Skip to content
  1. Dec 05, 2013
    • Yi Jiang's avatar
      01cfa942
    • Renato Golin's avatar
      Move test to X86 dir · e593fea5
      Renato Golin authored
      Test is platform independent, but I don't want to force vector-width, or
      that could spoil the pragma test.
      
      llvm-svn: 196539
      e593fea5
    • Renato Golin's avatar
      Add #pragma vectorize enable/disable to LLVM · 729a3ae9
      Renato Golin authored
      The intended behaviour is to force vectorization on the presence
      of the flag (either turn on or off), and to continue the behaviour
      as expected in its absence. Tests were added to make sure the all
      cases are covered in opt. No tests were added in other tools with
      the assumption that they should use the PassManagerBuilder in the
      same way.
      
      This patch also removes the outdated -late-vectorize flag, which was
      on by default and not helping much.
      
      The pragma metadata is being attached to the same place as other loop
      metadata, but nothing forbids one from attaching it to a function
      (to enable #pragma optimize) or basic blocks (to hint the basic-block
      vectorizers), etc. The logic should be the same all around.
      
      Patches to Clang to produce the metadata will be produced after the
      initial implementation is agreed upon and committed. Patches to other
      vectorizers (such as SLP and BB) will be added once we're happy with
      the pass manager changes.
      
      llvm-svn: 196537
      729a3ae9
    • Arnold Schwaighofer's avatar
      SLPVectorizer: An in-tree vectorized entry cannot also be a scalar external use · 7ee53cac
      Arnold Schwaighofer authored
      We were creating external uses for scalar values in MustGather entries that also
      had a ScalarToTreeEntry (they also are present in a vectorized tuple). This
      meant we would keep a value 'alive' as a scalar and vectorized causing havoc.
      This is not necessary because when we create a MustGather vector we explicitly
      create external uses entries for the insertelement instructions of the
      MustGather vector elements.
      
      Fixes PR18129.
      
      radar://15582184
      
      llvm-svn: 196508
      7ee53cac
    • Alp Toker's avatar
      Correct word hyphenations · f907b891
      Alp Toker authored
      This patch tries to avoid unrelated changes other than fixing a few
      hyphen-related ambiguities and contractions in nearby lines.
      
      llvm-svn: 196471
      f907b891
  2. Dec 03, 2013
  3. Dec 02, 2013
  4. Nov 28, 2013
  5. Nov 26, 2013
    • Nadav Rotem's avatar
      PR1860 - We can't save a list of ExtractElement instructions to CSE because... · b0082d24
      Nadav Rotem authored
      PR1860 - We can't save a list of ExtractElement instructions to CSE because some of these instructions
      may be removed and optimized in future iterations. Instead we save a list of basic blocks that we need to CSE.
      
      llvm-svn: 195791
      b0082d24
    • Arnold Schwaighofer's avatar
      LoopVectorizer: Truncate i64 trip counts of i32 phis if necessary · a2c8e008
      Arnold Schwaighofer authored
      In signed arithmetic we could end up with an i64 trip count for an i32 phi.
      Because it is signed arithmetic we know that this is only defined if the i32
      does not wrap. It is therefore safe to truncate the i64 trip count to a i32
      value.
      
      Fixes PR18049.
      
      llvm-svn: 195787
      a2c8e008
    • Nadav Rotem's avatar
      PR18060 - When we RAUW values with ExtractElement instructions in some cases · f9f8482e
      Nadav Rotem authored
      we generate PHI nodes with multiple entries from the same basic block but
      with different values. Enabling CSE on ExtractElement instructions make sure
      that all of the RAUWed instructions are the same.
      
      llvm-svn: 195773
      f9f8482e
    • Stepan Dyatkovskiy's avatar
      PR17925 bugfix. · abb8505d
      Stepan Dyatkovskiy authored
      Short description.
      
      This issue is about case of treating pointers as integers.
      We treat pointers as different if they references different address space.
      At the same time, we treat pointers equal to integers (with machine address
      width). It was a point of false-positive. Consider next case on 32bit machine:
      
      void foo0(i32 addrespace(1)* %p)
      void foo1(i32 addrespace(2)* %p)
      void foo2(i32 %p)
      
      foo0 != foo1, while
      foo1 == foo2 and foo0 == foo2.
      
      As you can see it breaks transitivity. That means that result depends on order
      of how functions are presented in module. Next order causes merging of foo0
      and foo1: foo2, foo0, foo1
      First foo0 will be merged with foo2, foo0 will be erased. Second foo1 will be
      merged with foo2.
      Depending on order, things could be merged we don't expect to.
      
      The fix:
      Forbid to treat any pointer as integer, except for those, who belong to address space 0.
      
      llvm-svn: 195769
      abb8505d
  6. Nov 25, 2013
  7. Nov 23, 2013
  8. Nov 22, 2013
  9. Nov 21, 2013
  10. Nov 20, 2013
    • Yuchen Wu's avatar
      llvm-cov: Added file checksum to gcno and gcda files. · babe7491
      Yuchen Wu authored
      Instead of permanently outputting "MVLL" as the file checksum, clang
      will create gcno and gcda checksums by hashing the destination block
      numbers of every arc. This allows for llvm-cov to check if the two gcov
      files are synchronized.
      
      Regenerated the test files so they contain the checksum. Also added
      negative test to ensure error when the checksums don't match.
      
      llvm-svn: 195191
      babe7491
  11. Nov 19, 2013
    • Arnold Schwaighofer's avatar
      SLPVectorizer: Fix stale for Value pointer array · 8bc4a0ba
      Arnold Schwaighofer authored
      We are slicing an array of Value pointers and process those slices in a loop.
      The problem is that we might invalidate a later slice by vectorizing a former
      slice.
      
      Use a WeakVH to track the pointer. If the pointer is deleted or RAUW'ed we can
      tell.
      
      The test case will only fail when running with libgmalloc.
      
      radar://15498655
      
      llvm-svn: 195162
      8bc4a0ba
    • Chandler Carruth's avatar
      Fix an issue where SROA computed different results based on the relative · a1262006
      Chandler Carruth authored
      order of slices of the alloca which have exactly the same size and other
      properties. This was found by a perniciously unstable sort
      implementation used to flush out buggy uses of the algorithm.
      
      The fundamental idea is that findCommonType should return the best
      common type it can find across all of the slices in the range. There
      were two bugs here previously:
      
      1) We would accept an integer type smaller than a byte-width multiple,
         and if there were different bit-width integer types, we would accept
         the first one. This caused an actual failure in the testcase updated
         here when the sort order changed.
      2) If we found a bad combination of types or a non-load, non-store use
         before an integer typed load or store we would bail, but if we found
         the integere typed load or store, we would use it. The correct
         behavior is to always use an integer typed operation which covers the
         partition if one exists.
      
      While a clever debugging sort algorithm found problem #1 in our existing
      test cases, I have no useful test case ideas for #2. I spotted in by
      inspection when looking at this code.
      
      llvm-svn: 195118
      a1262006
  12. Nov 18, 2013
    • Paul Robinson's avatar
      The 'optnone' attribute means don't inline anything into this function · dcbe35ba
      Paul Robinson authored
      (except functions marked always_inline).
      Functions with 'optnone' must also have 'noinline' so they don't get
      inlined into any other function.
      
      Based on work by Andrea Di Biagio.
      
      llvm-svn: 195046
      dcbe35ba
    • Arnold Schwaighofer's avatar
      LoopVectorizer: Extend the induction variable to a larger type · b72cb4ec
      Arnold Schwaighofer authored
      In some case the loop exit count computation can overflow. Extend the type to
      prevent most of those cases.
      
      The problem is loops like:
      int main ()
      {
        int a = 1;
        char b = 0;
        lbl:
          a &= 4;
          b--;
          if (b) goto lbl;
        return a;
      }
      
      The backedge count is 255. The induction variable type is i8. If we add one to
      255 to get the exit count we overflow to zero.
      
      To work around this issue we extend the type of the induction variable to i32 in
      the case of i8 and i16.
      
      PR17532
      
      llvm-svn: 195008
      b72cb4ec
  13. Nov 17, 2013
    • Hal Finkel's avatar
      Add the cold attribute to error-reporting call sites · 66cd3f1b
      Hal Finkel authored
      Generally speaking, control flow paths with error reporting calls are cold.
      So far, error reporting calls are calls to perror and calls to fprintf,
      fwrite, etc. with stderr as the stream. This can be extended in the future.
      
      The primary motivation is to improve block placement (the cold attribute
      affects the static branch prediction heuristics).
      
      llvm-svn: 194943
      66cd3f1b
    • Hal Finkel's avatar
      Add a loop rerolling pass · bf45efde
      Hal Finkel authored
      This adds a loop rerolling pass: the opposite of (partial) loop unrolling. The
      transformation aims to take loops like this:
      
      for (int i = 0; i < 3200; i += 5) {
        a[i]     += alpha * b[i];
        a[i + 1] += alpha * b[i + 1];
        a[i + 2] += alpha * b[i + 2];
        a[i + 3] += alpha * b[i + 3];
        a[i + 4] += alpha * b[i + 4];
      }
      
      and turn them into this:
      
      for (int i = 0; i < 3200; ++i) {
        a[i] += alpha * b[i];
      }
      
      and loops like this:
      
      for (int i = 0; i < 500; ++i) {
        x[3*i] = foo(0);
        x[3*i+1] = foo(0);
        x[3*i+2] = foo(0);
      }
      
      and turn them into this:
      
      for (int i = 0; i < 1500; ++i) {
        x[i] = foo(0);
      }
      
      There are two motivations for this transformation:
      
        1. Code-size reduction (especially relevant, obviously, when compiling for
      code size).
      
        2. Providing greater choice to the loop vectorizer (and generic unroller) to
      choose the unrolling factor (and a better ability to vectorize). The loop
      vectorizer can take vector lengths and register pressure into account when
      choosing an unrolling factor, for example, and a pre-unrolled loop limits that
      choice. This is especially problematic if the manual unrolling was optimized
      for a machine different from the current target.
      
      The current implementation is limited to single basic-block loops only. The
      rerolling recognition should work regardless of how the loop iterations are
      intermixed within the loop body (subject to dependency and side-effect
      constraints), but the significant restriction is that the order of the
      instructions in each iteration must be identical. This seems sufficient to
      capture all current use cases.
      
      This pass is not currently enabled by default at any optimization level.
      
      llvm-svn: 194939
      bf45efde
  14. Nov 16, 2013
    • Hal Finkel's avatar
      Apply the InstCombine fptrunc sqrt optimization to llvm.sqrt · 12100bf7
      Hal Finkel authored
      InstCombine, in visitFPTrunc, applies the following optimization to sqrt calls:
      
        (fptrunc (sqrt (fpext x))) -> (sqrtf x)
      
      but does not apply the same optimization to llvm.sqrt. This is a problem
      because, to enable vectorization, Clang generates llvm.sqrt instead of sqrt in
      fast-math mode, and because this optimization is being applied to sqrt and not
      applied to llvm.sqrt, sometimes the fast-math code is slower.
      
      This change makes InstCombine apply this optimization to llvm.sqrt as well.
      
      This fixes the specific problem in PR17758, although the same underlying issue
      (optimizations applied to libcalls are not applied to intrinsics) exists for
      other optimizations in SimplifyLibCalls.
      
      llvm-svn: 194935
      12100bf7
    • Benjamin Kramer's avatar
      InstCombine: fold (A >> C) == (B >> C) --> (A^B) < (1 << C) for constant Cs. · 03f3e248
      Benjamin Kramer authored
      This is common in bitfield code.
      
      llvm-svn: 194925
      03f3e248
    • Arnold Schwaighofer's avatar
      LoopVectorizer: Use abi alignment for accesses with no alignment · dbb7b87d
      Arnold Schwaighofer authored
      When we vectorize a scalar access with no alignment specified, we have to set
      the target's abi alignment of the scalar access on the vectorized access.
      Using the same alignment of zero would be wrong because most targets will have a
      bigger abi alignment for vector types.
      
      This probably fixes PR17878.
      
      llvm-svn: 194876
      dbb7b87d
  15. Nov 15, 2013
  16. Nov 14, 2013
  17. Nov 13, 2013
    • Diego Novillo's avatar
      SampleProfileLoader pass. Initial setup. · 8d6568b5
      Diego Novillo authored
      This adds a new scalar pass that reads a file with samples generated
      by 'perf' during runtime. The samples read from the profile are
      incorporated and emmited as IR metadata reflecting that profile.
      
      The profile file is assumed to have been generated by an external
      profile source. The profile information is converted into IR metadata,
      which is later used by the analysis routines to estimate block
      frequencies, edge weights and other related data.
      
      External profile information files have no fixed format, each profiler
      is free to define its own. This includes both the on-disk representation
      of the profile and the kind of profile information stored in the file.
      A common kind of profile is based on sampling (e.g., perf), which
      essentially counts how many times each line of the program has been
      executed during the run.
      
      The SampleProfileLoader pass is organized as a scalar transformation.
      On startup, it reads the file given in -sample-profile-file to
      determine what kind of profile it contains.  This file is assumed to
      contain profile information for the whole application. The profile
      data in the file is read and incorporated into the internal state of
      the corresponding profiler.
      
      To facilitate testing, I've organized the profilers to support two file
      formats: text and native. The native format is whatever on-disk
      representation the profiler wants to support, I think this will mostly
      be bitcode files, but it could be anything the profiler wants to
      support. To do this, every profiler must implement the
      SampleProfile::loadNative() function.
      
      The text format is mostly meant for debugging. Records are separated by
      newlines, but each profiler is free to interpret records as it sees fit.
      Profilers must implement the SampleProfile::loadText() function.
      
      Finally, the pass will call SampleProfile::emitAnnotations() for each
      function in the current translation unit. This function needs to
      translate the loaded profile into IR metadata, which the analyzer will
      later be able to use.
      
      This patch implements the first steps towards the above design. I've
      implemented a sample-based flat profiler. The format of the profile is
      fairly simplistic. Each sampled function contains a list of relative
      line locations (from the start of the function) together with a count
      representing how many samples were collected at that line during
      execution. I generate this profile using perf and a separate converter
      tool.
      
      Currently, I have only implemented a text format for these profiles. I
      am interested in initial feedback to the whole approach before I send
      the other parts of the implementation for review.
      
      This patch implements:
      
      - The SampleProfileLoader pass.
      - The base ExternalProfile class with the core interface.
      - A SampleProfile sub-class using the above interface. The profiler
        generates branch weight metadata on every branch instructions that
        matches the profiles.
      - A text loader class to assist the implementation of
        SampleProfile::loadText().
      - Basic unit tests for the pass.
      
      Additionally, the patch uses profile information to compute branch
      weights based on instruction samples.
      
      This patch converts instruction samples into branch weights. It
      does a fairly simplistic conversion:
      
      Given a multi-way branch instruction, it calculates the weight of
      each branch based on the maximum sample count gathered from each
      target basic block.
      
      Note that this assignment of branch weights is somewhat lossy and can be
      misleading. If a basic block has more than one incoming branch, all the
      incoming branches will get the same weight. In reality, it may be that
      only one of them is the most heavily taken branch.
      
      I will adjust this assignment in subsequent patches.
      
      llvm-svn: 194566
      8d6568b5
  18. Nov 12, 2013
Loading