Skip to content
  1. Jan 05, 2015
    • Karthik Bhat's avatar
      Select lower fsub,fabs pattern to fabd on AArch64 · 93f27ce8
      Karthik Bhat authored
      This patch lowers patterns such as-
        fsub   v0.4s, v0.4s, v1.4s
        fabs   v0.4s, v0.4s
      to
        fabd  v0.4s, v0.4s, v1.4s
      on AArch64.
      
      Review: http://reviews.llvm.org/D6791
      llvm-svn: 225169
      93f27ce8
    • Charlie Turner's avatar
      Parse Tag_compatibility correctly. · 6632d1f6
      Charlie Turner authored
      Tag_compatibility takes two arguments, but before this patch it would
      erroneously accept just one, it now produces an error in that case.
      
      Change-Id: I530f918587620d0d5dfebf639944d6083871ef7d
      llvm-svn: 225167
      6632d1f6
    • Charlie Turner's avatar
      Emit the build attribute Tag_conformance. · 8b2caa45
      Charlie Turner authored
      Claim conformance to version 2.09 of the ARM ABI.
      
      This build attribute must be emitted first amongst the build attributes when
      written to an object file. This is to simplify conformance detection by
      consumers.
      
      Change-Id: If9eddcfc416bc9ad6e5cc8cdcb05d0031af7657e
      llvm-svn: 225166
      8b2caa45
    • Karthik Bhat's avatar
      Select lower sub,abs pattern to sabd on AArch64 · 8ec742c2
      Karthik Bhat authored
      This patch lowers patterns such as-
        sub	v0.4s, v0.4s, v1.4s
        abs	v0.4s, v0.4s
      to
        sabd	v0.4s, v0.4s, v1.4s
      on AArch64.
      
      Review: http://reviews.llvm.org/D6781
      llvm-svn: 225165
      8ec742c2
    • Michael Kuperstein's avatar
      Fix broken test from r225159. · 6ae456b0
      Michael Kuperstein authored
      llvm-svn: 225164
      6ae456b0
    • Chandler Carruth's avatar
      [PM] Don't run the machinery of invalidating all the analysis passes · 539dc4b9
      Chandler Carruth authored
      when all are being preserved.
      
      We want to short-circuit this for a couple of reasons. One, I don't
      really want passes to grow a dependency on actually receiving their
      invalidate call when they've been preserved. I'm thinking about removing
      this entirely. But more importantly, preserving everything is likely to
      be the common case in a lot of scenarios, and it would be really good to
      bypass all of the invalidation and preservation machinery there.
      Avoiding calling N opaque functions to try to invalidate things that are
      by definition still valid seems important. =]
      
      This wasn't really inpsired by much other than seeing the spam in the
      logging for analyses, but it seems better ot get it checked in rather
      than forgetting about it.
      
      llvm-svn: 225163
      539dc4b9
    • Chandler Carruth's avatar
      [PM] Add names and debug logging for analysis passes to the new pass · e5e8fb3b
      Chandler Carruth authored
      manager.
      
      This starts to allow us to test analyses more easily, but it's really
      only the beginning. Some of the code here is still untestable without
      manual changes to create analysis passes, but I wanted to factor it into
      a small of chunks as possible.
      
      Next up in order to be able to test things are, in no particular order:
      - No-op analyses passes so we don't have to use real ones to exercise
        the pass maneger itself.
      - Automatic way of generating dummy passes that require an analysis be
        run, including a variant that calls a 'print' method on a pass to make
        it even easier to print out the results of an analysis.
      - Dummy passes that invalidate all analyses for their IR unit so we can
        test invalidation and re-runs.
      - Automatic way to print each analysis pass as it is re-run.
      - Automatic but optional verification of analysis passes everywhere
        possible.
      
      I'm not claiming I'll get to all of these immediately, but that's what
      is in the pipeline at some stage. I'm fleshing out exactly what I need
      and what to prioritize by working on converting analyses and then trying
      to test the conversion. =]
      
      llvm-svn: 225162
      e5e8fb3b
    • Jiangning Liu's avatar
      Fixed a bug in memory dependence checking module of loop vectorization. The... · 40c1b352
      Jiangning Liu authored
      Fixed a bug in memory dependence checking module of loop vectorization. The following loop should not be vectorized with current algorithm.
      
      {code}
      // loop body
         ... = a[i]          (1)
          ... = a[i+1]       (2)
       .......
      a[i+1] = ....          (3)
         a[i] = ...          (4)
      {code}
      
      The algorithm tries to collect memory access candidates from AliasSetTracker, and then check memory dependences one another. The memory accesses are unique in AliasSetTracker, and a single memory access in AliasSetTracker may map to multiple entries in AccessAnalysis, which could cover both 'read' and 'write'. Originally the algorithm only checked 'write' entry in Accesses if only 'write' exists. This is incorrect and the consequence is it ignored all read access, and finally some RAW and WAR dependence are missed.
      
      For the case given above, if we ignore two reads, the dependence between (1) and (3) would not be able to be captured, and finally this loop will be incorrectly vectorized.
      
      The fix simply inserts a new loop to find all entries in Accesses. Since it will skip most of all other memory accesses by checking the Value pointer at the very beginning of the loop, it should not increase compile-time visibly.
      
      llvm-svn: 225159
      40c1b352
    • Hal Finkel's avatar
      [PowerPC] Enable speculation of cttz/ctlz · 9bb61de1
      Hal Finkel authored
      PPC has an instruction for ctlz with defined zero behavior, and our lowering of
      cttz (provided by DAGCombine) is also efficient and branchless, so speculating
      these makes sense.
      
      llvm-svn: 225150
      9bb61de1
    • Chandler Carruth's avatar
      [SROA] Apply a somewhat heavy and unpleasant hammer to fix PR22093, an · 73b0164f
      Chandler Carruth authored
      assert out of the new pre-splitting in SROA.
      
      This fix makes the code do what was originally intended -- when we have
      a store of a load both dealing in the same alloca, we force them to both
      be pre-split with identical offsets. This is really quite hard to do
      because we can keep discovering problems as we go along. We have to
      track every load over the current alloca which for any resaon becomes
      invalid for pre-splitting, and go back to remove all stores of those
      loads. I've included a couple of test cases derived from PR22093 that
      cover the different ways this can happen. While that PR only really
      triggered the first of these two, its the same fundamental issue.
      
      The other challenge here is documented in a FIXME now. We end up being
      quite a bit more aggressive for pre-splitting when loads and stores
      don't refer to the same alloca. This aggressiveness comes at the cost of
      introducing potentially redundant loads. It isn't clear that this is the
      right balance. It might be considerably better to require that we only
      do pre-splitting when we can presplit every load and store involved in
      the entire operation. That would give more consistent if conservative
      results. Unfortunately, it requires a non-trivial change to the actual
      pre-splitting operation in order to correctly handle cases where we end
      up pre-splitting stores out-of-order. And it isn't 100% clear that this
      is the right direction, although I'm starting to suspect that it is.
      
      llvm-svn: 225149
      73b0164f
    • Hal Finkel's avatar
      [PowerPC] Materialize i64 constants using rotation with masking · 2f61879f
      Hal Finkel authored
      r225135 added the ability to materialize i64 constants using rotations in order
      to reduce the instruction count. Sometimes we can use a rotation only with some
      extra masking, so that we take advantage of the fact that generating a bunch of
      extra higher-order 1 bits is easy using li/lis.
      
      llvm-svn: 225147
      2f61879f
    • Chandler Carruth's avatar
      [PM] Wire up support for explicitly running the verifier pass. · 9c31db4f
      Chandler Carruth authored
      The required functionality has been there for some time, but I never
      managed to actually wire it into the command line registry of passes.
      Let's do that.
      
      llvm-svn: 225144
      9c31db4f
  2. Jan 04, 2015
    • Simon Pilgrim's avatar
      [X86][SSE] Added vector packing test for pr12412 · b65a6ee8
      Simon Pilgrim authored
      llvm-svn: 225138
      b65a6ee8
    • Simon Pilgrim's avatar
      [X86][SSE] Added vector integer truncation tests - based off pr15524 · a1540c11
      Simon Pilgrim authored
      llvm-svn: 225137
      a1540c11
    • Hal Finkel's avatar
      [PowerPC] Materialize i64 constants using rotation · 241ba79f
      Hal Finkel authored
      Materializing full 64-bit constants on PPC64 can be expensive, requiring up to
      5 instructions depending on the locations of the non-zero bits. Sometimes
      materializing a rotated constant, and then applying the inverse rotation, requires
      fewer instructions than the direct method. If so, do that instead.
      
      In r225132, I added support for forming constants using bit inversion. In
      effect, this reverts that commit and replaces it with rotation support. The bit
      inversion is useful for turning constants that are mostly ones into ones that
      are mostly zeros (thus enabling a more-efficient shift-based materialization),
      but the same effect can be obtained by using negative constants and a rotate,
      and that is at least as efficient, if not more.
      
      llvm-svn: 225135
      241ba79f
    • Hal Finkel's avatar
      [PowerPC] Materialize i64 constants using bit inversion · ca6375fb
      Hal Finkel authored
      Materializing full 64-bit constants on PPC64 can be expensive, requiring up to
      5 instructions depending on the locations of the non-zero bits. Sometimes
      materializing the bit-reversed constant, and then flipping the bits, requires
      fewer instructions than the direct method. If so, do that instead.
      
      llvm-svn: 225132
      ca6375fb
    • David Majnemer's avatar
      InstCombine: match can find ConstantExprs, don't assume we have a Value · 087dc8b8
      David Majnemer authored
      We assumed the output of a match was a Value, this would cause us to
      assert because we would fail a cast<>.  Instead, use a helper in the
      Operator family to hide the distinction between Value and Constant.
      
      This fixes PR22087.
      
      llvm-svn: 225127
      087dc8b8
    • David Majnemer's avatar
      ValueTracking: ComputeNumSignBits should tolerate misshapen phi nodes · 6ee8d17b
      David Majnemer authored
      PHI nodes can have zero operands in the middle of a transform.  It is
      expected that utilities in Analysis don't freak out when this happens.
      
      Note that it is considered invalid to allow these misshapen phi nodes to
      make it to another pass.
      
      This fixes PR22086.
      
      llvm-svn: 225126
      6ee8d17b
  3. Jan 03, 2015
    • Saleem Abdulrasool's avatar
      llvm-readobj: add support to dump COFF export tables · ddd92644
      Saleem Abdulrasool authored
      This enhances llvm-readobj to print out the COFF export table, similar to the
      -coff-import option.  This is useful for testing in lld.
      
      llvm-svn: 225120
      ddd92644
    • Saleem Abdulrasool's avatar
      ARM: permit tail calls to weak externals on COFF · 67f72993
      Saleem Abdulrasool authored
      Weak externals are resolved statically, so we can actually generate the tail
      call on PE/COFF targets without breaking the requirements.  It is questionable
      whether we want to propagate the current behaviour for MachO as the requirements
      are part of the ARM ELF specifications, and it seems that prior to the SVN
      r215890, we would have tail'ed the call.  For now, be conservative and only
      permit it on PE/COFF where the call will always be fully resolved.
      
      llvm-svn: 225119
      67f72993
    • Hal Finkel's avatar
      [PowerPC/BlockPlacement] Allow target to provide a per-loop alignment preference · 5772566e
      Hal Finkel authored
      The existing code provided for specifying a global loop alignment preference.
      However, the preferred loop alignment might depend on the loop itself. For
      recent POWER cores, loops between 5 and 8 instructions should have 32-byte
      alignment (while the others are better with 16-byte alignment) so that the
      entire loop will fit in one i-cache line.
      
      To support this, getPrefLoopAlignment has been made virtual, and can be
      provided with an optional MachineLoop* so the target can inspect the loop
      before answering the query. The default behavior, as before, is to return the
      value set with setPrefLoopAlignment. MachineBlockPlacement now queries the
      target for each loop instead of only once per function. There should be no
      functional change for other targets.
      
      llvm-svn: 225117
      5772566e
    • Hal Finkel's avatar
      [PowerPC] Use 16-byte alignment for modern cores for functions/loops · d73bfba7
      Hal Finkel authored
      Most modern PowerPC cores prefer that functions and loops start on
      16-byte-aligned boundaries (*), so instruct block placement, etc. to make this
      happen. The branch selector has also been adjusted so account for the extra
      nops that might now be inserted before loop headers.
      
      (*) Some cores actually prefer other alignments for small loops, but that will
          be addressed in a follow-up commit.
      
      llvm-svn: 225115
      d73bfba7
    • Hal Finkel's avatar
      [PowerPC] Add support for the CMPB instruction · 4edc66b8
      Hal Finkel authored
      Newer POWER cores, and the A2, support the cmpb instruction. This instruction
      compares its operands, treating each of the 8 bytes in the GPRs separately,
      returning a 'mask' result of 0 (for false) or -1 (for true) in each byte.
      
      Code generation support is added, in the form of a PPCISelDAGToDAG
      DAG-preprocessing routine, that recognizes patterns close to what the
      instruction computes (either exactly, or related by a constant masking
      operation), and generates the cmpb instruction (along with any necessary
      constant masking operation). This can be expanded if use cases arise.
      
      llvm-svn: 225106
      4edc66b8
    • Kostya Serebryany's avatar
    • Craig Topper's avatar
      [X86] Disassembler support for move to/from %rax with a 32-bit memory offset... · ae8e1b38
      Craig Topper authored
      [X86] Disassembler support for move to/from %rax with a 32-bit memory offset is REX.W and AdSize prefix are both present.
      
      llvm-svn: 225099
      ae8e1b38
  4. Jan 02, 2015
    • David Majnemer's avatar
      InstCombine: Detect when llvm.umul.with.overflow always overflows · c8a576b5
      David Majnemer authored
      We know overflow always occurs if both ~LHSKnownZero * ~RHSKnownZero
      and LHSKnownOne * RHSKnownOne overflow.
      
      llvm-svn: 225077
      c8a576b5
    • Craig Topper's avatar
      [X86] Make the instructions that use AdSize16/32/64 co-exist together without... · 055845f5
      Craig Topper authored
      [X86] Make the instructions that use AdSize16/32/64 co-exist together without using mode predicates.
      
      This is necessary to allow the disassembler to be able to handle AdSize32 instructions in 64-bit mode when address size prefix is used.
      
      Eventually we should probably also support 'addr32' and 'addr16' in the assembler to override the address size on some of these instructions. But for now we'll just use special operand types that will lookup the current mode size to select the right instruction.
      
      llvm-svn: 225075
      055845f5
    • Chandler Carruth's avatar
      [SROA] Teach SROA to be more aggressive in splitting now that we have · 24ac830d
      Chandler Carruth authored
      a pre-splitting pass over loads and stores.
      
      Historically, splitting could cause enough problems that I hamstrung the
      entire process with a requirement that splittable integer loads and
      stores must cover the entire alloca. All smaller loads and stores were
      unsplittable to prevent chaos from ensuing. With the new pre-splitting
      logic that does load/store pair splitting I introduced in r225061, we
      can now very nicely handle arbitrarily splittable loads and stores. In
      order to fully benefit from these smarts, we need to mark all of the
      integer loads and stores as splittable.
      
      However, we don't actually want to rewrite partitions with all integer
      loads and stores marked as splittable. This will fail to extract scalar
      integers from aggregates, which is kind of the point of SROA. =] In
      order to resolve this, what we really want to do is only do
      pre-splitting on the alloca slices with integer loads and stores fully
      splittable. This allows us to uncover all non-integer uses of the alloca
      that would benefit from a split in an integer load or store (and where
      introducing the split is safe because it is just memory transfer from
      a load to a store). Once done, we make all the non-whole-alloca integer
      loads and stores unsplittable just as they have historically been,
      repartition and rewrite.
      
      The result is that when there are integer loads and stores anywhere
      within an alloca (such as from a memcpy of a sub-object of a larger
      object), we can split them up if there are non-integer components to the
      aggregate hiding beneath. I've added the challenging test cases to
      demonstrate how this is able to promote to scalars even a case where we
      have even *partially* overlapping loads and stores.
      
      This restores the single-store behavior for small arrays of i8s which is
      really nice. I've restored both the little endian testing and big endian
      testing for these exactly as they were prior to r225061. It also forced
      me to be more aggressive in an alignment test to actually defeat SROA.
      =] Without the added volatiles there, we actually split up the weird i16
      loads and produce nice double allocas with better alignment.
      
      This also uncovered a number of bugs where we failed to handle
      splittable load and store slices which didn't have a begininng offset of
      zero. Those fixes are included, and without them the existing test cases
      explode in glorious fireworks. =]
      
      I've kept support for leaving whole-alloca integer loads and stores as
      splittable even for the purpose of rewriting, but I think that's likely
      no longer needed. With the new pre-splitting, we might be able to remove
      all the splitting support for loads and stores from the rewriter. Not
      doing that in this patch to try to isolate any performance regressions
      that causes in an easy to find and revert chunk.
      
      llvm-svn: 225074
      24ac830d
    • Chandler Carruth's avatar
      [SROA] Add a test case for r225068 / PR22080. · e65ae893
      Chandler Carruth authored
      llvm-svn: 225070
      e65ae893
  5. Jan 01, 2015
    • Chandler Carruth's avatar
      [SROA] Teach SROA how to much more intelligently handle split loads and · 0715cba0
      Chandler Carruth authored
      stores.
      
      When there are accesses to an entire alloca with an integer
      load or store as well as accesses to small pieces of the alloca, SROA
      splits up the large integer accesses. In order to do that, it uses bit
      math to merge the small accesses into large integers. While this is
      effective, it produces insane IR that can cause significant problems in
      the rest of the optimizer:
      
      - It can cause load and store mismatches with GVN on the non-alloca side
        where we end up loading an i64 (or some such) rather than loading
        specific elements that are stored.
      - We can't always get rid of the integer bit math, which is why we can't
        always fix the loads and stores to work well with GVN.
      - This is especially bad when we have operations that mix poorly with
        integer bit math such as floating point operations.
      - It will block things like the vectorizer which might be able to handle
        the scalar stores that underly the aggregate.
      
      At the same time, we can't just directly split up these loads and stores
      in all cases. If there is actual integer arithmetic involved on the
      values, then using integer bit math is actually the perfect lowering
      because we can often combine it heavily with the surrounding math.
      
      The solution this patch provides is to find places where SROA is
      partitioning aggregates into small elements, and look for splittable
      loads and stores that it can split all the way to some other adjacent
      load and store. These are uniformly the cases where failing to split the
      loads and stores hurts the optimizer that I have seen, and I've looked
      extensively at the code produced both from more and less aggressive
      approaches to this problem.
      
      However, it is quite tricky to actually do this in SROA. We may have
      loads and stores to the same alloca, or other complex patterns that are
      hard to handle. This complexity leads to the somewhat subtle algorithm
      implemented here. We have to do this entire process as a separate pass
      over the partitioning of the alloca, and split up all of the loads prior
      to splitting the stores so that we can handle safely the cases of
      overlapping, including partially overlapping, loads and stores to the
      same alloca. We also have to reconstitute the post-split slice
      configuration so we can avoid iterating again over all the alloca uses
      (the slow part of SROA). But we also have to ensure that when we split
      up loads and stores to *other* allocas, we *do* re-iterate over them in
      SROA to adapt to the more refined partitioning now required.
      
      With this, I actually think we can fix a long-standing TODO in SROA
      where I avoided splitting as many loads and stores as probably should be
      splittable. This limitation historically mitigated the fallout of all
      the bad things mentioned above. Now that we have more intelligent
      handling, I plan to remove the FIXME and more aggressively mark integer
      loads and stores as splittable. I'll do that in a follow-up patch to
      help with bisecting any fallout.
      
      The net result of this change should be more fine-grained and accurate
      scalars being formed out of aggregates. At the very least, Clang now
      generates perfect code for this high-level test case using
      std::complex<float>:
      
        #include <complex>
      
        void g1(std::complex<float> &x, float a, float b) {
          x += std::complex<float>(a, b);
        }
        void g2(std::complex<float> &x, float a, float b) {
          x -= std::complex<float>(a, b);
        }
      
        void foo(const std::complex<float> &x, float a, float b,
                 std::complex<float> &x1, std::complex<float> &x2) {
          std::complex<float> l1 = x;
          g1(l1, a, b);
          std::complex<float> l2 = x;
          g2(l2, a, b);
          x1 = l1;
          x2 = l2;
        }
      
      This code isn't just hypothetical either. It was reduced out of the hot
      inner loops of essentially every part of the Eigen math library when
      using std::complex<float>. Those loops would consistently and
      pervasively hop between the floating point unit and the integer unit due
      to bit math extraction and insertion of floating point values that were
      "stored" in a 64-bit integer register around the loop backedge.
      
      So far, this change has passed a bootstrap and I have done some other
      testing and so far, no issues. That doesn't mean there won't be though,
      so I'll be prepared to help with any fallout. If you performance swings
      in particular, please let me know. I'm very curious what all the impact
      of this change will be. Stay tuned for the follow-up to also split more
      integer loads and stores.
      
      llvm-svn: 225061
      0715cba0
    • Hal Finkel's avatar
      [PowerPC] Improve instruction selection bit-permuting operations (64-bit) · c58ce413
      Hal Finkel authored
      This is the second installment of improvements to instruction selection for "bit
      permutation" instruction sequences. r224318 added logic for instruction
      selection for 32-bit bit permutation sequences, and this adds lowering for
      64-bit sequences. The 64-bit sequences are more complicated than the 32-bit
      ones because:
        a) the 64-bit versions of the 32-bit rotate-and-mask instructions
           work by replicating the lower 32-bits of the value-to-be-rotated into the
           upper 32 bits -- and integrating this into the cost modeling for the various
           bit group operations is non-trivial
        b) unlike the 32-bit instructions in 32-bit mode, the rotate-and-mask instructions
           cannot, in one instruction, specify the
           mask starting index, the mask ending index, and the rotation factor. Also,
           forming arbitrary 64-bit constants is more complicated than in 32-bit mode
           because the number of instructions necessary is value dependent.
      
      Plus, support for 'late masking' was added: it is sometimes more efficient to
      treat the overall value as if it had no mandatory zero bits when planning the
      bit-group insertions, and then mask them in at the very end. Unfortunately, as
      the structure of the bit groups is different in the two cases, the more
      feasible implementation technique was to generate both instruction sequences,
      and then pick the shorter one.
      
      And finally, we now generate reasonable code for i64 bswap:
      
              rldicl 5, 3, 16, 0
              rldicl 4, 3, 8, 0
              rldicl 6, 3, 24, 0
              rldimi 4, 5, 8, 48
              rldicl 5, 3, 32, 0
              rldimi 4, 6, 16, 40
              rldicl 6, 3, 48, 0
              rldimi 4, 5, 24, 32
              rldicl 5, 3, 56, 0
              rldimi 4, 6, 40, 16
              rldimi 4, 5, 48, 8
              rldimi 4, 3, 56, 0
      
      vs. what we used to produce:
      
              li 4, 255
              rldicl 5, 3, 24, 40
              rldicl 6, 3, 40, 24
              rldicl 7, 3, 56, 8
              sldi 8, 3, 8
              sldi 10, 3, 24
              sldi 12, 3, 40
              rldicl 0, 3, 8, 56
              sldi 9, 4, 32
              sldi 11, 4, 40
              sldi 4, 4, 48
              andi. 5, 5, 65280
              andis. 6, 6, 255
              andis. 7, 7, 65280
              sldi 3, 3, 56
              and 8, 8, 9
              and 4, 12, 4
              and 9, 10, 11
              or 6, 7, 6
              or 5, 5, 0
              or 3, 3, 4
              or 7, 9, 8
              or 4, 6, 5
              or 3, 3, 7
              or 3, 3, 4
      
      which is 12 instructions, instead of 25, and seems optimal (at least in terms
      of code size).
      
      llvm-svn: 225056
      c58ce413
  6. Dec 31, 2014
Loading