Skip to content
  1. Jan 04, 2015
  2. Jan 03, 2015
  3. Jan 02, 2015
    • Chandler Carruth's avatar
      ce089831
    • Philip Reames's avatar
      Reformat statepoint documentation and fix a couple of typos · dfc238b4
      Philip Reames authored
      Patch by Ramkumar Ramachandra <artagnon@gmail.com>.
      
      llvm-svn: 225084
      dfc238b4
    • Andrea Di Biagio's avatar
      Improved comments. No functional change intended. · 6477847e
      Andrea Di Biagio authored
      llvm-svn: 225080
      6477847e
    • Craig Topper's avatar
      [X86] Bring some better consistency to the naming of the move to/from... · 4e5ab81a
      Craig Topper authored
      [X86] Bring some better consistency to the naming of the move to/from %al/ax/eax/rax with memory offset.
      
      llvm-svn: 225078
      4e5ab81a
    • David Majnemer's avatar
      InstCombine: Detect when llvm.umul.with.overflow always overflows · c8a576b5
      David Majnemer authored
      We know overflow always occurs if both ~LHSKnownZero * ~RHSKnownZero
      and LHSKnownOne * RHSKnownOne overflow.
      
      llvm-svn: 225077
      c8a576b5
    • David Majnemer's avatar
      Analysis: Reformulate WillNotOverflowUnsignedMul for reusability · 491331ac
      David Majnemer authored
      WillNotOverflowUnsignedMul's smarts will live in ValueTracking as
      computeOverflowForUnsignedMul.  It now returns a tri-state result:
      never overflows, always overflows and sometimes overflows.
      
      llvm-svn: 225076
      491331ac
    • Craig Topper's avatar
      [X86] Make the instructions that use AdSize16/32/64 co-exist together without... · 055845f5
      Craig Topper authored
      [X86] Make the instructions that use AdSize16/32/64 co-exist together without using mode predicates.
      
      This is necessary to allow the disassembler to be able to handle AdSize32 instructions in 64-bit mode when address size prefix is used.
      
      Eventually we should probably also support 'addr32' and 'addr16' in the assembler to override the address size on some of these instructions. But for now we'll just use special operand types that will lookup the current mode size to select the right instruction.
      
      llvm-svn: 225075
      055845f5
    • Chandler Carruth's avatar
      [SROA] Teach SROA to be more aggressive in splitting now that we have · 24ac830d
      Chandler Carruth authored
      a pre-splitting pass over loads and stores.
      
      Historically, splitting could cause enough problems that I hamstrung the
      entire process with a requirement that splittable integer loads and
      stores must cover the entire alloca. All smaller loads and stores were
      unsplittable to prevent chaos from ensuing. With the new pre-splitting
      logic that does load/store pair splitting I introduced in r225061, we
      can now very nicely handle arbitrarily splittable loads and stores. In
      order to fully benefit from these smarts, we need to mark all of the
      integer loads and stores as splittable.
      
      However, we don't actually want to rewrite partitions with all integer
      loads and stores marked as splittable. This will fail to extract scalar
      integers from aggregates, which is kind of the point of SROA. =] In
      order to resolve this, what we really want to do is only do
      pre-splitting on the alloca slices with integer loads and stores fully
      splittable. This allows us to uncover all non-integer uses of the alloca
      that would benefit from a split in an integer load or store (and where
      introducing the split is safe because it is just memory transfer from
      a load to a store). Once done, we make all the non-whole-alloca integer
      loads and stores unsplittable just as they have historically been,
      repartition and rewrite.
      
      The result is that when there are integer loads and stores anywhere
      within an alloca (such as from a memcpy of a sub-object of a larger
      object), we can split them up if there are non-integer components to the
      aggregate hiding beneath. I've added the challenging test cases to
      demonstrate how this is able to promote to scalars even a case where we
      have even *partially* overlapping loads and stores.
      
      This restores the single-store behavior for small arrays of i8s which is
      really nice. I've restored both the little endian testing and big endian
      testing for these exactly as they were prior to r225061. It also forced
      me to be more aggressive in an alignment test to actually defeat SROA.
      =] Without the added volatiles there, we actually split up the weird i16
      loads and produce nice double allocas with better alignment.
      
      This also uncovered a number of bugs where we failed to handle
      splittable load and store slices which didn't have a begininng offset of
      zero. Those fixes are included, and without them the existing test cases
      explode in glorious fireworks. =]
      
      I've kept support for leaving whole-alloca integer loads and stores as
      splittable even for the purpose of rewriting, but I think that's likely
      no longer needed. With the new pre-splitting, we might be able to remove
      all the splitting support for loads and stores from the rewriter. Not
      doing that in this patch to try to isolate any performance regressions
      that causes in an easy to find and revert chunk.
      
      llvm-svn: 225074
      24ac830d
    • Chandler Carruth's avatar
      [SROA] Make the computation of adjusted pointers not leak GEP · 5986b541
      Chandler Carruth authored
      instructions.
      
      I noticed this when working on dialing up how aggressively we can
      pre-split loads and stores. My test case wasn't passing because dead
      GEPs into the allocas persisted when they were built by this routine.
      This isn't terribly harmful, we still rewrote and promoted the alloca
      and I can't conceive of how to cause this to happen in a case where we
      will keep the exact same alloca but rewrite and promote the uses of it.
      If that ever happened, we'd get an assert out of mem2reg.
      
      So I don't have a direct test case yet, but the subsequent commit's test
      case wouldn't pass without this. There are other problems fixed by this
      patch that I spotted purely by inspection such as the fact that
      getAdjustedPtr could have actually deleted dead base pointers. I don't
      know how to get a base pointer to go into getAdjustedPtr today, so
      I think this bug could never have manifested (and I certainly can't
      write a test case for it) but, it wasn't the intent of the code. The
      code really just wanted to GC the new instructions built. That can be
      done more directly by comparing with the base pointer which is the only
      non-new instruction that this code can return.
      
      llvm-svn: 225073
      5986b541
    • Chandler Carruth's avatar
      [SROA] Add a test case for r225068 / PR22080. · e65ae893
      Chandler Carruth authored
      llvm-svn: 225070
      e65ae893
    • Chandler Carruth's avatar
      [SROA] Fix the loop exit placement to be prior to indexing the splits · 29c22fae
      Chandler Carruth authored
      array. This prevents it from walking out of bounds on the splits array.
      
      Bug found with the existing tests by ASan and by the MSVC debug build.
      
      llvm-svn: 225069
      29c22fae
    • Chandler Carruth's avatar
      [SROA] Fix two total think-os in r225061 that should have been caught on · c39eaa50
      Chandler Carruth authored
      a +asserts bootstrap, but my bootstrap had asserts off. Oops.
      
      Anyways, in some places it is reasonable to cast (as a sanity check) the
      pointer operand to a load or store to an instruction within SROA --
      namely when the pointer operand is expected to be derived from an
      alloca, and thus always an instruction. However, the pre-splitting code
      also deals with loads and stores to non-alloca pointers and there we
      need to just use the Value*. Nothing about the code relied on the
      instruction cast, it was only there essentially as an invariant
      assertion. Remove the two that don't actually hold.
      
      This should fix the proximate issue in PR22080, but I'm also doing an
      asserts bootstrap myself to see if there are other issues lurking.
      
      I'll craft a reduced test case in a moment, but I wanted to get the tree
      healthy as quickly as possible.
      
      llvm-svn: 225068
      c39eaa50
  4. Jan 01, 2015
    • Hal Finkel's avatar
      [PowerPC] use UINT64_C instead of ul · ddf8d7d1
      Hal Finkel authored
      Attempting to fix PR22078 (building on 32-bit systems) by replacing my careless
      use of 1ul to be a uint64_t constant with UINT64_C(1).
      
      llvm-svn: 225066
      ddf8d7d1
    • Michael Gottesman's avatar
      Revert "Just use a using directive in SmallMapVector instead of inheriting from MapVector itself." · 40898a5a
      Michael Gottesman authored
      This reverts commit r225059. I think MSVC 2012 has a problem with this. This is
      an attempt to fix one of the MSVC 2012 bots.
      
      llvm-svn: 225065
      40898a5a
    • Chandler Carruth's avatar
      Revert r225053: Add an ArrayRef upcasting constructor from ArrayRef<U*> ->... · a1f4697d
      Chandler Carruth authored
      Revert r225053: Add an ArrayRef upcasting constructor from ArrayRef<U*> -> ArrayRef<T*> where T is a base of U.
      
      This appears to have broken at least the windows build bots due to
      compile errors in the predicate that didn't simply supress the overload.
      I'm not sure what the fix is, and the bots have been broken for a long
      time now so I'm just reverting until Michael can figure out a fix.
      
      llvm-svn: 225064
      a1f4697d
    • Chandler Carruth's avatar
      [SROA] Switch to using a more direct debug logging technique in one part · 6044c0bc
      Chandler Carruth authored
      of my new load and store splitting, and fix a bug where it logged
      a totally irrelevant slice rather than the actual slice in question.
      
      The logging here previously worked because we used to place new slices
      onto the back of the core sequence, but that caused other problems.
      I updated the actual code to store new slices in their own vector but
      didn't update the logging. There isn't a good way to reuse the logging
      any more, and frankly it wasn't needed. We can directly log this bit
      more easily.
      
      llvm-svn: 225063
      6044c0bc
    • Chandler Carruth's avatar
      [SROA] Fix formatting with clang-format which I managed to fail to do · 994cde88
      Chandler Carruth authored
      prior to committing r225061. Sorry for that.
      
      llvm-svn: 225062
      994cde88
    • Chandler Carruth's avatar
      [SROA] Teach SROA how to much more intelligently handle split loads and · 0715cba0
      Chandler Carruth authored
      stores.
      
      When there are accesses to an entire alloca with an integer
      load or store as well as accesses to small pieces of the alloca, SROA
      splits up the large integer accesses. In order to do that, it uses bit
      math to merge the small accesses into large integers. While this is
      effective, it produces insane IR that can cause significant problems in
      the rest of the optimizer:
      
      - It can cause load and store mismatches with GVN on the non-alloca side
        where we end up loading an i64 (or some such) rather than loading
        specific elements that are stored.
      - We can't always get rid of the integer bit math, which is why we can't
        always fix the loads and stores to work well with GVN.
      - This is especially bad when we have operations that mix poorly with
        integer bit math such as floating point operations.
      - It will block things like the vectorizer which might be able to handle
        the scalar stores that underly the aggregate.
      
      At the same time, we can't just directly split up these loads and stores
      in all cases. If there is actual integer arithmetic involved on the
      values, then using integer bit math is actually the perfect lowering
      because we can often combine it heavily with the surrounding math.
      
      The solution this patch provides is to find places where SROA is
      partitioning aggregates into small elements, and look for splittable
      loads and stores that it can split all the way to some other adjacent
      load and store. These are uniformly the cases where failing to split the
      loads and stores hurts the optimizer that I have seen, and I've looked
      extensively at the code produced both from more and less aggressive
      approaches to this problem.
      
      However, it is quite tricky to actually do this in SROA. We may have
      loads and stores to the same alloca, or other complex patterns that are
      hard to handle. This complexity leads to the somewhat subtle algorithm
      implemented here. We have to do this entire process as a separate pass
      over the partitioning of the alloca, and split up all of the loads prior
      to splitting the stores so that we can handle safely the cases of
      overlapping, including partially overlapping, loads and stores to the
      same alloca. We also have to reconstitute the post-split slice
      configuration so we can avoid iterating again over all the alloca uses
      (the slow part of SROA). But we also have to ensure that when we split
      up loads and stores to *other* allocas, we *do* re-iterate over them in
      SROA to adapt to the more refined partitioning now required.
      
      With this, I actually think we can fix a long-standing TODO in SROA
      where I avoided splitting as many loads and stores as probably should be
      splittable. This limitation historically mitigated the fallout of all
      the bad things mentioned above. Now that we have more intelligent
      handling, I plan to remove the FIXME and more aggressively mark integer
      loads and stores as splittable. I'll do that in a follow-up patch to
      help with bisecting any fallout.
      
      The net result of this change should be more fine-grained and accurate
      scalars being formed out of aggregates. At the very least, Clang now
      generates perfect code for this high-level test case using
      std::complex<float>:
      
        #include <complex>
      
        void g1(std::complex<float> &x, float a, float b) {
          x += std::complex<float>(a, b);
        }
        void g2(std::complex<float> &x, float a, float b) {
          x -= std::complex<float>(a, b);
        }
      
        void foo(const std::complex<float> &x, float a, float b,
                 std::complex<float> &x1, std::complex<float> &x2) {
          std::complex<float> l1 = x;
          g1(l1, a, b);
          std::complex<float> l2 = x;
          g2(l2, a, b);
          x1 = l1;
          x2 = l2;
        }
      
      This code isn't just hypothetical either. It was reduced out of the hot
      inner loops of essentially every part of the Eigen math library when
      using std::complex<float>. Those loops would consistently and
      pervasively hop between the floating point unit and the integer unit due
      to bit math extraction and insertion of floating point values that were
      "stored" in a 64-bit integer register around the loop backedge.
      
      So far, this change has passed a bootstrap and I have done some other
      testing and so far, no issues. That doesn't mean there won't be though,
      so I'll be prepared to help with any fallout. If you performance swings
      in particular, please let me know. I'm very curious what all the impact
      of this change will be. Stay tuned for the follow-up to also split more
      integer loads and stores.
      
      llvm-svn: 225061
      0715cba0
    • Michael Gottesman's avatar
    • Hal Finkel's avatar
      [PowerPC] Improve instruction selection bit-permuting operations (64-bit) · c58ce413
      Hal Finkel authored
      This is the second installment of improvements to instruction selection for "bit
      permutation" instruction sequences. r224318 added logic for instruction
      selection for 32-bit bit permutation sequences, and this adds lowering for
      64-bit sequences. The 64-bit sequences are more complicated than the 32-bit
      ones because:
        a) the 64-bit versions of the 32-bit rotate-and-mask instructions
           work by replicating the lower 32-bits of the value-to-be-rotated into the
           upper 32 bits -- and integrating this into the cost modeling for the various
           bit group operations is non-trivial
        b) unlike the 32-bit instructions in 32-bit mode, the rotate-and-mask instructions
           cannot, in one instruction, specify the
           mask starting index, the mask ending index, and the rotation factor. Also,
           forming arbitrary 64-bit constants is more complicated than in 32-bit mode
           because the number of instructions necessary is value dependent.
      
      Plus, support for 'late masking' was added: it is sometimes more efficient to
      treat the overall value as if it had no mandatory zero bits when planning the
      bit-group insertions, and then mask them in at the very end. Unfortunately, as
      the structure of the bit groups is different in the two cases, the more
      feasible implementation technique was to generate both instruction sequences,
      and then pick the shorter one.
      
      And finally, we now generate reasonable code for i64 bswap:
      
              rldicl 5, 3, 16, 0
              rldicl 4, 3, 8, 0
              rldicl 6, 3, 24, 0
              rldimi 4, 5, 8, 48
              rldicl 5, 3, 32, 0
              rldimi 4, 6, 16, 40
              rldicl 6, 3, 48, 0
              rldimi 4, 5, 24, 32
              rldicl 5, 3, 56, 0
              rldimi 4, 6, 40, 16
              rldimi 4, 5, 48, 8
              rldimi 4, 3, 56, 0
      
      vs. what we used to produce:
      
              li 4, 255
              rldicl 5, 3, 24, 40
              rldicl 6, 3, 40, 24
              rldicl 7, 3, 56, 8
              sldi 8, 3, 8
              sldi 10, 3, 24
              sldi 12, 3, 40
              rldicl 0, 3, 8, 56
              sldi 9, 4, 32
              sldi 11, 4, 40
              sldi 4, 4, 48
              andi. 5, 5, 65280
              andis. 6, 6, 255
              andis. 7, 7, 65280
              sldi 3, 3, 56
              and 8, 8, 9
              and 4, 12, 4
              and 9, 10, 11
              or 6, 7, 6
              or 5, 5, 0
              or 3, 3, 4
              or 7, 9, 8
              or 4, 6, 5
              or 3, 3, 7
              or 3, 3, 4
      
      which is 12 instructions, instead of 25, and seems optimal (at least in terms
      of code size).
      
      llvm-svn: 225056
      c58ce413
    • Michael Gottesman's avatar
      Add 2x constructors for TinyPtrVector, one that takes in one elemenet and the... · 4c638994
      Michael Gottesman authored
      Add 2x constructors for TinyPtrVector, one that takes in one elemenet and the other that takes in an ArrayRef<EltTy>
      
      Currently one can only construct an empty TinyPtrVector. These are just missing
      elements of the API.
      
      llvm-svn: 225055
      4c638994
    • Michael Gottesman's avatar
      Add a SmallMapVector class that is a MapVector with a Map of SmallDenseMap and... · 64670671
      Michael Gottesman authored
      Add a SmallMapVector class that is a MapVector with a Map of SmallDenseMap and a Vector of SmallVector.
      
      llvm-svn: 225054
      64670671
Loading