Skip to content
  1. Feb 06, 2014
    • Nick Lewycky's avatar
      99384949
    • Manman Ren's avatar
      Set default of inlinecold-threshold to 225. · d4612449
      Manman Ren authored
      225 is the default value of inline-threshold. This change will make sure
      we have the same inlining behavior as prior to r200886.
      
      As Chandler points out, even though we don't have code in our testing
      suite that uses cold attribute, there are larger applications that do
      use cold attribute.
      
      r200886 + this commit intend to keep the same behavior as prior to r200886.
      We can later on tune the inlinecold-threshold.
      
      The main purpose of r200886 is to help performance of instrumentation based
      PGO before we actually hook up inliner with analysis passes such as BPI and BFI.
      For instrumentation based PGO, we try to increase inlining of hot functions and
      reduce inlining of cold functions by setting inlinecold-threshold.
      
      Another option suggested by Chandler is to use a boolean flag that controls
      if we should use OptSizeThreshold for cold functions. The default value
      of the boolean flag should not change the current behavior. But it gives us
      less freedom in controlling inlining of cold functions.
      
      llvm-svn: 200898
      d4612449
    • Paul Robinson's avatar
      Disable most IR-level transform passes on functions marked 'optnone'. · af4e64d0
      Paul Robinson authored
      Ideally only those transform passes that run at -O0 remain enabled,
      in reality we get as close as we reasonably can.
      Passes are responsible for disabling themselves, it's not the job of
      the pass manager to do it for them.
      
      llvm-svn: 200892
      af4e64d0
  2. Feb 05, 2014
  3. Feb 04, 2014
  4. Feb 03, 2014
    • Reid Kleckner's avatar
      inalloca: Don't remove dead arguments in the presence of inalloca args · d47a59a4
      Reid Kleckner authored
      It disturbs the layout of the parameters in memory and registers,
      leading to problems in the backend.
      
      The plan for optimizing internal inalloca functions going forward is to
      essentially SROA the argument memory and demote any captured arguments
      (things that aren't trivially written by a load or store) to an indirect
      pointer to a static alloca.
      
      llvm-svn: 200717
      d47a59a4
  5. Feb 02, 2014
  6. Feb 01, 2014
    • Chandler Carruth's avatar
      [LPM] Apply a really big hammer to fix PR18688 by recursively reforming · 1665152c
      Chandler Carruth authored
      LCSSA when we promote to SSA registers inside of LICM.
      
      Currently, this is actually necessary. The promotion logic in LICM uses
      SSAUpdater which doesn't understand how to place LCSSA PHI nodes.
      Teaching it to do so would be a very significant undertaking. It may be
      worthwhile and I've left a FIXME about this in the code as well as
      starting a thread on llvmdev to try to figure out the right long-term
      solution.
      
      For now, the PR needs to be fixed. Short of using the promition
      SSAUpdater to place both the LCSSA PHI nodes and the promoted PHI nodes,
      I don't see a cleaner or cheaper way of achieving this. Fortunately,
      LCSSA is relatively lazy and sparse -- it should only update
      instructions which need it. We can also skip the recursive variant when
      we don't promote to SSA values.
      
      llvm-svn: 200612
      1665152c
    • Eli Bendersky's avatar
      Remove some unused #includes · fc49d198
      Eli Bendersky authored
      llvm-svn: 200611
      fc49d198
    • Reid Kleckner's avatar
      Revert "[SLPV] Recognize vectorizable intrinsics during SLP vectorization ..." · a04504fe
      Reid Kleckner authored
      This reverts commit r200576.  It broke 32-bit self-host builds by
      vectorizing two calls to @llvm.bswap.i64, which we then fail to expand.
      
      llvm-svn: 200602
      a04504fe
  7. Jan 31, 2014
    • Chandler Carruth's avatar
      [SLPV] Recognize vectorizable intrinsics during SLP vectorization and · b3da389e
      Chandler Carruth authored
      transform accordingly. Based on similar code from Loop vectorization.
      Subsequent commits will include vectorization of function calls to
      vector intrinsics and form function calls to vector library calls.
      
      Patch by Raul Silvera! (Much delayed due to my not running dcommit)
      
      llvm-svn: 200576
      b3da389e
    • Chandler Carruth's avatar
      [vectorizer] Tweak the way we do small loop runtime unrolling in the · c12224cb
      Chandler Carruth authored
      loop vectorizer to not do so when runtime pointer checks are needed and
      share code with the new (not yet enabled) load/store saturation runtime
      unrolling. Also ensure that we only consider the runtime checks when the
      loop hasn't already been vectorized. If it has, the runtime check cost
      has already been paid.
      
      I've fleshed out a test case to cover the scalar unrolling as well as
      the vector unrolling and comment clearly why we are or aren't following
      the pattern.
      
      llvm-svn: 200530
      c12224cb
    • Bob Wilson's avatar
      Fix a bug in gcov instrumentation introduced by r195513. <rdar://15930350> · 055a0b4c
      Bob Wilson authored
      The entry block of a function starts with all the static allocas. The change
      in r195513 splits the block before those allocas, which has the effect of
      turning them into dynamic allocas. That breaks all sorts of things. Change to
      split after the initial allocas, and also add a comment explaining why the
      block is split.
      
      llvm-svn: 200515
      055a0b4c
  8. Jan 29, 2014
    • Chandler Carruth's avatar
      [LPM] Fix PR18643, another scary place where loop transforms failed to · d4be9dc0
      Chandler Carruth authored
      preserve loop simplify of enclosing loops.
      
      The problem here starts with LoopRotation which ends up cloning code out
      of the latch into the new preheader it is buidling. This can create
      a new edge from the preheader into the exit block of the loop which
      breaks LoopSimplify form. The code tries to fix this by splitting the
      critical edge between the latch and the exit block to get a new exit
      block that only the latch dominates. This sadly isn't sufficient.
      
      The exit block may be an exit block for multiple nested loops. When we
      clone an edge from the latch of the inner loop to the new preheader
      being built in the outer loop, we create an exiting edge from the outer
      loop to this exit block. Despite breaking the LoopSimplify form for the
      inner loop, this is fine for the outer loop. However, when we split the
      edge from the inner loop to the exit block, we create a new block which
      is in neither the inner nor outer loop as the new exit block. This is
      a predecessor to the old exit block, and so the split itself takes the
      outer loop out of LoopSimplify form. We need to split every edge
      entering the exit block from inside a loop nested more deeply than the
      exit block in order to preserve all of the loop simplify constraints.
      
      Once we try to do that, a problem with splitting critical edges
      surfaces. Previously, we tried a very brute force to update LoopSimplify
      form by re-computing it for all exit blocks. We don't need to do this,
      and doing this much will sometimes but not always overlap with the
      LoopRotate bug fix. Instead, the code needs to specifically handle the
      cases which can start to violate LoopSimplify -- they aren't that
      common. We need to see if the destination of the split edge was a loop
      exit block in simplified form for the loop of the source of the edge.
      For this to be true, all the predecessors need to be in the exact same
      loop as the source of the edge being split. If the dest block was
      originally in this form, we have to split all of the deges back into
      this loop to recover it. The old mechanism of doing this was
      conservatively correct because at least *one* of the exiting blocks it
      rewrote was the DestBB and so the DestBB's predecessors were fixed. But
      this is a much more targeted way of doing it. Making it targeted is
      important, because ballooning the set of edges touched prevents
      LoopRotate from being able to split edges *it* needs to split to
      preserve loop simplify in a coherent way -- the critical edge splitting
      would sometimes find the other edges in need of splitting but not
      others.
      
      Many, *many* thanks for help from Nick reducing these test cases
      mightily. And helping lots with the analysis here as this one was quite
      tricky to track down.
      
      llvm-svn: 200393
      d4be9dc0
    • Chandler Carruth's avatar
      [LPM] Fix PR18642, a pretty nasty bug in IndVars that "never mattered" · 66f0b163
      Chandler Carruth authored
      because of the inside-out run of LoopSimplify in the LoopPassManager and
      the fact that LoopSimplify couldn't be "preserved" across two
      independent LoopPassManagers.
      
      Anyways, in that case, IndVars wasn't correctly preserving an LCSSA PHI
      node because it thought it was rewriting (via SCEV) the incoming value
      to a loop invariant value. While it may well be invariant for the
      current loop, it may be rewritten in terms of an enclosing loop's
      values. This in and of itself is fine, as the LCSSA PHI node in the
      enclosing loop for the inner loop value we're rewriting will have its
      own LCSSA PHI node if used outside of the enclosing loop. With me so
      far?
      
      Well, the current loop and the enclosing loop may share an exiting
      block and exit block, and when they do they also share LCSSA PHI nodes.
      In this case, its not valid to RAUW through the LCSSA PHI node.
      
      Expected crazy test included.
      
      llvm-svn: 200372
      66f0b163
    • Arnold Schwaighofer's avatar
      LoopVectorizer: Don't count the induction variable multiple times · 1aab75ab
      Arnold Schwaighofer authored
      When estimating register pressure, don't count the induction variable mulitple
      times. It is unlikely to be unrolled. This is currently disabled and hidden
      behind a flag ("enable-ind-var-reg-heur").
      
      llvm-svn: 200371
      1aab75ab
  9. Jan 28, 2014
    • Rafael Espindola's avatar
      Fix pr14893. · ab73c493
      Rafael Espindola authored
      When simplifycfg moves an instruction, it must drop metadata it doesn't know
      is still valid with the preconditions changes. In particular, it must drop
      the range and tbaa metadata.
      
      The patch implements this with an utility function to drop all metadata not
      in a white list.
      
      llvm-svn: 200322
      ab73c493
    • Chandler Carruth's avatar
      [vectorizer] Completely disable the block frequency guidance of the loop · b7836285
      Chandler Carruth authored
      vectorizer, placing it behind an off-by-default flag.
      
      It turns out that block frequency isn't what we want at all, here or
      elsewhere. This has been I think a nagging feeling for several of us
      working with it, but Arnold has given some really nice simple examples
      where the results are so comprehensively wrong that they aren't useful.
      
      I'm planning to email the dev list with a summary of why its not really
      useful and a couple of ideas about how to better structure these types
      of heuristics.
      
      llvm-svn: 200294
      b7836285
    • Reid Kleckner's avatar
      Update optimization passes to handle inalloca arguments · 26af2cae
      Reid Kleckner authored
      Summary:
      I searched Transforms/ and Analysis/ for 'ByVal' and updated those call
      sites to check for inalloca if appropriate.
      
      I added tests for any change that would allow an optimization to fire on
      inalloca.
      
      Reviewers: nlewycky
      
      Differential Revision: http://llvm-reviews.chandlerc.com/D2449
      
      llvm-svn: 200281
      26af2cae
    • Chandler Carruth's avatar
      [LPM] Fix PR18616 where the shifts to the loop pass manager to extract · d84f776e
      Chandler Carruth authored
      LCSSA from it caused a crasher with the LoopUnroll pass.
      
      This crasher is really nasty. We destroy LCSSA form in a suprising way.
      When unrolling a loop into an outer loop, we not only need to restore
      LCSSA form for the outer loop, but for all children of the outer loop.
      This is somewhat obvious in retrospect, but hey!
      
      While this seems pretty heavy-handed, it's not that bad. Fundamentally,
      we only do this when we unroll a loop, which is already a heavyweight
      operation. We're unrolling all of these hypothetical inner loops as
      well, so their size and complexity is already on the critical path. This
      is just adding another pass over them to re-canonicalize.
      
      I have a test case from PR18616 that is great for reproducing this, but
      pretty useless to check in as it relies on many 10s of nested empty
      loops that get unrolled and deleted in just the right order. =/ What's
      worse is that investigating this has exposed another source of failure
      that is likely to be even harder to test. I'll try to come up with test
      cases for these fixes, but I want to get the fixes into the tree first
      as they're causing crashes in the wild.
      
      llvm-svn: 200273
      d84f776e
    • Arnold Schwaighofer's avatar
      LoopVectorize: Support conditional stores by scalarizing · 18865db3
      Arnold Schwaighofer authored
      The vectorizer takes a loop like this and widens all instructions except for the
      store. The stores are scalarized/unrolled and hidden behind an "if" block.
      
        for (i = 0; i < 128; ++i) {
          if (a[i] < 10)
            a[i] += val;
        }
      
        for (i = 0; i < 128; i+=2) {
          v = a[i:i+1];
          v0 = (extract v, 0) + 10;
          v1 = (extract v, 1) + 10;
          if (v0 < 10)
            a[i] = v0;
          if (v1 < 10)
            a[i] = v1;
        }
      
      The vectorizer relies on subsequent optimizations to sink instructions into the
      conditional block where they are anticipated.
      
      The flag "vectorize-num-stores-pred" controls whether and how many stores to
      handle this way. Vectorization of conditional stores is disabled per default for
      now.
      
      This patch also adds a change to the heuristic when the flag
      "enable-loadstore-runtime-unroll" is enabled (off by default). It unrolls small
      loops until load/store ports are saturated. This heuristic uses TTI's
      getMaxUnrollFactor as a measure for load/store ports.
      
      I also added a second flag -enable-cond-stores-vec. It will enable vectorization
      of conditional stores. But there is no cost model for vectorization of
      conditional stores in place yet so this will not do good at the moment.
      
      rdar://15892953
      
      Results for x86-64 -O3 -mavx +/- -mllvm -enable-loadstore-runtime-unroll
      -vectorize-num-stores-pred=1 (before the BFI change):
      
       Performance Regressions:
         Benchmarks/Ptrdist/yacr2/yacr2 7.35% (maze3() is identical but 10% slower)
         Applications/siod/siod         2.18%
       Performance improvements:
         mesa                          -4.42%
         libquantum                    -4.15%
      
       With a patch that slightly changes the register heuristics (by subtracting the
       induction variable on both sides of the register pressure equation, as the
       induction variable is probably not really unrolled):
      
       Performance Regressions:
         Benchmarks/Ptrdist/yacr2/yacr2  7.73%
         Applications/siod/siod          1.97%
      
       Performance Improvements:
         libquantum                    -13.05% (we now also unroll quantum_toffoli)
         mesa                           -4.27%
      
      llvm-svn: 200270
      18865db3
    • Manman Ren's avatar
      PGO branch weight: keep halving the weights until they can fit into · f1cb16e4
      Manman Ren authored
      uint32.
      
      When folding branches to common destination, the updated branch weights
      can exceed uint32 by more than factor of 2. We should keep halving the
      weights until they can fit into uint32.
      
      llvm-svn: 200262
      f1cb16e4
  10. Jan 27, 2014
    • Chandler Carruth's avatar
      [vectorize] Initial version of respecting PGO in the vectorizer: treat · e24f3973
      Chandler Carruth authored
      cold loops as-if they were being optimized for size.
      
      Nothing fancy here. Simply test case included. The nice thing is that we
      can now incrementally build on top of this to drive other heuristics.
      All of the infrastructure work is done to get the profile information
      into this layer.
      
      The remaining work necessary to make this a fully general purpose loop
      unroller for very hot loops is to make it a fully general purpose loop
      unroller. Things I know of but am not going to have time to benchmark
      and fix in the immediate future:
      
      1) Don't disable the entire pass when the target is lacking vector
         registers. This really doesn't make any sense any more.
      2) Teach the unroller at least and the vectorizer potentially to handle
         non-if-converted loops. This is trivial for the unroller but hard for
         the vectorizer.
      3) Compute the relative hotness of the loop and thread that down to the
         various places that make cost tradeoffs (very likely only the
         unroller makes sense here, and then only when dealing with loops that
         are small enough for unrolling to not completely blow out the LSD).
      
      I'm still dubious how useful hotness information will be. So far, my
      experiments show that if we can get the correct logic for determining
      when unrolling actually helps performance, the code size impact is
      completely unimportant and we can unroll in all cases. But at least
      we'll no longer burn code size on cold code.
      
      One somewhat unrelated idea that I've had forever but not had time to
      implement: mark all functions which are only reachable via the global
      constructors rigging in the module as optsize. This would also decrease
      the impact of any more aggressive heuristics here on code size.
      
      llvm-svn: 200219
      e24f3973
    • Benjamin Kramer's avatar
      ConstantHoisting: We can't insert instructions directly in front of a PHI node. · 9e709bce
      Benjamin Kramer authored
      Insert before the terminating instruction of the dominating block instead.
      
      llvm-svn: 200218
      9e709bce
    • Chandler Carruth's avatar
      [vectorizer] Add an override for the target instruction cost and use it · edfa37ef
      Chandler Carruth authored
      to stabilize a test that really is trying to test generic behavior and
      not a specific target's behavior.
      
      llvm-svn: 200215
      edfa37ef
    • Chandler Carruth's avatar
      [vectorizer] Simplify code to use existing helpers on the Function · 2bb03ba6
      Chandler Carruth authored
      object and fewer pointless variables.
      
      Also, add a clarifying comment and a FIXME because the code which
      disables *all* vectorization if we can't use implicit floating point
      instructions just makes no sense at all.
      
      llvm-svn: 200214
      2bb03ba6
    • Chandler Carruth's avatar
      [vectorizer] Teach the loop vectorizer's unroller to only unroll by · 147c2327
      Chandler Carruth authored
      powers of two. This is essentially always the correct thing given the
      impact on alignment, scaling factors that can be used in addressing
      modes, etc. Also, fix the management of the unroll vs. small loop cost
      to more accurately model things with this world.
      
      Enhance a test case to actually exercise more of the unroll machinery if
      using synthetic constants rather than a specific target model. Before
      this change, with the added flags this test will unroll 3 times instead
      of either 2 or 4 (the two sensible answers).
      
      While I don't expect this to make a huge difference, if there are lots
      of loops sitting right on the edge of hitting the 'small unroll' factor,
      they might change behavior. However, I've benchmarked moving the small
      loop cost up and down in many various ways and by a huge factor (2x)
      without seeing more than 0.2% code size growth. Small adjustments such
      as the series that led up here have led to about 1% improvement on some
      benchmarks, but it is very close to the noise floor so I mostly checked
      that nothing regressed. Let me know if you see bad behavior on other
      targets but I don't expect this to be a sufficiently dramatic change to
      trigger anything.
      
      llvm-svn: 200213
      147c2327
    • Chandler Carruth's avatar
      [vectorizer] Add some flags which are useful for conducting experiments · 7f90b453
      Chandler Carruth authored
      with the unrolling behavior in the loop vectorizer. No functionality
      changed at this point.
      
      These are a bit hack-y, but talking with Hal, there doesn't seem to be
      a cleaner way to easily experiment with different thresholds here and he
      was also interested in them so I wanted to commit them. Suggestions for
      improvement are very welcome here.
      
      llvm-svn: 200212
      7f90b453
    • Chandler Carruth's avatar
      [vectorizer] Fix a trivial oversight where we always requested the · 328998b2
      Chandler Carruth authored
      number of vector registers rather than toggling between vector and
      scalar register number based on VF. I don't have a test case as
      I spotted this by inspection and on X86 it only makes a difference if
      your target is lacking SSE and thus has *no* vector registers.
      
      If someone wants to add a test case for this for ARM or somewhere else
      where this is more significant, that would be awesome.
      
      Also made the variable name a bit more sensible while I'm here.
      
      llvm-svn: 200211
      328998b2
    • Chandler Carruth's avatar
      [vectorizer] Clean up the handling of unvectorized loop unrolling in the · 56612b20
      Chandler Carruth authored
      LoopVectorize pass.
      
      The logic here doesn't make much sense. We *only* unrolled if the
      unvectorized loop was a reduction loop with a single basic block *and*
      small loop body. The reduction part in particular doesn't make much
      sense. Instead, if we just fall through to the vectorized unroll logic
      it makes more sense of unrolling if there is a vectorized reduction that
      could be hacked on by the SLP vectorizer *or* if the loop is small.
      
      This is mostly a cleanup and nothing in the test suite really exercises
      this, but I did run benchmarks across this change and saw no really
      significant changes.
      
      llvm-svn: 200198
      56612b20
  11. Jan 25, 2014
    • Chandler Carruth's avatar
      [LPM] Conclude my immediate work by making the LoopVectorizer · 3aebcb99
      Chandler Carruth authored
      a FunctionPass. With this change the loop vectorizer no longer is a loop
      pass and can readily depend on function analyses. In particular, with
      this change we no longer have to form a loop pass manager to run the
      loop vectorizer which simplifies the entire pass management of LLVM.
      
      The next step here is to teach the loop vectorizer to leverage profile
      information through the profile information providing analysis passes.
      
      llvm-svn: 200074
      3aebcb99
    • Chandler Carruth's avatar
      [LPM] Make LCSSA a utility with a FunctionPass that applies it to all · 8765cf70
      Chandler Carruth authored
      the loops in a function, and teach LICM to work in the presance of
      LCSSA.
      
      Previously, LCSSA was a loop pass. That made passes requiring it also be
      loop passes and unable to depend on function analysis passes easily. It
      also caused outer loops to have a different "canonical" form from inner
      loops during analysis. Instead, we go into LCSSA form and preserve it
      through the loop pass manager run.
      
      Note that this has the same problem as LoopSimplify that prevents
      enabling its verification -- loop passes which run at the end of the loop
      pass manager and don't preserve these are valid, but the subsequent loop
      pass runs of outer loops that do preserve this pass trigger too much
      verification and fail because the inner loop no longer verifies.
      
      The other problem this exposed is that LICM was completely unable to
      handle LCSSA form. It didn't preserve it and it actually would give up
      on moving instructions in many cases when they were used by an LCSSA phi
      node. I've taught LICM to support detecting LCSSA-form PHI nodes and to
      hoist and sink around them. This may actually let LICM fire
      significantly more because we put everything into LCSSA form to rotate
      the loop before running LICM. =/ Now LICM should handle that fine and
      preserve it correctly. The down side is that LICM has to require LCSSA
      in order to preserve it. This is just a fact of life for LCSSA. It's
      entirely possible we should completely remove LCSSA from the optimizer.
      
      The test updates are essentially accomodating LCSSA phi nodes in the
      output of LICM, and the fact that we now completely sink every
      instruction in ashr-crash below the loop bodies prior to unrolling.
      
      With this change, LCSSA is computed only three times in the pass
      pipeline. One of them could be removed (and potentially a SCEV run and
      a separate LoopPassManager entirely!) if we had a LoopPass variant of
      InstCombine that ran InstCombine on the loop body but refused to combine
      away LCSSA PHI nodes. Currently, this also prevents loop unrolling from
      being in the same loop pass manager is rotate, LICM, and unswitch.
      
      There is one thing that I *really* don't like -- preserving LCSSA in
      LICM is quite expensive. We end up having to re-run LCSSA twice for some
      loops after LICM runs because LICM can undo LCSSA both in the current
      loop and the parent loop. I don't really see good solutions to this
      other than to completely move away from LCSSA and using tools like
      SSAUpdater instead.
      
      llvm-svn: 200067
      8765cf70
    • Juergen Ributzka's avatar
      Revert "Revert "Add Constant Hoisting Pass" (r200034)" · f26beda7
      Juergen Ributzka authored
      This reverts commit r200058 and adds the using directive for
      ARMTargetTransformInfo to silence two g++ overload warnings.
      
      llvm-svn: 200062
      f26beda7
    • Hans Wennborg's avatar
      Revert "Add Constant Hoisting Pass" (r200034) · 4d67a2e8
      Hans Wennborg authored
      This commit caused -Woverloaded-virtual warnings. The two new
      TargetTransformInfo::getIntImmCost functions were only added to the superclass,
      and to the X86 subclass. The other targets were not updated, and the
      warning highlighted this by pointing out that e.g. ARMTTI::getIntImmCost was
      hiding the two new getIntImmCost variants.
      
      We could pacify the warning by adding "using TargetTransformInfo::getIntImmCost"
      to the various subclasses, or turning it off, but I suspect that it's wrong to
      leave the functions unimplemnted in those targets. The default implementations
      return TCC_Free, which I don't think is right e.g. for ARM.
      
      llvm-svn: 200058
      4d67a2e8
  12. Jan 24, 2014
    • Juergen Ributzka's avatar
      Add Constant Hoisting Pass · 4f3df4ad
      Juergen Ributzka authored
      Retry commit r200022 with a fix for the build bot errors. Constant expressions
      have (unlike instructions) module scope use lists and therefore may have users
      in different functions. The fix is to simply ignore these out-of-function uses.
      
      llvm-svn: 200034
      4f3df4ad
Loading