Skip to content
  1. Apr 02, 2012
    • Stepan Dyatkovskiy's avatar
      Fast fix for PR12343: · f62ffeca
      Stepan Dyatkovskiy authored
      http://llvm.org/bugs/show_bug.cgi?id=12343
      
      We have not trivial way for splitting edges that are goes from indirect branch. We can do it with some tricks, but it should be additionally discussed. And it is still dangerous due to difficulty of indirect branches controlling.
      
      Fix forbids this case for unswitching.
      
      llvm-svn: 153879
      f62ffeca
  2. Apr 01, 2012
  3. Mar 31, 2012
    • Chandler Carruth's avatar
      Give the always-inliner its own custom filter. It shouldn't have to pay · a88a0faa
      Chandler Carruth authored
      the very high overhead of the complex inline cost analysis when all it
      wants to do is detect three patterns which must not be inlined. Comment
      the code, clean it up, and leave some hints about possible performance
      improvements if this ever shows up on a profile.
      
      Moving this off of the (now more expensive) inline cost analysis is
      particularly important because we have to run this inliner even at -O0.
      
      llvm-svn: 153814
      a88a0faa
    • Chandler Carruth's avatar
      Remove a bunch of empty, dead, and no-op methods from all of these · edd2826f
      Chandler Carruth authored
      interfaces. These methods were used in the old inline cost system where
      there was a persistent cache that had to be updated, invalidated, and
      cleared. We're now doing more direct computations that don't require
      this intricate dance. Even if we resume some level of caching, it would
      almost certainly have a simpler and more narrow interface than this.
      
      llvm-svn: 153813
      edd2826f
    • Chandler Carruth's avatar
      Initial commit for the rewrite of the inline cost analysis to operate · 0539c071
      Chandler Carruth authored
      on a per-callsite walk of the called function's instructions, in
      breadth-first order over the potentially reachable set of basic blocks.
      
      This is a major shift in how inline cost analysis works to improve the
      accuracy and rationality of inlining decisions. A brief outline of the
      algorithm this moves to:
      
      - Build a simplification mapping based on the callsite arguments to the
        function arguments.
      - Push the entry block onto a worklist of potentially-live basic blocks.
      - Pop the first block off of the *front* of the worklist (for
        breadth-first ordering) and walk its instructions using a custom
        InstVisitor.
      - For each instruction's operands, re-map them based on the
        simplification mappings available for the given callsite.
      - Compute any simplification possible of the instruction after
        re-mapping, and store that back int othe simplification mapping.
      - Compute any bonuses, costs, or other impacts of the instruction on the
        cost metric.
      - When the terminator is reached, replace any conditional value in the
        terminator with any simplifications from the mapping we have, and add
        any successors which are not proven to be dead from these
        simplifications to the worklist.
      - Pop the next block off of the front of the worklist, and repeat.
      - As soon as the cost of inlining exceeds the threshold for the
        callsite, stop analyzing the function in order to bound cost.
      
      The primary goal of this algorithm is to perfectly handle dead code
      paths. We do not want any code in trivially dead code paths to impact
      inlining decisions. The previous metric was *extremely* flawed here, and
      would always subtract the average cost of two successors of
      a conditional branch when it was proven to become an unconditional
      branch at the callsite. There was no handling of wildly different costs
      between the two successors, which would cause inlining when the path
      actually taken was too large, and no inlining when the path actually
      taken was trivially simple. There was also no handling of the code
      *path*, only the immediate successors. These problems vanish completely
      now. See the added regression tests for the shiny new features -- we
      skip recursive function calls, SROA-killing instructions, and high cost
      complex CFG structures when dead at the callsite being analyzed.
      
      Switching to this algorithm required refactoring the inline cost
      interface to accept the actual threshold rather than simply returning
      a single cost. The resulting interface is pretty bad, and I'm planning
      to do lots of interface cleanup after this patch.
      
      Several other refactorings fell out of this, but I've tried to minimize
      them for this patch. =/ There is still more cleanup that can be done
      here. Please point out anything that you see in review.
      
      I've worked really hard to try to mirror at least the spirit of all of
      the previous heuristics in the new model. It's not clear that they are
      all correct any more, but I wanted to minimize the change in this single
      patch, it's already a bit ridiculous. One heuristic that is *not* yet
      mirrored is to allow inlining of functions with a dynamic alloca *if*
      the caller has a dynamic alloca. I will add this back, but I think the
      most reasonable way requires changes to the inliner itself rather than
      just the cost metric, and so I've deferred this for a subsequent patch.
      The test case is XFAIL-ed until then.
      
      As mentioned in the review mail, this seems to make Clang run about 1%
      to 2% faster in -O0, but makes its binary size grow by just under 4%.
      I've looked into the 4% growth, and it can be fixed, but requires
      changes to other parts of the inliner.
      
      llvm-svn: 153812
      0539c071
    • Benjamin Kramer's avatar
      Internalize: Remove reference of @llvm.noinline, it was replaced with the... · 53dc8733
      Benjamin Kramer authored
      Internalize: Remove reference of @llvm.noinline, it was replaced with the noinline attribute a long time ago.
      
      llvm-svn: 153806
      53dc8733
    • Hal Finkel's avatar
      Correctly vectorize powi. · 5cad8742
      Hal Finkel authored
      The powi intrinsic requires special handling because it always takes a single
      integer power regardless of the result type. As a result, we can vectorize
      only if the powers are equal. Fixes PR12364.
      
      llvm-svn: 153797
      5cad8742
  4. Mar 29, 2012
    • Jakob Stoklund Olesen's avatar
      Don't PRE compares. · 4e55044f
      Jakob Stoklund Olesen authored
      CodeGenPrepare sinks compare instructions down to their uses to prevent
      live flags and predicate registers across basic blocks.
      
      PRE of a compare instruction prevents that, forcing the i1 compare
      result into a general purpose register.  That is usually more expensive
      than the redundant compare PRE was trying to eliminate in the first
      place.
      
      llvm-svn: 153657
      4e55044f
  5. Mar 28, 2012
  6. Mar 27, 2012
    • Chandler Carruth's avatar
      Make a seemingly tiny change to the inliner and fix the generated code · b9e35fbc
      Chandler Carruth authored
      size bloat. Unfortunately, I expect this to disable the majority of the
      benefit from r152737. I'm hopeful at least that it will fix PR12345. To
      explain this requires... quite a bit of backstory I'm afraid.
      
      TL;DR: The change in r152737 actually did The Wrong Thing for
      linkonce-odr functions. This change makes it do the right thing. The
      benefits we saw were simple luck, not any actual strategy. Benchmark
      numbers after a mini-blog-post so that I've written down my thoughts on
      why all of this works and doesn't work...
      
      To understand what's going on here, you have to understand how the
      "bottom-up" inliner actually works. There are two fundamental modes to
      the inliner:
      
      1) Standard fixed-cost bottom-up inlining. This is the mode we usually
         think about. It walks from the bottom of the CFG up to the top,
         looking at callsites, taking information about the callsite and the
         called function and computing th expected cost of inlining into that
         callsite. If the cost is under a fixed threshold, it inlines. It's
         a touch more complicated than that due to all the bonuses, weights,
         etc. Inlining the last callsite to an internal function gets higher
         weighth, etc. But essentially, this is the mode of operation.
      
      2) Deferred bottom-up inlining (a term I just made up). This is the
         interesting mode for this patch an r152737. Initially, this works
         just like mode #1, but once we have the cost of inlining into the
         callsite, we don't just compare it with a fixed threshold. First, we
         check something else. Let's give some names to the entities at this
         point, or we'll end up hopelessly confused. We're considering
         inlining a function 'A' into its callsite within a function 'B'. We
         want to check whether 'B' has any callers, and whether it might be
         inlined into those callers. If so, we also check whether inlining 'A'
         into 'B' would block any of the opportunities for inlining 'B' into
         its callers. We take the sum of the costs of inlining 'B' into its
         callers where that inlining would be blocked by inlining 'A' into
         'B', and if that cost is less than the cost of inlining 'A' into 'B',
         then we skip inlining 'A' into 'B'.
      
      Now, in order for #2 to make sense, we have to have some confidence that
      we will actually have the opportunity to inline 'B' into its callers
      when cheaper, *and* that we'll be able to revisit the decision and
      inline 'A' into 'B' if that ever becomes the correct tradeoff. This
      often isn't true for external functions -- we can see very few of their
      callers, and we won't be able to re-consider inlining 'A' into 'B' if
      'B' is external when we finally see more callers of 'B'. There are two
      cases where we believe this to be true for C/C++ code: functions local
      to a translation unit, and functions with an inline definition in every
      translation unit which uses them. These are represented as internal
      linkage and linkonce-odr (resp.) in LLVM. I enabled this logic for
      linkonce-odr in r152737.
      
      Unfortunately, when I did that, I also introduced a subtle bug. There
      was an implicit assumption that the last caller of the function within
      the TU was the last caller of the function in the program. We want to
      bonus the last caller of the function in the program by a huge amount
      for inlining because inlining that callsite has very little cost.
      Unfortunately, the last caller in the TU of a linkonce-odr function is
      *not* the last caller in the program, and so we don't want to apply this
      bonus. If we do, we can apply it to one callsite *per-TU*. Because of
      the way deferred inlining works, when it sees this bonus applied to one
      callsite in the TU for 'B', it decides that inlining 'B' is of the
      *utmost* importance just so we can get that final bonus. It then
      proceeds to essentially force deferred inlining regardless of the actual
      cost tradeoff.
      
      The result? PR12345: code bloat, code bloat, code bloat. Another result
      is getting *damn* lucky on a few benchmarks, and the over-inlining
      exposing critically important optimizations. I would very much like
      a list of benchmarks that regress after this change goes in, with
      bitcode before and after. This will help me greatly understand what
      opportunities the current cost analysis is missing.
      
      Initial benchmark numbers look very good. WebKit files that exhibited
      the worst of PR12345 went from growing to shrinking compared to Clang
      with r152737 reverted.
      
      - Bootstrapped Clang is 3% smaller with this change.
      - Bootstrapped Clang -O0 over a single-source-file of lib/Lex is 4%
        faster with this change.
      
      Please let me know about any other performance impact you see. Thanks to
      Nico for reporting and urging me to actually fix, Richard Smith, Duncan
      Sands, Manuel Klimek, and Benjamin Kramer for talking through the issues
      today.
      
      llvm-svn: 153506
      b9e35fbc
  7. Mar 26, 2012
  8. Mar 25, 2012
    • Chandler Carruth's avatar
      Teach the function cloner (and thus the inliner) to simplify PHINodes · ef82cf5b
      Chandler Carruth authored
      aggressively. There are lots of dire warnings about this being expensive
      that seem to predate switching to the TrackingVH-based value remapper
      that is automatically updated on RAUW. This makes it easy to not just
      prune single-entry PHIs, but to fully simplify PHIs, and to recursively
      simplify the newly inlined code to propagate PHINode simplifications.
      
      This introduces a bit of a thorny problem though. We may end up
      simplifying a branch condition to a constant when we fold PHINodes, and
      we would like to nuke any dead blocks resulting from this so that time
      isn't wasted continually analyzing them, but this isn't easy. Deleting
      basic blocks *after* they are fully cloned and mapped into the new
      function currently requires manually updating the value map. The last
      piece of the simplification-during-inlining puzzle will require either
      switching to WeakVH mappings or some other piece of refactoring. I've
      left a FIXME in the testcase about this.
      
      llvm-svn: 153410
      ef82cf5b
    • Chandler Carruth's avatar
      Move the instruction simplification of callsite arguments in the inliner · 21211992
      Chandler Carruth authored
      to instead rely on much more generic and powerful instruction
      simplification in the function cloner (and thus inliner).
      
      This teaches the pruning function cloner to use instsimplify rather than
      just the constant folder to fold values during cloning. This can
      simplify a large number of things that constant folding alone cannot
      begin to touch. For example, it will realize that 'or' and 'and'
      instructions with certain constant operands actually become constants
      regardless of what their other operand is. It also can thread back
      through the caller to perform simplifications that are only possible by
      looking up a few levels. In particular, GEPs and pointer testing tend to
      fold much more heavily with this change.
      
      This should (in some cases) have a positive impact on compile times with
      optimizations on because the inliner itself will simply avoid cloning
      a great deal of code. It already attempted to prune proven-dead code,
      but now it will be use the stronger simplifications to prove more code
      dead.
      
      llvm-svn: 153403
      21211992
    • Chandler Carruth's avatar
      Add an asserting ValueHandle to the block simplification code which will · 0c72e3f4
      Chandler Carruth authored
      fire if anything ever invalidates the assumption of a terminator
      instruction being unchanged throughout the routine.
      
      I've convinced myself that the current definition of simplification
      precludes such a transformation, so I think getting some asserts
      coverage that we don't violate this agreement is sufficient to make this
      code safe for the foreseeable future.
      
      Comments to the contrary or other suggestions are of course welcome. =]
      The bots are now happy with this code though, so it appears the bug here
      has indeed been fixed.
      
      llvm-svn: 153401
      0c72e3f4
    • Chandler Carruth's avatar
      Don't form a WeakVH around the sentinel node in the instructions BB · 17fc6ef2
      Chandler Carruth authored
      list. This is a bad idea. ;] I'm hopeful this is the bug that's showing
      up with the MSVC bots, but we'll see.
      
      It is definitely unnecessary. InstSimplify won't do anything to
      a terminator instruction, we don't need to even include it in the
      iteration range. We can also skip the now dead terminator check,
      although I've made it an assert to help document that this is an
      important invariant.
      
      I'm still a bit queasy about this because there is an implicit
      assumption that the terminator instruction cannot be RAUW'ed by the
      simplification code. While that appears to be true at the moment, I see
      no guarantee that would ensure it remains true in the future. I'm
      looking at the cleanest way to solve that...
      
      llvm-svn: 153399
      17fc6ef2
  9. Mar 24, 2012
    • Chandler Carruth's avatar
      Refactor the interface to recursively simplifying instructions to be tad · cf1b585f
      Chandler Carruth authored
      bit simpler by handling a common case explicitly.
      
      Also, refactor the implementation to use a worklist based walk of the
      recursive users, rather than trying to use value handles to detect and
      recover from RAUWs during the recursive descent. This fixes a very
      subtle bug in the previous implementation where degenerate control flow
      structures could cause mutually recursive instructions (PHI nodes) to
      collapse in just such a way that From became equal to To after some
      amount of recursion. At that point, we hit the inf-loop that the assert
      at the top attempted to guard against. This problem is defined away when
      not using value handles in this manner. There are lots of comments
      claiming that the WeakVH will protect against just this sort of error,
      but they're not accurate about the actual implementation of WeakVHs,
      which do still track RAUWs.
      
      I don't have any test case for the bug this fixes because it requires
      running the recursive simplification on unreachable phi nodes. I've no
      way to either run this or easily write an input that triggers it. It was
      found when using instruction simplification inside the inliner when
      running over the nightly test-suite.
      
      llvm-svn: 153393
      cf1b585f
    • Francois Pichet's avatar
      Fix the MSVC build. · 4b9ab746
      Francois Pichet authored
      llvm-svn: 153366
      4b9ab746
    • Andrew Trick's avatar
      More IndVarSimplify cleanup. · 25553ab5
      Andrew Trick authored
      llvm-svn: 153362
      25553ab5
    • Kostya Serebryany's avatar
      add EP_OptimizerLast extension point · e505a5ab
      Kostya Serebryany authored
      llvm-svn: 153353
      e505a5ab
  10. Mar 23, 2012
  11. Mar 22, 2012
Loading