Commits · 98f2c0c384e2aa39578cd8e9c96fe3c45e4b68d2 · Roger Ferrer / llvm-epi-0.8

Apr 02, 2012

Stepan Dyatkovskiy authored Apr 02, 2012

http://llvm.org/bugs/show_bug.cgi?id=12343

We have not trivial way for splitting edges that are goes from indirect branch. We can do it with some tricks, but it should be additionally discussed. And it is still dangerous due to difficulty of indirect branches controlling.

Fix forbids this case for unswitching.

llvm-svn: 153879

f62ffeca

Apr 01, 2012

Belatedly address some code review from Chris. · 45ae88f5

Chandler Carruth authored Apr 01, 2012

As a side note, I really dislike array_pod_sort... Do we really still
care about any STL implementations that get this so wrong? Does libc++?

llvm-svn: 153834

45ae88f5

Fix a pretty scary bug I introduced into the always inliner with · c5bfb3c0

Chandler Carruth authored Apr 01, 2012

a single missing character. Somehow, this had gone untested. I've added
tests for returns-twice logic specifically with the always-inliner that
would have caught this, and fixed the bug.

Thanks to Matt for the careful review and spotting this!!! =D

llvm-svn: 153832

c5bfb3c0

Mar 31, 2012

Give the always-inliner its own custom filter. It shouldn't have to pay · a88a0faa

Chandler Carruth authored Mar 31, 2012

the very high overhead of the complex inline cost analysis when all it
wants to do is detect three patterns which must not be inlined. Comment
the code, clean it up, and leave some hints about possible performance
improvements if this ever shows up on a profile.

Moving this off of the (now more expensive) inline cost analysis is
particularly important because we have to run this inliner even at -O0.

llvm-svn: 153814

a88a0faa

Remove a bunch of empty, dead, and no-op methods from all of these · edd2826f

Chandler Carruth authored Mar 31, 2012

interfaces. These methods were used in the old inline cost system where
there was a persistent cache that had to be updated, invalidated, and
cleared. We're now doing more direct computations that don't require
this intricate dance. Even if we resume some level of caching, it would
almost certainly have a simpler and more narrow interface than this.

llvm-svn: 153813

edd2826f

Initial commit for the rewrite of the inline cost analysis to operate · 0539c071

Chandler Carruth authored Mar 31, 2012

on a per-callsite walk of the called function's instructions, in
breadth-first order over the potentially reachable set of basic blocks.

This is a major shift in how inline cost analysis works to improve the
accuracy and rationality of inlining decisions. A brief outline of the
algorithm this moves to:

- Build a simplification mapping based on the callsite arguments to the
  function arguments.
- Push the entry block onto a worklist of potentially-live basic blocks.
- Pop the first block off of the *front* of the worklist (for
  breadth-first ordering) and walk its instructions using a custom
  InstVisitor.
- For each instruction's operands, re-map them based on the
  simplification mappings available for the given callsite.
- Compute any simplification possible of the instruction after
  re-mapping, and store that back int othe simplification mapping.
- Compute any bonuses, costs, or other impacts of the instruction on the
  cost metric.
- When the terminator is reached, replace any conditional value in the
  terminator with any simplifications from the mapping we have, and add
  any successors which are not proven to be dead from these
  simplifications to the worklist.
- Pop the next block off of the front of the worklist, and repeat.
- As soon as the cost of inlining exceeds the threshold for the
  callsite, stop analyzing the function in order to bound cost.

The primary goal of this algorithm is to perfectly handle dead code
paths. We do not want any code in trivially dead code paths to impact
inlining decisions. The previous metric was *extremely* flawed here, and
would always subtract the average cost of two successors of
a conditional branch when it was proven to become an unconditional
branch at the callsite. There was no handling of wildly different costs
between the two successors, which would cause inlining when the path
actually taken was too large, and no inlining when the path actually
taken was trivially simple. There was also no handling of the code
*path*, only the immediate successors. These problems vanish completely
now. See the added regression tests for the shiny new features -- we
skip recursive function calls, SROA-killing instructions, and high cost
complex CFG structures when dead at the callsite being analyzed.

Switching to this algorithm required refactoring the inline cost
interface to accept the actual threshold rather than simply returning
a single cost. The resulting interface is pretty bad, and I'm planning
to do lots of interface cleanup after this patch.

Several other refactorings fell out of this, but I've tried to minimize
them for this patch. =/ There is still more cleanup that can be done
here. Please point out anything that you see in review.

I've worked really hard to try to mirror at least the spirit of all of
the previous heuristics in the new model. It's not clear that they are
all correct any more, but I wanted to minimize the change in this single
patch, it's already a bit ridiculous. One heuristic that is *not* yet
mirrored is to allow inlining of functions with a dynamic alloca *if*
the caller has a dynamic alloca. I will add this back, but I think the
most reasonable way requires changes to the inliner itself rather than
just the cost metric, and so I've deferred this for a subsequent patch.
The test case is XFAIL-ed until then.

As mentioned in the review mail, this seems to make Clang run about 1%
to 2% faster in -O0, but makes its binary size grow by just under 4%.
I've looked into the 4% growth, and it can be fixed, but requires
changes to other parts of the inliner.

llvm-svn: 153812

0539c071

Internalize: Remove reference of @llvm.noinline, it was replaced with the... · 53dc8733
Benjamin Kramer authored Mar 31, 2012
```
Internalize: Remove reference of @llvm.noinline, it was replaced with the noinline attribute a long time ago.

llvm-svn: 153806
```
53dc8733

Correctly vectorize powi. · 5cad8742

Hal Finkel authored Mar 31, 2012

The powi intrinsic requires special handling because it always takes a single
integer power regardless of the result type. As a result, we can vectorize
only if the powers are equal. Fixes PR12364.

llvm-svn: 153797

5cad8742

Mar 29, 2012

Don't PRE compares. · 4e55044f

Jakob Stoklund Olesen authored Mar 29, 2012

CodeGenPrepare sinks compare instructions down to their uses to prevent
live flags and predicate registers across basic blocks.

PRE of a compare instruction prevents that, forcing the i1 compare
result into a general purpose register.  That is usually more expensive
than the redundant compare PRE was trying to eliminate in the first
place.

llvm-svn: 153657

4e55044f

Mar 28, 2012

GlobalOpt: If we have an inbounds GEP from a ConstantAggregateZero global that... · aa9e4a5e

Benjamin Kramer authored Mar 28, 2012

GlobalOpt: If we have an inbounds GEP from a ConstantAggregateZero global that we just determined to be constant, replace all loads from it with a zero value.

llvm-svn: 153576

aa9e4a5e

Switch to WeakVHs in the value mapper, and aggressively prune dead basic · 772c88b8

Chandler Carruth authored Mar 28, 2012

blocks in the function cloner. This removes the last case of trivially
dead code that I've been seeing in the wild getting inlined, analyzed,
re-inlined, optimized, only to be deleted. Nukes a FIXME from the
cleanup tests.

llvm-svn: 153572

772c88b8

Fix 80-column violation. · bb2a6da4
Chad Rosier authored Mar 28, 2012
```
llvm-svn: 153556
```
bb2a6da4

Mar 27, 2012

Make a seemingly tiny change to the inliner and fix the generated code · b9e35fbc

Chandler Carruth authored Mar 27, 2012

size bloat. Unfortunately, I expect this to disable the majority of the
benefit from r152737. I'm hopeful at least that it will fix PR12345. To
explain this requires... quite a bit of backstory I'm afraid.

TL;DR: The change in r152737 actually did The Wrong Thing for
linkonce-odr functions. This change makes it do the right thing. The
benefits we saw were simple luck, not any actual strategy. Benchmark
numbers after a mini-blog-post so that I've written down my thoughts on
why all of this works and doesn't work...

To understand what's going on here, you have to understand how the
"bottom-up" inliner actually works. There are two fundamental modes to
the inliner:

1) Standard fixed-cost bottom-up inlining. This is the mode we usually
   think about. It walks from the bottom of the CFG up to the top,
   looking at callsites, taking information about the callsite and the
   called function and computing th expected cost of inlining into that
   callsite. If the cost is under a fixed threshold, it inlines. It's
   a touch more complicated than that due to all the bonuses, weights,
   etc. Inlining the last callsite to an internal function gets higher
   weighth, etc. But essentially, this is the mode of operation.

2) Deferred bottom-up inlining (a term I just made up). This is the
   interesting mode for this patch an r152737. Initially, this works
   just like mode #1, but once we have the cost of inlining into the
   callsite, we don't just compare it with a fixed threshold. First, we
   check something else. Let's give some names to the entities at this
   point, or we'll end up hopelessly confused. We're considering
   inlining a function 'A' into its callsite within a function 'B'. We
   want to check whether 'B' has any callers, and whether it might be
   inlined into those callers. If so, we also check whether inlining 'A'
   into 'B' would block any of the opportunities for inlining 'B' into
   its callers. We take the sum of the costs of inlining 'B' into its
   callers where that inlining would be blocked by inlining 'A' into
   'B', and if that cost is less than the cost of inlining 'A' into 'B',
   then we skip inlining 'A' into 'B'.

Now, in order for #2 to make sense, we have to have some confidence that
we will actually have the opportunity to inline 'B' into its callers
when cheaper, *and* that we'll be able to revisit the decision and
inline 'A' into 'B' if that ever becomes the correct tradeoff. This
often isn't true for external functions -- we can see very few of their
callers, and we won't be able to re-consider inlining 'A' into 'B' if
'B' is external when we finally see more callers of 'B'. There are two
cases where we believe this to be true for C/C++ code: functions local
to a translation unit, and functions with an inline definition in every
translation unit which uses them. These are represented as internal
linkage and linkonce-odr (resp.) in LLVM. I enabled this logic for
linkonce-odr in r152737.

Unfortunately, when I did that, I also introduced a subtle bug. There
was an implicit assumption that the last caller of the function within
the TU was the last caller of the function in the program. We want to
bonus the last caller of the function in the program by a huge amount
for inlining because inlining that callsite has very little cost.
Unfortunately, the last caller in the TU of a linkonce-odr function is
*not* the last caller in the program, and so we don't want to apply this
bonus. If we do, we can apply it to one callsite *per-TU*. Because of
the way deferred inlining works, when it sees this bonus applied to one
callsite in the TU for 'B', it decides that inlining 'B' is of the
*utmost* importance just so we can get that final bonus. It then
proceeds to essentially force deferred inlining regardless of the actual
cost tradeoff.

The result? PR12345: code bloat, code bloat, code bloat. Another result
is getting *damn* lucky on a few benchmarks, and the over-inlining
exposing critically important optimizations. I would very much like
a list of benchmarks that regress after this change goes in, with
bitcode before and after. This will help me greatly understand what
opportunities the current cost analysis is missing.

Initial benchmark numbers look very good. WebKit files that exhibited
the worst of PR12345 went from growing to shrinking compared to Clang
with r152737 reverted.

- Bootstrapped Clang is 3% smaller with this change.
- Bootstrapped Clang -O0 over a single-source-file of lib/Lex is 4%
  faster with this change.

Please let me know about any other performance impact you see. Thanks to
Nico for reporting and urging me to actually fix, Richard Smith, Duncan
Sands, Manuel Klimek, and Benjamin Kramer for talking through the issues
today.

llvm-svn: 153506

b9e35fbc

Mar 26, 2012
- 153465 was incorrect. In this code we wanted to check that the pointer operand... · a8f3562e
  Nadav Rotem authored Mar 26, 2012
```
153465 was incorrect. In this code we wanted to check that the pointer operand is of pointer type (and not vector type).

llvm-svn: 153468
```
  a8f3562e
- PR12357: The pointer was used before it was checked. · e63e59cc
  Nadav Rotem authored Mar 26, 2012
```
llvm-svn: 153465
```
  e63e59cc
- LSR ivchain bug fix: corner case with ConstantExpr. · 14779cc4
  Andrew Trick authored Mar 26, 2012
```
Fixes PR11950.

llvm-svn: 153463
```
  14779cc4
- comment typo · 356a8963
  Andrew Trick authored Mar 26, 2012
```
llvm-svn: 153462
```
  356a8963
- eliminate an unneeded branch, part of PR12357 · b1e2e1e0
  Chris Lattner authored Mar 26, 2012
```
llvm-svn: 153458
```
  b1e2e1e0
- Tidy. · 2b40fdf3
  Eric Christopher authored Mar 26, 2012
```
llvm-svn: 153456
```
  2b40fdf3
- Tidy. · f16bee86
  Eric Christopher authored Mar 26, 2012
```
llvm-svn: 153455
```
  f16bee86
- LSR cleanup: potential bug caught by PVS-Studio. · e51feea7
  Andrew Trick authored Mar 26, 2012
```
Thanks Andrey.

llvm-svn: 153451
```
  e51feea7
- [tsan] treat vtable pointer updates in a special way (requires tbaa); fix a... · 6f8a7760
  Kostya Serebryany authored Mar 26, 2012
```
[tsan] treat vtable pointer updates in a special way (requires tbaa); fix a bug (forgot to return true after instrumenting); make sure the tsan tests are run

llvm-svn: 153448
```
  6f8a7760
- Prune some includes and forward declarations. · 6e80c280
  Craig Topper authored Mar 26, 2012
```
llvm-svn: 153429
```
  6e80c280
Mar 25, 2012

Teach the function cloner (and thus the inliner) to simplify PHINodes · ef82cf5b

Chandler Carruth authored Mar 25, 2012

aggressively. There are lots of dire warnings about this being expensive
that seem to predate switching to the TrackingVH-based value remapper
that is automatically updated on RAUW. This makes it easy to not just
prune single-entry PHIs, but to fully simplify PHIs, and to recursively
simplify the newly inlined code to propagate PHINode simplifications.

This introduces a bit of a thorny problem though. We may end up
simplifying a branch condition to a constant when we fold PHINodes, and
we would like to nuke any dead blocks resulting from this so that time
isn't wasted continually analyzing them, but this isn't easy. Deleting
basic blocks *after* they are fully cloned and mapped into the new
function currently requires manually updating the value map. The last
piece of the simplification-during-inlining puzzle will require either
switching to WeakVH mappings or some other piece of refactoring. I've
left a FIXME in the testcase about this.

llvm-svn: 153410

ef82cf5b

Move the instruction simplification of callsite arguments in the inliner · 21211992

Chandler Carruth authored Mar 25, 2012

to instead rely on much more generic and powerful instruction
simplification in the function cloner (and thus inliner).

This teaches the pruning function cloner to use instsimplify rather than
just the constant folder to fold values during cloning. This can
simplify a large number of things that constant folding alone cannot
begin to touch. For example, it will realize that 'or' and 'and'
instructions with certain constant operands actually become constants
regardless of what their other operand is. It also can thread back
through the caller to perform simplifications that are only possible by
looking up a few levels. In particular, GEPs and pointer testing tend to
fold much more heavily with this change.

This should (in some cases) have a positive impact on compile times with
optimizations on because the inliner itself will simply avoid cloning
a great deal of code. It already attempted to prune proven-dead code,
but now it will be use the stronger simplifications to prove more code
dead.

llvm-svn: 153403

21211992

Add an asserting ValueHandle to the block simplification code which will · 0c72e3f4

Chandler Carruth authored Mar 25, 2012

fire if anything ever invalidates the assumption of a terminator
instruction being unchanged throughout the routine.

I've convinced myself that the current definition of simplification
precludes such a transformation, so I think getting some asserts
coverage that we don't violate this agreement is sufficient to make this
code safe for the foreseeable future.

Comments to the contrary or other suggestions are of course welcome. =]
The bots are now happy with this code though, so it appears the bug here
has indeed been fixed.

llvm-svn: 153401

0c72e3f4

Don't form a WeakVH around the sentinel node in the instructions BB · 17fc6ef2

Chandler Carruth authored Mar 24, 2012

list. This is a bad idea. ;] I'm hopeful this is the bug that's showing
up with the MSVC bots, but we'll see.

It is definitely unnecessary. InstSimplify won't do anything to
a terminator instruction, we don't need to even include it in the
iteration range. We can also skip the now dead terminator check,
although I've made it an assert to help document that this is an
important invariant.

I'm still a bit queasy about this because there is an implicit
assumption that the terminator instruction cannot be RAUW'ed by the
simplification code. While that appears to be true at the moment, I see
no guarantee that would ensure it remains true in the future. I'm
looking at the cleanest way to solve that...

llvm-svn: 153399

17fc6ef2

Mar 24, 2012

Refactor the interface to recursively simplifying instructions to be tad · cf1b585f

Chandler Carruth authored Mar 24, 2012

bit simpler by handling a common case explicitly.

Also, refactor the implementation to use a worklist based walk of the
recursive users, rather than trying to use value handles to detect and
recover from RAUWs during the recursive descent. This fixes a very
subtle bug in the previous implementation where degenerate control flow
structures could cause mutually recursive instructions (PHI nodes) to
collapse in just such a way that From became equal to To after some
amount of recursion. At that point, we hit the inf-loop that the assert
at the top attempted to guard against. This problem is defined away when
not using value handles in this manner. There are lots of comments
claiming that the WeakVH will protect against just this sort of error,
but they're not accurate about the actual implementation of WeakVHs,
which do still track RAUWs.

I don't have any test case for the bug this fixes because it requires
running the recursive simplification on unreachable phi nodes. I've no
way to either run this or easily write an input that triggers it. It was
found when using instruction simplification inside the inliner when
running over the nightly test-suite.

llvm-svn: 153393

cf1b585f

Fix the MSVC build. · 4b9ab746
Francois Pichet authored Mar 24, 2012
```
llvm-svn: 153366
```
4b9ab746
More IndVarSimplify cleanup. · 25553ab5
Andrew Trick authored Mar 24, 2012
```
llvm-svn: 153362
```
25553ab5
add EP_OptimizerLast extension point · e505a5ab
Kostya Serebryany authored Mar 23, 2012
```
llvm-svn: 153353
```
e505a5ab

Mar 23, 2012

Don't convert objc_retainAutoreleasedReturnValue to objc_retain if it · e3ed2b06
Dan Gohman authored Mar 23, 2012
```
is retaining the return value of an invoke that it immediately follows.

llvm-svn: 153344
```
e3ed2b06

It's not possible to insert code immediately after an invoke in the · 5c70fadc

Dan Gohman authored Mar 23, 2012

same basic block, and it's not safe to insert code in the successor
blocks if the edges are critical edges. Splitting those edges is
possible, but undesirable, especially on the unwind side. Instead,
make the bottom-up code motion to consider invokes to be part of
their successor blocks, rather than part of their parent blocks, so
that it doesn't push code past them and onto the edges. This fixes
PR12307.

llvm-svn: 153343

5c70fadc

When propagating equalities, eg replacing A with B in every basic block · a11ef6e4

Duncan Sands authored Mar 23, 2012

dominated by Root, check that B is available throughout the scope.  This
is obviously true (famous last words?) given the current logic, but the
check may be helpful if more complicated reasoning is added one day.

llvm-svn: 153323

a11ef6e4

Indentation. · 8f897dc8
Duncan Sands authored Mar 23, 2012
```
llvm-svn: 153322
```
8f897dc8

Mar 22, 2012
- Remove -enable-lsr-retry in time for 3.1. · e3502cb2
  Andrew Trick authored Mar 22, 2012
```
llvm-svn: 153287
```
  e3502cb2
- Remove -enable-lsr-nested in time for 3.1. · d97b83e3
  Andrew Trick authored Mar 22, 2012
```
Tests cases have been removed but attached to open PR12330.

llvm-svn: 153286
```
  d97b83e3
- Refactor the code for visiting instructions out into helper functions. · 817a7c6f
  Dan Gohman authored Mar 22, 2012
```
llvm-svn: 153267
```
  817a7c6f
- Remove unused simplifyIVUsers · 06549890
  Andrew Trick authored Mar 22, 2012
```
llvm-svn: 153262
```
  06549890
- Remove -enable-iv-rewrite, which has been unsupported since 3.0. · f47d0af5
  Andrew Trick authored Mar 22, 2012
```
llvm-svn: 153260
```
  f47d0af5