Commits · 796de484592de0b615961130098bba2fec72b5d2 · Roger Ferrer / llvm-epi-0.8

Sep 14, 2012

Introduce a new SROA implementation. · 1b398ae0

Chandler Carruth authored Sep 14, 2012

This is essentially a ground up re-think of the SROA pass in LLVM. It
was initially inspired by a few problems with the existing pass:
- It is subject to the bane of my existence in optimizations: arbitrary
  thresholds.
- It is overly conservative about which constructs can be split and
  promoted.
- The vector value replacement aspect is separated from the splitting
  logic, missing many opportunities where splitting and vector value
  formation can work together.
- The splitting is entirely based around the underlying type of the
  alloca, despite this type often having little to do with the reality
  of how that memory is used. This is especially prevelant with unions
  and base classes where we tail-pack derived members.
- When splitting fails (often due to the thresholds), the vector value
  replacement (again because it is separate) can kick in for
  preposterous cases where we simply should have split the value. This
  results in forming i1024 and i2048 integer "bit vectors" that
  tremendously slow down subsequnet IR optimizations (due to large
  APInts) and impede the backend's lowering.

The new design takes an approach that fundamentally is not susceptible
to many of these problems. It is the result of a discusison between
myself and Duncan Sands over IRC about how to premptively avoid these
types of problems and how to do SROA in a more principled way. Since
then, it has evolved and grown, but this remains an important aspect: it
fixes real world problems with the SROA process today.

First, the transform of SROA actually has little to do with replacement.
It has more to do with splitting. The goal is to take an aggregate
alloca and form a composition of scalar allocas which can replace it and
will be most suitable to the eventual replacement by scalar SSA values.
The actual replacement is performed by mem2reg (and in the future
SSAUpdater).

The splitting is divided into four phases. The first phase is an
analysis of the uses of the alloca. This phase recursively walks uses,
building up a dense datastructure representing the ranges of the
alloca's memory actually used and checking for uses which inhibit any
aspects of the transform such as the escape of a pointer.

Once we have a mapping of the ranges of the alloca used by individual
operations, we compute a partitioning of the used ranges. Some uses are
inherently splittable (such as memcpy and memset), while scalar uses are
not splittable. The goal is to build a partitioning that has the minimum
number of splits while placing each unsplittable use in its own
partition. Overlapping unsplittable uses belong to the same partition.
This is the target split of the aggregate alloca, and it maximizes the
number of scalar accesses which become accesses to their own alloca and
candidates for promotion.

Third, we re-walk the uses of the alloca and assign each specific memory
access to all the partitions touched so that we have dense use-lists for
each partition.

Finally, we build a new, smaller alloca for each partition and rewrite
each use of that partition to use the new alloca. During this phase the
pass will also work very hard to transform uses of an alloca into a form
suitable for promotion, including forming vector operations, speculating
loads throguh PHI nodes and selects, etc.

After splitting is complete, each newly refined alloca that is
a candidate for promotion to a scalar SSA value is run through mem2reg.

There are lots of reasonably detailed comments in the source code about
the design and algorithms, and I'm going to be trying to improve them in
subsequent commits to ensure this is well documented, as the new pass is
in many ways more complex than the old one.

Some of this is still a WIP, but the current state is reasonbly stable.
It has passed bootstrap, the nightly test suite, and Duncan has run it
successfully through the ACATS and DragonEgg test suites. That said, it
remains behind a default-off flag until the last few pieces are in
place, and full testing can be done.

Specific areas I'm looking at next:
- Improved comments and some code cleanup from reviews.
- SSAUpdater and enabling this pass inside the CGSCC pass manager.
- Some datastructure tuning and compile-time measurements.
- More aggressive FCA splitting and vector formation.

Many thanks to Duncan Sands for the thorough final review, as well as
Benjamin Kramer for lots of review during the process of writing this
pass, and Daniel Berlin for reviewing the data structures and algorithms
and general theory of the pass. Also, several other people on IRC, over
lunch tables, etc for lots of feedback and advice.

llvm-svn: 163883

1b398ae0

Sep 13, 2012
- Fix an 80 char line limit. · 97d44349
  Nadav Rotem authored Sep 13, 2012
```
llvm-svn: 163808
```
  97d44349
Aug 29, 2012

Make MemoryBuiltins aware of TargetLibraryInfo. · 8bcc9711

Benjamin Kramer authored Aug 29, 2012

This disables malloc-specific optimization when -fno-builtin (or -ffreestanding)
is specified. This has been a problem for a long time but became more severe
with the recent memory builtin improvements.

Since the memory builtin functions are used everywhere, this required passing
TLI in many places. This means that functions that now have an optional TLI
argument, like RecursivelyDeleteTriviallyDeadFunctions, won't remove dead
mallocs anymore if the TLI argument is missing. I've updated most passes to do
the right thing.

Fixes PR13694 and probably others.

llvm-svn: 162841

8bcc9711

Aug 03, 2012

Move the "findUsedStructTypes" functionality outside of the Module class. · 8555a37c

Bill Wendling authored Aug 03, 2012

The "findUsedStructTypes" method is very expensive to run. It needs to be
optimized so that LTO can run faster. Splitting this method out of the Module
class will help this occur. For instance, it can keep a list of seen objects so
that it doesn't process them over and over again.

llvm-svn: 161228

8555a37c

Jul 25, 2012

It's not safe to blindly remove invoke instructions. This happens when we · 7d0f110c

Nick Lewycky authored Jul 25, 2012

encounter an invoke of an allocation function. This should fix the dragonegg
bootstrap. Testcase to follow, later.

llvm-svn: 160757

7d0f110c

Jul 24, 2012

Don't delete one more instruction than we're allowed to. This should fix the · 38be9312
Nick Lewycky authored Jul 24, 2012
```
Darwin bootstrap. Testcase exists but isn't fully reduced, I expect to commit
the testcase this evening.

llvm-svn: 160693
```
38be9312

Teach globalopt to not nuke all stores to globals. Keep them around of they · faa9c3b0

Nick Lewycky authored Jul 24, 2012

might be deliberate "one time" leaks, so that leak checkers can find them.
This is a reapply of r160602 with the fix that this time I'm committing the
code I thought I was committing last time; the I->eraseFromParent() goes
*after* the break out of the loop.

llvm-svn: 160664

faa9c3b0

Jul 21, 2012

Revert r160602. · 9669c198
Nick Lewycky authored Jul 21, 2012
```
llvm-svn: 160603
```
9669c198

Teach globalopt to play nice with leak checkers. This is a reapplication of · 72b83e5e

Nick Lewycky authored Jul 21, 2012

r160529 that was subsequently reverted. The fix was to not call
GV->eraseFromParent() right before the caller does the same. The existing
testcases already caught this bug if run under valgrind.

llvm-svn: 160602

72b83e5e

Jul 20, 2012
- Revert r160529 due to crashes. · 7707e234
  Nick Lewycky authored Jul 19, 2012
```
llvm-svn: 160532
```
  7707e234
- Don't wipe out global variables that are probably storing pointers to heap · 0fa6a281
  Nick Lewycky authored Jul 19, 2012
```
memory. This makes clang play nice with leak checkers.

llvm-svn: 160529
```
  0fa6a281
Jul 19, 2012
- Replace some explicit compare loops with std::equal. · f364a63c
  Benjamin Kramer authored Jul 19, 2012
```
No functionality change.

llvm-svn: 160501
```
  f364a63c
- Remove tabs. · ea6397f6
  Bill Wendling authored Jul 19, 2012
```
llvm-svn: 160477
```
  ea6397f6
Jul 02, 2012
- GlobalOpt forgot to handle bitcast when analyzing globals. Found by inspection. · e8ce94fc
  Duncan Sands authored Jul 02, 2012
```
llvm-svn: 159546
```
  e8ce94fc
Jun 29, 2012

Move llvm/Support/IRBuilder.h -> llvm/IRBuilder.h · aafe0918

Chandler Carruth authored Jun 29, 2012

This was always part of the VMCore library out of necessity -- it deals
entirely in the IR. The .cpp file in fact was already part of the VMCore
library. This is just a mechanical move.

I've tried to go through and re-apply the coding standard's preferred
header sort, but at 40-ish files, I may have gotten some wrong. Please
let me know if so.

I'll be committing the corresponding updates to Clang and Polly, and
Duncan has DragonEgg.

Thanks to Bill and Eric for giving the green light for this bit of cleanup.

llvm-svn: 159421

aafe0918

Jun 28, 2012

Move lib/Analysis/DebugInfo.cpp to lib/VMCore/DebugInfo.cpp and · e38859dc

Bill Wendling authored Jun 28, 2012

include/llvm/Analysis/DebugInfo.h to include/llvm/DebugInfo.h.

The reasoning is because the DebugInfo module is simply an interface to the
debug info MDNodes and has nothing to do with analysis.

llvm-svn: 159312

e38859dc

Jun 27, 2012

Revert r159136 due to PR13124. · a5886231

Matt Beaumont-Gay authored Jun 27, 2012

Original commit message:

If a constant or a function has linkonce_odr linkage and unnamed_addr, mark it
hidden. Being linkonce_odr guarantees that it is available in every dso that
needs it. Being a constant/function with unnamed_addr guarantees that the
copies don't have to be merged.

llvm-svn: 159272

a5886231

Jun 25, 2012

If a constant or a function has linkonce_odr linkage and unnamed_addr, mark it · 540c3d23

Rafael Espindola authored Jun 25, 2012

hidden. Being linkonce_odr guarantees that it is available in every dso that
needs it. Being a constant/function with unnamed_addr guarantees that the
copies don't have to be merged.

llvm-svn: 159136

540c3d23

Jun 24, 2012
- llvm/lib: [CMake] Add explicit dependency to intrinsics_gen. · 704de074
  NAKAMURA Takumi authored Jun 24, 2012
```
llvm-svn: 159112
```
  704de074
- Tab to spaces. No functionality change. · b74ae9c5
  Nick Lewycky authored Jun 24, 2012
```
llvm-svn: 159104
```
  b74ae9c5
Jun 23, 2012

Extend the IL for selecting TLS models (PR9788) · cbe34b4c

Hans Wennborg authored Jun 23, 2012

This allows the user/front-end to specify a model that is better
than what LLVM would choose by default. For example, a variable
might be declared as

  @x = thread_local(initialexec) global i32 42

if it will not be used in a shared library that is dlopen'ed.

If the specified model isn't supported by the target, or if LLVM can
make a better choice, a different model may be used.

llvm-svn: 159077

cbe34b4c

Jun 22, 2012
- fix whitespace in my last commit. · 0b60ebbf
  Nuno Lopes authored Jun 22, 2012
```
sorry for the churn :S  enough for today; going to sleep.

llvm-svn: 158953
```
  0b60ebbf
- remove extractMallocCallFromBitCast, since it was tailor maded for its sole... · 9792d683
  Nuno Lopes authored Jun 22, 2012
```
remove extractMallocCallFromBitCast, since it was tailor maded for its sole user. Update GlobalOpt accordingly.

llvm-svn: 158952
```
  9792d683
Jun 15, 2012
- Some optimizations done by globalopt are safe only for internal linkage, not · 1821c6c3
  Rafael Espindola authored Jun 15, 2012
```
linkonce linkage. For example, it is not valid to add unnamed_addr.

This also fixes a crash in g++.dg/opt/static5.C.

llvm-svn: 158528
```
  1821c6c3
- Implement the isSafeToDiscardIfUnused predicate and use it in globalopt and · def1b09b
  Rafael Espindola authored Jun 14, 2012
```
globaldce. Globaldce was already removing linkonce globals, but globalopt was
not.

llvm-svn: 158476
```
  def1b09b
Jun 02, 2012
- Fix typos found by http://github.com/lyda/misspell-check · bde91766
  Benjamin Kramer authored Jun 02, 2012
```
llvm-svn: 157885
```
  bde91766
May 28, 2012
- switch AttrListPtr::get to take an ArrayRef, simplifying a lot of clients. · 3cb6f83e
  Chris Lattner authored May 28, 2012
```
llvm-svn: 157556
```
  3cb6f83e
May 23, 2012

Fix the inliner so that the optsize function attribute don't alter the · 8a1e316c

Patrik Hägglund authored May 23, 2012

inline threshold if the global inline threshold is lower (as for -Oz).

Reviewed by Chandler Carruth and Bill Wendling.

llvm-svn: 157323

8a1e316c

May 12, 2012
- Teach Function::hasAddressTaken that BlockAddress doesn't really take · ca0c4996
  Jay Foad authored May 12, 2012
```
the address of a function.

llvm-svn: 156703
```
  ca0c4996
May 04, 2012

Move the CodeExtractor utility to a dedicated header file / source file, · 0fde0015

Chandler Carruth authored May 04, 2012

and expose it as a utility class rather than as free function wrappers.

The simple free-function interface works well for the bugpoint-specific
pass's uses of code extraction, but in an upcoming patch for more
advanced code extraction, they simply don't expose a rich enough
interface. I need to expose various stages of the process of doing the
code extraction and query information to decide whether or not to
actually complete the extraction or give up.

Rather than build up a new predicate model and pass that into these
functions, just take the class that was actually implementing the
functions and lift it up into a proper interface that can be used to
perform code extraction. The interface is cleaned up and re-documented
to work better in a header. It also is now setup to accept the blocks to
be extracted in the constructor rather than in a method.

In passing this essentially reverts my previous commit here exposing
a block-level query for eligibility of extraction. That is no longer
necessary with the more rich interface as clients can query the
extraction object for eligibility directly. This will reduce the number
of walks of the input basic block sequence by quite a bit which is
useful if this enters the normal optimization pipeline.

llvm-svn: 156163

0fde0015

Apr 16, 2012
- Add a Fixme. · 82b90a38
  Bill Wendling authored Apr 16, 2012
```
llvm-svn: 154793
```
  82b90a38
Apr 13, 2012

By default, use Early-CSE instead of GVN for vectorization cleanup. · 204bf535

Hal Finkel authored Apr 13, 2012

As has been suggested by Duncan and others, Early-CSE and GVN should
do similar redundancy elimination, but Early-CSE is much less expensive.
Most of my autovectorization benchmarks show a performance regresion, but
all of these are < 0.1%, and so I think that it is still worth using
the less expensive pass.

llvm-svn: 154673

204bf535

Code-gen may inject code into the IR before it emits the ASM. The linker · 585583c8

Bill Wendling authored Apr 13, 2012

obviously cannot know that this code is present, let alone used. So prevent the
internalize pass from internalizing those global values which code-gen may
insert.

llvm-svn: 154645

585583c8

Apr 11, 2012
- Add two statistics to help track how we are computing the inline cost. · 7ae90d4d
  Chandler Carruth authored Apr 11, 2012
```
Yea, 'NumCallerCallersAnalyzed' isn't a great name, suggestions welcome.

llvm-svn: 154492
```
  7ae90d4d
Apr 03, 2012
- Add an option to turn off the expensive GVN load PRE part of GVN. · 932b9928
  Bill Wendling authored Apr 02, 2012
```
llvm-svn: 153902
```
  932b9928
Apr 01, 2012

Belatedly address some code review from Chris. · 45ae88f5

Chandler Carruth authored Apr 01, 2012

As a side note, I really dislike array_pod_sort... Do we really still
care about any STL implementations that get this so wrong? Does libc++?

llvm-svn: 153834

45ae88f5

Fix a pretty scary bug I introduced into the always inliner with · c5bfb3c0

Chandler Carruth authored Apr 01, 2012

a single missing character. Somehow, this had gone untested. I've added
tests for returns-twice logic specifically with the always-inliner that
would have caught this, and fixed the bug.

Thanks to Matt for the careful review and spotting this!!! =D

llvm-svn: 153832

c5bfb3c0

Mar 31, 2012

Give the always-inliner its own custom filter. It shouldn't have to pay · a88a0faa

Chandler Carruth authored Mar 31, 2012

the very high overhead of the complex inline cost analysis when all it
wants to do is detect three patterns which must not be inlined. Comment
the code, clean it up, and leave some hints about possible performance
improvements if this ever shows up on a profile.

Moving this off of the (now more expensive) inline cost analysis is
particularly important because we have to run this inliner even at -O0.

llvm-svn: 153814

a88a0faa

Remove a bunch of empty, dead, and no-op methods from all of these · edd2826f

Chandler Carruth authored Mar 31, 2012

interfaces. These methods were used in the old inline cost system where
there was a persistent cache that had to be updated, invalidated, and
cleared. We're now doing more direct computations that don't require
this intricate dance. Even if we resume some level of caching, it would
almost certainly have a simpler and more narrow interface than this.

llvm-svn: 153813

edd2826f

Initial commit for the rewrite of the inline cost analysis to operate · 0539c071

Chandler Carruth authored Mar 31, 2012

on a per-callsite walk of the called function's instructions, in
breadth-first order over the potentially reachable set of basic blocks.

This is a major shift in how inline cost analysis works to improve the
accuracy and rationality of inlining decisions. A brief outline of the
algorithm this moves to:

- Build a simplification mapping based on the callsite arguments to the
  function arguments.
- Push the entry block onto a worklist of potentially-live basic blocks.
- Pop the first block off of the *front* of the worklist (for
  breadth-first ordering) and walk its instructions using a custom
  InstVisitor.
- For each instruction's operands, re-map them based on the
  simplification mappings available for the given callsite.
- Compute any simplification possible of the instruction after
  re-mapping, and store that back int othe simplification mapping.
- Compute any bonuses, costs, or other impacts of the instruction on the
  cost metric.
- When the terminator is reached, replace any conditional value in the
  terminator with any simplifications from the mapping we have, and add
  any successors which are not proven to be dead from these
  simplifications to the worklist.
- Pop the next block off of the front of the worklist, and repeat.
- As soon as the cost of inlining exceeds the threshold for the
  callsite, stop analyzing the function in order to bound cost.

The primary goal of this algorithm is to perfectly handle dead code
paths. We do not want any code in trivially dead code paths to impact
inlining decisions. The previous metric was *extremely* flawed here, and
would always subtract the average cost of two successors of
a conditional branch when it was proven to become an unconditional
branch at the callsite. There was no handling of wildly different costs
between the two successors, which would cause inlining when the path
actually taken was too large, and no inlining when the path actually
taken was trivially simple. There was also no handling of the code
*path*, only the immediate successors. These problems vanish completely
now. See the added regression tests for the shiny new features -- we
skip recursive function calls, SROA-killing instructions, and high cost
complex CFG structures when dead at the callsite being analyzed.

Switching to this algorithm required refactoring the inline cost
interface to accept the actual threshold rather than simply returning
a single cost. The resulting interface is pretty bad, and I'm planning
to do lots of interface cleanup after this patch.

Several other refactorings fell out of this, but I've tried to minimize
them for this patch. =/ There is still more cleanup that can be done
here. Please point out anything that you see in review.

I've worked really hard to try to mirror at least the spirit of all of
the previous heuristics in the new model. It's not clear that they are
all correct any more, but I wanted to minimize the change in this single
patch, it's already a bit ridiculous. One heuristic that is *not* yet
mirrored is to allow inlining of functions with a dynamic alloca *if*
the caller has a dynamic alloca. I will add this back, but I think the
most reasonable way requires changes to the inliner itself rather than
just the cost metric, and so I've deferred this for a subsequent patch.
The test case is XFAIL-ed until then.

As mentioned in the review mail, this seems to make Clang run about 1%
to 2% faster in -O0, but makes its binary size grow by just under 4%.
I've looked into the 4% growth, and it can be fixed, but requires
changes to other parts of the inliner.

llvm-svn: 153812

0539c071