- Feb 06, 2014
-
-
Nick Lewycky authored
llvm-svn: 200907
-
Manman Ren authored
225 is the default value of inline-threshold. This change will make sure we have the same inlining behavior as prior to r200886. As Chandler points out, even though we don't have code in our testing suite that uses cold attribute, there are larger applications that do use cold attribute. r200886 + this commit intend to keep the same behavior as prior to r200886. We can later on tune the inlinecold-threshold. The main purpose of r200886 is to help performance of instrumentation based PGO before we actually hook up inliner with analysis passes such as BPI and BFI. For instrumentation based PGO, we try to increase inlining of hot functions and reduce inlining of cold functions by setting inlinecold-threshold. Another option suggested by Chandler is to use a boolean flag that controls if we should use OptSizeThreshold for cold functions. The default value of the boolean flag should not change the current behavior. But it gives us less freedom in controlling inlining of cold functions. llvm-svn: 200898
-
Paul Robinson authored
Ideally only those transform passes that run at -O0 remain enabled, in reality we get as close as we reasonably can. Passes are responsible for disabling themselves, it's not the job of the pass manager to do it for them. llvm-svn: 200892
-
- Feb 05, 2014
-
-
Manman Ren authored
Added command line option inlinecold-threshold to set threshold for inlining functions with cold attribute. Listen to the cold attribute when it would decrease the inline threshold. llvm-svn: 200886
-
- Feb 04, 2014
-
-
Benjamin Kramer authored
For the odd case of platforms with exp2 available but not ldexp. llvm-svn: 200795
-
Duncan P. N. Exon Smith authored
No functional change. Updated loops from: for (I = scc_begin(), E = scc_end(); I != E; ++I) to: for (I = scc_begin(); !I.isAtEnd(); ++I) for teh win. llvm-svn: 200789
-
Tim Northover authored
rdar://problem/13729466 llvm-svn: 200771
-
Kai Nacke authored
Add the missing transformation strchr(p, 0) -> p + strlen(p) to SimplifyLibCalls and remove the ToDo comment. Reviewer: Duncan P.N. Exan Smith llvm-svn: 200736
-
Nick Lewycky authored
Self-memcpy-elision and memcpy of constant byte to memset transforms don't care how many bytes you were trying to transfer. Sink that safety test after those transforms. Noticed by inspection. llvm-svn: 200726
-
- Feb 03, 2014
-
-
Reid Kleckner authored
It disturbs the layout of the parameters in memory and registers, leading to problems in the backend. The plan for optimizing internal inalloca functions going forward is to essentially SROA the argument memory and demote any captured arguments (things that aren't trivially written by a load or store) to an indirect pointer to a static alloca. llvm-svn: 200717
-
- Feb 02, 2014
-
-
Duncan P. N. Exon Smith authored
LowerExpectIntrinsic previously only understood the idiom of an expect intrinsic followed by a comparison with zero. For llvm.expect.i1, the comparison would be stripped by the early-cse pass. Patch by Daniel Micay. llvm-svn: 200664
-
Arnold Schwaighofer authored
unrolling heuristic per default Benchmarking on x86_64 (thanks Chandler!) and ARM has shown those options speed up some benchmarks while not causing any interesting regressions. llvm-svn: 200621
-
- Feb 01, 2014
-
-
Chandler Carruth authored
LCSSA when we promote to SSA registers inside of LICM. Currently, this is actually necessary. The promotion logic in LICM uses SSAUpdater which doesn't understand how to place LCSSA PHI nodes. Teaching it to do so would be a very significant undertaking. It may be worthwhile and I've left a FIXME about this in the code as well as starting a thread on llvmdev to try to figure out the right long-term solution. For now, the PR needs to be fixed. Short of using the promition SSAUpdater to place both the LCSSA PHI nodes and the promoted PHI nodes, I don't see a cleaner or cheaper way of achieving this. Fortunately, LCSSA is relatively lazy and sparse -- it should only update instructions which need it. We can also skip the recursive variant when we don't promote to SSA values. llvm-svn: 200612
-
Eli Bendersky authored
llvm-svn: 200611
-
Reid Kleckner authored
This reverts commit r200576. It broke 32-bit self-host builds by vectorizing two calls to @llvm.bswap.i64, which we then fail to expand. llvm-svn: 200602
-
- Jan 31, 2014
-
-
Chandler Carruth authored
transform accordingly. Based on similar code from Loop vectorization. Subsequent commits will include vectorization of function calls to vector intrinsics and form function calls to vector library calls. Patch by Raul Silvera! (Much delayed due to my not running dcommit) llvm-svn: 200576
-
Chandler Carruth authored
loop vectorizer to not do so when runtime pointer checks are needed and share code with the new (not yet enabled) load/store saturation runtime unrolling. Also ensure that we only consider the runtime checks when the loop hasn't already been vectorized. If it has, the runtime check cost has already been paid. I've fleshed out a test case to cover the scalar unrolling as well as the vector unrolling and comment clearly why we are or aren't following the pattern. llvm-svn: 200530
-
rdar://15930350Bob Wilson authored
The entry block of a function starts with all the static allocas. The change in r195513 splits the block before those allocas, which has the effect of turning them into dynamic allocas. That breaks all sorts of things. Change to split after the initial allocas, and also add a comment explaining why the block is split. llvm-svn: 200515
-
- Jan 29, 2014
-
-
Chandler Carruth authored
preserve loop simplify of enclosing loops. The problem here starts with LoopRotation which ends up cloning code out of the latch into the new preheader it is buidling. This can create a new edge from the preheader into the exit block of the loop which breaks LoopSimplify form. The code tries to fix this by splitting the critical edge between the latch and the exit block to get a new exit block that only the latch dominates. This sadly isn't sufficient. The exit block may be an exit block for multiple nested loops. When we clone an edge from the latch of the inner loop to the new preheader being built in the outer loop, we create an exiting edge from the outer loop to this exit block. Despite breaking the LoopSimplify form for the inner loop, this is fine for the outer loop. However, when we split the edge from the inner loop to the exit block, we create a new block which is in neither the inner nor outer loop as the new exit block. This is a predecessor to the old exit block, and so the split itself takes the outer loop out of LoopSimplify form. We need to split every edge entering the exit block from inside a loop nested more deeply than the exit block in order to preserve all of the loop simplify constraints. Once we try to do that, a problem with splitting critical edges surfaces. Previously, we tried a very brute force to update LoopSimplify form by re-computing it for all exit blocks. We don't need to do this, and doing this much will sometimes but not always overlap with the LoopRotate bug fix. Instead, the code needs to specifically handle the cases which can start to violate LoopSimplify -- they aren't that common. We need to see if the destination of the split edge was a loop exit block in simplified form for the loop of the source of the edge. For this to be true, all the predecessors need to be in the exact same loop as the source of the edge being split. If the dest block was originally in this form, we have to split all of the deges back into this loop to recover it. The old mechanism of doing this was conservatively correct because at least *one* of the exiting blocks it rewrote was the DestBB and so the DestBB's predecessors were fixed. But this is a much more targeted way of doing it. Making it targeted is important, because ballooning the set of edges touched prevents LoopRotate from being able to split edges *it* needs to split to preserve loop simplify in a coherent way -- the critical edge splitting would sometimes find the other edges in need of splitting but not others. Many, *many* thanks for help from Nick reducing these test cases mightily. And helping lots with the analysis here as this one was quite tricky to track down. llvm-svn: 200393
-
Chandler Carruth authored
because of the inside-out run of LoopSimplify in the LoopPassManager and the fact that LoopSimplify couldn't be "preserved" across two independent LoopPassManagers. Anyways, in that case, IndVars wasn't correctly preserving an LCSSA PHI node because it thought it was rewriting (via SCEV) the incoming value to a loop invariant value. While it may well be invariant for the current loop, it may be rewritten in terms of an enclosing loop's values. This in and of itself is fine, as the LCSSA PHI node in the enclosing loop for the inner loop value we're rewriting will have its own LCSSA PHI node if used outside of the enclosing loop. With me so far? Well, the current loop and the enclosing loop may share an exiting block and exit block, and when they do they also share LCSSA PHI nodes. In this case, its not valid to RAUW through the LCSSA PHI node. Expected crazy test included. llvm-svn: 200372
-
Arnold Schwaighofer authored
When estimating register pressure, don't count the induction variable mulitple times. It is unlikely to be unrolled. This is currently disabled and hidden behind a flag ("enable-ind-var-reg-heur"). llvm-svn: 200371
-
- Jan 28, 2014
-
-
Rafael Espindola authored
When simplifycfg moves an instruction, it must drop metadata it doesn't know is still valid with the preconditions changes. In particular, it must drop the range and tbaa metadata. The patch implements this with an utility function to drop all metadata not in a white list. llvm-svn: 200322
-
Chandler Carruth authored
vectorizer, placing it behind an off-by-default flag. It turns out that block frequency isn't what we want at all, here or elsewhere. This has been I think a nagging feeling for several of us working with it, but Arnold has given some really nice simple examples where the results are so comprehensively wrong that they aren't useful. I'm planning to email the dev list with a summary of why its not really useful and a couple of ideas about how to better structure these types of heuristics. llvm-svn: 200294
-
Reid Kleckner authored
Summary: I searched Transforms/ and Analysis/ for 'ByVal' and updated those call sites to check for inalloca if appropriate. I added tests for any change that would allow an optimization to fire on inalloca. Reviewers: nlewycky Differential Revision: http://llvm-reviews.chandlerc.com/D2449 llvm-svn: 200281
-
Chandler Carruth authored
LCSSA from it caused a crasher with the LoopUnroll pass. This crasher is really nasty. We destroy LCSSA form in a suprising way. When unrolling a loop into an outer loop, we not only need to restore LCSSA form for the outer loop, but for all children of the outer loop. This is somewhat obvious in retrospect, but hey! While this seems pretty heavy-handed, it's not that bad. Fundamentally, we only do this when we unroll a loop, which is already a heavyweight operation. We're unrolling all of these hypothetical inner loops as well, so their size and complexity is already on the critical path. This is just adding another pass over them to re-canonicalize. I have a test case from PR18616 that is great for reproducing this, but pretty useless to check in as it relies on many 10s of nested empty loops that get unrolled and deleted in just the right order. =/ What's worse is that investigating this has exposed another source of failure that is likely to be even harder to test. I'll try to come up with test cases for these fixes, but I want to get the fixes into the tree first as they're causing crashes in the wild. llvm-svn: 200273
-
Arnold Schwaighofer authored
The vectorizer takes a loop like this and widens all instructions except for the store. The stores are scalarized/unrolled and hidden behind an "if" block. for (i = 0; i < 128; ++i) { if (a[i] < 10) a[i] += val; } for (i = 0; i < 128; i+=2) { v = a[i:i+1]; v0 = (extract v, 0) + 10; v1 = (extract v, 1) + 10; if (v0 < 10) a[i] = v0; if (v1 < 10) a[i] = v1; } The vectorizer relies on subsequent optimizations to sink instructions into the conditional block where they are anticipated. The flag "vectorize-num-stores-pred" controls whether and how many stores to handle this way. Vectorization of conditional stores is disabled per default for now. This patch also adds a change to the heuristic when the flag "enable-loadstore-runtime-unroll" is enabled (off by default). It unrolls small loops until load/store ports are saturated. This heuristic uses TTI's getMaxUnrollFactor as a measure for load/store ports. I also added a second flag -enable-cond-stores-vec. It will enable vectorization of conditional stores. But there is no cost model for vectorization of conditional stores in place yet so this will not do good at the moment. rdar://15892953 Results for x86-64 -O3 -mavx +/- -mllvm -enable-loadstore-runtime-unroll -vectorize-num-stores-pred=1 (before the BFI change): Performance Regressions: Benchmarks/Ptrdist/yacr2/yacr2 7.35% (maze3() is identical but 10% slower) Applications/siod/siod 2.18% Performance improvements: mesa -4.42% libquantum -4.15% With a patch that slightly changes the register heuristics (by subtracting the induction variable on both sides of the register pressure equation, as the induction variable is probably not really unrolled): Performance Regressions: Benchmarks/Ptrdist/yacr2/yacr2 7.73% Applications/siod/siod 1.97% Performance Improvements: libquantum -13.05% (we now also unroll quantum_toffoli) mesa -4.27% llvm-svn: 200270
-
Manman Ren authored
uint32. When folding branches to common destination, the updated branch weights can exceed uint32 by more than factor of 2. We should keep halving the weights until they can fit into uint32. llvm-svn: 200262
-
- Jan 27, 2014
-
-
Chandler Carruth authored
cold loops as-if they were being optimized for size. Nothing fancy here. Simply test case included. The nice thing is that we can now incrementally build on top of this to drive other heuristics. All of the infrastructure work is done to get the profile information into this layer. The remaining work necessary to make this a fully general purpose loop unroller for very hot loops is to make it a fully general purpose loop unroller. Things I know of but am not going to have time to benchmark and fix in the immediate future: 1) Don't disable the entire pass when the target is lacking vector registers. This really doesn't make any sense any more. 2) Teach the unroller at least and the vectorizer potentially to handle non-if-converted loops. This is trivial for the unroller but hard for the vectorizer. 3) Compute the relative hotness of the loop and thread that down to the various places that make cost tradeoffs (very likely only the unroller makes sense here, and then only when dealing with loops that are small enough for unrolling to not completely blow out the LSD). I'm still dubious how useful hotness information will be. So far, my experiments show that if we can get the correct logic for determining when unrolling actually helps performance, the code size impact is completely unimportant and we can unroll in all cases. But at least we'll no longer burn code size on cold code. One somewhat unrelated idea that I've had forever but not had time to implement: mark all functions which are only reachable via the global constructors rigging in the module as optsize. This would also decrease the impact of any more aggressive heuristics here on code size. llvm-svn: 200219
-
Benjamin Kramer authored
Insert before the terminating instruction of the dominating block instead. llvm-svn: 200218
-
Chandler Carruth authored
to stabilize a test that really is trying to test generic behavior and not a specific target's behavior. llvm-svn: 200215
-
Chandler Carruth authored
object and fewer pointless variables. Also, add a clarifying comment and a FIXME because the code which disables *all* vectorization if we can't use implicit floating point instructions just makes no sense at all. llvm-svn: 200214
-
Chandler Carruth authored
powers of two. This is essentially always the correct thing given the impact on alignment, scaling factors that can be used in addressing modes, etc. Also, fix the management of the unroll vs. small loop cost to more accurately model things with this world. Enhance a test case to actually exercise more of the unroll machinery if using synthetic constants rather than a specific target model. Before this change, with the added flags this test will unroll 3 times instead of either 2 or 4 (the two sensible answers). While I don't expect this to make a huge difference, if there are lots of loops sitting right on the edge of hitting the 'small unroll' factor, they might change behavior. However, I've benchmarked moving the small loop cost up and down in many various ways and by a huge factor (2x) without seeing more than 0.2% code size growth. Small adjustments such as the series that led up here have led to about 1% improvement on some benchmarks, but it is very close to the noise floor so I mostly checked that nothing regressed. Let me know if you see bad behavior on other targets but I don't expect this to be a sufficiently dramatic change to trigger anything. llvm-svn: 200213
-
Chandler Carruth authored
with the unrolling behavior in the loop vectorizer. No functionality changed at this point. These are a bit hack-y, but talking with Hal, there doesn't seem to be a cleaner way to easily experiment with different thresholds here and he was also interested in them so I wanted to commit them. Suggestions for improvement are very welcome here. llvm-svn: 200212
-
Chandler Carruth authored
number of vector registers rather than toggling between vector and scalar register number based on VF. I don't have a test case as I spotted this by inspection and on X86 it only makes a difference if your target is lacking SSE and thus has *no* vector registers. If someone wants to add a test case for this for ARM or somewhere else where this is more significant, that would be awesome. Also made the variable name a bit more sensible while I'm here. llvm-svn: 200211
-
Chandler Carruth authored
LoopVectorize pass. The logic here doesn't make much sense. We *only* unrolled if the unvectorized loop was a reduction loop with a single basic block *and* small loop body. The reduction part in particular doesn't make much sense. Instead, if we just fall through to the vectorized unroll logic it makes more sense of unrolling if there is a vectorized reduction that could be hacked on by the SLP vectorizer *or* if the loop is small. This is mostly a cleanup and nothing in the test suite really exercises this, but I did run benchmarks across this change and saw no really significant changes. llvm-svn: 200198
-
- Jan 25, 2014
-
-
Chandler Carruth authored
a FunctionPass. With this change the loop vectorizer no longer is a loop pass and can readily depend on function analyses. In particular, with this change we no longer have to form a loop pass manager to run the loop vectorizer which simplifies the entire pass management of LLVM. The next step here is to teach the loop vectorizer to leverage profile information through the profile information providing analysis passes. llvm-svn: 200074
-
Chandler Carruth authored
the loops in a function, and teach LICM to work in the presance of LCSSA. Previously, LCSSA was a loop pass. That made passes requiring it also be loop passes and unable to depend on function analysis passes easily. It also caused outer loops to have a different "canonical" form from inner loops during analysis. Instead, we go into LCSSA form and preserve it through the loop pass manager run. Note that this has the same problem as LoopSimplify that prevents enabling its verification -- loop passes which run at the end of the loop pass manager and don't preserve these are valid, but the subsequent loop pass runs of outer loops that do preserve this pass trigger too much verification and fail because the inner loop no longer verifies. The other problem this exposed is that LICM was completely unable to handle LCSSA form. It didn't preserve it and it actually would give up on moving instructions in many cases when they were used by an LCSSA phi node. I've taught LICM to support detecting LCSSA-form PHI nodes and to hoist and sink around them. This may actually let LICM fire significantly more because we put everything into LCSSA form to rotate the loop before running LICM. =/ Now LICM should handle that fine and preserve it correctly. The down side is that LICM has to require LCSSA in order to preserve it. This is just a fact of life for LCSSA. It's entirely possible we should completely remove LCSSA from the optimizer. The test updates are essentially accomodating LCSSA phi nodes in the output of LICM, and the fact that we now completely sink every instruction in ashr-crash below the loop bodies prior to unrolling. With this change, LCSSA is computed only three times in the pass pipeline. One of them could be removed (and potentially a SCEV run and a separate LoopPassManager entirely!) if we had a LoopPass variant of InstCombine that ran InstCombine on the loop body but refused to combine away LCSSA PHI nodes. Currently, this also prevents loop unrolling from being in the same loop pass manager is rotate, LICM, and unswitch. There is one thing that I *really* don't like -- preserving LCSSA in LICM is quite expensive. We end up having to re-run LCSSA twice for some loops after LICM runs because LICM can undo LCSSA both in the current loop and the parent loop. I don't really see good solutions to this other than to completely move away from LCSSA and using tools like SSAUpdater instead. llvm-svn: 200067
-
Juergen Ributzka authored
This reverts commit r200058 and adds the using directive for ARMTargetTransformInfo to silence two g++ overload warnings. llvm-svn: 200062
-
Hans Wennborg authored
This commit caused -Woverloaded-virtual warnings. The two new TargetTransformInfo::getIntImmCost functions were only added to the superclass, and to the X86 subclass. The other targets were not updated, and the warning highlighted this by pointing out that e.g. ARMTTI::getIntImmCost was hiding the two new getIntImmCost variants. We could pacify the warning by adding "using TargetTransformInfo::getIntImmCost" to the various subclasses, or turning it off, but I suspect that it's wrong to leave the functions unimplemnted in those targets. The default implementations return TCC_Free, which I don't think is right e.g. for ARM. llvm-svn: 200058
-
- Jan 24, 2014
-
-
Juergen Ributzka authored
Retry commit r200022 with a fix for the build bot errors. Constant expressions have (unlike instructions) module scope use lists and therefore may have users in different functions. The fix is to simply ignore these out-of-function uses. llvm-svn: 200034
-