- Jun 01, 2020
-
-
Sanjay Patel authored
SimplifyDemandedVectorElts() bails out on ScalableVectorType anyway, but we can exit faster with the external check. Move this to a helper function because there are likely other vector folds that we can try here.
-
Matt Arsenault authored
In this awkward case, we have to emit custom pseudo-constrained FP wrappers. InstrEmitter concludes that since a mayRaiseFPException instruction had a chain, it can't add nofpexcept. Test deferred until mayRaiseFPException is really set on everything.
-
Vedant Kumar authored
This is per Adrian's suggestion in https://reviews.llvm.org/D80684.
-
Vedant Kumar authored
Summary: Instead of iterating over all VarLoc IDs in removeEntryValue(), just iterate over the interval reserved for entry value VarLocs. This changes the iteration order, hence the test update -- otherwise this is NFC. This appears to give an ~8.5x wall time speed-up for LiveDebugValues when compiling sqlite3.c 3.30.1 with a Release clang (on my machine): ``` ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- Before: 2.5402 ( 18.8%) 0.0050 ( 0.4%) 2.5452 ( 17.3%) 2.5452 ( 17.3%) Live DEBUG_VALUE analysis After: 0.2364 ( 2.1%) 0.0034 ( 0.3%) 0.2399 ( 2.0%) 0.2398 ( 2.0%) Live DEBUG_VALUE analysis ``` The change in removeEntryValue() is the only one that appears to affect wall time, but for consistency (and to resolve a pending TODO), I made the analogous changes for iterating over SpillLocKind VarLocs. Reviewers: nikic, aprantl, jmorse, djtodoro Subscribers: hiraditya, dexonsmith, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D80684
-
Matt Arsenault authored
The AMDGPU non-strict fdiv lowering needs to introduce an FP mode switch in some cases, and has custom nodes to provide chain/glue for the intermediate FP operations. We need to propagate nofpexcept here, but getNode was dropping the flags. Adding nofpexcept in the AMDGPU custom lowering is left to a future patch. Also fix a second case where flags were dropped, but in this case it seems it just didn't handle this number of operands. Test will be included in future AMDGPU patch.
-
Hiroshi Yamauchi authored
Summary: The working set size heuristics (ProfileSummaryInfo::hasHugeWorkingSetSize) under the partial sample PGO may not be accurate because the profile is partial and the number of hot profile counters in the ProfileSummary may not reflect the actual working set size of the program being compiled. To improve this, the (approximated) ratio of the the number of profile counters of the program being compiled to the number of profile counters in the partial sample profile is computed (which is called the partial profile ratio) and the working set size of the profile is scaled by this ratio to reflect the working set size of the program being compiled and used for the working set size heuristics. The partial profile ratio is approximated based on the number of the basic blocks in the program and the NumCounts field in the ProfileSummary and computed through the thin LTO indexing. This means that there is the limitation that the scaled working set size is available to the thin LTO post link passes only. Reviewers: davidxl Subscribers: mgorny, eraman, hiraditya, steven_wu, dexonsmith, arphaman, dang, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D79831
-
Matt Arsenault authored
-
Matt Arsenault authored
-
hsmahesha authored
Summary: While clustering mem ops, AMDGPU target needs to consider number of clustered bytes to decide on max number of mem ops that can be clustered. This patch adds support to pass number of clustered bytes to target mem ops clustering logic. Reviewers: foad, rampitec, arsenm, vpykhtin, javedabsar Reviewed By: foad Subscribers: MatzeB, kzhuravl, jvesely, wdng, nhaehnle, yaxunl, dstuttard, tpr, t-tye, hiraditya, javed.absar, kerbowa, llvm-commits Tags: #llvm Differential Revision: https://reviews.llvm.org/D80545
-
Matt Arsenault authored
The alignment value also needs to be scaled by the wave size.
-
Stanislav Mekhanoshin authored
Differential Revision: https://reviews.llvm.org/D79218
-
Sam Clegg authored
simd-2.C now compiles thanks to: https://github.com/WebAssembly/wasi-libc/pull/183 Differential Revision: https://reviews.llvm.org/D80930
-
Sanjay Patel authored
As discussed in https://bugs.llvm.org/show_bug.cgi?id=45951 and D80584, the name 'tmp' is almost always a bad choice, but we have a legacy of regression tests with that name because it was baked into utils/update_test_checks.py. This change makes -instnamer more consistent (already using "arg" and "bb", the common LLVM shorthand). And it avoids the conflict in telling users of the FileCheck script to run "-instnamer" to create a better regression test and having that cause a warn/fail in update_test_checks.py.
-
Ehud Katz authored
-
James Henderson authored
Reviewed by: clayborg, dblaikie, labath Differential Revision: https://reviews.llvm.org/D80799
-
James Henderson authored
This will ensure that nothing can ever start parsing data from a future sequence and part-read data will be returned as 0 instead. Reviewed by: aprantl, labath Differential Revision: https://reviews.llvm.org/D80796
-
Simon Pilgrim authored
They are implicitly included in TargetFrameLowering.h and only ever used in TargetFrameLowering override methods.
-
Igor Kudrin authored
For most tables, we already use commas in headers. This set of patches unifies dumping the remaining ones. Differential Revision: https://reviews.llvm.org/D80806
-
Igor Kudrin authored
For most tables, we already use commas in headers. This set of patches unifies dumping the remaining ones. Differential Revision: https://reviews.llvm.org/D80806
-
Igor Kudrin authored
For most tables, we already use commas in headers. This set of patches unifies dumping the remaining ones. Differential Revision: https://reviews.llvm.org/D80806
-
Ehud Katz authored
This is a reimplementation of the `orderNodes` function, as the old implementation didn't take into account all cases. The new implementation uses SCCs instead of Loops to take account of irreducible loops. Fix PR41509 Differential Revision: https://reviews.llvm.org/D79037
-
Tim Northover authored
When a stack offset was too big to materialize in a single instruction, we were trying to do it in stages: adds xD, sp, #imm adds xD, xD, #imm Unfortunately, if xD is xzr then the second instruction doesn't exist and wouldn't do what was needed if it did. Instead we can use a temporary register for all but the last addition.
-
Chen Zheng authored
-
Li Rong Yi authored
Summary: Exploit vabsd* for for absolute difference of vectors on P9, for example: void foo (char *restrict p, char *restrict q, char *restrict t) { for (int i = 0; i < 16; i++) t[i] = abs (p[i] - q[i]); } this case should be matched to the HW instruction vabsdub. Reviewed By: steven.zhang Differential Revision: https://reviews.llvm.org/D80271
-
Matt Arsenault authored
-
- May 31, 2020
-
-
Craig Topper authored
Previously we walked the users of any vector binop looking for more binops with the same opcode or phis that eventually ended up in a reduction. While this is simple it also means visiting the same nodes many times since we'll do a forward walk for each BinaryOperator in the chain. It was also far more general than what we have tests for or expect to see. This patch replaces the algorithm with a new method that starts at extract elements looking for a horizontal reduction. Once we find a reduction we walk through backwards through phis and adds to collect leaves that we can consider for rewriting. We only consider single use adds and phis. Except for a special case if the Add is used by a phi that forms a loop back to the Add. Including other single use Adds to support unrolled loops. Ultimately, I want to narrow the Adds, Phis, and final reduction based on the partial reduction we're doing. I still haven't figured out exactly what that looks like yet. But restricting the types of graphs we expect to handle seemed like a good first step. As does having all the leaves and the reduction at once. Differential Revision: https://reviews.llvm.org/D79971
-
Simon Pilgrim authored
-
Simon Pilgrim authored
This matches what we do for the full sized vector ops at the start of combineX86ShufflesRecursively, and helps getFauxShuffleMask extract more INSERT_SUBVECTOR patterns.
-
Matt Arsenault authored
I inverted the mask when I ported to the new form of G_PTRMASK in 8bc03d21. I don't think this really broke anything, since G_VASTART isn't handled for types with an alignment higher than the stack alignment.
-
Simon Pilgrim authored
As suggested on D79987.
-
Simon Pilgrim authored
Try to prevent future node creation issues (as detailed in PR45974) by making the SelectionDAG reference const, so it can still be used for analysis, but not node creation.
-
Simon Pilgrim authored
Don't create nodes on the fly when decoding INSERT_SUBVECTOR as faux shuffles.
-
Simon Pilgrim authored
As detailed on PR45974 and D79987, getFauxShuffleMask is creating nodes on the fly to create shuffles with inputs the same size as the result, causing problems for hasOneUse() checks in later simplification stages. Currently only combineX86ShufflesRecursively benefits from these widened inputs so I've begun moving the functionality there, and out of getFauxShuffleMask. This allows us to remove the widening from VBROADCAST and *EXTEND* faux shuffle cases. This just leaves the INSERT_SUBVECTOR case in getFauxShuffleMask still creating nodes, which will require more extensive refactoring.
-
Florian Hahn authored
In some cases ScheduleDAGRRList has to add new nodes to resolve problems with interfering physical registers. When new nodes are added, it completely re-computes the topological order, which can take a long time, but is unnecessary. We only add nodes one by one, and initially they do not have any predecessors. So we can just insert them at the end of the vector. Later we add predecessors, but the helper function properly updates the topological order much more efficiently. With this change, the compile time for the program below drops from 300s to 30s on my machine. define i11129 @test1() { %L1 = load i11129, i11129* undef %B30 = ashr i11129 %L1, %L1 store i11129 %B30, i11129* undef ret i11129 %L1 } This should be generally beneficial, as we can skip a large amount of work. Theoretically there are some scenarios where we might not safe much, e.g. when we add a dependency between the first and last node. Then we would have to shift all nodes. But we still do not have to spend the time re-computing the initial order. Reviewers: MatzeB, atrick, efriedma, niravd, paquette Reviewed By: paquette Differential Revision: https://reviews.llvm.org/D59722
-
Jay Foad authored
Differential Revision: https://reviews.llvm.org/D80813
-
Changpeng Fang authored
Reviewers: rampitec, arsenm Differential Revision: https://reviews.llvm.org/D80853
-
Craig Topper authored
The types already match so TableGen is removing the bitconvert.
-
Craig Topper authored
[X86] Add DAG combine to turn (v2i64 (scalar_to_vector (i64 (bitconvert (mmx))))) to MOVQ2DQ. Remove unneeded isel patterns. We already had a DAG combine for (mmx (bitconvert (i64 (extractelement v2i64)))) to MOVDQ2Q. Remove patterns for MMX_MOVQ2DQrr/MMX_MOVDQ2Qrr that use scalar_to_vector/extractelement involving i64 scalar type with v2i64 and x86mmx.
-
Craig Topper authored
This code was repeated in two callers of CommitTargetLoweringOpt. But CommitTargetLoweringOpt is also called from TargetLowering. We should print a message for those calls to. So sink the repeated code into CommitTargetLoweringOpt to catch those calls.
-
Craig Topper authored
-