Commits · af1c5312d76000bf134d8b81cdb7343607c6ee64 · Lorenzo Albano / LLVM bpEVL

Sep 21, 2021

[InstCombine] add tests for mask-shift with trunc; NFC · af1c5312
Sanjay Patel authored Sep 20, 2021

af1c5312
[AMDGPU][MC][GFX10] Enabled dlc for FLAT and GLOBAL atomics · b8e7f532
Dmitry Preobrazhensky authored Sep 21, 2021
```
Differential Revision: https://reviews.llvm.org/D109614
```
b8e7f532

[SystemZ] Emit EXRL target instructions before text section is ended. · a48b43f9

Jonas Paulsson authored Sep 09, 2021

SystemZ adds the EXRL target instructions in the end of each file. This must
be done before debug info emission since that may end the text section, and
therefore this is now done in emitConstantPools() (instead of in
emitEndOfAsmFile).

Review: Ulrich Weigand

Differential Revision: https://reviews.llvm.org/D109513

a48b43f9

[VectorCombine] Add tests which require DT to use info from assumes. · ea27dd74
Florian Hahn authored Sep 21, 2021

ea27dd74

[AArch64] Improve schedule modelling on the Cortex-A55 · 9e4d7267

Nicholas Guy authored Sep 08, 2021

Enables the FuseAddress feature in the Cortex-A55 scheduling model

Differential Revision: https://reviews.llvm.org/D109323

9e4d7267

[InstCombine] foldConstantInsEltIntoShuffle - bail if we fail to find constant element (PR51824) · fc8f1e44

Simon Pilgrim authored Sep 21, 2021

If getAggregateElement() returns null for any element, early out as otherwise we will assert when creating a new constant vector

Fixes PR51824 + ; OSS-Fuzz: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=38057

fc8f1e44

[AMDGPU] Prefer fmac over fma when selecting FMA_W_CHAIN · 598bebea

Jay Foad authored Sep 20, 2021

FMA_W_CHAIN is used when lowering fdiv f32. Prefer to select it to fmac
if there are no source modifiers, just like we do for other mad/mac and
fma/fmac cases.

Differential Revision: https://reviews.llvm.org/D110074

598bebea

[AMDGPU] Prefer v_fmac over v_fma only when no source modifiers are used · 86dcb592

Jay Foad authored Sep 20, 2021

v_fmac with source modifiers forces VOP3 encoding, but it is strictly
better to use the VOP3-only v_fma instead, because $dst and $src2 are
not tied so it gives the register allocator more freedom and avoids a
copy in some cases.

This is the same strategy we already use for v_mad vs v_mac and
v_fma_legacy vs v_fmac_legacy.

Differential Revision: https://reviews.llvm.org/D110070

86dcb592

[AArch64] Regenerate test lines in sve-implicit-zero-filling.ll · e8362928
David Green authored Sep 21, 2021

e8362928

[LowerConstantIntrinsics] Fix heap-use-after-free bug in worklist · 7b4cc09b

David Stenberg authored Sep 21, 2021

This fixes PR51730, a heap-use-after-free bug in
replaceConditionalBranchesOnConstant().

With the attached reproducer we were left with a function looking
something like this after replaceAndRecursivelySimplify():

  [...]

  cont2.i:
    br i1 %.not1.i, label %handler.type_mismatch3.i, label %cont4.i

  handler.type_mismatch3.i:
    %3 = phi i1 [ %2, %cont2.thread.i ], [ false, %cont2.i ]
    unreachable

  cont4.i:
    unreachable

  [...]

with both the branch instruction and PHI node being in the worklist. As
a result of replacing the branch instruction with an unconditional
branch, the PHI node in %handler.type_mismatch3.i would be removed. This
then resulted in a heap-use-after-free bug due to accessing that removed
PHI node in the next worklist iteration.

This is solved by using a value handle worklist. I am a unsure if this
is the most idiomatic solution. Another solution could have been to
produce a worklist just containing the interesting branch instructions,
but I thought that it perhaps was a bit cleaner to keep all worklist
filtering in the loop that does the rewrites.

Reviewed By: lebedev.ri

Differential Revision: https://reviews.llvm.org/D109221

7b4cc09b

[GlobalISel][Legalizer] Use ArtifactValueFinder first for unmerge combines before trying others. · cc65e08f

Amara Emerson authored Sep 14, 2021

This is motivated by an pathological compile time issue during unmerge combining.

We should be able to use the AVF to do simplification. However AMDGPU
has a lot of codegen changes which I'm not sure how to evaluate.

Differential Revision: https://reviews.llvm.org/D109748

cc65e08f

[GlobalISel][Legalizer] Don't use eraseFromParentAndMarkDBGValuesForRemoval() for some artifacts. · 7091a7f7

Amara Emerson authored Sep 14, 2021

For artifacts excluding G_TRUNC/G_SEXT, which have IR counterparts, we don't
seem to have debug users of defs. However, in the legalizer we're always calling
MachineInstr::eraseFromParentAndMarkDBGValuesForRemoval() which is expensive.
In some rare cases, this contributes significantly to unreasonably long compile
times when we have lots of artifact combiner activity.

To verify this, I added asserts to that function when it actually replaced a debug
use operand with undef for these artifacts. On CTMark with both -O0 and -Os and
debug info enabled, I didn't see a single case where it triggered.

In my measurements I saw around a 0.5% geomean compile-time improvement on -g -O0
for AArch64 with this change.

Differential Revision: https://reviews.llvm.org/D109750

7091a7f7

[SCEV] Generalize implication when signedness of FoundPred doesn't matter · 2c7d5fbc

Max Kazantsev authored Sep 21, 2021

The implication logic for two values that are both negative or non-negative
says that it doesn't matter whether their predicate is signed and unsigned,
but only flips unsigned into signed for further inference. This patch adds
support for flipping a signed predicate into unsigned as well.

Differential Revision: https://reviews.llvm.org/D109959
Reviewed By: nikic

2c7d5fbc

BPF: make 32bit register spill with 64bit alignment · ea72b031

Yonghong Song authored Jul 31, 2021

In llvm, for non-alu32 mode, the stack alignment is 64bit so only one
64bit spill per 64bit slot. For alu32 mode, the stack alignment
is 32bit, so it is possible to have two 32bit spills per
64bit slot.

Currently, bpf kernel verifier does not preserve register states
for 32bit spills. That is, one 32bit register may hold a constant
value or a bounded range before spill. After reload from the
stack, the information is lost and sometimes this may cause
verifier failure. For 64bit register spill, the verifier
indeed tries to preserve the register state for reloading.

The current verifier can be modestly changed to handle one
32bit spill per 64bit stack slot with state-preserving reload.
Handling two 32bit spills per 64bit stack slot will require
substantial changes.

This patch changes stack alignment for alu32 to be 64bit.
This way, for any 64bit slot in alu32 mode, only one
32bit or 64bit register values can be saved. Together
with previous-mentioned verifier enhancement, 32bit
spill can be handled with state preserving.

Note that llvm stack slot coallescing
seems only doing adjacent packing which may leave some holes
in the stack. For example,
   stack slot 8   <== 8 bytes
   stack slot 4   <== 8 bytes with 4 byte hole
   stack slot 8   <== 8 bytes
   stack slot 4   <== 4 bytes

Differential Revision: https://reviews.llvm.org/D109073

ea72b031

[SimplifyCFG] Redirect switch cases that lead to UB into an unreachable block · 073b254c

Max Kazantsev authored Sep 21, 2021

When following a case of a switch instruction is guaranteed to lead to
UB, we can safely break these edges and redirect those cases into a newly
created unreachable block. As result, CFG will become simpler and we can
remove some of Phi inputs to make further analyzes easier.

Patch by Dmitry Bakunevich!

Differential Revision: https://reviews.llvm.org/D109428
Reviewed By: lebedev.ri

073b254c

[InstCombine] Eliminate vector reverse if all inputs/outputs to an instruction are reverses · f417d9d8
Usman Nadeem authored Sep 20, 2021
```
Differential Revision: https://reviews.llvm.org/D109808

Change-Id: I1a10d2bc33acbe0ea353c6cb3d077851391fe73e
```
f417d9d8

[X86] Rename the X86WinAllocaExpander pass and related symbols to "DynAlloca". NFC. · 4ceea774

Amara Emerson authored Sep 17, 2021

For x86 Darwin, we have a stack checking feature which re-uses some of this
machinery around stack probing on Windows. Renaming this to be more appropriate
for a generic feature.

Differential Revision: https://reviews.llvm.org/D109993

4ceea774

Sep 20, 2021

[GlobalISel] Implement support for the "trap-func-name" attribute. · f9d69a0a

Amara Emerson authored Sep 20, 2021

This attribute calls a function instead of emitting a trap instruction.

Differential Revision: https://reviews.llvm.org/D110098

f9d69a0a

[RISCV] Add test cases for missed opportunities to use vand/vor/vxor.vx. NFC · c6e52b1e
Craig Topper authored Sep 20, 2021
```
These are cases were the splat is in another basic block. CGP
needs to sink it to expose the opportunity to SelectionDAG.
```
c6e52b1e
[DebugInfo] Add test for dumping DW_AT_defaulted · fa822a2e
Paul Robinson authored Sep 20, 2021

fa822a2e
[RISCV] Teach RISCVTargetLowering::shouldSinkOperands to sink splats for FMA. · a95ba810
Craig Topper authored Sep 20, 2021
```
If either of the multiplicands is a splat, we can sink it to use
vfmacc.vf or similar.
```
a95ba810
[RISCV] Add test cases for missed opportunity to use vfmacc.vf. NFC · 792101ff
Craig Topper authored Sep 20, 2021
```
This is another case of a splat being in another basic block
preventing SelectionDAG from optimizing it.
```
792101ff

[IR] Add helper to convert offset to GEP indices · dd022656

Nikita Popov authored Sep 19, 2021

We implement logic to convert a byte offset into a sequence of GEP
indices for that offset in a number of places. This patch adds a
DataLayout::getGEPIndicesForOffset() method, which implements the
core logic. I've updated SROA, ConstantFolding and InstCombine to
use it, and there's a few more places where it looks relevant.

Differential Revision: https://reviews.llvm.org/D110043

dd022656

[RISCV] Teach RISCVTargetLowering::shouldSinkOperands to sink splats for FAdd/FSub/FMul/FDiv. · 04ab6c85
Craig Topper authored Sep 20, 2021

04ab6c85

[RISCV] Add test cases showing failure to use .vf vector operations when splat... · 890027b3

Craig Topper authored Sep 20, 2021

[RISCV] Add test cases showing failure to use .vf vector operations when splat is in another basic block. NFC

We should have CGP copy the splats into the same basic block as the
FP operation so that SelectionDAG can fold them.

890027b3

[RISCV] Add a pass to recognize VLS strided loads/store from gather/scatter. · d85e347a

Craig Topper authored Sep 20, 2021

For strided accesses the loop vectorizer seems to prefer creating a
vector induction variable with a start value of the form
<i32 0, i32 1, i32 2, ...>. This value will be incremented each
loop iteration by a splat constant equal to the length of the vector.
Within the loop, arithmetic using splat values will be done on this
vector induction variable to produce indices for a vector GEP.

This pass attempts to dig through the arithmetic back to the phi
to create a new scalar induction variable and a stride. We push
all of the arithmetic out of the loop by folding it into the start,
step, and stride values. Then we create a scalar GEP to use as the
base pointer for a strided load or store using the computed stride.
Loop strength reduce will run after this pass and can do some
cleanups to the scalar GEP and induction variable.

Reviewed By: frasercrmck

Differential Revision: https://reviews.llvm.org/D107790

d85e347a

[Verifier] Verify scoped noalias metadata · 8700f2bd

Nikita Popov authored Sep 16, 2021

Verify that !noalias, !alias.scope and llvm.experimental.noalias.scope
arguments have the format specified in
https://llvm.org/docs/LangRef.html#noalias-and-alias-scope-metadata.
I've fixed up a lot of broken metadata used by tests in advance.
Especially using a scope instead of the expected scope list is a
commonly made mistake.

Differential Revision: https://reviews.llvm.org/D110026

8700f2bd

[DSE] Add additional tests to cover review comments. · 963d3a22

Florian Hahn authored Sep 20, 2021

Adds additional tests following comments from D109844.

Also removes unusued in.ptr arguments and places in the call tests that
used loads instead of a getval call.

963d3a22

[SLP]Improve graph reordering. · bc69dd62

Alexey Bataev authored Aug 03, 2021

Reworked reordering algorithm. Originally, the compiler just tried to
detect the most common order in the reordarable nodes (loads, stores,
extractelements,extractvalues) and then fully rebuilding the graph in
the best order. This was not effecient, since it required an extra
memory and time for building/rebuilding tree, double the use of the
scheduling budget, which could lead to missing vectorization due to
exausted scheduling resources.

Patch provide 2-way approach for graph reodering problem. At first, all
reordering is done in-place, it doe not required tree
deleting/rebuilding, it just rotates the scalars/orders/reuses masks in
the graph node.

The first step (top-to bottom) rotates the whole graph, similarly to the previous
implementation. Compiler counts the number of the most used orders of
the graph nodes with the same vectorization factor and then rotates the
subgraph with the given vectorization factor to the most used order, if
it is not empty. Then repeats the same procedure for the subgraphs with
the smaller vectorization factor. We can do this because we still need
to reshuffle smaller subgraph when buildiong operands for the graph
nodes with lasrger vectorization factor, we can rotate just subgraph,
not the whole graph.

The second step (bottom-to-top) scans through the leaves and tries to
detect the users of the leaves which can be reordered. If the leaves can
be reorder in the best fashion, they are reordered and their user too.
It allows to remove double shuffles to the same ordering of the operands in
many cases and just reorder the user operations instead. Plus, it moves
the final shuffles closer to the top of the graph and in many cases
allows to remove extra shuffle because the same procedure is repeated
again and we can again merge some reordering masks and reorder user nodes
instead of the operands.

Also, patch improves cost model for gathering of loads, which improves
x264 benchmark in some cases.

Gives about +2% on AVX512 + LTO (more expected for AVX/AVX2) for {625,525}x264,
+3% for 508.namd, improves most of other benchmarks.
The compile and link time are almost the same, though in some cases it
should be better (we're not doing an extra instruction scheduling
anymore) + we may vectorize more code for the large basic blocks again
because of saving scheduling budget.

Differential Revision: https://reviews.llvm.org/D105020

bc69dd62

[Analysis] Add support for vscale in computeKnownBitsFromOperator · f988f680

David Sherwood authored Sep 16, 2021

In ValueTracking.cpp we use a function called
computeKnownBitsFromOperator to determine the known bits of a value.
For the vscale intrinsic if the function contains the vscale_range
attribute we can use the maximum and minimum values of vscale to
determine some known zero and one bits. This should help to improve
code quality by allowing certain optimisations to take place.

Tests added here:

  Transforms/InstCombine/icmp-vscale.ll

Differential Revision: https://reviews.llvm.org/D109883

f988f680

[AMDGPU] Regenerate checks · 680592b5
Jay Foad authored Sep 20, 2021

680592b5

[ARM] MVE reverse shuffles. · 3f90df22

David Green authored Sep 20, 2021

The vectorizer can sometimes make reverse shuffles from indices that
count down. In MVE, we don't have a 128bit rev instruction, but we can
select this to a VREV64 with some lane movs to swap the two halfs.

Ideally this would use VMOVD's, but only gets as far as VMOVS's at the
moment.

Differential Revision: https://reviews.llvm.org/D69510

3f90df22

[update_mir_test_checks.py] Use -NEXT FileCheck directories · 817e23d4

Alex Richardson authored Sep 20, 2021

Previously the script emitted output using plain CHECK directives. This
can result in a test passing even if there are some instructions between
CHECK directives that should have been removed. It also makes debugging
tests that have the output in a different order more difficult since
FileCheck can match with a later line and then complain about the "wrong"
directive not being found.

This will cause quite large diffs when updating existing tests, but I'm not sure we need an opt-in flag here.

Depends on D109765 (pre-commit tests)

Reviewed By: MaskRay

Differential Revision: https://reviews.llvm.org/D109767

817e23d4

pre-commit test for D109767 · 7b68c072
Alex Richardson authored Sep 20, 2021
```
Differential Revision: https://reviews.llvm.org/D109765
```
7b68c072

[GlobalISel] Improve elimination of dead instructions in legalizer · e4c46ddd

Petar Avramovic authored Sep 20, 2021

Add eraseInstr(s) utility functions. Before deleting an instruction
collects its use instructions. After deletion deletes use instructions
that became trivially dead.
This patch clears all dead instructions in existing legalizer mir tests.

Differential Revision: https://reviews.llvm.org/D109154

e4c46ddd

AArch64: use ldp/stp for 128-bit atomic load/store in v.84 onwards · 13aa102e

Tim Northover authored Sep 15, 2021

v8.4 says that normal loads/stores of 128-bytes are single-copy atomic if
they're properly aligned (which all LLVM atomics are) so we no longer need to
do a full RMW operation to guarantee we got a clean read.

13aa102e

Revert "[AArch64][SVE] Teach cost model that masked loads/stores are cheap" · 92c9b283

David Spickett authored Sep 20, 2021

This reverts commit 734708e0.

Due to build failures on the 2 stage SVE VLS bot.
https://lab.llvm.org/buildbot/#/builders/176/builds/908/steps/11/logs/stdio

92c9b283

[NFC] Add assert and test showing that revert of D109596 wasn't justified · e9d34c54

Max Kazantsev authored Sep 20, 2021

All transforms of IndVars have prerequisite requirement of LCSSA and LoopSimplify
form and rely on it. Added test that shows that this actually stands.

e9d34c54

Revert "Revert "[IndVars] Replace PHIs if loop exits on 1st iteration"" · 471217cf

Max Kazantsev authored Sep 20, 2021

This reverts commit 6fec6552.

The patch was reverted on incorrect claim that this patch may break LCSSA form
when the loop is not in a simplify form. All IndVars' transform insure that
the loop is in simplify and LCSSA form, so if it wasn't broken before this
transform, it will also not be broken after it.

471217cf

[SCEV] Support negative values in signed/unsigned predicate reasoning · def15c5f

Max Kazantsev authored Sep 20, 2021

There is a piece of logic that uses the fact that signed and unsigned
versions of the same predicate are equivalent when both values are
non-negative. It's also true when both of them are negative.

Differential Revision: https://reviews.llvm.org/D109957
Reviewed By: nikic

def15c5f