Commits · b9d01aa29e5d0aa433c2fc62ace709fe69c45ceb · Lorenzo Albano / LLVM bpEVL

Jul 11, 2018

[Power9] Add remaining __flaot128 builtin support for FMA round to odd · b9d01aa2

Stefan Pintilie authored Jul 11, 2018

Implement this as it is done on GCC:

__float128 a, b, c, d;
a = __builtin_fmaf128_round_to_odd (b, c, d);         // generates xsmaddqpo
a = __builtin_fmaf128_round_to_odd (b, c, -d);        // generates xsmsubqpo
a = - __builtin_fmaf128_round_to_odd (b, c, d);       // generates xsnmaddqpo
a = - __builtin_fmaf128_round_to_odd (b, c, -d);      // generates xsnmsubpqp

Differential Revision: https://reviews.llvm.org/D48218

llvm-svn: 336754

b9d01aa2

Jul 09, 2018

[Power9] Add __float128 builtins for Rounding Operations · 133acb22

Stefan Pintilie authored Jul 09, 2018

Added __float128 support for a number of rounding operations:

trunc
rint
nearbyint
round
floor
ceil

Differential Revision: https://reviews.llvm.org/D48415

llvm-svn: 336601

133acb22

[Power9] [LLVM] Add __float128 support for trunc to double round to odd · 58e3e0a8

Stefan Pintilie authored Jul 09, 2018

Add support for this builtin:
double builtin_truncf128_round_to_odd(float128)

Differential Revision: https://reviews.llvm.org/D48483

llvm-svn: 336595

58e3e0a8

[Power9] Add __float128 builtins for Round To Odd · 83a5fe14

Stefan Pintilie authored Jul 09, 2018

GCC has builtins for these round to odd instructions:

__float128 __builtin_sqrtf128_round_to_odd (__float128)
__float128 __builtin_{add,sub,mul,div}f128_round_to_odd (__float128, __float128)
__float128 __builtin_fmaf128_round_to_odd (__float128, __float128, __float128)

Differential Revision: https://reviews.llvm.org/D47550

llvm-svn: 336578

83a5fe14

[Power9] Add __float128 support for compare operations · 3d76326d

Stefan Pintilie authored Jul 09, 2018

Added handling for the select f128.

Differential Revision: https://reviews.llvm.org/D48294

llvm-svn: 336548

3d76326d

Jul 06, 2018

[Power9] Add __float128 library call for frem · b351f09c

Stefan Pintilie authored Jul 06, 2018

Power 9 does not have a hardware instruction for frem but we can call fmodf128.

Differential Revision: https://reviews.llvm.org/D48552

llvm-svn: 336406

b351f09c

Jul 05, 2018

[Power9] Add lib calls for float128 operations with no equivalent PPC instructions · 5612b906

Lei Huang authored Jul 05, 2018

Map the following instructions to the proper float128 lib calls:
  pow[i], exp[2], log[2|10], sin, cos, fmin, fmax

Differential Revision: https://reviews.llvm.org/D48544

llvm-svn: 336361

5612b906

[AArch64, PowerPC, x86] add tests for signbit bit hacks; NFC · bce899ff
Sanjay Patel authored Jul 05, 2018
```
llvm-svn: 336348
```
bce899ff

[Power9] Optimize codgen for conversions of int to float128 · 66e22c21

Lei Huang authored Jul 05, 2018

Optimize code sequences for integer conversion to fp128 when the integer is a result of:
  * float->int
  * float->long
  * double->int
  * double->long

Differential Revision: https://reviews.llvm.org/D48429

llvm-svn: 336316

66e22c21

[Power9][NFC] add back-end tests for passing homogeneous fp128 aggregates by value · 9d70afbb

Lei Huang authored Jul 05, 2018

Tests to verify that we are passing fp128 via VSX registers as per ABI.
These are related to clang commit rL336308.

Differential Revision: https://reviews.llvm.org/D48310

llvm-svn: 336314

9d70afbb

[Power9] Add tests for passing float128 in VSX reg for non-homogenous aggregates · 5bab646c
Lei Huang authored Jul 05, 2018
```
Add missing testcase for rL336310

llvm-svn: 336313
```
5bab646c

[Power9]Legalize and emit code for quad-precision convert from single-precision · d17c39cc

Lei Huang authored Jul 05, 2018

Legalize and emit code for quad-precision floating point operation conversion of
single-precision value to quad-precision.

Differential Revision: https://reviews.llvm.org/D47569

llvm-svn: 336307

d17c39cc

[Power9] Implement float128 parameter passing and return values · a26f3be4

Lei Huang authored Jul 05, 2018

This patch enable parameter passing and return by value for float128 types.
Passing aggregate/union which contain float128 members will be submitted in
subsequent patches.

Differential Revision: https://reviews.llvm.org/D47552

llvm-svn: 336306

a26f3be4

Jul 04, 2018

[Power9]Legalize and emit code for round & convert quad-precision values · 6270ab6c

Lei Huang authored Jul 04, 2018

Legalize and emit code for round & convert float128 to double precision and
single precision.

Differential Revision: https://reviews.llvm.org/D46997

llvm-svn: 336299

6270ab6c

[PowerPC] Replace the Post RA List Scheduler with the Machine Scheduler · cb4f0c5c

Stefan Pintilie authored Jul 04, 2018

  We want to run the Machine Scheduler instead of the List Scheduler after RA.
  Checked with a performance run on a Power 9 machine with SPEC 2006 and while
  some benchmarks improved and others degraded the geomean was slightly improved
  with the Machine Scheduler.

  Differential Revision: https://reviews.llvm.org/D45265

llvm-svn: 336295

cb4f0c5c

NFC - Various typo fixes in tests · da4a966e
Gabor Buella authored Jul 04, 2018
```
llvm-svn: 336268
```
da4a966e

Jul 02, 2018

[PowerPC] Don't make it as pre-inc candidate if displacement isn't 4's... · 3b2aa2b4

QingShan Zhang authored Jul 02, 2018

[PowerPC] Don't make it as pre-inc candidate if displacement isn't 4's multiple for i64 pre-inc load/store

For the below case, pre-inc prep think it's a good candidate to use pre-inc for the bucket, but 64bit integer load/store update (pre-inc) instruction on Power requires the displacement field should be DS-form (4's multiple). Since it can't satisfy the constraint, we have to do some fix ups later. As below, the original load/stores could be well-form, it makes things worse.

unsigned long long result = 0;
unsigned long long foo(char *p, unsigned long long n) {
  for (unsigned long long i = 0; i < n; i++) {
    unsigned long long x1 = *(unsigned long long *)(p - 50000 + i);
    unsigned long long x2 = *(unsigned long long *)(p - 61024 + i);
    unsigned long long x3 = *(unsigned long long *)(p - 62048 + i);
    unsigned long long x4 = *(unsigned long long *)(p - 64096 + i);
    result *= x1 * x2 * x3 * x4;
  }
  return result;
}

Patch by jedilyn(Kewen Lin).

Differential Revision: https://reviews.llvm.org/D48813 
--This line, and  those below, will be ignored--

M    lib/Target/PowerPC/PPCLoopPreIncPrep.cpp
A    test/CodeGen/PowerPC/preincprep-i64-check.ll

llvm-svn: 336074

3b2aa2b4

Jun 27, 2018

[DAGCombiner] restrict (float)((int) f) --> ftrunc with no-signed-zeros · d052de85

Sanjay Patel authored Jun 27, 2018

As noted in the D44909 review, the transform from (fptosi+sitofp) to ftrunc 
can produce -0.0 where the original code does not:

#include <stdio.h>
  
int main(int argc) {
  float x;
  x = -0.8 * argc;
  printf("%f\n", (float)((int)x));
  return 0;
}

$ clang -O0 -mavx fp.c ; ./a.out 
0.000000
$ clang -O1 -mavx fp.c ; ./a.out 
-0.000000

Ideally, we'd use IR/node flags to predicate the transform, but the IR parser 
doesn't currently allow fast-math-flags on the cast instructions. So for now, 
just use the function attribute that corresponds to clang's "-fno-signed-zeros" 
option.

Differential Revision: https://reviews.llvm.org/D48085

llvm-svn: 335761

d052de85

Jun 24, 2018

[DAGCombiner] eliminate setcc bool math when input is low-bit of some value · 962ee178

Sanjay Patel authored Jun 24, 2018

This patch has the same motivating example as D48466:
define void @foo(i64 %x, i32 %c.0282.in, i32 %d.0280, i32* %ptr0, i32* %ptr1) {
    %c.0282 = and i32 %c.0282.in, 268435455
    %a16 = lshr i64 32508, %x
    %a17 = and i64 %a16, 1
    %tobool = icmp eq i64 %a17, 0
    %. = select i1 %tobool, i32 1, i32 2
    %.286 = select i1 %tobool, i32 27, i32 26
    %shr97 = lshr i32 %c.0282, %.
    %shl98 = shl i32 %c.0282.in, %.286
    %or99 = or i32 %shr97, %shl98
    %shr100 = lshr i32 %d.0280, %.
    %shl101 = shl i32 %d.0280, %.286
    %or102 = or i32 %shr100, %shl101
    store i32 %or99, i32* %ptr0
    store i32 %or102, i32* %ptr1
    ret void
}

...but I'm trying to kill the setcc bool math sooner rather than later.

By matching a larger pattern that includes both the low-bit mask and the trailing add/sub, 
we can create a universally good fold because we always eliminate the condition code 
intermediate value.

Here are Alive proofs for these (currently instcombine folds the 'add' variants, but 
misses the 'sub' patterns):
https://rise4fun.com/Alive/Gsyp

Name: sub of zext cmp mask
  %a = and i8 %x, 1
  %c = icmp eq i8 %a, 0
  %z = zext i1 %c to i32
  %r = sub i32 C1, %z
  =>
  %optional_cast = zext i8 %a to i32
  %r = add i32 %optional_cast, C1-1

Name: add of zext cmp mask
  %a = and i32 %x, 1
  %c = icmp eq i32 %a, 0
  %z = zext i1 %c to i8
  %r = add i8 %z, C1
  =>
  %optional_cast = trunc i32 %a to i8
  %r = sub i8 C1+1, %optional_cast

All of the tests look like improvements or neutral to me. But it is possible that x86 
test+set+bitop is better than what we now show here. I suspect we could do better by 
adding another fold for the 'sub' variants.

We start with select-of-constant in IR in the larger motivating test, so that's why I 
included tests with selects. Proofs for those variants:
https://rise4fun.com/Alive/Bx1

Name: true const is bigger
Pre: C2 == (C1 + 1)
  %a = and i8 %x, 1
  %c = icmp eq i8 %a, 0
  %r = select i1 %c, i64 C2, i64 C1
  =>
  %z = zext i8 %a to i64
  %r = sub i64 C2, %z

Name: false const is bigger
Pre: C2 == (C1 + 1)
  %a = and i8 %x, 1
  %c = icmp eq i8 %a, 0
  %r = select i1 %c, i64 C1, i64 C2
  =>
  %z = zext i8 %a to i64
  %r = add i64 C1, %z

Differential Revision: https://reviews.llvm.org/D48466

llvm-svn: 335433

962ee178

Jun 23, 2018
- [PowerPC] add more tests for bit hacking opportunities with setcc; NFC · 0fe8ea56
  Sanjay Patel authored Jun 22, 2018
```
Missed cases where the input and output are the same size in rL335390.

llvm-svn: 335395
```
  0fe8ea56
Jun 22, 2018

[PowerPC] add tests for bit hacking opportunities with setcc; NFC · 6e505e43

Sanjay Patel authored Jun 22, 2018

We likely gave up on folding some select-of-constants patterns in 
IR with rL331486, and we need to recover those in the DAG.

The tests without select are based on our current DAGCombiner 
optimizations for select-of-constants.

llvm-svn: 335390

6e505e43

Jun 21, 2018

[DebugInfo] Make sure all DBG_VALUEs' reguse operands have IsDebug property · 42f7bc96

Mikael Holmen authored Jun 21, 2018

Summary:
In some cases, these operands lacked the IsDebug property, which is meant to signal that
they should not affect codegen. This patch adds a check for this property in the
MachineVerifier and adds it where it was missing.

This includes refactorings to use MachineInstrBuilder construction functions instead of
manually setting up the intrinsic everywhere.

Patch by: JesperAntonsson

Reviewers: aprantl, rnk, echristo, javed.absar

Reviewed By: aprantl

Subscribers: qcolombet, sdardis, nemanjai, JDevlieghere, atanasyan, llvm-commits

Differential Revision: https://reviews.llvm.org/D48319

llvm-svn: 335214

42f7bc96

Generalize MergeBlockIntoPredecessor. Replace uses of MergeBasicBlockIntoOnlyPred. · dfd14ade

Alina Sbirlea authored Jun 20, 2018

Summary:
Two utils methods have essentially the same functionality. This is an attempt to merge them into one.
1. lib/Transforms/Utils/Local.cpp : MergeBasicBlockIntoOnlyPred
2. lib/Transforms/Utils/BasicBlockUtils.cpp : MergeBlockIntoPredecessor

Prior to the patch:
1. MergeBasicBlockIntoOnlyPred
Updates either DomTree or DeferredDominance
Moves all instructions from Pred to BB, deletes Pred
Asserts BB has single predecessor
If address was taken, replace the block address with constant 1 (?)

2. MergeBlockIntoPredecessor
Updates DomTree, LoopInfo and MemoryDependenceResults
Moves all instruction from BB to Pred, deletes BB
Returns if doesn't have a single predecessor
Returns if BB's address was taken

After the patch:
Method 2. MergeBlockIntoPredecessor is attempting to become the new default:
Updates DomTree or DeferredDominance, and LoopInfo and MemoryDependenceResults
Moves all instruction from BB to Pred, deletes BB
Returns if doesn't have a single predecessor
Returns if BB's address was taken

Uses of MergeBasicBlockIntoOnlyPred that need to be replaced:

1. lib/Transforms/Scalar/LoopSimplifyCFG.cpp
Updated in this patch. No challenges.

2. lib/CodeGen/CodeGenPrepare.cpp
Updated in this patch.
i. eliminateFallThrough is straightforward, but I added using a temporary array to avoid the iterator invalidation.
ii. eliminateMostlyEmptyBlock(s) methods also now use a temporary array for blocks
Some interesting aspects:
- Since Pred is not deleted (BB is), the entry block does not need updating.
- The entry block was being updated with the deleted block in eliminateMostlyEmptyBlock. Added assert to make obvious that BB=SinglePred.
- isMergingEmptyBlockProfitable assumes BB is the one to be deleted.
- eliminateMostlyEmptyBlock(BB) does not delete BB on one path, it deletes its unique predecessor instead.
- adding some test owner as subscribers for the interesting tests modified:
test/CodeGen/X86/avx-cmp.ll
test/CodeGen/AMDGPU/nested-loop-conditions.ll
test/CodeGen/AMDGPU/si-annotate-cf.ll
test/CodeGen/X86/hoist-spill.ll
test/CodeGen/X86/2006-11-17-IllegalMove.ll

3. lib/Transforms/Scalar/JumpThreading.cpp
Not covered in this patch. It is the only use case using the DeferredDominance.
I would defer to Brian Rzycki to make this replacement.

Reviewers: chandlerc, spatel, davide, brzycki, bkramer, javed.absar

Subscribers: qcolombet, sanjoy, nemanjai, nhaehnle, jlebar, tpr, kbarton, RKSimon, wmi, arsenm, llvm-commits

Differential Revision: https://reviews.llvm.org/D48202

llvm-svn: 335183

dfd14ade

Jun 20, 2018

Allow binop C1, (select cc, CF, CT) -> select folding · 20279dc0

Stanislav Mekhanoshin authored Jun 20, 2018

Previously this folding was done only if select is a first operand.
However, for non-commutative operations constant may go before
select.

Differential Revision: https://reviews.llvm.org/D48223

llvm-svn: 335167

20279dc0

Jun 19, 2018

[PowerPC] Fix label address calculation for ppc32 · bb2b00bb

Strahinja Petrovic authored Jun 19, 2018

This patch fixes calculating address of label on ppc32 (for -fPIC).

Differential Revision: https://reviews.llvm.org/D46582

llvm-svn: 335043

bb2b00bb

If the arch is P9, we will select the DFLOADf32/DFLOADf64 pseudo instruction... · 9f0fe9a3

QingShan Zhang authored Jun 19, 2018

If the arch is P9, we will select the DFLOADf32/DFLOADf64 pseudo instruction when we are loading a floating,
and expand it post RA basing on the register pressure. However, we miss to do the add-imm peephole for these pseudo instruction.

Differential Revision: https://reviews.llvm.org/D47568
Reviewed By: Nemanjai

llvm-svn: 335024

9f0fe9a3

Jun 18, 2018
- Tests for dag combine select (binop) -> select. NFC. · 9347c7b9
  Stanislav Mekhanoshin authored Jun 18, 2018
```
Tests will be updated with https://reviews.llvm.org/D48223

llvm-svn: 334987
```
  9347c7b9
Jun 16, 2018

Utilize new SDNode flag functionality to expand current support for fma · 8e570c33

Michael Berg authored Jun 16, 2018

Summary: This patch originated from D47388 and is a proper subset of the originating changes, containing only the fmf optimization guard extensions.

Reviewers: spatel, hfinkel, wristow, arsenm, javed.absar, rampitec, nhaehnle, nemanjai

Reviewed By: rampitec, nhaehnle

Subscribers: tpr, nemanjai, wdng

Differential Revision: https://reviews.llvm.org/D47918

llvm-svn: 334876

8e570c33

Jun 08, 2018

propagate fast math flags via IR on fma and sub expressions · 77b5be7e

Michael Berg authored Jun 07, 2018

Summary: This change uses fmf subflags to guard fma optimizations as well as unsafe. These changes originated from D46483 and have been simplified via getNode.

Reviewers: spatel, arsenm, hfinkel, javed.absar

Reviewed By: spatel

Subscribers: nemanjai, wdng

Differential Revision: https://reviews.llvm.org/D47388

llvm-svn: 334242

77b5be7e

Jun 07, 2018

[PowerPC] avoid unprofitable Repl32 flag in BitPermutationSelector · 01ef4c2c

Hiroshi Inoue authored Jun 07, 2018

BitPermutationSelector sets Repl32 flag for bit groups which can be (potentially) benefit from 32-bit rotate-and-mask instructions with bit replication, i.e. rlwinm/rlwimi copies lower 32 bits into upper 32 bits on 64-bit PowerPC before rotation.
However, enforcing 32-bit instruction sometimes results in redundant generated code.
For example, the following simple code is compiled into rotldi + rlwimi while it can be compiled into only rldimi instruction if Repl32 flag is not set on the bit group for (a & 0xFFFFFFFF).

uint64_t func(uint64_t a, uint64_t b) {
	return (a & 0xFFFFFFFF) | (b << 32) ;
}

To avoid such problem, this patch checks the potential benefit of Repl32 flag before setting it. If a bit group does not require rotation (i.e. RLAmt == 0) and won't be merged into another group, we do not benefit from Repl32 flag on this group.

Differential Revision: https://reviews.llvm.org/D47867

llvm-svn: 334195

01ef4c2c

Jun 06, 2018

guard fsqrt with fmf sub flags · cc1c4b69

Michael Berg authored Jun 06, 2018

Summary:
This change uses fmf subflags to guard optimizations as well as unsafe. These changes originated from D46483.
It contains only context for fsqrt.


Reviewers: spatel, hfinkel, arsenm

Reviewed By: spatel

Subscribers: hfinkel, wdng, andrew.w.kaylor, wristow, efriedma, nemanjai

Differential Revision: https://reviews.llvm.org/D47749

llvm-svn: 334113

cc1c4b69

Jun 05, 2018

guard fneg with fmf sub flags · 96925fe0

Michael Berg authored Jun 05, 2018

Summary: This change uses fmf subflags to guard optimizations as well as unsafe. These changes originated from D46483.

Reviewers: spatel, hfinkel

Reviewed By: spatel

Subscribers: nemanjai

Differential Revision: https://reviews.llvm.org/D47389

llvm-svn: 334037

96925fe0

NFC: adding baseline fneg case for fmf · 8f6d6c81
Michael Berg authored Jun 05, 2018
```
llvm-svn: 334035
```
8f6d6c81

[PowerPC] reduce rotate in BitPermutationSelector · 955655f5

Hiroshi Inoue authored Jun 05, 2018

BitPermutationSelector builds the output value by repeating rotate-and-mask instructions with input registers.
Here, we may avoid one rotate instruction if we start building from an input register that does not require rotation.

For example of the test case bitfieldinsert.ll, it first rotates left r4 by 8 bits and then inserts some bits from r5 without rotation.
This can be executed by one rlwimi instruction, which rotates r4 by 8 bits and inserts its bits into r5.

This patch adds a check for rotation amounts in the comparator used in sorting to process the input without rotation first.

Differential Revision: https://reviews.llvm.org/D47765

llvm-svn: 334011

955655f5

May 29, 2018

[PowerPC] Fix the incorrect iterator inside peephole · 716103f1

Lei Huang authored May 29, 2018

Instruction selection can insert nodes into the underlying list after the root
node so iterating will thereby miss it. We should NOT assume that, the root node
is the last element in the DAG nodelist.

Patch by: steven.zhang (Qing Shan Zhang)

Differential Revision: https://reviews.llvm.org/D47437

llvm-svn: 333415

716103f1

May 28, 2018

[Power9]Legalize and emit code for HW/Byte vector extract and convert to QP · 651be449

Lei Huang authored May 28, 2018

Implemente patterns to extract HWord and Byte vector elements and convert to
quad-precision.

Differential Revision: https://reviews.llvm.org/D46774

llvm-svn: 333377

651be449

May 24, 2018

[PowerPC] Remove the match pattern in the definition of LXSDX/STXSDX · f4ec6782

Lei Huang authored May 24, 2018

The match pattern in the definition of LXSDX is xoaddr, so the Pseudo
instruction XFLOADf64 never gets selected. XFLOADf64 expands to LXSDX/LFDX post
RA based on the register pressure. To avoid ambiguity, we need to remove the
select pattern for LXSDX, same as what was done for LXSD. STXSDX also have
the same issue.

Patch by Qing Shan Zhang (steven.zhang).

Differential Revision: https://reviews.llvm.org/D47178

llvm-svn: 333150

f4ec6782

May 23, 2018

[Power9]Legalize and emit code for W vector extract and convert to QP · 8b0da65b

Lei Huang authored May 23, 2018

Implemente patterns to extract [Un]signed Word vector element and convert to
quad-precision.

Differential Revision: https://reviews.llvm.org/D46536

llvm-svn: 333115

8b0da65b

[Power9]Legalize and emit code for DW vector extract and convert to QP · 8990168a

Lei Huang authored May 23, 2018

Implemente patterns to extract [Un]signed DWord vector element and convert to
quad-precision.

Differential Revision: https://reviews.llvm.org/D46333

llvm-svn: 333112

8990168a

May 17, 2018

[PowerPC] preserve test intent by removing undef · 354842ab

Sanjay Patel authored May 16, 2018

We need to clean up the DAG floating-point undef logic.
This process is similar to how we handled integer undef
logic in D43141.

And as we did there, I'm trying to reduce the patch by
changing tests that would probably become meaningless
once we correct FP undef folding.

llvm-svn: 332549

354842ab