README.txt

Target Independent Opportunities:

//===---------------------------------------------------------------------===//

We should recognize idioms for add-with-carry and turn it into the appropriate
intrinsics.  This example:

unsigned add32carry(unsigned sum, unsigned x) {
 unsigned z = sum + x;
 if (sum + x < x)
     z++;
 return z;
}

Compiles to: clang t.c -S -o - -O3 -fomit-frame-pointer -m64 -mkernel

_add32carry:                            ## @add32carry
	addl	%esi, %edi
	cmpl	%esi, %edi
	sbbl	%eax, %eax
	andl	$1, %eax
	addl	%edi, %eax
	ret

with clang, but to:

_add32carry:
	leal	(%rsi,%rdi), %eax
	cmpl	%esi, %eax
	adcl	$0, %eax
	ret

with gcc.

//===---------------------------------------------------------------------===//

Dead argument elimination should be enhanced to handle cases when an argument is
dead to an externally visible function.  Though the argument can't be removed
from the externally visible function, the caller doesn't need to pass it in.
For example in this testcase:

  void foo(int X) __attribute__((noinline));
  void foo(int X) { sideeffect(); }
  void bar(int A) { foo(A+1); }

We compile bar to:

define void @bar(i32 %A) nounwind ssp {
  %0 = add nsw i32 %A, 1                          ; <i32> [#uses=1]
  tail call void @foo(i32 %0) nounwind noinline ssp
  ret void
}

The add is dead, we could pass in 'i32 undef' instead.  This occurs for C++
templates etc, which usually have linkonce_odr/weak_odr linkage, not internal
linkage.

//===---------------------------------------------------------------------===//

With the recent changes to make the implicit def/use set explicit in
machineinstrs, we should change the target descriptions for 'call' instructions
so that the .td files don't list all the call-clobbered registers as implicit
defs.  Instead, these should be added by the code generator (e.g. on the dag).

This has a number of uses:

1. PPC32/64 and X86 32/64 can avoid having multiple copies of call instructions
   for their different impdef sets.
2. Targets with multiple calling convs (e.g. x86) which have different clobber
   sets don't need copies of call instructions.
3. 'Interprocedural register allocation' can be done to reduce the clobber sets
   of calls.

//===---------------------------------------------------------------------===//

We should recognized various "overflow detection" idioms and translate them into
llvm.uadd.with.overflow and similar intrinsics.  Here is a multiply idiom:

unsigned int mul(unsigned int a,unsigned int b) {
 if ((unsigned long long)a*b>0xffffffff)
   exit(0);
  return a*b;
}

//===---------------------------------------------------------------------===//

Get the C front-end to expand hypot(x,y) -> llvm.sqrt(x*x+y*y) when errno and
precision don't matter (ffastmath).  Misc/mandel will like this. :)  This isn't
safe in general, even on darwin.  See the libm implementation of hypot for
examples (which special case when x/y are exactly zero to get signed zeros etc
right).

//===---------------------------------------------------------------------===//

Solve this DAG isel folding deficiency:

int X, Y;

void fn1(void)
{
  X = X | (Y << 3);
}

compiles to

fn1:
	movl Y, %eax
	shll $3, %eax
	orl X, %eax
	movl %eax, X
	ret

The problem is the store's chain operand is not the load X but rather
a TokenFactor of the load X and load Y, which prevents the folding.

There are two ways to fix this:

1. The dag combiner can start using alias analysis to realize that y/x
   don't alias, making the store to X not dependent on the load from Y.
2. The generated isel could be made smarter in the case it can't
   disambiguate the pointers.

Number 1 is the preferred solution.

This has been "fixed" by a TableGen hack. But that is a short term workaround
which will be removed once the proper fix is made.

//===---------------------------------------------------------------------===//

On targets with expensive 64-bit multiply, we could LSR this:

for (i = ...; ++i) {
   x = 1ULL << i;

into:
 long long tmp = 1;
 for (i = ...; ++i, tmp+=tmp)
   x = tmp;

This would be a win on ppc32, but not x86 or ppc64.

//===---------------------------------------------------------------------===//

Shrink: (setlt (loadi32 P), 0) -> (setlt (loadi8 Phi), 0)

//===---------------------------------------------------------------------===//

Reassociate should turn things like:

int factorial(int X) {
 return X*X*X*X*X*X*X*X;
}

into llvm.powi calls, allowing the code generator to produce balanced
multiplication trees.

First, the intrinsic needs to be extended to support integers, and second the
code generator needs to be enhanced to lower these to multiplication trees.

//===---------------------------------------------------------------------===//

Interesting? testcase for add/shift/mul reassoc:

int bar(int x, int y) {
  return x*x*x+y+x*x*x*x*x*y*y*y*y;
}
int foo(int z, int n) {
  return bar(z, n) + bar(2*z, 2*n);
}

This is blocked on not handling X*X*X -> powi(X, 3) (see note above).  The issue
is that we end up getting t = 2*X  s = t*t   and don't turn this into 4*X*X,
which is the same number of multiplies and is canonical, because the 2*X has
multiple uses.  Here's a simple example:

define i32 @test15(i32 %X1) {
  %B = mul i32 %X1, 47   ; X1*47
  %C = mul i32 %B, %B
  ret i32 %C
}


//===---------------------------------------------------------------------===//

Reassociate should handle the example in GCC PR16157:

extern int a0, a1, a2, a3, a4; extern int b0, b1, b2, b3, b4; 
void f () {  /* this can be optimized to four additions... */ 
        b4 = a4 + a3 + a2 + a1 + a0; 
        b3 = a3 + a2 + a1 + a0; 
        b2 = a2 + a1 + a0; 
        b1 = a1 + a0; 
} 

This requires reassociating to forms of expressions that are already available,
something that reassoc doesn't think about yet.


//===---------------------------------------------------------------------===//