Skip to content
  1. Nov 16, 2004
  2. Nov 14, 2004
  3. Nov 13, 2004
    • Chris Lattner's avatar
      · 049d33a7
      Chris Lattner authored
      shld is a very high latency operation. Instead of emitting it for shifts of
      two or three, open code the equivalent operation which is faster on athlon
      and P4 (by a substantial margin).
      
      For example, instead of compiling this:
      
      long long X2(long long Y) { return Y << 2; }
      
      to:
      
      X3_2:
              movl 4(%esp), %eax
              movl 8(%esp), %edx
              shldl $2, %eax, %edx
              shll $2, %eax
              ret
      
      Compile it to:
      
      X2:
              movl 4(%esp), %eax
              movl 8(%esp), %ecx
              movl %eax, %edx
              shrl $30, %edx
              leal (%edx,%ecx,4), %edx
              shll $2, %eax
              ret
      
      Likewise, for << 3, compile to:
      
      X3:
              movl 4(%esp), %eax
              movl 8(%esp), %ecx
              movl %eax, %edx
              shrl $29, %edx
              leal (%edx,%ecx,8), %edx
              shll $3, %eax
              ret
      
      This matches icc, except that icc open codes the shifts as adds on the P4.
      
      llvm-svn: 17707
      049d33a7
    • Chris Lattner's avatar
      Add missing check · ef6bd92a
      Chris Lattner authored
      llvm-svn: 17706
      ef6bd92a
    • Chris Lattner's avatar
      Compile: · 8d521bb1
      Chris Lattner authored
      long long X3_2(long long Y) { return Y+Y; }
      int X(int Y) { return Y+Y; }
      
      into:
      
      X3_2:
              movl 4(%esp), %eax
              movl 8(%esp), %edx
              addl %eax, %eax
              adcl %edx, %edx
              ret
      X:
              movl 4(%esp), %eax
              addl %eax, %eax
              ret
      
      instead of:
      
      X3_2:
              movl 4(%esp), %eax
              movl 8(%esp), %edx
              shldl $1, %eax, %edx
              shll $1, %eax
              ret
      
      X:
              movl 4(%esp), %eax
              shll $1, %eax
              ret
      
      llvm-svn: 17705
      8d521bb1
  4. Nov 10, 2004
  5. Nov 05, 2004
  6. Nov 02, 2004
  7. Nov 01, 2004
  8. Oct 28, 2004
  9. Oct 22, 2004
  10. Oct 19, 2004
  11. Oct 18, 2004
  12. Oct 17, 2004
    • Chris Lattner's avatar
      Don't print stuff out from the code generator. This broke the JIT horribly · 06855531
      Chris Lattner authored
      last night. :)  bork!
      
      llvm-svn: 17093
      06855531
    • Chris Lattner's avatar
      Rewrite support for cast uint -> FP. In particular, we used to compile this: · 839abf57
      Chris Lattner authored
      double %test(uint %X) {
              %tmp.1 = cast uint %X to double         ; <double> [#uses=1]
              ret double %tmp.1
      }
      
      into:
      
      test:
              sub %ESP, 8
              mov %EAX, DWORD PTR [%ESP + 12]
              mov %ECX, 0
              mov DWORD PTR [%ESP], %EAX
              mov DWORD PTR [%ESP + 4], %ECX
              fild QWORD PTR [%ESP]
              add %ESP, 8
              ret
      
      ... which basically zero extends to 8 bytes, then does an fild for an
      8-byte signed int.
      
      Now we generate this:
      
      
      test:
              sub %ESP, 4
              mov %EAX, DWORD PTR [%ESP + 8]
              mov DWORD PTR [%ESP], %EAX
              fild DWORD PTR [%ESP]
              shr %EAX, 31
              fadd DWORD PTR [.CPItest_0 + 4*%EAX]
              add %ESP, 4
              ret
      
              .section .rodata
              .align  4
      .CPItest_0:
              .quad   5728578726015270912
      
      This does a 32-bit signed integer load, then adds in an offset if the sign
      bit of the integer was set.
      
      It turns out that this is substantially faster than the preceeding sequence.
      Consider this testcase:
      
      unsigned a[2]={1,2};
      volatile double G;
      
      void main() {
          int i;
          for (i=0; i<100000000; ++i )
              G += a[i&1];
      }
      
      On zion (a P4 Xeon, 3Ghz), this patch speeds up the testcase from 2.140s
      to 0.94s.
      
      On apoc, an athlon MP 2100+, this patch speeds up the testcase from 1.72s
      to 1.34s.
      
      Note that the program takes 2.5s/1.97s on zion/apoc with GCC 3.3 -O3
      -fomit-frame-pointer.
      
      llvm-svn: 17083
      839abf57
    • Chris Lattner's avatar
      Unify handling of constant pool indexes with the other code paths, allowing · 112fd88a
      Chris Lattner authored
      us to use index registers for CPI's
      
      llvm-svn: 17082
      112fd88a
    • Chris Lattner's avatar
      Give the asmprinter the ability to print memrefs with a constant pool index, · af19d396
      Chris Lattner authored
      index reg and scale
      
      llvm-svn: 17081
      af19d396
    • Chris Lattner's avatar
      fold: · 653d8663
      Chris Lattner authored
        %X = and Y, constantint
        %Z = setcc %X, 0
      
      instead of emitting:
      
              and %EAX, 3
              test %EAX, %EAX
              je .LBBfoo2_2   # UnifiedReturnBlock
      
      We now emit:
      
              test %EAX, 3
              je .LBBfoo2_2   # UnifiedReturnBlock
      
      This triggers 581 times on 176.gcc for example.
      
      llvm-svn: 17080
      653d8663
  13. Oct 16, 2004
  14. Oct 15, 2004
  15. Oct 13, 2004
  16. Oct 11, 2004
  17. Oct 09, 2004
  18. Oct 08, 2004
  19. Oct 06, 2004
    • Chris Lattner's avatar
      Remove debugging code, fix encoding problem. This fixes the problems · 93867e51
      Chris Lattner authored
      the JIT had last night.
      
      llvm-svn: 16766
      93867e51
    • Chris Lattner's avatar
      Codegen signed mod by 2 or -2 more efficiently. Instead of generating: · 6835dedb
      Chris Lattner authored
      t:
              mov %EDX, DWORD PTR [%ESP + 4]
              mov %ECX, 2
              mov %EAX, %EDX
              sar %EDX, 31
              idiv %ECX
              mov %EAX, %EDX
              ret
      
      Generate:
      t:
              mov %ECX, DWORD PTR [%ESP + 4]
      ***     mov %EAX, %ECX
              cdq
              and %ECX, 1
              xor %ECX, %EDX
              sub %ECX, %EDX
      ***     mov %EAX, %ECX
              ret
      
      Note that the two marked moves are redundant, and should be eliminated by the
      register allocator, but aren't.
      
      Compare this to GCC, which generates:
      
      t:
              mov     %eax, DWORD PTR [%esp+4]
              mov     %edx, %eax
              shr     %edx, 31
              lea     %ecx, [%edx+%eax]
              and     %ecx, -2
              sub     %eax, %ecx
              ret
      
      or ICC 8.0, which generates:
      
      t:
              movl      4(%esp), %ecx                                 #3.5
              movl      $-2147483647, %eax                            #3.25
              imull     %ecx                                          #3.25
              movl      %ecx, %eax                                    #3.25
              sarl      $31, %eax                                     #3.25
              addl      %ecx, %edx                                    #3.25
              subl      %edx, %eax                                    #3.25
              addl      %eax, %eax                                    #3.25
              negl      %eax                                          #3.25
              subl      %eax, %ecx                                    #3.25
              movl      %ecx, %eax                                    #3.25
              ret                                                     #3.25
      
      We would be in great shape if not for the moves.
      
      llvm-svn: 16763
      6835dedb
    • Chris Lattner's avatar
      Fix a scary bug with signed division by a power of two. We used to generate: · 7bd8f133
      Chris Lattner authored
      s:   ;; X / 4
              mov %EAX, DWORD PTR [%ESP + 4]
              mov %ECX, %EAX
              sar %ECX, 1
              shr %ECX, 30
              mov %EDX, %EAX
              add %EDX, %ECX
              sar %EAX, 2
              ret
      
      When we really meant:
      
      s:
              mov %EAX, DWORD PTR [%ESP + 4]
              mov %ECX, %EAX
              sar %ECX, 1
              shr %ECX, 30
              add %EAX, %ECX
              sar %EAX, 2
              ret
      
      Hey, this also reduces register pressure too :)
      
      llvm-svn: 16761
      7bd8f133
    • Chris Lattner's avatar
      Codegen signed divides by 2 and -2 more efficiently. In particular · 147edd2f
      Chris Lattner authored
      instead of:
      
      s:   ;; X / 2
              movl 4(%esp), %eax
              movl %eax, %ecx
              shrl $31, %ecx
              movl %eax, %edx
              addl %ecx, %edx
              sarl $1, %eax
              ret
      
      t:   ;; X / -2
              movl 4(%esp), %eax
              movl %eax, %ecx
              shrl $31, %ecx
              movl %eax, %edx
              addl %ecx, %edx
              sarl $1, %eax
              negl %eax
              ret
      
      Emit:
      
      s:
              movl 4(%esp), %eax
              cmpl $-2147483648, %eax
              sbbl $-1, %eax
              sarl $1, %eax
              ret
      
      t:
              movl 4(%esp), %eax
              cmpl $-2147483648, %eax
              sbbl $-1, %eax
              sarl $1, %eax
              negl %eax
              ret
      
      llvm-svn: 16760
      147edd2f
    • Chris Lattner's avatar
      Add some new instructions. Fix the asm string for sbb32rr · e9bfa5a2
      Chris Lattner authored
      llvm-svn: 16759
      e9bfa5a2
Loading