Skip to content
  1. Mar 05, 2013
  2. Mar 04, 2013
  3. Mar 02, 2013
    • Jim Grosbach's avatar
      ARM: Creating a vector from a lane of another. · a3c5c769
      Jim Grosbach authored
      The VDUP instruction source register doesn't allow a non-constant lane
      index, so make sure we don't construct a ARM::VDUPLANE node asking it to
      do so.
      
      rdar://13328063
      http://llvm.org/bugs/show_bug.cgi?id=13963
      
      llvm-svn: 176413
      a3c5c769
    • Jim Grosbach's avatar
      Clean up code format a bit. · c6f1914e
      Jim Grosbach authored
      llvm-svn: 176412
      c6f1914e
    • Jim Grosbach's avatar
      Tidy up. Trailing whitespace. · 54efea0a
      Jim Grosbach authored
      llvm-svn: 176411
      54efea0a
    • Arnold Schwaighofer's avatar
      ARM NEON: Fix v2f32 float intrinsics · 99cba969
      Arnold Schwaighofer authored
      Mark them as expand, they are not legal as our backend does not match them.
      
      llvm-svn: 176410
      99cba969
    • Arnold Schwaighofer's avatar
      X86 cost model: Adjust cost for custom lowered vector multiplies · 20ef54f4
      Arnold Schwaighofer authored
      This matters for example in following matrix multiply:
      
      int **mmult(int rows, int cols, int **m1, int **m2, int **m3) {
        int i, j, k, val;
        for (i=0; i<rows; i++) {
          for (j=0; j<cols; j++) {
            val = 0;
            for (k=0; k<cols; k++) {
              val += m1[i][k] * m2[k][j];
            }
            m3[i][j] = val;
          }
        }
        return(m3);
      }
      
      Taken from the test-suite benchmark Shootout.
      
      We estimate the cost of the multiply to be 2 while we generate 9 instructions
      for it and end up being quite a bit slower than the scalar version (48% on my
      machine).
      
      Also, properly differentiate between avx1 and avx2. On avx-1 we still split the
      vector into 2 128bits and handle the subvector muls like above with 9
      instructions.
      Only on avx-2 will we have a cost of 9 for v4i64.
      
      I changed the test case in test/Transforms/LoopVectorize/X86/avx1.ll to use an
      add instead of a mul because with a mul we now no longer vectorize. I did
      verify that the mul would be indeed more expensive when vectorized with 3
      kernels:
      
      for (i ...)
         r += a[i] * 3;
      for (i ...)
        m1[i] = m1[i] * 3; // This matches the test case in avx1.ll
      and a matrix multiply.
      
      In each case the vectorized version was considerably slower.
      
      radar://13304919
      
      llvm-svn: 176403
      20ef54f4
    • Andrew Trick's avatar
      Added FIXME for future Hexagon cleanup. · 63474629
      Andrew Trick authored
      llvm-svn: 176400
      63474629
  4. Mar 01, 2013
Loading