Skip to content
  • Arnold Schwaighofer's avatar
    X86 cost model: Adjust cost for custom lowered vector multiplies · 20ef54f4
    Arnold Schwaighofer authored
    This matters for example in following matrix multiply:
    
    int **mmult(int rows, int cols, int **m1, int **m2, int **m3) {
      int i, j, k, val;
      for (i=0; i<rows; i++) {
        for (j=0; j<cols; j++) {
          val = 0;
          for (k=0; k<cols; k++) {
            val += m1[i][k] * m2[k][j];
          }
          m3[i][j] = val;
        }
      }
      return(m3);
    }
    
    Taken from the test-suite benchmark Shootout.
    
    We estimate the cost of the multiply to be 2 while we generate 9 instructions
    for it and end up being quite a bit slower than the scalar version (48% on my
    machine).
    
    Also, properly differentiate between avx1 and avx2. On avx-1 we still split the
    vector into 2 128bits and handle the subvector muls like above with 9
    instructions.
    Only on avx-2 will we have a cost of 9 for v4i64.
    
    I changed the test case in test/Transforms/LoopVectorize/X86/avx1.ll to use an
    add instead of a mul because with a mul we now no longer vectorize. I did
    verify that the mul would be indeed more expensive when vectorized with 3
    kernels:
    
    for (i ...)
       r += a[i] * 3;
    for (i ...)
      m1[i] = m1[i] * 3; // This matches the test case in avx1.ll
    and a matrix multiply.
    
    In each case the vectorized version was considerably slower.
    
    radar://13304919
    
    llvm-svn: 176403
    20ef54f4
Loading