Skip to content
  1. Aug 04, 2014
    • Chandler Carruth's avatar
      [x86] Implement more aggressive use of PACKUS chains for lowering common · 06e6f1ca
      Chandler Carruth authored
      patterns of v16i8 shuffles.
      
      This implements one of the more important FIXMEs for the SSE2 support in
      the new shuffle lowering. We now generate the optimal shuffle sequence
      for truncate-derived shuffles which show up essentially everywhere.
      
      Unfortunately, this exposes a weakness in other parts of the shuffle
      logic -- we can no longer form PSHUFB here. I'll add the necessary
      support for that and other things in a subsequent commit.
      
      llvm-svn: 214702
      06e6f1ca
    • Chandler Carruth's avatar
      [x86] Handle single input shuffles in the SSSE3 case more intelligently. · 37a18821
      Chandler Carruth authored
      I spent some time looking into a better or more principled way to handle
      this. For example, by detecting arbitrary "unneeded" ORs... But really,
      there wasn't any point. We just shouldn't build blatantly wrong code so
      late in the pipeline rather than adding more stages and logic later on
      to fix it. Avoiding this is just too simple.
      
      llvm-svn: 214680
      37a18821
  2. Aug 02, 2014
    • Chandler Carruth's avatar
      [x86] Largely complete the use of PSHUFB in the new vector shuffle · 4c57955f
      Chandler Carruth authored
      lowering with a small addition to it and adding PSHUFB combining.
      
      There is one obvious place in the new vector shuffle lowering where we
      should form PSHUFBs directly: when without them we will unpack a vector
      of i8s across two different registers and do a potentially 4-way blend
      as i16s only to re-pack them into i8s afterward. This is the crazy
      expensive fallback path for i8 shuffles and we can just directly use
      pshufb here as it will always be cheaper (the unpack and pack are
      two instructions so even a single shuffle between them hits our
      three instruction limit for forming PSHUFB).
      
      However, this doesn't generate very good code in many cases, and it
      leaves a bunch of common patterns not using PSHUFB. So this patch also
      adds support for extracting a shuffle mask from PSHUFB in the X86
      lowering code, and uses it to handle PSHUFBs in the recursive shuffle
      combining. This allows us to combine through them, combine multiple ones
      together, and generally produce sufficiently high quality code.
      
      Extracting the PSHUFB mask is annoyingly complex because it could be
      either pre-legalization or post-legalization. At least this doesn't have
      to deal with re-materialized constants. =] I've added decode routines to
      handle the different patterns that show up at this level and we dispatch
      through them as appropriate.
      
      The two primary test cases are updated. For the v16 test case there is
      still a lot of room for improvement. Since I was going through it
      systematically I left behind a bunch of FIXME lines that I'm hoping to
      turn into ALL lines by the end of this.
      
      llvm-svn: 214628
      4c57955f
  3. Jul 10, 2014
    • Chandler Carruth's avatar
      [x86] Add another combine that is particularly useful for the new vector · df8d0caa
      Chandler Carruth authored
      shuffle lowering: match shuffle patterns equivalent to an unpcklwd or
      unpckhwd instruction.
      
      This allows us to use generic lowering code for v8i16 shuffles and match
      the unpack pattern late.
      
      llvm-svn: 212705
      df8d0caa
    • Chandler Carruth's avatar
      [x86] Expand the target DAG combining for PSHUFD nodes to be able to · 853fa0ac
      Chandler Carruth authored
      combine into half-shuffles through unpack instructions that expand the
      half to a whole vector without messing with the dword lanes.
      
      This fixes some redundant instructions in splat-like lowerings for
      v16i8, which are now getting to be *really* nice.
      
      llvm-svn: 212695
      853fa0ac
    • Chandler Carruth's avatar
      [x86] Tweak the v16i8 single input special case lowering for shuffles · a34a8e23
      Chandler Carruth authored
      that splat i8s into i16s.
      
      Previously, we would try much too hard to arrange a sequence of i8s in
      one half of the input such that we could unpack them into i16s and
      shuffle those into place. This isn't always going to be a cheaper i8
      shuffle than our other strategies. The case where it is always going to
      be cheaper is when we can arrange all the necessary inputs into one half
      using just i16 shuffles. It happens that viewing the problem this way
      also makes it much easier to produce an efficient set of shuffles to
      move the inputs into one half and then unpack them.
      
      With this, our splat code gets one step closer to being not terrible
      with the new experimental lowering strategy. It also exposes two
      combines missing which I will add next.
      
      llvm-svn: 212692
      a34a8e23
    • Chandler Carruth's avatar
      [x86] Initial improvements to the new shuffle lowering for v16i8 · 7d2ffb54
      Chandler Carruth authored
      shuffles specifically for cases where a small subset of the elements in
      the input vector are actually used.
      
      This is specifically targetted at improving the shuffles generated for
      trunc operations, but also helps out splat-like operations.
      
      There is still some really low-hanging fruit here that I want to address
      but this is a huge step in the right direction.
      
      llvm-svn: 212680
      7d2ffb54
  4. Jul 07, 2014
    • Chandler Carruth's avatar
      [x86] Teach the new vector shuffle lowering code to handle what is · 0dcb3662
      Chandler Carruth authored
      essentially a DAG combine that never gets a chance to run.
      
      We might typically expect DAG combining to remove shuffles-of-splats and
      other similar patterns, but we don't get a chance to run the DAG
      combiner when we recursively form sub-shuffles during the lowering of
      a shuffle. So instead hand-roll a really important combine directly into
      the lowering code to detect shuffles-of-splats, especially shuffles of
      an all-zero splat which needn't even have the same element width, etc.
      
      This lets the new vector shuffle lowering handle shuffles which
      implement things like zero-extension really nicely. This will become
      even more important when I wire the legalization of zero-extension to
      vector shuffles with the new widening legalization strategy.
      
      llvm-svn: 212444
      0dcb3662
  5. Jun 28, 2014
    • Chandler Carruth's avatar
      [x86] Fix a bug in the v8i16 shuffling exposed by the new splat-like · bd0717d7
      Chandler Carruth authored
      lowering for v16i8.
      
      ASan and some bots caught this bug with existing test cases. Fixing it
      even fixed a miscompile with one of the test cases. I'm still a bit
      suspicious of this test case as I've not taken a proper amount of time
      to think about it, but the fix here is strict goodness.
      
      llvm-svn: 211976
      bd0717d7
    • Chandler Carruth's avatar
      [x86] Add handling for splat-like widenings of v16i8 shuffles. · 887c2c34
      Chandler Carruth authored
      These show up really frequently, not the least with actual splats. =] We
      lowered these quite badly before. The new code path tries to widen i8
      shuffles to i16 shuffles in a splat-like way. There are still some
      inefficiencies in our i16 splat logic though, so we aren't really done
      here.
      
      Also, for certain patterns (bit of a gather-and-splat) we still
      generate pretty silly code, and I've left a fixme for addressing it.
      However, I'm not actually worried about this code pattern as much. The
      old shuffle lowering generates a 29 instruction monstrosity for it that
      should execute much more slowly.
      
      llvm-svn: 211974
      887c2c34
  6. Jun 27, 2014
    • Chandler Carruth's avatar
      [x86] Fix a miscompile in the new shuffle lowering uncovered by · dd6470a9
      Chandler Carruth authored
      a bootstrap.
      
      I managed to mis-remember how PACKUS worked on x86, and was using undef
      for the high bytes instead of zero. The fix is fairly obvious.
      
      llvm-svn: 211922
      dd6470a9
    • Chandler Carruth's avatar
      [x86] Begin a significant overhaul of how vector lowering is done in the · 83860cfc
      Chandler Carruth authored
      x86 backend.
      
      This sketches out a new code path for vector lowering, hidden behind an
      off-by-default flag while it is under development. The fundamental idea
      behind the new code path is to aggressively break down the problem space
      in ways that ease selecting the odd set of instructions available on
      x86, and carefully avoid scalarizing code even when forced to use older
      ISAs. Notably, this starts off restricting itself to SSE2 and implements
      the complete vector shuffle and blend space for 128-bit vectors in SSE2
      without scalarizing. The plan is to layer on top of this ISA extensions
      where we can bail out of the complex SSE2 lowering and opt for
      a cheaper, specialized instruction (or set of instructions). It also
      needs to be generalized to AVX and AVX512 vector widths.
      
      Currently, this does a decent but not perfect job for SSE2. There are
      some specific shortcomings that I plan to address:
      - We need a peephole combine to fold together shuffles where possible.
        There are cases where a previous shuffle could be modified slightly to
        arrange for elements to be in the correct position and a later shuffle
        eliminated. Doing this eagerly added quite a bit of complexity, and
        so my plan is to combine away these redundancies afterward.
      - There are a lot more clever ways to use unpck and pack that need to be
        added. This is essential for real world shuffles as it turns out...
      
      Once SSE2 is polished a bit I should be able to get interesting numbers
      on performance improvements on benchmarks conducive to vectorization.
      All of this will be off by default until it is functionally equivalent
      of course.
      
      Differential Revision: http://reviews.llvm.org/D4225
      
      llvm-svn: 211888
      83860cfc
Loading