Skip to content
  1. Aug 04, 2014
    • Chandler Carruth's avatar
      [x86] Implement more aggressive use of PACKUS chains for lowering common · 06e6f1ca
      Chandler Carruth authored
      patterns of v16i8 shuffles.
      
      This implements one of the more important FIXMEs for the SSE2 support in
      the new shuffle lowering. We now generate the optimal shuffle sequence
      for truncate-derived shuffles which show up essentially everywhere.
      
      Unfortunately, this exposes a weakness in other parts of the shuffle
      logic -- we can no longer form PSHUFB here. I'll add the necessary
      support for that and other things in a subsequent commit.
      
      llvm-svn: 214702
      06e6f1ca
    • Chandler Carruth's avatar
      [x86] Handle single input shuffles in the SSSE3 case more intelligently. · 37a18821
      Chandler Carruth authored
      I spent some time looking into a better or more principled way to handle
      this. For example, by detecting arbitrary "unneeded" ORs... But really,
      there wasn't any point. We just shouldn't build blatantly wrong code so
      late in the pipeline rather than adding more stages and logic later on
      to fix it. Avoiding this is just too simple.
      
      llvm-svn: 214680
      37a18821
    • Chandler Carruth's avatar
      [x86] Fix the test case added in r214670 and tweaked in r214674 further. · 7bbfd245
      Chandler Carruth authored
      Fundamentally, there isn't a really portable way to test the constant
      pool contents. Instead, pin this test to the bare-metal triple. This
      also makes it a 64-bit triple which allows us to only match a single
      constant pool rather than two. It can also just hard code the '.' prefix
      as the format should be stable now that it has a fixed triple. Finally,
      I've switched it to use CHECK-NEXT to be more precise in the instruction
      sequence expected and to use variables rather than hard coding decisions
      by the register allocator.
      
      llvm-svn: 214679
      7bbfd245
    • Sanjay Patel's avatar
      Account for possible leading '.' in label string. · 065cabf4
      Sanjay Patel authored
      llvm-svn: 214674
      065cabf4
    • Sanjay Patel's avatar
      fix for PR20354 - Miscompile of fabs due to vectorization · 2ef67440
      Sanjay Patel authored
      This is intended to be the minimal change needed to fix PR20354 ( http://llvm.org/bugs/show_bug.cgi?id=20354 ). The check for a vector operation was wrong; we need to check that the fabs itself is not a vector operation.
      
      This patch will not generate the optimal code. A constant pool load and 'and' op will be generated instead of just returning a value that we can calculate in advance (as we do for the scalar case). I've put a 'TODO' comment for that here and expect to have that patch ready soon.
      
      There is a very similar optimization that we can do in visitFNEG, so I've put another 'TODO' there and expect to have another patch for that too.
      
      llvm-svn: 214670
      2ef67440
  2. Aug 02, 2014
    • Chandler Carruth's avatar
      [x86] Give this test a bare metal triple so it doesn't use the weird · bec57b40
      Chandler Carruth authored
      Darwin x86 asm comment prefix designed to work around GAS on that
      platform. That makes the comment-matching of the test much more stable.
      
      llvm-svn: 214629
      bec57b40
    • Chandler Carruth's avatar
      [x86] Largely complete the use of PSHUFB in the new vector shuffle · 4c57955f
      Chandler Carruth authored
      lowering with a small addition to it and adding PSHUFB combining.
      
      There is one obvious place in the new vector shuffle lowering where we
      should form PSHUFBs directly: when without them we will unpack a vector
      of i8s across two different registers and do a potentially 4-way blend
      as i16s only to re-pack them into i8s afterward. This is the crazy
      expensive fallback path for i8 shuffles and we can just directly use
      pshufb here as it will always be cheaper (the unpack and pack are
      two instructions so even a single shuffle between them hits our
      three instruction limit for forming PSHUFB).
      
      However, this doesn't generate very good code in many cases, and it
      leaves a bunch of common patterns not using PSHUFB. So this patch also
      adds support for extracting a shuffle mask from PSHUFB in the X86
      lowering code, and uses it to handle PSHUFBs in the recursive shuffle
      combining. This allows us to combine through them, combine multiple ones
      together, and generally produce sufficiently high quality code.
      
      Extracting the PSHUFB mask is annoyingly complex because it could be
      either pre-legalization or post-legalization. At least this doesn't have
      to deal with re-materialized constants. =] I've added decode routines to
      handle the different patterns that show up at this level and we dispatch
      through them as appropriate.
      
      The two primary test cases are updated. For the v16 test case there is
      still a lot of room for improvement. Since I was going through it
      systematically I left behind a bunch of FIXME lines that I'm hoping to
      turn into ALL lines by the end of this.
      
      llvm-svn: 214628
      4c57955f
    • Chandler Carruth's avatar
      [x86] Teach the target shuffle mask extraction to recognize unary forms · 34f9a987
      Chandler Carruth authored
      of normally binary shuffle instructions like PUNPCKL and MOVLHPS.
      
      This detects cases where a single register is used for both operands
      making the shuffle behave in a unary way. We detect this and adjust the
      mask to use the unary form which allows the existing DAG combine for
      shuffle instructions to actually work at all.
      
      As a consequence, this uncovered a number of obvious bugs in the
      existing DAG combine which are fixed. It also now canonicalizes several
      shuffles even with the existing lowering. These typically are trying to
      match the shuffle to the domain of the input where before we only really
      modeled them with the floating point variants. All of the cases which
      change to an integer shuffle here have something in the integer domain, so
      there are no more or fewer domain crosses here AFAICT. Technically, it
      might be better to go from a GPR directly to the floating point domain,
      but detecting floating point *outputs* despite integer inputs is a lot
      more code and seems unlikely to be worthwhile in practice. If folks are
      seeing domain-crossing regressions here though, let me know and I can
      hack something up to fix it.
      
      Also as a consequence, a bunch of missed opportunities to form pshufb
      now can be formed. Notably, splats of i8s now form pshufb.
      Interestingly, this improves the existing splat lowering too. We go from
      3 instructions to 1. Yes, we may tie up a register, but it seems very
      likely to be worth it, especially if splatting the 0th byte (the
      common case) as then we can use a zeroed register as the mask.
      
      llvm-svn: 214625
      34f9a987
    • Chandler Carruth's avatar
      [x86] Make some questionable tests not spew assembly to stdout, which · 063f425e
      Chandler Carruth authored
      makes a mess of the lit output when they ultimately fail.
      
      The 2012-10-02-DAGCycle test is really frustrating because the *only*
      explanation for what it is testing is a rdar link. I would really rather
      that rdar links (which are not public or part of the open source
      project) were not committed to the source code. Regardless, the actual
      problem *must* be described as the rdar link is completely opaque. The
      fact that this test didn't check for any particular output further
      exacerbates the inability of any other developer to debug failures.
      
      The mem-promote-integers test has nice comments and *seems* to be
      a great test for our lowering... except that we don't actually check
      that any of the generated code is correct or matches some pattern. We
      just avoid crashing. It would be great to go back and populate this test
      with the actual expectations.
      
      llvm-svn: 214605
      063f425e
    • Akira Hatanaka's avatar
      [X86] Simplify X87 stackifier pass. · 3516669a
      Akira Hatanaka authored
      Stop using ST registers for function returns and inline-asm instructions and use
      FP registers instead. This allows removing a large amount of code in the
      stackifier pass that was needed to track register liveness and handle copies
      between ST and FP registers and function calls returning floating point values.
      
      It also fixes a bug which manifests when an ST register defined by an
      inline-asm instruction was live across another inline-asm instruction, as shown
      in the following sequence of machine instructions:
      
      1. INLINEASM <es:frndint> $0:[regdef], %ST0<imp-def,tied5>
      2. INLINEASM <es:fldcw $0>
      3. %FP0<def> = COPY %ST0
      
      <rdar://problem/16952634>
      
      llvm-svn: 214580
      3516669a
  3. Aug 01, 2014
    • Reid Kleckner's avatar
      MS inline asm: Hide symbol to attempt to fix test failure on darwin · 6a2de900
      Reid Kleckner authored
      If the symbol comes from an external DSO, it apparently requires
      indirection through a register.
      
      llvm-svn: 214571
      6a2de900
    • Reid Kleckner's avatar
      MS inline asm: Use memory constraints for functions instead of registers · 5b37c181
      Reid Kleckner authored
      This is consistent with how we parse them in a standalone .s file, and
      inline assembly shouldn't differ.
      
      This fixes errors about requiring more registers than available in
      cases like this:
        void f();
        void __declspec(naked) g() {
          __asm pusha
          __asm call f
          __asm popa
          __asm ret
        }
      
      There are no registers available to pass the address of 'f' into the asm
      blob.  The asm should now directly call 'f'.
      
      Tests will land in Clang shortly.
      
      llvm-svn: 214550
      5b37c181
    • Philip Reames's avatar
      Explicitly report runtime stack realignment in StackMap section · 87c2b605
      Philip Reames authored
      This change adds code to explicitly mark a function which requires runtime stack realignment as not having a fixed frame size in the StackMap section. As it happens, this is not actually a functional change. The size that would be reported without the check is also "-1", but as far as I can tell, that's an accident. The code change makes this explicit.
      
      Note: There's a separate bug in handling of stackmaps and patchpoints in functions which need dynamic frame realignment. The current code assumes that offsets can be calculated from RBP, but realigned frames must use RSP. (There's a variable gap between RBP and the spill slots.) This change set does not address that issue.
      
      Reviewers: atrick, ributzka
      
      Differential Revision: http://reviews.llvm.org/D4572
      
      llvm-svn: 214534
      87c2b605
  4. Jul 31, 2014
  5. Jul 30, 2014
  6. Jul 29, 2014
  7. Jul 28, 2014
  8. Jul 27, 2014
    • Chandler Carruth's avatar
      [x86] Add a much more powerful framework for combining x86 shuffle · 80c5bfd8
      Chandler Carruth authored
      instructions in the legalized DAG, and leverage it to combine long
      sequences of instructions to PSHUFB.
      
      Eventually, the other x86-instruction-specific shuffle combines will
      probably all be driven out of this routine. But the real motivation is
      to detect after we have fully legalized and optimized a shuffle to the
      minimal number of x86 instructions whether it is profitable to replace
      the chain with a fully generic PSHUFB instruction even though doing so
      requires either a load from a constant pool or tying up a register with
      the mask.
      
      While the Intel manuals claim it should be used when it replaces 5 or
      more instructions (!!!!) my experience is that it is actually very fast
      on modern chips, and so I've gon with a much more aggressive model of
      replacing any sequence of 3 or more instructions.
      
      I've also taught it to do some basic canonicalization to special-purpose
      instructions which have smaller encodings than their generic
      counterparts.
      
      There are still quite a few FIXMEs here, and I've not yet implemented
      support for lowering blends with PSHUFB (where its power really shines
      due to being able to zero out lanes), but this starts implementing real
      PSHUFB support even when using the new, fancy shuffle lowering. =]
      
      llvm-svn: 214042
      80c5bfd8
  9. Jul 26, 2014
    • Joey Gouly's avatar
      Fix the failing test 'vector-idiv.ll'. · ec981058
      Joey Gouly authored
      On Darwin the comment character is ##.
      
      llvm-svn: 214028
      ec981058
    • Chandler Carruth's avatar
      [SDAG] When performing post-legalize DAG combining, run the legalizer · 411fb407
      Chandler Carruth authored
      over each node in the worklist prior to combining.
      
      This allows the combiner to produce new nodes which need to go back
      through legalization. This is particularly useful when generating
      operands to target specific nodes in a post-legalize DAG combine where
      the operands are significantly easier to express as pre-legalized
      operations. My immediate use case will be PSHUFB formation where we need
      to build a constant shuffle mask with a build_vector node.
      
      This also refactors the relevant functionality in the legalizer to
      support this, and updates relevant tests. I've spoken to the R600 folks
      and these changes look like improvements to them. The avx512 change
      needs to be investigated, I suspect there is a disagreement between the
      legalizer and the DAG combiner there, but it seems a minor issue so
      leaving it to be re-evaluated after this patch.
      
      Differential Revision: http://reviews.llvm.org/D4564
      
      llvm-svn: 214020
      411fb407
    • NAKAMURA Takumi's avatar
      llvm/test/CodeGen/X86/vector-idiv.ll: Fix for -Asserts. · f2df3f59
      NAKAMURA Takumi authored
      llvm-svn: 214015
      f2df3f59
    • Chandler Carruth's avatar
      [x86] Fix PR20355 (for real). There are many layers to this bug. · 5896698e
      Chandler Carruth authored
      The tale starts with r212808 which attempted to fix inversion of the low
      and high bits when lowering MUL_LOHI. Sadly, that commit did not include
      any positive test cases, and just removed some operations from a test
      case where the actual logic being changed isn't fully visible from the
      test.
      
      What this commit did was two things. First, it reversed the low and high
      results in the formation of the MERGE_VALUES node for the multiple
      results. This is entirely correct.
      
      Second it changed the shuffles for extracting the low and high
      components from the i64 results of the multiplies to extract them
      assuming a big-endian-style encoding of the multiply results. This
      second change is wrong. There is no big-endian encoding in x86, the
      results of the multiplies are normal v2i64s: when cast to v4i32, the low
      i32s are at offsets 0 and 2, and the high i32s are at offsets 1 and 3.
      
      However, the first change wasn't enough to actually fix the bug, which
      is (I assume) why the second change was also made. There was another bug
      in the MERGE_VALUES formation: we weren't using a VTList, and so were
      getting a single result node! When grabbing the *second* result from the
      node, we got... well.. colud be anything. I think this *appeared* to
      invert things, but had to be causing other problems as well.
      
      Fortunately, I fixed the MERGE_VALUES issue in r213931, so we should
      have been fine, right? NOOOPE! Because the core bug was never addressed,
      the test in vector-idiv failed when I fixed the MERGE_VALUES node.
      Because there are essentially no docs for this node, I had to guess at
      how to fix it and tried swapping the operands, restoring the order of
      the original code before r212808. While this "fixed" the test case (in
      that we produced the write instructions) we were still extracting the
      wrong elements of the i64s, and thus PR20355 was still broken.
      
      This commit essentially reverts the big-endian-style extraction part of
      r212808 and goes back to the original masks which were correct. Now that
      the MERGE_VALUES node formation is also correct, everything works. I've
      also included a more detailed test from PR20355 to make sure this stays
      fixed.
      
      llvm-svn: 214011
      5896698e
    • Chandler Carruth's avatar
      [x86] Finish switching from CHECK to ALL. This was mistakenly included · 591c16a9
      Chandler Carruth authored
      in r214007 and then reverted when I backed that (very misguided) patch
      out. This recovers the test case cleanup which was good.
      
      llvm-svn: 214010
      591c16a9
    • Chandler Carruth's avatar
      [x86] Revert r214007: Fix PR20355 ... · f6406ac5
      Chandler Carruth authored
      The clever way to implement signed multiplication with unsigned *is
      already implemented* and tested and working correctly. The bug is
      somewhere else. Re-investigating.
      
      This will teach me to not scroll far enough to read the code that did
      what I thought needed to be done.
      
      llvm-svn: 214009
      f6406ac5
    • Chandler Carruth's avatar
      [x86] Fix PR20355 (and dups) by not using unsigned multiplication when · 1bf4d191
      Chandler Carruth authored
      signed multiplication is requested. While there is not a difference in
      the *low* half of the result, the *high* half (used specifically to
      implement the signed division by these constants) certainly is used. The
      test case I've nuked was actively asserting wrong code.
      
      There is a delightful solution to doing signed multiplication even when
      we don't have it that Richard Smith has crafted, but I'll add the
      machinery back and implement that in a follow-up patch. This at least
      restores correctness.
      
      llvm-svn: 214007
      1bf4d191
    • Chandler Carruth's avatar
      [x86] Add coverage for PMUL* instruction testing on SSE2 as well as · 80adc640
      Chandler Carruth authored
      SSE4.1.
      
      llvm-svn: 214001
      80adc640
    • Chandler Carruth's avatar
      [x86] More cleanup for this test -- simplify the command line. · 8709cb4a
      Chandler Carruth authored
      llvm-svn: 213991
      8709cb4a
    • Chandler Carruth's avatar
      [x86] FileCheck-ize this test. · 6da2d97a
      Chandler Carruth authored
      llvm-svn: 213988
      6da2d97a
  10. Jul 25, 2014
    • Akira Hatanaka's avatar
      [stack protector] Fix a potential security bug in stack protector where the · e5b6e0d2
      Akira Hatanaka authored
      address of the stack guard was being spilled to the stack.
      
      Previously the address of the stack guard would get spilled to the stack if it
      was impossible to keep it in a register. This patch introduces a new target
      independent node and pseudo instruction which gets expanded post-RA to a
      sequence of instructions that load the stack guard value. Register allocator
      can now just remat the value when it can't keep it in a register. 
      
      <rdar://problem/12475629>
      
      llvm-svn: 213967
      e5b6e0d2
    • David Blaikie's avatar
      Recommit r212203: Don't try to construct debug LexicalScopes hierarchy for... · 2f040114
      David Blaikie authored
      Recommit r212203: Don't try to construct debug LexicalScopes hierarchy for functions that do not have top level debug information.
      
      Reverted by Eric Christopher (Thanks!) in r212203 after Bob Wilson
      reported LTO issues. Duncan Exon Smith and Aditya Nandakumar helped
      provide a reduced reproduction, though the failure wasn't too hard to
      guess, and even easier with the example to confirm.
      
      The assertion that the subprogram metadata associated with an
      llvm::Function matches the scope data referenced by the DbgLocs on the
      instructions in that function is not valid under LTO. In LTO, a C++
      inline function might exist in multiple CUs and the subprogram metadata
      nodes will refer to the same llvm::Function. In this case, depending on
      the order of the CUs, the first intance of the subprogram metadata may
      not be the one referenced by the instructions in that function and the
      assertion will fail.
      
      A test case (test/DebugInfo/cross-cu-linkonce-distinct.ll) is added, the
      assertion removed and a comment added to explain this situation.
      
      This was then reverted again in r213581 as it caused PR20367. The root
      cause of this was the early exit in LiveDebugVariables meant that
      spurious DBG_VALUE intrinsics that referenced dead variables were not
      removed, causing an assertion/crash later on. The fix is to have
      LiveDebugVariables strip all DBG_VALUE intrinsics in functions without
      debug info as they're not needed anyway. Test case added to cover this
      situation (that occurs when a debug-having function is inlined into a
      nodebug function) in test/DebugInfo/X86/nodebug_with_debug_loc.ll
      
      Original commit message:
      
      If a function isn't actually in a CU's subprogram list in the debug info
      metadata, ignore all the DebugLocs and don't try to build scopes, track
      variables, etc.
      
      While this is possibly a minor optimization, it's also a correctness fix
      for an incoming patch that will add assertions to LexicalScopes and the
      debug info verifier to ensure that all scope chains lead to debug info
      for the current function.
      
      Fix up a few test cases that had broken/incomplete debug info that could
      violate this constraint.
      
      Add a test case where this occurs by design (inlining a
      debug-info-having function in an attribute nodebug function - we want
      this to work because /if/ the nodebug function is then inlined into a
      debug-info-having function, it should be fine (and will work fine - we
      just stitch the scopes up as usual), but should the inlining not happen
      we need to not assert fail either).
      
      llvm-svn: 213952
      2f040114
    • David Blaikie's avatar
      DebugInfo: Fix up some test cases to have more correct debug info metadata. · 48af9c35
      David Blaikie authored
      * Add CUs to the named CU node
      * Add missing DW_TAG_subprogram nodes
      * Add llvm::Functions to the DW_TAG_subprogram nodes
      
      This cleans up the tests so that they don't break under a
      soon-to-be-made change that is more strict about such things.
      
      llvm-svn: 213951
      48af9c35
    • Lang Hames's avatar
      [X86] Add comments to clarify some non-obvious lines in the stackmap-nops.ll · 98c3c0f3
      Lang Hames authored
      testcases.
      
      Based on code review from Philip Reames. Thanks Philip!
      
      llvm-svn: 213923
      98c3c0f3
    • Chandler Carruth's avatar
      [SDAG] Introduce a combined set to the DAG combiner which tracks nodes · 9f4530b9
      Chandler Carruth authored
      which have successfully round-tripped through the combine phase, and use
      this to ensure all operands to DAG nodes are visited by the combiner,
      even if they are only added during the combine phase.
      
      This is critical to have the combiner reach nodes that are *introduced*
      during combining. Previously these would sometimes be visited and
      sometimes not be visited based on whether they happened to end up on the
      worklist or not. Now we always run them through the combiner.
      
      This fixes quite a few bad codegen test cases lurking in the suite while
      also being more principled. Among these, the TLS codegeneration is
      particularly exciting for programs that have this in the critical path
      like TSan-instrumented binaries (although I think they engineer to use
      a different TLS that is faster anyways).
      
      I've tried to check for compile-time regressions here by running llc
      over a merged (but not LTO-ed) clang bitcode file and observed at most
      a 3% slowdown in llc. Given that this is essentially a worst case (none
      of opt or clang are running at this phase) I think this is tolerable.
      The actual LTO case should be even less costly, and the cost in normal
      compilation should be negligible.
      
      With this combining logic, it is possible to re-legalize as we combine
      which is necessary to implement PSHUFB formation on x86 as
      a post-legalize DAG combine (my ultimate goal).
      
      Differential Revision: http://reviews.llvm.org/D4638
      
      llvm-svn: 213898
      9f4530b9
    • Chandler Carruth's avatar
      [x86] Make vector legalization of extloads work more like the "normal" · 80b86946
      Chandler Carruth authored
      vector operation legalization with support for custom target lowering
      and fallback to expand when it fails, and use this to implement sext and
      anyext load lowering for x86 in a more principled way.
      
      Previously, the x86 backend relied on a target DAG combine to "combine
      away" sextload and extload nodes prior to legalization, or would expand
      them during legalization with terrible code. This is particularly
      problematic because the DAG combine relies on running over non-canonical
      DAG nodes at just the right time to match several common and important
      patterns. It used a combine rather than lowering because we didn't have
      good lowering support, and to expose some tricks being employed to more
      combine phases.
      
      With this change it becomes a proper lowering operation, the backend
      marks that it can lower these nodes, and I've added support for handling
      the canonical forms that don't have direct legal representations such as
      sextload of a v4i8 -> v4i64 on AVX1. With this change, our test cases
      for this behavior continue to pass even after the DAG combiner beigns
      running more systematically over every node.
      
      There is some noise caused by this in the test suite where we actually
      use vector extends instead of subregister extraction. This doesn't
      really seem like the right thing to do, but is unlikely to be a critical
      regression. We do regress in one case where by lowering to the
      target-specific patterns early we were able to combine away extraneous
      legal math nodes. However, this regression is completely addressed by
      switching to a widening based legalization which is what I'm working
      toward anyways, so I've just switched the test to that mode.
      
      Differential Revision: http://reviews.llvm.org/D4654
      
      llvm-svn: 213897
      80b86946
  11. Jul 24, 2014
    • Lang Hames's avatar
      [X86] Optimize stackmap shadows on X86. · f49bc3f1
      Lang Hames authored
      This patch minimizes the number of nops that must be emitted on X86 to satisfy
      stackmap shadow constraints.
      
      To minimize the number of nops inserted, the X86AsmPrinter now records the
      size of the most recent stackmap's shadow in the StackMapShadowTracker class,
      and tracks the number of instruction bytes emitted since the that stackmap
      instruction was encountered. Padding is emitted (if it is required at all)
      immediately before the next stackmap/patchpoint instruction, or at the end of
      the basic block.
      
      This optimization should reduce code-size and improve performance for people
      using the llvm stackmap intrinsic on X86.
      
      <rdar://problem/14959522>
      
      llvm-svn: 213892
      f49bc3f1
Loading