Skip to content
  1. Feb 15, 2017
    • Stanislav Mekhanoshin's avatar
      [AMDGPU] Revert failed scheduling · 582a5237
      Stanislav Mekhanoshin authored
      This patch reverts region's scheduling to the original untouched state
      in case if we have have decreased occupancy.
      
      In addition it switches to use TargetRegisterInfo occupancy callback
      for pressure limits instead of gradually increasing limits which were
      just passed by. We are going to stay with the best schedule so we do
      not need to tolerate worsened scheduling anymore.
      
      Differential Revision: https://reviews.llvm.org/D29971
      
      llvm-svn: 295206
      582a5237
  2. Dec 11, 2016
    • Sanjoy Das's avatar
      [Verifier] Add verification for TBAA metadata · 3336f681
      Sanjoy Das authored
      Summary:
      This change adds some verification in the IR verifier around struct path
      TBAA metadata.
      
      Other than some basic sanity checks (e.g. we get constant integers where
      we expect constant integers), this checks:
      
       - That by the time an struct access tuple `(base-type, offset)` is
         "reduced" to a scalar base type, the offset is `0`.  For instance, in
         C++ you can't start from, say `("struct-a", 16)`, and end up with
         `("int", 4)` -- by the time the base type is `"int"`, the offset
         better be zero.  In particular, a variant of this invariant is needed
         for `llvm::getMostGenericTBAA` to be correct.
      
       - That there are no cycles in a struct path.
      
       - That struct type nodes have their offsets listed in an ascending
         order.
      
       - That when generating the struct access path, you eventually reach the
         access type listed in the tbaa tag node.
      
      Reviewers: dexonsmith, chandlerc, reames, mehdi_amini, manmanren
      
      Subscribers: mcrosier, llvm-commits
      
      Differential Revision: https://reviews.llvm.org/D26438
      
      llvm-svn: 289402
      3336f681
  3. Nov 25, 2016
  4. Nov 23, 2016
    • Matt Arsenault's avatar
      AMDGPU: Fix MMO when splitting spill · 2669a76f
      Matt Arsenault authored
      The size and offset were wrong. The size of the object was
      being used for the size of the access, when here it is really
      being split into 4-byte accesses. The underlying object size
      is set in the MachinePointerInfo, which also didn't have the
      offset set.
      
      llvm-svn: 287806
      2669a76f
  5. Oct 13, 2016
    • Nirav Dave's avatar
      Revert "In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled." · a81682aa
      Nirav Dave authored
      This reverts commit r284151 which appears to be triggering a LTO
      failures on Hexagon
      
      llvm-svn: 284157
      a81682aa
    • Nirav Dave's avatar
      In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled. · 4b369572
      Nirav Dave authored
         Retrying after upstream changes.
      
         Simplify Consecutive Merge Store Candidate Search
      
         Now that address aliasing is much less conservative, push through
         simplified store merging search which only checks for parallel stores
         through the chain subgraph. This is cleaner as the separation of
         non-interfering loads/stores from the store-merging logic.
      
         Whem merging stores, search up the chain through a single load, and
         finds all possible stores by looking down from through a load and a
         TokenFactor to all stores visited. This improves the quality of the
         output SelectionDAG and generally the output CodeGen (with some
         exceptions).
      
         Additional Minor Changes:
      
             1. Finishes removing unused AliasLoad code
             2. Unifies the the chain aggregation in the merged stores across
             code paths
             3. Re-add the Store node to the worklist after calling
             SimplifyDemandedBits.
             4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
             arbitrary, but seemed sufficient to not cause regressions in
             tests.
      
         This finishes the change Matt Arsenault started in r246307 and
         jyknight's original patch.
      
         Many tests required some changes as memory operations are now
         reorderable. Some tests relying on the order were changed to use
         volatile memory operations
      
         Noteworthy tests:
      
          CodeGen/AArch64/argument-blocks.ll -
            It's not entirely clear what the test_varargs_stackalign test is
            supposed to be asserting, but the new code looks right.
      
          CodeGen/AArch64/arm64-memset-inline.lli -
          CodeGen/AArch64/arm64-stur.ll -
          CodeGen/ARM/memset-inline.ll -
      
            The backend now generates *worse* code due to store merging
            succeeding, as we do do a 16-byte constant-zero store efficiently.
      
          CodeGen/AArch64/merge-store.ll -
            Improved, but there still seems to be an extraneous vector insert
            from an element to itself?
      
          CodeGen/PowerPC/ppc64-align-long-double.ll -
            Worse code emitted in this case, due to the improved store->load
            forwarding.
      
          CodeGen/X86/dag-merge-fast-accesses.ll -
          CodeGen/X86/MergeConsecutiveStores.ll -
          CodeGen/X86/stores-merging.ll -
          CodeGen/Mips/load-store-left-right.ll -
            Restored correct merging of non-aligned stores
      
          CodeGen/AMDGPU/promote-alloca-stored-pointer-value.ll -
            Improved. Correctly merges buffer_store_dword calls
      
          CodeGen/AMDGPU/si-triv-disjoint-mem-access.ll -
            Improved. Sidesteps loading a stored value and
            merges two stores
      
          CodeGen/X86/pr18023.ll -
            This test has been removed, as it was asserting incorrect
            behavior. Non-volatile stores *CAN* be moved past volatile loads,
            and now are.
      
          CodeGen/X86/vector-idiv.ll -
          CodeGen/X86/vector-lzcnt-128.ll -
            It's basically impossible to tell what these tests are actually
            testing. But, looks like the code got better due to the memory
            operations being recognized as non-aliasing.
      
          CodeGen/X86/win32-eh.ll -
            Both loads of the securitycookie are now merged.
      
          CodeGen/AMDGPU/vgpr-spill-emergency-stack-slot-compute.ll -
            This test appears to work but no longer exhibits the spill behavior.
      
      Reviewers: arsenm, hfinkel, tstellarAMD, jyknight, nhaehnle
      
      Subscribers: wdng, nhaehnle, nemanjai, arsenm, weimingz, niravd, RKSimon, aemerson, qcolombet, dsanders, resistor, tstellarAMD, t.p.northover, spatel
      
      Differential Revision: https://reviews.llvm.org/D14834
      
      llvm-svn: 284151
      4b369572
  6. Sep 28, 2016
    • Nirav Dave's avatar
      Revert "In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled." · e524f508
      Nirav Dave authored
      This reverts commit r282600 due to test failues with MCJIT
      
      llvm-svn: 282604
      e524f508
    • Nirav Dave's avatar
      In visitSTORE, always use FindBetterChain, rather than only when UseAA is enabled. · e17e055b
      Nirav Dave authored
        Simplify Consecutive Merge Store Candidate Search
      
        Now that address aliasing is much less conservative, push through
        simplified store merging search which only checks for parallel stores
        through the chain subgraph. This is cleaner as the separation of
        non-interfering loads/stores from the store-merging logic.
      
        Whem merging stores, search up the chain through a single load, and
        finds all possible stores by looking down from through a load and a
        TokenFactor to all stores visited. This improves the quality of the
        output SelectionDAG and generally the output CodeGen (with some
        exceptions).
      
        Additional Minor Changes:
      
          1. Finishes removing unused AliasLoad code
          2. Unifies the the chain aggregation in the merged stores across
             code paths
          3. Re-add the Store node to the worklist after calling
             SimplifyDemandedBits.
          4. Increase GatherAllAliasesMaxDepth from 6 to 18. That number is
             arbitrary, but seemed sufficient to not cause regressions in
             tests.
      
        This finishes the change Matt Arsenault started in r246307 and
        jyknight's original patch.
      
        Many tests required some changes as memory operations are now
        reorderable. Some tests relying on the order were changed to use
        volatile memory operations
      
        Noteworthy tests:
      
          CodeGen/AArch64/argument-blocks.ll -
            It's not entirely clear what the test_varargs_stackalign test is
            supposed to be asserting, but the new code looks right.
      
          CodeGen/AArch64/arm64-memset-inline.lli -
          CodeGen/AArch64/arm64-stur.ll -
          CodeGen/ARM/memset-inline.ll -
            The backend now generates *worse* code due to store merging
            succeeding, as we do do a 16-byte constant-zero store efficiently.
      
          CodeGen/AArch64/merge-store.ll -
            Improved, but there still seems to be an extraneous vector insert
            from an element to itself?
      
          CodeGen/PowerPC/ppc64-align-long-double.ll -
            Worse code emitted in this case, due to the improved store->load
            forwarding.
      
          CodeGen/X86/dag-merge-fast-accesses.ll -
          CodeGen/X86/MergeConsecutiveStores.ll -
          CodeGen/X86/stores-merging.ll -
          CodeGen/Mips/load-store-left-right.ll -
            Restored correct merging of non-aligned stores
      
          CodeGen/AMDGPU/promote-alloca-stored-pointer-value.ll -
            Improved. Correctly merges buffer_store_dword calls
      
          CodeGen/AMDGPU/si-triv-disjoint-mem-access.ll -
            Improved. Sidesteps loading a stored value and merges two stores
      
          CodeGen/X86/pr18023.ll -
            This test has been removed, as it was asserting incorrect
            behavior. Non-volatile stores *CAN* be moved past volatile loads,
            and now are.
      
          CodeGen/X86/vector-idiv.ll -
          CodeGen/X86/vector-lzcnt-128.ll -
            It's basically impossible to tell what these tests are actually
            testing. But, looks like the code got better due to the memory
            operations being recognized as non-aliasing.
      
          CodeGen/X86/win32-eh.ll -
            Both loads of the securitycookie are now merged.
      
          CodeGen/AMDGPU/vgpr-spill-emergency-stack-slot-compute.ll -
            This test appears to work but no longer exhibits the spill
            behavior.
      
      Reviewers: arsenm, hfinkel, tstellarAMD, nhaehnle, jyknight
      
      Subscribers: wdng, nhaehnle, nemanjai, arsenm, weimingz, niravd, RKSimon, aemerson, qcolombet, resistor, tstellarAMD, t.p.northover, spatel
      
      Differential Revision: https://reviews.llvm.org/D14834
      
      llvm-svn: 282600
      e17e055b
  7. Aug 29, 2016
    • Tom Stellard's avatar
      AMDGPU/SI: Implement a custom MachineSchedStrategy · 0d23ebe8
      Tom Stellard authored
      Summary:
      GCNSchedStrategy re-uses most of GenericScheduler, it's just uses
      a different method to compute the excess and critical register
      pressure limits.
      
      It's not enabled by default, to enable it you need to pass -misched=gcn
      to llc.
      
      Shader DB stats:
      
      32464 shaders in 17874 tests
      Totals:
      SGPRS: 1542846 -> 1643125 (6.50 %)
      VGPRS: 1005595 -> 904653 (-10.04 %)
      Spilled SGPRs: 29929 -> 27745 (-7.30 %)
      Spilled VGPRs: 334 -> 352 (5.39 %)
      Scratch VGPRs: 1612 -> 1624 (0.74 %) dwords per thread
      Code Size: 36688188 -> 37034900 (0.95 %) bytes
      LDS: 1913 -> 1913 (0.00 %) blocks
      Max Waves: 254101 -> 265125 (4.34 %)
      Wait states: 0 -> 0 (0.00 %)
      
      Totals from affected shaders:
      SGPRS: 1338220 -> 1438499 (7.49 %)
      VGPRS: 886221 -> 785279 (-11.39 %)
      Spilled SGPRs: 29869 -> 27685 (-7.31 %)
      Spilled VGPRs: 334 -> 352 (5.39 %)
      Scratch VGPRs: 1612 -> 1624 (0.74 %) dwords per thread
      Code Size: 34315716 -> 34662428 (1.01 %) bytes
      LDS: 1551 -> 1551 (0.00 %) blocks
      Max Waves: 188127 -> 199151 (5.86 %)
      Wait states: 0 -> 0 (0.00 %)
      
      Reviewers: arsenm, mareko, nhaehnle, MatzeB, atrick
      
      Subscribers: arsenm, kzhuravl, llvm-commits
      
      Differential Revision: https://reviews.llvm.org/D23688
      
      llvm-svn: 279995
      0d23ebe8
  8. Jul 12, 2016
  9. Jun 13, 2016
  10. May 21, 2016
    • Matt Arsenault's avatar
      AMDGPU: Define priorities for register classes · 7f9eabd2
      Matt Arsenault authored
      Allocating larger register classes first should give better allocation
      results (and more importantly for myself, make the lit tests more stable
      with respect to scheduler changes).
      
      Patch by Matthias Braun
      
      llvm-svn: 270312
      7f9eabd2
  11. May 11, 2016
  12. Apr 30, 2016
  13. Apr 29, 2016
    • Nikolay Haustov's avatar
      AMDGPU/SI: Assembler: Unify parsing/printing of operands. · 4f672a34
      Nikolay Haustov authored
      Summary:
      The goal is for each operand type to have its own parse function and
      at the same time share common code for tracking state as different
      instruction types share operand types (e.g. glc/glc_flat, etc).
      
      Introduce parseAMDGPUOperand which can parse any optional operand.
      DPP and Clamp/OMod have custom handling for now. Sam also suggested
      to have class hierarchy for operand types instead of table. This
      can be done in separate change.
      
      Remove parseVOP3OptionalOps, parseDS*OptionalOps, parseFlatOptionalOps,
      parseMubufOptionalOps, parseDPPOptionalOps.
      Reduce number of definitions of AsmOperand's and MatchClasses' by using common base class.
      Rename AsmMatcher/InstPrinter methods accordingly.
      Print immediate type when printing parsed immediate operand.
      Use 'off' if offset/index register is unused instead of skipping it to make it more readable (also agreed with SP3).
      Update tests.
      
      Reviewers: tstellarAMD, SamWot, artem.tamazov
      
      Subscribers: qcolombet, arsenm, llvm-commits
      
      Differential Revision: http://reviews.llvm.org/D19584
      
      llvm-svn: 268015
      4f672a34
  14. Apr 14, 2016
  15. Apr 06, 2016
  16. Feb 13, 2016
  17. Feb 12, 2016
    • Matt Arsenault's avatar
      AMDGPU: Set element_size in private resource descriptor · 24ee0785
      Matt Arsenault authored
      Introduce a subtarget feature for this, and leave the default with
      the current behavior which assumes up to 16-byte loads/stores can
      be used. The field also seems to have the ability to be set to 2 bytes,
      but I'm not sure what that would be used for.
      
      llvm-svn: 260651
      24ee0785
  18. Nov 30, 2015
    • Matt Arsenault's avatar
      AMDGPU: Rework how private buffer passed for HSA · 26f8f3db
      Matt Arsenault authored
      If we know we have stack objects, we reserve the registers
      that the private buffer resource and wave offset are passed
      and use them directly.
      
      If not, reserve the last 5 SGPRs just in case we need to spill.
      After register allocation, try to pick the next available registers
      instead of the last SGPRs, and then insert copies from the inputs
      to the reserved registers in the progloue.
      
      This also only selectively enables all of the input registers
      which are really required instead of always enabling them.
      
      llvm-svn: 254331
      26f8f3db
    • Matt Arsenault's avatar
      AMDGPU: Remove SIPrepareScratchRegs · 0e3d3893
      Matt Arsenault authored
      It does not work because of emergency stack slots.
      This pass was supposed to eliminate dummy registers for the
      spill instructions, but the register scavenger can introduce
      more during PrologEpilogInserter, so some would end up
      left behind if they were needed.
      
      The potential for spilling the scratch resource descriptor
      and offset register makes doing something like this
      overly complicated. Reserve registers to use for the resource
      descriptor and use them directly in eliminateFrameIndex.
      
      Also removes creating another scratch resource descriptor
      when directly selecting scratch MUBUF instructions.
      
      The choice of which registers are reserved is temporary.
      For now it attempts to pick the next available registers
      after the user and system SGPRs.
      
      llvm-svn: 254329
      0e3d3893
  19. Nov 06, 2015
Loading