Skip to content
  1. Jul 27, 2021
    • Johannes Doerfert's avatar
      [OpenMP] Prototype opt-in new GPU device RTL · 67ab875f
      Johannes Doerfert authored
      The "old" OpenMP GPU device runtime (D14254) has served us well for many
      years but modernizing it has caused some pain recently. This patch
      introduces an alternative which is mostly written from scratch embracing
      OpenMP 5.X, C++, LLVM coding style (where applicable), and conceptual
      interfaces. This new runtime is opt-in through a clang flag (D106793).
      The new runtime is currently only build for nvptx and has "-new" in its
      name.
      
      The design is tailored towards middle-end optimizations rather than
      front-end code generation choices, a trend we already started in the old
      runtime a while back. In contrast to the old one, state is organized in
      a simple manner rather than a "smart" one. While this can induce costs
      it helps optimizations. Our expectation is that the majority of codes
      can be optimized and a "simple" design is therefore preferable. The new
      runtime does also avoid users to pay for things they do not use,
      especially wrt. memory. The unlikely case of nested parallelism is
      supported but costly to make the more likely case use less resources.
      
      The worksharing and reduction implementation have been taken from the
      old runtime and will be rewritten in the future if necessary.
      
      Documentation and debug features are still mostly missing and will be
      added over time.
      
      All external symbols start with `__kmpc` for legacy reasons but should
      be renamed once we switch over to a single runtime. All internal symbols
      are placed in appropriate namespaces (anonymous or `_OMP`) to avoid name
      clashes with user symbols.
      
      Differential Revision: https://reviews.llvm.org/D106803
      67ab875f
    • Shilei Tian's avatar
      [AbstractAttributor] Fold __kmpc_parallel_level if possible · e97e0a4f
      Shilei Tian authored
      Similar to D105787, this patch tries to fold `__kmpc_parallel_level` if possible.
      Note that `__kmpc_parallel_level` doesn't take activeness into consideration,
      based on current `deviceRTLs`, its return value can be such as 0, 1, 2, instead
      of 0, 129, 130, etc. that also indicate activeness.
      
      Reviewed By: jdoerfert
      
      Differential Revision: https://reviews.llvm.org/D106154
      e97e0a4f
  2. Jul 26, 2021
  3. Jul 25, 2021
  4. Jul 23, 2021
    • Shilei Tian's avatar
      [OpenMP] Fix bug 50022 · c2c43132
      Shilei Tian authored
      Bug 50022 [0] reports target nowait fails in certain case, which is added in this
      patch. The root cause of the failure is, when the second task is created, its
      parent's `td_incomplete_child_tasks` will not be incremented because there is no
      parallel region here thus its team is serialized. Therefore, when the initial
      thread is waiting for its unfinished children tasks, it thought there is only
      one, the first task, because it is hidden helper task, so it is tracked. The
      second task will only be pushed to the queue when the first task is finished.
      However, when the first task finishes, it first decrements the counter of its
      parent, and then release dependences. Once the counter is decremented, the thread
      will move on because its counter is reset, but actually, the second task has not
      been executed at all. As a result, since in this case, the main function finishes,
      then `libomp` starts to destroy. When the second task is pushed somewhere, all
      some of the structures might already have already been destroyed, then anything
      could happen.
      
      This patch simply moves `__kmp_release_deps` ahead of decrement of the counter.
      In this way, we can make sure that the initial thread is aware of the existence
      of another task(s) so it will not move on. In addition, in order to tackle
      dependence chain starting with hidden helper thread, when hidden helper task is
      encountered, we force the task to release dependences.
      
      Reference:
      [0] https://bugs.llvm.org/show_bug.cgi?id=50022
      
      Reviewed By: AndreyChurbanov
      
      Differential Revision: https://reviews.llvm.org/D106519
      c2c43132
    • Joseph Huber's avatar
      [Libomptarget] Add unroll flag to shared variables loop · e1dedeca
      Joseph Huber authored
      Unrolling this loop provides better performance in practice because it is
      executed on the device and is likely to be very small.
      
      Reviewed By: tianshilei1992
      
      Differential Revision: https://reviews.llvm.org/D106692
      e1dedeca
    • Shilei Tian's avatar
      [OpenMP][Offloading] Fix data race in data mapping by using two locks · 18ce3d3f
      Shilei Tian authored
      This patch tries to partially fix one of the two data race issues reported in
      [1] by introducing a per-entry mutex. Additional discussion can also be found in
      D104418, which will also be refined to fix another data race problem.
      
      Here is how it works. Like before, `DataMapMtx` is still being used for mapping
      table lookup and update. In any case, we will get a table entry. If we need to
      make a data transfer (update the data on the device), we need to lock the entry
      right before releasing `DataMapMtx`, and the issue of data transfer should be
      after releasing `DataMapMtx`, and the entry is unlocked afterwards. This can
      guarantee that: 1) issue of data movement is not in critical region, which will
      not affect performance too much, and also will not affect other threads that don't
      touch the same entry; 2) if another thread accesses the same entry, the state of
      data movement is consistent (which requires that a thread must first get the
      update lock before getting data movement information).
      
      For a target that doesn't support async data transfer, issue of data movement is
      data transfer. This two-lock design can potentially improve concurrency compared
      with the design that guards data movement with `DataMapMtx` as well. For a target
      that supports async data movement, we could simply attach the event between the
      issue of data movement and unlock the entry. For a thread that wants to get the
      event, it must first get the lock. This can also get rid of the busy wait until
      the event pointer is valid.
      
      Reference:
      [1] https://bugs.llvm.org/show_bug.cgi?id=49940
      
      Reviewed By: grokos
      
      Differential Revision: https://reviews.llvm.org/D104555
      18ce3d3f
    • Abhinav Gaba's avatar
      [OpenMP] Fix CUDA plugin build after 3817ba13. · f7c92995
      Abhinav Gaba authored
      The build was broken on machines that don't have Cuda SDK installed.
      
      See https://reviews.llvm.org/D106627 for the original discussion.
      f7c92995
    • Johannes Doerfert's avatar
      [OpenMP] Simplify the ThreadStackTy for globalization fallback · d12ee28e
      Johannes Doerfert authored
      With D106496 we can make the globalization fallback stack much simpler
      and this version doesn't seem to experience the spurious failures and
      deadlocks we have seen before.
      
      Differential Revision: https://reviews.llvm.org/D106576
      d12ee28e
    • Joseph Huber's avatar
      [OpenMP][NFC] Fix formatting in CUDA plugin · 76c0c0ca
      Joseph Huber authored
      76c0c0ca
    • Joseph Huber's avatar
      [OpenMP] Add environment variables to change stack / heap size in the CUDA plugin · 3817ba13
      Joseph Huber authored
      This patch adds support for two environment variables to configure the device.
      ``LIBOMPTARGET_STACK_SIZE`` sets the amount of memory in bytes that each thread
      has for its stack. ``LIBOMPTARGET_HEAP_SIZE`` sets the amount of heap memory
      that can be allocated using malloc / free on the device.
      
      Reviewed By: jdoerfert
      
      Differential Revision: https://reviews.llvm.org/D106627
      3817ba13
    • Jose M Monsalve Diaz's avatar
      [OpenMP] Renaming RT functions `GetNumberOfBlocksInKernel` and `GetNumberOfThreadsInBlock` · 68d6278a
      Jose M Monsalve Diaz authored
      These functions should follow the camel case convention. These are really easy to change
      and are needed for D106033.
      
      Reviewed By: JonChesterfield
      
      Differential Revision: https://reviews.llvm.org/D106390
      68d6278a
  5. Jul 22, 2021
    • Jon Chesterfield's avatar
      [libomptarget][amdgpu][nfc] Normalise license headers · 9e05c084
      Jon Chesterfield authored
      Reviewed By: gregrodgers, jdoerfert
      
      Differential Revision: https://reviews.llvm.org/D106581
      9e05c084
    • Jon Chesterfield's avatar
      [libomptarget][amdgpu][nfc] Replace use of gelf.h with libelf.h · 14e34a83
      Jon Chesterfield authored
      AMDGPU can assume Elf64 so doesn't need to abstract over Elf32
      
      Drop a few other unused headers at the same time. Now only llvm elf
      and libelf are used by the plugin.
      
      Reviewed By: jdoerfert
      
      Differential Revision: https://reviews.llvm.org/D106579
      14e34a83
    • Jon Chesterfield's avatar
      [libomptarget][amdgpu] Implement dlopen of libhsa · 1a965706
      Jon Chesterfield authored
      AMDGPU plugin equivalent of D95155, build without HSA installed locally
      
      Compiles a new file, plugins/amdgpu/dynamic_hsa/hsa.cpp, to an object file that
      exposes the same symbols that the plugin presently uses from hsa. The object
      file contains dlopen of hsa and cached dlsym calls. Also provides header files
      corresponding to the subset that is used.
      
      This is behind a feature flag, LIBOMPTARGET_FORCE_DLOPEN_LIBHSA, default off.
      That allows developers to build against the dlopen/dlsym implementation, e.g.
      while testing this mode.
      
      Enabling by default will cause this plugin to build on a wider variety of
      machines than it does at present so may break some CI builds. That risk can
      be minimised by reviewing the header dependencies of the library and ensuring
      it doesn't use any libraries that are not already used by libomptarget.
      
      Separating the implementation from enabling by default in case the latter needs
      to be rolled back after wider CI results.
      
      Reviewed By: jdoerfert
      
      Differential Revision: https://reviews.llvm.org/D106559
      1a965706
    • Jon Chesterfield's avatar
      [libomptarget][nfc] Improve static assert message in dlwrap · 6e9cd3e9
      Jon Chesterfield authored
      Revision of D102858. Raise dlwrap arity argument to template argument
      so the correct value is given in the error message. E.g. '2 == 1' instead of
      '2 == trait<>::nargs'.
      
      Arity higher than it should be:
      Before diff
      ```
      $/plugins/cuda/dynamic_cuda/cuda.cpp:23:1: error:
            static_assert failed due to requirement '2 == trait<cudaError_enum (*)(unsigned int)>::nargs'
            "Arity Error"
      DLWRAP_INTERNAL(cuInit, 2);
      ^~~~~~~~~~~~~~~~~~~~~~~~~~
      ...
      $/include/dlwrap.h:166:3: note: expanded from macro
            'DLWRAP_COMMON'
        static_assert(ARITY == trait<decltype(&SYMBOL)>::nargs, "Arity Error");      \
      ```
      
      After diff
      In file included from $/plugins/cuda/dynamic_cuda/cuda.cpp:16:
      ```
      $/include/dlwrap.h:131:3: error: static_assert failed due to
            requirement '2UL == 1UL' "Arity Error"
        static_assert(Requested == Required, "Arity Error");
        ^             ~~~~~~~~~~~~~~~~~~~~~
      $/plugins/cuda/dynamic_cuda/cuda.cpp:23:1: note: in
            instantiation of function template specialization 'dlwrap::verboseAssert<2UL, 1UL>' requested
            here
      DLWRAP_INTERNAL(cuInit, 2);
      ```
      
      Arity lower than it should be:
      Before diff
      ```
      $/plugins/cuda/dynamic_cuda/cuda.cpp:131:10: error: no
            matching function for call to 'dlwrap_cuInit'
        return dlwrap_cuInit(X);
               ^~~~~~~~~~~~~
      $/plugins/cuda/dynamic_cuda/cuda.cpp:23:1: note: candidate
            function not viable: requires 0 arguments, but 1 was provided
      DLWRAP_INTERNAL(cuInit, 0);
      ```
      
      After diff
      In file included from $/plugins/cuda/dynamic_cuda/cuda.cpp:16:
      ```
      $/include/dlwrap.h:131:3: error: static_assert failed due to
            requirement '0UL == 1UL' "Arity Error"
        static_assert(Requested == Required, "Arity Error");
        ^             ~~~~~~~~~~~~~~~~~~~~~
      $/plugins/cuda/dynamic_cuda/cuda.cpp:23:1: note: in
            instantiation of function template specialization 'dlwrap::verboseAssert<0UL, 1UL>' requested
            here
      DLWRAP_INTERNAL(cuInit, 0);
      ```
      
      Reviewed By: jdoerfert
      
      Differential Revision: https://reviews.llvm.org/D106543
      6e9cd3e9
    • Joseph Huber's avatar
      [OpenMP] Fix warnings for uninitialized block counts · a158d366
      Joseph Huber authored
      Summary:
      Fixes some warning given for uninitialized block counts if the exection mode is
      not recognized. This shouldn't happen in practice because the execution mode is
      checked when it's read from the device.
      a158d366
    • Jon Chesterfield's avatar
      [libomptarget][amdgpu][nfc] Drop dead signal pool setup · dc1f6f8b
      Jon Chesterfield authored
      This class is instantiated once in rtl.cpp before hsa_init is
      called. The hsa_signal_create call therefore fails leaving the pool empty.
      
      This signal pool is a legacy from ATMI where it was constructed after hsa_init.
      Moving the state into the rtl.cpp global class disabled the initial populating
      of the pool without noticeably changing performance. Just rechecked with a fix
      that allocates the signals after hsa_init and that also doesn't noticeably
      change performance.
      
      This patch therefore drops the initialisation. Only change from main is to
      drop a DEBUG_PRINT statement that would say the pool initial size is zero.
      
      Reviewed By: jdoerfert
      
      Differential Revision: https://reviews.llvm.org/D106515
      dc1f6f8b
    • Joseph Huber's avatar
      [OpenMP] Add an option to disable function internalization · 4a668604
      Joseph Huber authored
      Function internalization can sometimes occur in situations where we want to
      keep the call sites intact. This patch adds an option to disable function
      internalization and prevents the device runtime from being internalized while
      creating the bitcode library.
      
      Reviewed By: jdoerfert
      
      Differential Revision: https://reviews.llvm.org/D106438
      4a668604
    • Joseph Huber's avatar
      [Libomptarget] Introduce new main thread ID runtime function · 1684012a
      Joseph Huber authored
      This patch introduces `__kmpc_is_generic_main_thread_id` which splits the old
      comparison into its own runtime function. The purpose of this is so we can fold
      this part independently, so when both this and `is_spmd_mode` are folded the
      final function will be folded as well.
      
      Reviewed By: jdoerfert
      
      Differential Revision: https://reviews.llvm.org/D106437
      1684012a
    • Joseph Huber's avatar
      [OpenMP] Add new execution mode for SPMD execution with Generic semantics · 7d576392
      Joseph Huber authored
      Qualified kernels can be transformed from generic-mode to SPMD mode using an
      optimization in OpenMPOpt. This patch introduces a new execution mode to
      indicate kernels that have been transformed from generic-mode to SPMD-mode.
      These kernels have SPMD-mode execution, but need generic-mode semantics for
      scheduling the blocks and threads. Without this far too few blocks will be
      scheduled for a generic region as SPMD mode expects the trip count to be
      divided by the number of threads.
      
      Reviewed By: ggeorgakoudis
      
      Differential Revision: https://reviews.llvm.org/D106460
      7d576392
    • Joseph Huber's avatar
      [OpenMP] Change `__kmpc_free_shared` to include the paired allocation size · 754eb1c2
      Joseph Huber authored
      This patch changes `__kmpc_free_shared` to take an additional argument
      corresponding to the associated allocation's size. This makes it easier to
      implement the allocator in the runtime.
      
      Reviewed By: jdoerfert
      
      Differential Revision: https://reviews.llvm.org/D106496
      754eb1c2
  6. Jul 21, 2021
  7. Jul 20, 2021
  8. Jul 19, 2021
  9. Jul 18, 2021
    • Shilei Tian's avatar
      [OpenMP][Offloading] Add a CMake argument LIBOMPTARGET_LIT_ARGS to control... · 954711ed
      Shilei Tian authored
      [OpenMP][Offloading] Add a CMake argument LIBOMPTARGET_LIT_ARGS to control behavior of libomptarget lit test
      
      By default, `lit` uses all threads to invoke tests, which  can easily cause out
      of memory on GPUs because most of OpenMP offloading test usually take about 1GB
      GPU memory, but a typical GPU only has 4-8GB memory. This patch introduce a
      CMake argument `LIBOMPTARGET_LIT_ARGS` to allow users to control the behavior of
      `libomptarget` tests, similar to `LLVM_LIT_ARGS`.
      
      Reviewed By: JonChesterfield
      
      Differential Revision: https://reviews.llvm.org/D106236
      954711ed
    • Shilei Tian's avatar
      [OpenMP][Offloading] Add -g when compiling deviceRTLs in debug mode · 4357cfc7
      Shilei Tian authored
      Currently when we compile the project in debug mode, `-g` will not be added to
      compilation flag. The bc files generated in different mode are of different size.
      When using GPU debuggers like `cuda-gdb`, it is expected to provide more info
      with a debug version of bc lib.
      
      Reviewed By: JonChesterfield
      
      Differential Revision: https://reviews.llvm.org/D106229
      4357cfc7
  10. Jul 17, 2021
    • Giorgis Georgakoudis's avatar
      [OpenMP] Codegen aggregate for outlined function captures · e9c7291c
      Giorgis Georgakoudis authored
      Parallel regions are outlined as functions with capture variables explicitly generated as distinct parameters in the function's argument list. That complicates the fork_call interface in the OpenMP runtime: (1) the fork_call is variadic since there is a variable number of arguments to forward to the outlined function, (2) wrapping/unwrapping arguments happens in the OpenMP runtime, which is sub-optimal, has been a source of ABI bugs, and has a hardcoded limit (16) in the number of arguments, (3)  forwarded arguments must cast to pointer types, which complicates debugging. This patch avoids those issues by aggregating captured arguments in a struct to pass to the fork_call.
      
      Reviewed By: jdoerfert
      
      Differential Revision: https://reviews.llvm.org/D102107
      e9c7291c
  11. Jul 16, 2021
    • Shilei Tian's avatar
      [NFC][OpenMP][Offloading] Replaced explicit parallel level computation with... · 97c8f60b
      Shilei Tian authored
      [NFC][OpenMP][Offloading] Replaced explicit parallel level computation with function `__kmpc_parallel_level`
      
      There are two places in current deviceRTLs where it computes parallel level explicitly,
      which is basically the functionality of `__kmpc_parallel_level`. Starting from
      D105787, we plan to introduce a series of function call folding based on information
      that can be deducted during compilation time. Computation of parallel level is
      the next target. This patch makes steps for the optimization.
      
      Reviewed By: jdoerfert
      
      Differential Revision: https://reviews.llvm.org/D105955
      97c8f60b
  12. Jul 15, 2021
  13. Jul 13, 2021
    • George Rokos's avatar
      [libomptarget] Update device pointer only if needed · bb0166dc
      George Rokos authored
      Currently, libomptarget will always perform a host-to-device memory transfer in
      order to update the device pointer of a PTR_AND_OBJ entry. This is not always
      necessary because the device pointer may have been set to the correct pointee
      address already, so we can eliminate the redundant memory transfer.
      bb0166dc
Loading