Skip to content
  1. Aug 26, 2020
  2. Aug 25, 2020
    • Shilei Tian's avatar
      [OpenMP] Pack first-private arguments to improve efficiency of data transfer · 0775c1df
      Shilei Tian authored
      In this patch, we pack all small first-private arguments, allocate and transfer them all at once to reduce the number of data transfer which is very expensive.
      
      Let's take the test case as example.
      ```
      int main() {
        int data1[3] = {1}, data2[3] = {2}, data3[3] = {3};
        int sum[16] = {0};
      #pragma omp target teams distribute parallel for map(tofrom: sum) firstprivate(data1, data2, data3)
        for (int i = 0; i < 16; ++i) {
          for (int j = 0; j < 3; ++j) {
            sum[i] += data1[j];
            sum[i] += data2[j];
            sum[i] += data3[j];
          }
        }
      }
      ```
      Here `data1`, `data2`, and `data3` are three first-private arguments of the target region. In the previous `libomptarget`, it called data allocation and data transfer three times, each of which allocated and transferred 12 bytes. With this patch, it only calls allocation and transfer once. The size is `(12+4)*3=48` where 12 is the size of each array and 4 is the padding to keep the address aligned with 8. It is implemented in this way:
      1. First collect all information for those *first*-private arguments. _private_ arguments are not the case because private arguments don't need to be mapped to target device. It just needs a data allocation. With the patch for memory manager, the data allocation could be very cheap, especially for the small size. For each qualified argument, push a place holder pointer `nullptr` to the `vector` for kernel arguments, and we will update them later.
      2. After we have all information, create a buffer that can accommodate all arguments plus their paddings. Copy the arguments to the buffer at the right place, i.e. aligned address.
      3. Allocate a target memory with the same size as the host buffer, transfer the host buffer to target device, and finally update all place holder pointers in the arguments `vector`.
      
      The reason we only consider small arguments is, the data transfer is asynchronous. Therefore, for the large argument, we could continue to do things on the host side meanwhile, hopefully, the data is also being transferred. The "small" is defined by that the argument size is less than a predefined value. Currently it is 1024. I'm not sure whether it is a good one, and that is an open question. Another question is, do we need to make it configurable via an environment variable?
      
      Reviewed By: ye-luo
      
      Differential Revision: https://reviews.llvm.org/D86307
      0775c1df
  3. Aug 24, 2020
  4. Aug 20, 2020
    • Shilei Tian's avatar
      [OpenMP] Introduce target memory manager · 02896967
      Shilei Tian authored
      Target memory manager is introduced in this patch which aims to manage target
      memory such that they will not be freed immediately when they are not used
      because the overhead of memory allocation and free is very large. For CUDA
      device, cuMemFree even blocks the context switch on device which affects
      concurrent kernel execution.
      
      The memory manager can be taken as a memory pool. It divides the pool into
      multiple buckets according to the size such that memory allocation/free
      distributed to different buckets will not affect each other.
      
      In this version, we use the exact-equality policy to find a free buffer. This
      is an open question: will best-fit work better here? IMO, best-fit is not good
      for target memory management because computation on GPU usually requires GBs of
      data. Best-fit might lead to a serious waste. For example, there is a free
      buffer of size 1960MB, and now we need a buffer of size 1200MB. If best-fit,
      the free buffer will be returned, leading to a 760MB waste.
      
      The allocation will happen when there is no free memory left, and the memory
      free on device will take place in the following two cases:
      1. The program ends. Obviously. However, there is a little problem that plugin
      library is destroyed before the memory manager is destroyed, leading to a fact
      that the call to target plugin will not succeed.
      2. Device is out of memory when we request a new memory. The manager will walk
      through all free buffers from the bucket with largest base size, pick up one
      buffer, free it, and try to allocate immediately. If it succeeds, it will
      return right away rather than freeing all buffers in free list.
      
      Update:
      A threshold (8KB by default) is set such that users could control what size of memory
      will be managed by the manager. It can also be configured by an environment variable
      `LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD`.
      
      Reviewed By: jdoerfert, ye-luo, JonChesterfield
      
      Differential Revision: https://reviews.llvm.org/D81054
      02896967
  5. Aug 19, 2020
  6. Aug 17, 2020
  7. Aug 16, 2020
    • Johannes Doerfert's avatar
      [OpenMP][CUDA] Keep one kernel list per device, not globally. · 5272d29e
      Johannes Doerfert authored
      Reviewed By: JonChesterfield
      
      Differential Revision: https://reviews.llvm.org/D86039
      5272d29e
    • Johannes Doerfert's avatar
      [OpenMP][CUDA] Cache the maximal number of threads per block (per kernel) · aa27cfc1
      Johannes Doerfert authored
      Instead of calling `cuFuncGetAttribute` with
      `CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK` for every kernel invocation,
      we can do it for the first one and cache the result as part of the
      `KernelInfo` struct. The only functional change is that we now expect
      `cuFuncGetAttribute` to succeed and otherwise propagate the error.
      Ignoring any error seems like a slippery slope...
      
      Reviewed By: JonChesterfield
      
      Differential Revision: https://reviews.llvm.org/D86038
      aa27cfc1
    • Jon Chesterfield's avatar
      [libomptarget] Implement host plugin for amdgpu · d0b31295
      Jon Chesterfield authored
      [libomptarget] Implement host plugin for amdgpu
      
      Replacement for D71384. Primary difference is inlining the dependency on atmi
      followed by extensive simplification and bugfixes. This is the latest version
      from https://github.com/ROCm-Developer-Tools/amd-llvm-project/tree/aomp12 with
      minor patches and a rename from hsa to amdgpu, on the basis that this can't be
      used by other implementations of hsa without additional work.
      
      This will not build unless the ROCM_DIR variable is passed so won't break other
      builds. That variable is used to locate two amdgpu specific libraries that ship
      as part of rocm:
      libhsakmt at https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface
      libhsa-runtime64 at https://github.com/RadeonOpenCompute/ROCR-Runtime
      These libraries build from source. The build scripts in those repos are for
      shared libraries, but can be adapted to statically link both into this plugin.
      
      There are caveats.
      - This works well enough to run various tests and benchmarks, and will be used
        to support the current clang bring up
      - It is adequately thread safe for the above but there will be races remaining
      - It is not stylistically correct for llvm, though has had clang-format run
      - It has suboptimal memory management and locking strategies
      - The debug printing / error handling is inconsistent
      
      I would like to contribute this pretty much as-is and then improve it in-tree.
      This would be advantagous because the aomp12 branch that was in use for fixing
      this codebase has just been joined with the amd internal rocm dev process.
      
      Reviewed By: jdoerfert
      
      Differential Revision: https://reviews.llvm.org/D85742
      d0b31295
  8. Aug 14, 2020
    • Joachim Protze's avatar
      [OpenMP] Fix releasing of stack memory · 66a3575c
      Joachim Protze authored
      Starting with 787eb0c6 I got spurious segmentation faults for some testcases. I could nail it down to `brel` trying to release the "memory" of the node allocated on the stack of __kmpc_omp_wait_deps. With this patch, you will see the assertion triggering for some of the tests in the test suite.
      
      My proposed solution for the issue is to just patch __kmpc_omp_wait_deps:
      ```
        __kmp_init_node(&node);
      -  node.dn.on_stack = 1;
      +  // the stack owns the node
      +  __kmp_node_ref(&node);
      ```
      
      What do you think?
      
      Reviewed By: AndreyChurbanov
      
      Differential Revision: https://reviews.llvm.org/D84472
      66a3575c
  9. Aug 06, 2020
  10. Aug 05, 2020
    • Joel E. Denny's avatar
      [OpenMP] Fix `target data` exit for array extension · 8c8bb128
      Joel E. Denny authored
      For example:
      
      ```
       #pragma omp target data map(tofrom:arr[0:100])
       {
         #pragma omp target exit data map(delete:arr[0:100])
         #pragma omp target enter data map(alloc:arr[98:2])
       }
      ```
      
      Without this patch, the transfer at the end of the target data region
      is broken and fails depending on the target device.  According to my
      read of the spec, the transfer shouldn't even be attempted because
      `arr[0:100]` isn't (fully) present there.  To fix that, this patch
      makes `DeviceTy::getTgtPtrBegin` return null for this case.
      
      Reviewed By: grokos
      
      Differential Revision: https://reviews.llvm.org/D85342
      8c8bb128
    • Joel E. Denny's avatar
      [OpenMP] Fix `present` diagnostic for array extension · 41b1aefe
      Joel E. Denny authored
      For example, without this patch, the following fails as expected with
      or without the `present` modifier, but the `present` modifier doesn't
      produce its usual diagnostic:
      
      ```
       #pragma omp target data map(alloc: arr[0:2])
       {
         #pragma omp target map(present, tofrom: arr[0:100]) // not fully present
         ;
       }
      ```
      
      Reviewed By: grokos, vzakhari
      
      Differential Revision: https://reviews.llvm.org/D85320
      41b1aefe
    • George Rokos's avatar
      [libomptarget][NFC] Replace `%ld` with PRId64 for data of type int64_t. · 40470eb2
      George Rokos authored
      The standard way of printing `int64_t` data is via the PRId64 macro, `ld`
      is for `long int` and int64_t is not guaranteed to be typedef'ed as `long int`
      on all platforms. E.g. on Windows we get mismatch warnings.
      
      Differential Revision: https://reviews.llvm.org/D85353
      40470eb2
    • Alexey Bataev's avatar
      [LIBOMPTARGET]Fix order of mapper data for targetDataEnd function. · 6780d567
      Alexey Bataev authored
      targetDataMapper function fills arrays with the mapping data in the
      direct order. When this function is called by targetDataBegin or
      tgt_target_update functions, it works as expected. But targetDataEnd
      function processes mapped data in reverse order. In this case, the base
      pointer might be deleted before the associated data is deleted. Need to
      reverse data, mapped by mapper, too, since it always adds data that must
      be deleted at the end of the buffer.
      Fixes the test declare_mapper_target_update.cpp.
      Also, reduces the memry fragmentation by preallocation the memory
      buffers.
      
      Differential Revision: https://reviews.llvm.org/D85216
      6780d567
    • Joel E. Denny's avatar
      [OpenMP] Fix `omp target update` for array extension · 5ab43989
      Joel E. Denny authored
      OpenMP TR8 sec. 2.15.6 "target update Construct", p. 183, L3-4 states:
      
      > If the corresponding list item is not present in the device data
      > environment and there is no present modifier in the clause, then no
      > assignment occurs to or from the original list item.
      
      L10-11 states:
      
      > If a present modifier appears in the clause and the corresponding
      > list item is not present in the device data environment then an
      > error occurs and the program termintates.
      
      (OpenMP 5.0 also has the first passage but without mention of the
      present modifier of course.)
      
      In both passages, I assume "is not present" includes the case of
      partially but not entirely present.  However, without this patch, the
      target update directive misbehaves in this case both with and without
      the present modifier.  For example:
      
      ```
       #pragma omp target enter data map(to:arr[0:3])
       #pragma omp target update to(arr[0:5]) // might fail on data transfer
       #pragma omp target update to(present:arr[0:5]) // might fail on data transfer
      ```
      
      The problem is that `DeviceTy::getTgtPtrBegin` does not return a null
      pointer in that case, so `target_data_update` sees the data as fully
      present, and the data transfer then might fail depending on the target
      device.  However, without the present modifier, there should never be
      a failure.  Moreover, with the present modifier, there should always
      be a failure, and the diagnostic should mention the present modifier.
      
      This patch fixes `DeviceTy::getTgtPtrBegin` to return null when
      `target_data_update` is the caller.  I'm wondering if it should do the
      same for more callers.
      
      Reviewed By: grokos, jdoerfert
      
      Differential Revision: https://reviews.llvm.org/D85246
      5ab43989
    • Joel E. Denny's avatar
      [OpenMP] Fix `present` for exit from `omp target data` · 002d61db
      Joel E. Denny authored
      Without this patch, the following example fails but shouldn't
      according to OpenMP TR8:
      
      ```
       #pragma omp target enter data map(alloc:i)
       #pragma omp target data map(present, alloc: i)
       {
         #pragma omp target exit data map(delete:i)
       } // fails presence check here
      ```
      
      OpenMP TR8 sec. 2.22.7.1 "map Clause", p. 321, L23-26 states:
      
      > If the map clause appears on a target, target data, target enter
      > data or target exit data construct with a present map-type-modifier
      > then on entry to the region if the corresponding list item does not
      > appear in the device data environment an error occurs and the
      > program terminates.
      
      There is no corresponding statement about the exit from a region.
      Thus, the `present` modifier should:
      
      1. Check for presence upon entry into any region, including a `target
         exit data` region.  This behavior is already implemented correctly.
      
      2. Should not check for presence upon exit from any region, including
         a `target` or `target data` region.  Without this patch, this
         behavior is not implemented correctly, breaking the above example.
      
      In the case of `target data`, this patch fixes the latter behavior by
      removing the `present` modifier from the map types Clang generates for
      the runtime call at the end of the region.
      
      In the case of `target`, we have not found a valid OpenMP program for
      which such a fix would matter.  It appears that, if a program can
      guarantee that data is present at the beginning of a `target` region
      so that there's no error there, that data is also guaranteed to be
      present at the end.  This patch adds a comment to the runtime to
      document this case.
      
      Reviewed By: grokos, RaviNarayanaswamy, ABataev
      
      Differential Revision: https://reviews.llvm.org/D84422
      002d61db
  11. Aug 04, 2020
  12. Aug 01, 2020
  13. Jul 31, 2020
  14. Jul 30, 2020
  15. Jul 29, 2020
  16. Jul 28, 2020
Loading