- Aug 26, 2020
-
-
Jon Chesterfield authored
-
AndreyChurbanov authored
Patch by mati865@gmail.com Differential Revision: https://reviews.llvm.org/D86552
-
AndreyChurbanov authored
Patch by nihui (Ni Hui) Differential Revision: https://reviews.llvm.org/D76755
-
- Aug 25, 2020
-
-
Shilei Tian authored
In this patch, we pack all small first-private arguments, allocate and transfer them all at once to reduce the number of data transfer which is very expensive. Let's take the test case as example. ``` int main() { int data1[3] = {1}, data2[3] = {2}, data3[3] = {3}; int sum[16] = {0}; #pragma omp target teams distribute parallel for map(tofrom: sum) firstprivate(data1, data2, data3) for (int i = 0; i < 16; ++i) { for (int j = 0; j < 3; ++j) { sum[i] += data1[j]; sum[i] += data2[j]; sum[i] += data3[j]; } } } ``` Here `data1`, `data2`, and `data3` are three first-private arguments of the target region. In the previous `libomptarget`, it called data allocation and data transfer three times, each of which allocated and transferred 12 bytes. With this patch, it only calls allocation and transfer once. The size is `(12+4)*3=48` where 12 is the size of each array and 4 is the padding to keep the address aligned with 8. It is implemented in this way: 1. First collect all information for those *first*-private arguments. _private_ arguments are not the case because private arguments don't need to be mapped to target device. It just needs a data allocation. With the patch for memory manager, the data allocation could be very cheap, especially for the small size. For each qualified argument, push a place holder pointer `nullptr` to the `vector` for kernel arguments, and we will update them later. 2. After we have all information, create a buffer that can accommodate all arguments plus their paddings. Copy the arguments to the buffer at the right place, i.e. aligned address. 3. Allocate a target memory with the same size as the host buffer, transfer the host buffer to target device, and finally update all place holder pointers in the arguments `vector`. The reason we only consider small arguments is, the data transfer is asynchronous. Therefore, for the large argument, we could continue to do things on the host side meanwhile, hopefully, the data is also being transferred. The "small" is defined by that the argument size is less than a predefined value. Currently it is 1024. I'm not sure whether it is a good one, and that is an open question. Another question is, do we need to make it configurable via an environment variable? Reviewed By: ye-luo Differential Revision: https://reviews.llvm.org/D86307
-
- Aug 24, 2020
-
-
Dimitry Andric authored
after cde8f4c1. Sort system includes, while here.
-
Dimitry Andric authored
Instead of copying and pasting the same `#ifdef` expressions in multiple places, define a type and a pair of macros in `kmp_os.h`, to handle whether `va_list` is pointer-like or not: * `kmp_va_list` is the type to use for `__kmp_fork_call()` * `kmp_va_deref()` dereferences a `va_list`, if necessary * `kmp_va_addr_of()` takes the address of a `va_list`, if necessary Also add FreeBSD to the list of OSes that has a non pointer-like va_list. This can now be easily extended to other OSes too. Reviewed By: AndreyChurbanov Differential Revision: https://reviews.llvm.org/D86397
-
AndreyChurbanov authored
Patch by mati865@gmail.com Differential Revision: https://reviews.llvm.org/D86448
-
Shilei Tian authored
The issue mentioned has been fixed in D84996
-
- Aug 20, 2020
-
-
Shilei Tian authored
Target memory manager is introduced in this patch which aims to manage target memory such that they will not be freed immediately when they are not used because the overhead of memory allocation and free is very large. For CUDA device, cuMemFree even blocks the context switch on device which affects concurrent kernel execution. The memory manager can be taken as a memory pool. It divides the pool into multiple buckets according to the size such that memory allocation/free distributed to different buckets will not affect each other. In this version, we use the exact-equality policy to find a free buffer. This is an open question: will best-fit work better here? IMO, best-fit is not good for target memory management because computation on GPU usually requires GBs of data. Best-fit might lead to a serious waste. For example, there is a free buffer of size 1960MB, and now we need a buffer of size 1200MB. If best-fit, the free buffer will be returned, leading to a 760MB waste. The allocation will happen when there is no free memory left, and the memory free on device will take place in the following two cases: 1. The program ends. Obviously. However, there is a little problem that plugin library is destroyed before the memory manager is destroyed, leading to a fact that the call to target plugin will not succeed. 2. Device is out of memory when we request a new memory. The manager will walk through all free buffers from the bucket with largest base size, pick up one buffer, free it, and try to allocate immediately. If it succeeds, it will return right away rather than freeing all buffers in free list. Update: A threshold (8KB by default) is set such that users could control what size of memory will be managed by the manager. It can also be configured by an environment variable `LIBOMPTARGET_MEMORY_MANAGER_THRESHOLD`. Reviewed By: jdoerfert, ye-luo, JonChesterfield Differential Revision: https://reviews.llvm.org/D81054
-
- Aug 19, 2020
-
-
Shilei Tian authored
This patch contains the following changes: 1. Renamed the function `DeviceTy::data_exchange` to `DeviceTy::dataExchange`; 2. Changed the second argument `DeviceTy DstDev` to `DeviceTy &DstDev`; 3. Renamed the last argument. Reviewed By: ye-luo Differential Revision: https://reviews.llvm.org/D86238
-
Jon Chesterfield authored
-
- Aug 17, 2020
-
-
George Rokos authored
Differential Revision: https://reviews.llvm.org/D86082
-
- Aug 16, 2020
-
-
Johannes Doerfert authored
Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D86039
-
Johannes Doerfert authored
Instead of calling `cuFuncGetAttribute` with `CU_FUNC_ATTRIBUTE_MAX_THREADS_PER_BLOCK` for every kernel invocation, we can do it for the first one and cache the result as part of the `KernelInfo` struct. The only functional change is that we now expect `cuFuncGetAttribute` to succeed and otherwise propagate the error. Ignoring any error seems like a slippery slope... Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D86038
-
Jon Chesterfield authored
[libomptarget] Implement host plugin for amdgpu Replacement for D71384. Primary difference is inlining the dependency on atmi followed by extensive simplification and bugfixes. This is the latest version from https://github.com/ROCm-Developer-Tools/amd-llvm-project/tree/aomp12 with minor patches and a rename from hsa to amdgpu, on the basis that this can't be used by other implementations of hsa without additional work. This will not build unless the ROCM_DIR variable is passed so won't break other builds. That variable is used to locate two amdgpu specific libraries that ship as part of rocm: libhsakmt at https://github.com/RadeonOpenCompute/ROCT-Thunk-Interface libhsa-runtime64 at https://github.com/RadeonOpenCompute/ROCR-Runtime These libraries build from source. The build scripts in those repos are for shared libraries, but can be adapted to statically link both into this plugin. There are caveats. - This works well enough to run various tests and benchmarks, and will be used to support the current clang bring up - It is adequately thread safe for the above but there will be races remaining - It is not stylistically correct for llvm, though has had clang-format run - It has suboptimal memory management and locking strategies - The debug printing / error handling is inconsistent I would like to contribute this pretty much as-is and then improve it in-tree. This would be advantagous because the aomp12 branch that was in use for fixing this codebase has just been joined with the amd internal rocm dev process. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D85742
-
- Aug 14, 2020
-
-
Joachim Protze authored
Starting with 787eb0c6 I got spurious segmentation faults for some testcases. I could nail it down to `brel` trying to release the "memory" of the node allocated on the stack of __kmpc_omp_wait_deps. With this patch, you will see the assertion triggering for some of the tests in the test suite. My proposed solution for the issue is to just patch __kmpc_omp_wait_deps: ``` __kmp_init_node(&node); - node.dn.on_stack = 1; + // the stack owns the node + __kmp_node_ref(&node); ``` What do you think? Reviewed By: AndreyChurbanov Differential Revision: https://reviews.llvm.org/D84472
-
- Aug 06, 2020
-
-
Joel E. Denny authored
D85342 broke this case. The new test case presents an example. Reviewed By: grokos Differential Revision: https://reviews.llvm.org/D85369
-
- Aug 05, 2020
-
-
Joel E. Denny authored
For example: ``` #pragma omp target data map(tofrom:arr[0:100]) { #pragma omp target exit data map(delete:arr[0:100]) #pragma omp target enter data map(alloc:arr[98:2]) } ``` Without this patch, the transfer at the end of the target data region is broken and fails depending on the target device. According to my read of the spec, the transfer shouldn't even be attempted because `arr[0:100]` isn't (fully) present there. To fix that, this patch makes `DeviceTy::getTgtPtrBegin` return null for this case. Reviewed By: grokos Differential Revision: https://reviews.llvm.org/D85342
-
Joel E. Denny authored
For example, without this patch, the following fails as expected with or without the `present` modifier, but the `present` modifier doesn't produce its usual diagnostic: ``` #pragma omp target data map(alloc: arr[0:2]) { #pragma omp target map(present, tofrom: arr[0:100]) // not fully present ; } ``` Reviewed By: grokos, vzakhari Differential Revision: https://reviews.llvm.org/D85320
-
George Rokos authored
The standard way of printing `int64_t` data is via the PRId64 macro, `ld` is for `long int` and int64_t is not guaranteed to be typedef'ed as `long int` on all platforms. E.g. on Windows we get mismatch warnings. Differential Revision: https://reviews.llvm.org/D85353
-
Alexey Bataev authored
targetDataMapper function fills arrays with the mapping data in the direct order. When this function is called by targetDataBegin or tgt_target_update functions, it works as expected. But targetDataEnd function processes mapped data in reverse order. In this case, the base pointer might be deleted before the associated data is deleted. Need to reverse data, mapped by mapper, too, since it always adds data that must be deleted at the end of the buffer. Fixes the test declare_mapper_target_update.cpp. Also, reduces the memry fragmentation by preallocation the memory buffers. Differential Revision: https://reviews.llvm.org/D85216
-
Joel E. Denny authored
OpenMP TR8 sec. 2.15.6 "target update Construct", p. 183, L3-4 states: > If the corresponding list item is not present in the device data > environment and there is no present modifier in the clause, then no > assignment occurs to or from the original list item. L10-11 states: > If a present modifier appears in the clause and the corresponding > list item is not present in the device data environment then an > error occurs and the program termintates. (OpenMP 5.0 also has the first passage but without mention of the present modifier of course.) In both passages, I assume "is not present" includes the case of partially but not entirely present. However, without this patch, the target update directive misbehaves in this case both with and without the present modifier. For example: ``` #pragma omp target enter data map(to:arr[0:3]) #pragma omp target update to(arr[0:5]) // might fail on data transfer #pragma omp target update to(present:arr[0:5]) // might fail on data transfer ``` The problem is that `DeviceTy::getTgtPtrBegin` does not return a null pointer in that case, so `target_data_update` sees the data as fully present, and the data transfer then might fail depending on the target device. However, without the present modifier, there should never be a failure. Moreover, with the present modifier, there should always be a failure, and the diagnostic should mention the present modifier. This patch fixes `DeviceTy::getTgtPtrBegin` to return null when `target_data_update` is the caller. I'm wondering if it should do the same for more callers. Reviewed By: grokos, jdoerfert Differential Revision: https://reviews.llvm.org/D85246
-
Joel E. Denny authored
Without this patch, the following example fails but shouldn't according to OpenMP TR8: ``` #pragma omp target enter data map(alloc:i) #pragma omp target data map(present, alloc: i) { #pragma omp target exit data map(delete:i) } // fails presence check here ``` OpenMP TR8 sec. 2.22.7.1 "map Clause", p. 321, L23-26 states: > If the map clause appears on a target, target data, target enter > data or target exit data construct with a present map-type-modifier > then on entry to the region if the corresponding list item does not > appear in the device data environment an error occurs and the > program terminates. There is no corresponding statement about the exit from a region. Thus, the `present` modifier should: 1. Check for presence upon entry into any region, including a `target exit data` region. This behavior is already implemented correctly. 2. Should not check for presence upon exit from any region, including a `target` or `target data` region. Without this patch, this behavior is not implemented correctly, breaking the above example. In the case of `target data`, this patch fixes the latter behavior by removing the `present` modifier from the map types Clang generates for the runtime call at the end of the region. In the case of `target`, we have not found a valid OpenMP program for which such a fix would matter. It appears that, if a program can guarantee that data is present at the beginning of a `target` region so that there's no error there, that data is also guaranteed to be present at the end. This patch adds a comment to the runtime to document this case. Reviewed By: grokos, RaviNarayanaswamy, ABataev Differential Revision: https://reviews.llvm.org/D84422
-
- Aug 04, 2020
-
-
Adrian Pop authored
RTM Adaptive Locks are supported on msys2/mingw for clang and gcc. Differential Revision: https://reviews.llvm.org/D81776
-
AndreyChurbanov authored
Patch by mati865@gmail.com Differential Revision: https://reviews.llvm.org/D85210
-
David Blaikie authored
-
- Aug 01, 2020
-
-
Shilei Tian authored
This patch fixed the issue that target memory might be deallocated when they're still being used or before they're used. Reviewed By: ye-luo Differential Revision: https://reviews.llvm.org/D84996
-
- Jul 31, 2020
-
-
Joachim Protze authored
This is to address the issue reported at: https://bugs.llvm.org/show_bug.cgi?id=46863 Since weak is meaningless for a shared library interface function, this patch disables the attribute, when the OpenMP library is built as shared library. ompt_start_tool is not an interface function, but a internally called function possibly implemented by an OMPT tool. This function needs to be weak if possible to allow overwriting ompt_start_tool with a function implementation built into the application. Differential Revision: https://reviews.llvm.org/D84871
-
Shilei Tian authored
Refactored the function `targetDataEnd` to make preparation of fixing the issue of ahead-of-time target memory deallocation. This patch only renamed `targetDataEnd` related variables and functions to conform with LLVM code standard. Reviewed By: ye-luo Differential Revision: https://reviews.llvm.org/D84991
-
Shilei Tian authored
Refactored the function `target` to make preparation for fixing the issue of ahead-of-time device memory deallocation. Reviewed By: ye-luo Differential Revision: https://reviews.llvm.org/D84816
-
- Jul 30, 2020
-
-
Alexey Bataev authored
Need to map the base pointer for all directives, not only target data-based ones. The base pointer is mapped for array sections, array subscript, array shaping and other array-like constructs with the base pointer. Also, codegen for use_device_ptr clause was modified to correctly handle mapping combination of array like constructs + use_device_ptr clause. The data for use_device_ptr clause is emitted as the last records in the data mapping array. Reviewed By: ye-luo Differential Revision: https://reviews.llvm.org/D84767
-
Alexey Bataev authored
This reverts commit 142d0d3e to investigate undefined behavior revealed by buildbots.
-
Alexey Bataev authored
Need to map the base pointer for all directives, not only target data-based ones. The base pointer is mapped for array sections, array subscript, array shaping and other array-like constructs with the base pointer. Also, codegen for use_device_ptr clause was modified to correctly handle mapping combination of array like constructs + use_device_ptr clause. The data for use_device_ptr clause is emitted as the last records in the data mapping array. It applies only for global pointers. Differential Revision: https://reviews.llvm.org/D84767
-
- Jul 29, 2020
-
-
Joel E. Denny authored
This patch implements OpenMP runtime support for the OpenMP TR8 `present` motion modifier for `omp target update` directives. The previous patch in this series implements Clang front end support. Reviewed By: grokos Differential Revision: https://reviews.llvm.org/D84712
-
Shilei Tian authored
Reviewed By: ye-luo Differential Revision: https://reviews.llvm.org/D84799
-
Joel E. Denny authored
This reverts commit 2cb926a4. It depends on 3c3faae4, which is being reverted.
-
Shilei Tian authored
[NFC][OpenMP] Renamed all variable and function names in `target` to conform with LLVM code standard This patch only touched variables and functions in `target`. Reviewed By: ye-luo Differential Revision: https://reviews.llvm.org/D84797
-
Joel E. Denny authored
This patch implements OpenMP runtime support for the OpenMP TR8 `present` motion modifier for `omp target update` directives. The previous patch in this series implements Clang front end support. Reviewed By: grokos Differential Revision: https://reviews.llvm.org/D84712
-
- Jul 28, 2020
-
-
Jinsong Ji authored
This reverts commit bf544fa1. Fixed the typo in PPCInstrInfo.cpp.
-
Joel E. Denny authored
On runtime failures, D83963 causes the runtime to abort instead of merely exiting with a non-zero value, but many tests in the libomptarget test suite still expect the former behavior. This patch updates the test suite and was discussed in post-commit comments on D83963 and D84557.
-