- Apr 07, 2022
-
-
Michael Kruse authored
In a clean build directory, `check-openmp` or `check-libomptarget` will fail because of missing device RTL .bc files. Ensure that the new targets new custom targets `omptarget.devicertl.nvptx` and `omptarget.devicertl.amdgpu` (corresponding to the plugin rtl targets `omptarget.rtl.cuda`, respectively `omptarget.rlt.amdgpu` ) are dependencies of the regression tests. Reviewed By: JonChesterfield Differential Revision: https://reviews.llvm.org/D123177
-
- Mar 29, 2022
-
-
Ron Lieberman authored
Differential Revision: https://reviews.llvm.org/D122658
-
Johannes Doerfert authored
-
Johannes Doerfert authored
-
Johannes Doerfert authored
If we decided to delete a mapping entry we did not act on it right away but first issued and waited for memory copies. In the meantime some other thread might reuse the entry. While there was some logic to avoid colliding on the actual "deletion" part, there were two races happening: 1) The data transfer back of the thread deleting the entry and the data transfer back of the thread taking over the entry raced. 2) The update to the shadow map happened regardless if the entry was actually reused by another thread which left the shadow map in a inconsistent state. To fix both issues we will now update the shadow map and delete the entry only if we are sure the thread is responsible for deletion, hence no other thread took over the entry and reused it. We also wait for a potential former data transfer from the device to finish before we issue another one that would race with it. Fixes https://github.com/llvm/llvm-project/issues/54216 Differential Revision: https://reviews.llvm.org/D121058
-
Johannes Doerfert authored
-
Johannes Doerfert authored
Inline assembly is scary but we need to support it for the OpenMP GPU device runtime. The new assumption expresses the fact that it may not have call semantics, that is, it will not call another function but simply perform an operation or side-effect. This is important for reachability in the presence of inline assembly. Differential Revision: https://reviews.llvm.org/D109986
-
- Mar 26, 2022
-
-
Shilei Tian authored
As we mentioned in the code comments for function `ResourcePoolTy::release`, at some point there could be two identical resources on the two sides of `Next` mark. It is usually not an issue, unless the following case: 1. Some resources are not returned. 2. We need to iterate the pool and free the element. That will cause double free, which is the case for event pool. Since we don't release events hold by the data map, it can happen that the `Next` mark is not reset, and we have two identical items in the pool. When the pool is destroyed, we will call `cuEventDestroy` twice on the same event. In the best case, we can only observe CUDA errors. In the worst case, it can cause internal failures in CUDART and further crash. This patch fixes the issue by tracking all resources that have been given using an `unordered_set`. We don't remove it when a resource is returned. When the pool is destroyed, we merge the pool (a `vector`) and the set. In this way, we can make sure that the set contains all resources allocated from the device. We just need to iterate the set and free the resource accordingly. For now, only event pool is set to use it. Stream pool is not because we can make sure all streams are returned when the plugin is destroyed. Someone might be wondering, why don't we release all events hold in the data map. That is because, plugins are determined to be destroyed *before* `libomptarget`. If we can somehow make the plugin outlast `libomptarget`, life will be much easier. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D122014
-
Joseph Huber authored
This patch adds the necessary AMDGPU calling convention to the ctor / dtor kernels. These are fundamentally device kenels called by the host on image load. Without this calling convention information the AMDGPU plugin is unable to identify them. Depends on D122504 Fixes #54091 Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D122515
-
- Mar 25, 2022
-
-
Johannes Doerfert authored
This reverts commit b9fd8f34 as it accidentally contained a unit test change that is not finished (and unrelated).
-
Johannes Doerfert authored
-
Johannes Doerfert authored
-
Johannes Doerfert authored
This patch solves two problems with the `HostDataToTargetMap` (HDTT map) which caused races and crashes before: 1) Any access to the HDTT map needs to be exclusive access. This was not the case for the "dump table" traversals that could collide with updates by other threads. The new `Accessor` and `ProtectedObject` wrappers will ensure we have a hard time introducing similar races in the future. Note that we could allow multiple concurrent read-accesses but that feature can be added to the `Accessor` API later. 2) The elements of the HDTT map were `HostDataToTargetTy` objects which meant that they could be copied/moved/deleted as the map was changed. However, we sometimes kept pointers to these elements around after we gave up the map lock which caused potential races again. The new indirection through `HostDataToTargetMapKeyTy` will allows us to modify the map while keeping the (interesting part of the) entries valid. To offset potential cost we duplicate the ordering key of the entry which avoids an additional indirect lookup. We should replace more objects with "protected objects" as we go. Differential Revision: https://reviews.llvm.org/D121057
-
- Mar 22, 2022
-
-
Joseph Huber authored
The unroll pragma did not properly work as the loop bound was not known when we optimize the runtime and we then added a "unroll disable" metadata which prevented unrolling later when the bounds were known. For now we manually unroll to make sure up to 16 elements are handled nicely. This helps optimizations to look through the argument passing. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D109164
-
- Mar 17, 2022
-
-
Stanislav Mekhanoshin authored
Differential Revision: https://reviews.llvm.org/D120849
-
- Mar 12, 2022
-
-
Jon Chesterfield authored
-
- Mar 10, 2022
-
-
Shilei Tian authored
-
- Mar 09, 2022
-
-
Shilei Tian authored
Currently we set ccontext everywhere accordingly, but that causes many unnecessary function calls. For example, in the resource pool, if we need to resize the pool, we need to get from allocator. Each call to allocate sets the current context once, which is unnecessary. In this patch, we set the context only in the entry interface functions, if needed. Actually in the best way this should be implemented via RAII, but since `cuCtxSetCurrent` could return error, and we don't use exception, we can't stop the execution if RAII fails. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D121322
-
Shilei Tian authored
This patch fixes the issue introduced in 14de0820 and D120089, that if dynamic libraries are used, the `CUmodule` array could be overwritten. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D121308
-
Johannes Doerfert authored
The modules vector was for some reason special which could lead to it not being of the same size (=num devices). Easiest solution is to treat it like we do all the other vectors.
-
- Mar 08, 2022
-
-
Johannes Doerfert authored
An event pool, similar to the stream pool, needs to be kept per device. For one, events are associated with cuda contexts which means we cannot destroy the former after the latter. Also, CUDA documentation states streams and events need to be associated with the same context, which we did not ensure at all. Differential Revision: https://reviews.llvm.org/D120142
-
Johannes Doerfert authored
There are two problems this patch tries to address: 1) We currently free resources in a random order wrt. plugin and libomptarget destruction. This patch should ensure the CUDA plugin is less fragile if something during the deinitialization goes wrong. 2) We need to support (hard) pause runtime calls eventually. This patch allows us to free all associated resources, though we cannot reinitialize the device yet. Follow up patch will associate one event pool per device/context. Differential Revision: https://reviews.llvm.org/D120089
-
Johannes Doerfert authored
Differential Revision: https://reviews.llvm.org/D121060
-
- Mar 07, 2022
-
-
Johannes Doerfert authored
This reverts commit ff50e81b as it broke the buildbots, see https://reviews.llvm.org/D121060#3362737.
-
Johannes Doerfert authored
Differential Revision: https://reviews.llvm.org/D121060
-
- Mar 06, 2022
-
-
Shilei Tian authored
`LIBOMPTARGET_LLVM_INCLUDE_DIRS` is currently checked and included for multiple times redundantly. This patch is simply a clean up. Reviewed By: jhuber6 Differential Revision: https://reviews.llvm.org/D121055
-
- Mar 04, 2022
-
-
Joseph Huber authored
Libomptarget uses some shared variables to track certain internal stated in the runtime. This causes problems when we have code that contains no OpenMP kernels. These variables are normally initialized upon kernel entry, but if there are no kernels we will see no initialization. Currently we load the runtime into each source file when not running in LTO mode, so these variables will be erroneously considered undefined or dead and removed, causing miscompiles. This patch temporarily works around the most obvious case, but others still exhibit this problem. We will need to fix this more soundly later. Fixes #54208. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D121007
-
- Mar 03, 2022
-
-
Aakanksha authored
Differential Revision: https://reviews.llvm.org/D120846
-
- Mar 02, 2022
-
-
Stanislav Mekhanoshin authored
This is target definition only. Differential Revision: https://reviews.llvm.org/D120688
-
- Feb 23, 2022
-
-
Shilei Tian authored
-
Joseph Huber authored
-
- Feb 18, 2022
-
-
Carlo Bertolli authored
[OpenMP][libomptarget] Delay restore of shadow pointers in structs to after H2D memory copies are completed When using asynchronous plugin calls, shadow pointer restore could happen before the D2H copy for the entire struct has completed, effectively leaving a device pointer in a host struct. This patch fixes the problem by delaying restore's to after a synchronization happens (target regions) and by calling early synchronization (target update). Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D119968
-
Joseph Huber authored
The runtime uses thread state values to indicate when we use an ICV or are in nested parallelism. This is done for OpenMP correctness, but it not needed in the majority of cases. The new flag added is `-fopenmp-assume-no-thread-state`. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D120106
-
- Feb 17, 2022
-
-
Shilei Tian authored
`bug49334.cpp` has one issue that causes flaky result reported in #53730. The root cause is `BlockedC` is never initialized but in `BlockMatMul_TargetNowait` it is directly read and written (via `+=`). Fixes #53730. Reviewed By: jhuber6 Differential Revision: https://reviews.llvm.org/D119988
-
- Feb 16, 2022
-
-
Johannes Doerfert authored
The `IsSPMD` global can only be read by threads other than the main thread *after* initialization is complete. To allow usage of `mapping::getBlockSize` before initialization is done, we can pass the `IsSPMD` state explicitly. This is similar to other APIs that take `IsSPMD` explicitly to avoid such a race, e.g., `mapping::isInitialThreadInLevel0(IsSPMD)` Fixes https://github.com/llvm/llvm-project/issues/53857
-
- Feb 15, 2022
-
-
Joseph Huber authored
This patch adds a new target to the OpenMP CPU offloading tests. This tests the usage of the new driver for CPU offloading. If this all works then we can move to transition to the new driver as the default. Depends on D119613 Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D119736
-
- Feb 14, 2022
-
-
Joseph Huber authored
Currently whenever we compile the device runtime we get the following 'Mapping.cpp:32:32: warning: inline function '_OMP::impl::getGridValue' is not defined [-Wundefined-inline]' warning. This can be silenced by removing the constexpr attribute for this function. Doing this doesn't change the generated bitcode at all but prevents the screen from getting filled with warnings whenver we build the runtime. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D119747
-
- Feb 13, 2022
-
-
Shilei Tian authored
This patch fixes the issue that the for loop in `applyToShadowMapEntries` is infinite because `Itr` is not incremented in `CB`. Fixes #53727. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D119471
-
- Feb 11, 2022
-
-
Shilei Tian authored
`bug49334.cpp` directly uses `!=` to compare two floating point values, which is almost wrong. Reviewed By: jhuber6 Differential Revision: https://reviews.llvm.org/D119485
-
Shilei Tian authored
Currently we have a hard team limit, which is set to 65536. It says no matter whether the device can support more teams, or users set more teams, as long as it is larger than that hard limit, the final number to launch the kernel will always be that hard limit. It is way less than the actual hardware limit. For example, my workstation has GTX2080, and the hardware limit of grid size is 2147483647, which is exactly the largest number a `int32_t` can represent. There is no limitation mentioned in the spec. This patch simply removes it. Reviewed By: jdoerfert Differential Revision: https://reviews.llvm.org/D119313
-