- May 05, 2017
-
-
Siddharth Bhat authored
This reverts commit 17a84e414adb51ee375d14836d4c2a817b191933. Patches should have been submitted in the order of: 1. D32852 2. D32854 3. D32431 I mistakenly pushed D32431(3) first. Reverting to push in the correct order. llvm-svn: 302217
-
Siddharth Bhat authored
Summary: When compiling for GPU, one can now choose to compile for OpenCL or CUDA, with the corresponding polly-gpu-runtime flag (libopencl / libcudart). The GPURuntime library (GPUJIT) has been extended with the OpenCL Runtime library for that purpose, correctly choosing the corresponding library calls to the option chosen when compiling (via different initialization calls). Additionally, a specific GPU Target architecture can now be chosen with -polly-gpu-arch (only nvptx64 implemented thus far). Reviewers: grosser, bollu, Meinersbur, etherzhhb, singam-sanjay Reviewed By: grosser, Meinersbur Subscribers: singam-sanjay, llvm-commits, pollydev, nemanjai, mgorny, yaxunl, Anastasia Tags: #polly Differential Revision: https://reviews.llvm.org/D32431 llvm-svn: 302215
-
- May 03, 2017
-
-
Siddharth Bhat authored
- Fixes breakage from commit 5536f. - Interference with commit 764f3 caused testcase to fail. Reverting 764f3 allows commit 5536f to succeed. - Generated kernel code was slightly different due to 764f3, which caused testcase to fail. llvm-svn: 302021
-
- Apr 28, 2017
-
-
Siddharth Bhat authored
generation. This needs changes to GPURuntime to expose synchronization between host and device. 1. Needs better function naming, I want a better name than "getOrCreateManagedDeviceArray" 2. DeviceAllocations is used by both the managed memory and the non-managed memory path. This exploits the fact that the two code paths are never run together. I'm not sure if this is the best design decision Reviewed by: PhilippSchaad Tags: #polly Differential Revision: https://reviews.llvm.org/D32215 llvm-svn: 301640
-
- Apr 25, 2017
-
-
Siddharth Bhat authored
Added a small change to the way pointer arguments are set in the kernel code generation. The way the pointer is retrieved now, specifically requests global address space to be annotated. This is necessary, if the IR should be run through NVPTX to generate OpenCL compatible PTX. The changes do not affect the PTX Strings generated for the CUDA target (nvptx64-nvidia-cuda), but are necessary for OpenCL (nvptx64-nvidia-nvcl). Additionally, the data layout has been updated to what the NVPTX Backend requests/recommends. Contributed-by: Philipp Schaad Reviewers: Meinersbur, grosser, bollu Reviewed By: grosser, bollu Subscribers: jlebar, pollydev, llvm-commits, nemanjai, yaxunl, Anastasia Tags: #polly Differential Revision: https://reviews.llvm.org/D32215 llvm-svn: 301299
-
- Mar 07, 2017
-
-
Tobias Grosser authored
There is no point in optimizing unreachable code, hence our test cases should always return. This commit is part of a series that makes Polly more robust on the presence of unreachables. llvm-svn: 297147
-
- Mar 03, 2017
-
-
Tobias Grosser authored
Some Polly ACC test cases fail without a working NVPTX backend. We explicitly specify this dependence in REQUIRES. Alternatively, we could have only marked polly-acc as supported in case the NVPTX backend is available, but as we might use other backends in the future, this does not seem to be the best choice. For this to work, we also need to make the 'targets_to_build' information available. Suggested-by:
Michael Kruse <llvm@meinersbur.de> llvm-svn: 296853
-
Tobias Grosser authored
Suggested-by:
Michael Kruse <llvm@meinersbur.de> llvm-svn: 296852
-
- Jan 16, 2017
-
-
Tobias Grosser authored
llvm-svn: 292124
-
- Nov 13, 2016
-
-
Tobias Grosser authored
LLVM recently changed the SCEV canonicalization which changed the output of one of our GPGPU test cases. llvm-svn: 286770
-
- Sep 18, 2016
-
-
Tobias Grosser authored
llvm-svn: 281850
-
Tobias Grosser authored
In case sequential kernels are found deeper in the loop tree than any parallel kernel, the overall scop is probably mostly sequential. Hence, run it on the CPU. llvm-svn: 281849
-
Tobias Grosser authored
Offloading to a GPU is only beneficial if there is a sufficient amount of compute that can be accelerated. Many kernels just have a very small number of dynamic compute, which means GPU acceleration is not beneficial. We compute at run-time an approximation of how many dynamic instructions will be executed and fall back to CPU code in case this number is not sufficiently large. To keep the run-time checking code simple, we over-approximate the number of instructions executed in each statement by computing the volume of the rectangular hull of its iteration space. llvm-svn: 281848
-
Tobias Grosser authored
llvm-svn: 281847
-
- Sep 17, 2016
-
-
Tobias Grosser authored
We may generate GPU kernels that store into scalars in case we run some sequential code on the GPU because the remaining data is expected to already be on the GPU. For these kernels it is important to not keep the scalar values in thread-local registers, but to store them back to the corresponding device memory objects that backs them up. We currently only store scalars back at the end of a kernel. This is only correct if precisely one thread is executed. In case more than one thread may be run, we currently invalidate the scop. To support such cases correctly, we would need to always load and store back from a corresponding global memory slot instead of a thread-local alloca slot. llvm-svn: 281838
-
Tobias Grosser authored
and pass these by value rather than by reference. llvm-svn: 281837
-
- Sep 15, 2016
-
-
Tobias Grosser authored
Our alias checks precisely check that the minimal and maximal accessed elements do not overlap in a kernel. Hence, we must ensure that our host <-> device transfers do not touch additional memory locations that are not covered in the alias check. To ensure this, we make sure that the data we copy for a given array is only the data from the smallest element accessed to the largest element accessed. We also adjust the size of the array according to the offset at which the array is actually accessed. An interesting result of this is: In case array are accessed with negative subscripts ,e.g., A[-100], we automatically allocate and transfer _more_ data to cover the full array. This is important as such code indeed exists in the wild. llvm-svn: 281611
-
- Sep 13, 2016
-
-
Tobias Grosser authored
llvm-svn: 281305
-
Tobias Grosser authored
This prevents a compiler crash. llvm-svn: 281303
-
- Sep 12, 2016
-
-
Tobias Grosser authored
Instead of aborting, we now bail out gracefully in case the kernel IR we generate is invalid. This can currently happen in case the SCoP stores pointer values, which we model as arrays, as data values into other arrays. In this case, the original pointer value is not available on the device and can consequently not be stored. As detecting this ahead of time is not so easy, we detect these situations after the invalid IR has been generated and bail out. llvm-svn: 281193
-
- Sep 11, 2016
-
-
Tobias Grosser authored
llvm-svn: 281166
-
Tobias Grosser authored
If these arrays have never been accessed we failed to derive an upper bound of the accesses and consequently a size for the outermost dimension. We now explicitly check for empty access sets and then just use zero as size for the outermost dimension. llvm-svn: 281165
-
- Aug 10, 2016
-
-
Tobias Grosser authored
To do so we change the way array exents are computed. Instead of the precise set of memory locations accessed, we now compute the extent as the range between minimal and maximal address in the first dimension and the full extent defined by the sizes of the inner array dimensions. We also move the computation of the may_persist region after the construction of the arrays, as it relies on array information. Without arrays being constructed no useful information is computed at all. llvm-svn: 278212
-
- Aug 09, 2016
-
-
Tobias Grosser authored
Ensure the right scalar allocations are used as the host location of data transfers. For the device code, we clear the allocation cache before device code generation to be able to generate new device-specific allocation and we need to make sure to add back the old host allocations as soon as the device code generation is finished. llvm-svn: 278126
-
Tobias Grosser authored
This increases the readability of the IR and also clarifies that the GPU inititialization is executed _after_ the scalar initialization which needs to before the code of the transformed scop is executed. Besides increased readability, the IR should not change. Specifically, I do not expect any changes in program semantics due to this patch. llvm-svn: 278125
-
Tobias Grosser authored
llvm-svn: 278104
-
Tobias Grosser authored
After having generated the code for a ScopStmt, we run a simple dead-code elimination that drops all instructions that are known to be and remain unused. Until this change, we only considered instructions for dead-code elimination, if they have a corresponding instruction in the original BB that belongs to ScopStmt. However, when generating code we do not only copy code from the BB belonging to a ScopStmt, but also generate code for operands referenced from BB. After this change, we now also considers code for dead code elimination, which does not have a corresponding instruction in BB. This fixes a bug in Polly-ACC where such dead-code referenced CPU code from within a GPU kernel, which is possible as we do not guarantee that all variables that are used in known-dead-code are moved to the GPU. llvm-svn: 278103
-
Tobias Grosser authored
llvm-svn: 278100
-
- Aug 08, 2016
-
-
Tobias Grosser authored
When adding code that avoids to pass values used in isl expressions and LLVM instructions twice, we forgot to make single variable passed to the kernel available in the ValueMap that makes it usable for instructions that are not replaced with isl ast expressions. This change adds the variable that is passed to the kernel to the ValueMap to ensure it is available for such use cases as well. llvm-svn: 278039
-
Tobias Grosser authored
llvm-svn: 278026
-
- Aug 05, 2016
-
-
Tobias Grosser authored
Before this commit we generated the array type in reverse order and we also added the outermost dimension size to the new array declaration, which is incorrect as Polly additionally assumed an additional unsized outermost dimension, such that we had an off-by-one error in the linearization of access expressions. llvm-svn: 277802
-
Tobias Grosser authored
llvm-svn: 277800
-
Tobias Grosser authored
These annotations ensure that the NVIDIA PTX assembler limits the number of registers used such that we can be certain the resulting kernel can be executed for the number of threads in a thread block that we are planning to use. llvm-svn: 277799
-
- Aug 04, 2016
-
-
Tobias Grosser authored
llvm-svn: 277726
-
Tobias Grosser authored
llvm-svn: 277722
-
Tobias Grosser authored
llvm-svn: 277721
-
Tobias Grosser authored
Pass the content of scalar array references to the alloca on the kernel side and do not pass them additional as normal LLVM scalar value. llvm-svn: 277699
-
Tobias Grosser authored
llvm-svn: 277697
-
- Aug 03, 2016
-
-
Tobias Grosser authored
Otherwise, we would try to re-optimize them with Polly-ACC and possibly even generate kernels that try to offload themselves, which does not work as the GPURuntime is not available on the accelerator and also does not make any sense. llvm-svn: 277589
-
- Jul 28, 2016
-
-
Tobias Grosser authored
llvm-svn: 276964
-