[mlir][sparse][gpu] generate proper memcpy in/out host and device
The host registration is a convenient way to get CUDA kernels running, but it may be slow and does not work for all buffer (like global constants). This revision uses the proper alloc copy dealloc chains for buffers, using asynchronous chains to increase overlap. The host registration mechanism is kept under a flag for the output, just for experimentation purposes while this project ramps up. Reviewed By: Peiming Differential Revision: https://reviews.llvm.org/D148682
Loading
Please sign in to comment