How to call a Thrust function in a stream from a kernel? - cuda

I want to make thrust::scatter asynchronous by calling it in a device kernel(I could also do it by calling it in another host thread). thrust::cuda::par.on(stream) is host function that cannot be called from a device kernel. The following code was tried with CUDA 10.1 on Turing architecture.
__global__ void async_scatter_kernel(float* first,
float* last,
int* map,
float* output)
{
cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
thrust::scatter(thrust::cuda::par.on(stream), first, last, map, output);
cudaDeviceSynchronize();
cudaStreamDestroy(stream);
}
I know thrust uses dynamic parallelism to launch its kernels when called from the device, however I couldn't find a way to specify the stream.

The following code compiles cleanly for me on CUDA 10.1.243:
$ cat t1518.cu
#include <thrust/scatter.h>
#include <thrust/execution_policy.h>
__global__ void async_scatter_kernel(float* first,
float* last,
int* map,
float* output)
{
cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
thrust::scatter(thrust::cuda::par.on(stream), first, last, map, output);
cudaDeviceSynchronize();
cudaStreamDestroy(stream);
}
int main(){
float *first = NULL;
float *last = NULL;
float *output = NULL;
int *map = NULL;
async_scatter_kernel<<<1,1>>>(first, last, map, output);
cudaDeviceSynchronize();
}
$ nvcc -arch=sm_35 -rdc=true t1518.cu -o t1518
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
$
The -arch=sm_35 (or similar) and -rdc=true are necessary (but not in all cases sufficient) compile switches for any code that uses CUDA Dynamic Parallelism. If you omit, for example, the -rdc=true switch, you get an error similar to what you describe:
$ nvcc -arch=sm_35 t1518.cu -o t1518
t1518.cu(11): error: calling a __host__ function("thrust::cuda_cub::par_t::on const") from a __global__ function("async_scatter_kernel") is not allowed
t1518.cu(11): error: identifier "thrust::cuda_cub::par_t::on const" is undefined in device code
2 errors detected in the compilation of "/tmp/tmpxft_00003a80_00000000-8_t1518.cpp1.ii".
$
So, for the example you have shown here, your compilation error can be eliminated either by updating to the latest CUDA version or by specifying the proper command line, or both.

Related

CUDA: Forgetting kernel launch configuration does not result in NVCC compiler warning or error

When I try to call a CUDA kernel (a __global__ function) using a function pointer, everything appears to work just fine. However, if I forget to provide launch configuration when calling the kernel, NVCC will not result in an error or warning, but the program will compile and then crash if I attempt to run it.
__global__ void bar(float x) { printf("foo: %f\n", x); }
typedef void(*FuncPtr)(float);
void invoker(FuncPtr func)
{
func<<<1, 1>>>(1.0);
}
invoker(bar);
cudaDeviceSynchronize();
Compile and run the above. Everything will work just fine. Then, remove the kernel's launch configuration (i.e., <<<1, 1>>>). The code will compile just fine but it will crash when you try to run it.
Any idea what is going on? Is this a bug, or I am not supposed to pass around pointers of __global__ functions?
CUDA version: 8.0
OS version: Debian (Testing repo)
GPU: NVIDIA GeForce 750M
If we take a slightly more complex version of your repro, and look at the code emitted by the CUDA toolchain front-end, it becomes possible to see what is happening:
#include <cstdio>
__global__ void bar_func(float x) { printf("foo: %f\n", x); }
typedef void(*FuncPtr)(float);
void invoker(FuncPtr passed_func)
{
#ifdef NVCC_FAILS_HERE
bar_func(1.0);
#endif
bar_func<<<1,1>>>(1.0);
passed_func(1.0);
passed_func<<<1,1>>>(2.0);
}
So let's compile it a couple of ways:
$ nvcc -arch=sm_52 -c -DNVCC_FAILS_HERE invoker.cu
invoker.cu(10): error: a __global__ function call must be configured
i.e. the front-end can detect that bar_func is a global function and requires launch parameters. Another attempt:
$ nvcc -arch=sm_52 -c -keep invoker.cu
As you note, this produces no compile error. Let's look at what happened:
void bar_func(float x) ;
# 5 "invoker.cu"
typedef void (*FuncPtr)(float);
# 7 "invoker.cu"
void invoker(FuncPtr passed_func)
# 8 "invoker.cu"
{
# 12 "invoker.cu"
(cudaConfigureCall(1, 1)) ? (void)0 : (bar_func)((1.0));
# 13 "invoker.cu"
passed_func((2.0));
# 14 "invoker.cu"
(cudaConfigureCall(1, 1)) ? (void)0 : passed_func((3.0));
# 15 "invoker.cu"
}
The standard kernel invocation syntax <<<>>> gets expanded into an inline call to cudaConfigureCall, and then a host wrapper function is called. The host wrapper has the API internals required to launch the kernel:
void bar_func( float __cuda_0)
# 3 "invoker.cu"
{__device_stub__Z8bar_funcf( __cuda_0); }
void __device_stub__Z8bar_funcf(float __par0)
{
if (cudaSetupArgument((void *)(char *)&__par0, sizeof(__par0), (size_t)0UL) != cudaSuccess) return;
{ volatile static char *__f __attribute__((unused)); __f = ((char *)((void ( *)(float))bar_func));
(void)cudaLaunch(((char *)((void ( *)(float))bar_func)));
};
}
So the stub only handles arguments and launches the kernel via cudaLaunch. It doesn't handle launch configuration
The underlying reason for the crash (actually an undetected runtime API error) is that the kernel launch happens without a prior configuration. Obviously this happens because the CUDA front end (and C++ for that matter) can't do pointer introspection at compile time and detect that your function pointer is a stub function for calling a kernel.
I think the only way to describe this is a "limitation" of the runtime API and compiler. I wouldn't say what you are doing is wrong, but I would probably be using the driver API and explicitly managing the kernel launch myself in such a situation.

Loading multiple modules in JCuda is not working

In jCuda one can load cuda files as PTX or CUBIN format and call(launch) __global__ functions (kernels) from Java.
With keeping that in mind, I want to develop a framework with JCuda that gets user's __device__ function in a .cu file at run-time, loads and runs it.
And I have already implemented a __global__ function, in which each thread finds out the start point of its related data, perform some computation, initialization and then call user's __device__ function.
Here is my kernel pseudo code:
extern "C" __device__ void userFunc(args);
extern "C" __global__ void kernel(){
// initialize
userFunc(args);
// rest of the kernel
}
And user's __device__ function:
extern "C" __device__ void userFunc(args){
// do something
}
And in Java side, here is the part that I load the modules(modules are made from ptx files which are successfully created from cuda files with this command: nvcc -m64 -ptx path/to/cudaFile -o cudaFile.ptx)
CUmodule kernelModule = new CUmodule(); // 1
CUmodule userFuncModule = new CUmodule(); // 2
cuModuleLoad(kernelModule, ptxKernelFileName); // 3
cuModuleLoad(userFuncModule, ptxUserFuncFileName); // 4
When I try to run it I got error at line 3 : CUDA_ERROR_NO_BINARY_FOR_GPU. After some searching I get that my ptx file has some syntax error. After running this suggested command:
ptxas -arch=sm_30 kernel.ptx
I got:
ptxas fatal : Unresolved extern function 'userFunc'
Even when I replace line 3 with 4 to load userFunc before kernel I get this error. I got stuck at this phase. Is this the correct way to load multiple modules that need to be linked together in JCuda? Or is it even possible?
Edit:
Second part of the question is here
The really short answer is: No, you can't load multiple modules into a context in the runtime API.
You can do what you want, but it requires explicit setup and execution of a JIT linking call. I have no idea how (or even whether) that has been implemented in JCUDA, but I can show you how to do it with the standard driver API. Hold on...
If you have a device function in one file, and a kernel in another, for example:
// test_function.cu
#include <math.h>
__device__ float mathop(float &x, float &y, float &z)
{
float res = sin(x) + cos(y) + sqrt(z);
return res;
}
and
// test_kernel.cu
extern __device__ float mathop(float & x, float & y, float & z);
__global__ void kernel(float *xvals, float * yvals, float * zvals, float *res)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
res[tid] = mathop(xvals[tid], yvals[tid], zvals[tid]);
}
You can compile them to PTX as usual:
$ nvcc -arch=sm_30 -ptx test_function.cu
$ nvcc -arch=sm_30 -ptx test_kernel.cu
$ head -14 test_kernel.ptx
//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-19324607
// Cuda compilation tools, release 7.0, V7.0.27
// Based on LLVM 3.4svn
//
.version 4.2
.target sm_30
.address_size 64
// .globl _Z6kernelPfS_S_S_
.extern .func (.param .b32 func_retval0) _Z6mathopRfS_S_
At runtime, your code must create a JIT link session, add each PTX to the linker session, then finalise the linker session. This will give you a handle to a compiled cubin image which can be loaded as a module as usual. The simplest possible driver API code to put this together looks like this:
#include <cstdio>
#include <cuda.h>
#define drvErrChk(ans) { drvAssert(ans, __FILE__, __LINE__); }
inline void drvAssert(CUresult code, const char *file, int line, bool abort=true)
{
if (code != CUDA_SUCCESS) {
fprintf(stderr, "Driver API Error %04d at %s %d\n", int(code), file, line);
exit(-1);
}
}
int main()
{
cuInit(0);
CUdevice device;
drvErrChk( cuDeviceGet(&device, 0) );
CUcontext context;
drvErrChk( cuCtxCreate(&context, 0, device) );
CUlinkState state;
drvErrChk( cuLinkCreate(0, 0, 0, &state) );
drvErrChk( cuLinkAddFile(state, CU_JIT_INPUT_PTX, "test_function.ptx", 0, 0, 0) );
drvErrChk( cuLinkAddFile(state, CU_JIT_INPUT_PTX, "test_kernel.ptx" , 0, 0, 0) );
size_t sz;
char * image;
drvErrChk( cuLinkComplete(state, (void **)&image, &sz) );
CUmodule module;
drvErrChk( cuModuleLoadData(&module, image) );
drvErrChk( cuLinkDestroy(state) );
CUfunction function;
drvErrChk( cuModuleGetFunction(&function, module, "_Z6kernelPfS_S_S_") );
return 0;
}
You should be able to compile and run this as posted and verify it works OK. It should serve as a template for a JCUDA implementation, if they have JIT linking support implemented.

How to compile multiple files in cuda?

I did it the way I do with gcc
nvcc a.cu ut.cu
but the compiler shows
ptxas fatal : Unresolved extern function '_Z1fi'
The problem only occurs when the function is __device__ function.
[File ut.h]
__device__ int f(int);
[File ut.c]
#include "ut.h"
__device__ int f(int a){
return a*a;
}
[File a.cu]
#include "ut.h"
__global__ void mk(){
f(5);
}
int main(){
mk<<<1,1>>>();
}
When a __device__ or __global__ function calls a __device__ function (or __global__ function, in the case of dynamic parallelism) in another translation unit (i.e. file), then it is necessary to use device linking. To enable device linking with your simple compile command, just add the -rdc=true switch:
nvcc -rdc=true a.cu ut.cu
That should fix the issue.
Note that in your compile command you list "ut.cu" but in your question you show "ut.c", I assume that should be the file "ut.cu". If not, you will also need to change the file name from "ut.c" to "ut.cu".
You can read more about device linking in the nvcc manual.

Call cublas in a kernel

I want to use Zgemv in parallel.
__global__ void S_Cphir(cuDoubleComplex *S,cuDoubleComplex *A,cuDoubleComplex *B, int n,int l)
{
....
cublasZgemv(handle,CUBLAS_OP_N,n,n,&alpha,S+i*n*n,n,A+n*i,1,&beta,B+i*n,1);}
void S_Cphir_(cuDoubleComplex *S,cuDoubleComplex *A,cuDoubleComplex *B, int n,int l){
dim3 grid = dim3(1,1,1);
dim3 block = dim3(32,1,1);
S_Cphir<<<grid,block>>>(S,A,B,n,l);}
my compile command is
nvcc -c -arch=compute_30 -code=sm_35 time_propagation_cublas.cu --relocatable-device-code true
nvcc -o ./main.v2 time_propagation_cublas.o -lcublas
The first line is work. But the second line is wrong!!
In function`__sti____cudaRegisterAll_58_tmpxft_000032b7_00000000_6_time_propagation_cublas_cpp1_ii_0d699356()';tmpxft_000032b7_00000000-3_time_propagation_cublas.cudafe1.cpp:(.text+0x17a4):
undefined reference to `__cudaRegisterLinkedBinary_58_tmpxft_000032b7_00000000_6_time_propagation_cublas_cpp1_ii_0d699356'
collect2: ld returned 1 exit status
I search the "cudaRegisterLinkedBinary" but I have nothing!!
I know nvcc support to call cublas in kernel.
Use the CUBLAS Device Library sample code as your reference. On a standard CUDA 5.5 install, you'll find it at:
/usr/local/cuda/samples/7_CUDALibraries/simpleDevLibCUBLAS
Referring to the Makefile in that directory, your compile commands should be like this:
nvcc -arch=sm_35 -rdc=true -o main.v2 time_propagation_cublas.cu -lcublas -lcublas_device -lcudadevrt

Link .ll files generated by compiling .cu file with clang

I am compiling the following code using clang with:
clang++ -std=c++11 -emit-llvm -c -S $1 --cuda-gpu-arch=sm_30. This generates vectoradd-cuda-nvptx64-nvidia-cuda-sm_30.ll and vectoradd.ll files. The goal to run some LLVM analysis passes on the kernel, which would possibly instrument it. So i would like to link the post-analysis IR to an executable but i am not sure how. When i try to link the .ll files with llvm-link i am getting the error Linking globals named '_Z9vectoraddPiS_S_i': symbol multiply defined!. I am not really sure how to achieve this, so any help is appreciated.
#define THREADS_PER_BLOCK 512
__global__ void vectoradd(int *A, int *B, int *C, int N) {
int gi = threadIdx.x + blockIdx.x * blockDim.x;
if ( gi < N) {
C[gi] = A[gi] + B[gi];
}
}
int main(int argc, char **argv) {
int N = 10000, *d_A, *d_B, *d_C;
/// allocate host memory
std::vector<int> A(N);
std::vector<int> B(N);
std::vector<int> C(N);
/// allocate device memory
cudaMalloc((void **) &d_A, N * sizeof(int));
cudaMalloc((void **) &d_B, N * sizeof(int));
cudaMalloc((void **) &d_C, N * sizeof(int));
/// populate host data
for ( size_t i = 0; i < N; ++i) {
A[i] = i; B[i] = i;
}
/// copy to device
cudaMemcpy(d_A, &A[0], N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, &B[0], N * sizeof(int), cudaMemcpyHostToDevice);
dim3 block(THREADS_PER_BLOCK, 1, 1);
dim3 grid((N + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK, 1, 1);
vectoradd<<<grid,block>>>(d_A, d_B, d_C, N);
cudaDeviceSynchronize();
cudaMemcpy(&C[0], d_C, N * sizeof(int), cudaMemcpyDeviceToHost);
return 0;
}
The CUDA compilation trajectory in Clang is rather complicated (as it is in the NVIDIA toolchain) and what you are trying to do cannot work. The LLVM IR from each branch of the compilation process must remain separate until directly linkable objects are available. As a result, there are many intermediate steps which you will need to perform manually.
LLVM IR code for the GPU must be compiled firstly to PTX code, and then assembled to a binary payload which can be linked against host object files.
So in your example, you first do something like:
clang++ -std=c++11 -emit-llvm -c -S test.cu --cuda-gpu-arch=sm_52
which emits two llvm IR files test-cuda-nvptx64-nvidia-cuda-sm_52.ll and test.ll. The GPU code then needs to be compiled to PTX (see more about the nvptx backend here):
llc -mcpu=sm_52 test-cuda-nvptx64-nvidia-cuda-sm_52.ll -o test.ptx
Now the PTX code can be assembled into an ELF file which can later be linked by nvcc (or the host linker with an couple of additional steps) in the normal way:
ptxas --gpu-name=sm_52 test.ptx -o test.ptx.o
fatbinary --cuda -64 --create test.fatbin --image=profile=sm_52,file=test.ptx.o
For the host code you do something like
llc test.ll
clang -m64 -c test.s
to produce assembler output from the LLVM IR and then assemble that to an object file.
Now with a fatbin file containing CUDA the compiled code, and an object file containing the compiled host code, you can perform linkage. I have not been able to test linking a host object file with a fatbinary using clang, that is something you will need to work out yourself. It will be instructive to study both the verbose output of clang during a CUDA compilation call, and also the nvcc documentation to get a better feel for how the device code build system works.