A follow up Q from: CUDA: Calling a __device__ function from a kernel
I'm trying to speed up a sort operation. A simplified pseudo version follows:
// some costly swap operation
__device__ swap(float* ptrA, float* ptrB){
float saveData; // swap some
saveData= *Adata; // big complex
*Adata= *Bdata // data chunk
*Bdata= saveData;
}
// a rather simple sort operation
__global__ sort(float data[]){
for (i=0; i<limit: i++){
find left swap point
find right swap point
swap<<<1,1>>>(left, right);
}
}
(Note: This simple version doesn't show the reduction techniques in the blocks.)
The idea is that it is easy (fast) to identify the swap points. The swap operation is costly (slow). So use one block to find/identify the swap points. Use other blocks to do the swap operations. i.e. Do the actual swapping in parallel.
This sounds like a decent plan. But if the compiler in-lines the device calls, then there is no parallel swapping taking place.
Is there a way to tell the compiler to NOT in-line a device call?
It has been a long time that this question was asked. When I googled the same problem, I got to this page. Seems like I got the solution.
Solution:
I reached [here][1] somehow and saw the cool approach to launch kernel from within another kernel.
__global__ void kernel_child(float *var1, int N){
//do data operations here
}
__global__ void kernel_parent(float *var1, int N)
{
kernel_child<<<1,2>>>(var1,N);
}
The dynamic parallelism on cuda 5.0 and over made this possible. Also while running make sure you use compute_35 architecture or above.
Terminal way
You can run the above parent kernel (which will eventually run child kernel) from termial. Verified on a Linux machine.
$ nvcc -arch=sm_35 -rdc=true yourFile.cu
$ ./a.out
Hope it helps. Thank you!
[1]: http://developer.download.nvidia.com/assets/cuda/docs/TechBrief_Dynamic_Parallelism_in_CUDA_v2.pdf
Edit (2016):
Dynamic parallelism was introduced in the second generation of Kepler architecture GPUs. Launching kernels in the device is supported on compute capability 3.5 and higher devices.
Original Answer:
You will have to wait until the end of the year when the next generation of hardware is available. No current CUDA devices can launch kernels from other kernels - it is presently unsupported.
Related
I have been exploring the field of parallel programming and have written basic kernels in Cuda and SYCL. I have encountered a situation where I had to print inside the kernel and I noticed that std::cout inside the kernel does not work whereas printf works. For example, consider the following SYCL Codes -
This works -
void print(float*A, size_t N){
buffer<float, 1> Buffer{A, {N}};
queue Queue((intel_selector()));
Queue.submit([&Buffer, N](handler& Handler){
auto accessor = Buffer.get_access<access::mode::read>(Handler);
Handler.parallel_for<dummyClass>(range<1>{N}, [accessor](id<1>idx){
printf("%f", accessor[idx[0]]);
});
});
}
whereas if I replace the printf with std::cout<<accessor[idx[0]] it raises a compile time error saying - Accessing non-const global variable is not allowed within SYCL device code.
A similar thing happens with CUDA kernels.
This got me thinking that what may be the difference between printf and std::coout which causes such behavior.
Also suppose If I wanted to implement a custom print function to be called from the GPU, how should I do it?
TIA
This got me thinking that what may be the difference between printf and std::cout which causes such behavior.
Yes, there is a difference. The printf() which runs in your kernel is not the standard C library printf(). A different call is made, to an on-device function (the code of of which is closed, if it at all exists in CUDA C). That function uses a hardware mechanism on NVIDIA GPUs - a buffer for kernel threads to print into, which gets sent back over to the host side, and the CUDA driver then forwards it to the standard output file descriptor of the process which launched the kernel.
std::cout does not get this sort of a compiler-assisted replacement/hijacking - and its code is simply irrelevant on the GPU.
A while ago, I implemented an std::cout-like mechanism for use in GPU kernels; see this answer of mine here on SO for more information and links. But - I decided I don't really like it, and it compilation is rather expensive, so instead, I adapted a printf()-family implementation for the GPU, which is now part of the cuda-kat library (development branch).
That means I've had to answer your second question for myself:
If I wanted to implement a custom print function to be called from the GPU, how should I do it?
Unless you have access to undisclosed NVIDIA internals - the only way to do this is to use printf() calls instead of C standard library or system calls on the host side. You essentially need to modularize your entire stream over the low-level primitive I/O facilities. It is far from trivial.
In SYCL you cannot use std::cout for output on code not running on the host for similar reasons to those listed in the answer for CUDA code.
This means if you are running kernel code on the "device" (e.g. a GPU) then you need to use the stream class. There is more information about this in the SYCL developer guide section called Logging.
There is no __device__ version of std::cout, so only printf can be used in device code.
I have used the following method, expecting to avoid memcpy from host to device. Does thrust library ensure that there wont be a memcpy from host to device in the process?
void EScanThrust(float * d_in, float * d_out)
{
thrust::device_ptr<float> dev_ptr(d_in);
thrust::device_ptr<float> dev_out_ptr(d_out);
thrust::exclusive_scan(dev_ptr, dev_ptr + size, dev_out_ptr);
}
Here d_in and d_out are prepared using cudaMalloc and d_in is filled with data using cudaMemcpy before calling this function
Does thrust library ensure that there wont be a memcpy from host to device in the process?
The code you've shown shouldn't involve any host->device copying. (How could it? There are no references anywhere to any host data in the code you have shown.)
For actual codes, it's easy enough to verify the underlying CUDA activity using a profiler, for example:
nvprof --print-gpu-trace ./my_exe
If you keep your profiled code sequences short, it's pretty easy to line up the underlying CUDA activity with the thrust code that generated that activity. If you want to profile just a short segment of a longer sequence, then you can turn profiling on and off or else use NVTX markers to identify the desired range in the profiler output.
In my code, I want to push_back my date on the __global__ function,and it is hard to use array here. So I want to know is that possible to use the push_back method on kernel of CUDA?
Can I use the std::vector on the __global__ function through some other way,or how to use the thrust::vector on __global__ function.
Can somebody give me an example code?
It is not possible either std::vector or thrust::vector in CUDA kernel code. Thrust is a host side abstraction for GPU arrays and algorithms which cannot be used inside CUDA kernels.
You should rethink approach. push_back style appending of data is an fundamentally serial operation which requires some sort of locking or atomic operation in data parallel execution models. This almost always has negative performance impact on GPU code.
I need to solve the sort problem with using (Quick Sort) ,so my problem is when i try to run the code many error appear to me but the major error is when i recall the Kernel QuickSort , because the kernel call itself twice ,so how can i solve this problem ,below my code ,so any one can help me .
Note: Iam new in programing in cuda .
__global__ void QuickSort(int p, int r,char *c)
{
if (p < r)
{ int q = Partition(p, r, c);
QuickSort<<<5,5>>>(p, q-1,c);
QuickSort<<<5,5>>>(q+1, r,c);
}
}
Your GPU card (compute capability 3.0) do not support Dynamic Parallelism, which needs compute capability 3.5 or higher. Dynamic Parallelism is to support the recursive method with new allocated resource in GPU. A Quicksort algorithm with cuda implementation and informantion of Dynamic Parallelism are showed here http://blogs.nvidia.com/2012/09/how-tesla-k20-speeds-up-quicksort-a-familiar-comp-sci-code/ .
However, in your GPU, I suggest to employ a different way to implement the Quicksort, as the implementation in the link above is just to demonstrate the benefits of Dynamic Parallelism instead of showing a algorithm with peak performance. Your can refer to this paper "GPU-Quicksort A Practical Quicksort Algorithm for Graphics Processors" for a better performance with your card.
My problem is very much like this one. I run the simplest CUDA program but the kernel doesn't launch. However, I am sure that my CUDA installation is ok, since I can run complicated CUDA projects consisting of several files (which I took from someone else) with no problems. In these projects, compilation and linking is done through makefiles with a lot of flags. I think the problem is in the correct flags to use while compiling. I simply use a command like this:
nvcc -arch=sm_20 -lcudart test.cu with a such a program (to run on a linux machine):
__global__ void myKernel()
{
cuPrintf("Hello, world from the device!\n");
}
int main()
{
cudaPrintfInit();
myKernel<<<1,10>>>();
cudaPrintfDisplay(stdout, true);
cudaPrintfEnd();
}
The program compiles correctly. When I add cudaMemcpy() operations, it returns no error. Any suggestion on why the kernel doesn't launch ?
The reason it is not printing when using printf is that kernel launches are asynchronous and your program is exiting before the printf buffer gets flushed. Section B.16 of the CUDA (5.0) C Programming Guide explains this.
The output buffer for printf() is set to a fixed size before kernel launch (see
Associated Host-Side API). It is circular and if more output is produced during kernel
execution than can fit in the buffer, older output is overwritten. It is flushed only
when one of these actions is performed:
Kernel launch via <<<>>> or cuLaunchKernel() (at the start of the launch, and if the
CUDA_LAUNCH_BLOCKING environment variable is set to 1, at the end of the launch as
well),
Synchronization via cudaDeviceSynchronize(), cuCtxSynchronize(),
cudaStreamSynchronize(), cuStreamSynchronize(), cudaEventSynchronize(),
or cuEventSynchronize(),
Memory copies via any blocking version of cudaMemcpy*() or cuMemcpy*(),
Module loading/unloading via cuModuleLoad() or cuModuleUnload(),
Context destruction via cudaDeviceReset() or cuCtxDestroy().
For this reason, this program prints nothing:
#include <stdio.h>
__global__ void myKernel()
{
printf("Hello, world from the device!\n");
}
int main()
{
myKernel<<<1,10>>>();
}
But this program prints "Hello, world from the device!\n" ten times.
#include <stdio.h>
__global__ void myKernel()
{
printf("Hello, world from the device!\n");
}
int main()
{
myKernel<<<1,10>>>();
cudaDeviceSynchronize();
}
Are you sure that your CUDA device supports the SM_20 architecture?
Remove the arch= option from your nvcc command line and rebuild everything. This compiles for the 1.0 CUDA architecture, which will be supported on all CUDA devices. If it still doesn't run, do a build clean and make sure there are no object files left anywhere. Then rebuild and run.
Also, arch= refers to the virtual architecture, which should be something like compute_10. sm_20 is the real architecture and I believe should be used with the code= switch, not arch=.
In Visual Studio:
Right click on your project > Properies > Cuda C/C++ > Device
and add then following to Code Generation field
compute_30,sm_30;compute_35,sm_35;compute_37,sm_37;compute_50,sm_50;compute_52,sm_52;compute_60,sm_60;compute_61,sm_61;compute_70,sm_70;compute_75,sm_75;
generating code for all these architecture makes your code a bit slower. So eliminate one by one to find which compute and sm gen code is required for your GPU.
But if you are shipping this to others better include all of these.