using thrust device_vector as global variable - cuda

Why does the following code crash at the end of the main?
#include <thrust/device_vector.h>
thrust::device_vector<float4> v;
int main(){
v.resize(1000);
return 0;
}
The error is:
terminate called after throwing an instance of 'thrust::system::system_error'
what(): unspecified driver error
If I use host_vector instead of device_vector the code run fine.
Do you think it's a Thrust bug, or am I doing something wrong here?
I tried it on ubuntu 10.10 with cuda 4.0 and on Windows 7 with cuda 6.5.
The Thrust version is 1.7 in both cases.
thanks

The problem is neither a bug in Thrust, nor are you doing something wrong. Rather, this is a limitation of the design of the CUDA runtime API.
The underlying reason for the crash is that the destructor for the thrust::vector is being called when the variable falls out of scope, which is happening after the CUDA runtime API context has been torn down. This will produce a runtime error (probably cudaErrorCudartUnloading) because the process is attempting to call cudaFree after it has already disconnected from the CUDA driver.
I am unaware of a workaround other than not using Thrust device containers declared at main() translation unit scope.

Related

Why does printf() work within a kernel, but using std::cout doesn't?

I have been exploring the field of parallel programming and have written basic kernels in Cuda and SYCL. I have encountered a situation where I had to print inside the kernel and I noticed that std::cout inside the kernel does not work whereas printf works. For example, consider the following SYCL Codes -
This works -
void print(float*A, size_t N){
buffer<float, 1> Buffer{A, {N}};
queue Queue((intel_selector()));
Queue.submit([&Buffer, N](handler& Handler){
auto accessor = Buffer.get_access<access::mode::read>(Handler);
Handler.parallel_for<dummyClass>(range<1>{N}, [accessor](id<1>idx){
printf("%f", accessor[idx[0]]);
});
});
}
whereas if I replace the printf with std::cout<<accessor[idx[0]] it raises a compile time error saying - Accessing non-const global variable is not allowed within SYCL device code.
A similar thing happens with CUDA kernels.
This got me thinking that what may be the difference between printf and std::coout which causes such behavior.
Also suppose If I wanted to implement a custom print function to be called from the GPU, how should I do it?
TIA
This got me thinking that what may be the difference between printf and std::cout which causes such behavior.
Yes, there is a difference. The printf() which runs in your kernel is not the standard C library printf(). A different call is made, to an on-device function (the code of of which is closed, if it at all exists in CUDA C). That function uses a hardware mechanism on NVIDIA GPUs - a buffer for kernel threads to print into, which gets sent back over to the host side, and the CUDA driver then forwards it to the standard output file descriptor of the process which launched the kernel.
std::cout does not get this sort of a compiler-assisted replacement/hijacking - and its code is simply irrelevant on the GPU.
A while ago, I implemented an std::cout-like mechanism for use in GPU kernels; see this answer of mine here on SO for more information and links. But - I decided I don't really like it, and it compilation is rather expensive, so instead, I adapted a printf()-family implementation for the GPU, which is now part of the cuda-kat library (development branch).
That means I've had to answer your second question for myself:
If I wanted to implement a custom print function to be called from the GPU, how should I do it?
Unless you have access to undisclosed NVIDIA internals - the only way to do this is to use printf() calls instead of C standard library or system calls on the host side. You essentially need to modularize your entire stream over the low-level primitive I/O facilities. It is far from trivial.
In SYCL you cannot use std::cout for output on code not running on the host for similar reasons to those listed in the answer for CUDA code.
This means if you are running kernel code on the "device" (e.g. a GPU) then you need to use the stream class. There is more information about this in the SYCL developer guide section called Logging.
There is no __device__ version of std::cout, so only printf can be used in device code.

Unable to call CUDA half precision functions from the host

I am trying to do some FP16 work that will have both CPU and GPU backend. I researched my options and decided to use CUDA's half precision converter and data types. The ones I intent to use are specified as both __device__ and __host__ which according to my understanding (and the official documentation) should mean that the functions are callable from both HOST and DEVICE code. I wrote a simple test program:
#include <iostream>
#include <cuda_fp16.h>
int main() {
const float a = 32.12314f;
__half2 test = __float2half2_rn(a);
__half test2 = __float2half(a);
return 0;
}
However when I try to compile it I get:
nvcc cuda_half2.cu
cuda_half2.cu(6): error: calling a __device__ function("__float2half2_rn") from a __host__ function("main") is not allowed
cuda_half2.cu(7): error: calling a __device__ function("__float2half") from a __host__ function("main") is not allowed
2 errors detected in the compilation of "/tmp/tmpxft_000013b8_00000000-4_cuda_half2.cpp4.ii".
The only thing that comes to mind is that my CUDA is 9.1 and I'm reading the documentation for 9.2 but i can't find an older version of it, nor can I find anything in the changelog. Ideas?
Ideas?
Switch to CUDA 9.2
Your code compiles without error on CUDA 9.2, but throws the errors you indicate on CUDA 9.1. If you have CUDA 9.1 installed, then the documentation for it is already installed on your machine. On a typical linux install, it will be located in /usr/local/cuda-9.1/doc. If you look at /usr/local/cuda-9.1/doc/pdf/CUDA_Math_API.pdf you will see that the corresponding functions are only marked __device__, so this change was indeed made between CUDA 9.1 and CUDA 9.2

assert() in CUDA 5.5

I've just upgraded from CUDA 5.0 to 5.5 and all my VS2012 CUDA projects have stopped compiling due to a problem with assert(). To repro the problem, I created a new CUDA 5.5 project in VS 2012 and added the code straight from Programming Guide and got the same error.
__global__ void testAssert(void)
{
int is_one = 1;
int should_be_one = 0;
// This will have no effect
assert(is_one);
// This will halt kernel execution
assert(should_be_one);
}
This produces the following compiler error:
kernel.cu(22): error : calling a __host__ function("_wassert") from a __global__ function("testAssert") is not allowed
Is there something obvious that I'm missing?
Make sure you are including assert.h, and make sure you are targeting sm_20 or later. Also check you're not including Windows headers, and if you are then try without.

Does gpuocelot support dynamic memory allocation in CUDA device?

My algorithm (parallel multi-frontal Gaussian elimination) needs to dynamically allocate memory (tree building) inside CUDA kernel. Does anyone know if gpuocelot supports such things?
According to this: stackoverflow-link and CUDA programming guide I can do such things. But with gpuocelot I get errors during runtime.
Errors:
When I call malloc() inside kernel I get this error:(2.000239) ExternalFunctionSet.cpp:371: Assertion message: LLVM required to call external host functions from PTX.
solver: ocelot/ir/implementation/ExternalFunctionSet.cpp:371: void ir::ExternalFunctionSet::ExternalFunction::call(void*, const ir::PTXKernel::Prototype&): Assertion false' failed.
When I try to get or set malloc heap size (inside host code):solver: ocelot/cuda/implementation/CudaRuntimeInterface.cpp:811: virtual cudaError_t cuda::CudaRuntimeInterface::cudaDeviceGetLimit(size_t*, cudaLimit): Assertion `0 && "unimplemented"' failed.
Maybe I have to point (somehow) to compiler that I want to use device malloc()?
Any advice?
You can find the answer in gpu ocelot mailing list:
gpuocelot mailing list link

CUDA kernel doesn't launch

My problem is very much like this one. I run the simplest CUDA program but the kernel doesn't launch. However, I am sure that my CUDA installation is ok, since I can run complicated CUDA projects consisting of several files (which I took from someone else) with no problems. In these projects, compilation and linking is done through makefiles with a lot of flags. I think the problem is in the correct flags to use while compiling. I simply use a command like this:
nvcc -arch=sm_20 -lcudart test.cu with a such a program (to run on a linux machine):
__global__ void myKernel()
{
cuPrintf("Hello, world from the device!\n");
}
int main()
{
cudaPrintfInit();
myKernel<<<1,10>>>();
cudaPrintfDisplay(stdout, true);
cudaPrintfEnd();
}
The program compiles correctly. When I add cudaMemcpy() operations, it returns no error. Any suggestion on why the kernel doesn't launch ?
The reason it is not printing when using printf is that kernel launches are asynchronous and your program is exiting before the printf buffer gets flushed. Section B.16 of the CUDA (5.0) C Programming Guide explains this.
The output buffer for printf() is set to a fixed size before kernel launch (see
Associated Host-Side API). It is circular and if more output is produced during kernel
execution than can fit in the buffer, older output is overwritten. It is flushed only
when one of these actions is performed:
Kernel launch via <<<>>> or cuLaunchKernel() (at the start of the launch, and if the
CUDA_LAUNCH_BLOCKING environment variable is set to 1, at the end of the launch as
well),
Synchronization via cudaDeviceSynchronize(), cuCtxSynchronize(),
cudaStreamSynchronize(), cuStreamSynchronize(), cudaEventSynchronize(),
or cuEventSynchronize(),
Memory copies via any blocking version of cudaMemcpy*() or cuMemcpy*(),
Module loading/unloading via cuModuleLoad() or cuModuleUnload(),
Context destruction via cudaDeviceReset() or cuCtxDestroy().
For this reason, this program prints nothing:
#include <stdio.h>
__global__ void myKernel()
{
printf("Hello, world from the device!\n");
}
int main()
{
myKernel<<<1,10>>>();
}
But this program prints "Hello, world from the device!\n" ten times.
#include <stdio.h>
__global__ void myKernel()
{
printf("Hello, world from the device!\n");
}
int main()
{
myKernel<<<1,10>>>();
cudaDeviceSynchronize();
}
Are you sure that your CUDA device supports the SM_20 architecture?
Remove the arch= option from your nvcc command line and rebuild everything. This compiles for the 1.0 CUDA architecture, which will be supported on all CUDA devices. If it still doesn't run, do a build clean and make sure there are no object files left anywhere. Then rebuild and run.
Also, arch= refers to the virtual architecture, which should be something like compute_10. sm_20 is the real architecture and I believe should be used with the code= switch, not arch=.
In Visual Studio:
Right click on your project > Properies > Cuda C/C++ > Device
and add then following to Code Generation field
compute_30,sm_30;compute_35,sm_35;compute_37,sm_37;compute_50,sm_50;compute_52,sm_52;compute_60,sm_60;compute_61,sm_61;compute_70,sm_70;compute_75,sm_75;
generating code for all these architecture makes your code a bit slower. So eliminate one by one to find which compute and sm gen code is required for your GPU.
But if you are shipping this to others better include all of these.