This question already has an answer here:
cuda thrust::remove_if throws "thrust::system::system_error" for device_vector?
(1 answer)
Closed 7 years ago.
I copied this code from the Thrust documentation:
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
int main()
{
thrust::device_vector<int> vec0(100);
thrust::device_vector<int> vec1(100);
thrust::copy(vec0.begin(), vec0.end(), vec1.begin());
return 0;
}
When I run this in Debug mode (VS2012), my program crashes and I get the error Debug Error! ... R6010 - abort() has been called. When I run this in Release mode, it still crashes and I get the message .exe has stopped working.
However copying from host-to-device works correctly:
thrust::host_vector<int> vec0(100);
thrust::device_vector<int> vec1(100);
thrust::copy(vec0.begin(), vec0.end(), vec1.begin());
I use GeForce GTX 970, CUDA driver version/runtime version is 7.5, deviceQuery runs without any problem. Host runtime library is in Multi-threaded (/MT) mode. Does anybody have an idea what might cause this problem?
There are a few similar questions e.g. here
To quote a comment :
"Thrust is known to not compile and run correctly when built for
debugging"
And from the docs:
"nvcc does not support device debugging Thrust code. Thrust functions
compiled with (e.g., nvcc -G, nvcc --device-debug 0, etc.) will likely
crash."
Related
I am trying to do some FP16 work that will have both CPU and GPU backend. I researched my options and decided to use CUDA's half precision converter and data types. The ones I intent to use are specified as both __device__ and __host__ which according to my understanding (and the official documentation) should mean that the functions are callable from both HOST and DEVICE code. I wrote a simple test program:
#include <iostream>
#include <cuda_fp16.h>
int main() {
const float a = 32.12314f;
__half2 test = __float2half2_rn(a);
__half test2 = __float2half(a);
return 0;
}
However when I try to compile it I get:
nvcc cuda_half2.cu
cuda_half2.cu(6): error: calling a __device__ function("__float2half2_rn") from a __host__ function("main") is not allowed
cuda_half2.cu(7): error: calling a __device__ function("__float2half") from a __host__ function("main") is not allowed
2 errors detected in the compilation of "/tmp/tmpxft_000013b8_00000000-4_cuda_half2.cpp4.ii".
The only thing that comes to mind is that my CUDA is 9.1 and I'm reading the documentation for 9.2 but i can't find an older version of it, nor can I find anything in the changelog. Ideas?
Ideas?
Switch to CUDA 9.2
Your code compiles without error on CUDA 9.2, but throws the errors you indicate on CUDA 9.1. If you have CUDA 9.1 installed, then the documentation for it is already installed on your machine. On a typical linux install, it will be located in /usr/local/cuda-9.1/doc. If you look at /usr/local/cuda-9.1/doc/pdf/CUDA_Math_API.pdf you will see that the corresponding functions are only marked __device__, so this change was indeed made between CUDA 9.1 and CUDA 9.2
Why does the following code crash at the end of the main?
#include <thrust/device_vector.h>
thrust::device_vector<float4> v;
int main(){
v.resize(1000);
return 0;
}
The error is:
terminate called after throwing an instance of 'thrust::system::system_error'
what(): unspecified driver error
If I use host_vector instead of device_vector the code run fine.
Do you think it's a Thrust bug, or am I doing something wrong here?
I tried it on ubuntu 10.10 with cuda 4.0 and on Windows 7 with cuda 6.5.
The Thrust version is 1.7 in both cases.
thanks
The problem is neither a bug in Thrust, nor are you doing something wrong. Rather, this is a limitation of the design of the CUDA runtime API.
The underlying reason for the crash is that the destructor for the thrust::vector is being called when the variable falls out of scope, which is happening after the CUDA runtime API context has been torn down. This will produce a runtime error (probably cudaErrorCudartUnloading) because the process is attempting to call cudaFree after it has already disconnected from the CUDA driver.
I am unaware of a workaround other than not using Thrust device containers declared at main() translation unit scope.
I'm using Visual Studio 2010 and a GTX480 with compute capability 2.0.
I have tried setting sm to 2.0, but when I attempt to use printf() in a kernel, I get:
error : calling a host function("printf") from a __device__/__global__
function("test") is not allowed
This is my code:
#include "util\cuPrintf.cu"
#include <cuda.h>
#include <iostream>
#include <stdio.h>
#include <conio.h>
#include <cuda_runtime.h>
__global__ void test (void)
{
printf("Hello, world from the device!\n");
}
void main(void)
{
test<<<1,1>>>();
getch();
}
I find a example here: "CUDA_C_Programming_Guide" 'page _106' "B.16.4 Examples"
at last,it is work for me :D thank you.
#include "stdio.h"
#include <conio.h>
// printf() is only supported
// for devices of compute capability 2.0 and higher
#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 200)
#define printf(f, ...) ((void)(f, __VA_ARGS__),0)
#endif
__global__ void helloCUDA(float f)
{
printf("Hello thread %d, f=%f\n", threadIdx.x, f);
}
int main()
{
helloCUDA<<<1, 5>>>(1.2345f);
cudaDeviceSynchronize();
getch();
return 0;
}
To use printf in kernel code, you have to do three things:
make sure that cstdio or stdio.h are included in the kernel compilation unit. CUDA implements kernel printf by overloading, so you must include that file
Compile your code for compute capability 2.x or 3.x and run it on a supported GPU (so pass something like -arch=sm_20 to nvcc or the IDE equivalent in Visual Studio or Nsight Eclipse edition)
Ensure that the kernel has finished running by including an explicit or implicit synchronization point in your host code (cudaDeviceSynchronize for example).
You are probably compiling for an architecture that does not support printf(). By default the project is compiled for compute architecture 1.0. To change this, in VS open the project properties -> CUDA C/C++ -> Device and change the "Code Generation" property to "compute_20,sm_20".
You do not need #include "util\cuPrintf.cu". Please see this for details on how to use printf and how to flush the output so you actually see the result.
If you're getting that error, it probably means that your GPU does not have Compute capability 2.x or higher. This thread goes into more detail on what your options are for printing inside a kernel function.
My problem is very much like this one. I run the simplest CUDA program but the kernel doesn't launch. However, I am sure that my CUDA installation is ok, since I can run complicated CUDA projects consisting of several files (which I took from someone else) with no problems. In these projects, compilation and linking is done through makefiles with a lot of flags. I think the problem is in the correct flags to use while compiling. I simply use a command like this:
nvcc -arch=sm_20 -lcudart test.cu with a such a program (to run on a linux machine):
__global__ void myKernel()
{
cuPrintf("Hello, world from the device!\n");
}
int main()
{
cudaPrintfInit();
myKernel<<<1,10>>>();
cudaPrintfDisplay(stdout, true);
cudaPrintfEnd();
}
The program compiles correctly. When I add cudaMemcpy() operations, it returns no error. Any suggestion on why the kernel doesn't launch ?
The reason it is not printing when using printf is that kernel launches are asynchronous and your program is exiting before the printf buffer gets flushed. Section B.16 of the CUDA (5.0) C Programming Guide explains this.
The output buffer for printf() is set to a fixed size before kernel launch (see
Associated Host-Side API). It is circular and if more output is produced during kernel
execution than can fit in the buffer, older output is overwritten. It is flushed only
when one of these actions is performed:
Kernel launch via <<<>>> or cuLaunchKernel() (at the start of the launch, and if the
CUDA_LAUNCH_BLOCKING environment variable is set to 1, at the end of the launch as
well),
Synchronization via cudaDeviceSynchronize(), cuCtxSynchronize(),
cudaStreamSynchronize(), cuStreamSynchronize(), cudaEventSynchronize(),
or cuEventSynchronize(),
Memory copies via any blocking version of cudaMemcpy*() or cuMemcpy*(),
Module loading/unloading via cuModuleLoad() or cuModuleUnload(),
Context destruction via cudaDeviceReset() or cuCtxDestroy().
For this reason, this program prints nothing:
#include <stdio.h>
__global__ void myKernel()
{
printf("Hello, world from the device!\n");
}
int main()
{
myKernel<<<1,10>>>();
}
But this program prints "Hello, world from the device!\n" ten times.
#include <stdio.h>
__global__ void myKernel()
{
printf("Hello, world from the device!\n");
}
int main()
{
myKernel<<<1,10>>>();
cudaDeviceSynchronize();
}
Are you sure that your CUDA device supports the SM_20 architecture?
Remove the arch= option from your nvcc command line and rebuild everything. This compiles for the 1.0 CUDA architecture, which will be supported on all CUDA devices. If it still doesn't run, do a build clean and make sure there are no object files left anywhere. Then rebuild and run.
Also, arch= refers to the virtual architecture, which should be something like compute_10. sm_20 is the real architecture and I believe should be used with the code= switch, not arch=.
In Visual Studio:
Right click on your project > Properies > Cuda C/C++ > Device
and add then following to Code Generation field
compute_30,sm_30;compute_35,sm_35;compute_37,sm_37;compute_50,sm_50;compute_52,sm_52;compute_60,sm_60;compute_61,sm_61;compute_70,sm_70;compute_75,sm_75;
generating code for all these architecture makes your code a bit slower. So eliminate one by one to find which compute and sm gen code is required for your GPU.
But if you are shipping this to others better include all of these.
The following CUDA Thrust program crashes:
#include <thrust/device_vector.h>
#include <thrust/extrema.h>
int main(void)
{
thrust::device_vector<int> vec;
for (int i(0); i < 1000; ++i) {
vec.push_back(i);
}
thrust::min_element(vec.begin(), vec.end());
}
The exception I get is:
Unhandled exception at 0x7650b9bc in test_thrust.exe: Microsoft C++
exception:thrust::system::system_error at memory location 0x0017f178..
In `checked_cudaMemcpy()` in `trivial_copy.inl`.
If I add #include <thrust/sort.h> and replace min_element with sort, it does not crash.
I'm using CUDA 4.1 on Windows 7 64-bit, compute_20,sm_20 (Fermi), Debug build. In a Release build, I am not getting the crash and min_element finds the correct element.
Am I doing something wrong, or is there a bug in Thrust?
I can reproduce the error using debug mode targeting Compute Capability 2.0 (i.e nvcc -G0 -arch=sm_20). The bug does not reproduce in release mode or when targeting Compute Capability 1.x devices, which generally suggests a code-generation problem instead of a bug in the library. Wherever the fault lies, I'd encourage you to submit a bug report so this issue gets the attention it deserves. In the meantime, I'd suggest compiling in release mode, which is more rigorously tested.