I am trying to do some FP16 work that will have both CPU and GPU backend. I researched my options and decided to use CUDA's half precision converter and data types. The ones I intent to use are specified as both __device__ and __host__ which according to my understanding (and the official documentation) should mean that the functions are callable from both HOST and DEVICE code. I wrote a simple test program:
#include <iostream>
#include <cuda_fp16.h>
int main() {
const float a = 32.12314f;
__half2 test = __float2half2_rn(a);
__half test2 = __float2half(a);
return 0;
}
However when I try to compile it I get:
nvcc cuda_half2.cu
cuda_half2.cu(6): error: calling a __device__ function("__float2half2_rn") from a __host__ function("main") is not allowed
cuda_half2.cu(7): error: calling a __device__ function("__float2half") from a __host__ function("main") is not allowed
2 errors detected in the compilation of "/tmp/tmpxft_000013b8_00000000-4_cuda_half2.cpp4.ii".
The only thing that comes to mind is that my CUDA is 9.1 and I'm reading the documentation for 9.2 but i can't find an older version of it, nor can I find anything in the changelog. Ideas?
Ideas?
Switch to CUDA 9.2
Your code compiles without error on CUDA 9.2, but throws the errors you indicate on CUDA 9.1. If you have CUDA 9.1 installed, then the documentation for it is already installed on your machine. On a typical linux install, it will be located in /usr/local/cuda-9.1/doc. If you look at /usr/local/cuda-9.1/doc/pdf/CUDA_Math_API.pdf you will see that the corresponding functions are only marked __device__, so this change was indeed made between CUDA 9.1 and CUDA 9.2
This question already has an answer here:
cuda thrust::remove_if throws "thrust::system::system_error" for device_vector?
(1 answer)
Closed 7 years ago.
I copied this code from the Thrust documentation:
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
int main()
{
thrust::device_vector<int> vec0(100);
thrust::device_vector<int> vec1(100);
thrust::copy(vec0.begin(), vec0.end(), vec1.begin());
return 0;
}
When I run this in Debug mode (VS2012), my program crashes and I get the error Debug Error! ... R6010 - abort() has been called. When I run this in Release mode, it still crashes and I get the message .exe has stopped working.
However copying from host-to-device works correctly:
thrust::host_vector<int> vec0(100);
thrust::device_vector<int> vec1(100);
thrust::copy(vec0.begin(), vec0.end(), vec1.begin());
I use GeForce GTX 970, CUDA driver version/runtime version is 7.5, deviceQuery runs without any problem. Host runtime library is in Multi-threaded (/MT) mode. Does anybody have an idea what might cause this problem?
There are a few similar questions e.g. here
To quote a comment :
"Thrust is known to not compile and run correctly when built for
debugging"
And from the docs:
"nvcc does not support device debugging Thrust code. Thrust functions
compiled with (e.g., nvcc -G, nvcc --device-debug 0, etc.) will likely
crash."
Why does the following code crash at the end of the main?
#include <thrust/device_vector.h>
thrust::device_vector<float4> v;
int main(){
v.resize(1000);
return 0;
}
The error is:
terminate called after throwing an instance of 'thrust::system::system_error'
what(): unspecified driver error
If I use host_vector instead of device_vector the code run fine.
Do you think it's a Thrust bug, or am I doing something wrong here?
I tried it on ubuntu 10.10 with cuda 4.0 and on Windows 7 with cuda 6.5.
The Thrust version is 1.7 in both cases.
thanks
The problem is neither a bug in Thrust, nor are you doing something wrong. Rather, this is a limitation of the design of the CUDA runtime API.
The underlying reason for the crash is that the destructor for the thrust::vector is being called when the variable falls out of scope, which is happening after the CUDA runtime API context has been torn down. This will produce a runtime error (probably cudaErrorCudartUnloading) because the process is attempting to call cudaFree after it has already disconnected from the CUDA driver.
I am unaware of a workaround other than not using Thrust device containers declared at main() translation unit scope.
I've just upgraded from CUDA 5.0 to 5.5 and all my VS2012 CUDA projects have stopped compiling due to a problem with assert(). To repro the problem, I created a new CUDA 5.5 project in VS 2012 and added the code straight from Programming Guide and got the same error.
__global__ void testAssert(void)
{
int is_one = 1;
int should_be_one = 0;
// This will have no effect
assert(is_one);
// This will halt kernel execution
assert(should_be_one);
}
This produces the following compiler error:
kernel.cu(22): error : calling a __host__ function("_wassert") from a __global__ function("testAssert") is not allowed
Is there something obvious that I'm missing?
Make sure you are including assert.h, and make sure you are targeting sm_20 or later. Also check you're not including Windows headers, and if you are then try without.
My problem is very much like this one. I run the simplest CUDA program but the kernel doesn't launch. However, I am sure that my CUDA installation is ok, since I can run complicated CUDA projects consisting of several files (which I took from someone else) with no problems. In these projects, compilation and linking is done through makefiles with a lot of flags. I think the problem is in the correct flags to use while compiling. I simply use a command like this:
nvcc -arch=sm_20 -lcudart test.cu with a such a program (to run on a linux machine):
__global__ void myKernel()
{
cuPrintf("Hello, world from the device!\n");
}
int main()
{
cudaPrintfInit();
myKernel<<<1,10>>>();
cudaPrintfDisplay(stdout, true);
cudaPrintfEnd();
}
The program compiles correctly. When I add cudaMemcpy() operations, it returns no error. Any suggestion on why the kernel doesn't launch ?
The reason it is not printing when using printf is that kernel launches are asynchronous and your program is exiting before the printf buffer gets flushed. Section B.16 of the CUDA (5.0) C Programming Guide explains this.
The output buffer for printf() is set to a fixed size before kernel launch (see
Associated Host-Side API). It is circular and if more output is produced during kernel
execution than can fit in the buffer, older output is overwritten. It is flushed only
when one of these actions is performed:
Kernel launch via <<<>>> or cuLaunchKernel() (at the start of the launch, and if the
CUDA_LAUNCH_BLOCKING environment variable is set to 1, at the end of the launch as
well),
Synchronization via cudaDeviceSynchronize(), cuCtxSynchronize(),
cudaStreamSynchronize(), cuStreamSynchronize(), cudaEventSynchronize(),
or cuEventSynchronize(),
Memory copies via any blocking version of cudaMemcpy*() or cuMemcpy*(),
Module loading/unloading via cuModuleLoad() or cuModuleUnload(),
Context destruction via cudaDeviceReset() or cuCtxDestroy().
For this reason, this program prints nothing:
#include <stdio.h>
__global__ void myKernel()
{
printf("Hello, world from the device!\n");
}
int main()
{
myKernel<<<1,10>>>();
}
But this program prints "Hello, world from the device!\n" ten times.
#include <stdio.h>
__global__ void myKernel()
{
printf("Hello, world from the device!\n");
}
int main()
{
myKernel<<<1,10>>>();
cudaDeviceSynchronize();
}
Are you sure that your CUDA device supports the SM_20 architecture?
Remove the arch= option from your nvcc command line and rebuild everything. This compiles for the 1.0 CUDA architecture, which will be supported on all CUDA devices. If it still doesn't run, do a build clean and make sure there are no object files left anywhere. Then rebuild and run.
Also, arch= refers to the virtual architecture, which should be something like compute_10. sm_20 is the real architecture and I believe should be used with the code= switch, not arch=.
In Visual Studio:
Right click on your project > Properies > Cuda C/C++ > Device
and add then following to Code Generation field
compute_30,sm_30;compute_35,sm_35;compute_37,sm_37;compute_50,sm_50;compute_52,sm_52;compute_60,sm_60;compute_61,sm_61;compute_70,sm_70;compute_75,sm_75;
generating code for all these architecture makes your code a bit slower. So eliminate one by one to find which compute and sm gen code is required for your GPU.
But if you are shipping this to others better include all of these.