I've just upgraded from CUDA 5.0 to 5.5 and all my VS2012 CUDA projects have stopped compiling due to a problem with assert(). To repro the problem, I created a new CUDA 5.5 project in VS 2012 and added the code straight from Programming Guide and got the same error.
__global__ void testAssert(void)
{
int is_one = 1;
int should_be_one = 0;
// This will have no effect
assert(is_one);
// This will halt kernel execution
assert(should_be_one);
}
This produces the following compiler error:
kernel.cu(22): error : calling a __host__ function("_wassert") from a __global__ function("testAssert") is not allowed
Is there something obvious that I'm missing?
Make sure you are including assert.h, and make sure you are targeting sm_20 or later. Also check you're not including Windows headers, and if you are then try without.
Related
I am trying to do some FP16 work that will have both CPU and GPU backend. I researched my options and decided to use CUDA's half precision converter and data types. The ones I intent to use are specified as both __device__ and __host__ which according to my understanding (and the official documentation) should mean that the functions are callable from both HOST and DEVICE code. I wrote a simple test program:
#include <iostream>
#include <cuda_fp16.h>
int main() {
const float a = 32.12314f;
__half2 test = __float2half2_rn(a);
__half test2 = __float2half(a);
return 0;
}
However when I try to compile it I get:
nvcc cuda_half2.cu
cuda_half2.cu(6): error: calling a __device__ function("__float2half2_rn") from a __host__ function("main") is not allowed
cuda_half2.cu(7): error: calling a __device__ function("__float2half") from a __host__ function("main") is not allowed
2 errors detected in the compilation of "/tmp/tmpxft_000013b8_00000000-4_cuda_half2.cpp4.ii".
The only thing that comes to mind is that my CUDA is 9.1 and I'm reading the documentation for 9.2 but i can't find an older version of it, nor can I find anything in the changelog. Ideas?
Ideas?
Switch to CUDA 9.2
Your code compiles without error on CUDA 9.2, but throws the errors you indicate on CUDA 9.1. If you have CUDA 9.1 installed, then the documentation for it is already installed on your machine. On a typical linux install, it will be located in /usr/local/cuda-9.1/doc. If you look at /usr/local/cuda-9.1/doc/pdf/CUDA_Math_API.pdf you will see that the corresponding functions are only marked __device__, so this change was indeed made between CUDA 9.1 and CUDA 9.2
I am using the ManagedCuda library in a C# project to utilise the GPU, currently I am following along with this tutorial regarding how to write code compatible between C# and C++ after failing to achieve it with OpenCV.
Everything seems to be working fine with my code, the kernel is found, built and the method call is executed however I am getting an error:
An unhandled exception of type 'ManagedCuda.CudaException' occurred in ManagedCuda.dll
Additional information: ErrorIllegalAddress: While executing a kernel, the device
encountered a load or store instruction on an invalid memory address.
The context cannot be used, so it must be destroyed (and a new one should be created).
I understand that C# is complaining that a valid address is not found when it attempts to pass the device pointer to the kernel, the only difference I can tell between my code and the post in the cited tutorial, is that ManagedCuda seems to have recently had a facelift which allows users to use Lambdas, I've done some reading and haven't found anything to clarify whether or not this is what's causing my problem:
static Func<int, int, int> cudaAdd = (a, b) =>
{
// init output parameters
CudaDeviceVariable<int> result_dev = 0;
int result_host = 0;
// run CUDA method
addWithCuda.Run(a, b, result_dev.DevicePointer); <--- Code throws error here
// copy return to host
result_dev.CopyToHost(ref result_host);
return result_host;
};
In the original tutorial code the OP uses CudaDeviceVariable result_dev = 0;. Could this be the problem? I don't see why it would be, but maybe my cast is wrong?
For clarity here is the kernel which is being called:
__global__ void kernel(int a, int b, int *c)
{
*c = (a + b)*(a + b);
}
TL;DR: Exception is related to 32/64 bit settings in your C# project. Either set platform target to x86 or if you have it on Any CPU make sure to tick Prefer 32-bit.
How i found out:
Made a solution in .NET 4.5 according to https://algoslaves.wordpress.com/2013/08/25/nvidia-cuda-hello-world-in-managed-c-and-f-with-use-of-managedcuda/ (same as OP)
used NuGet to add ManagedCuda 6.5 - Standalone
Worked fine
Backup point A.
Changed .NET version to 4.0, used NuGet to remove and add ManagedCuda 6.5 - Standalone:
Exception thrown: Additional information: ErrorIllegalAddress: While executing a kernel, the device encountered a load or store instruction on an invalid memory address.
Switched .NET version to 4.5, used NuGet to remove and add ManagedCuda 6.5 - Standalone:
Exception thrown: Same as above.
Clean solution, Rebuild solution, Build solution:
Exception thrown: Same as above.
Manually deleted every single file/folder generated by Visual Studio:
Exception thrown: Same as above.
Reopen project backed up from point A:
Worked fine.
Manually delete every single file/folder generated by Visual Studio in both Working and Not working project.
Visually compare all files.
Every file is the same except: not working version has extra <Prefer32Bit>false</Prefer32Bit> in MangedCudaTest.csproj
Deleted lines with <Prefer32Bit>false</Prefer32Bit>.
Not working version finally Worked fine.
Made same changes in my main project, finally Worked fine.
From the comments it seems that the thrown exception is due to running on different threads. In a single threaded environment the sample code runs fine returning the right result. In order to use Cuda in a multithreaded application, proper synchronization of the threads and binding the cuda context to the currently active threads would be necessary.
I had the same issue, the solution for me was to set the .NET project to 64bit, NOT 32 bit like suggested in aeroson's answer. I think this maybe because I'm using CUDA SDK 7.5 and I read somewhere about 32bit being phased out.
Why does the following code crash at the end of the main?
#include <thrust/device_vector.h>
thrust::device_vector<float4> v;
int main(){
v.resize(1000);
return 0;
}
The error is:
terminate called after throwing an instance of 'thrust::system::system_error'
what(): unspecified driver error
If I use host_vector instead of device_vector the code run fine.
Do you think it's a Thrust bug, or am I doing something wrong here?
I tried it on ubuntu 10.10 with cuda 4.0 and on Windows 7 with cuda 6.5.
The Thrust version is 1.7 in both cases.
thanks
The problem is neither a bug in Thrust, nor are you doing something wrong. Rather, this is a limitation of the design of the CUDA runtime API.
The underlying reason for the crash is that the destructor for the thrust::vector is being called when the variable falls out of scope, which is happening after the CUDA runtime API context has been torn down. This will produce a runtime error (probably cudaErrorCudartUnloading) because the process is attempting to call cudaFree after it has already disconnected from the CUDA driver.
I am unaware of a workaround other than not using Thrust device containers declared at main() translation unit scope.
The following CUDA Thrust program crashes:
#include <thrust/device_vector.h>
#include <thrust/extrema.h>
int main(void)
{
thrust::device_vector<int> vec;
for (int i(0); i < 1000; ++i) {
vec.push_back(i);
}
thrust::min_element(vec.begin(), vec.end());
}
The exception I get is:
Unhandled exception at 0x7650b9bc in test_thrust.exe: Microsoft C++
exception:thrust::system::system_error at memory location 0x0017f178..
In `checked_cudaMemcpy()` in `trivial_copy.inl`.
If I add #include <thrust/sort.h> and replace min_element with sort, it does not crash.
I'm using CUDA 4.1 on Windows 7 64-bit, compute_20,sm_20 (Fermi), Debug build. In a Release build, I am not getting the crash and min_element finds the correct element.
Am I doing something wrong, or is there a bug in Thrust?
I can reproduce the error using debug mode targeting Compute Capability 2.0 (i.e nvcc -G0 -arch=sm_20). The bug does not reproduce in release mode or when targeting Compute Capability 1.x devices, which generally suggests a code-generation problem instead of a bug in the library. Wherever the fault lies, I'd encourage you to submit a bug report so this issue gets the attention it deserves. In the meantime, I'd suggest compiling in release mode, which is more rigorously tested.
I'm using a GeForce 9800 GX2. I installed drivers and the CUDA SDK i wrote simple program which look s like this:
__global__ void myKernel(int *d_a)
{
int tx=threadIdx.x;
d_a[tx]+=1;
cuPrintf("Hello, world from the device!\n");
}
int main()
{
int *a=(int*)malloc(sizeof(int)*10);
int *d_a;
int i;
for(i=0;i<10;i++)
a[i]=i;
cudaPrintfInit();
cudaMalloc((void**)&d_a,10*sizeof(int));
cudaMemcpy(d_a,a,10*sizeof(int),cudaMemcpyHostToDevice);
myKernel<<<1,10>>>(d_a);
cudaPrintfDisplay(stdout, true);
cudaMemcpy(a,d_a,10*sizeof(int),cudaMemcpyDeviceToHost);
cudaPrintfEnd();
cudaFree(d_a);
}
The code is compiling properly, but the kernel appear not to be launching... No message is printed from the kernel side. What should I do to resolve this?
Given that in your comments you say you are getting "No CUDA-capable device" that implies that either you do not have a CUDA-capable GPU or that you do not have the correct driver installed. Given that you say you have both, I suggest you try reinstalling your driver to check.
Some other notes:
Are you trying to do this through Remote Desktop? That won't work since with RDP Microsoft uses a dummy display device in order to forward the display remotely, the Tesla GPUs support TCC which allows RDP to work by making the GPU behave as a non-display device, but with display GPUs like Geforce this is not possible. Either run at the console or login at the console and use VNC.
Also try running the deviceQuery SDK code sample to check that it detects your GPU and driver/runtime version correctly.
You should check all CUDA API calls for errors.
Call cudaDeviceSynchronize() before cudaPrintfDisplay().