cudaMemPrefetchAsync bug on GTX 1080 (Pascal)? - cuda

On my machine the call to cudeMemPrefetchAsync in the code below returns 10 (cuda error invalid device) rather than 0. The setup is an Alienware 17 laptop running Windows 10 with a NVidia GTX 1080 GPU and onboard Intel HD Graphics 530. Using driver 376.19 from NVidia (mobile driver)).
I've compiled for compute_61, sm_61. Another user tried running the same code on a Pascal architecture (Titan X) and it returned 0 correctly. I've also tested this in both Debug and Release mode with the same result. Any ideas?
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
int main()
{
int* data;
size_t len = 10;
cudaError_t err = cudaSetDevice(0);
err = cudaMallocManaged(reinterpret_cast<void **>(&data), len, cudaMemAttachGlobal);
err = cudaMemPrefetchAsync(data, len, 0, 0);
}

There is know bug confirmed by a NVIDIA employee for Windows system enviroment (see at the post botton, i
On the other hand, there are reports that code like yours work fine under Linux SO, or with Maxwell cards.
I do have the very same issue of you, but until now no solutions, even using CUDA 9.0 RC. My advice is to use the regular memory approach for now, since there is more than a year of reporting, but no fix.

Related

cuda copy device data to host (again) [duplicate]

I recently meet a problem when copying dynamically allocated data in device to host memory. The data is allocated with malloc, and I copy those data from device to host in host function. Here is the code:
#include <cuda.h>
#include <stdio.h>
#define N 100
__device__ int* d_array;
__global__ void allocDeviceMemory()
{
d_array = new int[N];
for(int i=0; i < N; i++)
d_array[i] = 123;
}
int main()
{
allocDeviceMemory<<<1, 1>>>();
cudaDeviceSynchronize();
int* d_a = NULL;
cudaMemcpyFromSymbol((void**)&d_a, "d_array", sizeof(d_a), 0, cudaMemcpyDeviceToHost);
printf("gpu adress: %p\n", d_a);
int* h_array = (int*)malloc(N*sizeof(int));
cudaError_t errr = cudaMemcpy(h_array, d_a, N*sizeof(int), cudaMemcpyDeviceToHost);
printf("h_array: %d, %d\n", h_array[0], errr);
getchar();
return 0;
}
There is already a poster had the same issue for CUDA 4.1, and some experts suggest upgreading the CUDA driver and runtime to newer version can solve this issue.
CUDA - Copy device data to host?
I have CUDA toolkit 4.2 and lastest developer drivers and C2075, but it still come up with the above problem. Please let me know how to solve this problem.
Unfortunately there is no way to do what you are trying to do it CUDA 4. The host API cannot copy from dynamically allocated addresses on device runtime heap, only device code can access them. If you want to copy with the host API, you will need to write the data into an "output" buffer allocated with the host API first, then you are free to use cudaMemcpy to retrieve it from the host.
You can see confirmation of this limitation from Mark Harris of Nvidia here.
Since this answer was posted in 2012, the restriction on host API interoperability appears to have been set in stone, and is explicitly documented in the CUDA programming guide.

Visual Studio 2017 and Cuda 9 RC still do not work together

Despite the announced support for Visual Studio 2017, I still get this error message:
nvcc fatal : Host compiler targets unsupported OS.
when I try to compile a simple test program like this
#include <stdio.h>
__global__ void kernel() {
printf("hello world from GPU\n");
}
main() {
printf("hello world from CPU\n");
kernel<<<1, 10>>>();
cudaDeviceSynchronized();
}
even after updating to CUDA 9 RC.
Thanks for your help!
Apologies for the difficulties with VS 2017 and CUDA 9 RC.
Microsoft released VS 2017 Update 3 (15.3) on 8/14/2017, right after CUDA 9 RC was published. Unfortunately, this update results in an incompatibility with CUDA 9 RC. NVIDIA expects that the CUDA 9 GA (future) release will address this particular incompatibility. In the meantime, if you switch to using VS 2017 RTM (the very first release of VS 2017) with no updates, it should work with CUDA 9 RC. I'm not suggesting this is easy or difficult (in fact it may be impossible, unless you've previously archived an offline installer to do that), or providing exact steps to get VS 2017 (original) RTM here.
In other respects, the supported environments should be spelled out in the windows installation guide that ships with CUDA 9 RC, and which is also linked from the CUDA 9 RC download page on developer.nvidia.com. Based on this, the other option seems to be to switch to VS 2015 (still available), or else a VS 2015 toolchain within VS 2017.
I managed to compile similar code with VS2017 and CUDA 9.0. It looks like you forgot to include cuda_runtime.h to your file.
#include "cuda_runtime.h"
#include <stdio.h>
__global__ void kernel() {
printf("hello world from GPU \r\n");
}
int main() {
printf("hello world from CPU \r\n");
kernel <<< 1, 10 >>>();
return 0;
}

thrust::copy doesn't work for device_vectors [duplicate]

This question already has an answer here:
cuda thrust::remove_if throws "thrust::system::system_error" for device_vector?
(1 answer)
Closed 7 years ago.
I copied this code from the Thrust documentation:
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
int main()
{
thrust::device_vector<int> vec0(100);
thrust::device_vector<int> vec1(100);
thrust::copy(vec0.begin(), vec0.end(), vec1.begin());
return 0;
}
When I run this in Debug mode (VS2012), my program crashes and I get the error Debug Error! ... R6010 - abort() has been called. When I run this in Release mode, it still crashes and I get the message .exe has stopped working.
However copying from host-to-device works correctly:
thrust::host_vector<int> vec0(100);
thrust::device_vector<int> vec1(100);
thrust::copy(vec0.begin(), vec0.end(), vec1.begin());
I use GeForce GTX 970, CUDA driver version/runtime version is 7.5, deviceQuery runs without any problem. Host runtime library is in Multi-threaded (/MT) mode. Does anybody have an idea what might cause this problem?
There are a few similar questions e.g. here
To quote a comment :
"Thrust is known to not compile and run correctly when built for
debugging"
And from the docs:
"nvcc does not support device debugging Thrust code. Thrust functions
compiled with (e.g., nvcc -G, nvcc --device-debug 0, etc.) will likely
crash."

Crash with thrust::min_element on thrust::device_vector (CUDA Thrust)

The following CUDA Thrust program crashes:
#include <thrust/device_vector.h>
#include <thrust/extrema.h>
int main(void)
{
thrust::device_vector<int> vec;
for (int i(0); i < 1000; ++i) {
vec.push_back(i);
}
thrust::min_element(vec.begin(), vec.end());
}
The exception I get is:
Unhandled exception at 0x7650b9bc in test_thrust.exe: Microsoft C++
exception:thrust::system::system_error at memory location 0x0017f178..
In `checked_cudaMemcpy()` in `trivial_copy.inl`.
If I add #include <thrust/sort.h> and replace min_element with sort, it does not crash.
I'm using CUDA 4.1 on Windows 7 64-bit, compute_20,sm_20 (Fermi), Debug build. In a Release build, I am not getting the crash and min_element finds the correct element.
Am I doing something wrong, or is there a bug in Thrust?
I can reproduce the error using debug mode targeting Compute Capability 2.0 (i.e nvcc -G0 -arch=sm_20). The bug does not reproduce in release mode or when targeting Compute Capability 1.x devices, which generally suggests a code-generation problem instead of a bug in the library. Wherever the fault lies, I'd encourage you to submit a bug report so this issue gets the attention it deserves. In the meantime, I'd suggest compiling in release mode, which is more rigorously tested.

CUDA kernel not launching

I'm using a GeForce 9800 GX2. I installed drivers and the CUDA SDK i wrote simple program which look s like this:
__global__ void myKernel(int *d_a)
{
int tx=threadIdx.x;
d_a[tx]+=1;
cuPrintf("Hello, world from the device!\n");
}
int main()
{
int *a=(int*)malloc(sizeof(int)*10);
int *d_a;
int i;
for(i=0;i<10;i++)
a[i]=i;
cudaPrintfInit();
cudaMalloc((void**)&d_a,10*sizeof(int));
cudaMemcpy(d_a,a,10*sizeof(int),cudaMemcpyHostToDevice);
myKernel<<<1,10>>>(d_a);
cudaPrintfDisplay(stdout, true);
cudaMemcpy(a,d_a,10*sizeof(int),cudaMemcpyDeviceToHost);
cudaPrintfEnd();
cudaFree(d_a);
}
The code is compiling properly, but the kernel appear not to be launching... No message is printed from the kernel side. What should I do to resolve this?
Given that in your comments you say you are getting "No CUDA-capable device" that implies that either you do not have a CUDA-capable GPU or that you do not have the correct driver installed. Given that you say you have both, I suggest you try reinstalling your driver to check.
Some other notes:
Are you trying to do this through Remote Desktop? That won't work since with RDP Microsoft uses a dummy display device in order to forward the display remotely, the Tesla GPUs support TCC which allows RDP to work by making the GPU behave as a non-display device, but with display GPUs like Geforce this is not possible. Either run at the console or login at the console and use VNC.
Also try running the deviceQuery SDK code sample to check that it detects your GPU and driver/runtime version correctly.
You should check all CUDA API calls for errors.
Call cudaDeviceSynchronize() before cudaPrintfDisplay().