Why is this not copying from device to host in Cuda? - cuda

I'm working through the examples of the "CUDA by Example" book. The following code doesn't give me an answer and work as it should. Where's the mistake?
Will appreciate your help and answers.
I get an output,which reads
Calculation done on GPU yields the answer: &d
Press enter to stop
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <iostream>
#include <stdio.h>
using namespace std;
__global__ void add_integers_cuda(int a, int b, int *c)
{
*c = a + b;
}
int main(void)
{
int c;
int *dev_ptr;
cudaMalloc((void **)&dev_ptr, sizeof(int)); //allocate sizeof(int) bytes of contiguous memory in the gpu device and return the address of first byte to dev_ptr.
// call the kernel
add_integers_cuda <<<1,1>>>(2,7,dev_ptr);
cudaMemcpy(&c, dev_ptr, sizeof(int), cudaMemcpyDeviceToHost);
printf("Calculation done on GPU yields the answer: &d\n",c );
cudaFree(dev_ptr);
printf("Press enter to stop.");
cin.ignore(255, '\n');
return 0;
}
"

&d is not a correct printf formatting character here:
printf("Calculation done on GPU yields the answer: &d\n",c );
You won't get the output you are expecting.
You should use %d instead:
printf("Calculation done on GPU yields the answer: %d\n",c );
This particular issue has nothing to do with CUDA of course.
You may also want to run CUDA codes with cuda-memcheck and/or use proper CUDA error checking if you are just learning and having trouble. Neither of those would have pointed out the above error, however.

Related

Increment and access global counter CUDA

I need to make my kernel communicate with the host. I tried to use a global counter (better ways are well accepted), but the following code prints always 0. What am I doing wrong? (I tried both commented and uncommented ways).
#include <stdio.h>
#include <cuda_runtime.h>
//__device__ int count[1] = {0};
__device__ int count = 0;
__global__ void inc() {
//count[0]++;
atomicAdd(&count, 1);
}
int main(void) {
inc<<<1,10>>>();
cudaDeviceSynchronize();
//int *c;
int c;
cudaMemcpyFromSymbol(&c, count, sizeof(int), cudaMemcpyDeviceToHost);
printf("%d\n", c);
return 0;
}
Anytime you are having trouble with a CUDA code, I strongly encourage you to use proper CUDA error checking and run your code with cuda-memcheck, before asking others for help. Even if you don't understand the error output, providing it in your question will be useful for those trying to help you.
If you had done so, you would have received a report that cudaMemcpyFromSymbol is throwing an invalid argument error.
If you study the documentation for that function call, you will see that the 4th parameter is not the direction parameter, but is the offset parameter. So providing cudaMemcpyDeviceToHost is incorrect for the offset parameter. Since cudaMemcpyFromSymbol is always a device->host transfer, providing the direction argument is redundant, and since it is provided a default, is unnecessary. Your code works correctly for me simply by eliminating that:
$ cat t1414.cu
#include <stdio.h>
#include <cuda_runtime.h>
//__device__ int count[1] = {0};
__device__ int count = 0;
__global__ void inc() {
//count[0]++;
atomicAdd(&count, 1);
}
int main(void) {
inc<<<1,10>>>();
cudaDeviceSynchronize();
//int *c;
int c;
cudaMemcpyFromSymbol(&c, count, sizeof(int));
printf("%d\n", c);
return 0;
}
$ nvcc -o t1414 t1414.cu
$ cuda-memcheck ./t1414
========= CUDA-MEMCHECK
10
========= ERROR SUMMARY: 0 errors
$

Limits on the cuSparse tridiagonal solver?

I have decided to rewrite one of my serial codes to CUDA. A large section of the code is to invert a large tridiagonal matrix for differing right hand sides. I then came across cusparseSgtsv from the cuSparse library. I got a sample code to work for small matrices, but when the matrix size got above 1024, nothing but nan. Did I miss something in the documentation?
Here is the sample code. For N=1024, the code works fine. For N=1025, it is nan all the way down.
#include<iostream>
#include<cuda_runtime.h>
#include<cusparse_v2.h>
using namespace std;
__global__ void assignMat(float *a,float *b,float *c,float *r)
{
int tid=threadIdx.x+blockDim.x*blockIdx.x;
a[tid]=0;
b[tid]=1;
c[tid]=0;
r[tid]=tid;
}
int main()
{
float *d_a,*d_b,*d_c,*d_r;
float *h_r;
int N=1025;
cusparseStatus_t status;
cusparseHandle_t handle=0;
status=cusparseCreate(&handle);
h_r=(float *)malloc(N*sizeof(float));
cudaMalloc((void **)&d_a,N*sizeof(float));
cudaMalloc((void **)&d_b,N*sizeof(float));
cudaMalloc((void **)&d_c,N*sizeof(float));
cudaMalloc((void **)&d_r,N*sizeof(float));
assignMat<<<1,N>>>(d_a,d_b,d_c,d_r);
status=cusparseSgtsv(handle,N,1,d_a,d_b,d_c,d_r,N);
if (status != CUSPARSE_STATUS_SUCCESS)
{
cout << status << endl;
}
else
{
cudaMemcpy(h_r,d_r,N*sizeof(float),cudaMemcpyDeviceToHost);
for (int i=0;i<N;i++)
cout << i << " " << h_r[i] << endl;
}
free(h_r);
cudaFree(d_a);cudaFree(d_b);cudaFree(d_c);cudaFree(d_r);
}
Did I miss something in the documentation?
Not in the cuSparse documentation, no.
However, there is a hard limit on the number of threads per block, so your assignMat kernel stops working once N > 1024. You can read about how to select legal kernel launch parameters here. If your code contained error checking or you ran the program with cuda-memcheck, you probably would have been able to detect the problem yourself at runtime.

cudaEventElapsedTime not expected behaviour

I'm trying to compute the total time taken in GPU to compute something. I'm using the cudaEventRecord and cudaEventElapsedTime to determine this, but I'm having a unexpected behavior, or at least, unexpected for me :) I wrote this example to understand what's happening and I'm still confused.
In the example below I was expecting to report the same time for the three iterations but the result is:
2.80342
1003
2005.6
Which means that the total time in considering the CPU sleep time.
Am I doing something wrong? If not, is it possible do what I want?
#include <iostream>
#include <thread>
#include <chrono>
#include <cuda.h>
#include <cuda_runtime.h>
#include "device_launch_parameters.h"
__global__ void kernel_test(int *a, int N) {
for(int i=threadIdx.x;i<N;i+=N) {
if(i<N)
a[i] = 1;
}
}
int main(int argc, char ** argv) {
cudaEvent_t start[3], stop[3];
for(int i=0;i<3;i++) {
cudaEventCreate(&start[i]);
cudaEventCreate(&stop[i]);
}
cudaStream_t stream;
cudaStreamCreate(&stream);
const int N = 1024 * 1024;
int *h_a = (int*)malloc(N * sizeof(int));
int *a = 0;
cudaMalloc((void**)&a, N * sizeof(int));
for(int i=0;i<3;i++) {
cudaEventRecord(start[i], stream);
cudaMemcpyAsync(a, h_a, N * sizeof(int), cudaMemcpyHostToDevice, stream);
kernel_test<<<1, 1024, 0, stream>>>(a, N);
cudaMemcpyAsync(h_a, a, N*sizeof(int), cudaMemcpyDeviceToHost, stream);
cudaEventRecord(stop[i], stream);
std::this_thread::sleep_for (std::chrono::seconds(i));
cudaEventSynchronize(stop[i]);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start[i], stop[i]);
std::cout<<milliseconds<<std::endl;
}
return 0;
}
I attach the nsight result to verify the behaviour of my example.
Windows 8.1
Geforce GTX 780 Ti
Nvidia drivers: 358.50
EDIT:
Added code to be complete
Attached nsight result
Added SO and drivers info
, ,
If you're running the program on Windows using the WDDM (in contrast to TCC with Tesla cards or Linux) this may be the issue:
With the WDDM kernels are not executed immediately after invocation but instead enqueued to a command buffer. Once the buffer is full it gets flushed and the enqueued commands are actually executed. Another option to force the command buffer to be explicitly flushed is to synchronize.
Now what happens is that you wait before the command buffer is acutally flushed...
Edit
Also see https://devtalk.nvidia.com/default/topic/548639/is-wddm-causing-this-/ for the problem and how cudaEventQuery(0) may help

Modulus computation of an array of cufftComplex data type in CUDA

I made a Dll file in visual C++ to compute modulus of an array of complex numbers in CUDA. The array is type of cufftComplex. I then called the Dll in LabVIEW to check the accuracy of the result. I'm receiving an incorrect result. Could anyone tell me what is wrong with the following code, please? I think there should be something wrong with my kernel function(the way I am retrieving the cufftComplex data should be incorrect).
#include <math.h>
#include <cstdlib>
#include <cuda_runtime.h>
#include <cufft.h>
extern "C" __declspec(dllexport) void Modulus(cufftComplex *digits,float *result);
__global__ void ModulusComputation(cufftComplex *a, int N, float *temp)
{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
if (idx<N)
{
temp[idx] = sqrt((a[idx].x * a[idx].x) + (a[idx].y * a[idx].y));
}
}
void Modulus(cufftComplex *digits,float *result)
{
#define N 1024
cufftComplex *d_data;
float *temp;
size_t size = sizeof(cufftComplex)*N;
cudaMalloc((void**)&d_data, size);
cudaMalloc((void**)&temp, sizeof(float)*N);
cudaMemcpy(d_data, digits, size, cudaMemcpyHostToDevice);
int blockSize = 16;
int nBlocks = N/blockSize;
if( N % blockSize != 0 )
nBlocks++;
ModulusComputation <<< nBlocks, blockSize >>> (d_data, N,temp);
cudaMemcpy(result, temp, size, cudaMemcpyDeviceToHost);
cudaFree(d_data);
cudaFree(temp);
}
In the final cudaMemcpy in your code, you have:
cudaMemcpy(result, temp, size, cudaMemcpyDeviceToHost);
It should be:
cudaMemcpy(result, temp, sizeof(float)*N, cudaMemcpyDeviceToHost);
If you had included error checking for your cuda calls, you would have seen this cuda call (as originally written) throw an error.
There's other comments that could be made. For example your block size (16) should be an integral multiple of 32. But this does not prevent proper operation.
After the kernel call, when copying back the result, you are using size as the memory size. The third argument of cudaMemcpy should be N * sizeof(float).

CUDA - Copy device data to host?

I have device variable and in this variable, I allocate and fill an array in the device, but I have a problem to get data to host. cudaMemcpy() return cudaErrorInvalidValue error. how can I do it?
PS: The Code is just example, I know, that In this particular case I can use cudaMalloc because I know the size of the array, but In my REAL code, It computes the size of the array in the device and it needs immediately allocate memory.
PS2: I found a similar problem, but I still don't know, how can I solve it? - copy data which is allocated in device from device to host
PS3: I have updated code, but still doesn't work:{
PS4: I am just trying to run this code on a notebook with Nvidia GT 520MX(latest game driver) and doesn't work too :(
thx
#include <cuda.h>
#include <stdio.h>
#define N 400
__device__ int* d_array;
__global__ void allocDeviceMemory()
{
d_array = new int[N];
for(int i=0; i < N; i++)
d_array[i] = 123;
}
int main()
{
allocDeviceMemory<<<1, 1>>>();
cudaDeviceSynchronize();
int* d_a = NULL;
cudaMemcpyFromSymbol((void**)&d_a, "d_array", sizeof(d_a), 0, cudaMemcpyDeviceToHost);
printf("gpu adress: %lld\n", d_a);
int* h_array = (int*)malloc(N*sizeof(int));
cudaError_t errr = cudaMemcpy(h_array, d_a, N*sizeof(int), cudaMemcpyDeviceToHost);
printf("h_array: %d, %d\n", h_array[0], errr);
getchar();
return 0;
}
You need to synchronize (cudaDeviceSynchronize()) after launching the kernel to allocate the memory.
Can you also check the return value of the sync and all other CUDA API calls?
i have tested your code and there is no error here. I am running CUDA 4.0.