Cudamemcpy function usage - cuda

How will the cudaMemcpy function work in this case?
I have declared a matrix like this
float imagen[par->N][par->M];
and I want to copy it to the cuda device so I did this
float *imagen_cuda;
int tam_cuda=par->M*par->N*sizeof(float);
cudaMalloc((void**) &imagen_cuda,tam_cuda);
cudaMemcpy(imagen_cuda,imagen,tam_cuda,cudaMemcpyHostToDevice);
Will this copy the 2d array into a 1d array fine?
And how can I copy to another 2d array? can I change this and will it work?
float **imagen_cuda;

It's not trivial to handle a doubly-subscripted C array when copying data between host and device. For the most part, cudaMemcpy (including cudaMemcpy2D) expect an ordinary pointer for source and destination, not a pointer-to-pointer.
The simplest approach (I think) is to "flatten" the 2D arrays, both on host and device, and use index arithmetic to simulate 2D coordinates:
float imagen[par->N][par->M];
float *myimagen = &(imagen[0][0]);
float myval = myimagen[(rowsize*row) + col];
You can then use ordinary cudaMemcpy operations to handle the transfers (using the myimagen pointer):
float *d_myimagen;
cudaMalloc((void **)&d_myimagen, (par->N * par->M)*sizeof(float));
cudaMemcpy(d_myimagen, myimagen, (par->N * par->M)*sizeof(float), cudaMemcpyHostToDevice);
If you really want to handle dynamically sized (i.e. not known at compile time) doubly-subscripted arrays, you can review this question/answer.

Related

What is the most optimized way of getting a single element from a device array in cuda

I have an array on device of huge length and for some condition check I want to access (On Host/ CPU) only one element from middle (say Nth element). What could be the optimized way for doing this.
Do I need to write a kernel that writes Nth location in single element array from the src array and then I copy single element array to host?
You can copy single element of an array using cudaMemcpy.
Let's say you want to copy N-th element of array:
int * dSourceArray
to variable
int hTargetVariable
You can apply device pointer arithmetics on the host. All you need to do is to move dSourceArray pointer by N elements ant copy single element:
cudaMemcpy(&hTargetVariable, dSourceArray+N, sizeof(int), cudaMemcpyDeviceToHost)
Keep in mind that if you use multiple streams you would like to synchronize the device before transferring the data.
One addendum to answer 1, you may need to take account of the bytes per element of your array. e.g. For an array of arrays of various types on the device:
#ifdef CUDA_KERNEL
char* mgpu[ MAX_BUF ]; // Device array of pointers to arrays of various types.
#else
CUdeviceptr mgpu[ MAX_BUF ]; // on host, gpu is a device pointer.
CUdeviceptr gpu (int n ) { return mgpu[n]; }
CUdeviceptr GPUpointer = m_Fluid.gpu(FGRIDOFF); // Device pointer to FGRIDOFF (int) array
cuMemcpyDtoH (&CPUelement, GPUpointer+(offset*sizeof(int)) , sizeof(int) );

cudaMemcpy2D setting values to 0

I'm attempting to copy a 2-dimensional array from host to device with cudaMallocPitch and cudaMemcpy2D, but I'm having a problem where it seems to be setting my value to 0.
I'll write the basics of my code in the browser. I know the value I print from the kernel is not 0. Any ideas?
__global__ void kernel(float **d_array) {
printf("%f", d_array[0][0]);
}
void kernelWrapper(int rows, int cols, float **array) {
float **d_array;
size_t pitch;
cudaMallocPitch((void**) &d_array, &pitch, rows*sizeof(float), cols);
cudaMemcpy2D(d_array, pitch, array, rows*sizeof(float), rows*sizeof(float), cols, cudaMemcpyHostToDevice);
kernel<<<1,1>>>(d_array);
}
For some reason, the kernel keeps printing 0.0000. I know that the first element is not 0 as I tested printing the first element of the host array. What is happening?
EDIT:
I tried this code as well but got invalid pointer errors.
cudaMalloc(d_array, rows*sizeof(float*));
for (int i = 0; i < rows; i++) {
cudaMalloc((void**) &d_array[i], cols*sizeof(float));
}
cudaMemcpy(d_array, array, rows*sizeof(float*), cudaMemcpyHostToDevice);
Despite it's name, cudaMemcpy2D does not copy a doubly-subscripted C host array (**) to a doubly-subscripted (**) device array. You'll note that it expects single pointers (*) to be passed to it, not double pointers (**). cudaMemcpy2D is used for copying a flat, strided array, not a 2-dimensional array. There are 2 dimensions inherent in the concept of strided access, which is where the name comes from.
In general, trying to copy a 2D array from host to device is more complicated than just a single API call. You are advised to flatten your array so you can reference it with a single pointer (*), then the API calls will work. There are plenty of examples of proper usage of cudaMemcpy2D on SO, just search for them.
Also, you should do cuda error checking on all cuda API calls and kernel calls, whenever you are having difficulty with CUDA code.
If you really want to copy a 2D array directly, take a look at this question/answer for a worked example. It's not trivial.

std::vector to array in CUDA

Is there a way to convert a 2D vector into an array to be able to use it in CUDA kernels?
It is declared as:
vector<vector<int>> information;
I want to cudaMalloc and copy from host to device, what would be the best way to do it?
int *d_information;
cudaMalloc((void**)&d_information, sizeof(int)*size);
cudaMemcpy(d_information, information, sizeof(int)*size, cudaMemcpyHostToDevice);
In a word, no there isn't. The CUDA API doesn't support deep copying and also doesn't know anything about std::vector either. If you insist on having a vector of vectors as a host source, it will require doing something like this:
int *d_information;
cudaMalloc((void**)&d_information, sizeof(int)*size);
int *dst = d_information;
for (std::vector<std::vector<int> >::iterator it = information.begin() ; it != information.end(); ++it) {
int *src = &((*it)[0]);
size_t sz = it->size();
cudaMemcpy(dst, src, sizeof(int)*sz, cudaMemcpyHostToDevice);
dst += sz;
}
[disclaimer: written in browser, not compiled or tested. Use at own risk]
This would copy the host memory to an allocation in GPU linear memory, requiring one copy for each vector. If the vector of vectors is a "jagged" array, you will want to store an indexing somewhere for the GPU to use as well.
As far as I understand, the vector of vectors do not need to reside in a contiguous memory, i.e. they can be fragmented.
Depending on the amount of memory you need to transfer I would do one of two issues:
Reorder your memory to be a single vector, and then use your cudaMemcpy.
Create a series of cudaMemcpyAsync, where each copy handles a single vector in your vector of vectors, and then synchronize.

Copying data to "cufftComplex" data struct?

I have data stored as arrays of floats (single precision). I have one array for my real data, and one array for my complex data, which I use as the input to FFTs. I need to copy this data into the cufftComplex data type if I want to use the CUDA cufft library. From nVidia: " cufftComplex is a single‐precision, floating‐point complex data type that consists of interleaved real and imaginary components." Data to be operated on by cufft is stored in arrays of cufftComplex.
How do I quickly copy my data from a normal C array into an array of cufftComplex ? I don't want to use a for loop because it's probably the slowest possible option. I don't know how to use memcpy on arrays data of this type, because I do not know how it is stored in memory. Thanks!
You could do this as part of a host-> device copy. Each copy would take one of the contiguous input arrays on the host and copy it in strided fashion to the device. The storage layout of complex data types in CUDA is compatible with the layout defined for complex types in Fortran and C++, i.e. as a structure with the real part followed by imaginary part.
float * real_vec; // host vector, real part
float * imag_vec; // host vector, imaginary part
float2 * complex_vec_d; // device vector, single-precision complex
float * tmp_d = (float *) complex_vec_d;
cudaStat = cudaMemcpy2D (tmp_d, 2 * sizeof(tmp_d[0]),
real_vec, 1 * sizeof(real_vec[0]),
sizeof(real_vec[0]), n, cudaMemcpyHostToDevice);
cudaStat = cudaMemcpy2D (tmp_d + 1, 2 * sizeof(tmp_d[0]),
imag_vec, 1 * sizeof(imag_vec[0]),
sizeof(imag_vec[0]), n, cudaMemcpyHostToDevice);

is there a better and a faster way to copy from CPU memory to GPU using thrust?

Recently I have been using thrust a lot. I have noticed that in order to use thrust, one must always copy the data from the cpu memory to the gpu memory.
Let's see the following example :
int foo(int *foo)
{
host_vector<int> m(foo, foo+ 100000);
device_vector<int> s = m;
}
I'm not quite sure how the host_vector constructor works, but it seems like I'm copying the initial data, coming from *foo, twice - once to the host_vector when it is initialized, and another time when device_vector is initialized. Is there a better way of copying from cpu to gpu without making an intermediate data copies? I know I can use device_ptras a wrapper, but that still doesn't fix my problem.
thanks!
One of device_vector's constructors takes a range of elements specified by two iterators. It's smart enough to understand the raw pointer in your example, so you can construct a device_vector directly and avoid the temporary host_vector:
void my_function_taking_host_ptr(int *raw_ptr, size_t n)
{
// device_vector assumes raw_ptrs point to system memory
thrust::device_vector<int> vec(raw_ptr, raw_ptr + n);
...
}
If your raw pointer points to CUDA memory, introduce a device_ptr:
void my_function_taking_cuda_ptr(int *raw_ptr, size_t n)
{
// wrap raw_ptr before passing to device_vector
thrust::device_ptr<int> d_ptr(raw_ptr);
thrust::device_vector<int> vec(d_ptr, d_ptr + n);
...
}
Using a device_ptr doesn't allocate any storage; it just encodes the location of the pointer in the type system.