cudaMemcpy2D setting values to 0 - cuda

I'm attempting to copy a 2-dimensional array from host to device with cudaMallocPitch and cudaMemcpy2D, but I'm having a problem where it seems to be setting my value to 0.
I'll write the basics of my code in the browser. I know the value I print from the kernel is not 0. Any ideas?
__global__ void kernel(float **d_array) {
printf("%f", d_array[0][0]);
}
void kernelWrapper(int rows, int cols, float **array) {
float **d_array;
size_t pitch;
cudaMallocPitch((void**) &d_array, &pitch, rows*sizeof(float), cols);
cudaMemcpy2D(d_array, pitch, array, rows*sizeof(float), rows*sizeof(float), cols, cudaMemcpyHostToDevice);
kernel<<<1,1>>>(d_array);
}
For some reason, the kernel keeps printing 0.0000. I know that the first element is not 0 as I tested printing the first element of the host array. What is happening?
EDIT:
I tried this code as well but got invalid pointer errors.
cudaMalloc(d_array, rows*sizeof(float*));
for (int i = 0; i < rows; i++) {
cudaMalloc((void**) &d_array[i], cols*sizeof(float));
}
cudaMemcpy(d_array, array, rows*sizeof(float*), cudaMemcpyHostToDevice);

Despite it's name, cudaMemcpy2D does not copy a doubly-subscripted C host array (**) to a doubly-subscripted (**) device array. You'll note that it expects single pointers (*) to be passed to it, not double pointers (**). cudaMemcpy2D is used for copying a flat, strided array, not a 2-dimensional array. There are 2 dimensions inherent in the concept of strided access, which is where the name comes from.
In general, trying to copy a 2D array from host to device is more complicated than just a single API call. You are advised to flatten your array so you can reference it with a single pointer (*), then the API calls will work. There are plenty of examples of proper usage of cudaMemcpy2D on SO, just search for them.
Also, you should do cuda error checking on all cuda API calls and kernel calls, whenever you are having difficulty with CUDA code.
If you really want to copy a 2D array directly, take a look at this question/answer for a worked example. It's not trivial.

Related

What is the most optimized way of getting a single element from a device array in cuda

I have an array on device of huge length and for some condition check I want to access (On Host/ CPU) only one element from middle (say Nth element). What could be the optimized way for doing this.
Do I need to write a kernel that writes Nth location in single element array from the src array and then I copy single element array to host?
You can copy single element of an array using cudaMemcpy.
Let's say you want to copy N-th element of array:
int * dSourceArray
to variable
int hTargetVariable
You can apply device pointer arithmetics on the host. All you need to do is to move dSourceArray pointer by N elements ant copy single element:
cudaMemcpy(&hTargetVariable, dSourceArray+N, sizeof(int), cudaMemcpyDeviceToHost)
Keep in mind that if you use multiple streams you would like to synchronize the device before transferring the data.
One addendum to answer 1, you may need to take account of the bytes per element of your array. e.g. For an array of arrays of various types on the device:
#ifdef CUDA_KERNEL
char* mgpu[ MAX_BUF ]; // Device array of pointers to arrays of various types.
#else
CUdeviceptr mgpu[ MAX_BUF ]; // on host, gpu is a device pointer.
CUdeviceptr gpu (int n ) { return mgpu[n]; }
CUdeviceptr GPUpointer = m_Fluid.gpu(FGRIDOFF); // Device pointer to FGRIDOFF (int) array
cuMemcpyDtoH (&CPUelement, GPUpointer+(offset*sizeof(int)) , sizeof(int) );

Cudamemcpy function usage

How will the cudaMemcpy function work in this case?
I have declared a matrix like this
float imagen[par->N][par->M];
and I want to copy it to the cuda device so I did this
float *imagen_cuda;
int tam_cuda=par->M*par->N*sizeof(float);
cudaMalloc((void**) &imagen_cuda,tam_cuda);
cudaMemcpy(imagen_cuda,imagen,tam_cuda,cudaMemcpyHostToDevice);
Will this copy the 2d array into a 1d array fine?
And how can I copy to another 2d array? can I change this and will it work?
float **imagen_cuda;
It's not trivial to handle a doubly-subscripted C array when copying data between host and device. For the most part, cudaMemcpy (including cudaMemcpy2D) expect an ordinary pointer for source and destination, not a pointer-to-pointer.
The simplest approach (I think) is to "flatten" the 2D arrays, both on host and device, and use index arithmetic to simulate 2D coordinates:
float imagen[par->N][par->M];
float *myimagen = &(imagen[0][0]);
float myval = myimagen[(rowsize*row) + col];
You can then use ordinary cudaMemcpy operations to handle the transfers (using the myimagen pointer):
float *d_myimagen;
cudaMalloc((void **)&d_myimagen, (par->N * par->M)*sizeof(float));
cudaMemcpy(d_myimagen, myimagen, (par->N * par->M)*sizeof(float), cudaMemcpyHostToDevice);
If you really want to handle dynamically sized (i.e. not known at compile time) doubly-subscripted arrays, you can review this question/answer.

Thrust device_malloc and device_new

What are the advantages for using Thrust device_malloc instead of the normal cudaMalloc and what does device_new do?
For device_malloc it seems the only reason to use it is that it's just a bit cleaner.
The device_new documentation says:
"device_new implements the placement new operator for types resident
in device memory. device_new calls T's null constructor on a array of
objects in device memory. No memory is allocated by this function."
Which I don't understand...
device_malloc returns the proper type of object if you plan on using Thrust for other things. There is normally no reason to use cudaMalloc if you are using Thrust. Encapsulating CUDA calls makes it easier and usually cleaner. The same thing goes for C++ and STL containers versus C-style arrays and malloc.
For device_new, you should read the following line of the documentation:
template<typename T>
device_ptr<T> thrust::device_new (device_ptr< void > p, const size_t n = 1)
p: A device_ptr to a region of device memory into which to construct
one or many Ts.
Basically, this function can be used if memory has already been allocated. Only the default constructor will be called, and this will return a device_pointer casted to T's type.
On the other hand, the following method allocates memory and returns a device_ptr<T>:
template<typename T >
device_ptr<T> thrust::device_new (const size_t n = 1)
So I think I found out one good use for device_new
It's basically a better way of initialising an object and copying it to the device, while holding a pointer to it on host.
so instead of doing:
Particle *dev_p;
cudaMalloc((void**)&(dev_p), sizeof(Particle));
cudaMemcpy(dev_p, &p, sizeof(Particle), cudaMemcpyHostToDevice);
test2<<<1,1>>>(dev_p);
I can just do:
thrust::device_ptr<Particle> p = thrust::device_new<Particle>(1);
test2<<<1,1>>>(thrust::raw_pointer_cast(p));

Is this way of allocating a device object "correct"?

SO I asked a question before about how to allocate an object on the device directly instead of the "normal":
Allocate on host
Copy to device
Copy dynamically allocated fields to device one by one
The main reason I want it to be allocated directly on the device is that I don't want to copy each dynamically allocated field inside one by one manually.
Anyway, so I think I have actually found a way to do this, and I would like to see some input from more experienced CUDA programmers (like Robert Crovella).
Let's see the code first:
class Particle
{
public:
int *data;
__device__ Particle()
{
data = new int[10];
for (int i=0; i<10; i++)
{
data[i] = i*2;
}
}
};
__global__ void test(Particle **result)
{
Particle *p = new Particle();
result[0] = p; // store memory location
}
__global__ void test2(Particle *p)
{
for (int i=0; i<10; i++)
printf("%d\n", p->data[i]);
}
int main() {
// initialise and allocate an object on device
Particle **d_p_addr;
cudaMalloc((void**)&d_p_addr, sizeof(Particle*));
test<<<1,1>>>(d_p_addr);
// copy pointer to host memory
Particle **p_addr = new Particle*[1];
cudaMemcpy(p_addr, d_p_addr, sizeof(Particle*), cudaMemcpyDeviceToHost);
// test:
test2<<<1,1>>>(p_addr[0]);
cudaDeviceSynchronize();
printf("Done!\n");
}
As you can see, what I do is:
Call a kernel that initialises an object on the device and stores its pointer an output parameter
Copy the pointer to the allocated object from device memory to host memory
Now you can pass that pointer to another kernel just fine !
This code actually works, but I'm not sure if there are drawbacks.
Cheers
EDIT: as pointed out by Robert, there was no point of creating a pointer on host first, so I removed that part from the code.
Yes, you can do that.
You are allocating an object on the device, and passing a pointer to it from one kernel to the next. Since a characteristic of device malloc/new is that allocations persist for the lifetime of the context (not just the kernel), the allocations do not disappear at the end of the kernel. This is basically standard C++ behavior, but I thought it might be worth repeating. The pointer(s) that you are passing from one kernel to the next are therefore valid in any subsequent device code in the context of your program.
There is a wrinkle you might want to be aware of, however. Pointers returned by dynamic allocations done on the device (such as via new or malloc in device code) are not usable for transferring data from device to host, at least in the present incarnation of cuda (cuda 5.0 and earlier). The reasons for this are somewhat arcane (translation: I can't adequately explain it) but it's instructive to think about the fact that dynamic allocations come out of the device heap, a region that is logically separate from the region of global memory that runtime API functions like cudaMalloc and cudaMemcpy use. An oblique indication of this is given here:
Memory reserved for the device heap is in addition to memory allocated through host-side CUDA API calls such as cudaMalloc().
If you want to prove this wrinkle to yourself, try adding the following seemingly innocuous code after your second kernel call:
Particle *q;
q = (Particle *)malloc(sizeof(Particle));
cudaMemcpy(q, p_addr[0], sizeof(Particle), cudaMemcpyDeviceToHost);
If you then check the API error value returned from that cudaMemcpy operation, you will observe the error.
As an unrelated comment, your use of the pointer *p is a little freaky, in my book, and the compiler warning given about it is an indication of the wierdness. It's not technically illegal, since you're not actually doing anything meaningful with that pointer (you immediately replace it in your kernel 1) but nevertheless it's wierd because you're passing a pointer to a kernel that you haven't properly cudaMalloc'ed. In the context of what you're demonstrating, it's completely unnecessary, and your first parameter to kernel 1 could be eliminated and replaced with a local variable, eliminating the wierdness and compiler warning.

is there a better and a faster way to copy from CPU memory to GPU using thrust?

Recently I have been using thrust a lot. I have noticed that in order to use thrust, one must always copy the data from the cpu memory to the gpu memory.
Let's see the following example :
int foo(int *foo)
{
host_vector<int> m(foo, foo+ 100000);
device_vector<int> s = m;
}
I'm not quite sure how the host_vector constructor works, but it seems like I'm copying the initial data, coming from *foo, twice - once to the host_vector when it is initialized, and another time when device_vector is initialized. Is there a better way of copying from cpu to gpu without making an intermediate data copies? I know I can use device_ptras a wrapper, but that still doesn't fix my problem.
thanks!
One of device_vector's constructors takes a range of elements specified by two iterators. It's smart enough to understand the raw pointer in your example, so you can construct a device_vector directly and avoid the temporary host_vector:
void my_function_taking_host_ptr(int *raw_ptr, size_t n)
{
// device_vector assumes raw_ptrs point to system memory
thrust::device_vector<int> vec(raw_ptr, raw_ptr + n);
...
}
If your raw pointer points to CUDA memory, introduce a device_ptr:
void my_function_taking_cuda_ptr(int *raw_ptr, size_t n)
{
// wrap raw_ptr before passing to device_vector
thrust::device_ptr<int> d_ptr(raw_ptr);
thrust::device_vector<int> vec(d_ptr, d_ptr + n);
...
}
Using a device_ptr doesn't allocate any storage; it just encodes the location of the pointer in the type system.