Is there a way to convert a 2D vector into an array to be able to use it in CUDA kernels?
It is declared as:
vector<vector<int>> information;
I want to cudaMalloc and copy from host to device, what would be the best way to do it?
int *d_information;
cudaMalloc((void**)&d_information, sizeof(int)*size);
cudaMemcpy(d_information, information, sizeof(int)*size, cudaMemcpyHostToDevice);
In a word, no there isn't. The CUDA API doesn't support deep copying and also doesn't know anything about std::vector either. If you insist on having a vector of vectors as a host source, it will require doing something like this:
int *d_information;
cudaMalloc((void**)&d_information, sizeof(int)*size);
int *dst = d_information;
for (std::vector<std::vector<int> >::iterator it = information.begin() ; it != information.end(); ++it) {
int *src = &((*it)[0]);
size_t sz = it->size();
cudaMemcpy(dst, src, sizeof(int)*sz, cudaMemcpyHostToDevice);
dst += sz;
}
[disclaimer: written in browser, not compiled or tested. Use at own risk]
This would copy the host memory to an allocation in GPU linear memory, requiring one copy for each vector. If the vector of vectors is a "jagged" array, you will want to store an indexing somewhere for the GPU to use as well.
As far as I understand, the vector of vectors do not need to reside in a contiguous memory, i.e. they can be fragmented.
Depending on the amount of memory you need to transfer I would do one of two issues:
Reorder your memory to be a single vector, and then use your cudaMemcpy.
Create a series of cudaMemcpyAsync, where each copy handles a single vector in your vector of vectors, and then synchronize.
Related
I'm trying to optimize my code which I accelerated using basically OpenACC only.
Is it a good approach to insert CUDA such as in the example that follows?
In this case, u_device and v_device are used by the device only. Using cudaMalloc assures me that I allocate memory on the device memory and not on the host memory too.
int size = NVAR * sizeof(double);
// Declare pointers that will point to the memory allocated on the device.
double* v_device;
double* u_device;
// Allocate memory on the device
cudaMalloc(&v_device, size);
cudaMalloc(&u_device, size);
#pragma acc parallel loop private(v_device, u_device)
for (i = ibeg; i <= iend; i++){
#pragma acc loop
for (nv = 0; nv < NVAR; nv++) v_device[nv] = V[nv][k][j][i];
PrimToCons (v_device, u_device);
#pragma acc loop
for (nv = 0; nv < NVAR; nv++) U[k][j][i][nv] = u_device[nv];
}
cudaFree(u_device);
cudaFree(v_device);
Before I would have used OpenACC and written something like this:
double* v_device = (double*)malloc(size);
double* u_device = (double*)malloc(size);
#pragma acc enter data create(u_device[:size],v[:device])
#pragma acc parallel loop private(v_device, u_device)
for (i = ibeg; i <= iend; i++){
...
}
#pragma acc exit data delete(u_device[:size],v[:device])
Is there a way with OpenACC to avoid host memory allocation?
Another doubt I have regarding cudaMalloc is the possibility to put the routine inside the kernel, in order to make the arrays private:
#pragma acc parallel loop private(v_device, u_device)
for (i = ibeg; i <= iend; i++){
double* v_device;
double* u_device;
// Allocate memory on the device
cudaMalloc(&v_device, size);
cudaMalloc(&u_device, size);
.
.
.
cudaFree(u_device);
cudaFree(v_device);
}
Writing in this way I get the error:
182, Accelerator restriction: call to 'cudaMalloc' with no acc routine information
Is there a way with OpenACC to avoid host memory allocation?
You can use cudaMalloc, but for pure OpenACC, you'd use "acc_malloc" and "acc_free". For example: https://github.com/rmfarber/ParallelProgrammingWithOpenACC/blob/master/Chapter05/acc_malloc.c
Note the use the "deviceptr" clause which indicates that the pointer is a device pointer. Though here, you're wanting to privatize these arrays so you can keep the private.
I've never used a device pointer in a private clause, but just tried and it seems to work. Which make sense since all the compiler really needs is the size and type of the private array to make the private copies. In this case since it's on the gang loop, the compiler will attempt to put the private arrays in shared memory, assuming they aren't too big to fit. I'd recommend using the triplet notation for the array, i.e. "private(v_device[:NVAR],...) so the compiler will know the size.
Though I'm not sure there's much of an advantage to using device arrays here. The device memory you're allocating isn't going to be used taking up space on the device. Device memory is often much smaller than host memory, so if you do need to waste space, probably better this be on the host. Plus having to use acc_malloc or cudaMalloc limits portability of the code. Not that there isn't cases where using device only memory is beneficial, I just don't think it is for this case.
Note you can call "malloc" within device code, but it's not recommended. Malloc's get serialized causing performance issues, but also the default heap is relatively small which can lead to heap overflows. Granted, this can be increased by either calling cudaDeviceLimits or via the environment variable "NV_ACC_CUDA_HEAPSIZE".
How will the cudaMemcpy function work in this case?
I have declared a matrix like this
float imagen[par->N][par->M];
and I want to copy it to the cuda device so I did this
float *imagen_cuda;
int tam_cuda=par->M*par->N*sizeof(float);
cudaMalloc((void**) &imagen_cuda,tam_cuda);
cudaMemcpy(imagen_cuda,imagen,tam_cuda,cudaMemcpyHostToDevice);
Will this copy the 2d array into a 1d array fine?
And how can I copy to another 2d array? can I change this and will it work?
float **imagen_cuda;
It's not trivial to handle a doubly-subscripted C array when copying data between host and device. For the most part, cudaMemcpy (including cudaMemcpy2D) expect an ordinary pointer for source and destination, not a pointer-to-pointer.
The simplest approach (I think) is to "flatten" the 2D arrays, both on host and device, and use index arithmetic to simulate 2D coordinates:
float imagen[par->N][par->M];
float *myimagen = &(imagen[0][0]);
float myval = myimagen[(rowsize*row) + col];
You can then use ordinary cudaMemcpy operations to handle the transfers (using the myimagen pointer):
float *d_myimagen;
cudaMalloc((void **)&d_myimagen, (par->N * par->M)*sizeof(float));
cudaMemcpy(d_myimagen, myimagen, (par->N * par->M)*sizeof(float), cudaMemcpyHostToDevice);
If you really want to handle dynamically sized (i.e. not known at compile time) doubly-subscripted arrays, you can review this question/answer.
I need to perform a parallel reduction to find the min or max of an array on a CUDA device. I found a good library for this, called Thrust. It seems that you can only perform a parallel reduction on arrays in host memory. My data is in device memory. Is it possible to perform a reduction on data in device memory?
I can't figure how to do this. Here is documentation for Thrust: http://code.google.com/p/thrust/wiki/QuickStartGuide#Reductions. Thank all of you.
You can do reductions in thrust on arrays which are already in device memory. All that you need to do is wrap your device pointers inside thrust::device_pointer containers, and call one of the reduction procedures, just as shown in the wiki you have linked to:
// assume this is a valid device allocation holding N words of data
int * dmem;
// Wrap raw device pointer
thrust::device_ptr<int> dptr(dmem);
// use max_element for reduction
thrust::device_ptr<int> dresptr = thrust::max_element(dptr, dptr+N);
// retrieve result from device (if required)
int max_value = dresptr[0];
Note that the return value is also a device_ptr, so you can use it directly in other kernels using thrust::raw_pointer_cast:
int * dres = thrust::raw_pointer_cast(dresptr);
If thrust or any other library does not provides you such a service you can still create that kernel yourself.
Mark Harris has a great tutorial about parallel reduction and its optimisations on cuda.
Following his slides it is not that hard to implement and modify it for your needs.
Recently I have been using thrust a lot. I have noticed that in order to use thrust, one must always copy the data from the cpu memory to the gpu memory.
Let's see the following example :
int foo(int *foo)
{
host_vector<int> m(foo, foo+ 100000);
device_vector<int> s = m;
}
I'm not quite sure how the host_vector constructor works, but it seems like I'm copying the initial data, coming from *foo, twice - once to the host_vector when it is initialized, and another time when device_vector is initialized. Is there a better way of copying from cpu to gpu without making an intermediate data copies? I know I can use device_ptras a wrapper, but that still doesn't fix my problem.
thanks!
One of device_vector's constructors takes a range of elements specified by two iterators. It's smart enough to understand the raw pointer in your example, so you can construct a device_vector directly and avoid the temporary host_vector:
void my_function_taking_host_ptr(int *raw_ptr, size_t n)
{
// device_vector assumes raw_ptrs point to system memory
thrust::device_vector<int> vec(raw_ptr, raw_ptr + n);
...
}
If your raw pointer points to CUDA memory, introduce a device_ptr:
void my_function_taking_cuda_ptr(int *raw_ptr, size_t n)
{
// wrap raw_ptr before passing to device_vector
thrust::device_ptr<int> d_ptr(raw_ptr);
thrust::device_vector<int> vec(d_ptr, d_ptr + n);
...
}
Using a device_ptr doesn't allocate any storage; it just encodes the location of the pointer in the type system.
i am making several cudamemset calls in order to set my values to 0 as below:
void allocateByte( char **gStoreR,const int byte){
char **cStoreR = (char **)malloc(N * sizeof(char*));
for( int i =0 ; i< N ; i++){
char *c;
cudaMalloc((void**)&c, byte*sizeof(char));
cudaMemset(c,0,byte);
cStoreR[i] = c;
}
cudaMemcpy(gStoreR, cStoreR, N * sizeof(char *), cudaMemcpyHostToDevice);
}
However, this is proving to be very slow. Is there a memset function on the GPU as calling it from CPU takes lot of time. Also, does cudaMalloc((void**)&c, byte*sizeof(char)) automatically set bits that c points to to 0.
Every cudaMemset call launches a kernel, so if N is large and byte is small, then you will have a lot of kernel launch overhead slowing down the code. There is no device side memset, so the solution would be to write a kernel which traverses the allocations and zeros the storage in a single launch.
As an aside, I would strongly recommend against using a structure of arrays in CUDA. It is a lot slower and much more complex to manage that achieving the same outcome using a single large block of linear memory and indexing into that memory. In your example, it would reduce the code to a single cudaMalloc call and a single cudaMemset call. On the device side, pointer indirection, which is slow, gets replaced by a few integer operations, which are very fast. If your source material on the host is an array of structures, I would recommend using something like the excellent thrust::zip_iterator to get the data into a GPU friendly form on the device.