I've been working with JCuda for some months now and I can't copy a multidimensional array from device memory to host memory. The funny thing is that I have no problems in doing so in the opposite direction (I can invoke my kernel with multidimensional arrays and everything works with the correct values).
In a few words, I put the results of my kernel in a bi-dimensional array of shorts, where the first dimension of such array is the number of threads, so that each one can write in different locations.
Here an example:
CUdeviceptr pointer_dev = new CUdeviceptr();
cuMemAlloc(pointer_dev, Sizeof.POINTER); // in this case, as an example, it's an array with one element (one thread), but it doesn't matter
// Invoke kernel with pointer_dev as parameter. Now it should contain some results
CUdeviceptr[] arrayPtr = new CUdeviceptr[1]; // It will point to the result
arrayPtr[0] = new CUdeviceptr();
short[] resultArray = new short[3]; // an array of 3 shorts was allocated in the kernel
cuMemAlloc(arrayPtr[0], 3 * Sizeof.SHORT);
cuMemcpyDtoH(Pointer.to(arrayPtr), pointer_dev, Sizeof.POINTER); // Its seems, using the debugger, that the value of arrayPtr[0] isn't changed here!
cuMemcpyDtoH(Pointer.to(resultArray), arrayPtr[0], 3 * Sizeof.SHORT); // Not the expected values in resultArray, probably because of the previous instruction
What am I doing wrong?
EDIT:
Apparently, there are some limitations that doesn't allow device allocated memory to be copied back to host, as stated in this (and many more) threads: link
Any workaround? I'm using CUDA Toolkit v5.0
Here we are copying a two dimensional array of integers from the device to host.
First, create a single dimensional array with size equal to size of another single dimension array (here blockSizeX).
CUdeviceptr[] hostDevicePointers = new CUdeviceptr[blockSizeX];
for (int i = 0; i < blockSizeX; i++)
{
hostDevicePointers[i] = new CUdeviceptr();
cuMemAlloc(hostDevicePointers[i], size * Sizeof.INT);
}
Allocate device memory for the array of pointers that point to the other array, and copy array pointers from the host to the device.
CUdeviceptr hostDevicePointersArray = new CUdeviceptr();
cuMemAlloc(hostDevicePointersArray, blockSizeX * Sizeof.POINTER);
cuMemcpyHtoD(hostDevicePointersArray, Pointer.to(hostDevicePointers), blockSizeX * Sizeof.POINTER);
Launch the kernel.
kernelLauncher.call(........, hostDevicePointersArray);
Transfer the output from the device to host.
int hostOutputData[] = new int[numberofelementsInArray * blockSizeX];
cuMemcpyDtoH(Pointer.to(hostOutputData), hostDevicePointers[i], numberofelementsInArray * blockSizeX * Sizeof.INT);
for (int j = 0; j < size; j++)
{
sum = sum + hostOutputData[j];
}
Related
I am trying to use CUBLAS in C++ to rewrite a python/tensorflow script which is operating on batches of input samples (of shape BxD, B: BatchSize, D: Depth of the flattened 2D matrix)
For the first step, I decided to use CUBLAS cublasSgemmBatched to compute MatMul for batches of matrices.
I've found couple working sample codes as the one in link to the question,
but what I want is to allocate one big contiguous device array to store batches of flattened identical shaped matrices. I DO NOT want to store batches separated from each other on device memory(as they are in the provided sample code in the given link to StackOverflow question)
From what I can imagine, somehow I have to get a list of pointers to starting elements of each batch on device memory. something like this:
float **device_batch_ptr;
cudaMalloc((void**)&device_batch_ptr, batch_size*sizeof(float *));
for(int i = 0 ; i < batch_size; i++ ) {
// set device_batch_ptr[i] to starting point of i'th batch on device memory array.
}
Note that cublasSgemmBatched needs a float** that each float* in it, points to starting element of each batch in a given input matrix.
Any advice and suggestions will be greatly appreciated.
If your arrays are in contiguous linear memory (device_array) then all you need to do is calculate the offsets using standard pointer arithmetic and store the device addresses in a host array which you then copy to the device. Something like:
float** device_batch_ptr;
float** h_device_batch_ptr = new float*[batch_size];
cudaMalloc((void**)&device_batch_ptr, batch_size*sizeof(float *));
size_t nelementsperrarray = N * N;
for(int i = 0 ; i < batch_size; i++ ) {
// set h_device_batch_ptr[i] to starting point of i'th batch on device memory array.
h_device_batch_ptr[i] = device_array + i * nelementsperarray;
}
cudaMemcpy(device_batch_ptr, h_device_batch_ptr, batch_size*sizeof(float *)),
cudaMemcpyHostToDevice);
[Obviously never compiled or tested, use at own risk]
int main() {
char** hMat,* dArr;
hMat = new char*[10];
for (int i=0;i<10;i++) {
hMat[i] = new char[10];
}
cudaMalloc((void**)&dArr,100);
// Copy from dArr to hMat here:
}
I have an array, dArr on the GPU, and I want to copy it into a 2D array hMat on the host, where the first 10 fields in the GPU array are copied to the first row in the host matrix, and the next 10 fields are copied to the second row, and so on.
There are some functions in the documentation, namely CudaMemcpy2D and CudaMemcpy2DFromArray, but I'm not quite sure how they should be used.
Your allocation scheme (an array of pointers, separately allocated) has the potential to create a discontiguous allocation on the host. There are no cudaMemcpy operations of any type (including the ones you mention) that can target an arbitrarily discontiguous area, which your allocation scheme has the potential to create.
In a nutshell, then, your approach is troublesome. It can be made to work, but will require a loop to perform the copying -- essentially one cudaMemcpy operation per "row" of your "2D array". If you choose to do that, presumably you don't need help. It's quite straightforward.
What I will suggest is that you instead modify your host allocation to create an underlying contiguous allocation. Such a region can be handled by a single, ordinary cudaMemcpy call, but you can still treat it as a "2D array" in host code.
The basic idea is to create a single allocation of the correct overall size, then to create a set of pointers to specific places within the single allocation, where each "row" should start. You then reference into this pointer array using your initial double-pointer.
Something like this:
#include <stdio.h>
typedef char mytype;
int main(){
const int rows = 10;
const int cols = 10;
mytype **hMat = new mytype*[rows];
hMat[0] = new mytype[rows*cols];
for (int i = 1; i < rows; i++) hMat[i] = hMat[i-1]+cols;
//initialize "2D array"
for (int i = 0; i < rows; i++)
for (int j = 0; j < cols; j++)
hMat[i][j] = 0;
mytype *dArr;
cudaMalloc(&dArr, rows*cols*sizeof(mytype));
//copy to device
cudaMemcpy(dArr, hMat[0], rows*cols*sizeof(mytype), cudaMemcpyHostToDevice);
//kernel call
//copy from device
cudaMemcpy(hMat[0], dArr, rows*cols*sizeof(mytype), cudaMemcpyDeviceToHost);
return 0;
}
Here is my code:
//initialize the device_vector
int size = N;
thrust::device_vector<glm::vec3> value(size);
thrust::device_vector<int> key(size);
//get the device pointer of the device_vector
//so than I can write data to the device_vector in CUDA kernel
glm::vec3 * dv_value_ptr = thrust::raw_pointer_cast(&value[0]);
int* dv_key_ptr = thrust::raw_pointer_cast(&key[0]);
//run the kernel function
dim3 threads(16, 16);
dim3 blocks(iDivUp(m_width, threads.x), iDivUp(m_height, threads.y));
//the size of value and key is packed in dev_data
compute_one_i_all_j <<<blocks, threads >>>(dev_data, dv_key_ptr, dv_value_ptr);
//Finally, reduce the vector by its keys.
thrust::pair<thrust::device_vector<int>::iterator,
thrust::device_vector<glm::vec3>::iterator> new_last;
new_last = thrust::reduce_by_key(key.begin(), key.end(), value.begin(), output_key.begin(), output_value.begin());
//get the reduced vector size
int new__size = new_last.first - output_key.begin();
After all these code, I write the output_key to a file. I get so many duplicated keys in the file, as below:
So, the reduce_by_key() seems not work .
Ps. The CUDA kernel only write part of the key and value, so after the kernel some of the elements in key and value remains unchanged (likely 0).
As stated in documentation:
For each group of consecutive keys in the range [keys_first, keys_last) that are equal, reduce_by_key copies the first element of the group to the keys_output. The corresponding values in the range are reduced using the plus and the result copied to values_output.
Every group of equal consecutive keys will be reduced.
So first of all you must rearrange all your keys and values so all elements with equal keys are adjacent. The simpliest method will be to use sort_by_key.
thrust::sort_by_key(key.begin(), key.end(), value.begin())
new_last = thrust::reduce_by_key(key.begin(), key.end(), value.begin(), output_key.begin(), output_value.begin());
I'm implement my kernel in a multithreaded "host"-program, where every host thread is calling the kernel.
I've got a problem with the use of constant memory. In the constant memory will be placed some parameters, but for every thread they are different.
I build a sample where the problem occurs, too.
This is the kernel
__global__ void Kernel( int *aiOutput, int Length )
{
int id = threadIdx.x + blockIdx.x * blockDim.x;
int iValue = 0;
// bound check
if( id < Length )
{
if( id % 3 == 0 )
iValue = c_iaCoeff[2];
else if( id % 2 == 0 )
iValue = c_iaCoeff[1];
else
iValue = c_iaCoeff[0];
aiOutput[id] = iValue;
}
__syncthreads();
}
And a pthread is calling this function.
void* WrapperCopy( void* params )
{
// choose cuda device to perform on
CUDA_CHECK_RETURN( cudaSetDevice( 0 ) );
// cast of params
SParams *_params = (SParams*)params;
// copy coefficients to constant memory
CUDA_CHECK_RETURN( cudaMemcpyToSymbol( c_iaCoeff, _params->h_piCoeff, 3*sizeof(int) ) );
// loop kernel
for( int i=0; i<100; i++ )
{
// perfrom kernel
Kernel<<< BLOCKCOUNT, BLOCKSIZE >>>( _params->d_piArray, _params->iLength );
}
// copy data back from gpu
CUDA_CHECK_RETURN( cudaMemcpy(
_params->h_piArray, _params->d_piArray, BLOCKSIZE*BLOCKCOUNT*sizeof(int), cudaMemcpyDeviceToHost ) );
return NULL;
}
Constant memory is declared as this.
__constant__ int c_iaCoeff[ 3 ];
For every host thread has diffrent values in h_piCoeff and will copy that to the constant memory.
Now I get for every pthread call the same results, becaus all of them got the same values in c_iaCoeff.
I think that is the problem of how constant memory works and have to be declared in a context - in the sample there will be only one c_iaCoeff declared for all pthreads calling and the kernels called by pthreads will get the values of the last cudaMemcpyToSymbol. Is that right?
Now I've tried to change my constant memory in a two-dimensional array.
The second dimension will be the values as before, but the first will be the index of the used pthread.
__constant__ int c_iaCoeff2[ THREADS ][ 3 ];
In the kernels the use of it will be in this way.
iValue = c_iaCoeff2[iTId][2];
But I don't know if it's possible to use constant memory in this way, is it?
Also I got an error when I try to copy data to the constant memory.
CUDA_CHECK_RETURN( cudaMemcpyToSymbol( c_iaCoeff[_params->iTId], _params->h_piCoeff, 3*sizeof(int) ) );
General is it possible to use constant memory as a two-dimensional array and if yes, where is my failure?
Yes, you should be able to use constant memory in the way you want to, but the cudaMemcpyToSymbol copy operation you are using is incorrect. The first argument to the call is a symbol, and the API does a lookup in the runtime symbol table to get the address of the constant memory symbol you request. So an address can't be passed to the call (although your code is actually passing an initialised host value to the call, why that is I will leave as an exercise to the reader).
What you may have missed is the optional fourth argument in the call, which is an offset into the memory pointed to by the symbol you request. So you should be able to do something like:
cudaMemcpyToSymbol( c_iaCoeff, // symbol to lookup
_params->h_piCoeff, // source location
3*sizeof(int), // number of bytes to copy
(3*_params->iTId)*sizeof(int) // Offset in bytes
);
[standard disclaimer: written in browser, unstested. use at own risk]
The last argument is the offset in bytes from the start of the symbol. Your 2D array will be laid out in row major order, so you need to use the pitch of the rows multiplied by the row index as an offset for each copy operation.
I have to extract sections of an array and set the chunk to another array.
For instance, I have a 2d array (in 1d format) like A[32 X 32]; there is another array B[64 X 64] and I would want to copy an 8X8 chunk of B, starting from (0,8) of B and place it in (8,8) of A.
At present, I would probably use something like the kernel below, for getting a portion of data when offsets are passed. A similar one could also be used to setting chunks to a larger array.
__global__ void get_chunk (double *data, double *sub, int xstart, int ystart, int rows, int cols, int subset)
{
int i,j;
i = blockIdx.x * blockDim.x + threadIdx.x;
for (j = 0; j < subset; j++)
sub[i*subset+j] = data[i*cols + (xstart*cols + ystart)+j];
}
I think the same could be done using a variant of cudamemCpy* (perhaps cudamemCpyArray(...)), but I am not sure how to do it. I need some code samples, or some directions on how it could be done.
PS I had the exact same question in nvidia forums, got no reply so trying here. http://forums.nvidia.com/index.php?showtopic=223386
Thank you.
There is no need for a kernel if you just want to copy data from one array to another on the device.
If you have your device pointers with your source data and your allocated target pointer in host code:
Pseudocode:
//source and target device pointers
double * source_d, target_d;
//get offseted source pointer
double * offTarget_d + offset * sizeof(double);
//copy n elements from offseted source data to target device pointer
cudaMemcpy(offTarget_d, source_d, n * sizeof(double), cudaMemcpyDeviceToDevice);
It was not clear if you just want to copy a range of a 1D array or if you want to copy a range of each row in a 2D array into the target row of another 2D array