thrust::reduce_by_key() seems not work - cuda

Here is my code:
//initialize the device_vector
int size = N;
thrust::device_vector<glm::vec3> value(size);
thrust::device_vector<int> key(size);
//get the device pointer of the device_vector
//so than I can write data to the device_vector in CUDA kernel
glm::vec3 * dv_value_ptr = thrust::raw_pointer_cast(&value[0]);
int* dv_key_ptr = thrust::raw_pointer_cast(&key[0]);
//run the kernel function
dim3 threads(16, 16);
dim3 blocks(iDivUp(m_width, threads.x), iDivUp(m_height, threads.y));
//the size of value and key is packed in dev_data
compute_one_i_all_j <<<blocks, threads >>>(dev_data, dv_key_ptr, dv_value_ptr);
//Finally, reduce the vector by its keys.
thrust::pair<thrust::device_vector<int>::iterator,
thrust::device_vector<glm::vec3>::iterator> new_last;
new_last = thrust::reduce_by_key(key.begin(), key.end(), value.begin(), output_key.begin(), output_value.begin());
//get the reduced vector size
int new__size = new_last.first - output_key.begin();
After all these code, I write the output_key to a file. I get so many duplicated keys in the file, as below:
So, the reduce_by_key() seems not work .
Ps. The CUDA kernel only write part of the key and value, so after the kernel some of the elements in key and value remains unchanged (likely 0).

As stated in documentation:
For each group of consecutive keys in the range [keys_first, keys_last) that are equal, reduce_by_key copies the first element of the group to the keys_output. The corresponding values in the range are reduced using the plus and the result copied to values_output.
Every group of equal consecutive keys will be reduced.
So first of all you must rearrange all your keys and values so all elements with equal keys are adjacent. The simpliest method will be to use sort_by_key.
thrust::sort_by_key(key.begin(), key.end(), value.begin())
new_last = thrust::reduce_by_key(key.begin(), key.end(), value.begin(), output_key.begin(), output_value.begin());

Related

(cudaBindTexture2D) How to bind a pitched-array from the middle

I am trying to bind a pitched array from the middle partly (not from the beginning of the array), like followings.
/1. allocate/
cudaMallocPitch((void**)&d_texinput, &FloatPitch, cols*sizeof(float), rows);
cudaMallocPitch((void**)&d_output, &FloatPitch, cols*sizeof(float), rows);
/2. set row-length of target region (i.e., dividing rows 10 times)/
int row_div_times = 10;
int part_rows = rows / row_div_times;
int part_offset = part_rows*FloatPitch/sizeof(float);
dim3 threads(16,16);
dim3 Part_Blocks((cols + threads.x - 1) / threads.x, (Part_rows + threads.y - 1) / threads.y);
/3. processing divided rows, iteratively/
for (int i = 0; i < row_div_times; i++)
{
size_t offsetsize= i*part_offset;
/*computing values of "d_tex_input"*/
calibration << <Part_Blocks, threads, 0, stream[i] >> >
(d_texinput + i*part_offset );
/*
//###(QUESTION point!) I want to bind the device memory "d_texinput" to texture "tex_mem" only partly like below.
cudaBindTexture2D(0, tex_mem, &d_texinput[i*part_offset], channelDesc_flt, cols, Part_rows, FloatPitch); //tentative code a;
,,, or something like,,,
cudaBindTexture2D(&offsetsize, tex_mem, &d_texinput, channelDesc_flt, cols, Part_rows, FloatPitch); //tentative code b;
*/
//final computaion with texture
final_computationwithtexture << <Part_Blocks, threads, 0, stream[i] >> >
( d_output + i*part_offset );
cudaUnbindTexture(tex_mem);
}
Please kindly allow me to ask your instruction, advice how to bind the target region of the device memory array partly by revising above( QUESTION point!)?
I tried to understand first argument of cudaBindTExture2D as "offset". but it is not value. it is address. according to the documentation.
i still could not understand the documentation.
I hope I can understand what that is by knowing adequate inputting way to the cudaBindTexture2D.
The offset parameter is not an input, it is an output. That's why it is a pointer. The function will set the offset in bytes. If you want to bind in the middle of an allocation, you set the devPtr argument (third) appropriately and then the function will give you the offset required for texture accesses.
Here is how to understand this: Textures can only be bound with a certain alignment. Memory allocations are always properly aligned. Therefore it is not an issue in most cases. However, if you provide an arbitrary memory address, CUDA has to round down to the alignment and you have to apply the proper offset later on.
Let's say you bind &float[66], the proper alignment might be &float[64], so CUDA starts its texture at that offset and you have to add an offset of 8 bytes for each access to get the desired result. I'm picking random numbers here, I don't know the alignment requirements.

Does this Cuda scan kernel only work within a single block, or across multiple blocks?

I am doing a homework and have been given a Cuda kernel that performs a primitive scan operation. From what I can tell this kernel will only do a scan of the data if a single block is used (because of the int id = threadInx.x). Is this true?
//Hillis & Steele: Kernel Function
//Altered by Jake Heath, October 8, 2013 (c)
// - KD: Changed input array to be unsigned ints instead of ints
__global__ void scanKernel(unsigned int *in_data, unsigned int *out_data, size_t numElements)
{
//we are creating an extra space for every numElement so the size of the array needs to be 2*numElements
//cuda does not like dynamic array in shared memory so it might be necessary to explicitly state
//the size of this mememory allocation
__shared__ int temp[1024 * 2];
//instantiate variables
int id = threadIdx.x;
int pout = 0, pin = 1;
// // load input into shared memory.
// // Exclusive scan: shift right by one and set first element to 0
temp[id] = (id > 0) ? in_data[id - 1] : 0;
__syncthreads();
//for each thread, loop through each of the steps
//each step, move the next resultant addition to the thread's
//corresponding space to manipulted for the next iteration
for (int offset = 1; offset < numElements; offset <<= 1)
{
//these switch so that data can move back and fourth between the extra spaces
pout = 1 - pout;
pin = 1 - pout;
//IF: the number needs to be added to something, make sure to add those contents with the contents of
//the element offset number of elements away, then move it to its corresponding space
//ELSE: the number only needs to be dropped down, simply move those contents to its corresponding space
if (id >= offset)
{
//this element needs to be added to something; do that and copy it over
temp[pout * numElements + id] = temp[pin * numElements + id] + temp[pin * numElements + id - offset];
}
else
{
//this element just drops down, so copy it over
temp[pout * numElements + id] = temp[pin * numElements + id];
}
__syncthreads();
}
// write output
out_data[id] = temp[pout * numElements + id];
}
I would like to modify this kernel to work across multiple blocks, I want it to be as simple as changing the int id... to int id = threadIdx.x + blockDim.x * blockIdx.x. But the shared memory is only within the block, meaning the scan kernels across blocks cannot share the proper information.
From what I can tell this kernel will only do a scan of the data if a single block is used (because of the int id = threadInx.x). Is this true?
Not exactly. This kernel will work regardless of how many blocks you launch, but all blocks will fetch the same input and compute the same output, because of how id is calculated:
int id = threadIdx.x;
This id is independant of blockIdx, and therefore identical across blocks, no matter their number.
If I were to make a multi-block version of this scan without changing too much code, I would introduce an auxilliary array to store the per-block sums. Then, run a similar scan on that array, calculating per-block increments. Finally, run a last kernel to add those per-block increments to the block elements. If memory serves there is a similar kernel in the CUDA SDK samples.
Since Kepler the above code could be rewritten much more efficiently, notably through the use of __shfl. Additionally, changing the algorithm to work per-warp rather than per-block would get rid of the __syncthreads and may improve performance. A combination of both these improvements would allow you to get rid of shared memory and work only with registers for maximal performance.

JCuda: copy multidimensional array from device to host

I've been working with JCuda for some months now and I can't copy a multidimensional array from device memory to host memory. The funny thing is that I have no problems in doing so in the opposite direction (I can invoke my kernel with multidimensional arrays and everything works with the correct values).
In a few words, I put the results of my kernel in a bi-dimensional array of shorts, where the first dimension of such array is the number of threads, so that each one can write in different locations.
Here an example:
CUdeviceptr pointer_dev = new CUdeviceptr();
cuMemAlloc(pointer_dev, Sizeof.POINTER); // in this case, as an example, it's an array with one element (one thread), but it doesn't matter
// Invoke kernel with pointer_dev as parameter. Now it should contain some results
CUdeviceptr[] arrayPtr = new CUdeviceptr[1]; // It will point to the result
arrayPtr[0] = new CUdeviceptr();
short[] resultArray = new short[3]; // an array of 3 shorts was allocated in the kernel
cuMemAlloc(arrayPtr[0], 3 * Sizeof.SHORT);
cuMemcpyDtoH(Pointer.to(arrayPtr), pointer_dev, Sizeof.POINTER); // Its seems, using the debugger, that the value of arrayPtr[0] isn't changed here!
cuMemcpyDtoH(Pointer.to(resultArray), arrayPtr[0], 3 * Sizeof.SHORT); // Not the expected values in resultArray, probably because of the previous instruction
What am I doing wrong?
EDIT:
Apparently, there are some limitations that doesn't allow device allocated memory to be copied back to host, as stated in this (and many more) threads: link
Any workaround? I'm using CUDA Toolkit v5.0
Here we are copying a two dimensional array of integers from the device to host.
First, create a single dimensional array with size equal to size of another single dimension array (here blockSizeX).
CUdeviceptr[] hostDevicePointers = new CUdeviceptr[blockSizeX];
for (int i = 0; i < blockSizeX; i++)
{
hostDevicePointers[i] = new CUdeviceptr();
cuMemAlloc(hostDevicePointers[i], size * Sizeof.INT);
}
Allocate device memory for the array of pointers that point to the other array, and copy array pointers from the host to the device.
CUdeviceptr hostDevicePointersArray = new CUdeviceptr();
cuMemAlloc(hostDevicePointersArray, blockSizeX * Sizeof.POINTER);
cuMemcpyHtoD(hostDevicePointersArray, Pointer.to(hostDevicePointers), blockSizeX * Sizeof.POINTER);
Launch the kernel.
kernelLauncher.call(........, hostDevicePointersArray);
Transfer the output from the device to host.
int hostOutputData[] = new int[numberofelementsInArray * blockSizeX];
cuMemcpyDtoH(Pointer.to(hostOutputData), hostDevicePointers[i], numberofelementsInArray * blockSizeX * Sizeof.INT);
for (int j = 0; j < size; j++)
{
sum = sum + hostOutputData[j];
}

cuda use constant memory as two-dimensional array

I'm implement my kernel in a multithreaded "host"-program, where every host thread is calling the kernel.
I've got a problem with the use of constant memory. In the constant memory will be placed some parameters, but for every thread they are different.
I build a sample where the problem occurs, too.
This is the kernel
__global__ void Kernel( int *aiOutput, int Length )
{
int id = threadIdx.x + blockIdx.x * blockDim.x;
int iValue = 0;
// bound check
if( id < Length )
{
if( id % 3 == 0 )
iValue = c_iaCoeff[2];
else if( id % 2 == 0 )
iValue = c_iaCoeff[1];
else
iValue = c_iaCoeff[0];
aiOutput[id] = iValue;
}
__syncthreads();
}
And a pthread is calling this function.
void* WrapperCopy( void* params )
{
// choose cuda device to perform on
CUDA_CHECK_RETURN( cudaSetDevice( 0 ) );
// cast of params
SParams *_params = (SParams*)params;
// copy coefficients to constant memory
CUDA_CHECK_RETURN( cudaMemcpyToSymbol( c_iaCoeff, _params->h_piCoeff, 3*sizeof(int) ) );
// loop kernel
for( int i=0; i<100; i++ )
{
// perfrom kernel
Kernel<<< BLOCKCOUNT, BLOCKSIZE >>>( _params->d_piArray, _params->iLength );
}
// copy data back from gpu
CUDA_CHECK_RETURN( cudaMemcpy(
_params->h_piArray, _params->d_piArray, BLOCKSIZE*BLOCKCOUNT*sizeof(int), cudaMemcpyDeviceToHost ) );
return NULL;
}
Constant memory is declared as this.
__constant__ int c_iaCoeff[ 3 ];
For every host thread has diffrent values in h_piCoeff and will copy that to the constant memory.
Now I get for every pthread call the same results, becaus all of them got the same values in c_iaCoeff.
I think that is the problem of how constant memory works and have to be declared in a context - in the sample there will be only one c_iaCoeff declared for all pthreads calling and the kernels called by pthreads will get the values of the last cudaMemcpyToSymbol. Is that right?
Now I've tried to change my constant memory in a two-dimensional array.
The second dimension will be the values as before, but the first will be the index of the used pthread.
__constant__ int c_iaCoeff2[ THREADS ][ 3 ];
In the kernels the use of it will be in this way.
iValue = c_iaCoeff2[iTId][2];
But I don't know if it's possible to use constant memory in this way, is it?
Also I got an error when I try to copy data to the constant memory.
CUDA_CHECK_RETURN( cudaMemcpyToSymbol( c_iaCoeff[_params->iTId], _params->h_piCoeff, 3*sizeof(int) ) );
General is it possible to use constant memory as a two-dimensional array and if yes, where is my failure?
Yes, you should be able to use constant memory in the way you want to, but the cudaMemcpyToSymbol copy operation you are using is incorrect. The first argument to the call is a symbol, and the API does a lookup in the runtime symbol table to get the address of the constant memory symbol you request. So an address can't be passed to the call (although your code is actually passing an initialised host value to the call, why that is I will leave as an exercise to the reader).
What you may have missed is the optional fourth argument in the call, which is an offset into the memory pointed to by the symbol you request. So you should be able to do something like:
cudaMemcpyToSymbol( c_iaCoeff, // symbol to lookup
_params->h_piCoeff, // source location
3*sizeof(int), // number of bytes to copy
(3*_params->iTId)*sizeof(int) // Offset in bytes
);
[standard disclaimer: written in browser, unstested. use at own risk]
The last argument is the offset in bytes from the start of the symbol. Your 2D array will be laid out in row major order, so you need to use the pitch of the rows multiplied by the row index as an offset for each copy operation.

Extract and Set portions of an array in CUDA

I have to extract sections of an array and set the chunk to another array.
For instance, I have a 2d array (in 1d format) like A[32 X 32]; there is another array B[64 X 64] and I would want to copy an 8X8 chunk of B, starting from (0,8) of B and place it in (8,8) of A.
At present, I would probably use something like the kernel below, for getting a portion of data when offsets are passed. A similar one could also be used to setting chunks to a larger array.
__global__ void get_chunk (double *data, double *sub, int xstart, int ystart, int rows, int cols, int subset)
{
int i,j;
i = blockIdx.x * blockDim.x + threadIdx.x;
for (j = 0; j < subset; j++)
sub[i*subset+j] = data[i*cols + (xstart*cols + ystart)+j];
}
I think the same could be done using a variant of cudamemCpy* (perhaps cudamemCpyArray(...)), but I am not sure how to do it. I need some code samples, or some directions on how it could be done.
PS I had the exact same question in nvidia forums, got no reply so trying here. http://forums.nvidia.com/index.php?showtopic=223386
Thank you.
There is no need for a kernel if you just want to copy data from one array to another on the device.
If you have your device pointers with your source data and your allocated target pointer in host code:
Pseudocode:
//source and target device pointers
double * source_d, target_d;
//get offseted source pointer
double * offTarget_d + offset * sizeof(double);
//copy n elements from offseted source data to target device pointer
cudaMemcpy(offTarget_d, source_d, n * sizeof(double), cudaMemcpyDeviceToDevice);
It was not clear if you just want to copy a range of a 1D array or if you want to copy a range of each row in a 2D array into the target row of another 2D array