Questions about thrust remove_if [duplicate] - cuda

This question already has answers here:
How to know how many elements are in the result of thrust::partition_copy
(1 answer)
Segmentation error when using thrust::sort in CUDA
(2 answers)
Closed 1 year ago.
I have to important questions about how to use thrust's remove_if
How can we know how many elements were removed or the size of the new array (once elements are removed)? For example if I have a 6-element array
int thearray[6] = {1, 0, 2, 0, 1, 3};
and I remove the 0's
int *new_end = thrust::remove_if(A, A + N, is_zero());
I should get {1,2,1,3} but how do I know the result has 4 elements?
I tried to use
thrust::remove_if(d_data, d_data+6, is_zero<int>());
where d_data was a int * d_data; but with memory allocated in the device memory (with cudaMalloc but it gave a segmentation fault. This usually happens when trying to access device memory from the host. So that makes me think. I am trying to keep every operation in the device memory (just downloading to host in the end). Does thrust::remove_if first downloads the data to the Host?
EDIT:
In this
int * d_data;
cudaMalloc((void**)&d_data, 6 * sizeof(int));
cudaMemcpy(d_data, h_data, 6 * sizeof(int), cudaMemcpyHostToDevice);
thrust::device_ptr<int> dev_ptr(d_data);
thrust::remove_if(dev_ptr, dev_ptr+6, is_zero<int>());
//thrust::remove_if(d_data, d_data+6, is_zero<int>()); //--> segmentation fault
you can see that I am wrapping d_data with a device_ptr. If I do that it works without problem, but if I try to use remove_if on d_data itself it crashes

Related

Memory access in CUDA kernel functions (simple example)

I am novice in GPU parallel computing and I'm trying to learn CUDA by looking at some examples in NVidia "CUDA by examples" book.
And I do not understand properly how thread access and change variables in such a simple example (dot product of two vectors).
The kernel function is defined as follows
__global__ void dot( float *a, float *b, float *c ) {
__shared__ float cache[threadsPerBlock];
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int cacheIndex = threadIdx.x;
float temp = 0;
while (tid < N) {
temp += a[tid] * b[tid];
tid += blockDim.x * gridDim.x;
}
// set the cache values
cache[cacheIndex] = temp;
I do not understand three things.
What is the sequence of execution of this function? Is there any sequence between threads? For example, the first are the thread from the first block, then threads from the second block come into play and so on (this is connected to the question why this is necessary to divide threads into blocks).
Do all threads have their own copy of the "temp" variable or not (if not, why is there no race condition?)
How is it operated? What exactly goes to the variable temp in the while loop? The array cache stores values of temp for different threads. How does the summation go on? It seems that temp already contains all sums necessary for dot product because variable tid goes from 0 to N-1 in the while loop.
Despite the code you provide is incomplete, here are some clarifications about what you are asking :
The kernel code will be executed by all the threads in all the blocks. The way to "split the jobs" is to make threads work only on one or a few elements.
For instance, if you have to treat 100 integers with a specific algorithm, you probably want 100 threads to treat 1 element each.
In CUDA the amount of blocks and threads is defined at the kernel launch on host side :
myKernel<<<grid, threads>>>(...);
Where grids and threads are dim3, which define the size on three dimensions.
There is no specific order in the execution of threads and blocks. As you can read there :
http://mc.stanford.edu/cgi-bin/images/3/34/Darve_cme343_cuda_3.pdf
On page 6 : "No specific order in which blocks are dispatched and executed".
Since the temp variable is defined in the kernel in no specific way, it is not distributed and each thread will have this value stored in a register.
This is equivalent of what is done on CPU side. So yes, this means each threads has its own "temp" variable.
The temp variable is updated in each iteration of the loop, using access to device arrays.
Again, this is equivalent of what is done on CPU side.
I think you should probably check if you are used enough to C/C++ programming on CPU side before going further into GPU programming. Meaning no offense, it seems you have a lack in several main topics.
Since CUDA allows you to drive your GPU with C code, the difficulty is not in the syntax, but in the specificities of the hardware.

How to measure the time of the device functions when they are called in kernel function [duplicate]

This question already has answers here:
How to measure the inner kernel time in NVIDIA CUDA?
(2 answers)
Closed 5 years ago.
Hello in my kernel function i used 3 device function and i want to calculate the time taken by each device function.Is there any way to time the device functions time in kernel ? please kindly let me know Thank you
Quoting the CUDA C Programming Guide:
clock_t clock();
long long int clock64();
when executed in device code, returns the value of a per-multiprocessor counter that is
incremented every clock cycle. Sampling this counter at the beginning and at the end of
a kernel, taking the difference of the two samples, and recording the result per thread
provides a measure for each thread of the number of clock cycles taken by the device to
completely execute the thread, but not of the number of clock cycles the device actually
spent executing thread instructions. The former number is greater that the latter since
threads are time sliced.
This timing works pretty much Matlab's tic and toc. There is a clock sample in the CUDA SDK. Basically, it works like this
__global__ void max(..., int* time)
{
int i = threadIdx.x + blockIdx.x * blockDim.x;
clock_t start = clock();
//device function call
clock_t stop = clock();
...
time[i] = (int)(stop - start);
}

How to properly add in global memory in CUDA?

I'm trying to implement sum of absolute differences in CUDA for a homework assignment, but am having trouble getting correct results.
I am given a Blocksize that represents X and Y size (in pixels) of a square portion of the images I am given to compare. I am also given two images in YUV format. Below are the portions of the program I have to implement: the kernel that calculates the SAD and the setup for the size of the grid/blocks of threads. The rest of the program is provided, and can be assumed to be correct.
Here I'm getting the x and y index of the current thread and using those to get the pixel in the image arrays I'm dealing with in the current thread. Then I calculate the absolute difference, wait for all the threads to finish calculating that, then if the current thread is within the block in the image we care about the absolute difference is added to the sum in global memory with an atomicAdd to avoid a collision during write.
__global__ void gpuCounterKernel(pixel* cuda_curBlock, pixel* cuda_refBlock, uint32* cuda_SAD, uint32 cuda_Blocksize)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;
int id = idx * cuda_Blocksize + idy;
int AD = abs( cuda_curBlock[id] - cuda_refBlock[id] );
__syncthreads();
if( idx < cuda_Blocksize && idy < cuda_Blocksize ) {
atomicAdd( cuda_SAD, AD );
}
}
And this is how I'm setting up the grid and blocks for the kernel:
int grid_sizeX = Blocksize/2;
int grid_sizeY = Blocksize/2;
int block_sizeX = Blocksize/4;
int block_sizeY = Blocksize/4;
dim3 blocksInGrid(grid_sizeX, grid_sizeY);
dim3 threadsInBlock(block_sizeX, block_sizeY);
The given program calculates the SAD on the CPU as well and compares our result from the GPU with that one to check for correctness. Valid block sizes within the image are from 1-1000. My solution above is getting correct results from 10-91, but anything above 91 just returns 0 for the sum. What am I doing wrong?
Your grid and block size settings looks odd.
Usually we use the settings for image pixels similar as follows.
int imageROISize=1000;
dim3 threadInBlock(16,16);
dim3 blocksInGrid((imageROISize+15)/16, (imageROISize+15)/16);
You could refer to the following section in cuda programming guide for more information on how to distribute workloads to CUDA threads.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#thread-hierarchy
You really should show all the code and identify the GPU you are running on. At least the portion that calls the kernel and allocates data for GPU use.
Are you doing proper cuda error
checking on all cuda API calls and kernel calls?
Probably your kernel is not running at all because your
threadsInBlock parameter is exceeding 512 threads total. You indicate that at Blocksize = 92 and above, things are not working. Let's do the math:
92/4 = 23 threads in X and Y dimensions
23 * 23 = 529 total threads requested per threadblock
529 exceeds 512 which is the limit for cc 1.x devices, so I'm guessing you're running on a cc 1.x device, and therefore your kernel launch is failing, so your kernel is not running, and so you get no computed results (i.e. 0). Note that at 91/4 = 22 threads in X and Y dimensions, you are requesting 484 total threads which does not exceed the 512 limit for cc 1.x devices.
If you were doing proper cuda error checking, the error report would have focused your attention on the cuda kernel launch failing due to incorrect launch parameters.

Is it possible for cuda threads to have local/private copy of an argument which can be updated without affecting the other

I have come across a situation whereby I need to provide a number of arrays as input to a global function, I need each thread to be able to perform operations on the array in such a manner that they will not affect how others threads copy of the array, I provide the below code as an example of what I am trying to achieve.
__global__ void testLocalCopy(double *temper){
int threadIDx = threadIdx.x + blockDim.x * blockIdx.x;
// what I need is for each thread to set temper[3] to its id without affecting any other threads copy
// so thread id 0 will have a set its copy of temper[3] to 0 and thread id 3 will set it to 3 etc.
temper[3]=threadIDx;
printf("For thread %d the val in temper[3] is %lf \n",threadIDx,temper[3]);
}
just to restate , Is there a method whereby a given thread can be certain that no other thread is updating its value of temper[3] ?
I initially believed I would be able to solve this problem by using constant memory, but as constant memory is readonly this did not meet my needs,
I am using cuda 4.0 , please see the main function below.
int main(){
double temper[4]={2.0,25.9999,55.3,66.6};
double *dev_temper;
int size=4;
cudaMalloc( (void**)&dev_temper, size * sizeof(double) );
cudaMemcpy( dev_temper, &temper, size * sizeof(double), cudaMemcpyHostToDevice );
testLocalCopy<<<2,2>>>(dev_temper);
cudaDeviceReset();
cudaFree(dev_temper);
}
Thanks in advance,
Connor
Within your kernel function you can allocate memory as
int temper_per_thread[4];
Now each thread will have seperate and unique access to this array within your kernel e.g. the code below will populate temper_per_thread with the current thread index:
temper_per_thread[0]=threadIDx;
temper_per_thread[1]=threadIDx;
temper_per_thread[2]=threadIDx;
temper_per_thread[3]=threadIDx;
Of course if you wish to transfer all these thread specific arrays back to the CPU, you will need a different approach. 1) allocate a larger portion of global memory. 2) The size of this larger portion of global memory will be the number of threads multiplied by the number of elements unique to each thread. 3) Index the array writing such that each thread always writes to unique location within global memory. 4) Do a GPU to CPU memcpy after the kernel finishes.

Memory Error in CUDA Program for Fermi GPU

I am facing the following problem on a GeForce GTX 580 (Fermi-class) GPU.
Just to give you some background, I am reading single-byte samples packed in the following manner in a file: Real(Signal 1), Imaginary(Signal 1), Real(Signal 2), Imaginary(Signal 2). (Each byte is a signed char, taking values between, -128 and 127.) I read these into a char4 array, and use the kernel given below to copy them to two float2 arrays corresponding to each signal. (This is just an isolated part of a larger program.)
When I run the program using cuda-memcheck, I get either an unqualified unspecified launch failure, or the same message along with User Stack Overflow or Breakpoint Hit or Invalid __global__ write of size 8 at random thread and block indices.
The main kernel and launch-related code is reproduced below. The strange thing is that this code works (and cuda-memcheck throws no error) on a non-Fermi-class GPU that I have access to. Another thing that I observed is that the Fermi gives no error for N less than 16384.
#define N 32768
int main(int argc, char *argv[])
{
char4 *pc4Buf_h = NULL;
char4 *pc4Buf_d = NULL;
float2 *pf2InX_d = NULL;
float2 *pf2InY_d = NULL;
dim3 dimBCopy(1, 1, 1);
dim3 dimGCopy(1, 1);
...
/* i do check for errors in the actual code */
pc4Buf_h = (char4 *) malloc(N * sizeof(char4));
(void) cudaMalloc((void **) &pc4Buf_d, N * sizeof(char4));
(void) cudaMalloc((void **) &pf2InX_d, N * sizeof(float2));
(void) cudaMalloc((void **) &pf2InY_d, N * sizeof(float2));
...
dimBCopy.x = 1024; /* number of threads in a block, for my GPU */
dimGCopy.x = N / 1024;
CopyDataForFFT<<<dimGCopy, dimBCopy>>>(pc4Buf_d,
pf2InX_d,
pf2InY_d);
...
}
__global__ void CopyDataForFFT(char4 *pc4Data,
float2 *pf2FFTInX,
float2 *pf2FFTInY)
{
int i = (blockIdx.x * blockDim.x) + threadIdx.x;
pf2FFTInX[i].x = (float) pc4Data[i].x;
pf2FFTInX[i].y = (float) pc4Data[i].y;
pf2FFTInY[i].x = (float) pc4Data[i].z;
pf2FFTInY[i].y = (float) pc4Data[i].w;
return;
}
One other thing I noticed in my program is that if I comment out any two char-to-float assignment statements in my kernel, there's no memory error. One other thing I noticed in my program is that if I comment out either the first two or the last two char-to-float assignment statements in my kernel, there's no memory error. If I comment out one from the first two (pf2FFTInX), and another from the second two (pf2FFTInY), errors still crop up, but less frequently. The kernel uses 6 registers with all four assignment statements uncommented, and uses 5 4 registers with two assignment statements commented out.
I tried the 32-bit toolkit in place of the 64-bit toolkit, 32-bit compilation with the -m32 compiler option, running without X windows, etc. but the program behaviour is the same.
I use CUDA 4.0 driver and runtime (also tried CUDA 3.2) on RHEL 5.6. The GPU compute capability is 2.0.
Please help! I could post the entire code if anybody is interested in running it on their Fermi cards.
UPDATE: Just for the heck of it, I inserted a __syncthreads() between the pf2FFTInX and the pf2FFTInY assignment statements, and memory errors disappeared for N = 32768. But at N = 65536, I still get errors. <-- This didn't last long. Still getting errors.
UPDATE: In continuing with the weird behaviour, when I run the program using cuda-memcheck, I get these 16x16 blocks of multi-coloured pixels distributed randomly all over my screen. This does not happen if I run the program directly.
The problem was a bad GPU card (see the comments). [I'm Adding this answer to remove the question from the unanswered list and make it more useful.]