Add multiple vectors concurrently in cuda - cuda

I want to design a kernel to add a matrix row pairs concurrently, but I don't know how to accomplish it.
For example, I have a data matrix, which size is (512, 1024), and I want to add its row pairs(row1+row2, row3+row4,...,row511+row512) at same time.
The reason I’m considering doing this is just for saving time.
Could you give me some advice?
Thanks!

Something like this may be useful:
const int width = 1024;
const int rows = 512;
template <typename T>
__global__ void row_add(const T * __restrict__ din, T * __restrict__ dout){
int idx = width*2*blockIdx.x + threadIdx.x;
if (dout == din)
dout[idx] += dout[idx+width];
else
dout[idx-blockIdx.x*width] = din[idx]+din[idx+width];
}
It depends on the width dimension being 1024 or less. You would launch it like this:
row_add<<<rows/2, width>>>(d_in, d_out);
If you pass it different pointers for d_in and d_out, it will assume you want the output written contiguously to a separate array. If you pass it the same pointer for d_in and d_out, it will assume you want the results of row 0+1 written to row 0, the results of row 2+3 written to row 2, and so on.
The rows dimension has to be an even number, obviously from your problem statement (adding rows pairwise).
coded in browser, not tested, may contain bugs

Related

CUDA number of data cannot be divide by the CUDA threads evenly

For example, there are two 4-threads, but I have 5 data, the first 0-3 can be mapped to the first 4-threads, how about the rest, it only says there might be a runtime error, but how to fix it?
I think I ask this question in the wrong direction, now suppose I have
perfromwork<<<2,2>>>;
Now my dataIndex calculated by this pseudocode is smaller than the number of data elements(N=5), so what to do with the last one (5-2x2=1)? If I use another block for it, it will come across the same problem, the <<<2, 2>>> block will create a larger dataIndex.
There are two canonical approaches here.
Size the grid to be larger than or equal to the data set size, and make sure to use a "thread check" that prevents unneeded extra threads from doing any work.
Use a grid-stride loop, which allows the grid size to be determined independently from the data set size (if you wish) while still providing correct results.
vector add example kernels for each:
__global__ void vectorAdd(float *x, float *y, float *z, int size){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < size) // thread check
z[idx] = x[idx] + y[idx];
}
The above kernel does not use a grid-stride loop. It will require that you size the grid to be larger than or equal to the data set size, in order for all elements to be processed. That sizing code might look like this:
int size = MY_DATA_SET_SIZE;
dim3 block(256); // this is threads per block, the choice here is not critical for correctness, but must be 1 or larger and less than or equal to 1024;
dim3 grid((size+block.x-1)/block.x);
vectorAdd<<<grid,block>>>(...);
A kernel implementing a grid-stride loop to do the same thing might look like this:
__global__ void vectorAdd(float *x, float *y, float *z, int size){
for (int idx = threadIdx.x+blockDim.x*blockIdx.x; idx < size; idx += blockDim.x*gridDim.x)
z[idx] = x[idx] + y[idx];
}
In this case, grid sizing can be arbitrary (1 or larger) and still yield correct results.

dot_product with CUDA_CUB

__global__ void sum(const float * __restrict__ indata, float * __restrict__ outdata) {
unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
// --- Specialize BlockReduce for type float.
typedef cub::BlockReduce<float, BLOCKSIZE> BlockReduceT;
// --- Allocate temporary storage in shared memory
__shared__ typename BlockReduceT::TempStorage temp_storage;
float result;
if(tid < N) result = BlockReduceT(temp_storage).Sum(indata[tid]);
// --- Update block reduction value
if(threadIdx.x == 0) outdata[blockIdx.x] = result;
return;
}
I have tested the reduction sum(as shown in above code snippet) with cuda cub successfully, I want to perform the inner product of two vectors based on this code. But I have some confusions about it:
We need two input vectors for the inner_product, need I to conduct a component-wise multiplication of this two input vectors before the reduction sum on the resulting new vector.
In the code examples of the cuda cub, the dimension of input vectors is equal to the blocknumber*threadnumber. what if we have a very large vector.
Yes, with cub, and assuming your vectors were stored separately (i.e. not interleaved), you would need to do an element-wise multiplication first. On the other hand, thrust transform_reduce could handle it in a single function call.
blocknumber*threadnumber should give you all the range you need. on a cc3.0 or higher GPU, blocknumber (i.e. gridDim.x) can range up to 2^31-1 and threadnumber (i.e. blockDim.x) can range up to 1024. This gives you the possibility to handle 2^40 elements. If each element is 4 bytes, this would constitute (i.e. require) 2^42 bytes. That is about 4TB (or double that if you are considering 2 input vectors), which is much larger than any GPU memory currently. So you will run out of GPU memory space before you run out of grid dimension.
Note that what you are showing is cub::BlockReduce. However if you are doing a vector dot product of two large vectors, you might want to use cub::DeviceReduce instead.

CUDA atomicAdd using thread.x not returning expected results

I've been experimenting with atomic operations in CUDA, but I can't get thread index numbers to be included in the operations, it looks like they are just treated as zeros as in the examples shown below:
Is there anything I'm doing wrong in the code below?
Code 1: adding thread index value to dest[10] (not working, dest[10] is 0 after running, I would expect it to be greater than 0 as it would add the value of the index to dest[10] each time)
__global__ void add_test(int* dest, float *a, float *b, float *c)
{
int ix = ((blockIdx.x * blockDim.x) + threadIdx.x);
int idx = threadIdx.x;
atomicAdd(dest+10,idx);
}
Code 2: if I use a constant, then it seems to work (at the end of the run dest[10]=2, but again I would expect it to be greater than 2 as it should add 2 for every running thread/block):
__global__ void add_test(int* dest, float *a, float *b, float *c)
{
int ix = ((blockIdx.x * blockDim.x) + threadIdx.x);
int idx = threadIdx.x;
atomicAdd(dest+10,2);
}
My test call looks like:
add_test<<<(1024,1,1), (41,1584,1)>>>
This isn't a valid kernel launch:
add_test<<<(1024,1,1), (41,1584,1)>>>
You cannot ask for thread block dimensions of (41,1584,1)
My guess is you are doing no proper cuda error checking and have not run your code with cuda-memcheck, as either of these would have indicated the error, and that your kernel is not running properly.
The maximum in either of the first two dimensions is either 512 or 1024, and the maximum combined dimensions (i.e. the product of the dimensions = total threads) is 512 or 1024 depending on GPU.
In the future, please provide a complete, compilable code if you are asking for help with a code that is not working. SO expects this and it is a valid close reason for your question if you don't.

CUDA: Thread and Array Allocation

I have read many times about CUDA Thread/Blocks and Array, but still don't understand point: how and when CUDA starts to run multithread for kernel function. when host calling kernel function, or inside kernel function.
For example I have this example, It just simple transpose an array. (so, it just copy value from this array to another array).
__global__
void transpose(float* in, float* out, uint width) {
uint tx = blockIdx.x * blockDim.x + threadIdx.x;
uint ty = blockIdx.y * blockDim.y + threadIdx.y;
out[tx * width + ty] = in[ty * width + tx];
}
int main(int args, char** vargs) {
/*const int HEIGHT = 1024;
const int WIDTH = 1024;
const int SIZE = WIDTH * HEIGHT * sizeof(float);
dim3 bDim(16, 16);
dim3 gDim(WIDTH / bDim.x, HEIGHT / bDim.y);
float* M = (float*)malloc(SIZE);
for (int i = 0; i < HEIGHT * WIDTH; i++) { M[i] = i; }
float* Md = NULL;
cudaMalloc((void**)&Md, SIZE);
cudaMemcpy(Md,M, SIZE, cudaMemcpyHostToDevice);
float* Bd = NULL;
cudaMalloc((void**)&Bd, SIZE); */
transpose<<<gDim, bDim>>>(Md, Bd, WIDTH); // CALLING FUNCTION TRANSPOSE
cudaMemcpy(M,Bd, SIZE, cudaMemcpyDeviceToHost);
return 0;
}
(I have commented all lines that not important, just have the line calling function transpose)
I have understand all lines in function main except the line calling function tranpose. Does it true when I say: when we call function transpose<<<gDim, bDim>>>(Md, Bd, WIDTH), CUDA will automatically assign each elements of array into one thread (and block), and when we calling "one time" tranpose, CUDA will running gDim * bDim times tranpose on gDim * bDim threads.
This point makes me feel frustrated so much, because it doesn't like multithread in java, when I use :( Please tell me.
Thanks :)
Your understanding is in essence correct.
transpose is not a function, but a CUDA kernel. When you call a regular function, it only runs once. But when you launch a kernel a single time, CUDA will automatically run the code in the kernel many times. CUDA does this by starting many threads. Each thread runs the code in your kernel one time. The numbers inside the tripple brackets (<<< >>>) is called the kernel execution configuration. It determines how many threads will be launched by CUDA and specifies some relationships between the threads.
The number of threads that will be started is calculated by multiplying up all the values in the grid and block dimensions inside the triple brackets. For instance, the number of threads will be 1,048,576 (16 * 16 * 64 * 64) in your example.
Each thread can read some variables to find out which thread it is. Those are the blockIdx and threadIdx structures at the top of the kernel. The values reflect the ones in the kernel execution configuration. So, if you run your kernel with a grid configuration of 16 x 16 (the first dim3 in the triple brackets, you will get threads that, when they each read the x and y values in the blockIdx structure, will get all possible combinations of x and y between 0 and 15.
So, as you see, CUDA does not know anything about array elements or any other data structures that are specific to your kernel. It just deals with threads, thread indexes and block indexes. You then use those indexes to to determine what a given thread should do (in particular, which values in your application specific data it should work on).

CUDA efficient division?

I would like to know if there is, by any chance an efficient way of dividing elements of an array. I am running with matrix values 10000x10000 and it a considerable amount of time in comparison with other kernels. Division are expensive operations, and I can't see how to improve it.
__global__ void division(int N, float* A, int* B){
int row = blockIdx.x * blockDim.x + threadIdx.x;
int col = blockIdx.y * blockDim.y + threadIdx.y;
if((row < N) && (col <= row) ){
if( B[row*N+col] >0 )
A[row*N+col] /= (float)B[row*N+col];
}
}
kernel launched with
int N = 10000;
int threads = 32
int blocks = (N+threads-1)/threads
dim3 t(threads,threads);
dim3 b(blocks, blocks);
division<<< b, t >>>(N, A, B);
cudaThreadSynchronize();
Option B:
__global__ void division(int N, float* A, int* B){
int k = blockIdx.x * blockDim.x + threadIdx.x;
int kmax = N*(N+1)/2
int i,j;
if(k< kmax){
row = (int)(sqrt(0.25+2.0*k)-0.5);
col = k - (row*(row+1))>>1;
if( B[row*N+col] >0 )
A[row*N+col] /= (float)B[row*N+col];
}
}
launched with
int threads =192;
int totalThreadsNeeded = (N*(N+1)/2;
int blocks = ( threads + (totalThreadsNeeded)-1 )/threads;
division<<<blocks, threads >>>(N, A, B);
Why is option B giving a wrong result even if the threadIds are the correct one? what is missing here?
Your basic problem is that you are launching an improbably huge grid (over 100 million threads for your 10000x10000 array example), and then because of the triangular nature of the access pattern in the kernel, fully half of those threads never do anything productive. So a enormous amount of GPU cycles are being wasted for no particularly good reason. Further, the access pattern you are using isn't allowing coalesced memory access, which is going to further reduce the performance of the threads which are actually doing useful work.
If I understand your problem correctly, the kernel is only performing element-wise division on a lower-triangle of a square array. If this is the case, it could be equally done using something like this:
__global__
void division(int N, float* A, int* B)
{
for(int row=blockIdx.x; row<N; row+=gridDim.x) {
for(int col=threadIdx.x; col<=row; col+=blockDim.x) {
int val = max(1,B[row*N+col]);
A[row*N+col] /= (float)val;
}
}
}
[disclaimer: written in browser, never compiled, never tested, use at own risk]
Here, a one dimension grid is used, with each block computing a row at a time. Threads in a block move along the row, so memory access is coalesced. In comments you mention your GPU is a Tesla C2050. That device only requires 112 blocks of 192 threads each to completely "fill" each of the 14 SM with a full complement of 8 blocks each and the maximum number of concurrent threads per SM. So the launch parameters could be something like:
int N = 10000;
int threads = 192;
int blocks = min(8*14, N);
division<<<blocks, threads>>>(N, A, B);
I would expect this to run considerably faster than your current approach. If numerical accuracy isn't that important, you can probably achieve further speed-up by replacing the division with an approximate reciprocal intrinsic and a floating point multiply.
Because threads are executed in groups of 32, called warps, you are paying for the division for all 32 threads in a warp if both if conditions are true for just one of the threads. If the condition is false for many threads, see if you can filter out the values for which the division is not needed in a separate kernel.
The int to float conversion may itself be slow. If so, you might be able to generate floats directly in your earlier step, and pass B in as an array of floats.
You may be able to generate inverted numbers in the earlier step, where you generate the B array. If so, you can use multiplication instead of division in this kernel. (a / b == a * 1 / b).
Depending on your algorithm, maybe you can get away with a lower precision division. There's an intrinsic, __fdividef(x, y), that you can try. There is also a compiler flag, -prec-div=false.
The very first thing to look at should be coalesced memory access. There is no reason for the non-coalesced pattern here, just exchange rows and columns for to avoid wasting a lot of memory bandwidth:
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
...
A[row*N+col] ...
Even if this is run on compute capability 2.0 or higher, the caches are not large enough to remedy this suboptimal pattern.