read 4 char per thread in 1 transaction in cuda - cuda

I am learning CUDA recently. And I have a question about the memory transaction.
What I understand is, in each transaction, 32 consecutive threads (in the same block) can access a consecutive 128 bytes (32 single precision words) of memory concurrently, which is called a warp.
But in the example, each thread is always accessing the (4-bytes) word as 1 whole variable. So my question is, if my array in global memory is defined in type for char, then can all the 32 threads access this piece of memory, and read 4 consecutive char respectively in the same time?
So, for eaxmple, if I write the code:
__global__
void kernel(char *d_mask)
{
extern __shared__ char s_tmp[];
const unsigned int thId = threadIdx.x;
const unsigned int elementId = 4 * (threadIdx.x + blockDim.x * blockIdx.x);
s_tmp[thId_x] = d_mask[elementId];
s_tmp[1 + thId_x] = d_mask[elementId + 1];
s_tmp[2 + thId_x] = d_mask[elementId + 2];
s_tmp[3 + thId_x] = d_mask[elementId + 3];
__syncthreads();
/* calculation */
}
Then, will each thread read the 4 bytes concurrently? And if not, how can I manage to do it? should I use some API like memcpy?

In order to get a properly efficient read, it's necessary to combine the bytes being read into a single transaction; we generally can't do this by breaking things up across several lines of code.
To combine things into a single transaction, there are vector types which combine multiple elements into a single type. As long as we pay attention to proper alignment, we can treat char or unsigned char arrays as arrays of e.g. uchar4 which is a vector type that combines four characters into a single (32-bit) type. You can find lots more goodies in the cuda header files vector_types.h and vector_functions.h.
Anyway, we could re-write your sample like this, to take advantage of a "vector load":
__global__
void kernel(char *d_mask)
{
extern __shared__ char s_tmp[];
const unsigned int thId = threadIdx.x;
const unsigned int elementId = threadIdx.x + blockDim.x * blockIdx.x;
uchar4 *s_tmp_v = reinterpret_cast<uchar4 *>(s_tmp);
uchar4 *d_mask_v = reinterpret_cast<uchar4 *>(d_mask);
s_tmp_v[thId] = d_mask_v[elementId];
__syncthreads();
/* calculation */
}

Related

Only half of the shared memory array is assigned

I see only half of the shared memory array is assigned, when I use Nsight stepped after s_f[sidx] = 5;
__global__ void BackProjectPixel(double* val,
double* projection,
double* focalPtPos,
double* pxlPos,
double* pxlGrid,
double* detPos,
double *detGridPos,
unsigned int nN,
unsigned int nS,
double perModDetAngle,
double perModSpaceAngle,
double perModAngle)
{
const double fx = focalPtPos[0];
const double fy = focalPtPos[1];
//extern __shared__ double s_f[64]; //
__shared__ double s_f[64]; //
unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;
unsigned int j = (blockIdx.y * blockDim.y) + threadIdx.y;
unsigned int idx = j*nN + i;
unsigned int sidx = threadIdx.y * blockDim.x + threadIdx.x;
unsigned int threadsPerSharedMem = 64;
if (sidx < threadsPerSharedMem)
{
s_f[sidx] = 5;
}
__syncthreads();
//double * angle;
//
if (sidx < threadsPerSharedMem)
{
s_f[idx] = TriPointAngle(detGridPos[0], detGridPos[1],fx, fy, pxlPos[idx*2], pxlPos[idx*2+1], nN);
}
}
Here is what I observed
I am wondering why there are only thirty-two 5? Shouldn't there be sixty-four 5 in s_f? Thanks.
Threads are executed in groups of threads (usually 32) which are also called warps. Warps group the threads in order. In your case one warp will get threads 0-31, the other 32-63. In your debugging context, you are probably seeing the results of only the warp that contains threads 0-31.
I am wondering why there are only thirty-two 5?
There are 32 fives because as mete says, kernels are executed simultaneously only by groups of threads of size 32, so called warps in CUDA terminology.
Shouldn't there be sixty-four 5 in s_f?
There will be 64 fives after the synchronization barrier, i.e. __syncthreads(). So if you place your breakpoint on the first instruction after the __syncthreads() call, you'll see all fives. Thats because by that time all the warps from one block will finish execution of all the code prior to __syncthreads().
How can I see all warps with Nsight?
You can see values for all the threads easily by putting this into watchfield:
s_f[sidx]
Although sidx value may become undefined due to optimizations, so I would better watch the value of:
s_f[((blockIdx.y * blockDim.y) + threadIdx.y) * nN + (blockIdx.x * blockDim.x) + threadIdx.x]
And indeed, if you want to investigate values for particular warp, then as Robert Crovella points out, you should use conditional breakpoints. If you want to break within the second warp, then something like this should work in case of two dimensional grid of two dimensional block (which I presume you are using):
((blockIdx.x + blockIdx.y * gridDim.x) * (blockDim.x * blockDim.y) + (threadIdx.y * blockDim.x) + threadIdx.x) == 32
Because 32 is the index of the first thread within the second warp. For other combinations of block and grid dimensions see this useful cheatsheet.

Can multiple blocks and threads write to the same output?

I have the following CUDA kernel code which computes the sum squared error of two arrays.
__global__ void kSquaredError(double* data, double* recon, double* error,
unsigned int num_elements)
{
const unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
for (unsigned int i = idx; i < num_elements; i += blockDim.x * gridDim.x) {
*error += pow(data[i] - recon[i], 2);
}
}
I need a single scalar output (error). In this case, it seems like all threads are writing to error simultaneously. Is there some way I need to synchronize it?
Currently I'm getting a bad result so I'm guessing there is some issue.
The implementation you are doing now is subject to race conditions due to the fact that all threads try to update the same global memory address at the same time. You could easily put a atomicAdd function instead of *error += pow... but that suffers from performance issues due to it being serialized on each update.
Instead you should try and and do a reduction using the shared memory, as following:
_global__ void kSquaredError(double* data, double* recon, double* error, unsigned int num_elements) {
const unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int tid = threadIdx.x;
__shared__ double serror[blockDim.x];//temporary storage of each threads error
for (unsigned int i = idx; i < num_elements; i += blockDim.x * gridDim.x) {
serror[tid] += pow(data[i] - recon[i], 2);//put each threads value in shared memory
}
__syncthreads();
int i = blockDim.x >> 1; //halve the threads
for(;i>0;i>>=1) {//reduction in shared memory
if(tid<i) {
serror[tid] += serror[tid+i];
__syncthreads();//make shure all threads are at the same state and have written to shared memory
}
}
if(tid == 0) {//thread 0 updates the value in global memory
atomicAdd(error,serror[tid]);// same as *error += serror[tid]; but atomic
}
}
It works by the following principle, each thread have its own temporary variable where it calculates the sum of the error for all its input, when it have finished all threads converge at the __syncthreads instruction to ensure that all data is complete.
Now half of all the threads in the block will take one value from the corresponding other half add add it to its own, half the threads again and do it again until you are left with one thread(thread 0) which will have the total sum.
Now thread 0 will uppdate the global memory with an atomicAdd function to avoid race condition with other blocks if there is any.
If we would just use the first example and use atomicAdd on every assignment. You would have gridDim.x*blockDim.x*num_elements atomic functions that would be serialized, now we have only gridDim.x atomic functions which is a lot less.
See Optimizing Parallel Reduction in CUDA for further reading on how reduction using cuda works.
Edit
Added if statement in the reduction for loop, forgot that.

CUDA efficient division?

I would like to know if there is, by any chance an efficient way of dividing elements of an array. I am running with matrix values 10000x10000 and it a considerable amount of time in comparison with other kernels. Division are expensive operations, and I can't see how to improve it.
__global__ void division(int N, float* A, int* B){
int row = blockIdx.x * blockDim.x + threadIdx.x;
int col = blockIdx.y * blockDim.y + threadIdx.y;
if((row < N) && (col <= row) ){
if( B[row*N+col] >0 )
A[row*N+col] /= (float)B[row*N+col];
}
}
kernel launched with
int N = 10000;
int threads = 32
int blocks = (N+threads-1)/threads
dim3 t(threads,threads);
dim3 b(blocks, blocks);
division<<< b, t >>>(N, A, B);
cudaThreadSynchronize();
Option B:
__global__ void division(int N, float* A, int* B){
int k = blockIdx.x * blockDim.x + threadIdx.x;
int kmax = N*(N+1)/2
int i,j;
if(k< kmax){
row = (int)(sqrt(0.25+2.0*k)-0.5);
col = k - (row*(row+1))>>1;
if( B[row*N+col] >0 )
A[row*N+col] /= (float)B[row*N+col];
}
}
launched with
int threads =192;
int totalThreadsNeeded = (N*(N+1)/2;
int blocks = ( threads + (totalThreadsNeeded)-1 )/threads;
division<<<blocks, threads >>>(N, A, B);
Why is option B giving a wrong result even if the threadIds are the correct one? what is missing here?
Your basic problem is that you are launching an improbably huge grid (over 100 million threads for your 10000x10000 array example), and then because of the triangular nature of the access pattern in the kernel, fully half of those threads never do anything productive. So a enormous amount of GPU cycles are being wasted for no particularly good reason. Further, the access pattern you are using isn't allowing coalesced memory access, which is going to further reduce the performance of the threads which are actually doing useful work.
If I understand your problem correctly, the kernel is only performing element-wise division on a lower-triangle of a square array. If this is the case, it could be equally done using something like this:
__global__
void division(int N, float* A, int* B)
{
for(int row=blockIdx.x; row<N; row+=gridDim.x) {
for(int col=threadIdx.x; col<=row; col+=blockDim.x) {
int val = max(1,B[row*N+col]);
A[row*N+col] /= (float)val;
}
}
}
[disclaimer: written in browser, never compiled, never tested, use at own risk]
Here, a one dimension grid is used, with each block computing a row at a time. Threads in a block move along the row, so memory access is coalesced. In comments you mention your GPU is a Tesla C2050. That device only requires 112 blocks of 192 threads each to completely "fill" each of the 14 SM with a full complement of 8 blocks each and the maximum number of concurrent threads per SM. So the launch parameters could be something like:
int N = 10000;
int threads = 192;
int blocks = min(8*14, N);
division<<<blocks, threads>>>(N, A, B);
I would expect this to run considerably faster than your current approach. If numerical accuracy isn't that important, you can probably achieve further speed-up by replacing the division with an approximate reciprocal intrinsic and a floating point multiply.
Because threads are executed in groups of 32, called warps, you are paying for the division for all 32 threads in a warp if both if conditions are true for just one of the threads. If the condition is false for many threads, see if you can filter out the values for which the division is not needed in a separate kernel.
The int to float conversion may itself be slow. If so, you might be able to generate floats directly in your earlier step, and pass B in as an array of floats.
You may be able to generate inverted numbers in the earlier step, where you generate the B array. If so, you can use multiplication instead of division in this kernel. (a / b == a * 1 / b).
Depending on your algorithm, maybe you can get away with a lower precision division. There's an intrinsic, __fdividef(x, y), that you can try. There is also a compiler flag, -prec-div=false.
The very first thing to look at should be coalesced memory access. There is no reason for the non-coalesced pattern here, just exchange rows and columns for to avoid wasting a lot of memory bandwidth:
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
...
A[row*N+col] ...
Even if this is run on compute capability 2.0 or higher, the caches are not large enough to remedy this suboptimal pattern.

CUDA: bad performance with shared memory and no parallelism

I'm trying to exploit shared memory in this kernel function, but the performance are not as good as I was expecting. This function is called many times in my application (about 1000 times or more), so I was thinking to exploit shared memory to avoid the memory latency. But something is wrong apparently because my application became really slow since i'm using shared memory.
This is the kernel:
__global__ void AndBitwiseOperation(int* _memory_device, int b1_size, int* b1_memory, int* b2_memory){
int j = 0;
// index GPU - Transaction-wise
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int tid = threadIdx.x;
// shared variable
extern __shared__ int shared_memory_data[];
extern __shared__ int shared_b1_data[];
extern __shared__ int shared_b2_data[];
// copy from global memory into shared memory and sync threads
shared_b1_data[tid] = b1_memory[tid];
shared_b2_data[tid] = b2_memory[tid];
__syncthreads();
// AND each int bitwise
for(j = 0; j < b1_size; j++)
shared_memory_data[tid] = (shared_b1_data[tid] & shared_b2_data[tid]);
// write result for this block to global memory
_memory_device[i] = shared_memory_data[i];
}
The shared variables are declared extern because I don't know the size of b1 and b2 since they depend from the number of customer that I can only know at runtime (but both have the same size all the times).
This is how I call the kernel:
void Bitmap::And(const Bitmap &b1, const Bitmap &b2)
{
int* _memory_device;
int* b1_memory;
int* b2_memory;
int b1_size = b1.getIntSize();
// allocate memory on GPU
(cudaMalloc((void **)&b1_memory, _memSizeInt * SIZE_UINT));
(cudaMalloc((void **)&b2_memory, _memSizeInt * SIZE_UINT));
(cudaMalloc((void **)&_memory_device, _memSizeInt * SIZE_UINT));
// copy values on GPU
(cudaMemcpy(b1_memory, b1._memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));
(cudaMemcpy(b2_memory, b2._memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));
(cudaMemcpy(_memory_device, _memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));
dim3 dimBlock(1, 1);
dim3 dimGrid(1, 1);
AndBitwiseOperation<<<dimGrid, dimBlock>>>(_memory_device, b1_size, b1_memory, b2_memory);
// return values
(cudaMemcpy(_memory, _memory_device, _memSizeInt * SIZE_UINT, cudaMemcpyDeviceToHost ));
// Free Memory
(cudaFree(b1_memory));
(cudaFree(b2_memory));
(cudaFree(_memory_device));
}
b1 and b2 are bitmaps with 4 bits for each element. The number of elements depend from the number of customers. Also, I have problem with the kernel's parameters, because if I add some blocks or threads, the AndBitwiseOperation() is not giving me the correct result. With just 1 block and 1 thread per block the result is correct but the kernel is not in parallel.
Every advice is welcomed :)
Thank you
I did not really understood what your kernel wants to do.
You should read more about CUDA and GPU programming.
I tried to point out some of the mistakes.
Shared memory (sm) should reduce global memory reads.
Analyze your global memory (gm) read and write operations per thread.
a. You read global memory two times and write sm two times
b. (nonsense loop ignored, no use of index) you read two times sn and write once sm
c. you read once sm and write once gm
So in total you have nothing gained. You could directly use the global memory.
You use all threads to write out one value at the block index "i".
You should only use one thread to write this data out.
It makes no sense outputing the same data by multiple threads which will get serialized.
You use a loop and don't use the loop counter at all.
You write at "tid" and read at "i" randomly.
This assignement is overhead.
unsigned int tid = threadIdx.x;
The results cannot be correct with more then one block since with one block tid = i!
All the wrong indexing results in wrong calculation using more then one block
The shared memory at "i" was never written!
_memory_device[i] = shared_memory_data[i];
My assumption what your kernel should do
/*
* Call kernel with x-block usage and up to 3D Grid
*/
__global__ void bitwiseAnd(int* outData_g,
const long long int inSize_s,
const int* inData1_g,
const int* inData2_g)
{
//get unique block index
const unsigned long long int blockId = blockIdx.x //1D
+ blockIdx.y * gridDim.x //2D
+ gridDim.x * gridDim.y * blockIdx.z; //3D
//get unique thread index
const unsigned long long int threadId = blockId * blockDim.x + threadIdx.x;
//check global unique thread range
if(threadId >= inSize_s)
return;
//output bitwise and
outData_g[thread] = inData1_g[thread] & inData2_g[thread];
}
When you declare an extern __shared__ array, you must also specify its size in the kernel call.
The kernel configuration is:
<<< Dg, Db, Ns, S >>>
Ns is the size of the extern __shared__ arrays, and defaults to 0.
I don't think you can define more than one extern __shared__ array in your kernel. An example in the Programming Guide defines a single extern __shared__ array and manually sets arrays with offsets within it:
extern __shared__ float array[];
__device__ void func() // __device__ or __global__ function
{
short* array0 = (short*)array;
float* array1 = (float*)&array0[128];
int* array2 = (int*)&array1[64];
}

CUDA Dot Product

I'm trying to implement the classic dot-product kernel for double precision arrays with atomic computation of the final sum across the various blocks. I used the atomicAdd for double precision as stated in page 116 of the programming guide.Probably i'm doing something wrong.The partial sums across the threads in every block are computed correctly but afterwords the atomic operation doesn't seem to be working properly since every time i run my kernel with the same data,i receive different results. I'll be grateful if somebody could spot the mistake or provide an alternative solution!
Here is my kernel:
__global__ void cuda_dot_kernel(int *n,double *a, double *b, double *dot_res)
{
__shared__ double cache[threadsPerBlock]; //thread shared memory
int global_tid=threadIdx.x + blockIdx.x * blockDim.x;
int i=0,cacheIndex=0;
double temp = 0;
cacheIndex = threadIdx.x;
while (global_tid < (*n)) {
temp += a[global_tid] * b[global_tid];
global_tid += blockDim.x * gridDim.x;
}
cache[cacheIndex] = temp;
__syncthreads();
for (i=blockDim.x/2; i>0; i>>=1) {
if (threadIdx.x < i) {
cache[threadIdx.x] += cache[threadIdx.x + i];
}
__syncthreads();
}
__syncthreads();
if (cacheIndex==0) {
*dot_res=cuda_atomicAdd(dot_res,cache[0]);
}
}
And here is my device function atomicAdd:
__device__ double cuda_atomicAdd(double *address, double val)
{
double assumed,old=*address;
do {
assumed=old;
old= __longlong_as_double(atomicCAS((unsigned long long int*)address,
__double_as_longlong(assumed),
__double_as_longlong(val+assumed)));
}while (assumed!=old);
return old;
}
Getting a reduction right using ad hoc CUDA code can be tricky, so here's an alternative solution using a Thrust algorithm, which is included with the CUDA Toolkit:
#include <thrust/inner_product.h>
#include <thrust/device_ptr.h>
double do_dot_product(int n, double *a, double *b)
{
// wrap raw pointers to device memory with device_ptr
thrust::device_ptr<double> d_a(a), d_b(b);
// inner_product implements a mathematical dot product
return thrust::inner_product(d_a, d_a + n, d_b, 0.0);
}
You are using the cuda_atomicAdd function incorrectly. This section of your kernel:
if (cacheIndex==0) {
*dot_res=cuda_atomicAdd(dot_res,cache[0]);
}
is the culprit. Here, you atomically add to dot_res. then non atomically set dot_res with the result it returns. The return result from this function is the previous value of the location being atomically updated, and it supplied for "information" or local use of the caller only. You don't assign it to what you are atomically updated, that completely defeats the purpose of using atomic memory access in the first place. Do something like this instead:
if (cacheIndex==0) {
double result=cuda_atomicAdd(dot_res,cache[0]);
}
Did not checked your code that depth but here are some advices.
I would only advice using Thrust if you only use your GPU for such generic tasks, since if a complex problem will arise people have no idea to efficiently program parallel on the gpu.
Start a new parallel reduction kernel to summarize the dot product.
Since the data is already on the device you won't see a decrease in performance starting a new kernel.
Your kernel seems not to scale across the maximum number of possible blocks on the newest GPU. If it would and your kernel would be able to calculate the dot product of millions of values the performance would decrease dramatically because of the serialized atomic operation.
Beginner mistake: Is your input data and shared memory access range checked? Or are you sure the input data is always multiple of your block size? Else you will read garbage. Most of my wrong results were due to this fault.
optimise your parallel reduction. My Thesis or Optimisations Mark Harris
Untested, i just wrote it down in notepad:
/*
* #param inCount_s unsigned long long int Length of both input arrays
* #param inValues1_g double* First value array
* #param inValues2_g double* Second value array
* #param outDots_g double* Output dots of each block, length equals the number of blocks
*/
__global__ void dotProduct(const unsigned long long int inCount_s,
const double* inValuesA_g,
const double* inValuesB_g,
double* outDots_g)
{
//get unique block index in a possible 3D Grid
const unsigned long long int blockId = blockIdx.x //1D
+ blockIdx.y * gridDim.x //2D
+ gridDim.x * gridDim.y * blockIdx.z; //3D
//block dimension uses only x-coordinate
const unsigned long long int tId = blockId * blockDim.x + threadIdx.x;
/*
* shared value pair products array, where BLOCK_SIZE power of 2
*
* To improve performance increase its size by multiple of BLOCK_SIZE, so that each threads loads more then 1 element!
* (outDots_g length decreases by same factor, and you need to range check and initialize memory)
* -> see harris gpu optimisations / parallel reduction slides for more informations.
*/
__shared__ double dots_s[BLOCK_SIZE];
/*
* initialize shared memory array and calculate dot product of two values,
* shared memory always needs to be initialized, its never 0 by default, else garbage is read later!
*/
if(tId < inCount_s)
dots_s[threadIdx.x] = inValuesA_g[tId] * inValuesB_g[tId];
else
dots_s[threadIdx.x] = 0;
__syncthreads();
//do parallel reduction on shared memory array to sum up values
reductionAdd(dots_s, dots_s[0]) //see my thesis link
//output value
if(threadIdx.x == 0)
outDots_g[0] = dots_s[0];
//start new parallel reduction kernel to sum up outDots_g!
}
Edit: removed unnecessary points.