Will the Blocks run in parallel? - cuda

The Kernel will search the 2D Array for the Number 5
e.g Array has dimensions 10x10
, total elements = 10000 , if I divide it by 4, the Range = 2500;
BlockIdx.x = 0 then it will search (i = 0; i < 2500; i++)
BlockIdx.x = 1 then it will search (i = 2500 ; i < 5000; i++)
BlockIdx.x = 2 then it will search (i = 5000 ; i < 7500; i++)
BlockIdx.x = 3 then it will search (i = 7500 ; i < 10000; i++)
Kernel Code
__global__
void psearch(int *d_array)
{
int blockID = blockIdx.x;
int condition = (blockID+1)*(Range)
for(int i = blockID*(Range) ; i < condition; i++)
{
if(d_array[i] == 5)
{
d_array[1] = 1992;
}
}
As can be seen below, I am calling the kernel with 4 blocks of 1 thread each.
Kernel Call
psearch<<<4,1>>>(d_array);
My question is the kernel call will call 4 blocks of 1 thread each, so I can say that all the blocks are running in parallel and therefore the Array is being searched in parallel.
Device Name = Quadro FX 1800M
This program is for learning purposes, so if i made any mistakes, I will be glad if you people could point them out.
Thank you for your consideration.

Yes, it will run in parallel, however it will be underperforming grately compared to what you card is actually capable to do. In order to harvest whole power you should have at least:
as many blocks as there are stream multiprocessors (should be fine for your particular model)
have at least 32 threads in a block, preferably more than 128.
GPUs get their performance when you have hundreds or thousands of threads. 4 is very, very little for a GPU.

Related

CUDA shared memory bank conflict unexpected timing

I was trying to reproduce a bank conflict scenario (minimal working example here) and decided to perform a benchmark when a warp (32 threads) access 32 integers of size 32-bits each in the following 2 scenarios:
When there is no bank conflict (offset=1)
When there is a bank conflict (offset=32, all threads are accessing bank 0)
Here is a sample of the code (only the kernel):
__global__ void kernel(int offset) {
__shared__ uint32_t shared_memory[MEMORY_SIZE];
// init shared memory
if (threadIdx.x == 0) {
for (int i = 0; i < MEMORY_SIZE; i++)
shared_memory[i] = i;
}
__syncthreads();
uint32_t index = threadIdx.x * offset;
// 2048 / 32 = 64
for (int i = 0; i < 64; i++)
{
shared_memory[index] += index * 10;
index += 32;
index %= MEMORY_SIZE;
__syncthreads();
}
}
I expected the version with offset=32 to run slower than the one with offset=1 as access should be serialized but found out that they have similar output time. How is that possible ?
You have only 1 working warp, so biggest problem with your performance is that each (or most) GPU command awaits for finishing previous one. This hides most shared memory conflicts slowdown. You also have a lot of work per each shared memory access. How many small commands there are in cosf? Try simple integer arithmetics instead.

find nearest non-zero element in another vector in CUDA

There is a M x N matrix A and B.(Actual size of matrix is 512 x 4096)
In each row of A, the points to be processed are set to 1.
And each row of B contains values obtained through a specific operation.
Based on each row, I am going to do an operation to get the value of B that is closest to the point of 1 in A.
The example is shown in the figure below, and the code I wrote in MATLAB was also written down.
Here's how I thought of it:
Pick the non-zero element index of A with thrust. And for each element, the closest value is fetched from the corresponding row of B by for-loop.
(If there are several non-zero elements in A, it is expected to be slow.)
I want to make good use of the power of the GPU for this operation, do you have any more efficient ideas?
[idxY,idxX] = find(A == 1);
for Point = 1:length(idxY)
pointBuf = find(B(:,idxY(Point)) == 1); // find non-zero elements in Row of B
if ~isempty(pointBuf) // there are non-zero elements in Row of B
[MinValue, MinIndex] = min(abs(pointBuf - idxY(Point)));
C(idxY(Point),idxX(Point)) = B(pointBuf(MinIndex(1)),RangeInd(Point)); // Get closest point in B
else
C(DopInd(Point),RangeInd(Point)) = 0; // if there is no non-zero elements in Row of B, just set to 0
end
end
Just as reference a solution, which shifts left and right by 4095. It has similarities with bubble sort variants, which bubble up and down at the same time.
Advantage is that it does not depend on the position of the non-null elements in B and can be easily parallelized between threads.
But the inner loop, which translates to 2 SASS instructions is still just too slow (too often called): The program takes 26ms on my notebook.
It would do so in the best and the absolute worst case of the input matrices.
Parts and methods of it probably can be reused, as it shows some CUDA programming methods.
So more or less for reference, in the end not a final (fast enough) solution:
__global__ void calcmatrix(bool* A, double* B, double* C)
{
// calculate row number
int row = blockDim.x * gridDim.y + threadIdx.y;
if (row >= 512)
return;
// store index of valid double from B, this is moved up and down
// those indices are for the current thread. Each thread is responsible for 128 consecutive columns.
int indices[128];
// prefill the indices with their own number (as if every double from B is valid)
#pragma unroll
for (int i = 0; i < 128; i++)
indices[i] = threadIdx.x * 128 + i;
// Store zero flags (4 * 32 bits) for our 128 elements
unsigned int myzeroflags[4];
// For efficiently loading data from memory, we distribute the data in another way: thread 0 gets columns 0, 32, 64, 96, ...; thread 1 gets columns 1, 33, 65, 97, ...; thread 2 gets columns 2, 34, 66, 98, ...; and so on
#pragma unroll
for (int i = 0; i < 128; i++) {
// load value from B
double in = B[row * 4096 + i * 32 + threadIdx.x];
// compare to zero (!in) and combine all bool results from the 32 threads (__ballot_sync))
unsigned int zeroflag = __ballot_sync(0xFFFFFFFF, !in);
// store the ones, which belong to us
if (threadIdx.x == i / 4)
myzeroflags[i & 3] = zeroflag;
}
// go through our zero flags and set those indices to -1 (there is already a valid index "0", so we use a negative number to signify invalid)
#pragma unroll
for (int i = 0; i < 4; i++)
#pragma unroll
for (int j = 0; j < 32; j++)
if (myzeroflags[i] & (1 << j))
indices[i * 32 + j] = -1;
// main loop, do 4095 times
#pragma unroll 1
for (int i = 0; i < 4095; i++) {
// move all elements to the left (if the index there is invalid)
// send index over thread boundaries
int fromright = __shfl_down_sync(0xFFFFFFFF, indices[0], 1, 32);
#pragma unroll
// if left index is -1, set it to one index to the right
for (int j = 0; j < 127; j++)
if (indices[j] == -1)
indices[j] = indices[j + 1];
// move over thread boundaries (except for the rightmost thread)
if (threadIdx.x != 31 && indices[127] == -1)
indices[127] = fromright;
// move to the right in the same way as to the left
int fromleft = __shfl_up_sync(0xFFFFFFFF, indices[127], 1, 32);
#pragma unroll
for (int j = 127; j > 0; j--)
if (indices[j] == -1)
indices[j] = indices[j - 1];
if (threadIdx.x != 0 && indices[0] == -1)
indices[0] = fromleft;
}
// for the other distribution of elements for memory accesses, we have to redistribute the indices to the correct threads
// To not have bank conflicts, we define the shared memory array with 33 instead of 32 elements in the last dimension, but use only 32. With this method we can put threadIdx.x into the last and previous to last dimension without bank conflicts
__shared__ short2 distribidx[8][32][33];
int indices2[128];
// Redistribute first half; the index can go from 0..4095 (and also theoreticially -1, if there was no non-null element in this row). This fits into a short, convert for faster transfer
#pragma unroll
for (int i = 0; i < 32; i++)
distribidx[threadIdx.y][threadIdx.x][i] = { static_cast<short>(indices[i]), static_cast<short>(indices[i + 32]) };
__syncwarp();
#pragma unroll
for (int i = 0; i < 32; i++) {
short2 idxback = distribidx[threadIdx.y][i][threadIdx.x];
indices2[4 * i + 0] = idxback.x;
indices2[4 * i + 1] = idxback.y;
}
__syncwarp();
// Redistribute second half
#pragma unroll
for (int i = 0; i < 32; i++)
distribidx[threadIdx.y][threadIdx.x][i] = { static_cast<short>(indices[i + 64]), static_cast<short>(indices[i + 96]) };
__syncwarp();
#pragma unroll
for (int i = 0; i < 32; i++) {
short2 idxback = distribidx[threadIdx.y][i][threadIdx.x];
indices2[4 * i + 2] = idxback.x;
indices2[4 * i + 3] = idxback.y;
}
// Do final calculation
#pragma unroll
for (int i = 0; i < 128; i++) {
// Default value is zero
double result = 0;
// Read only, if A is true and indices2 is valid
if (A[row * 4096 + i * 32 + threadIdx.x] && indices2[i] != -1)
// Read B with calculated index (this read is not optimized/coalesced, because the indices can be wild, but hopefully was or can be cached)
result = B[row * 4096 + indices2[i]];
// Store result in C
C[row * 4096 + i * 32 + threadIdx.x] = result;
}
}
int main()
{
bool* A;
double* B;
double* C;
cudaMalloc(&A, 2 * 512 * 4096);
cudaMalloc(&B, 8 * 512 * 4096);
cudaMalloc(&C, 8 * 512 * 4096);
// called in this fashion
calcmatrix<<<(512 + 7) / 8, dim3(32, 8)>>>(A, B, C);
return 0;
}
This problem is really far from being simple to implement efficiently on a GPU. The main reason is that GPUs are designed to efficiently execute SIMD-friendly algorithm while this problem can hardly be solve in a SIMD friendly way.
The naive solution you propose will be very inefficient due to the many small kernels to execute (starting a kernel is expensive and Thrust tends to run them synchronously by default AFAIK), not to mention the amount of parallelism of each kernel would be far too small for any modern GPU. I expect this solution to be slower than a naive CPU implementation.
First things first, one need to find an efficient algorithm. The proposed solution runs in O(n m²) where n is the number of row and m the number of columns. That being said, the solution should be fast (ie. close to O(n m)) if most values are non-zero which is not the case in the example.
A more efficient solution is to first iterate over the B matrix and find the location of all the non-zero items so to put it in an array L. Then you can iterate over A, track the non-zero values and search for the closest index of L matching to the location of the current item in A. If the number of items in L is big for the target row (eg. >50), you can use a binary search so to find the location faster (since items of L are sorted). This solution runs in O(n m log m) time.
An even better solution is to iterate simultaneously over A and L like a merge algorithm. Indeed, the indices of A and the items of B are both sorted so the binary search is not even needed. When the index of the current non-zero item of A is bigger than the current item of L you can iterate to the next value of L (and memorize the last value of L discarded needed to compute the closest value). This algorithm runs in O(n m) (optimal). An efficient CPU implementation consists in computing chunks of raw in each many threads.
On a GPU, things are more complex since all the previously provided algorithm are not SIMD-friendly. Computing a row in an SIMD-friendly way turns out to be complex and generally inefficient (the overhead can be higher than the serial algorithm on a CPU). One possible solution would be to compute rows in parallel (1 thread per row) and transpose the matrix block per block in shared-memory so to perform SIMD-friendly memory accesses after that (assuming there is enough space). The non-zero values of A and B certainly needs to be extracted first so to avoid thread divergence as much as possible. This solution works only if the number of non-zero is relatively uniform between the lines (otherwise I doubt a GPU can actually be helpful). Note the overhead of the transposition can be significant compared to the computation. Thus, I am not sure it will be faster than a CPU based solution. In fact, if data lies on the CPU memory, then just transferring data to the GPU will certainly be more expensive than computing the result on a CPU in parallel.

Kmeans clustering acceleration in GPU(CUDA)

I am a fairly new cuda user. I'm practicing on my first cuda application where I try to accelerate kmeans algorithm by using GPU(GTX 670).
Briefly, each thread works on a single point which is compared to all cluster centers and a point is assigned to a center with minimum distance(kernel code can be seen below with comments).
According to Nsight Visual Studio, I have an occupancy of 99.61%(1024 blocks, 1024 threads per block), 99.34% Streaming Multiprocessor activity, 79.98% warp issue efficiency, no shared memory bank conflicts, 18.4GFLOPs Single MUL and 55.2 GFLOPs Single ADD(takes about 14,5 ms to complete kmeans kernel with given parameters).
According to Wikipedia, GTX670's peak performance is 2460 GFLOPs. I am nowhere close to it. In addition to these, some papers claim they can achieve more than half of the peak performance. I cannot see how further I can optimize this kernel code. Is there any optimization that I can apply to the kernel? Any suggestion or help is appreciated and I can give any additional information on demand.
Complete Code
Thanks in advance.
#define SIZE 1024*1024 //number of points
#define CENTERS 32 //number of cluster centroids
#define DIM 8 //dimension of each point and center
#define cudaTHREADSIZE 1024 //threads per block
#define cudaBLOCKSIZE SIZE/cudaTHREADSIZE //number of blocks for kernel
__global__ void kMeans(float *dp, float *dc,int *tag, int *membershipChangedPerBlock)
{
//TOTAL NUMBER OF THREADS SHOULD BE EQUAL TO THE NUMBER OF POINTS, BECAUSE EACH THREAD WORKS ON A SINGLE POINT
__shared__ unsigned char membershipChanged[cudaTHREADSIZE];
__shared__ float dc_shared[CENTERS*DIM];
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int threadID = threadIdx.x;
membershipChanged[threadIdx.x] = 0;
//move centers to shared memory, because each and every thread will call it(roughly + %10 performance here)
while(threadID < CENTERS*DIM){
dc_shared[threadID] = dc[threadID];
threadID += blockDim.x;
}
__syncthreads();
while(tid < SIZE){
int index,prevIndex;
float dist, min_dist;
index = 0;//all initial point indices(centroid number) are assigned to 0.
prevIndex = 0;
dist = 0;
min_dist = 0;
//euclid distance for center 0
for(int dimIdx = 0; dimIdx < DIM; dimIdx++){
min_dist += (dp[tid + dimIdx*SIZE] - dc_shared[dimIdx*CENTERS])*(dp[tid + dimIdx*SIZE] - dc_shared[dimIdx*CENTERS]);
}
//euclid distance for other centers with distance comparison
for(int centerIdx = 1; centerIdx < CENTERS; centerIdx++){
dist = 0;
for(int dimIdx = 0; dimIdx < DIM; dimIdx++){
dist += (dp[tid + dimIdx*SIZE] - dc_shared[centerIdx + dimIdx*CENTERS])*(dp[tid + dimIdx*SIZE] - dc_shared[centerIdx + dimIdx*CENTERS]);
}
//compare distances, if found a shorter one, change index to that centroid number
if(dist < min_dist){
min_dist = dist;
index = centerIdx;
}
}
if (tag[tid] != index) {//if a point's cluster membership changes, flag it as changed in order to compute total membership changes later on
membershipChanged[threadIdx.x] = 1;
}
tag[tid] = index;
__syncthreads();//sync before applying sum reduction to membership changes
//sum reduction
for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1) {
if (threadIdx.x < s) {
membershipChanged[threadIdx.x] +=
membershipChanged[threadIdx.x + s];
}
__syncthreads();
}
if (threadIdx.x == 0) {
membershipChangedPerBlock[blockIdx.x] = membershipChanged[0];
}
tid += blockDim.x * gridDim.x;
}
}
My advice is to compare your work with a more exprienced GPU developer's work. I found out Kmeans algorithm is written by Byran Catanzaro after watching this video. You can find the source code:
https://github.com/bryancatanzaro/kmeans
I am also a beginner but IMHO it is better to use libraries like "Trust". GPU programming is really complicated issue it is hard to achieve max performance "Trust" will help you with that.
Check out rapids.ai cuml which replicates scikit api
Example from docs:
from cuml import KMeans
from cuml.cluster import KMeans
import cudf
import numpy as np
import pandas as pd
def np2cudf(df):
# convert numpy array to cuDF dataframe
df = pd.DataFrame({'fea%d'%i:df[:,i] for i in range(df.shape[1])})
pdf = cudf.DataFrame()
for c,column in enumerate(df):
pdf[str(c)] = df[column]
return pdf
a = np.asarray([[1.0, 1.0], [1.0, 2.0], [3.0, 2.0], [4.0, 3.0]],
dtype=np.float32)
b = np2cudf(a)
print("input:")
print(b)
print("Calling fit")
kmeans_float = KMeans(n_clusters=2)
kmeans_float.fit(b)
print("labels:")
print(kmeans_float.labels_)
print("cluster_centers:")
print(kmeans_float.cluster_centers_)

Parallel Reduction in CUDA for calculating primes

I have a code to calculate primes which I have parallelized using OpenMP:
#pragma omp parallel for private(i,j) reduction(+:pcount) schedule(dynamic)
for (i = sqrt_limit+1; i < limit; i++)
{
check = 1;
for (j = 2; j <= sqrt_limit; j++)
{
if ( !(j&1) && (i&(j-1)) == 0 )
{
check = 0;
break;
}
if ( j&1 && i%j == 0 )
{
check = 0;
break;
}
}
if (check)
pcount++;
}
I am trying to port it to GPU, and I would want to reduce the count as I did for the OpenMP example above. Following is my code, which apart from giving incorrect results is also slower:
__global__ void sieve ( int *flags, int *o_flags, long int sqrootN, long int N)
{
long int gid = blockIdx.x*blockDim.x+threadIdx.x, tid = threadIdx.x, j;
__shared__ int s_flags[NTHREADS];
if (gid > sqrootN && gid < N)
s_flags[tid] = flags[gid];
else
return;
__syncthreads();
s_flags[tid] = 1;
for (j = 2; j <= sqrootN; j++)
{
if ( gid%j == 0 )
{
s_flags[tid] = 0;
break;
}
}
//reduce
for(unsigned int s=1; s < blockDim.x; s*=2)
{
if( tid % (2*s) == 0 )
{
s_flags[tid] += s_flags[tid + s];
}
__syncthreads();
}
//write results of this block to the global memory
if (tid == 0)
o_flags[blockIdx.x] = s_flags[0];
}
First of all, how do I make this kernel fast, I think the bottleneck is the for loop, and I am not sure how to replace it. And next, my counts are not correct. I did change the '%' operator and noticed some benefit.
In the flags array, I have marked the primes from 2 to sqroot(N), in this kernel I am calculating primes from sqroot(N) to N, but I would need to check whether each number in {sqroot(N),N} is divisible by primes in {2,sqroot(N)}. The o_flags array stores the partial sums for each block.
EDIT: Following the suggestion, I modified my code (I understand about the comment on syncthreads now better); I realized that I do not need the flags array and just the global indexes work in my case. What concerns me at this point is the slowness of the code (more than correctness) that could be attributed to the for loop. Also, after a certain data size (100000), the kernel was producing incorrect results for subsequent data sizes. Even for data sizes less than 100000, the GPU reduction results are incorrect (a member in the NVidia forum pointed out that that may be because my data size is not of a power of 2).
So there are still three (may be related) questions -
How could I make this kernel faster? Is it a good idea to use shared memory in my case where I have to loop over each tid?
Why does it produce correct results only for certain data sizes?
How could I modify the reduction?
__global__ void sieve ( int *o_flags, long int sqrootN, long int N )
{
unsigned int gid = blockIdx.x*blockDim.x+threadIdx.x, tid = threadIdx.x;
volatile __shared__ int s_flags[NTHREADS];
s_flags[tid] = 1;
for (unsigned int j=2; j<=sqrootN; j++)
{
if ( gid % j == 0 )
s_flags[tid] = 0;
}
__syncthreads();
//reduce
reduce(s_flags, tid, o_flags);
}
While I profess to know nothing about sieving for primes, there are a host of correctness problems in your GPU version which will stop it from working correctly irrespective of whether the algorithm you are implementing is correct or not:
__syncthreads() calls must be unconditional. It is incorrect to write code where branch divergence could leave some threads within the same warp unable to execute a __syncthreads() call. The underlying PTX is bar.sync and the PTX guide says this:
Barriers are executed on a per-warp basis as if all the threads in a
warp are active. Thus, if any thread in a warp executes a bar
instruction, it is as if all the threads in the warp have executed the
bar instruction. All threads in the warp are stalled until the barrier
completes, and the arrival count for the barrier is incremented by the
warp size (not the number of active threads in the warp). In
conditionally executed code, a bar instruction should only be used if
it is known that all threads evaluate the condition identically (the
warp does not diverge). Since barriers are executed on a per-warp
basis, the optional thread count must be a multiple of the warp size.
Your code unconditionally sets s_flags to one after conditionally loading some values from global memory. Surely that cannot be the intent of the code?
The code lacks a synchronization barrier between the sieving code and the reduction, this can lead to a shared memory race and incorrect results from the reduction.
If you are planning on running this code on a Fermi class card, the shared memory array should be declared volatile to prevent compiler optimization from potentially breaking the shared memory reduction.
If you fix those things, the code might work. Performance is a completely different issue. Certainly on older hardware, the integer modulo operation was very, very slow and not recommended. I can recall reading some material suggesting that Sieve of Atkin was a useful approach to fast prime generation on GPUs.

CUDA reduction using registers

I need to calculate N signals' mean values using reduction. The input is a 1D array of size MN, where M is the length of each signal.
Originally I had additional shared memory to first copy the data and do the reduction on each signal. However, the original data is corrupted.
My program tries to minimize the shared memory. So I was wondering how I can use registers to do a reduction sum on N signals. I have N threads, a shared memory (float) s_m[N*M], 0....M-1 is the first signal, etc.
Do I need N registers (or one) to store do mean value of N different signals? (I know how to do with sequential addition using multi-thread programming and 1 register). The next step I want to do is subtract every value in the input from its correspondent signal's mean.
Your problem is very small (N = 32 and M < 128). However, some guidelines:
Assuming you are reducing across N values for each of N threads.
If N is very large (> 10s of thousands) large, just do the reductions over M sequentially in each thread.
If N is < 10s of thousands, consider using one warp or one thread block to perform each of the N reductions.
If N is very small but M is very large, consider using multiple thread blocks per each of the N reductions.
If N is very small and M is very small (as your numbers are), only consider using the GPU for the reductions if the computations that generate and / or consume the input / output of the reductions are also running on the GPU.
Based on my understanding of the question, I say that you don't need N registers to store the mean value of N different signals.
If you already have N threads [Given that each thread do reduction on only one signal], then you don't need N registers to store the reduction of one signal. All you need is one register to store the mean value.
dim3 threads (N,1);
reduction<<<threads,1>>>(signals); // signals is the [N*M] array
__global__ reduction (int *signals)
{
int id = threadIdx.x;
float meanValue = 0.0;
for(int i = 0; i < M; i++)
meanValue = signals[id*M +i];
meanValue = meanValue/M;
// Then do the subtraction
for(int i = 0; i < M; i++)
signals[id*M +i] -= meanValue;
}
If you need to do Kind of global reduction of all the meanValues of N different signals, then you need to use 2 registers [one to store the local mean and another to store the global mean] and the shared memory
dim3 threads (N,1);
reduction<<<threads,1>>>(signals); // signals is the [N*M] array
__global__ reduction (int *signals)
{
__shared__ float means[N]; // shared value
int id = threadIdx.x;
float meanValue = 0.0;
float globalMean = 0.0;
for(int i = 0; i < M; i++)
meanValue += signals[id*M +i];
means[id] = meanValue/M;
__syncthreads();
// do the global reduction
for(int i = 0; i < N; i++)
globalMean += means[i];
globalMean = globalMean/N;
// Then do the subtraction
for(int i = 0; i < M; i++)
signals[id*M +i] -= globalMean;
}
I hope this helps you. Any doubts, let me know.