Number of Threads Calculating a single Value - cuda

I am using CUDA with Compute capability 1.2. I am running my CUDA Code with each element of a matrix calculated by the addtion of other 2 matrices. I am calculating the value of one element by one Thread. I want to know is it possible to use 2 threads for calculating single value.If it is possible, Can anyone plz tell me how to use 2 different threads of same block to calculate the single value?

If you need to calculate
q = m2[i][k] + m2[(k+1)][j] + p1[(i-1)]*p1[k]*p1[j];
by two cores, use a wider variable + less iterations. int2:
__shared__ int2 m2[N][N],p1[N],q;
could use two cores but not two threads. If you insist on two threads,
qThread1 = m2[i][k] + m2[(k+1)][j] //in a kernel
...
...
...
qThread2 = p1[(i-1)]*p1[k]*p1[j] //in another kernel
Then you simply add them into q by another thread. Synchronizations, kernel starting overheads, cache utilization can decrease performance as well as decreased instruction level parallelizm. Maybe kernel occupation increases but not sure if it tolerates the above negatives.

Related

Chisel Synchronization

Here is a description of one of the states in my state machine. What I would like to do is to go to the next state after the for loops.
is(s_multiplier){
when(ready){state := s_ready}
// Initialization of C memory to 0
for(i <- 0 to matrixSize - 1){
for(j <- 0 to matrixSize - 1){
memC.write(i + j, 0.asSInt((2 * cellSize).W))
}
}
// Objective 1 : Multiplication for the 128X128
// Objective 2 : Multiplication for the n.m and m.p size parameters given
for(i <- 0 to matrixSize - 1){
for(j <- 0 to matrixSize - 1){
sum := 0.asSInt(cellSize.W)
for(k <- 0 to matrixSize - 1){
sum = sum + memA.read(i * matrixSize + k, true.B) * memB.read(k * matrixSize + j, true.B)
}
memC.write(i * matrixSize + j, sum)
}
}
ready := true.B
}
I just created a boolean variable ready that I put to true after the loops. But as everything is supposed to be executed in parallel, I Don't think that my code is correct :/
There is a fundamental difference between writing software algorithms and using chisel to construct the hardware necessary to perform equivalent calculations.
Before discussing the matrix multiplication, consider (as a simpler example) your memory initialization operation loop. The way you have done it makes sense, but for hardware every time the inner body of the loop is executed the hardware necessary to init that memory cell is added to the hardware graph. That means you have created the necessary wires to initialize 16384 memory locations all at the same time. That a lot of wires. Not only that, it would require a memory that has 16384 write ports (you probably can't find that). Your hardware would initialize all this memory in one clock cycle, which is good, but by devoting an enormous number of gates to do so.
Typically one would initialize memory over a number of clock cycles and in this way reducing the amount of hardware required.
Similarly in the matrix multiplication section you are generating all the hardware necessary to compute a matrix multiplication in 1 clock cycle. This is great for performance but the number of multiplications required for this approach is 2,097,152 hardware multipliers plus a further large number of adders. Every * and + operation in the inner loop generates hardware. The number of gates required to multiply two 32 bit numbers is roughly 1024 gates.
The way to go about this is to figure out a way of breaking down the problem into stages. Maybe this would be module that can multiply one row by one column and sum the total. You would then need to use registers to work your way through the matrix, keeping track of the row and columns in order to compute the value at every point in the result matrix. In order to reduce the number of hardware elements you instead perform the calculation over multiple clock cycles keeping state information (indices to the rows and columns) on the progress of the calculation in registers or in memory.
There's a lot of ways to try and optimize a function this and Chisel is a great language for experimenting and testing out tactics.
Maybe you want to make the memory very wide to accommodate getting multiple cell values at once.
Maybe you will unroll your loop a bit more to compute multiple cell values at once by having more than one cell calculator.
Clever iteration strategies can optimize your memory accesses for both reading and writing.
The point is that writing hardware is not necessary harder than writing software (and Chisel helps there) but it is pretty different in the approach.
I would recommend you spend a little more time with Chisel bootcamp. The 2.3_control_flow page's section on sorting is pretty similar with respect to the discussion above. You can write a one cycle sorter but the size of the hardware to do it grows rapidly, in practice it is necessary to break the problem into pieces and spread the calculation over multiple cycles.
Good luck.

Strategy for minimizing bank conflicts for 64-bit thread-separate shared memory

Suppose I have a full warp of threads in a CUDA block, and each of these threads is intended to work with N elements of type T, residing in shared memory (so we have warp_size * N = 32 N elements total). The different threads never access each other's data. (Well, they do, but at a later stage which we don't care about here). This access is to happen in a loop such as the following:
for(int i = 0; i < big_number; i++) {
auto thread_idx = determine_thread_index_into_its_own_array();
T value = calculate_value();
write_to_own_shmem(thread_idx, value);
}
Now, the different threads may have different indices each, or identical - I'm not making any assumptions this way or that. But I do want to minimize shared memory bank conflicts.
If sizeof(T) == 4, then this is is easy-peasy: Just place all of thread i's data in shared memory addresses i, 32+i, 64+i, 96+i etc. This puts all of i's data in the same bank, that's also distinct from the other lane's banks. Great.
But now - what if sizeof(T) == 8? How should I place my data and access it so as to minimize bank conflicts (without any knowledge about the indices)?
Note: Assume T is plain-old-data. You may even assume it's a number if that makes your answer simpler.
tl;dr: Use the same kind of interleaving as for 32-bit values.
On later-than-Kepler micro-architectures (up to Volta), the best we could theoretically get is 2 shared memory transactions for a full warp reading a single 64-bit value (as a single transaction provides 32 bits to each lane at most).
This is is achievable in practice by the analogous placement pattern OP described for 32-bit data. That is, for T* arr, have lane i read the idx'th element as T[idx + i * 32]. This will compile so that two transactions occur:
The lower 16 lanes obtain their data from the first 32*4 bytes in T (utilizing all banks)
The higher 16 obtain their data from the successive 32*4 bytes in T (utilizing all banks)
So the GPU is smarter/more flexible than trying to fetch 4 bytes for each lane separately. That means it can do better than the simplistic "break up T into halves" idea the earlier answer proposed.
(This answer is based on #RobertCrovella's comments.)
On Kepler GPUs, this had a simple solution: Just change the bank size! Kepler supported setting the shared memory bank size to 8 instead of 4, dynamically. But alas, that feature is not available in later microarchitectures (e.g. Maxwell, Pascal).
Now, here's an ugly and sub-optimal answer for more recent CUDA microarchitectures: Reduce the 64-bit case to the 32-bit case.
Instead of each thread storing N values of type T, it stores 2N values, each consecutive pair being the low and the high 32-bits of a T.
To access a 64-bit values, 2 half-T accesses are made, and the T is composed with something like `
uint64_t joined =
reinterpret_cast<uint32_t&>(&upper_half) << 32 +
reinterpret_cast<uint32_t&>(&lower_half);
auto& my_t_value = reinterpret_cast<T&>(&joined);
and the same in reverse when writing.
As comments suggest, it is better to make 64-bit access, as described in this answer.

CUDA: 2 threads from different warps but same block attempt to write into same SHARED memory position: dangerous?

Will this lead to inconsistencies in shared memory?
My kernel code looks like this (pseudocode):
__shared__ uint histogram[32][64];
uint threadLane = threadIdx.x % 32;
for (data){
histogram[threadLane][data]++;
}
Will this lead to collisions, given that, in a block with 64 threads, threads with id "x" and "(x + 32)" will very often write into the same position in the matrix?
This program calculates a histogram for a given matrix. I have an analogous CPU program which does the same. The histogram calculated by the GPU is consistently 1/128 lower than the one calculated by the CPU, and I can't figure out why.
It is dangerous. It leads to race conditions.
If you cannot guarantee that each thread within a block has unique write access to a location in the shared memory then you have a problem because that you need to solve by synchronization.
Take a look at this paper for a correct and efficient way of using SM for histogram computation: http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/histogram64/doc/histogram.pdf
Note that is plenty of libraries online that allows you to compute histograms in one line, Thrust for instance .

Is there a performance penalty for CUDA method not running in sync?

If i have a kernel which looks back the last Xmins and calculates the average of all the values in a float[], would i experience a performance drop if all the threads are not executing the same line of code at the same time?
eg:
Say # x=1500, there are 500 data points spanning the last 2hr period.
# x = 1510, there are 300 data points spanning the last 2hr period.
the thread at x = 1500 will have to look back 500 places yet the thread at x = 1510 only looks back 300, so the later thread will move onto the next position before the 1st thread is finished.
Is this typically an issue?
EDIT: Example code. Sorry but its in C# as i was planning to use CUDAfy.net. Hopefully it provides a rough idea of the type of programming structures i need to run (Actual code is more complicated but similar structure). Any comments regarding whether this is suitable for a GPU / coprocessor or just a CPU would be appreciated.
public void PopulateMeanArray(float[] data)
{
float lookFwdDistance = 108000000000f;
float lookBkDistance = 12000000000f;
int counter = thread.blockIdx.x * 1000; //Ensures unique position in data is written to (assuming i have less than 1000 entries).
float numberOfTicksInLookBack = 0;
float sum = 0; //Stores the sum of difference between two time ticks during x min look back.
//Note:Time difference between each time tick is not consistent, therefore different value of numberOfTicksInLookBack at each position.
//Thread 1 could be working here.
for (float tickPosition = SDS.tick[thread.blockIdx.x]; SDS.tick[tickPosition] < SDS.tick[(tickPosition + lookFwdDistance)]; tickPosition++)
{
sum = 0;
numberOfTicksInLookBack = 0;
//Thread 2 could be working here. Is this warp divergence?
for(float pastPosition = tickPosition - 1; SDS.tick[pastPosition] > (SDS.tick[tickPosition - lookBkDistance]); pastPosition--)
{
sum += SDS.tick[pastPosition] - SDS.tick[pastPosition + 1];
numberOfTicksInLookBack++;
}
data[counter] = sum/numberOfTicksInLookBack;
counter++;
}
}
CUDA runs threads in groups called warps. On all CUDA architectures that have been implemented so far (up to compute capability 3.5), the size of a warp is 32 threads. Only threads in different warps can truly be at different locations in the code. Within a warp, threads are always in the same location. Any threads that should not be executing the code in a given location are disabled as that code is executed. The disabled threads are then just taking up room in the warp and cause their corresponding processing cycles to be lost.
In your algorithm, you get warp divergence because the exit condition in the inner loop is not satisfied at the same time for all the threads in the warp. The GPU must keep executing the inner loop until the exit condition is satisfied for ALL the threads in the warp. As more threads in a warp reach their exit condition, they are disabled by the machine and represent lost processing cycles.
In some situations, the lost processing cycles may not impact performance, because disabled threads do not issue memory requests. This is the case if your algorithm is memory bound and the memory that would have been required by the disabled thread was not included in the read done by one of the other threads in the warp. In your case, though, the data is arranged in such a way that accesses are coalesced (which is a good thing), so you do end up losing performance in the disabled threads.
Your algorithm is very simple and, as it stands, the algorithm does not fit that well on the GPU. However, I think the same calculation can be dramatically sped up on both the CPU and GPU with a different algorithm that uses an approach more like that used in parallel reductions. I have not considered how that might be done in a concrete way though.
A simple thing to try, for a potentially dramatic increase in speed on the CPU, would be to alter your algorithm in such a way that the inner loop iterates forwards instead of backwards. This is because CPUs do cache prefetches. These only work when you iterate forwards through your data.

CUDA finding the max value in given array

I tried to develop a small CUDA program for find the max value in the given array,
int input_data[0...50] = 1,2,3,4,5....,50
max_value initialized by the first value of the input_data[0],
The final answer is stored in result[0].
The kernel is giving 0 as the max value. I don't know what the problem is.
I executed by 1 block 50 threads.
__device__ int lock=0;
__global__ void max(float *input_data,float *result)
{
float max_value = input_data[0];
int tid = threadIdx.x;
if( input_data[tid] > max_value)
{
do{} while(atomicCAS(&lock,0,1));
max_value=input_data[tid];
__threadfence();
lock=0;
}
__syncthreads();
result[0]=max_value; //Final result of max value
}
Even though there are in-built functions, just I am practicing small problems.
You are trying to set up a "critical section", but this approach on CUDA can lead to hang of your whole program - try to avoid it whenever possible.
Why your code hangs?
Your kernel (__global__ function) is executed by groups of 32 threads, called warps. All threads inside a single warp execute synchronously. So, the warp will stop in your do{} while(atomicCAS(&lock,0,1)) until all threads from your warp succeed with obtaining the lock. But obviously, you want to prevent several threads from executing the critical section at the same time. This leads to a hang.
Alternative solution
What you need is a "parallel reduction algorithm". You can start reading here:
Parallel prefix sum # wikipedia
Parallel Reduction # CUDA website
NVIDIA's Guide to Reduction
Your code has potential race. I'm not sure if you defined the 'max_value' variable in shared memory or not, but both are wrong.
1) If 'max_value' is just a local variable, then each thread holds the local copy of it, which are not the actual maximum value (they are just the maximum value between input_data[0] and input_data[tid]). In the last line of code, all threads write to result[0] their own max_value, which will result in undefined behavior.
2) If 'max_value' is a shared variable, 49 threads will enter the if-statements block, and they will try to update the 'max_value' one at a time using locks. But the order of executions among 49 threads is not defined, and therefore some threads may overwrite the actual maximum value to smaller values. You would need to compare the maximum value again within the critical section.
Max is a 'reduction' - check out the Reduction sample in the SDK, and do max instead of summation.
The white paper's a little old but still reasonably useful:
http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf
The final optimization step is to use 'warp synchronous' coding to avoid unnecessary __syncthreads() calls.
It requires at least 2 kernel invocations - one to write a bunch of intermediate max() values to global memory, then another to take the max() of that array.
If you want to do it in a single kernel invocation, check out the threadfenceReduction SDK sample. That uses __threadfence() and atomicAdd() to track progress, then has 1 block do a final reduction when all blocks have finished writing their intermediate results.
There are different accesses for variables. when you define a variable by device then the variable is placed on GPU global memory and it is accessible by all threads in grid , shared places the variable in block shared memory and it is accessible only by the threads of that block , at the end if you don't use any keyword like float max_value then the variable is placed on thread registers and it can be accessed only in that thread.In your code each thread have local variable max_value and it doesn't identify variables in other threads.