CUDA: how to represent efficiently 2-D arrays on the GPU - cuda

I need to process a 2-D array with dimensions K x N on the GPU, where K is a small number (3, 4, or 5) and N has a value of millions to 100s of millions.
The processing will be done for one column of K elements at a time, such that each column will be processed by a separate invocation of a kernel.
What is the most efficient way to represent the K x N array on the GPU:
1) in a 1-D array, placing the K elements of a column in consecutive locations, so that each thread will process elements K*thread_id, K*thread_id + 1, ..., K*thread_id + K - 1
2) as K separate 1-D arrays, where each array stores 1 row of the original array;
3) something else
Thank you!

The option 2 is better for your case.
The data layout of your option 2 can be seen as the structure of arrays (SoA), while the option 1 is the array of structures (AoS).
Generally the SoA is better than the AoS for GPU programming. There are a lot of discussion on this topic showing why SoA performs better.
http://developer.download.nvidia.com/CUDA/training/introductiontothrust.pdf
http://my.safaribooksonline.com/book/-/9780123884268/chapter-6dot-efficiently-using-gpu-memory/st0045_b9780123884268000069
Since each thread accesses the K elements one by one, AoS layout in your option 1 leads to strided memory access issure and can hurt the performance, which is discussed as follows.
https://developer.nvidia.com/content/how-access-global-memory-efficiently-cuda-cc-kernels
Although this issue could be relaxed by a large enough L2 cache in your case, avoiding AoS is a more robust way to get higher performance.

Related

How to copy a small part of an array from global memory into the local memory of a thread

I have an nxn matrix A (which is stored row by row in an array) with floats. n is divisible by 4. I now want to output a matrix B which is (n/4)x(n/4). In B i want to store the arithmetic middle of all the values of a 4 by 4 sub matrix in A. The submatrixes do not overlap. E.g. If A were 16x16, B would be 4x4, since A has 4 submatrixes.
The thing is, that i want to you use 1 thread per submatrix and i want to have the memory access as coalescent as possible. My idea was to create a kernel and in thread 0 read the values A[0],A[1],A[2],A[3] and in thread 2 A[4],A[5],A[6],A[7]. The thing is, that if i execute these reads after one another (this code only works for the first row of the matrix) e.g.:
float * firstRow[4];
for (int i = 0; i<4; i++):
firstRow[i] = A[threadIdx.x * 4 + i];
the reads would not be coalescent. i would need something like cudaMemcpy but from global memory to a threads memory.
I am not sure if i could copy the whole array into shared memory since it is 4 times bigger than the number of threads. Nevertheless, i want to know the answer to my original question.

Chisel Synchronization

Here is a description of one of the states in my state machine. What I would like to do is to go to the next state after the for loops.
is(s_multiplier){
when(ready){state := s_ready}
// Initialization of C memory to 0
for(i <- 0 to matrixSize - 1){
for(j <- 0 to matrixSize - 1){
memC.write(i + j, 0.asSInt((2 * cellSize).W))
}
}
// Objective 1 : Multiplication for the 128X128
// Objective 2 : Multiplication for the n.m and m.p size parameters given
for(i <- 0 to matrixSize - 1){
for(j <- 0 to matrixSize - 1){
sum := 0.asSInt(cellSize.W)
for(k <- 0 to matrixSize - 1){
sum = sum + memA.read(i * matrixSize + k, true.B) * memB.read(k * matrixSize + j, true.B)
}
memC.write(i * matrixSize + j, sum)
}
}
ready := true.B
}
I just created a boolean variable ready that I put to true after the loops. But as everything is supposed to be executed in parallel, I Don't think that my code is correct :/
There is a fundamental difference between writing software algorithms and using chisel to construct the hardware necessary to perform equivalent calculations.
Before discussing the matrix multiplication, consider (as a simpler example) your memory initialization operation loop. The way you have done it makes sense, but for hardware every time the inner body of the loop is executed the hardware necessary to init that memory cell is added to the hardware graph. That means you have created the necessary wires to initialize 16384 memory locations all at the same time. That a lot of wires. Not only that, it would require a memory that has 16384 write ports (you probably can't find that). Your hardware would initialize all this memory in one clock cycle, which is good, but by devoting an enormous number of gates to do so.
Typically one would initialize memory over a number of clock cycles and in this way reducing the amount of hardware required.
Similarly in the matrix multiplication section you are generating all the hardware necessary to compute a matrix multiplication in 1 clock cycle. This is great for performance but the number of multiplications required for this approach is 2,097,152 hardware multipliers plus a further large number of adders. Every * and + operation in the inner loop generates hardware. The number of gates required to multiply two 32 bit numbers is roughly 1024 gates.
The way to go about this is to figure out a way of breaking down the problem into stages. Maybe this would be module that can multiply one row by one column and sum the total. You would then need to use registers to work your way through the matrix, keeping track of the row and columns in order to compute the value at every point in the result matrix. In order to reduce the number of hardware elements you instead perform the calculation over multiple clock cycles keeping state information (indices to the rows and columns) on the progress of the calculation in registers or in memory.
There's a lot of ways to try and optimize a function this and Chisel is a great language for experimenting and testing out tactics.
Maybe you want to make the memory very wide to accommodate getting multiple cell values at once.
Maybe you will unroll your loop a bit more to compute multiple cell values at once by having more than one cell calculator.
Clever iteration strategies can optimize your memory accesses for both reading and writing.
The point is that writing hardware is not necessary harder than writing software (and Chisel helps there) but it is pretty different in the approach.
I would recommend you spend a little more time with Chisel bootcamp. The 2.3_control_flow page's section on sorting is pretty similar with respect to the discussion above. You can write a one cycle sorter but the size of the hardware to do it grows rapidly, in practice it is necessary to break the problem into pieces and spread the calculation over multiple cycles.
Good luck.

Strategy for minimizing bank conflicts for 64-bit thread-separate shared memory

Suppose I have a full warp of threads in a CUDA block, and each of these threads is intended to work with N elements of type T, residing in shared memory (so we have warp_size * N = 32 N elements total). The different threads never access each other's data. (Well, they do, but at a later stage which we don't care about here). This access is to happen in a loop such as the following:
for(int i = 0; i < big_number; i++) {
auto thread_idx = determine_thread_index_into_its_own_array();
T value = calculate_value();
write_to_own_shmem(thread_idx, value);
}
Now, the different threads may have different indices each, or identical - I'm not making any assumptions this way or that. But I do want to minimize shared memory bank conflicts.
If sizeof(T) == 4, then this is is easy-peasy: Just place all of thread i's data in shared memory addresses i, 32+i, 64+i, 96+i etc. This puts all of i's data in the same bank, that's also distinct from the other lane's banks. Great.
But now - what if sizeof(T) == 8? How should I place my data and access it so as to minimize bank conflicts (without any knowledge about the indices)?
Note: Assume T is plain-old-data. You may even assume it's a number if that makes your answer simpler.
tl;dr: Use the same kind of interleaving as for 32-bit values.
On later-than-Kepler micro-architectures (up to Volta), the best we could theoretically get is 2 shared memory transactions for a full warp reading a single 64-bit value (as a single transaction provides 32 bits to each lane at most).
This is is achievable in practice by the analogous placement pattern OP described for 32-bit data. That is, for T* arr, have lane i read the idx'th element as T[idx + i * 32]. This will compile so that two transactions occur:
The lower 16 lanes obtain their data from the first 32*4 bytes in T (utilizing all banks)
The higher 16 obtain their data from the successive 32*4 bytes in T (utilizing all banks)
So the GPU is smarter/more flexible than trying to fetch 4 bytes for each lane separately. That means it can do better than the simplistic "break up T into halves" idea the earlier answer proposed.
(This answer is based on #RobertCrovella's comments.)
On Kepler GPUs, this had a simple solution: Just change the bank size! Kepler supported setting the shared memory bank size to 8 instead of 4, dynamically. But alas, that feature is not available in later microarchitectures (e.g. Maxwell, Pascal).
Now, here's an ugly and sub-optimal answer for more recent CUDA microarchitectures: Reduce the 64-bit case to the 32-bit case.
Instead of each thread storing N values of type T, it stores 2N values, each consecutive pair being the low and the high 32-bits of a T.
To access a 64-bit values, 2 half-T accesses are made, and the T is composed with something like `
uint64_t joined =
reinterpret_cast<uint32_t&>(&upper_half) << 32 +
reinterpret_cast<uint32_t&>(&lower_half);
auto& my_t_value = reinterpret_cast<T&>(&joined);
and the same in reverse when writing.
As comments suggest, it is better to make 64-bit access, as described in this answer.

How do I calculate variance of gpu_array?

I am trying to compute the variance of a 2D gpu_array. A reduction kernel sounds like a good idea:
http://documen.tician.de/pycuda/array.html
However, this documentation implies that reduction kernels just reduce 2 arrays into 1 array. How do I reduce a single 2D array into a single value?
I guess the first step is to define variance for this case. In matlab, the variance function on a 2D array returns a vector (1D-array) of values. But it sounds like you want a single-valued variance, so as others have already suggested, probably the first thing to do is to treat the 2D-array as 1D. In C we won't require any special steps to accomplish this. If you have a pointer to the array you can index into it as if it were a 1D array. I'm assuming you don't need help on how to handle a 2D array with a 1D index.
Now if it's the 1D variance you're after, I'm assuming a function like variance(x)=sum((x[i]-mean(x))^2) where the sum is over all i, is what you're after (based on my read of the wikipedia article ). We can break this down into 3 steps:
compute the mean (this is a classical reduction - one value is produced for the data set - sum all elements then divide by the number of elements)
compute the value (x[i]-mean)^2 for all i - this is an element by element operation producing an output data set equal in size (number of elements) to the input data set
compute the sum of the elements produced in step 2 - this is another classical reduction, as one value is produced for the entire data set.
Both steps 1 and 3 are classical reductions which are summing all elements of an array. Rather than cover that ground here, I'll point you to Mark Harris' excellent treatment of the topic as well as some CUDA sample code. For step 2, I'll bet you could figure out the kernel code on your own, but it would look something like this:
#include <math.h>
__global__ void var(float *input, float *output, unsigned N, float mean){
unsigned idx=threadIdx.x+(blockDim.x*blockIdx.x);
if (idx < N) output[idx] = __powf(input[idx]-mean, 2);
}
Note that you will probably want to combine the reductions and the above code into a single kernel.

3 questions about alignment

The discussion is restricted to compute capability 2.x
Question 1
The size of a curandState is 48 bytes (measured by sizeof()). When an array of curandStates is allocated, is each element somehow padded (for example, to 64 bytes)? Or are they just placed contiguously in the memory?
Question 2
The OP of Passing structs to CUDA kernels states that "the align part was unnecessary". But without alignment, access to that structure will be divided into two consecutive access to a and b. Right?
Question 3
struct
{
double x, y, z;
}Position
Suppose each thread is accessing the structure above:
int globalThreadID=blockIdx.x*blockDim.x+threadIdx.x;
Position positionRegister=positionGlobal[globalThreadID];
To optimize memory access, should I simply use three separate double variables x, y, z to replace the structure?
Thanks for your time!
(1) They are placed contiguously in memory.
(2) If the array is in global memory, each memory transaction is 128 bytes, aligned to 128 bytes. You get two transactions only if a and b happen to span a 128-byte boundary.
(3) Performance can often be improved by using an struct of arrays instead of an array of structs. This justs means that you pack all your x together in an array, then y and so on. This makes sense when you look at what happens when all 32 threads in a warp get to the point where, for instance, x is needed. By having all the values packed together, all the threads in the warp can be serviced with as few transactions as possible. Since a global memory transaction is 128 bytes, this means that a single transaction can service all the threads if the value is a 32-bit word. The code example you gave might cause the compiler to keep the values in registers until they are needed.