How to copy a small part of an array from global memory into the local memory of a thread - cuda

I have an nxn matrix A (which is stored row by row in an array) with floats. n is divisible by 4. I now want to output a matrix B which is (n/4)x(n/4). In B i want to store the arithmetic middle of all the values of a 4 by 4 sub matrix in A. The submatrixes do not overlap. E.g. If A were 16x16, B would be 4x4, since A has 4 submatrixes.
The thing is, that i want to you use 1 thread per submatrix and i want to have the memory access as coalescent as possible. My idea was to create a kernel and in thread 0 read the values A[0],A[1],A[2],A[3] and in thread 2 A[4],A[5],A[6],A[7]. The thing is, that if i execute these reads after one another (this code only works for the first row of the matrix) e.g.:
float * firstRow[4];
for (int i = 0; i<4; i++):
firstRow[i] = A[threadIdx.x * 4 + i];
the reads would not be coalescent. i would need something like cudaMemcpy but from global memory to a threads memory.
I am not sure if i could copy the whole array into shared memory since it is 4 times bigger than the number of threads. Nevertheless, i want to know the answer to my original question.

Related

Number of Threads Calculating a single Value

I am using CUDA with Compute capability 1.2. I am running my CUDA Code with each element of a matrix calculated by the addtion of other 2 matrices. I am calculating the value of one element by one Thread. I want to know is it possible to use 2 threads for calculating single value.If it is possible, Can anyone plz tell me how to use 2 different threads of same block to calculate the single value?
If you need to calculate
q = m2[i][k] + m2[(k+1)][j] + p1[(i-1)]*p1[k]*p1[j];
by two cores, use a wider variable + less iterations. int2:
__shared__ int2 m2[N][N],p1[N],q;
could use two cores but not two threads. If you insist on two threads,
qThread1 = m2[i][k] + m2[(k+1)][j] //in a kernel
...
...
...
qThread2 = p1[(i-1)]*p1[k]*p1[j] //in another kernel
Then you simply add them into q by another thread. Synchronizations, kernel starting overheads, cache utilization can decrease performance as well as decreased instruction level parallelizm. Maybe kernel occupation increases but not sure if it tolerates the above negatives.

CUDA: how to represent efficiently 2-D arrays on the GPU

I need to process a 2-D array with dimensions K x N on the GPU, where K is a small number (3, 4, or 5) and N has a value of millions to 100s of millions.
The processing will be done for one column of K elements at a time, such that each column will be processed by a separate invocation of a kernel.
What is the most efficient way to represent the K x N array on the GPU:
1) in a 1-D array, placing the K elements of a column in consecutive locations, so that each thread will process elements K*thread_id, K*thread_id + 1, ..., K*thread_id + K - 1
2) as K separate 1-D arrays, where each array stores 1 row of the original array;
3) something else
Thank you!
The option 2 is better for your case.
The data layout of your option 2 can be seen as the structure of arrays (SoA), while the option 1 is the array of structures (AoS).
Generally the SoA is better than the AoS for GPU programming. There are a lot of discussion on this topic showing why SoA performs better.
http://developer.download.nvidia.com/CUDA/training/introductiontothrust.pdf
http://my.safaribooksonline.com/book/-/9780123884268/chapter-6dot-efficiently-using-gpu-memory/st0045_b9780123884268000069
Since each thread accesses the K elements one by one, AoS layout in your option 1 leads to strided memory access issure and can hurt the performance, which is discussed as follows.
https://developer.nvidia.com/content/how-access-global-memory-efficiently-cuda-cc-kernels
Although this issue could be relaxed by a large enough L2 cache in your case, avoiding AoS is a more robust way to get higher performance.

Coalesce a set of odd numbers

I'm trying to understand coalescing global memory.
Say I'd like to load an odd set of floats to global memory. Each thread will process a set of 3 floats. Say these floats are A, B, and C.
A0, B0, C0
A1, B1, C1
A2, B2, C2
..
A19, B19, C19
So the threads would grab the data like such:
Thread 0: A0, B0, C0
Thread 1: A1, B1, C1
Thread 2: A2, B2, C2
..
Thread 19: A19, B19, C19
First approach:
I could load 3 arrays of: float A[20]; float B[20]; floatC[20]; I'd have to cudaMemcpy() three different times to load the data into global memory. This approach would probably not coalesce very well.
Second approach:
A better approach would be something like:
struct {float A, float B, float C} dataPt;
dataPt data[20];
I could load the data with one cudaMemcpy(), but I'm not sure the memory access would coalesce very well.
Third approach:
struct {float A, float B, float C, float padding} dataPt2;
dataPt2 data2[20];
or
struct __align__(16){float A, float B, float C} dataPt3;
dataPt3 data3[20];
I could load the data to global memory with a single cudaMemcpy(), and the thread access to data would be coalesced. (At the cost of wasted global memory.)
1) The 1st approach would not coalesce because each thread will probably need 3 bus cycles to load the input data.
2) The 2nd approach will coalesce for many of the threads, but there will be a few threads that will need two bus cycles to get the input data.
3) The 3rd approach will coalesce for all threads.
Is this accurate? Is there a significant difference between the 2nd & 3rd approach? Is there an approach the uses the 3 thread dimensions (threadIdx.x, threadIdx.y, threadIdx.z)?
Just amplifying on #talonmies answer.
Let's assume our kernel looks like this:
__global__ void kern(float *a, float *b, float *c){
float local_a, local_b, local_c;
int idx = threadIdx.x + (blockDim.x * blockIdx.x);
local_a = a[idx];
local_b = b[idx];
local_c = c[idx];
}
ignoring optimizations (which would result in an empty kernel), and assuming we launch 1 block of 32 threads:
kern<<<1, 32>>>(d_a, d_b, d_c);
Then we have 32 threads (1 warp) executing in lock-step. That means each thread will process the following kernel code line:
local_a = a[idx];
at exactly the same time. The definition of a coalesced load (from global memory) is when a warp loads a sequence of data items that are all within a single 128-byte aligned boundary in global memory (for CC 2.0 devices). A perfectly coalesced load with 100% bandwidth utilization implies that each thread is using one unique 32 bit quantity within that 128 byte aligned region. If thread zero loads a[0], thread 1 loads a[1], etc, that may be a typical example of a coalesced load.
So in your first case, since the a[] array is all contiguous and aligned, and a[0..31] fit within a 128 byte aligned region in global memory, we get a coalesced load. thread 0 reads a[0], thread 1 reads a[1] etc.
In the second case, a[0] is not contiguous with a[1], and furthermore the elements a[0..31] (which are all loaded at the same code line) do not fit within a 128 byte aligned sequence in global memory. I'm going to let you parse what happens in your third case, but suffice it to say that like the second case, the elements a[0..31] are niether contiguous nor contained within a single 128 byte aligned region in global memory. While it's not necessary to have data items that are contiguous to achieve some level of coalescing, a 100% bandwidth utilization ("perfectly") coalesced load from a 32 thread warp implies that each thread is using a unique 32bit item, all of which are contiguous and contained within a single 128-byte aligned sequence in global memory.
A handy mental model is to contrast an Arrary of Structures (AoS) (which corresponds to your cases 2 and 3) and a Structure of Arrays (SoA), which is essentially your first case. SoA's usually present better possibilities for coalescing than AoS's. From the nvidia webinar page you may find this presentation interesting, especially slides 11-22 or so.
Some other relevant info from the Best Practices Guide:
For devices of compute capability 2.x, the requirements can be
summarized quite easily: the concurrent accesses of the threads of a
warp will coalesce into a number of transactions equal to the number
of cache lines necessary to service all of the threads of the warp. By
default, all accesses are cached through L1, which as 128-byte lines.
For scattered access patterns, to reduce overfetch, it can sometimes
be useful to cache only in L2, which caches shorter 32-byte segments
(see the CUDA C Programming Guide).
The compiler flag: -Xptxas -dlcm=cg will disable L1 cache. i.e. Use only L2, for poorly coalesced data.

3 questions about alignment

The discussion is restricted to compute capability 2.x
Question 1
The size of a curandState is 48 bytes (measured by sizeof()). When an array of curandStates is allocated, is each element somehow padded (for example, to 64 bytes)? Or are they just placed contiguously in the memory?
Question 2
The OP of Passing structs to CUDA kernels states that "the align part was unnecessary". But without alignment, access to that structure will be divided into two consecutive access to a and b. Right?
Question 3
struct
{
double x, y, z;
}Position
Suppose each thread is accessing the structure above:
int globalThreadID=blockIdx.x*blockDim.x+threadIdx.x;
Position positionRegister=positionGlobal[globalThreadID];
To optimize memory access, should I simply use three separate double variables x, y, z to replace the structure?
Thanks for your time!
(1) They are placed contiguously in memory.
(2) If the array is in global memory, each memory transaction is 128 bytes, aligned to 128 bytes. You get two transactions only if a and b happen to span a 128-byte boundary.
(3) Performance can often be improved by using an struct of arrays instead of an array of structs. This justs means that you pack all your x together in an array, then y and so on. This makes sense when you look at what happens when all 32 threads in a warp get to the point where, for instance, x is needed. By having all the values packed together, all the threads in the warp can be serviced with as few transactions as possible. Since a global memory transaction is 128 bytes, this means that a single transaction can service all the threads if the value is a 32-bit word. The code example you gave might cause the compiler to keep the values in registers until they are needed.

How to properly coalesce reads from global memory into shared memory with elements of type short or char (assuming one thread per element)?

I have a questions about coalesced global memory loads in CUDA. Currently I need to be able to execute on a CUDA device with compute capability CUDA 1.1 or 1.3.
I am writing a CUDA kernel function which reads an array of type T from global memory into shared memory, does some computation, and then will write out an array of type T back to global memory. I am using the shared memory because the computation for each output element actually depends not only on the corresponding input element, but also on the nearby input elements. I only want to load each input element once, hence I want to cache the input elements in shared memory.
My plan is to have each thread read one element into shared memory, then __syncthreads() before beginning the computation. In this scenario, each thread loads, computes, and stores one element (although the computation depends on elements loaded into shared memory by other threads).
For this question I want to focus on the read from global memory into shared memory.
Assuming that there are N elements in the array, I have configured CUDA to execute a total of N threads. For the case where sizeof(T) == 4, this should coalesce nicely according to my understanding of CUDA, since thread K will read word K (where K is the thread index).
However, in the case where sizeof(T) < 4, for example if T=unsigned char or if T=short, then I think there may be a problem. In this case, my (naive) plan is:
Compute numElementsPerWord = 4 / sizeof(T)
if(K % numElementsPerWord == 0), then read have thread K read the next full 32-bit word
store the 32 bit word in shared memory
after the shared memory has been populated, (and __syncthreads() called) then each thread K can process work on computing output element K
My concern is that it will not coalesce because (for example, in the case where T=short)
Thread 0 reads word 0 from global memory
Thread 1 does not read
Thread 2 reads word 1 from global memory
Thread 3 does not read
etc...
In other words, thread K reads word (K/sizeof(T)). This would seem to not coalesce properly.
An alternative approach that I considered was:
Launch with number of threads = (N + 3) / 4, such that each thread will be responsible for loading and processing (4/sizeof(T)) elements (each thread processes one 32-bit word - possibly 1, 2, or 4 elements depending on sizeof(T)). However I am concerned that this approach will not be as fast as possible since each thread must then do twice (if T=short) or even quadruple (if T=unsigned char) the amount of processing.
Can someone please tell me if my assumption about my plan is correct: i.e.: it will not coalesce properly?
Can you please comment on my alternative approach?
Can you recommend a more optimal approach that properly coalesces?
You are correct, you have to do loads of at least 32 bits in size to get coalescing, and the scheme you describe (having every other thread do a load) will not coalesce. Just shift the offset right by 2 bits and have each thread do a contiguous 32-bit load, and use conditional code to inhibit execution for threads that would operate on out-of-range addresses.
Since you are targeting SM 1.x, note also that 1) in order for coalescing to happen, thread 0 of a given warp (collections of 32 threads) must be 64-, 128- or 256-byte aligned for 4-, 8- and 16-byte operands, respectively, and 2) once your data is in shared memory, you may want to unroll your loop by 2x (for short) or 4x (for char) so adjacent threads reference adjacent 32-bit words, to avoid shared memory bank conflicts.