Does CUDA automatically convert float4 arrays into a struct of arrays? - cuda

I have the following snippet of code:
#include <stdio.h>
struct Nonsense {
float3 group;
float other;
};
__global__ void coalesced(float4* float4Array, Nonsense* nonsenseArray) {
float4 someCoordinate = float4Array[threadIdx.x];
someCoordinate.x = 5;
float4Array[threadIdx.x] = someCoordinate;
Nonsense nonsenseValue = nonsenseArray[threadIdx.x];
nonsenseValue.other = 3;
nonsenseArray[threadIdx.x] = nonsenseValue;
}
int main() {
float4* float4Array;
cudaMalloc(&float4Array, 32 * sizeof(float4));
cudaMemset(float4Array, 32 * sizeof(float4), 0);
Nonsense* nonsenseArray;
cudaMalloc(&nonsenseArray, 32 * sizeof(Nonsense));
cudaMemset(nonsenseArray, 32 * sizeof(Nonsense), 0);
coalesced<<<1, 32>>>(float4Array, nonsenseArray);
cudaDeviceSynchronize();
return 0;
}
When I run this through the Nvidia profiler in Nsight, and look at the Global Memory Access Pattern, the float4Array has perfect coalesced reads and writes. Meanwhile, the Nonsense array has a poor access patterns (due to it being an array of structs).
Does NVCC automatically convert a float4 array which conceptually is an array of structs into a struct of array for better memory access patterns?

No, it does not convert it into a struct of arrays. I think if you think about this carefully, you would conclude that it is nearly impossible for the compiler to reorganize data this way. After all, the thing that is being passed is a pointer.
There is only one array, and the elements of that array still have the struct elements in the same order:
float address (i.e. index): 0 1 2 3 4 5 ...
array element : a[0].x a[0].y a[0].z a[0].w a[1].x a[1].y ...
However the float4 array gives a better pattern because the compiler generates a single 16-byte load per thread. This is sometimes referred to as a "vector load" because we are loading a vector (float4 in this case) per thread. Therefore, adjacent threads are still reading adjacent data, and you have ideal coalescing behavior. In the above example, thread 0 would read a[0].x, a[0].y, a[0].z and a[0].w, thread 1 would read a[1].x, a[1].y etc. All of this would take place in a single request (i.e. SASS instruction) but may be split across multiple transactions. The splitting of a request into multiple transactions does not result in any loss of efficiency (in this case).
In the case of the Nonsense struct, the compiler does not recognize that that struct could also be loaded in a similar fashion, so under the hood it must generate 3 or 4 loads per thread:
one 8-byte load (or two 4-byte loads) to load the first two words of the float3 group
one 4-byte load to load the last word of the float3 group
one 4-byte load to load the float other
If you map out the above loads per thread, perhaps using the above diagram, you will see that each load involves a stride (unused elements between the items loaded per thread) and so results in lower efficiency.
By using careful typecasting or a union definition in your struct, you can get the compiler to load your Nonsense struct in a single load.
This answer also covers some ideas related to AoS -> SoA conversion and the related efficiency gains.
This answer covers vector load details.

Related

how to use CUDA to create an array of indices of matches from another array? With atomic?

I have a array in device ,the array is like[0,1,0,0,1...], only have 0,1. I want to create a new array and put the number 1's index in the new array.
I think it should use atomic. I have no idea. How to implement?
This can be achieved using stream-compaction. With Thrust it could look like this
#include <thrust/device_vector.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/copy.h>
#include <vector>
struct IsOne{
__host__ __device__
bool operator()(int i) const{
return i == 1;
}
};
int main(){
std::vector<int> h_array{1,0,1,1,0};
thrust::device_vector<int> d_array = h_array;
thrust::device_vector<int> d_indicesOfOnes(d_array.size());
auto end = thrust::copy_if(
thrust::device,
thrust::make_counting_iterator(0),
thrust::make_counting_iterator(5),
d_array.begin(),
d_indicesOfOnes.begin(),
IsOne{}
);
int numIndices = thrust::distance(d_indicesOfOnes.begin(), end);
for(int i = 0; i < numIndices; i++){
std::cout << d_indicesOfOnes[i] << "\n";
}
}
Output:
0
2
3
So you want to find the indices of all the 1s, producing an output array whose length is the count of ones?
Almost certainly better to work in private chunks than to have all threads (whatever you call them in CUDA) serialized by incrementing a shared atomic output position. That strategy would be disastrously slow, getting no parallelism except when scanning through 0s. Maybe you had some better idea for how to use atomic, but you didn't mention it.
Have each "thread" count 1s in its chunk, then prefix-sum those results across chunks to find the starting point in the output array for each chunk.
Then (after all earlier chunks are done) copy its local array of indices to the right position in the final shared array.
Collecting results between chunks might be done with atomic, but I think you'd want an array of start-positions instead of serializing things with atomic for only 1 chunk start-point at a time.
IDK whether it would be faster to generate a local array of indices during the first pass as you count, or to re-scan the original array to generate indices. That would let the first pass just be a simple sum (which is the same thing as counting the matches in this case). If 1s are very rare, writing a temporary array and copying it might be good. If they're common, keeping the first pass simple and light-weight is probably good, not costing a lot more memory access / cache footprint.
The overall problem is somewhat similar to the parallel prefix-sum problem for dependencies between chunks, but it's the starting position that isn't known here, instead of what offset you need to add.
The prefix-sum part of what I'm suggesting is just over chunks, one value per thread, so this step doesn't have to do a lot of work between waiting for earlier chunks and starting up the 2nd phase.
I've never actually used CUDA so I'm not going to attempt code, but this is how you can parallelize the dependencies inherent in this problem in a way that's probably friendly for what GPUs can do.
It would work well on a multi-core CPU where you'd maybe have an array of atomic<ssize_t> output counts / positions. Perhaps starting zero-initialized, and when a thread finishes its chunk it writes a count to the array biased by 1, so it's non-zero. Or perhaps chunk_counts[thread] = -(count+1); so it's definitely non-zero, and negative. Then it checks if it's the first thread, or if chunk_counts[thread-1] is non-zero and positive (else .wait(0) on it?), and if so writes its chunk_counts[thread] = chunk_counts[thread-1] + count; Actually you wouldn't need to store both the negative local count and the positive prefix-summed count, just one or the other depending on how you collect.
And on a CPU, probably better to have one thread responsible for collecting the results, prefix summing, and notifying all the other threads to wake again. Instead of serializing with a chain of each thread waiting for a previous thread to write its value. Perhaps also on a GPU. That one collector thread can notify workers to start phase 2 as it goes along in the prefix sum.

cuda memory alignment

In my code I am using structures in order to facilitate the passing of arguements to functions (I don't use arrays of structures, but instead structures of arrays in general).
When I am in cuda-gdb and I examine the point in a kernel where I give values to a simple structure like
struct pt{
int i;
int j;
int k;
}
even though I am not doing something complicated and it's obvious that the members should have the values appointed, I get...
Asked for position 0 of stack, stack only has 0 elements on it.
So I am thinking that even though it's not an array, maybe there is a problem with the alignment of memory at that point. So I change the definition in the header file to
struct __align__(16) pt{
int i;
int j;
int k;
}
but then, when the compiler tries to compile the host-code files that use the same definitions, gives the following error:
error: expected unqualified-id before numeric constant error: expected
‘)’ before numeric constant error: expected constructor, destructor,
or type conversion before ‘;’ token
so, am I supposed to have two different definitions for host and device structures ???
Further I would like to ask how to generalize the logic of alignment. I am not a computer scientist, so the two examples in the programming guide don't help me get the big picture.
For example, how should the following two be aligned? or, how should a structure with 6 floats be aligned? or 4 integers? again, I'm not using arrays of those, but still I define lots of variables with these structures within the kernels or _ device _ functions.
struct {
int a;
int b;
int c;
int d;
float* el;
} ;
struct {
int a;
int b
int c
int d
float* i;
float* j;
float* k;
} ;
Thank you in advance for any advice or hints
There are a lot of questions in this post. Since the CUDA programming guide does a pretty good job of explaining alignment in CUDA, I'll just explain a few things that are not obvious in the guide.
First, the reason your host compiler gives you errors is because the host compiler doesn't know what __align(n)__ means, so it is giving a syntax error. What you need is to put something like the following in a header for your project.
#if defined(__CUDACC__) // NVCC
#define MY_ALIGN(n) __align__(n)
#elif defined(__GNUC__) // GCC
#define MY_ALIGN(n) __attribute__((aligned(n)))
#elif defined(_MSC_VER) // MSVC
#define MY_ALIGN(n) __declspec(align(n))
#else
#error "Please provide a definition for MY_ALIGN macro for your host compiler!"
#endif
So, am I supposed to have two different definitions for host and device structures?
No, just use MY_ALIGN(n), like this
struct MY_ALIGN(16) pt { int i, j, k; }
For example, how should the following two be aligned?
First, __align(n)__ (or any of the host compiler flavors), enforces that the memory for the struct begins at an address in memory that is a multiple of n bytes. If the size of the struct is not a multiple of n, then in an array of those structs, padding will be inserted to ensure each struct is properly aligned. To choose a proper value for n, you want to minimize the amount of padding required. As explained in the programming guide, the hardware requires each thread reads words aligned to 1,2,4, 8 or 16 bytes. So...
struct MY_ALIGN(16) {
int a;
int b;
int c;
int d;
float* el;
};
In this case let's say we choose 16-byte alignment. On a 32-bit machine, the pointer takes 4 bytes, so the struct takes 20 bytes. 16-byte alignment will waste 16 * (ceil(20/16) - 1) = 12 bytes per struct. On a 64-bit machine, it will waste only 8 bytes per struct, due to the 8-byte pointer. We can reduce the waste by using MY_ALIGN(8) instead. The tradeoff will be that the hardware will have to use 3 8-byte loads instead of 2 16-byte loads to load the struct from memory. If you are not bottlenecked by the loads, this is probably a worthwhile tradeoff. Note that you don't want to align smaller than 4 bytes for this struct.
struct MY_ALIGN(16) {
int a;
int b
int c
int d
float* i;
float* j;
float* k;
};
In this case with 16-byte alignment you waste only 4 bytes per struct on 32-bit machines, or 8 on 64-bit machines. It would require two 16-byte loads (or 3 on a 64-bit machine). If we align to 8 bytes, we could eliminate waste entirely with 4-byte alignment (8-byte on 64-bit machines), but this would result in excessive loads. Again, tradeoffs.
or, how should a structure with 6 floats be aligned?
Again, tradeoffs: either waste 8 bytes per struct or require two loads per struct.
or 4 integers?
No tradeoff here. MY_ALIGN(16).
again, I'm not using arrays of those, but still I define lots of variables with these structures within the kernels or _ device _ functions.
Hmmm, if you are not using arrays of these, then you may not need to align at all. But how are you assigning to them? As you are probably seeing, all that waste is important to worry about—it's another good reason to favor structures of arrays over arrays of structures.
These days, you should use the C++11 alignas specifier, which is supported by GCC (including the versions compatible with current CUDA), by MSVC since the 2015 version, and IIANM by nvcc as well. That should save you the need to resort to macros.

How best to transfer a large number of arrays of chars to the GPU?

I am new to CUDA and am trying to do some processing of a large number of arrays. Each array is an array of about 1000 chars (not a string, just stored as chars) and there can be up to 1 million of them, so about 1 gb of data to be transfered. This data is already all loaded into memory and I have a pointer to each array, but I don't think I can rely on all the data being sequential in memory, so I can't just transfer it all with one call.
I currently made a first go at it with thrust, and based my solution kind of on this message ... I made a struct with a static call that allocates all the memory, and then each individual constructor copies that array, and I have a transform call which takes in the struct with the pointer to the device array.
My problem is that this is obviously extremely slow, since each array is copied individually. I'm wondering how to transfer this data faster.
In this question (the question is mostly unrelated, but I think the user is trying to do something similar) talonmies suggests that they try and use a zip iterator but I don't see how that would help transfer a large number of arrays.
I also just found out about cudaMemcpy2DToArray and cudaMemcpy2D while writing this question, so maybe those are the answer, but I don't see immediately how these would work, since neither seem to take pointers to pointers as input...
Any suggestions are welcome...
One way to do this is as marina.k suggested, batching your transfers only as you need them. Since you said each array only contains about 1000 chars, you could assign each char to a thread (since on Fermi we can allocate 1024 threads per block) and have each array handled by one block. In this case you may be able to transfer all the arrays for one "round" in one call - can you use a FORTRAN style, where you make one gigantic array and to get the 5th element of the "third" 1000 char array you would go:
third_array[5] = big_array[5 + 2*1000]
so that the first 1000 char array makes up the first 1000 elements of big_array, the second 1000 char array makes up the second 1000 elements of big_array, etc. ? In this case your chars would be continuous in memory and you could move the set you were going to process with one kernel launch in only one memcpy. Then as soon as you launch one kernel, you refill big_array on the CPU side and copy it asynchronously to the GPU.
Within each kernel, you could simply handle each array within 1 block, so that block N handles the (N-1)-thousandth element up to the N-thousandth of d_big_array (where you copied all those chars to).
Did you try pinned memory? This may provide a considerable speed-up on some hardware configurations.
Take try of async, you can assign the same job to different streams, each stream process a small part of date, make tranfer and computation at the same time
here is code:
cudaMemcpyAsync(
inputDevPtr + i * size, hostPtr + i * size, size, cudaMemcpyHostToDevice, stream[i]
);
MyKernel<<<100, 512, 0, stream[i]>>> (outputDevPtr + i * size, inputDevPtr + i * size, size);
cudaMemcpyAsync(
hostPtr + i * size, outputDevPtr + i * size, size, cudaMemcpyDeviceToHost, stream[i]
);

Very poor memory access performance with CUDA

I'm very new to CUDA, and trying to write a test program.
I'm running the application on GeForce GT 520 card, and get VERY poor performance.
The application is used to process some image, with each row being handled by a separate thread.
Below is a simplified version of the application. Please note that in the real application, all constants are actually variables, provided be the caller.
When running the code below, it takes more than 20 seconds to complete the execution.
But as opposed to using malloc/free, when l_SrcIntegral is defined as a local array (as it appears in the commented line), it takes less than 1 second to complete the execution.
Since the actual size of the array is dynamic (and not 1700), this local array can't be used in the real application.
Any advice how to improve the performance of this rather simple code would be appreciated.
#include "cuda_runtime.h"
#include <stdio.h>
#define d_MaxParallelRows 320
#define d_MinTreatedRow 5
#define d_MaxTreatedRow 915
#define d_RowsResolution 1
#define k_ThreadsPerBlock 64
__global__ void myKernel(int Xi_FirstTreatedRow)
{
int l_ThreadIndex = blockDim.x * blockIdx.x + threadIdx.x;
if (l_ThreadIndex >= d_MaxParallelRows)
return;
int l_Row = Xi_FirstTreatedRow + (l_ThreadIndex * d_RowsResolution);
if (l_Row <= d_MaxTreatedRow) {
//float l_SrcIntegral[1700];
float* l_SrcIntegral = (float*)malloc(1700 * sizeof(float));
for (int x=185; x<1407; x++) {
for (int i=0; i<1700; i++)
l_SrcIntegral[i] = i;
}
free(l_SrcIntegral);
}
}
int main()
{
cudaError_t cudaStatus;
cudaStatus = cudaSetDevice(0);
int l_ThreadsPerBlock = k_ThreadsPerBlock;
int l_BlocksPerGrid = (d_MaxParallelRows + l_ThreadsPerBlock - 1) / l_ThreadsPerBlock;
int l_FirstRow = d_MinTreatedRow;
while (l_FirstRow <= d_MaxTreatedRow) {
printf("CUDA: FirstRow=%d\n", l_FirstRow);
fflush(stdout);
myKernel<<<l_BlocksPerGrid, l_ThreadsPerBlock>>>(l_FirstRow);
cudaDeviceSynchronize();
l_FirstRow += (d_MaxParallelRows * d_RowsResolution);
}
printf("CUDA: Done\n");
return 0;
}
1.
As #aland said, you will maybe even encounter worse performance calculating just one row in each kernel call.
You have to think about processing the whole input, just to theoretically use the power of the massive parallel processing.
Why start multiple kernels with just 320 threads just to calculate one row?
How about using as many blocks you have rows and let the threads per block process one row.
(320 threads per block is not a good choice, check out how to reach better occupancy)
2.
If your fast resources as registers and shared memory are not enough, you have to use a tile apporach which is one of the basics using GPGPU programming.
Separate the input data into tiles of equal size and process them in a loop in your thread.
Here I posted an example of such a tile approach:
Parallelization in CUDA, assigning threads to each column
Be aware of range checks in that tile approach!
Example to give you the idea:
Calculate the sum of all elements in a column vector in an arbitrary sized matrix.
Each block processes one column and the threads of that block store in a tile loop their elements in a shared memory array. When finished they calculate the sum using parallel reduction, just to start the next iteration.
At the end each block calculated the sum of its vector.
You can still use dynamic array sizes using shared memory. Just pass a third argument in the <<<...>>> of the kernel call. That'd be the size of your shared memory per block.
Once you're there, just bring all relevant data into your shared array (you should still try to keep coalesced accesses) bringing one or several (if it's relevant to keep coalesced accesses) elements per thread. Sync threads after it's been brought (only if you need to stop race conditions, to make sure the whole array is in shared memory before any computation is done) and you're good to go.
Also: you should tessellate using blocks and threads, not loops. I understand that's just an example using a local array, but still, it could be done tessellating through blocks/threads and not nested for loops (which are VERY bad for performance!) I hope you're running your sample code using just 1 block and 1 thread, otherwise it wouldn't make much sense.

CUDA: streaming the same memory location to all threads

Here's my problem: I have quite a big set of doubles (it's an array of 77.500 doubles) to be stored somewhere in cuda. Now, I need a big set of threads to sequentially do a bunch of operations with that array. Every thread will have to read the SAME element of that array, perform tasks, store results in shared memory and read the next element of the array. Note that every thread will simultaneously have to read (just read) from the same memory location. So I wonder: is there any way to broadcast the same double to all threads with just one memory read? Reading many times would be quite useless... Any idea??
This is a common optimization. The idea is to make each thread cooperate with its blockmates to read in the data:
// choose some reasonable block size
const unsigned int block_size = 256;
__global__ void kernel(double *ptr)
{
__shared__ double window[block_size];
// cooperate with my block to load block_size elements
window[threadIdx.x] = ptr[threadIdx.x];
// wait until the window is full
__syncthreads();
// operate on the data
...
}
You can iteratively "slide" the window across the array block_size (or maybe some integer factor more) elements at a time to consume the whole thing. The same technique applies when you'd like to store the data back in a synchronized fashion.