cuda memory alignment - cuda

In my code I am using structures in order to facilitate the passing of arguements to functions (I don't use arrays of structures, but instead structures of arrays in general).
When I am in cuda-gdb and I examine the point in a kernel where I give values to a simple structure like
struct pt{
int i;
int j;
int k;
}
even though I am not doing something complicated and it's obvious that the members should have the values appointed, I get...
Asked for position 0 of stack, stack only has 0 elements on it.
So I am thinking that even though it's not an array, maybe there is a problem with the alignment of memory at that point. So I change the definition in the header file to
struct __align__(16) pt{
int i;
int j;
int k;
}
but then, when the compiler tries to compile the host-code files that use the same definitions, gives the following error:
error: expected unqualified-id before numeric constant error: expected
‘)’ before numeric constant error: expected constructor, destructor,
or type conversion before ‘;’ token
so, am I supposed to have two different definitions for host and device structures ???
Further I would like to ask how to generalize the logic of alignment. I am not a computer scientist, so the two examples in the programming guide don't help me get the big picture.
For example, how should the following two be aligned? or, how should a structure with 6 floats be aligned? or 4 integers? again, I'm not using arrays of those, but still I define lots of variables with these structures within the kernels or _ device _ functions.
struct {
int a;
int b;
int c;
int d;
float* el;
} ;
struct {
int a;
int b
int c
int d
float* i;
float* j;
float* k;
} ;
Thank you in advance for any advice or hints

There are a lot of questions in this post. Since the CUDA programming guide does a pretty good job of explaining alignment in CUDA, I'll just explain a few things that are not obvious in the guide.
First, the reason your host compiler gives you errors is because the host compiler doesn't know what __align(n)__ means, so it is giving a syntax error. What you need is to put something like the following in a header for your project.
#if defined(__CUDACC__) // NVCC
#define MY_ALIGN(n) __align__(n)
#elif defined(__GNUC__) // GCC
#define MY_ALIGN(n) __attribute__((aligned(n)))
#elif defined(_MSC_VER) // MSVC
#define MY_ALIGN(n) __declspec(align(n))
#else
#error "Please provide a definition for MY_ALIGN macro for your host compiler!"
#endif
So, am I supposed to have two different definitions for host and device structures?
No, just use MY_ALIGN(n), like this
struct MY_ALIGN(16) pt { int i, j, k; }
For example, how should the following two be aligned?
First, __align(n)__ (or any of the host compiler flavors), enforces that the memory for the struct begins at an address in memory that is a multiple of n bytes. If the size of the struct is not a multiple of n, then in an array of those structs, padding will be inserted to ensure each struct is properly aligned. To choose a proper value for n, you want to minimize the amount of padding required. As explained in the programming guide, the hardware requires each thread reads words aligned to 1,2,4, 8 or 16 bytes. So...
struct MY_ALIGN(16) {
int a;
int b;
int c;
int d;
float* el;
};
In this case let's say we choose 16-byte alignment. On a 32-bit machine, the pointer takes 4 bytes, so the struct takes 20 bytes. 16-byte alignment will waste 16 * (ceil(20/16) - 1) = 12 bytes per struct. On a 64-bit machine, it will waste only 8 bytes per struct, due to the 8-byte pointer. We can reduce the waste by using MY_ALIGN(8) instead. The tradeoff will be that the hardware will have to use 3 8-byte loads instead of 2 16-byte loads to load the struct from memory. If you are not bottlenecked by the loads, this is probably a worthwhile tradeoff. Note that you don't want to align smaller than 4 bytes for this struct.
struct MY_ALIGN(16) {
int a;
int b
int c
int d
float* i;
float* j;
float* k;
};
In this case with 16-byte alignment you waste only 4 bytes per struct on 32-bit machines, or 8 on 64-bit machines. It would require two 16-byte loads (or 3 on a 64-bit machine). If we align to 8 bytes, we could eliminate waste entirely with 4-byte alignment (8-byte on 64-bit machines), but this would result in excessive loads. Again, tradeoffs.
or, how should a structure with 6 floats be aligned?
Again, tradeoffs: either waste 8 bytes per struct or require two loads per struct.
or 4 integers?
No tradeoff here. MY_ALIGN(16).
again, I'm not using arrays of those, but still I define lots of variables with these structures within the kernels or _ device _ functions.
Hmmm, if you are not using arrays of these, then you may not need to align at all. But how are you assigning to them? As you are probably seeing, all that waste is important to worry about—it's another good reason to favor structures of arrays over arrays of structures.

These days, you should use the C++11 alignas specifier, which is supported by GCC (including the versions compatible with current CUDA), by MSVC since the 2015 version, and IIANM by nvcc as well. That should save you the need to resort to macros.

Related

How would I write a MIPS behavioral simulator for the machine code created using the assembler code provided?

This MIPS simulator will read in a text file consisting of LC3100 machine code instructions (represented as decimal values), and execute the program, then display the values of register files and memory after each instruction is completed.
I do not understand how this can be done and simply need a format for what steps I need to take in order to create the simulator in MIPS. Do I write code in C++ or write the code in MIPS? How do I read files if it is in MIPS? Honestly, just confused.
I do not know where I need to start from. This is what I am asking to help figure out.
I'd imagine you'd want to create some global variables that represent your registers and memory:
int memory[0x80000000/4];
int reg_v0;
int reg_t0;
int* reg_pc;
// etc
And then define some functions that mimic the way MIPS behaves. You'll need to read up on how the CPU operates (which is why this example function may seem arbitrary but really it isn't.)
void MIPS_multu(int regA, int regB)
{
// void because we're writing to global variables.
uint64_t temp = regA * regB;
reg_hi = temp >> 32;
reg_lo = (temp & 0x00000000FFFFFFFF);
}
Finally, you'll need to understand how MIPS instructions are encoded and create a routine that can unpack them and select the correct function.
int memory[0x80000000/4];
int reg_v0;
int reg_t0;
int* reg_pc;
// etc
int main()
{
reg_pc = &memory[0];
while (reg_pc < &memory[0x80000000/4])
// chances are this is either invalid C or just bad practice,
// but I can't think of a better way to express the idea
{
int temp = *reg_pc;
// use bitwise operators etc to figure out what the instruction represents,
// and switch cases to pick the functions.
reg_pc++;
}
}

Does CUDA automatically convert float4 arrays into a struct of arrays?

I have the following snippet of code:
#include <stdio.h>
struct Nonsense {
float3 group;
float other;
};
__global__ void coalesced(float4* float4Array, Nonsense* nonsenseArray) {
float4 someCoordinate = float4Array[threadIdx.x];
someCoordinate.x = 5;
float4Array[threadIdx.x] = someCoordinate;
Nonsense nonsenseValue = nonsenseArray[threadIdx.x];
nonsenseValue.other = 3;
nonsenseArray[threadIdx.x] = nonsenseValue;
}
int main() {
float4* float4Array;
cudaMalloc(&float4Array, 32 * sizeof(float4));
cudaMemset(float4Array, 32 * sizeof(float4), 0);
Nonsense* nonsenseArray;
cudaMalloc(&nonsenseArray, 32 * sizeof(Nonsense));
cudaMemset(nonsenseArray, 32 * sizeof(Nonsense), 0);
coalesced<<<1, 32>>>(float4Array, nonsenseArray);
cudaDeviceSynchronize();
return 0;
}
When I run this through the Nvidia profiler in Nsight, and look at the Global Memory Access Pattern, the float4Array has perfect coalesced reads and writes. Meanwhile, the Nonsense array has a poor access patterns (due to it being an array of structs).
Does NVCC automatically convert a float4 array which conceptually is an array of structs into a struct of array for better memory access patterns?
No, it does not convert it into a struct of arrays. I think if you think about this carefully, you would conclude that it is nearly impossible for the compiler to reorganize data this way. After all, the thing that is being passed is a pointer.
There is only one array, and the elements of that array still have the struct elements in the same order:
float address (i.e. index): 0 1 2 3 4 5 ...
array element : a[0].x a[0].y a[0].z a[0].w a[1].x a[1].y ...
However the float4 array gives a better pattern because the compiler generates a single 16-byte load per thread. This is sometimes referred to as a "vector load" because we are loading a vector (float4 in this case) per thread. Therefore, adjacent threads are still reading adjacent data, and you have ideal coalescing behavior. In the above example, thread 0 would read a[0].x, a[0].y, a[0].z and a[0].w, thread 1 would read a[1].x, a[1].y etc. All of this would take place in a single request (i.e. SASS instruction) but may be split across multiple transactions. The splitting of a request into multiple transactions does not result in any loss of efficiency (in this case).
In the case of the Nonsense struct, the compiler does not recognize that that struct could also be loaded in a similar fashion, so under the hood it must generate 3 or 4 loads per thread:
one 8-byte load (or two 4-byte loads) to load the first two words of the float3 group
one 4-byte load to load the last word of the float3 group
one 4-byte load to load the float other
If you map out the above loads per thread, perhaps using the above diagram, you will see that each load involves a stride (unused elements between the items loaded per thread) and so results in lower efficiency.
By using careful typecasting or a union definition in your struct, you can get the compiler to load your Nonsense struct in a single load.
This answer also covers some ideas related to AoS -> SoA conversion and the related efficiency gains.
This answer covers vector load details.

CUDA and addressing bits in parallel

I want to write a CUDA program that returns locations of a bigger array that hold a specific criteria.
The trivial way to do it is to write a kernel that returns an array of integers with 1 if the criteria was held, or 0 if it was not.
Another way might be to return only indexes that were found - but that would be problematic based on my knowledge of GPU synchronization (it's equivalent to implement a queue/linked list on GPU).
The problem with the first idea presented is that the array would be in the input size.
Another way I thought about is to create an array the size of log(n)/8+1 (n=number of items I check), and use 1 bit for each array location (holding a sort of compressed representation of the output).
The only thing I could not find is if CUDA supports bit addressing in parallel..
An example of how I am doing it now:
__global__ void test_kernel(char *gpu, char *gpuFind, int *gputSize, int *gputSearchSize, int *resultsGPU)
{
int start_idx = threadIdx.x + (blockIdx.x * blockDim.x);
if (start_idx > *gputTextSize - *gputSearchSize){return;}
unsigned int wrong=0;
for(int i=0; i<*gputSearchSize;i++){
wrong = calculationOnGpu(gpuText, gpuFind, start_idx,i, gputSearchSize);
}
resultsGPU[start_idx] = !wrong;
}
What I want to do is instead of using int or char for the "resultsGpu" variable , to use something else.
Thanks
A CUDA GPU can access items on boundaries of 1,2,4,8, or 16 bytes. It does not have the ability to independently access bits in a byte.
Bits in a byte would be modified by reading a larger size item, such as char or int, modifying the bit(s) in a register, then writing that item back to memory. Thus it would be a read-modify-write operation.
In order to preserve adjacent bits in such a scenario with multiple threads, it would be necessary to atomically update the item (char, int, etc.) There are no atomics that operate on char quantities, so the bits would need to be grouped into quantities of 32, and written e.g. as int. Following that idiom, every thread would be doing an atomic operation.
32 also happens to be the warp size currently, so a warp-based intrinsic might be a more efficient way to go here, in particular the warp vote __ballot() function. Something like this:
__global__ void test_kernel(char *gpu, char *gpuFind, int *gputSize, int *gputSearchSize, int *resultsGPU)
{
int start_idx = threadIdx.x + (blockIdx.x * blockDim.x);
if (start_idx > *gputTextSize - *gputSearchSize){return;}
unsigned int wrong=0;
wrong = calculationOnGpu(gpuText, gpuFind, start_idx,0, gputSearchSize);
wrong = __ballot(wrong);
if ((threadIdx.x & 31) == 0)
resultsGPU[start_idx/32] = wrong;
}
You haven't provided a complete code, so the above is just a sketch of how it might be done. I'm not sure the loop in your original kernel was an efficient approach anyway, and the above assumes 1 thread per data item to be searched. __ballot() should be safe even in the presence of inactive threads at one end or the other of the array being searched.

Passing a struct pointer to a CUDA kernel [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Copying a struct containing pointers to CUDA device
I have a structure of device pointers, pointing to arrays allocated on the device.
like this
struct mystruct{
int* dev1;
double* dev2;
.
.
}
There are a large number of arrays in this structure. I started writing a CUDA kernel in which
I passed the pointer to mystruct and then derefernce it within the
CUDA kernel code like this mystruct->dev1[i].
But I realized after writing a few lines that this will not work since by CUDA first principles
you cannot derefernce a host pointer (in this case to mystruct) within a CUDA kernel.
But this is kind of inconveneint, since I will have to pass a larger number of arguments
to my kernels. Is there any way to avoid this. I would like to keep the number of arguments
to my kernel calls as short as possible.
As I explain in this answer, you can pass your struct by value to the kernel, so you don't have to worry about dereferencing a host pointer:
__global__ void kernel(mystruct in)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
in.dev1[idx] *= 2;
in.dev2[idx] += 3.14159;
}
There is the overhead of passing the struct by value to be aware of. However if your struct is not too large, it shouldn't matter.
If you pass the same struct to a lot of kernels, or repeatedly, you may consider copying the struct itself to global or constant memory instead as suggested by aland, or use mapped host memory as suggested by Mark Ebersole. But passing the struct by value is a much simpler way to get started.
(Note: please search StackOverflow before duplicating questions...)
You can copy your mystruct structure to global memory and pass its device address to kernel.
From performance viewpoint, however, it would be better to store mystruct in constant memory, since (I guess) there are a lot of random reads from it by many threads.
You could also use page-locked (pinned) host memory and create the structure within that region if your setup supports it. Please see 3.2.4 of the CUDA programming guide.

Casting array to different data-type before reading causes intermittent issues

I have an int array in global memory. In order to try and read from global memory less often, I have been experimenting with reading using datatypes that are 64 bits and then using the high or low 32 bits as needed. For example, this gets the 3rd and 4th ints from the array:
__device__ void func1(int* arr)
{
unsigned long long int val = *((unsigned long long int *) &arr[3]);
// Now operate on the individual ints
}
Using this method to retrieve ints gives me undefined behavior, even though it seems like this should work. When it does work, reading values this way is quite a bit faster than individual integer reads. Has anyone run across this problem before?
Quantities like to be aligned by their size. I'm not sure how cuda handles what you're doing and its possible that it's environment specific, but your use of:
*((unsigned long long int *) &arr[3])
Assuming arr is 8 byte aligned, is taking an 8 byte quantity that is only 4 byte aligned. This happens of course because:
arr = 8n // n is an integer
sizeof(int) = 4
&arr[3] = 8n + 3*4 // simplifies to 8(n+1) + 4
I know you will run into issues if you try to do the same thing on a processor using 32 bit and 16 bit integers (though I've never tried it with 64 and 32 bit ones).
You will need to homebrew some sort of accessor deal that figures out where the piece of data you are trying to access is. Consider the following situation, similar to yours:
int get32BitValueFrom(unsigned long long int longArray[], int index)
{
// get the 64 bit int containing the 32 bit int we want
unsigned long long int value = longarray[index >> 1];
// if we wanted an odd index, return the high order 32 bits
// otherwise return the low order 32 bits
return (int) ((index & 1) ? (value >> 32) : (value));
}
Edit: I know you're using cuda and I know to avoid branching, but I'm sure there is a way to write equivalent code using some sort of bitwise or mathematical trick that accomplishes the same thing.