I would like to declare the alignment for a global device variable in CUDA. Specifically, I have a string declaration, like
__device__ char str1 = "some pre-defined string";
In normal gcc, I can request alignment from the compiler as
__device__ char str1 __attribute__ ((aligned (4))) = "some pre-defined string";
However, when I tried this on nvcc, the compiler ignores these requests. The reason I would like to do this is to copy these strings onto a buffer in my kernels, and copying words at a time is much faster than copying bytes at a time, though they require that the src string be aligned. Can anyone please tell me how to request alignment from the nvcc compiler?
See section 5.3.2 "Size and Alignment Requirement" of the "CUDA C Programming Guide", which can be found here:
The alignment requirement is automatically fulfilled for the built-in types of char, short, int, long, longlong, float, double like float2 or float4.
For structs, the size and alignment requirements can be enforced by the compiler using the alignment specifiers __align__(8) or __align__(16).
Example usage:
struct __align__(8) {
float r;
float i;
} complex_num;
Can you check if this works?
__device__ char __align__(4) str1 = "some pre-defined string";
Related
This MIPS simulator will read in a text file consisting of LC3100 machine code instructions (represented as decimal values), and execute the program, then display the values of register files and memory after each instruction is completed.
I do not understand how this can be done and simply need a format for what steps I need to take in order to create the simulator in MIPS. Do I write code in C++ or write the code in MIPS? How do I read files if it is in MIPS? Honestly, just confused.
I do not know where I need to start from. This is what I am asking to help figure out.
I'd imagine you'd want to create some global variables that represent your registers and memory:
int memory[0x80000000/4];
int reg_v0;
int reg_t0;
int* reg_pc;
// etc
And then define some functions that mimic the way MIPS behaves. You'll need to read up on how the CPU operates (which is why this example function may seem arbitrary but really it isn't.)
void MIPS_multu(int regA, int regB)
{
// void because we're writing to global variables.
uint64_t temp = regA * regB;
reg_hi = temp >> 32;
reg_lo = (temp & 0x00000000FFFFFFFF);
}
Finally, you'll need to understand how MIPS instructions are encoded and create a routine that can unpack them and select the correct function.
int memory[0x80000000/4];
int reg_v0;
int reg_t0;
int* reg_pc;
// etc
int main()
{
reg_pc = &memory[0];
while (reg_pc < &memory[0x80000000/4])
// chances are this is either invalid C or just bad practice,
// but I can't think of a better way to express the idea
{
int temp = *reg_pc;
// use bitwise operators etc to figure out what the instruction represents,
// and switch cases to pick the functions.
reg_pc++;
}
}
I could not find anything in internet. Due to the fact that it is possible to use printf in a __device__ function I am wondering if there is a sprintf like function due to the fact that printf is "using" the result from sprintf to be displayed in stdout.
No there isn't anything built into CUDA for this.
Within CUDA the implementation of device printf is a special case and does not use the same mechanisms as the C library printf.
sprintf(), snprintf() and additional printf()-family functions are now available on the development branch of the CUDA Kernel Author's Toolkit, a.k.a. cuda-kat. Signatures:
namespace kat {
__device__ int sprintf(char* s, const char* format, ...);
__device__ int snprintf(char* s, size_t n, const char* format, ...);
}
... and they do exactly what you would expect. In particular, they support the C standard features which CUDA printf() does not, and then some (e.g. specifying a string argument's field width using an extra argument; format specifiers for size_t, and ptrdiff_t, and printing in base-2).
Caveat: I am the author of cuda-kat, so I'm biased...
Always prefer snprintf() which takes the buffer size oversprintf() which might overflow.
I tracked a bug to the use of a __m128 (SSE vector) as a value in a std::unordered_map.
This causes a runtime segmentation fault with mingw32 g++4.7.2.
Please see the example below.
Is there any reason why this should fail?
Or, might there be a workaround? (I tried wrapping the value in a class but it did not help.)
Thanks.
#include <unordered_map>
#include <xmmintrin.h> // __m128
#include <iostream>
int main()
{
std::unordered_map<int,__m128> m;
std::cerr << "still ok\n";
m[0] = __m128();
std::cerr << "crash in previous statement\n";
return 0;
}
Compilation settings:
g++ -march=native -std=c++11
There are 2 issues regarding alignment:
Does the ABI ensure that __m128 variables are always aligned on the stack?
Does the global new operator return memory suitably aligned for the __m128 type? i.e., returns memory with a 16-byte alignment.
C++ currently doesn't handle dynamic allocation of over-aligned types. With usual x86 ABIs, standard alignment is 8 and __m128 has an alignment of 16 bytes, so it is overaligned. With usual x86_64 ABIs, the standard alignment is 16 which makes __m128 safe (but __m256 is unsafe again with its 32-byte alignment).
See this paper for a possible change in the next standard that would make things "just work":
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3396.htm
In the meantime, you can specify your own allocator, for instance based on aligned_alloc (C11), posix_memalign (unix), _aligned_malloc (Microsoft), etc.
I have a 2D matrix SIZE x SIZE, which I'm trying to copy to the GPU.
I allocate the matrix this way:
#define SIZE 1024
float (*a)(SIZE) = (float(*)[SIZE]) malloc(SIZE * SIZE * sizeof(float));
And I have this on my ACC region:
void mmul_acc(restrict float a[][SIZE],
restrict float b[][SIZE],
restrict float c[][SIZE]) {
#pragma acc data copyin(a[0:SIZE][0:SIZE], b[0:SIZE][0:SIZE]) \
copyout c[0:SIZE][0:SIZE])
{
... code here...
}
When compiling with the PGI compiler, using -Minfo=acc, the compiler tells me:
Generating copyin(a[0:1024][0:])
What does a[0:1024][0:] mean? Why not a[0:1024][0:1024] ???
If instead of declaring matrices I declare arrays with size SIZE*SIZE, doing
#pragma acc copyin(a[0:SIZE*SIZE])
Generates the following compiler message
Generating copyin(a[0:16777216])
The code actually works the same way, same performance, same result.
Apparently in both ways the compiler generates the same code, as it should be, but the message is not straightforward.
I'm using the PGI accelerator 12.8, in a Linux64 machine. I'm compiling with -Minfo=acc
Note: this question was edited and now it doesn't really make much sense, but maybe it can useful to more people.
This issue is fixed in latest PGI Compiler 12.9.0. The compiler now returns following messsage:
Generating copyin(a[0:1024][0:1024])
In my code I am using structures in order to facilitate the passing of arguements to functions (I don't use arrays of structures, but instead structures of arrays in general).
When I am in cuda-gdb and I examine the point in a kernel where I give values to a simple structure like
struct pt{
int i;
int j;
int k;
}
even though I am not doing something complicated and it's obvious that the members should have the values appointed, I get...
Asked for position 0 of stack, stack only has 0 elements on it.
So I am thinking that even though it's not an array, maybe there is a problem with the alignment of memory at that point. So I change the definition in the header file to
struct __align__(16) pt{
int i;
int j;
int k;
}
but then, when the compiler tries to compile the host-code files that use the same definitions, gives the following error:
error: expected unqualified-id before numeric constant error: expected
‘)’ before numeric constant error: expected constructor, destructor,
or type conversion before ‘;’ token
so, am I supposed to have two different definitions for host and device structures ???
Further I would like to ask how to generalize the logic of alignment. I am not a computer scientist, so the two examples in the programming guide don't help me get the big picture.
For example, how should the following two be aligned? or, how should a structure with 6 floats be aligned? or 4 integers? again, I'm not using arrays of those, but still I define lots of variables with these structures within the kernels or _ device _ functions.
struct {
int a;
int b;
int c;
int d;
float* el;
} ;
struct {
int a;
int b
int c
int d
float* i;
float* j;
float* k;
} ;
Thank you in advance for any advice or hints
There are a lot of questions in this post. Since the CUDA programming guide does a pretty good job of explaining alignment in CUDA, I'll just explain a few things that are not obvious in the guide.
First, the reason your host compiler gives you errors is because the host compiler doesn't know what __align(n)__ means, so it is giving a syntax error. What you need is to put something like the following in a header for your project.
#if defined(__CUDACC__) // NVCC
#define MY_ALIGN(n) __align__(n)
#elif defined(__GNUC__) // GCC
#define MY_ALIGN(n) __attribute__((aligned(n)))
#elif defined(_MSC_VER) // MSVC
#define MY_ALIGN(n) __declspec(align(n))
#else
#error "Please provide a definition for MY_ALIGN macro for your host compiler!"
#endif
So, am I supposed to have two different definitions for host and device structures?
No, just use MY_ALIGN(n), like this
struct MY_ALIGN(16) pt { int i, j, k; }
For example, how should the following two be aligned?
First, __align(n)__ (or any of the host compiler flavors), enforces that the memory for the struct begins at an address in memory that is a multiple of n bytes. If the size of the struct is not a multiple of n, then in an array of those structs, padding will be inserted to ensure each struct is properly aligned. To choose a proper value for n, you want to minimize the amount of padding required. As explained in the programming guide, the hardware requires each thread reads words aligned to 1,2,4, 8 or 16 bytes. So...
struct MY_ALIGN(16) {
int a;
int b;
int c;
int d;
float* el;
};
In this case let's say we choose 16-byte alignment. On a 32-bit machine, the pointer takes 4 bytes, so the struct takes 20 bytes. 16-byte alignment will waste 16 * (ceil(20/16) - 1) = 12 bytes per struct. On a 64-bit machine, it will waste only 8 bytes per struct, due to the 8-byte pointer. We can reduce the waste by using MY_ALIGN(8) instead. The tradeoff will be that the hardware will have to use 3 8-byte loads instead of 2 16-byte loads to load the struct from memory. If you are not bottlenecked by the loads, this is probably a worthwhile tradeoff. Note that you don't want to align smaller than 4 bytes for this struct.
struct MY_ALIGN(16) {
int a;
int b
int c
int d
float* i;
float* j;
float* k;
};
In this case with 16-byte alignment you waste only 4 bytes per struct on 32-bit machines, or 8 on 64-bit machines. It would require two 16-byte loads (or 3 on a 64-bit machine). If we align to 8 bytes, we could eliminate waste entirely with 4-byte alignment (8-byte on 64-bit machines), but this would result in excessive loads. Again, tradeoffs.
or, how should a structure with 6 floats be aligned?
Again, tradeoffs: either waste 8 bytes per struct or require two loads per struct.
or 4 integers?
No tradeoff here. MY_ALIGN(16).
again, I'm not using arrays of those, but still I define lots of variables with these structures within the kernels or _ device _ functions.
Hmmm, if you are not using arrays of these, then you may not need to align at all. But how are you assigning to them? As you are probably seeing, all that waste is important to worry about—it's another good reason to favor structures of arrays over arrays of structures.
These days, you should use the C++11 alignas specifier, which is supported by GCC (including the versions compatible with current CUDA), by MSVC since the 2015 version, and IIANM by nvcc as well. That should save you the need to resort to macros.