i was wondering weather it is possible to extend the CUDA.#atomic operation to a custom type.
Here is an example of what i am trying to do:
using CUDA
struct Dual
x
y
end
cu0 = CuArray([Dual(1, 2), Dual(2,3)])
cu1 = CuArray([Dual(1, 2), Dual(2,3)])
indexes = CuArray([1, 1])
function my_kernel(dst, src, idx)
index = threadIdx().x + (blockIdx().x - 1) * blockDim().x
#inbounds if index <= length(idx)
CUDA.#atomic dst[idx[index]] = dst[idx[index]] + src[index]
end
return nothing
end
#cuda threads = 100 my_kernel(cu0, cu1, indexes)
The Problem of this code is that the CUDA.#atomic call only supports basic types like
Int, Float or Real.
I need it to work with my own struct.
Would be nice if someone has an idea how this could be possible.
The underlying PTX instruction set for CUDA provides a subset of atomic store, exchange, add/subtract,increment/decrement, min/max, and compare-and-set operations for global and shared memory locations (not all architectures support all operations with all POD types, and there is evidence that not all operations are implemented in hardware on all architectures).
What all these instructions have in common is that they execute only one operation atomically. I am completely unfamiliar with Julia, but if
CUDA.#atomic dst[idx[index]] = dst[idx[index]] + src[index]
means "atomically add src[].x and src[].y to dst[].x and dst[].y" then that isn't possible because that implies two additions on separate memory locations in one atomic operation. If the members of your structure could be packed into a compatible type (a 32 bit or 64 bit unsigned integer, for example), you could perform atomic store, exchange or compare-and-set in CUDA. But not arithmetic.
If you consult this section of the programming guide, you can see an example of a brute force double precision add implementation using compare-and-set in a tight loop. If your structure can be packed into something which can be manipulated with compare-and-set, then it might be possible to roll your own atomic add for a custom type (limited to a maximum of 64 bits).
How you might approach that in Julia is definitely an exercise left to the reader.
Related
Suppose I have a full warp of threads in a CUDA block, and each of these threads is intended to work with N elements of type T, residing in shared memory (so we have warp_size * N = 32 N elements total). The different threads never access each other's data. (Well, they do, but at a later stage which we don't care about here). This access is to happen in a loop such as the following:
for(int i = 0; i < big_number; i++) {
auto thread_idx = determine_thread_index_into_its_own_array();
T value = calculate_value();
write_to_own_shmem(thread_idx, value);
}
Now, the different threads may have different indices each, or identical - I'm not making any assumptions this way or that. But I do want to minimize shared memory bank conflicts.
If sizeof(T) == 4, then this is is easy-peasy: Just place all of thread i's data in shared memory addresses i, 32+i, 64+i, 96+i etc. This puts all of i's data in the same bank, that's also distinct from the other lane's banks. Great.
But now - what if sizeof(T) == 8? How should I place my data and access it so as to minimize bank conflicts (without any knowledge about the indices)?
Note: Assume T is plain-old-data. You may even assume it's a number if that makes your answer simpler.
tl;dr: Use the same kind of interleaving as for 32-bit values.
On later-than-Kepler micro-architectures (up to Volta), the best we could theoretically get is 2 shared memory transactions for a full warp reading a single 64-bit value (as a single transaction provides 32 bits to each lane at most).
This is is achievable in practice by the analogous placement pattern OP described for 32-bit data. That is, for T* arr, have lane i read the idx'th element as T[idx + i * 32]. This will compile so that two transactions occur:
The lower 16 lanes obtain their data from the first 32*4 bytes in T (utilizing all banks)
The higher 16 obtain their data from the successive 32*4 bytes in T (utilizing all banks)
So the GPU is smarter/more flexible than trying to fetch 4 bytes for each lane separately. That means it can do better than the simplistic "break up T into halves" idea the earlier answer proposed.
(This answer is based on #RobertCrovella's comments.)
On Kepler GPUs, this had a simple solution: Just change the bank size! Kepler supported setting the shared memory bank size to 8 instead of 4, dynamically. But alas, that feature is not available in later microarchitectures (e.g. Maxwell, Pascal).
Now, here's an ugly and sub-optimal answer for more recent CUDA microarchitectures: Reduce the 64-bit case to the 32-bit case.
Instead of each thread storing N values of type T, it stores 2N values, each consecutive pair being the low and the high 32-bits of a T.
To access a 64-bit values, 2 half-T accesses are made, and the T is composed with something like `
uint64_t joined =
reinterpret_cast<uint32_t&>(&upper_half) << 32 +
reinterpret_cast<uint32_t&>(&lower_half);
auto& my_t_value = reinterpret_cast<T&>(&joined);
and the same in reverse when writing.
As comments suggest, it is better to make 64-bit access, as described in this answer.
I am trying to use CUDA in order to parallelize the simulated annealing algorithm. The GPU I am using is NVIDIA GTX660. I am trying to speed the program up and in order to do so I am considering to replace this
int r= rand();
if (condition)
{
r += 1;
}
with
int r = rand() + (condition)*1;
I understand that jump/branch instructions(like if-then-else commands) are the slowest to execute but unless my understanding is incorrect typecasting involves memory access then copying the number in new location as an int before accessing it. Could the result of 'condition' be stored in a register and fed in ALU without modification? if so wouldn't that be a faster way to calculate the value of variable r? The above runs on every thread.
Generally, you'd try very hard to avoid branching on GPUs, since that's classically the point where the CPU needs to halt all threads that don't go through that branch, execute those who do, then halt these, and do the other branch.
That being said, the branching doesn't happen because you write if; it happens because you use e.g. < which assigns a value to a register based on what you're comparing, but that is very very depending on your actual condition, and the language/architecture you're on – my knowledge is from first-generation CUDA and might not fully apply anymore.
I mean, interpreters work on a list of instructions, which seem to be composed more or less by sequences of bytes, usually stored as integers. Opcodes are retrieved from these integers, by doing bit-wise operations, for use in a big switch statement where all operations are located.
My specific question is: How do the object values get stored/retrieved?
For example, let's (non-realistically) assume:
Our instructions are unsigned 32 bit integers.
We've reserved the first 4 bits of the integer for opcodes.
If I wanted to store data in the same integer as my opcode, I'm limited to a 24 bit integer. If I wanted to store it in the next instruction, I'm limited to a 32 bit value.
Values like Strings require lots more storage than this. How do most interpreters get away with this in an efficient manner?
I'm going to start by assuming that you're interested primarily (if not exclusively) in a byte-code interpreter or something similar (since your question seems to assume that). An interpreter that works directly from source code (in raw or tokenized form) is a fair amount different.
For a typical byte-code interpreter, you basically design some idealized machine. Stack-based (or at least stack-oriented) designs are pretty common for this purpose, so let's assume that.
So, first let's consider the choice of 4 bits for op-codes. A lot here will depend on how many data formats we want to support, and whether we're including that in the 4 bits for the op code. Just for the sake of argument, let's assume that the basic data types supported by the virtual machine proper are 8-bit and 64-bit integers (which can also be used for addressing), and 32-bit and 64-bit floating point.
For integers we pretty much need to support at least: add, subtract, multiply, divide, and, or, xor, not, negate, compare, test, left/right shift/rotate (right shifts in both logical and arithmetic varieties), load, and store. Floating point will support the same arithmetic operations, but remove the logical/bitwise operations. We'll also need some branch/jump operations (unconditional jump, jump if zero, jump if not zero, etc.) For a stack machine, we probably also want at least a few stack oriented instructions (push, pop, dupe, possibly rotate, etc.)
That gives us a two-bit field for the data type, and at least 5 (quite possibly 6) bits for the op-code field. Instead of conditional jumps being special instructions, we might want to have just one jump instruction, and a few bits to specify conditional execution that can be applied to any instruction. We also pretty much need to specify at least a few addressing modes:
Optional: small immediate (N bits of data in the instruction itself)
large immediate (data in the 64-bit word following the instruction)
implied (operand(s) on top of stack)
Absolute (address specified in 64 bits following instruction)
relative (offset specified in or following instruction)
I've done my best to keep everything about as minimal as is at all reasonable here -- you might well want more to improve efficiency.
Anyway, in a model like this, an object's value is just some locations in memory. Likewise, a string is just some sequence of 8-bit integers in memory. Nearly all manipulation of objects/strings is done via the stack. For example, let's assume you had some classes A and B defined like:
class A {
int x;
int y;
};
class B {
int a;
int b;
};
...and some code like:
A a {1, 2};
B b {3, 4};
a.x += b.a;
The initialization would mean values in the executable file loaded into the memory locations assigned to a and b. The addition could then produce code something like this:
push immediate a.x // put &a.x on top of stack
dupe // copy address to next lower stack position
load // load value from a.x
push immediate b.a // put &b.a on top of stack
load // load value from b.a
add // add two values
store // store back to a.x using address placed on stack with `dupe`
Assuming one byte for each instruction proper, we end up around 23 bytes for the sequence as a whole, 16 bytes of which are addresses. If we use 32-bit addressing instead of 64-bit, we can reduce that by 8 bytes (i.e., a total of 15 bytes).
The most obvious thing to keep in mind is that the virtual machine implemented by a typical byte-code interpreter (or similar) isn't all that different from a "real" machine implemented in hardware. You might add some instructions that are important to the model you're trying to implement (e.g., the JVM includes instructions to directly support its security model), or you might leave out a few if you only want to support languages that don't include them (e.g., I suppose you could leave out a few like xor if you really wanted to). You also need to decide what sort of virtual machine you're going to support. What I've portrayed above is stack-oriented, but you can certainly do a register-oriented machine if you prefer.
Either way, most of object access, string storage, etc., comes down to them being locations in memory. The machine will retrieve data from those locations into the stack/registers, manipulate as appropriate, and store back to the locations of the destination object(s).
Bytecode interpreters that I'm familiar with do this using constant tables. When the compiler is generating bytecode for a chunk of source, it is also generating a little constant table that rides along with that bytecode. (For example, if the bytecode gets stuffed into some kind of "function" object, the constant table will go in there too.)
Any time the compiler encounters a literal like a string or a number, it creates an actual runtime object for the value that the interpreter can work with. It adds that to the constant table and gets the index where the value was added. Then it emits something like a LOAD_CONSTANT instruction that has an argument whose value is the index in the constant table.
Here's an example:
static void string(Compiler* compiler, int allowAssignment)
{
// Define a constant for the literal.
int constant = addConstant(compiler, wrenNewString(compiler->parser->vm,
compiler->parser->currentString, compiler->parser->currentStringLength));
// Compile the code to load the constant.
emit(compiler, CODE_CONSTANT);
emit(compiler, constant);
}
At runtime, to implement a LOAD_CONSTANT instruction, you just decode the argument, and pull the object out of the constant table.
Here's an example:
CASE_CODE(CONSTANT):
PUSH(frame->fn->constants[READ_ARG()]);
DISPATCH();
For things like small numbers and frequently used values like true and null, you may devote dedicated instructions to them, but that's just an optimization.
I recently wanted to use a simple CUDA matrix-vector multiplication. I found a proper function in cublas library: cublas<<>>gbmv. Here is the official documentation
But it is actually very poor, so I didn't manage to understand what the kl and ku parameters mean. Moreover, I have no idea what stride is (it must also be provided).
There is a brief explanation of these parameters (Page 37), but it looks like I need to know something else.
A search on the internet doesn't provide tons of useful information on this question, mostly references to different version of documentation.
So I have several questions to GPU/CUDA/cublas gurus:
How do I find more understandable docs or guides about using cublas?
If you know how to use this very function, couldn't you explain me how do I use it?
Maybe cublas library is somewhat extraordinary and everyone uses something more popular, better documented and so on?
Thanks a lot.
So BLAS (Basic Linear Algebra Subprograms) generally is an API to, as the name says, basic linear algebra routines. It includes vector-vector operations (level 1 blas routines), matrix-vector operations (level 2) and matrix-matrix operations (level 3). There is a "reference" BLAS available that implements everything correctly, but most of the time you'd use an optimized implementation for your architecture. cuBLAS is an implementation for CUDA.
The BLAS API was so successful as an API that describes the basic operations that it's become very widely adopted. However, (a) the names are incredibly cryptic because of architectural limitations of the day (this was 1979, and the API was defined using names of 8 characters or less to ensure it could widely compile), and (b) it is successful because it's quite general, and so even the simplest function calls require a lot of extraneous arguments.
Because it's so widespread, it's often assumed that if you're doing numerical linear algebra, you already know the general gist of the API, so implementation manuals often leave out important details, and I think that's what you're running into.
The Level 2 and 3 routines generally have function names of the form TMMOO.. where T is the numerical type of the matrix/vector (S/D for single/double precision real, C/Z for single/double precision complex), MM is the matrix type (GE for general - eg, just a dense matrix you can't say anything else about; GB for a general banded matrix, SY for symmetric matrices, etc), and OO is the operation.
This all seems slightly ridiculous now, but it worked and works relatively well -- you quickly learn to scan these for familiar operations so that SGEMV is a single-precision general-matrix times vector multiplication (which is probably what you want, not SGBMV), DGEMM is double-precision matrix-matrix multiply, etc. But it does take some practice.
So if you look at the cublas sgemv instructions, or in the documentation of the original, you can step through the argument list. First, the basic operation is
This function performs the matrix-vector multiplication
y = a op(A)x + b y
where A is a m x n matrix stored in column-major format, x and y
are vectors, and and are scalars.
where op(A) can be A, AT, or AH. So if you just want y = Ax, as is the common case, then a = 1, b = 0. and transa == CUBLAS_OP_N.
incx is the stride between different elements in x; there's lots of situations where this would come in handy, but if x is just a simple 1d array containing the vector, then the stride would be 1.
And that's about all you need for SGEMV.
What do you mean by Atomic instructions?
How does the following become Atomic?
TestAndSet
int TestAndSet(int *x){
register int temp = *x;
*x = 1;
return temp;
}
From a software perspective, if one does not want to use non-blocking synchronization primitives, how can one ensure Atomicity of instruction? is it possible only at Hardware or some assembly level directive optimization can be used?
Some machine instructions are intrinsically atomic - for example, reading and writing properly aligned values of the native processor word size is atomic on many architectures.
This means that hardware interrupts, other processors and hyper-threads cannot interrupt the read or store and read or write a partial value to the same location.
More complicated things such as reading and writing together atomically can be achieved by explicit atomic machine instructions e.g. LOCK CMPXCHG on x86.
Locking and other high-level constructs are built on these atomic primitives, which typically only guard a single processor word.
Some clever concurrent algorithms can be built using just the reading and writing of pointers e.g. in linked lists shared between a single reader and writer, or with effort, multiple readers and writers.
Below are some of my notes on Atomicity that may help you understand the meaning. The notes are from the sources listed at the end and I recommend reading some of them if you need a more thorough explanation rather than point-form bullets as I have. Please point out any errors so that I may correct them.
Definition :
From the Greek meaning "not divisible into smaller parts"
An "atomic" operation is always observed to be done or not done, but
never halfway done.
An atomic operation must be performed entirely or not performed at
all.
In multi-threaded scenarios, a variable goes from unmutated to
mutated directly, with no "halfway mutated" values
Example 1 : Atomic Operations
Consider the following integers used by different threads :
int X = 2;
int Y = 1;
int Z = 0;
Z = X; //Thread 1
X = Y; //Thread 2
In the above example, two threads make use of X, Y, and Z
Each read and write are atomic
The threads will race :
If thread 1 wins, then Z = 2
If thread 2 wins, then Z=1
Z will will definitely be one of those two values
Example 2 : Non-Atomic Operations : ++/-- Operations
Consider the increment/decrement expressions :
i++; //increment
i--; //decrement
The operations translate to :
Read i
Increment/decrement the read value
Write the new value back to i
The operations are each composed of 3 atomic operations, and are not atomic themselves
Two attempts to increment i on separate threads could interleave such that one of the increments is lost
Example 3 - Non-Atomic Operations : Values greater than 4-Bytes
Consider the following immutable struct :
struct MyLong
{
public readonly int low;
public readonly int high;
public MyLong(int low, int high)
{
this.low = low;
this.high = high;
}
}
We create fields with specific values of type MyLong :
MyLong X = new MyLong(0xAAAA, 0xAAAA);
MyLong Y = new MyLong(0xBBBB, 0xBBBB);
MyLong Z = new MyLong(0xCCCC, 0xCCCC);
We modify our fields in separate threads without thread safety :
X = Y; //Thread 1
Y = X; //Thread 2
In .NET, when copying a value type, the CLR doesn't call a constructor - it moves the bytes one atomic operation at a time
Because of this, the operations in the two threads are now four atomic operations
If there is no thread safety enforced, the data can be corrupted
Consider the following execution order of operations :
X.low = Y.low; //Thread 1 - X = 0xAAAABBBB
Y.low = Z.low; //Thread 2 - Y = 0xCCCCBBBB
Y.high = Z.high; //Thread 2 - Y = 0xCCCCCCCC
X.high = Y.high; //Thread 1 - X = 0xCCCCBBBB <-- corrupt value for X
Reading and writing values greater than 32-bits on multiple threads on a 32-bit operating system without adding some sort of locking to make the operation atomic is likely to result in corrupt data as above
Processor Operations
On all modern processors, you can assume that reads and writes of naturally aligned native types are atomic as long as :
1 : The memory bus is at least as wide as the type being read or written
2 : The CPU reads and writes these types in a single bus transaction, making it impossible for other threads to see them in a half-completed state
On x86 and X64 there is no guarantee that reads and writes larger than eight bytes are atomic
Processor vendors define the atomic operations for each processor in a Software Developer's Manual
In single processors / single core systems it is possible to use standard locking techniques to prevent CPU instructions from being interrupted, but this can be inefficient
Disabling interrupts is another more efficient solution, if possible
In multiprocessor / multicore systems it is still possible to use locks but merely using a single instruction or disabling interrupts does not guarantee atomic access
Atomicity can be achieved by ensuring that the instructions used assert the 'LOCK' signal on the bus to prevent other processors in the system from accessing the memory at the same time
Language Differences
C#
C# guarantees that operations on any built-in value type that takes up to 4-bytes are atomic
Operations on value types that take more than four bytes (double, long, etc.) are not guaranteed to be atomic
The CLI guarantees that reads and writes of variables of value type that are the size (or smaller) of the processor's natural pointer size are atomic
Ex - running C# on a 64-bit OS in a 64-bit version of the CLR performs reads and writes of 64-bit doubles and long integers atomically
Creating atomic operations :
.NET provodes the Interlocked Class as part of the System.Threading namespace
The Interlocked Class provides atomic operations such as increment, compare, exchange, etc.
using System.Threading;
int unsafeCount;
int safeCount;
unsafeCount++;
Interlocked.Increment(ref safeCount);
C++
C++ standard does not guarantee atomic behavior
All C / C++ operations are presumed non-atomic unless otherwise specified by the compiler or hardware vendor - including 32-bit integer assignment
Creating atomic operations :
The C++ 11 concurrency library includes the - Atomic Operations Library ()
The Atomic library provides atomic types as a template class to use with any type you want
Operations on atomic types are atomic and thus thread-safe
struct AtomicCounter
{
std::atomic< int> value;
void increment(){
++value;
}
void decrement(){
--value;
}
int get(){
return value.load();
}
}
Java
Java guarantees that operations on any built-in value type that takes up to 4-bytes are atomic
Assignments to volatile longs and doubles are also guaranteed to be atomic
Java provides a small toolkit of classes that support lock-free thread-safe programming on single variables through java.util.concurrent.atomic
This provides atomic lock-free operations based on low-level atomic hardware primitives such as compare-and-swap (CAS) - also called compare and set :
CAS form - boolean compareAndSet(expectedValue, updateValue );
This method atomically sets a variable to the updateValue if it currently holds the expectedValue - reporting true on success
import java.util.concurrent.atomic.AtomicInteger;
public class Counter
{
private AtomicInteger value= new AtomicInteger();
public int increment(){
return value.incrementAndGet();
}
public int getValue(){
return value.get();
}
}
Sources
http://www.evernote.com/shard/s10/sh/c2735e95-85ae-4d8c-a615-52aadc305335/99de177ac05dc8635fb42e4e6121f1d2
Atomic comes from the Greek ἄτομος (atomos) which means "indivisible". (Caveat: I don't speak Greek, so maybe it's really something else, but most English speakers citing etymologies interpret it this way. :-)
In computing, this means that the operation, well, happens. There isn't any intermediate state that's visible before it completes. So if your CPU gets interrupted to service hardware (IRQ), or if another CPU is reading the same memory, it doesn't affect the result, and these other operations will observe it as either completed or not started.
As an example... let's say you wanted to set a variable to something, but only if it has not been set before. You might be inclined to do this:
if (foo == 0)
{
foo = some_function();
}
But what if this is run in parallel? It could be that the program will fetch foo, see it as zero, meanwhile thread 2 comes along and does the same thing and sets the value to something. Back in the original thread, the code still thinks foo is zero, and the variable gets assigned twice.
For cases like this, the CPU provides some instructions that can do the comparison and the conditional assignment as an atomic entity. Hence, test-and-set, compare-and-swap, and load-linked/store-conditional. You can use these to implement locks (your OS and your C library has done this.) Or you can write one-off algorithms that rely on the primitives to do something. (There's cool stuff to be done here, but most mere mortals avoid this for fear of getting it wrong.)
Atomicity is a key concept when you have any form of parallel processing (including different applications cooperating or sharing data) that includes shared resources.
The problem is well illustrated with an example. Let's say you have two programs that want to create a file but only if the file doesn't already exists. Any of the two program can create the file at any point in time.
If you do (I'll use C since it's what's in your example):
...
f = fopen ("SYNCFILE","r");
if (f == NULL) {
f = fopen ("SYNCFILE","w");
}
...
you can't be sure that the other program hasn't created the file between your open for read and your open for write.
There's no way you can do this on your own, you need help from the operating system, that usually provide syncronization primitives for this purpose, or another mechanism that is guaranteed to be atomic (for example a relational database where the lock operation is atomic, or a lower level mechanism like processors "test and set" instructions).
Atomicity can only be guaranteed by the OS. The OS uses the underlying processor features to achieve this.
So creating your own testandset function is impossible. (Although I'm not sure if one could use an inline asm snippet, and use the testandset mnemonic directly (Could be that this statement can only be done with OS priviliges))
EDIT:
According to the comments below this post, making your own 'bittestandset' function using an ASM directive directly is possible (on intel x86). However, if these tricks also work on other processors is not clear.
I stand by my point: if You want to do atmoic things, use the OS functions and don't do it yourself