I'm pretty new to CUDA language, and I need to perform a simulation on particles which get updated at each time step by adding a random value to their position (different from each other, but following the same distribution).
My idea is to give every particle a different curandState (with a different seed), and at each time step simply do a curand(curandState[particle_id]).
I was thinking I could store the random states and particle ids in constant memory on the GPU. But I havent seen anyone do that anywhere, would that raise memory problems? Can this speed up the program?
Thank you for your help :)
I don't think that makes sense. __constant__ memory is constant, and can't be modified directly by threads running on the GPU. curandState, however, needs to be modified each time a random number is generated by a thread (otherwise, you will get the same number generated, over and over).
There's nothing wrong with giving every particle it's own state; that would be the typical usage for this scenario.
Since the retrieval and usage of curandState and the generation of random numbers is being done by an NVIDIA library on the GPU, you can assume that the NVIDIA engineers have done a reasonably good job of optimizing memory accesses so as to be efficient and coalesced, during the operation of retrieving and updating state, and generating random numbers.
__constant__ memory also has the characteristic that it services only one 32 bit value per SM per clock, so it's useful when all threads are accessing the same data element (i.e. broadcast) but not generally useful when each thread is accessing a different element (e.g. separate curandState) even if that access would normally coalesce, e.g. if it were in ordinary global memory.
Related
Note: I don't have my computer and GPU with me so this me typing from memory. I timed this and compiled it correctly so ignore any odd typos should they exist.
I don't know if the overhead of what I'm going to describe below is the problem, or if I'm doing this wrong, or why launching kernels in kernels is slower than one big kernel that has a lot of threads predicate off and not get used. Maybe this is because I'm not swamping the GPU with work that I don't notice the saturation.
Suppose we're doing something simple for the sake of this example, like multiplying all the values in a square matrix by two. The matrices can be any size, but they won't be larger than 16x16.
Now suppose I have 200 matrices all in the device memory ready to go. I launch a kernel like
// One matrix given to each block
__global__ void matrixFunc(Matrix** matrices)
{
Matrix* m = matrices[blockIdx.x];
int area = m->width * m->height;
if (threadIdx.x < area)
// Heavy calculations
}
// Assume 200 matrices, no larger than 16x16
matrixFunc<<<200, 256>>>(ptrs);
whereby I'm using one block per matrix, and an abundance of threads such that I know I'm never going to have less threads per block than cells in a matrix.
The above runs in 0.17 microseconds.
This seems wasteful. I know that I have a bunch of small matrices (so 256 threads is overkill when a 2x2 matrix can function on 4 threads), so why not launch a bunch of them dynamically from a kernel to see what the runtime overhead is? (for learning reasons)
I change my code to be like the following:
__device__ void matrixFunc(float* matrix)
{
// Heavy calculations (on threadIdx.x for the cell)
}
__global__ void matrixFuncCaller(Matrix** matrices)
{
Matrix* m = matrices[threadIdx.x];
int area = m->width * m->height;
matrixFunc<<<1, area>>>(m.data);
}
matrixFuncCaller<<<1, 200>>>(ptrs);
But this performs a lot worse at 11.3 microseconds.
I realize I could put them all on a stream, so I do that. I then change this to make a new stream:
__global__ void matrixFuncCaller(Matrix** matrices)
{
Matrix* m = matrices[threadIdx.x];
int area = m->width * m->height;
// Create `stream`
matrixFunc<<<1, area, 0, stream>>>(m.data);
// Destroy `stream`
}
This does better, it's now 3 microseconds instead of 11, but it's still much worse than 0.17 microseconds.
I want to know why this is worse.
Is this kernel launching overhead? I figure that maybe my examples are small enough such that the overhead drowns out the work seen here. In my real application which I cannot post, there is a lot more work done than just "2 * matrix", but it still is probably small enough that there might be decent overhead.
Am I doing anything wrong?
Put it shortly: the benchmark is certainly biased and the computation is latency bound.
I do not know how did you measure the timings but I do not believe "0.17 microseconds" is even possible. In fact the overhead of launching a kernel is typically few microseconds (I never saw an overhead smaller than 1 microsecond). Indeed, running a kernel should typically require a system call that are expensive and known to take an overhead of at least about 1000 cycles. An example of overhead analysis can be found in this research paper (confirming that it should takes several microseconds). Not to mention current RAM accesses should take at least 50-100 ns on mainstream x86-64 platforms and the one one of GPU requires several hundreds of cycles. While everything may fit in both the CPU and GPU cache is possible this is very unlikely to be the case regarding the kernels (and the fact the GPU may be used for other tasks during multiple kernel executions). For more information about this, please read this research paper. Thus, what you measure has certainly nothing to do with the kernel execution. To measure the overhead of the kernel, you need to care about synchronizations (eg. call cudaDeviceSynchronize) since kernels are launched asynchronously.
When multiple kernels are launched, you may pay the overhead of an implicit synchronization since the queue is certainly bounded (for sake of performance). In fact, as pointed out by #talonmies in the comments, the number of concurrent kernels is bounded to 16-128 (so less than the number of matrices).
Using multiple streams reduces the need for synchronizations hence the better performance results but there is certainly still a synchronization. That being said, for the comparison to be fair, you need to add a synchronization in all cases or measure the execution time on the GPU itself (without taking care of the launching overhead) still in all cases.
Profilers like nvvp help a lot to understand what is going on in such a case. I strongly advise you to use them.
As for the computation, please note that GPU are designed for heavy computational SIMT-friendly kernels, not low-latency kernel operating on small variable-sized matrices stored in unpredictable memory locations. In fact, the overhead of a global memory access is so big that it should be much bigger than the actual matrix computation. If you want GPUs to be useful, then you need to submit more work to them (so to provide more parallelism to them and so to overlap the high latencies). If you cannot provide more work, then the latency cannot be overlapped and if you care about microsecond latencies then GPUs are clearly not suited for the task.
By the way, not that Nvidia GPUs operate on warp of typically 32 threads. Threads should perform coalesced memory loads/stores to be efficient (otherwise they are split in many load/store requests). Operating on very small matrices like this likely prevent that. Not to mention most threads will do nothing. Flattening the matrices and sorting them by size as proposed by #sebastian in the comments help a bit but the computations and memory access will still be very inefficient for a GPU (not SIMT-friendly). Note that using less thread and make use of unrolling should also be a bit more efficient (but still far from being great). CPUs are better suited for such a task (thanks to a higher frequency, instruction-level parallelism combined with an out-of-order execution). For fast low-latency kernels like this FPGAs can be even better suited (though they are hard to program).
If many threads in a warp want to read an address in global memory, this data is broadcasted, is that right?
If many threads in a warp want to write into an address in global memory, there is a serialization, but is not possible to predict the order, is that right?
But, the first question: If many threads in a different warps, in different blocks, want to write into an address in global memory? What the GPU gonna do? Serializes all the access to this address? Is there any guarantee of data consistence?
With Hyper-Q it is possible to launch a lot of streams containing kernels. If I have a position in the memory, and a number of threads in different kernels wants to write or read this address, what the GPU gonna do? Serializes the accesses of all threads from different kernels, or does the GPU do nothing and some inconsistencies gonna happen? Is there any guarantee of data consistence when multiple kernels are reading/writing into the same address?
It's preferred that you ask one question per question.
If many threads in a warp want to read an adress in global memory, this data is broadcasted, is that right?
Yes this is true for Fermi (CC2.0) and beyond.
If many threads in a warp want to write into an adress in global memory, there is a serialization, but is not possible to predict the order, is that right?
Correct. The order is undefined.
If many threads in a different warps, in different blocks, want to write into an adress in global memory? What the GPU gonna do? Serializes all the access to this address?
If the accesses are simultaneous, they are serialized. Again, order is undefined.
Is there any guarantee of data consistence?
Not sure what you mean by data consistence. Anyway, what else could the GPU do except serialize simultaneous writes? I'm surprised this is such a difficult concept, as there appears to me to be no obvious alternative.
If I have a possition in the memory, and a number of threads in different kernels wants to write or read this address, what the GPU gonna do? Serializes the access of all threads from different kernels, or the GPU do nothing and some inconsistences gonna happen? Is there any guarantee of data consistence when multiple kernels are reading/writing into the same address?
It does not matter what is the origin of simultaneous writes to global memory, whether from the same warp, or different warps, in different blocks, in different kernels. Simultaneous writes are serialized, in an undefined order. Again, for "data consistence" I'd like to know what you mean by that. Simultaneous reads and writes are also going to produce undefined behavior. The reads may return a value including the initial value of the memory location or any of the values that were written.
The final result of simultaneous writes to any GPU memory location is undefined. If all simultaneous writes are writing the same value, then the final value in that location will reflect that. Otherwise, the final value will reflect one of the values that got written. Which value is undefined. Beyond that, most of your questions and statements don't make sense to me. (What do you mean by data consistence?) You should not expect anything rational from such programming behavior. The GPU should be programmed as a distributed independent work machine, not a globally synchronous machine. Note that "undefined" also means that results may vary from one run of a kernel to the next, even if the input data is identical.
Simultaneous or nearly simultaneous reading and writing of global memory from different blocks (whether from the same or different kernels) is especially hazardous on Fermi (cc2.x) devices due to the independent non-coherent L1 caches that are interposed between the SMs (where the threadblocks execute) and the L2 cache (which is device-wide, and therefore coherent). Attempting to create synchronized behavior between threadblocks using global memory as a vehicle is difficult at best, and discouraged. It is suggested to consider ways to recast your algorithm to structure the work independently.
I want to compute the trajectories of particles subject to certain potentials, a typical N-body problem. I've been researching methods for utilizing a GPU (CUDA for example), and they seem to benefit simulations with large N (20000). This makes sense since the most expensive calculation is usually finding the force.
However, my system will have "low" N (less than 20), many different potentials/factors, and many time steps. Is it worth it to port this system to a GPU?
Based on the Fast N-Body Simulation with CUDA article, it seems that it is efficient to have different kernels for different calculations (such as acceleration and force). For systems with low N, it seems that the cost of copying to/from the device is actually significant, since for each time step one would have to copy and retrieve data from the device for EACH kernel.
Any thoughts would be greatly appreciated.
If you have less than 20 entities that need to be simulated in parallel, I would just use parallel processing on an ordinary multi-core CPU and not bother about using GPU.
Using a multi-core CPU would be much easier to program and avoid the steps of translating all your operations into GPU operations.
Also, as you already suggested, the performance gain using GPU will be small (or even negative) with this small number of processes.
There is no need to copy results from the device to host and back between time steps. Just run your entire simulation on the GPU and copy results back only after several time steps have been calculated.
For how many different potentials do you need to run simulations? Enough to just use the structure from the N-body example and still load the whole GPU?
If not, and assuming the potential calculation is expensive, I'd think it would be best to use one thread for each pair of particles in order to make the problem sufficiently parallel. If you use one block per potential setting, you can then write out the forces to shared memory, __syncthreads(), and use a subset of the block's threads (one per particle) to sum the forces. __syncthreads() again, and continue for the next time step.
If the potential calculation is not expensive, it might be worth exploring first where the main cost of your simulation is.
I am really new to programming and Cuda. Basically I have a C function that reads a list of data and then checks each item against a hashmap (I'm using uthash for this in C). It works well but I want to run this process in Cuda (once it gets the value for the hash key then it does a lot of processing), but I'm unsure the best way to create a read only hash function that's as quick as possible in Cuda.
Background
Basically I'm trying to value a very very large batch of portfolio as quickly as possible. I get several million portfolio constantly that are in the form of two lists. One has the stock name and the other has the weight. I then use the stock name to look up a hashtable to get other data(value, % change,etc..) and then process it based on the weight. On a CPU in plain C it takes about 8 minutes so I am interesting in trying it on a GPU.
I have read and done the examples in cuda by example so I believe I know how to do most of this except the hash function(there is one in the appendix but it seems focused on adding to it while I only really want it as a reference since it'll never change. I might be rough around the edges in cuda for example so maybe there is something I'm missing that is helpful for me in this situation, like using textual or some special form of memory for this). How would I structure this for best results should each block have its own access to the hashmap or should each thread or is one good enough for the entire GPU?
Edit
Sorry just to clarify, I'm only using C. Worst case I'm willing to use another language but ideally I'd like something that I can just natively put on the GPU once and have all future threads read to it since to process my data I'll need to do it in several large batches).
This is some thoughts on potential performance issues of using a hash map on a GPU, to back up my comment about keeping the hash map on the CPU.
NVIDIA GPUs run threads in groups of 32 threads, called warps. To get good performance, each of the threads in a warp must be doing essentially the same thing. That is, they must run the same instructions and they must read from memory locations that are close to each other.
I think a hash map may break with both of these rules, possibly slowing the GPU down so much that there's no use in keeping the hash map on the GPU.
Consider what happens when the 32 threads in a warp run:
First, each thread has to create a hash of the stock name. If these names differ in length, this will involve a different number of rounds in the hashing loop for the different lengths and all the threads in the warp must wait for the hash of the longest name to complete. Depending on the hashing algorithm, there might different paths that the code can take inside the hashing algorithm. Whenever the different threads in a warp need to take different paths, the same code must run multiple times (once for each code path). This is called warp divergence.
When all the threads in warp each have obtained a hash, each thread will then have to read from different locations in slow global memory (designated by the hashes). The GPU runs optimally when each of the 32 threads in the warp read in a tight, coherent pattern. But now, each thread is reading from an essentially random location in memory. This could cause the GPU to have to serialize all the threads, potentially dropping the performance to 1/32 of the potential.
The memory locations that the threads read are hash buckets. Each potentially containing a different number of hashes, again causing the threads in the warp to have to do different things. They may then have to branch out again, each to a random location, to get the actual structures that are mapped.
If you instead keep the stock names and data structures in a hash map on the CPU, you can use the CPU to put together arrays of information that are stored in the exact pattern that the GPU is good at handling. Depending on how busy the CPU is, you may be able to do this while the GPU is processing the previously submitted work.
This also gives you an opportunity to change the array of structures (AoS) that you have on the CPU to a structure of arrays (SoA) for the GPU. If you are not familiar with this concept, essentially, you convert:
my_struct {
int a;
int b;
};
my_struct my_array_of_structs[1000];
to:
struct my_struct {
int a[1000];
int b[1000];
} my_struct_of_arrays;
This puts all the a's adjacent to each other in memory so that when the 32 threads in a warp get to the instruction that reads a, all the values are neatly laid out next to each other, causing the entire warp to be able to load the values very quickly. The same is true for the b's, of course.
There is a hash_map extension for CUDA Thrust, in the cuda-thrust-extensions library. I have not tried it.
Because of your hash map is so large, I think it can be replaced by a database, mysql or other products will all be OK, they probably will be fast than hash map design by yourself. And I agree with Roger's viewpoint, it is not suitable to move it to GPU, it consumes too large device memory (may be not capable to contain it) and it is terribly slow for kernel function access global memory on device.
Further more, which part of your program takes 8 minutes, finding in hash map or process on weight? If it is the latter, may be it can be accelerated by GPU.
Best regards!
I need to copy a single boolean or an integer value from the device to the host after every kernel call (I am calling the same kernel in a for loop). That is, after every kernel call, I need to send an integer or a boolean back to the host. What is the best way to do this?
Should I write the value directly to RAM? Or should I use cudaMemcpy()? Or is there any other way to do this? Would copying just 1 integer after every kernel launch slow down my program?
Let me first answer your last question:
Would copying just 1 integer after every kernel launch slow down my program?
A bit - yes. Issuing the command, waiting for GPU to respond, etc, etc... The amount of data (1 int vs 100 ints) probably doesn't really matter in this case. However, you can still achieve speeds of thousands memory transfers per second. Most likely, your kernel will be slower than this single memory transfer (otherwise, it would be probably better to do the whole task on a CPU)
what is the best way to do this?
Well, I would suggest simply trying it yourself. As you said: you can either use mapped-pinned memory and have your kernel store the value directly to RAM, or use cudaMemcpy. The first one might be better if your kernels still have some work to do after sending the integer back. In that case, the latency of sending it to host could be hidden by the execution of the kernel.
If you use the first method, you will have to call cudaThreadsynchronize() to make sure the kernel ended its execution. Kernel calls are asynchronous.
You can use cudaMemcpyAsync which is also asynchronous, but GPU cannot have kernel running and having cudaMemcpyAsync executed in parallel, unless you use streams.
I never actually tried that, but if your program won't crash if the loop executes too many times, you might try to ignore synchronisation and let it iterate until the special value is seen in RAM. In that solution, the memory transfer might be completely hidden and you would pay an overhead only at the end. You will need however to somehow prevent the loop from iterating too many times, CUDA events may be helpful.
Why not use pinned memory? If your system supports it -- see CUDA C Programming Guide's section on pinned memory.
Copying data to and from the GPU will be much slower than accessing the data from the CPU. If you are not running a significant number of threads for this value then this will result in very slow performance, don't do it.
What you are describing sounds like a serial algorithm, your algorithm needs to be parallelised in order to make it worth doing using CUDA. If you can't rewrite your algorithm to become a single write of multiple data to the GPU, multiple threads, single write of multiple data back to CPU; then your algorithm should be done on CPU.
If you need the value computed in the previous kernel call to launch the next one then is serialized and your choice is to cudaMemcpy(dst,src, size =1, ...);
If all the kernel launch parameters do not depend on the previous launch then you can store all the result of each kernel invocation in GPU memory and then download all the results at once.