Question on CUDA programming model - cuda

Hi I am new to CUDA programming and I had 2 questions on the CUDA programming model.
In brief, the model says there is a memory hierarchy in terms of thread, blocks and then grids. Threads within a block have shared memory and are able to communicate with each other easily, but cannot communicate if they are in different blocks. There is also a global memory on the GPU device.
My questions are:
(1)Why do we need to have such a memory hierarchy consisting of threads and then blocks?
That way any two threads can communicate with each other if needed and hence probably simplify programming effort.
(2) Why is there a restriction of setting up threads only upto 3D configuations and not beyond?
Thank you.

1) This allows you to have a generalized programming model that supports hardware with different numbers of processors. It is also a reflection of the underlying GPU hardware which treats thread within a block differently from threads in different blocks WRT to memory access and synchronization.
Threads can communicate via global memory, or shared memory depending on their block affinity. You can also use synchronization primatives, like __syncthreads.
2) This is part of the programming model. I suspect is largely due to user demand to allow data decomposition for 3 dimensional problems and there was little demand for further dimension support.
The Cuda Programming Guide covers a lot of this sort of stuff. There are also a couple of books available. There's a good discussion in Programming Massively Parallel Processors: A Hands-on Approach that goes into why GPU hardware is the way it is and how that has been reflected in the programming model.

(1) Local memory is used to store local values that doesn't fit into registers. Shared memory is used to store common data, that is shared by threads. Local memory + registers compose execution context of thread, and shared memory is storage for data to be processed.
(2) You can easily use 1D to represent any D. For example if you have 1D index you can convert it to 2D space by using: x = i % width, y = i / width and inverse is i = y*width + x. 2D and 3D were added for your convenience. It is pretty same as N-D arrays are implemented in C++.

Related

What are Cuda shared memory capabilities

I am a newbie in cuda. According to my knowledge I must use global memory to make blocks communicate with each other, but my understanding of the stream concept and memory capabilities stuck somewhere. After searching I figured that streams queue multiple kernels in sequence and can be used to apply different kernels on different blocks.
Now I NEED to exchange arrays between 2 blocks or more. Can kernel be used to swap or exchange data within shared memory between blocks without involving global/device memory.
if I allocated block for each sub population to calculate fitness using some kernel and shared memory, can I transfer data between blocks
No. Shared memory has block scope. It is not portable between blocks. Global memory or heap memory is portable and could potentially be used to hold data to be accessed by multiple blocks.
However, the standard execution model in CUDA doesn't support grid level synchronization. Since CUDA 9, and with the newest generations of hardware, there is support for a grid level synchronization mechanism if you use cooperative groups, however neither PyCUDA nor Numba expose that facility as far as I am aware.

Any guarantees that Torch won't mess up with an already allocated CUDA array?

Assume we allocated some array on our GPU through other means than PyTorch, for example by creating a GPU array using numba.cuda.device_array. Will PyTorch, when allocating later GPU memory for some tensors, accidentally overwrite the memory space that is being used for our first CUDA array? In general, since PyTorch and Numba use the same CUDA runtime and thus I assume the same mechanism for memory management, are they automatically aware of memory regions used by other CUDA programs or does each one of them see the entire GPU memory as his own? If it's the latter, is there a way to make them aware of allocations by other CUDA programs?
EDIT: figured this would be an important assumption: assume that all allocations are done by the same process.
Will PyTorch, when allocating later GPU memory for some tensors, accidentally overwrite the memory space that is being used for our first CUDA array?
No.
are they automatically aware of memory regions used by other CUDA programs ...
They are not "aware", but each process gets its own separate context ...
... or does each one of them see the entire GPU memory as his own?
.... and contexts have their own address spaces and isolation. So neither, but there is no risk of memory corruption.
If it's the latter, is there a way to make them aware of allocations by other CUDA programs?
If by "aware" you mean "safe", then that happens automatically. If by "aware" you imply some sort of interoperability, then that is possible on some platforms, but it is not automatic.
... assume that all allocations are done by the same process.
That is a different situation. In general, the same process implies a shared context, and shared contexts share a memory space, but all the normal address space protection rules and facilities apply, so there is not a risk of loss of safety.

Splitting an array on a multi-GPU system and transferring the data across the different GPUs

I'm using CUDA on a double GPU system using NVIDIA GTX 590 cards and I have an array partitioned according to the figure below.
If I'm going to use CudaSetDevice() to split the sub-arrays across the GPUs, will they share the same global memory? Could the first device access the updated data on the second device and, if so, how?
Thank you.
Each device memory is separate, so if you call cudaSetDevice(A) and then cudaMalloc() then you are allocating memory on device A. If you subsequently access that memory from device B then you will have a higher access latency since the access has to go through the external PCIe link.
An alternative strategy would be to partition the result across the GPUs and store all the input data needed on each GPU. This means you have some duplication of data but this is common practice in GPU (and indeed any parallel method such as MPI) programming - you'll often hear the term "halo" applied to the data regions that need to be transferred between updates.
Note that you can check whether one device can access another's memory using cudaDeviceCanAccessPeer(), in cases where you have a dual GPU card this is always true.

How to adjust the cuda number of block and of thread to get optimal performances

I've tested empirically for several values of block and of thread, and the execution time can be greatly reduced with specific values.
I don't see what are the differences between blocks and thread. I figure that it may be that thread in a block have specific cache memory but it's quite fuzzy for me. For the moment, I parallelize my functions in N parts, which are allocated on blocks/threads.
My goal could be to automaticaly adjust the number of blocks and thread regarding to the size of the memory that I've to use. Could it be possible? Thank you.
Hong Zhou's answer is good, so far. Here are some more details:
When using shared memory you might want to consider it first, because it's a very much limited resource and it's not unlikely for kernels to have very specific needs that constrain
those many variables controlling parallelism.
You either have blocks with many threads sharing larger regions or blocks with fewer
threads sharing smaller regions (under constant occupancy).
If your code can live with as little as 16KB of shared memory per multiprocessor
you might want to opt for larger (48KB) L1-caches calling
cudaDeviceSetCacheConfig(cudaFuncCachePreferL1);
Further, L1-caches can be disabled for non-local global access using the compiler option -Xptxas=-dlcm=cg to avoid pollution when the kernel accesses global memory carefully.
Before worrying about optimal performance based on occupancy you might also want to check
that device debugging support is turned off for CUDA >= 4.1 (or appropriate optimization options are given, read my post in this thread for a suitable compiler
configuration).
Now that we have a memory configuration and registers are actually used aggressively,
we can analyze the performance under varying occupancy:
The higher the occupancy (warps per multiprocessor) the less likely the multiprocessor will have to wait (for memory transactions or data dependencies) but the more threads must share the same L1 caches, shared memory area and register file (see CUDA Optimization Guide and also this presentation).
The ABI can generate code for a variable number of registers (more details can be found in the thread I cited). At some point, however, register spilling occurs. That is register values get temporarily stored on the (relatively slow, off-chip) local memory stack.
Watching stall reasons, memory statistics and arithmetic throughput in the profiler while
varying the launch bounds and parameters will help you find a suitable configuration.
It's theoretically possible to find optimal values from within an application, however,
having the client code adjust optimally to both different device and launch parameters
can be nontrivial and will require recompilation or different variants of the kernel to be deployed for every target device architecture.
I believe to automatically adjust the blocks and thread size is a highly difficult problem. If it is easy, CUDA would most probably have this feature for you.
The reason is because the optimal configuration is dependent of implementation and the kind of algorithm you are implementing. It requires profiling and experimenting to get the best performance.
Here are some limitations which you can consider.
Register usage in your kernel.
Occupancy of your current implementation.
Note: having more threads does not equate to best performance. Best performance is obtained by getting the right occupancy in your application and keeping the GPU cores busy all the time.
I've a quite good answer here, in a word, this is a difficult problem to compute the optimal distribution on blocks and threads.

Multi-GPU programming strategies using CUDA

I need some advice on a project that I am going to undertake. I am planning to run simple kernels (yet to decide, but I am hinging on embarassingly parallel ones) on a Multi-GPU node using CUDA 4.0 by following the strategies listed below. The intention is to profile the node, by launching kernels in different strategies that CUDA provide on a multi-GPU environment.
Single host thread - multiple devices (shared context)
Single host thread - concurrent execution of kernels on a single device (shared context)
Multiple host threads - (Equal) Multiple devices (independent contexts)
Single host thread - Sequential kernel execution on one device
Multiple host threads - concurrent execution of kernels on one device (independent contexts)
Multiple host threads - sequential execution of kernels on one device (independent contexts)
Am I missing out any categories? What is your opinion about the test categories that I have chosen and any general advice w.r.t multi-GPU programming is welcome.
Thanks,
Sayan
EDIT:
I thought that the previous categorization involved some redundancy, so modified it.
Most workloads are light enough on CPU work that you can juggle multiple GPUs from a single thread, but that only became easily possible starting with CUDA 4.0. Before CUDA 4.0, you would call cuCtxPopCurrent()/cuCtxPushCurrent() to change the context that is current to a given thread. But starting with CUDA 4.0, you can just call cudaSetDevice() to set the current context to correspond to a given device.
Your option 1) is a misnomer, though, because there is no "shared context" - the GPU contexts are still separate and device memory and objects such as CUDA streams and CUDA events are affiliated with the GPU context in which they were created.
Multiple host threads - equal multiple devices, independent contexts is a winner if you can get away with it. This is assuming that you can get truly independent units of work. This should be true since your problem is embarassingly parallel.
Caveat emptor: I have not personally built a large scale multi-GPU system. I have built a successful single GPU system w/ 3 orders of magnitude acceleration relative to CPUs. Thus, the advice is generalization of the synchronization costs I've seen as well as discussion with my colleagues who have built multi-GPU systems.