How does the speed of CUDA program scale with the number of blocks? - cuda

I am working on Tesla C1060, which contains 240 processor cores with compute capability 1.3. Knowing that each 8 cores are controlled by a single multi-processor, and that each block of threads is assigned to a single multi-processor, then I would expect that launching a grid of 30 blocks, should take the same execution time as one single block. However, things don't scale that nicely, and I never got this nice scaling even with 8 threads per block. Going to the other extreme with 512 threads per block, I get approximately the same time of one block, when the grid contains a maximum of 5 blocks. This was disappointing when I compared the performance with implementing the same task parallelized with MPI on an 8-core CPU machine.
Can some one explain that to me?
By the way, the computer actually contains two of this Tesla card, so does it distribute blocks between them automatically, or do I have to take further steps to ensure that both are fully exploited?
EDIT:
Regarding my last question, if I launch two independent MPI processes on the same computer, how can I make each work on a different graphics card?
EDIT2: Based on the request of Pedro, here is a plot depicting the total time on the vertical access, normalized to 1 , versus the number of parallel blocks. The number of threads/block = 512. The numbers are rough, since I observed quite large variance of the times for large numbers of blocks.

The speed is not a simple linear relation with the number of blocks. It depends on bunch of stuffs. For example, the memory usage, the number of instruction excuted in a block, etc.
If you want to do multi-GPU computing, you need to modify your code, otherwise you can only use one GPU card.

It seems to me that you have simply taken a C program and compiled it in CUDA without much tought.
Dear friend, this is not the way to go. You have to design your code to take advantage of the fact that CUDA cards have a different internal architecture than regular CPUs. In particular, take the following into account:
memory access pattern - there is a number of memory systems in a GPU and each requires consideration on how to use it best
thread divergence problems - performance will only be good if most of your threads follow the same code path most of the time
If your system has 2 GPUs, you can use both to accelerate some(suitable) problems. The thing is that the memory area of the two are split and not easily 'visible' by each other - you have to design your algorithm to take this into account.
A typical C program written in pre-GPU era will often not be easily transplantable unless originally written with MPI in mind.
To make each CPU MPI thread work with a different GPU card you can use cudaSetDevice()

Related

How to process a task of arbitrary size using CUDA?

I'm starting to learn CUDA, and have to dive straight into a project, so I currently am lacking a solid theoretical background; I'll be picking it up along the way.
While I understand that the way the hardware is built requires the programmer to deal with thread blocks and grids, I haven't been able to find an answer to the following questions in my introductory book:
What happens when the task size is greater than the amount of threads a GPU can process at a time? Will the GPU then proceed through the array the same way a CPU would, i.e. sequentially?
Thus, should I worry if the amount of thread blocks that a given task requires exceeds the amount that can simultaneously run on the GPU? I've found a notion of the "thread block limit" so far, and it's obviously higher that what a GPU can be processing at a given moment in time, thus, is that the real (and only) limit I should be concerned with?
Other than choosing the right block size for the given hardware, are there any problems to consider when setting up a kernel for execution? I'm at loss regarding launching a task of arbitrary size. Even considered going OpenCL instead of CUDA because there appears to be no explicit block size calculation involved when launching a kernel to execute over an array.
I'm fine with this being closed as duplicate in case it is, just be sure to point at the original question.
The number of thread blocks can be arbitrary. The hardware can handle them sequentially if the number is large. This link gives you a basic view.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#scalable-programming-model
On the other hand you could use limited number of threads to handle task of arbitrary sizes by increasing the work per thread. This link shows you how to do that and why it is better.
https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/
You may want to read the following two for a full answer.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html

Is it possible to set a limit on the number of cores to be used in Cuda programming for a given code?

Assume I have Nvidia K40, and for some reason, I want my code only uses portion of the Cuda cores(i.e instead of using all 2880 only use 400 cores for examples), is it possible?is it logical to do this either?
In addition, is there any way to see how many cores are being using by GPU when I run my code? In other words, can we check during execution, how many cores are being used by the code, report likes "task manger" in Windows or top in Linux?
It is possible, but the concept in a way goes against fundamental best practices for cuda. Not to say it couldn't be useful for something. For example if you want to run multiple kernels on the same GPU and for some reason want to allocate some number of Streaming Multiprocessors to each kernel. Maybe this could be beneficial for L1 caching of a kernel that does not have perfect memory access patterns (I still think for 99% of cases manual shared memory methods would be better).
How you could do this, would be to access the ptx identifiers %nsmid and %smid and put a conditional on the original launching of the kernels. You would have to only have 1 block per Streaming Multiprocessor (SM) and then return each kernel based on which kernel you want on which SM's.
I would warn that this method should be reserved for very experienced cuda programmers, and only done as a last resort for performance. Also, as mentioned in my comment, I remember reading that a threadblock could migrate from one SM to another, so behavior would have to be measured before implementation and could be hardware and cuda version dependent. However, since you asked and since I do believe it is possible (though not recommended), here are some resources to accomplish what you mention.
PTS register for SM index and number of SMs...
http://docs.nvidia.com/cuda/parallel-thread-execution/#identifiers
and how to use it in a cuda kernel without writing ptx directly...
https://gist.github.com/allanmac/4751080
Not sure, whether it works with the K40, but for newer Ampere GPUs there is the MIG Multi-Instance-GPU feature to partition GPUs.
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
I don't know such methods, but would like to get to know.
As to question 2, I suppose sometimes this can be useful. When you have complicated execution graphs, many kernels, some of which can be executed in parallel, you want to load GPU fully, most effectively. But it seems on its own GPU can occupy all SMs with single blocks of one kernel. I.e. if you have a kernel with 30-blocks grid and 30 SMs, this kernel can occupy entire GPU. I believe I saw such effect. Really this kernel will be faster (maybe 1.5x against 4 256-threads blocks per SM), but this will not be effective when you have another work.
GPU can't know whether we are going to run another kernel after this one with 30 blocks or not - whether it will be more effective to spread it onto all SMs or not. So some manual way to say this should exist
As to question 3, I suppose GPU profiling tools should show this, Visual Profiler and newer Parallel Nsight and Nsight Compute. But I didn't try. This will not be Task manager, but a statistics for kernels that were executed by your program instead.
As to possibility to move thread blocks between SMs when necessary,
#ChristianSarofeen, I can't find mentions that this is possible. Quite the countrary,
Each CUDA block is executed by one streaming multiprocessor (SM) and
cannot be migrated to other SMs in GPU (except during preemption,
debugging, or CUDA dynamic parallelism).
https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Although starting from some architecture there is such thing as preemption. As I remember NVidia advertised it in the following way. Let's say you made a game that run some heavy kernels (say for graphics rendering). And then something unusual happened. You need to execute some not so heavy kernel as fast as possible. With preemption you can unload somehow running kernels and execute this high priority one. This increases execution time (of this high pr. kernel) a lot.
I also found such thing:
CUDA Graphs present a new model for work submission in CUDA. A graph
is a series of operations, such as kernel launches, connected by
dependencies, which is defined separately from its execution. This
allows a graph to be defined once and then launched repeatedly.
Separating out the definition of a graph from its execution enables a
number of optimizations: first, CPU launch costs are reduced compared
to streams, because much of the setup is done in advance; second,
presenting the whole workflow to CUDA enables optimizations which
might not be possible with the piecewise work submission mechanism of
streams.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs
I do not believe kernels invocation take a lot of time (of course in case of a stream of kernels and if you don't await for results in between). If you call several kernels, it seems possible to send all necessary data for all kernels while the first kernel is executing on GPU. So I believe NVidia means that it runs several kernels in parallel and perform some smart load-balancing between SMs.

Strong scaling on GPUs

I'd like to investigate the strong scaling of my parallel GPU code (written with OpenACC). The concept of strong scaling with GPUs is - at least as far as I know - more murky than with CPUs. The only resource I found regarding strong scaling on GPUs suggests fixing the problem size and increasing the number of GPUs. However, I believe there is some amount of strong scaling within GPUs, for example scaling over streaming multiprocessors (in the Nvidia Kepler architecture).
The intent of OpenACC and CUDA is to explicitly abstract away the hardware to the parallel programmer, constraining her to their three-level programming model with gangs (thread blocks), workers (warps) and vectors (SIMT group of threads). It is my understanding that the CUDA model aims at offering scalability with respect to its thread blocks, which are independent and are mapped to SMXs. I therefore see two ways to investigate strong scaling with the GPU:
Fix the problem size, and set the thread block size and number of threads per block to an arbitrary constant number. Scale the number of thread blocks (grid size).
Given additional knowledge on the underlying hardware (e.g. CUDA compute capability, max warps/multiprocessor, max thread blocks/multiprocessor, etc.), set the thread block size and number of threads per block such that a block occupies an entire and single SMX. Therefore, scaling over thread blocks is equivalent to scaling over SMXs.
My questions are: is my train of thought regarding strong scaling on the GPU correct/relevant? If so, is there a way to do #2 above within OpenACC?
GPUs do strong scale, but not necessarily in the way that you're thinking, which is why you've only been able to find information about strong scaling to multiple GPUs. With a multi-core CPU you can trivially decide exactly how many CPU cores you want to run on, so you can fix the work and adjust the degree of threading across the cores. With a GPU the allocation across SMs is handled automatically and is completely out of your control. This is by design, because it means that a well-written GPU code will strong scale to fill whatever GPU (or GPUs) you throw at it without any programmer or user intervention.
You could run on some small number of OpenACC gangs/CUDA threadblocks and assume that 14 gangs will run on 14 different SMs, but there's a couple of problems with this. First, 1 gang/threadblock will not saturate a single Kepler SMX. No matter how many threads, no matter what the occupancy, you need more blocks per SM in order to fully utilize the hardware. Second, you're not really guaranteed that the hardware will choose to schedule the blocks that way. Finally, even if you find the optimal number of blocks or gangs per SM on the device you have, it won't scale to other devices. The trick with GPUs is to expose as much parallelism as possible so that you can scale from devices with 1 SM up to devices with 100, if they ever exist, or to multiple devices.
If you want to experiment with how varying the number of OpenACC gangs for a fixed amount of work affects performance, you'd do that with either the num_gangs clause, if you're using a parallel region, or the gang clause, if you're using kernels. Since you're trying to force a particular mapping of the loops, you're really better off using parallel, since that's the more prescriptive directive. What you'd want to do is something like the following:
#pragma acc parallel loop gang vector num_gangs(vary this number) vector_length(fix this number)
for(i=0; i<N; i++)
do something
This tells the compiler to vectorize the loop using some provided vector length and then partition the loop across OpenACC gangs. What I'd expect is that as you add gangs you'll see better performance up until some multiple of the number of SMs, at which point performance would become roughly flat (with outliers of course). As I said above, fixing the number of gangs at the point where you see optimal performance is not necessarily the best idea, unless this is the only device you're interested in. Instead, by either letting the compiler decide how to decompose the loop, which allows the compiler to make smart decisions based on the architecture you tell it to build for, or by exposing as many gangs as possible, which gives you additional parallelism that will strong scale to larger GPUs or multiple GPUs, you'd have more portable code.
For occupying a complete SMX I would suggest using shared memory as limiting resource for occupancy. Write a kernel that consumes all 32kB of shared memory and the block will occupy the entire SMX, because the SMX is out of resources for another block. Than you can scale up your blocks from 1 to 13 (for K20c) and the scheduler will (hopefully) schedule each block to a different SMX. Than you can scale up the therads per block first to 192 to get each CUDA core busy and then you can go further to get the warp scheduler happy. GPUs provide performance through latency hiding. So you have to move on from 1 block occupies a SMX to N blocks. You can do that by using less shared memory. And again scaling up your warps to cover latency hiding.
I never touched OpenACC and if you really want full control over your experimental code use CUDA instead of OpenACC. You cannot see inside the OpenACC compiler and what it is doing with the pragmas used in your code.

Utilizing GPU worth it?

I want to compute the trajectories of particles subject to certain potentials, a typical N-body problem. I've been researching methods for utilizing a GPU (CUDA for example), and they seem to benefit simulations with large N (20000). This makes sense since the most expensive calculation is usually finding the force.
However, my system will have "low" N (less than 20), many different potentials/factors, and many time steps. Is it worth it to port this system to a GPU?
Based on the Fast N-Body Simulation with CUDA article, it seems that it is efficient to have different kernels for different calculations (such as acceleration and force). For systems with low N, it seems that the cost of copying to/from the device is actually significant, since for each time step one would have to copy and retrieve data from the device for EACH kernel.
Any thoughts would be greatly appreciated.
If you have less than 20 entities that need to be simulated in parallel, I would just use parallel processing on an ordinary multi-core CPU and not bother about using GPU.
Using a multi-core CPU would be much easier to program and avoid the steps of translating all your operations into GPU operations.
Also, as you already suggested, the performance gain using GPU will be small (or even negative) with this small number of processes.
There is no need to copy results from the device to host and back between time steps. Just run your entire simulation on the GPU and copy results back only after several time steps have been calculated.
For how many different potentials do you need to run simulations? Enough to just use the structure from the N-body example and still load the whole GPU?
If not, and assuming the potential calculation is expensive, I'd think it would be best to use one thread for each pair of particles in order to make the problem sufficiently parallel. If you use one block per potential setting, you can then write out the forces to shared memory, __syncthreads(), and use a subset of the block's threads (one per particle) to sum the forces. __syncthreads() again, and continue for the next time step.
If the potential calculation is not expensive, it might be worth exploring first where the main cost of your simulation is.

Code to limit GPU usage

Is there a command/function/variable that can be set in CUDA code that limits the GPU usage percent? I'd like to modify an open-source project called Flam4CUDA so that that option exists. They way it is now, it uses as much of all GPUs present as possible, with the effect being that the temperatures skyrocket (obviously). In an effort to keep temps down over long periods of computing, I'd like to be able to tell the program to use, say, 50% of each GPU (or even have different percentages for different GPUs, or maybe also be able to select which GPU(s) to use). Any ideas?
If you want to see the code, it's available with "svn co https://flam4.svn.sourceforge.net/svnroot/flam4 flam4".
There is no easy way to do what you are asking to do. CPU usage is controlled via time-slicing of context switches, while GPUs do not have such fine-grained context switching. GPUs are cooperatively multitasked. This is why the nvidia-smi tool for workstation- and server-class boards has "exclusive" and "prohibited" modes to control the number of GPU contexts that are allowed on a given board.
Messing with the number of threads/block or blocks in a grid, as has been suggested, will break applications that are passing metadata to the kernel (not easily inferred by your software) that depends on the expected block and grid size.
You can look at the gpu properties using CUDA and find the number of multiprocessors and the number of cores per multiprocessor. What you basically need to do is change the block size and grid size of the kernel functions so that you use half of the total number of cores.