Artificially downgrade CUDA compute capabilities to simulate other hardware - cuda

I am developing software that should be running on several CUDA GPUs of varying amount of memory and compute capability. It happened to me more than once that customers would report a reproducible problem on their GPU that I couldn't reproduce on my machine. Maybe because I have 8 GB GPU memory and they have 4 GB, maybe because compute capability 3.0 rather than 2.0, things like that.
Thus the question: can I temporarily "downgrade" my GPU so that it would pretend to be a lesser model, with smaller amount of memory and/or with less advanced compute capability?
Per comments clarifying what I'm asking.
Suppose a customer reports a problem running on a GPU with compute capability C with M gigs of GPU memory and T threads per block. I have a better GPU on my machine, with higher compute capability, more memory, and more threads per block.
Can I run my program on my GPU restricted to M gigs of GPU memory? The answer to this one seems to be "yes, just allocate (whatever mem you have) - M at startup and never use it; that would leave only M until your program exits."
Can I reduce the size of the blocks on my GPU to no more than T threads for the duration of runtime?
Can I reduce compute capability of my GPU for the duration of runtime, as seen by my program?

I originally wanted to make this a comment but it was getting far too big for that scope.
As #RobertCrovella mentioned there is no native way to do what you are asking for. That said, you can take the following measures to minimize the bugs you see on other architectures.
0) Try to get the output from cudaGetDeviceProperties from the CUDA GPUs you want to target. You could crowd source this from your users or the community.
1) To restrict memory, you can either implement a memory manager and manually keep track of the memory being used or use cudaGetMemInfo to get a fairly close estimate. Note: This function returns memory used by other applications as well.
2) Have a wrapper macro to launch the kernel where you can explicitly check if the number of blocks / threads fit in the current profile. i.e. Instead of launching
kernel<float><<<blocks, threads>>>(a, b, c);
You'd do something like this:
LAUNCH_KERNEL((kernel<float>), blocks, threads, a, b, c);
Where you can have the macro be defined like this:
#define LAUNCH_KERNEL(kernel, blocks, threads, ...)\
check_blocks(blocks);\
check_threads(threads);\
kernel<<<blocks, threads>>>(__VA_ARGS__)
3) Reducing the compute capability is not possible, but you can however compile your code with various compute modes and make sure your kernels have backwards compatible code in them. If a certain part of your kernel errors out with an older compute mode, you can do something like this:
#if !defined(TEST_FALLBACK) && __CUDA_ARCH__ >= 300 // Or any other newer compute
// Implement using new fancy feature
#else
// Implement a fallback version
#endif
You can define TEST_FALLBACK whenever you want to test your fallback code and ensure your code works on older computes.

Related

Is it possible to set a limit on the number of cores to be used in Cuda programming for a given code?

Assume I have Nvidia K40, and for some reason, I want my code only uses portion of the Cuda cores(i.e instead of using all 2880 only use 400 cores for examples), is it possible?is it logical to do this either?
In addition, is there any way to see how many cores are being using by GPU when I run my code? In other words, can we check during execution, how many cores are being used by the code, report likes "task manger" in Windows or top in Linux?
It is possible, but the concept in a way goes against fundamental best practices for cuda. Not to say it couldn't be useful for something. For example if you want to run multiple kernels on the same GPU and for some reason want to allocate some number of Streaming Multiprocessors to each kernel. Maybe this could be beneficial for L1 caching of a kernel that does not have perfect memory access patterns (I still think for 99% of cases manual shared memory methods would be better).
How you could do this, would be to access the ptx identifiers %nsmid and %smid and put a conditional on the original launching of the kernels. You would have to only have 1 block per Streaming Multiprocessor (SM) and then return each kernel based on which kernel you want on which SM's.
I would warn that this method should be reserved for very experienced cuda programmers, and only done as a last resort for performance. Also, as mentioned in my comment, I remember reading that a threadblock could migrate from one SM to another, so behavior would have to be measured before implementation and could be hardware and cuda version dependent. However, since you asked and since I do believe it is possible (though not recommended), here are some resources to accomplish what you mention.
PTS register for SM index and number of SMs...
http://docs.nvidia.com/cuda/parallel-thread-execution/#identifiers
and how to use it in a cuda kernel without writing ptx directly...
https://gist.github.com/allanmac/4751080
Not sure, whether it works with the K40, but for newer Ampere GPUs there is the MIG Multi-Instance-GPU feature to partition GPUs.
https://docs.nvidia.com/datacenter/tesla/mig-user-guide/
I don't know such methods, but would like to get to know.
As to question 2, I suppose sometimes this can be useful. When you have complicated execution graphs, many kernels, some of which can be executed in parallel, you want to load GPU fully, most effectively. But it seems on its own GPU can occupy all SMs with single blocks of one kernel. I.e. if you have a kernel with 30-blocks grid and 30 SMs, this kernel can occupy entire GPU. I believe I saw such effect. Really this kernel will be faster (maybe 1.5x against 4 256-threads blocks per SM), but this will not be effective when you have another work.
GPU can't know whether we are going to run another kernel after this one with 30 blocks or not - whether it will be more effective to spread it onto all SMs or not. So some manual way to say this should exist
As to question 3, I suppose GPU profiling tools should show this, Visual Profiler and newer Parallel Nsight and Nsight Compute. But I didn't try. This will not be Task manager, but a statistics for kernels that were executed by your program instead.
As to possibility to move thread blocks between SMs when necessary,
#ChristianSarofeen, I can't find mentions that this is possible. Quite the countrary,
Each CUDA block is executed by one streaming multiprocessor (SM) and
cannot be migrated to other SMs in GPU (except during preemption,
debugging, or CUDA dynamic parallelism).
https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/
Although starting from some architecture there is such thing as preemption. As I remember NVidia advertised it in the following way. Let's say you made a game that run some heavy kernels (say for graphics rendering). And then something unusual happened. You need to execute some not so heavy kernel as fast as possible. With preemption you can unload somehow running kernels and execute this high priority one. This increases execution time (of this high pr. kernel) a lot.
I also found such thing:
CUDA Graphs present a new model for work submission in CUDA. A graph
is a series of operations, such as kernel launches, connected by
dependencies, which is defined separately from its execution. This
allows a graph to be defined once and then launched repeatedly.
Separating out the definition of a graph from its execution enables a
number of optimizations: first, CPU launch costs are reduced compared
to streams, because much of the setup is done in advance; second,
presenting the whole workflow to CUDA enables optimizations which
might not be possible with the piecewise work submission mechanism of
streams.
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs
I do not believe kernels invocation take a lot of time (of course in case of a stream of kernels and if you don't await for results in between). If you call several kernels, it seems possible to send all necessary data for all kernels while the first kernel is executing on GPU. So I believe NVidia means that it runs several kernels in parallel and perform some smart load-balancing between SMs.

Running parallel CUDA tasks

I am about to create GPU-enabled program using CUDA technology. It is supposed to be C# Emgu or C++ Cuda toolkit (not yet decided).
I need to use all GPU power (I have card with 16 GPU cores). How do I run 16 tasks in parallel?
First of. 16 GPU cores is, on pre 6xx series, equal to 16*8=128 cores. On 6xx series it is 16*32=512 cores. That does not mean you should limit yourself to 128/512 tasks.
Second: emgu seems to be a OpenCV wrapper for .NET, and is related to image processing. It generally has nothing to do with GPU programming. Might be some algorithms have been gpu accelerated, but I don't know anything about that. The alternative to CUDA in this is OpenCL, not OpenCV. If you will be using CUDA technology like you say, you have no alternative to CUDA, as only CUDA is CUDA.
When it comes to starting tasks, you only tell the GPU how many threads you wish to run. Actually, you tell the GPU how many blocks, and how many threads pr. block you wish to run. This is done when you call the cuda function itself. You don't want to limit yourself to 128/512 threads either, but experiment.
Don't know your knowledge on GPGPU programming, but remember that you can not run tasks as you do on the CPU. You can not run 128 different tasks, all threads have to run the exact same instructions (except for when branching, which should generally be avoided).
Generally speaking, you want sufficient threads to fill all the streaming multiprocessors. At a minimum that is .25 * MULTIPROCESSORS * MAX_THREADS_PER_MULTIPROCESSOR.
Specifically in CUDA now, suppose you have some CUDA kernel __global__ void square_array(float *a, int N)...
Now when you launch the kernel you specify the number of blocks and the number of threads per block
square_array <<< n_blocks, n_threads_per_block >>> (a, N);
Note: you need to get more framiliar with the CUDA parallel programming model as you not approaching to in a manor which will use all your GPU power. Consider reading Programming Massively Parallel Processors, A Hands-on Approach.

CUDA development on different cards?

I'm just starting to learn how to do CUDA development(using version 4) and was wondering if it was possible to develop on a different card then I plan to use? As I learn, it would be nice to know this so I can keep an eye out if differences are going to impact me.
I have a mid-2010 macbook pro with a Nvidia GeForce 320M graphic cards(its a pretty basic laptop integrated card) but I plan to run my code on EC2's NVIDIA Tesla “Fermi” M2050 GPUs. I'm wondering if its possible to develop locally on my laptop and then run it on EC2 with minimal changes(I'm doing this for a personal project and don't want to spend $2.4 for development).
A specific question is, I heard that recursions are supported in newer cards(and maybe not in my laptops), what if I run a recursion on my laptop gpu? will it kick out an error or will it run but not utilize the hardware features? (I don't need the specific answer to this, but this is kind of the what I'm getting at).
If this is going to be a problem, is there emulators for features not avail in my current card? or will the SDK emulate it for me?
Sorry if this question is too basic.
Yes, it's a pretty common practice to use different GPUs for development and production. nVidia GPU generations are backward-compatible, so if your program runs on older card (that is if 320M (CC1.3)), it would certainly run on M2070 (CC2.0)).
If you want to get maximum performance, you should, however, profile your program on same architecture you are going to use it, but usually everything works quite well without any changes when moving from 1.x to 2.0. Any emulator provide much worse view of what's going on than running on no-matter-how-old GPU.
Regarding recursion: an attempt to compile a program with obvious recursion for 1.3 architecture produces compile-time error:
nvcc rec.cu -arch=sm_13
./rec.cu(5): Error: Recursive function call is not supported yet: factorial(int)
In more complex cases the program might compile (I don't know how smart the compiler is in detecting recursions), but certainly won't work: in 1.x architecture there was no call stack, and all function calls were actually inlined, so recursion is technically impossible.
However, I would strongly recommend you to avoid recursion at any cost: it goes against GPGPU programming paradigm, and would certainly lead to very poor performance. Most algorithms are easily rewritten without the use of recursion, and it is much more preferable way to utilize them, not only on GPU, but on CPU as well.
The Cuda Version at first is not that important. More important are the compute capabilities of your card.
If you programm your kernels using cc 1.0 and they are scalable for the future you won't have any problems.
Choose yourself your minimum cc level you need for your application.
Calculate necessary parameters using properties and use ptx jit compilation:
If your kernel can handle arbitrary input sized data and your kernel launch configuration scales across thousands of threads it will scale across future versions.
In my projects all my kernels used a fixed number of threads per block which was equal to the number of resident threads per streaming multiprocessor divided by the number of resident blocks per streaming multiprocessor to reach 100% occupancy.
Some kernels need a multiple of two number of threads per block so I handled this case also since not for all cc versions the above equation guaranteed a multiple of two block size.
Some kernels used shared memory and its size was also deducted by the cc level properties.
This data was received using (cudaGetDeviceProperties) in a utility class and using ptx jit compiling my kernels worked without any changes on all devices. I programmed on a cc 1.1 device and ran tests on latest cuda cards without any changes!
All kernels were programmed to work with 64-bit length input data and utilizing all dimensions of the 3D Grid. (I am pretty sure in a year I will continue working on this project so this was necessary)
All my kernels except one did not exceeded the cc 1.0 register limit while having 100% occ. So if the used card cc was below 1.2 I added a maxregcount command to my kernel to still enforce 100% occ.
This does not guarantees best possible performance!
For possible best performance each kernel should be analyzed regarding its parameters and resources.
This maybe is not practicable for all applications and requirements
The NVidia Kepler K20 GPU available in Q4 2012 with CUDA 5 will support recursive algorithms.

Code to limit GPU usage

Is there a command/function/variable that can be set in CUDA code that limits the GPU usage percent? I'd like to modify an open-source project called Flam4CUDA so that that option exists. They way it is now, it uses as much of all GPUs present as possible, with the effect being that the temperatures skyrocket (obviously). In an effort to keep temps down over long periods of computing, I'd like to be able to tell the program to use, say, 50% of each GPU (or even have different percentages for different GPUs, or maybe also be able to select which GPU(s) to use). Any ideas?
If you want to see the code, it's available with "svn co https://flam4.svn.sourceforge.net/svnroot/flam4 flam4".
There is no easy way to do what you are asking to do. CPU usage is controlled via time-slicing of context switches, while GPUs do not have such fine-grained context switching. GPUs are cooperatively multitasked. This is why the nvidia-smi tool for workstation- and server-class boards has "exclusive" and "prohibited" modes to control the number of GPU contexts that are allowed on a given board.
Messing with the number of threads/block or blocks in a grid, as has been suggested, will break applications that are passing metadata to the kernel (not easily inferred by your software) that depends on the expected block and grid size.
You can look at the gpu properties using CUDA and find the number of multiprocessors and the number of cores per multiprocessor. What you basically need to do is change the block size and grid size of the kernel functions so that you use half of the total number of cores.

HELP! CUDA kernel will no longer launch after using too much memory

I'm writing a program that requires the following kernel launch:
dim3 blocks(16,16,16); //grid dimensions
dim3 threads(32,32); //block dimensions
get_gaussian_responses<<<blocks,threads>>>(pDeviceIntegral,itgStepSize,pScaleSpace);
I forgot to free the pScaleSpace array at the end of the program, and then ran the program through the CUDA profiler, which runs it 15 times in succession, using up a lot of memory / causing a lot of fragmentation. Now whenever I run the program, the kernel doesn't even launch. If I look at the list of function calls recorded by the profiler, the kernel is not there. I realize this is a pretty stupid error, but I don't know what I can do at this point to get the program to run again. I have restarted my computer, but that did not help. If I reduce the dimensions of the kernel, it runs fine, but the current dimensions are well within the allowed maximum for my card.
Max threads per block: 1024
Max grid dimensions: 65535,65535,65535
Any suggestions appreciated, thanks in advance!
Try launching with lesser number of threads. If that works, it means that each of your threads is doing a lot of work or using a lot of memory. Thus the maximum possible number of threads cannot possibly be practically launched by CUDA on your hardware.
You may have to make your CUDA code more efficient to be able to launch more threads. You could try slicing your kernel into smaller pieces if it has complex logic inside it. Or get more powerful hardware.
If you compile your code like this:
nvcc -Xptxas="-v" [other compiler options]
the assembler will report the number of local heap memory that the code requires. This can be a useful diagnostic to see what the memory footprint of the kernel is. There is also an API call cudaThreadSetLimit which can be used to control the amount of per thread heap memory which a kernel will try and consume during execution.
Recent toolkits ship with a utility called cuda-memchk, which provides valgrind like analysis of kernel memory access, including buffer overflows and illegal memory usage. It might be that your code is overflowing some memory somewhere and overwriting other parts of GPU memory, leaving the card in a parlous state.
I got it! nVidia NSight 2.0 - which supposedly supports CUDA 4 - changed my CUDA_INC_PATH to use CUDA 3.2. No wonder it wouldn't let me allocate 1024 threads per block. All relief and jubilation aside, that is a really stupid and annoying bug considering I already had CUDA 4.0 RC2 installed.