Simple Compute-Intensive CUDA Program - cuda

I'm preparing an acceptance test for a new machine with Nvidia graphics cards and I'd like a simple CUDA program that will fully exercise the GPU for a full day. The intent is to generate large amounts of heat and ensure the new machine is stable under the load. I'd like the code to be very easy to compile and run (no dependencies, no large input data sets), and also very easy to verify (small amounts of output). Also, I'd like it to be command-line only, no GUI (the test will have to be automated).
I was originally thinking of repeatedly running Vector Dot Products of large vectors. However, that's mostly memory-intensive. So if the GPUs are constantly waiting on memory accesses, then they probably aren't generating as much heat as they could.
I'm running on a CentOS Linux machine.
Does anyone have any suggestions?

You didn't mention which OS you are on.
Ideally, you would want to stress the floating point units, the logic/integer units, the GPU memory, the GPU voltage regulators (VRMs) and the main PSU. I don't think there is any single utility out there that does that.
Memory:
http://sourceforge.net/projects/cudagpumemtest/
Integer (?):
http://sourceforge.net/projects/cudalucas/
PSU and VRMs (In the past, this program could cause GPUs to run out-of-spec, breaking the card. I don't think that's the case anymore):
http://www.ozone3d.net/benchmarks/fur/

Related

TensorFlow strange memory usage

I'm on an Ubuntu 19.10 machine (with KDE desktop environment) with 8GB of RAM, an i5 8250u and an MX130 gpu (2GB VRAM), running a Jupyter Notebook with tensorflow-gpu.
I was just training some models to test their memory usage, and I can't see any sense in what I'm looking at. I used KSysGUARD and NVIDIA System Monitor (https://github.com/congard/nvidia-system-monitor) to monitor my system during training.
As I hit "train", on NVIDIA S.M. show me that memory usage is 100% (or near 100% like 95/97%) the GPU usage is fine.
Always in NVIDIA S.M., I look at the processes list and "python" occupies only around 60MB of vram space.
In KSysGUARD, python's memory usage is always around 700mb.
There might be some explanation for that, the problem is that the gpu's memory usage hits 90% with a model with literally 2 neurons (densely connected of course xD), just like a model with 200million parameters does. I'm using a batch size of 128.
I thought around that mess, and if I'm not wrong, a model with 200million parameters should occupy 200000000*4bytes*128 bytes, which should be 1024gb.
That means I'm definitely wrong on something, but I'm too selfless to keep that riddle for me, so I decided to give you the chance to solve this ;D
PS: English is not my main language.
Tensorflow by default allocates all available VRAM in the target GPU. There is an experimental feature called memory growth that let's you control that, basically stops the initialization process from allocating all VRAM and does it when there is a need for it.
https://www.tensorflow.org/api_docs/python/tf/config/experimental/set_memory_growth

Does nvidia gpu work less efficiently when it is the only gpu in PC?

I want to assemble a new computer mainly for CUDA applications. When it comes to CPU I have to choose between AMD and Intel.
Most of the AMD's processors don't have integrated gpu while Intel's processors do.
My question is:
If the nvidia gpu would be the only graphic processing unit in the whole PC (without integrated one),
would its efficiency for CUDA programs be worse as it has to produce some graphics on a desktop (while using for example Matlab)?
The anwer is yes, efficiency would be slightly lower due to the GPU doing display tasks, like moving the cursor around or scrolling a display in a .pdf browser.
however if you are aiming for a reasonably mid-to-high-end GPU, the loss of efficiency is marginal. If you have enough money, you will buy dedicated GPU, but if not, then just don't bother. It might be like 1% or less.
A bigger problem is that the display takes up RAM, that (a) becomes unavailable to CUDA applications and (b) the CUDA manual states that the display driver is allowed to dis-own the CUDA application from it's memory at any time without warning (!).
If you ask me if that does really happen (display driver taking over the CUDA app memory), then yes, I have experienced it, with the prime example being when you change the resolution of your display.
So definetely don't do any banking with GPUs or you might see your accounts being randomly infused with millions :-)
That's why 'proffesional' CUDA cards (the tesla variety) have no display outputs - just in case.

Kernel time increases for same number of particles

I am trying to run my code on NVIDIA's K10 GPU. I am using 5.0 CUDA Driver and 4.2 CUDA runtime. The problem is that the time taken by the kernel increases with iterations, where each iteration uses the same number of sources and targets (or particles). Because of this, the kernel eventually takes very large times, and the code crashes with runtime error, which says something like "GPU fallen off the bus".
The plot showing the behavior of increasing kernel run time with number of iterations can be seen here:
https://docs.google.com/open?id=0B5QLL4ig3LVqODdmVjNBTlp5UFU
I tried to run the NVIDIA "nbody" example to understand if the same thing happens here too, and yes it does. For the number of particles/bodies (Np) = 1e5 and 10 iterations, code runs fine. For Np=1e5 and iterations= 100, OR Np=1e6 and iterations = 10, code goes into a mode where it hangs the entire system.
When I run my own kernel as well as NVIDIA's nbody example on a different machine with Tesla C2050 NVIDIA card (CUDA Driver version: 3.2, and runtime version: 3.2), there is no problem, and kernel takes the same amount of time for every iteration.
I am trying to understand whats going on in the machine with the K10 GPU. I have tried different combinations of CUDA driver and runtime versions on this machine, and here is what I get:
For 5.0 CUDA Driver, 4.2 Runtime, it just hangs and sometimes says "GPU fallen off the bus".
For 4.2 CUDA Driver, 4.2 Runtime, the codes (nbody as well as my code) crash with error: "CUDA Runtime API error 39: uncorrectable ECC error encountered."
For 5.0 CUDA Driver, 5.0 Runtime, it just hangs and sometimes says "GPU fallen off the bus".
This is a 64-bit linux machine, which we have recently assembled with NVIDIA K10 GPU card. I am using gfortran44 and gcc44.
Please let me know if any other info. is required to track the problem.
Thanks in advance for the help!
M
I'm mostly just creating an answer so we can call this question closed, but I'll try to add a few details.
Tesla GPUs come in 2 distinct categories: those with a fan, and those without. Those with a fan carry (at this time) the "C" designation, although the K20 product family naming will be slightly different:
These are not exhaustive lists:
Tesla GPUs with a Fan: C870, C1060, C2050, C2070, C2075, K20c ("C Class")
Tesla GPUs without a Fan: M1060, M2050, M2070, M2075, M2090, K10, K20, K20X ("M class")
(note that there is currently no K10 type product with a fan or "C" designation)
Tesla GPUs with a fan are designed to be plugged into a wide variety of PC boxes and chassis, including various workstation and server variants. Since they have their own fan, they require a supply of inlet air that is below a certain temperature level, but given that, they will keep themselves cool. As the workload increases, and the generated heat increases, they will spin up their own fan to keep themselves cool. The main ways you can screw up this process are by either restricting the inlet air flow or by putting it in an ambient air environment that is hotter than its max inlet spec.
Tesla GPUs without a fan have something called a passive heatsink and they cannot keep themselves cool independently and take a passive role in the cooling process. They still have a temperature sensor, but it becomes the responsibility of the server BMC (baseboard management controller) to monitor this temperature sensor (this is done directly at the hardware/firmware level, independent of any OS or any activity being directed at the GPU), and to direct a level of airflow over the card that is sufficient to keep the card cool based on it's indicated temperature. The BMC does this by ramping up whatever fans are designed into the server chassis that control airflow over the GPU. Normally there will be shrouding/ducting within the chassis to aid in this process. Server manufacturers integrating these cards have a variety of responsibilities and must follow various technical specifications from NVIDIA in order to make this work.
If you happen to get your hands on a Tesla GPU without a fan and just slap it in some random chassis, you're pretty much guaranteed to have the behavior as described in this question. For this reason, Tesla "M" series and "K" series GPUs are normally only sold to OEMs who have undergone the qualification process.
Since the average sysadmin/system assembler is not likely to devise a suitable closed loop fan control system and normally does not have easy access to the necessary specifications defining the temperature sensor and access method, the only klugey workaround if you have one of these that you simply must play with, is to direct a high level of continuous airflow over the card, in whatever setting you put it. Be advised, that this will most likely be noisy. If you don't have a noisy level of airflow, you probably do not have enough airflow to keep a card cool that is in a high workload situation. In addition, you should probably keep an eye on GPU temps. Note that the nvidia-smi method for monitoring GPU temps does not work for all M class GPUs (i.e. GPUs without a fan). Unfortunately, the method of temperature sensor access in Fermi and prior for the M class GPUs (different than the C class GPUs) was such that it could not be readily monitored in-system via the nvidia-smi command, so in these cases you will get no temperature reading from nvidia-smi, making this approach even harder to manage. Things changed with the Kepler generation, so now the temperature can be monitored both by the nvidia-smi method and by the server BMC at the hardware/firmware level.
C class products with a fan have a temperature that can be monitored with nvidia-smi, regardless of generation. But this is normally not necessary since the card has it's own control system to keep itself cool.
As mentioned in the comments, all GPUs also have a variety of protection mechanisms, none of which are guaranteed to prevent damage. (If you throw the card in a fire, there's nothing to be done about that.) But the first typical mechanism is thermal throttling. At some predefined high temperature near the maximum safe operating range of the GPU, the GPU firmware will independently reduce its clocks to attempt to prevent further temperature rise. (If the card is clocked slower, then generally it's ability to generate heat is also somewhat reduced.) This is a crude mechanism, and when this thermal throttling occurs, something in the cooling arena is already wrong. The card is designed to not enter thermal throttling ever, under normal operating conditions. If temperatures continue to rise (and there is not much headroom at this point), the card will enter it's final protection mode which is to halt itself. At this point the GPU has become unresponsive to the system, and at the OS level, messages like "gpu has fallen of the bus" are typical. This means cooling has failed and protection mechanisms have failed.

right way to report CUDA speedup

I would like to compare the performance of a serial program running on a CPU and a CUDA program running on a GPU. But I'm not sure how to compare the performance fairly. For example, if I compare the performance of an old CPU with a new GPU, then I will have immense speedup.
Another question: How can I compare my CUDA program with another CUDA program reported in a paper (both run on different GPUs and I cannot access the source code).
For fairness, you should include the data transfer times to get the data into and out of the GPU. It's not hard to write a blazing fast CUDA function. The real trick is in figuring out how to keep it fed, or how to hide the cost of data transfer by overlapping it with other necessary work. Unless your routine is 100% compute-bound, including data transfer in your units-of-work-done-per-unit-of-time is critical to understanding how your implementation would handle, say, a lot more units of work.
For cross-device comparisons, it might be useful to report units of work performed per unit of time per processor core. The per processor core will help normalize large differences between, say, a 200 core and a 2000 core CUDA device.
If you're talking about your algorithm (not just output), it is useful to describe how you broke the problem down for parallel execution - your block/thread distribution, for example.
Make sure you are not measuring performance on a debug build, or running in a debugger. Debugging adds overhead.
Make sure that your work sample is large enough that it is significantly above the "noise floor". A test run that takes a few seconds to complete will be measuring more of your function and less of the ambient noise of the environment than a test run that completes in milliseconds. You can always divide the units of work by the test execution time to arrive at a sexy "units per nanosecond" figure, but you don't actually measure it that way.
The speed of cuda program on different GPUs depends on many factors of the GPU like memory bandwidth, core clock speed, cores, number of threads/registers/shared memory available. so it is difficult to compare the performance in different GPUs

CUDA development on different cards?

I'm just starting to learn how to do CUDA development(using version 4) and was wondering if it was possible to develop on a different card then I plan to use? As I learn, it would be nice to know this so I can keep an eye out if differences are going to impact me.
I have a mid-2010 macbook pro with a Nvidia GeForce 320M graphic cards(its a pretty basic laptop integrated card) but I plan to run my code on EC2's NVIDIA Tesla “Fermi” M2050 GPUs. I'm wondering if its possible to develop locally on my laptop and then run it on EC2 with minimal changes(I'm doing this for a personal project and don't want to spend $2.4 for development).
A specific question is, I heard that recursions are supported in newer cards(and maybe not in my laptops), what if I run a recursion on my laptop gpu? will it kick out an error or will it run but not utilize the hardware features? (I don't need the specific answer to this, but this is kind of the what I'm getting at).
If this is going to be a problem, is there emulators for features not avail in my current card? or will the SDK emulate it for me?
Sorry if this question is too basic.
Yes, it's a pretty common practice to use different GPUs for development and production. nVidia GPU generations are backward-compatible, so if your program runs on older card (that is if 320M (CC1.3)), it would certainly run on M2070 (CC2.0)).
If you want to get maximum performance, you should, however, profile your program on same architecture you are going to use it, but usually everything works quite well without any changes when moving from 1.x to 2.0. Any emulator provide much worse view of what's going on than running on no-matter-how-old GPU.
Regarding recursion: an attempt to compile a program with obvious recursion for 1.3 architecture produces compile-time error:
nvcc rec.cu -arch=sm_13
./rec.cu(5): Error: Recursive function call is not supported yet: factorial(int)
In more complex cases the program might compile (I don't know how smart the compiler is in detecting recursions), but certainly won't work: in 1.x architecture there was no call stack, and all function calls were actually inlined, so recursion is technically impossible.
However, I would strongly recommend you to avoid recursion at any cost: it goes against GPGPU programming paradigm, and would certainly lead to very poor performance. Most algorithms are easily rewritten without the use of recursion, and it is much more preferable way to utilize them, not only on GPU, but on CPU as well.
The Cuda Version at first is not that important. More important are the compute capabilities of your card.
If you programm your kernels using cc 1.0 and they are scalable for the future you won't have any problems.
Choose yourself your minimum cc level you need for your application.
Calculate necessary parameters using properties and use ptx jit compilation:
If your kernel can handle arbitrary input sized data and your kernel launch configuration scales across thousands of threads it will scale across future versions.
In my projects all my kernels used a fixed number of threads per block which was equal to the number of resident threads per streaming multiprocessor divided by the number of resident blocks per streaming multiprocessor to reach 100% occupancy.
Some kernels need a multiple of two number of threads per block so I handled this case also since not for all cc versions the above equation guaranteed a multiple of two block size.
Some kernels used shared memory and its size was also deducted by the cc level properties.
This data was received using (cudaGetDeviceProperties) in a utility class and using ptx jit compiling my kernels worked without any changes on all devices. I programmed on a cc 1.1 device and ran tests on latest cuda cards without any changes!
All kernels were programmed to work with 64-bit length input data and utilizing all dimensions of the 3D Grid. (I am pretty sure in a year I will continue working on this project so this was necessary)
All my kernels except one did not exceeded the cc 1.0 register limit while having 100% occ. So if the used card cc was below 1.2 I added a maxregcount command to my kernel to still enforce 100% occ.
This does not guarantees best possible performance!
For possible best performance each kernel should be analyzed regarding its parameters and resources.
This maybe is not practicable for all applications and requirements
The NVidia Kepler K20 GPU available in Q4 2012 with CUDA 5 will support recursive algorithms.