TensorFlow strange memory usage - deep-learning

I'm on an Ubuntu 19.10 machine (with KDE desktop environment) with 8GB of RAM, an i5 8250u and an MX130 gpu (2GB VRAM), running a Jupyter Notebook with tensorflow-gpu.
I was just training some models to test their memory usage, and I can't see any sense in what I'm looking at. I used KSysGUARD and NVIDIA System Monitor (https://github.com/congard/nvidia-system-monitor) to monitor my system during training.
As I hit "train", on NVIDIA S.M. show me that memory usage is 100% (or near 100% like 95/97%) the GPU usage is fine.
Always in NVIDIA S.M., I look at the processes list and "python" occupies only around 60MB of vram space.
In KSysGUARD, python's memory usage is always around 700mb.
There might be some explanation for that, the problem is that the gpu's memory usage hits 90% with a model with literally 2 neurons (densely connected of course xD), just like a model with 200million parameters does. I'm using a batch size of 128.
I thought around that mess, and if I'm not wrong, a model with 200million parameters should occupy 200000000*4bytes*128 bytes, which should be 1024gb.
That means I'm definitely wrong on something, but I'm too selfless to keep that riddle for me, so I decided to give you the chance to solve this ;D
PS: English is not my main language.

Tensorflow by default allocates all available VRAM in the target GPU. There is an experimental feature called memory growth that let's you control that, basically stops the initialization process from allocating all VRAM and does it when there is a need for it.
https://www.tensorflow.org/api_docs/python/tf/config/experimental/set_memory_growth

Related

How to extend tensorflow's GPU memory from system RAM

I want to train googles object detection with faster_rcnn_with resnet101 using mscoco datasetcode. I used only 10,000 images for training purpose.I used graphics: GeForce 930M/PCIe/SSE2. NVIDIA Driver Version:384.90. here is the picture of my GeForce.
And I have 8Gb RAM but in tensorflow gpu it is showed 1.96 Gb.
. Now How can I extend my PGU's RAM. I want to use full system memory.
You can train on the cpu to take advantage of the RAM on your machine. However, to run something on the gpu it has to be loaded to the gpu first. Now you can swap memory in and out, because not all the results are needed at any step. However, you pay with a very long training time and I would rather advise you to reduce the batch size. Nevertheless, details about this process and implementation can be found here: https://medium.com/#Synced/how-to-train-a-very-large-and-deep-model-on-one-gpu-7b7edfe2d072.

Why mxnet's GPU version cost more memory than CPU version?

I made a very simple network using mxnet(two fc layers with dim of 512).
By changing the ctx = mx.cpu() or ctx = mx.gpu(0), I run the same code on both CPU and GPU.
The memory cost of GPU is much bigger than CPU version.(I checked that using 'top' instead of 'nvidia-smi').
It seems strange, as the GPU version also has memory on GPU already, why GPU still need more space on memory?
(line 1 is CPU program / line 2 is GPU program)
It may be due to differences in how much time each process was running.
Looking at your screenshot, CPU process has 5:48.85 while GPU has 9:11.20 - so the GPU training was running almost double the time which could be the reason.
When running on GPU you are loading a bunch of different lower-level libraries in memory (CUDA, CUDnn, etc) which are allocated first in your RAM. If your network is very small like in your current case, the overhead of loading the libraries in RAM will be higher than the cost of storing the network weights in RAM.
For any more sizable network, when running on CPU the amount of memory used by the weights will be significantly larger than the libraries loaded in memory.

Does nvidia gpu work less efficiently when it is the only gpu in PC?

I want to assemble a new computer mainly for CUDA applications. When it comes to CPU I have to choose between AMD and Intel.
Most of the AMD's processors don't have integrated gpu while Intel's processors do.
My question is:
If the nvidia gpu would be the only graphic processing unit in the whole PC (without integrated one),
would its efficiency for CUDA programs be worse as it has to produce some graphics on a desktop (while using for example Matlab)?
The anwer is yes, efficiency would be slightly lower due to the GPU doing display tasks, like moving the cursor around or scrolling a display in a .pdf browser.
however if you are aiming for a reasonably mid-to-high-end GPU, the loss of efficiency is marginal. If you have enough money, you will buy dedicated GPU, but if not, then just don't bother. It might be like 1% or less.
A bigger problem is that the display takes up RAM, that (a) becomes unavailable to CUDA applications and (b) the CUDA manual states that the display driver is allowed to dis-own the CUDA application from it's memory at any time without warning (!).
If you ask me if that does really happen (display driver taking over the CUDA app memory), then yes, I have experienced it, with the prime example being when you change the resolution of your display.
So definetely don't do any banking with GPUs or you might see your accounts being randomly infused with millions :-)
That's why 'proffesional' CUDA cards (the tesla variety) have no display outputs - just in case.

GPU Memory Allocation under CUDA 8 and Pascal Architecture

Pascal Architecture has brought an amazing feature for CUDA developers by upgrading the unified memory behavior, allowing them to allocate GPU memory way bigger than available on the system.
I am just curious about how this is implemented under the hood. I have tested it out by "cudaMallocManaging" a huge buffer and nvidia-smi isn't showing anything (unless the buffer size is under the available GDDR).
I am just curious about how this is implemented under the hood. I have tested it out by "cudaMallocManaging" a huge buffer and nvidia-smi isn't showing anything (unless the buffer size is under the available GDDR).
First of all I suggest you do proper CUDA error checking on all CUDA API calls. It would seem from your description that you are not.
demand-paging in unified memory (UM) allowing the increase in memory size beyond the GPU physical DRAM memory will only work with:
Pascal (or future) GPUs
CUDA 8 (or future) toolkit
Other than that, your setup should probably work. If it's not working for you with CUDA 8 (not CUDA 8RC) and a Pascal GPU, make sure that you meet the requirements (e.g. OS) for UM and also do proper error checking. Rather than trying to infer what is happening from nvidia-smi, run an actual test on the allocated memory.
For a more general description of the feature I refer you to this blog article.

Simple Compute-Intensive CUDA Program

I'm preparing an acceptance test for a new machine with Nvidia graphics cards and I'd like a simple CUDA program that will fully exercise the GPU for a full day. The intent is to generate large amounts of heat and ensure the new machine is stable under the load. I'd like the code to be very easy to compile and run (no dependencies, no large input data sets), and also very easy to verify (small amounts of output). Also, I'd like it to be command-line only, no GUI (the test will have to be automated).
I was originally thinking of repeatedly running Vector Dot Products of large vectors. However, that's mostly memory-intensive. So if the GPUs are constantly waiting on memory accesses, then they probably aren't generating as much heat as they could.
I'm running on a CentOS Linux machine.
Does anyone have any suggestions?
You didn't mention which OS you are on.
Ideally, you would want to stress the floating point units, the logic/integer units, the GPU memory, the GPU voltage regulators (VRMs) and the main PSU. I don't think there is any single utility out there that does that.
Memory:
http://sourceforge.net/projects/cudagpumemtest/
Integer (?):
http://sourceforge.net/projects/cudalucas/
PSU and VRMs (In the past, this program could cause GPUs to run out-of-spec, breaking the card. I don't think that's the case anymore):
http://www.ozone3d.net/benchmarks/fur/