How to extend tensorflow's GPU memory from system RAM - deep-learning

I want to train googles object detection with faster_rcnn_with resnet101 using mscoco datasetcode. I used only 10,000 images for training purpose.I used graphics: GeForce 930M/PCIe/SSE2. NVIDIA Driver Version:384.90. here is the picture of my GeForce.
And I have 8Gb RAM but in tensorflow gpu it is showed 1.96 Gb.
. Now How can I extend my PGU's RAM. I want to use full system memory.

You can train on the cpu to take advantage of the RAM on your machine. However, to run something on the gpu it has to be loaded to the gpu first. Now you can swap memory in and out, because not all the results are needed at any step. However, you pay with a very long training time and I would rather advise you to reduce the batch size. Nevertheless, details about this process and implementation can be found here: https://medium.com/#Synced/how-to-train-a-very-large-and-deep-model-on-one-gpu-7b7edfe2d072.

Related

TensorFlow strange memory usage

I'm on an Ubuntu 19.10 machine (with KDE desktop environment) with 8GB of RAM, an i5 8250u and an MX130 gpu (2GB VRAM), running a Jupyter Notebook with tensorflow-gpu.
I was just training some models to test their memory usage, and I can't see any sense in what I'm looking at. I used KSysGUARD and NVIDIA System Monitor (https://github.com/congard/nvidia-system-monitor) to monitor my system during training.
As I hit "train", on NVIDIA S.M. show me that memory usage is 100% (or near 100% like 95/97%) the GPU usage is fine.
Always in NVIDIA S.M., I look at the processes list and "python" occupies only around 60MB of vram space.
In KSysGUARD, python's memory usage is always around 700mb.
There might be some explanation for that, the problem is that the gpu's memory usage hits 90% with a model with literally 2 neurons (densely connected of course xD), just like a model with 200million parameters does. I'm using a batch size of 128.
I thought around that mess, and if I'm not wrong, a model with 200million parameters should occupy 200000000*4bytes*128 bytes, which should be 1024gb.
That means I'm definitely wrong on something, but I'm too selfless to keep that riddle for me, so I decided to give you the chance to solve this ;D
PS: English is not my main language.
Tensorflow by default allocates all available VRAM in the target GPU. There is an experimental feature called memory growth that let's you control that, basically stops the initialization process from allocating all VRAM and does it when there is a need for it.
https://www.tensorflow.org/api_docs/python/tf/config/experimental/set_memory_growth

Why is learning at CPU slower than at GPU

I have:
GPU : GeForce RTX 2070 8GB.
CPU : AMD Ryzen 7 1700 Eight-Core Processor.
RAM : 32GB.
Driver Version: 418.43.
CUDA Version: 10.1.
On my project, gpu is also slower than cpu. But now I will use the documentation example.
from catboost import CatBoostClassifier
import time
start_time = time.time()
train_data = [[0,3],
[4,1],
[8,1],
[9,1]]
train_labels = [0,0,1,1]
model = CatBoostClassifier(iterations=1000, task_type = "CPU/GPU")
model.fit(train_data, train_labels, verbose = False)
print(time.time()-start_time)
Training time on gpu: 4.838074445724487
Training time on cpu: 0.43390488624572754
Why is the training time on gpu more than on cpu?
Be careful, no experience with catboost, so the following is from CUDA point of view
data transfer The launch of a kernel (function called by Host, e.g. CPU, executed by device, e.g. GPU) requires data to be transferred from host to device. See image below to get an idea of the transfer time on data size. By default, the memory are non-pinned (using cudaMalloc()). See https://www.cs.virginia.edu/~mwb7w/cuda_support/pinned_tradeoff.html to find out more.
kernel launch overhead Each time when the host calls a kernel, it enqueues the kernel to the working queue of the device. i.e. for each iteration, the host instantiates a kernel, and adds to the queue. Before the introduction of CUDA graph (which also points out that the kernel launch overhead can be significant when the kernel has short execution time), the overhead of each kernel launch cannot be avoided. I don't know how catboost handles iterations, but given that difference between execution times, it seems not have resolved the launch overheads (IMHO)
Catboost uses some different techniques with small datasets (rows<50k or columns<10) that reduces overfitting, but takes more time. Try training with a gigantic dataset, for instance the Epsilon dataset. see github https://github.com/catboost/catboost/issues/505

Why mxnet's GPU version cost more memory than CPU version?

I made a very simple network using mxnet(two fc layers with dim of 512).
By changing the ctx = mx.cpu() or ctx = mx.gpu(0), I run the same code on both CPU and GPU.
The memory cost of GPU is much bigger than CPU version.(I checked that using 'top' instead of 'nvidia-smi').
It seems strange, as the GPU version also has memory on GPU already, why GPU still need more space on memory?
(line 1 is CPU program / line 2 is GPU program)
It may be due to differences in how much time each process was running.
Looking at your screenshot, CPU process has 5:48.85 while GPU has 9:11.20 - so the GPU training was running almost double the time which could be the reason.
When running on GPU you are loading a bunch of different lower-level libraries in memory (CUDA, CUDnn, etc) which are allocated first in your RAM. If your network is very small like in your current case, the overhead of loading the libraries in RAM will be higher than the cost of storing the network weights in RAM.
For any more sizable network, when running on CPU the amount of memory used by the weights will be significantly larger than the libraries loaded in memory.

High cpu utilization while using 3D convolution on gpu in theano 0.9

My cpu utilization is 100% when using theano.tensor.nnet.conv3d
I am adding conv3d_fft, convgrad3d_fft and convtransp3d_fft to my compiling mode in the theano function.
The interesting part is my gpu utilization is also high. My data is all on gpu memory.
Any ideas about why it cpu utilization is so high?
Thanks!
Update: I tried the same convolution in Keras with Theano backend. It is shockingly faster than theano despite the fact that it also uses conv3d.
For invoking a task on the GPU, the CPU has to do some work (not negligible). If the tasks on the GPU are small, the tasks on the CPU become comparatively big.
By the way, memory transfers are done by the DMA and do not cause CPU load except of invoking the transfer.
I had a CUDA program, where the CPU invoked a GPU task every ~100µs. Profiling the program I found out, that it was CPU bound, while nothing was done on CPU except invoking GPU tasks.

How are multiple gpus utilized in Caffe?

I want to know how Caffe utilizes multiple GPUs so that I can decide to upgrade to a new more powerful card or just buy the same card and run on SLI.
for example am I better off buying one TitanX 12 GB , or two GTX 1080 8 GB ?
If I go SLI the 1080s, will my effective memory get doubled? I mean can I run a network which takes 12 or more GB of vram using them? Or am I left with only 8 GB ?
Again how is memory utilized in such scenarios ?
What would happen if two different cards are installed (both NVIDIA) ? Does caffe utilize the memory available the same? (suppose one 980 and one 970!)
for example am I better off buying one TitanX 12 GB , or two GTX 1080
8 GB ? If I go SLI the 1080s, will my effective memory get doubled? I
mean can I run a network which takes 12 or more GB of vram using them?
Or am I left with only 8 GB ?
No, effective memory size in case of 2 GPU with 8Gb of RAM will be 8Gb, but effective batch size will be doubled which will lead to more stable\fast training.
What would happen if two different cards are installed (both NVIDIA) ?
Does caffe utilize the memory available the same? (suppose one 980 and
one 970!)
I think you will be limited to lower card and may have problem with drivers, so I don't recomend to try this configuration.
Also from documentation:
Current implementation has a "soft" assumption that the devices being
used are homogeneous. In practice, any devices of the same general
class should work together, but performance and total size is limited
by the smallest device being used. e.g. if you combine a TitanX and a
GTX980, performance will be limited by the 980. Mixing vastly
different levels of boards, e.g. Kepler and Fermi, is not supported.
Summing up: with GPU that have lots of RAM you can train deeper models, with multiply GPUs you can train single model faster and also you can train separate models per GPU. I would choose single GPU with more memory (TitanX) because deep networks nowadays are more RAM bounded(e.g. ResNet-152 or some semantic segmentation network) and more memory will give the opportunity to run deeper networks and with larger batch size, otherwise if you have some task that fit on single GPU (GTX 1080) you can buy 2 or 4 of them just to make things faster.
Also here is some info about multi GPU support in Caffe:
The current implementation uses a tree reduction strategy. e.g. if
there are 4 GPUs in the system, 0:1, 2:3 will exchange gradients, then
0:2 (top of the tree) will exchange gradients, 0 will calculate
updated model, 0->2, and then 0->1, 2->3.
https://github.com/BVLC/caffe/blob/master/docs/multigpu.md
I don't believe Caffe supports SLI mode. The two GPUs are treated as
separate cards.
When you run Caffe and add the '-gpu' flag (assuming you are using the
command line), you can specify which GPU to use (-gpu 0 or -gpu 1 for
example). You can also specify multiple GPUs (-gpu 0,1,3) including
using all GPUs (-gpu all).
When you execute using multiple GPUs, Caffe will execute the training
across all of the GPUs and then merge the training updates across the
models. This is effectively doubling (or more if you have more than 2
GPUs) the batch size for each iteration.
In my case, I started with a NVIDIA GTX 970 (4GB card) and then
upgraded to a NVIDIA GTX Titan X (Maxwell version with 12 GB) because
my models were too large to fit in the GTX 970. I can run some of the
smaller models across both cards (even though they are not the same)
as long as the model will fully fit into the 4GB of the smaller card.
Using the standard ImageNet model, I could execute across both cards
and cut my training time in half.
If I recall correctly, other frameworks (TensorFlow and maybe the
Microsoft CNTK) support splitting a model among different nodes to
effectively increase the available GPU memory like what you are
describing. Although I haven't personally tried either one, I
understand you can define on a per-layer basis where the layer
executes.
Patrick
Link
Perhaps a late answer, but caffe supports gpu parallelism, which means you can indeed fully utilize both gpu's, but I do recommend getting two gpu's of equal memory size, since I don't think caffe lets you select the batch size per gpu.
As for how memory is utilized, using multiple gpu's each gpu gets a batch of batch size as specified in your train_val.prototxt, so if your batch size is for example 16 and you're using 2 gpu's, you'd have an effective batch size 32.
Finally, I know that for things such as gaming, SLI seems to be much less effective and often much more problematic than having a single, more powerful GPU. So if you are planning on using the GPU's for more than only Deep Learning, I'd recommend you still go for the Titan X