Too much time to train one epoch - deep-learning

I use an RTX 3060 12GB GPU enabled workstation with RAM of 16GB DDR4 and CPU Intel Core i5 10400F. Also mounted an external storage HDD drive and ran the script p2ch11.prepcache from the bellow referred repository in order to cache… Used from zero to 8 workers and various batch size selections ranging from 32 to 1024!! Still it takes approximately 13,5 hours to train for one epoch (with batch size=1024 and 4 workers!!)… I still haven’t figured what’s wrong… Looks like I cannot utilize the GPU for some reason …
Code pulled from the repository: https://github.com/deep-learning-with-pytorch/dlwpt-code
-> p2ch11.training.py (https://github.com/deep-learning-with-pytorch/dlwpt-code/blob/master/p2ch11/training.py)

The size of an image is large, you need to do some preprocessing first. I think this will help.

Related

TensorFlow strange memory usage

I'm on an Ubuntu 19.10 machine (with KDE desktop environment) with 8GB of RAM, an i5 8250u and an MX130 gpu (2GB VRAM), running a Jupyter Notebook with tensorflow-gpu.
I was just training some models to test their memory usage, and I can't see any sense in what I'm looking at. I used KSysGUARD and NVIDIA System Monitor (https://github.com/congard/nvidia-system-monitor) to monitor my system during training.
As I hit "train", on NVIDIA S.M. show me that memory usage is 100% (or near 100% like 95/97%) the GPU usage is fine.
Always in NVIDIA S.M., I look at the processes list and "python" occupies only around 60MB of vram space.
In KSysGUARD, python's memory usage is always around 700mb.
There might be some explanation for that, the problem is that the gpu's memory usage hits 90% with a model with literally 2 neurons (densely connected of course xD), just like a model with 200million parameters does. I'm using a batch size of 128.
I thought around that mess, and if I'm not wrong, a model with 200million parameters should occupy 200000000*4bytes*128 bytes, which should be 1024gb.
That means I'm definitely wrong on something, but I'm too selfless to keep that riddle for me, so I decided to give you the chance to solve this ;D
PS: English is not my main language.
Tensorflow by default allocates all available VRAM in the target GPU. There is an experimental feature called memory growth that let's you control that, basically stops the initialization process from allocating all VRAM and does it when there is a need for it.
https://www.tensorflow.org/api_docs/python/tf/config/experimental/set_memory_growth

How to extend tensorflow's GPU memory from system RAM

I want to train googles object detection with faster_rcnn_with resnet101 using mscoco datasetcode. I used only 10,000 images for training purpose.I used graphics: GeForce 930M/PCIe/SSE2. NVIDIA Driver Version:384.90. here is the picture of my GeForce.
And I have 8Gb RAM but in tensorflow gpu it is showed 1.96 Gb.
. Now How can I extend my PGU's RAM. I want to use full system memory.
You can train on the cpu to take advantage of the RAM on your machine. However, to run something on the gpu it has to be loaded to the gpu first. Now you can swap memory in and out, because not all the results are needed at any step. However, you pay with a very long training time and I would rather advise you to reduce the batch size. Nevertheless, details about this process and implementation can be found here: https://medium.com/#Synced/how-to-train-a-very-large-and-deep-model-on-one-gpu-7b7edfe2d072.

Running text classification - CNN on GPU

Based on this github link https://github.com/dennybritz/cnn-text-classification-tf , I want to classified my datasets on Ubuntu-16.04 on GPU.
For running on GPU, I've been changed line 23 on text_cnn.py to this : with tf.device('/gpu:0'), tf.name_scope("embedding"):
my first dataset for train phase has 9000 documents and it's size is about 120M and
second one for train has 1300 documents and it's size is about 1M.
After running on my Titan X server with GPU, I have got errors.
Please guide me, How can I solve this issue?
Thanks.
You are getting Out of Memory error, so first thing to try is smaller batch size
(default is 64). I would start with:
./train.py --batch_size 32
Most of the memory is used to hold the embedding parameters and convolution parameters. I would suggest reduce:
EMBEDDING_DIM
NUM_FILTERS
BATCH_SIZE
try embedding_dim=16, batch_size=16 and num_filters=32, if that works, increase them 2x at a time.
Also if you are using docker virtual machine to run tensorflow, you might be limited to use only 1G of memory by default though you have 16G memory in your machine. see here for more details.

How are multiple gpus utilized in Caffe?

I want to know how Caffe utilizes multiple GPUs so that I can decide to upgrade to a new more powerful card or just buy the same card and run on SLI.
for example am I better off buying one TitanX 12 GB , or two GTX 1080 8 GB ?
If I go SLI the 1080s, will my effective memory get doubled? I mean can I run a network which takes 12 or more GB of vram using them? Or am I left with only 8 GB ?
Again how is memory utilized in such scenarios ?
What would happen if two different cards are installed (both NVIDIA) ? Does caffe utilize the memory available the same? (suppose one 980 and one 970!)
for example am I better off buying one TitanX 12 GB , or two GTX 1080
8 GB ? If I go SLI the 1080s, will my effective memory get doubled? I
mean can I run a network which takes 12 or more GB of vram using them?
Or am I left with only 8 GB ?
No, effective memory size in case of 2 GPU with 8Gb of RAM will be 8Gb, but effective batch size will be doubled which will lead to more stable\fast training.
What would happen if two different cards are installed (both NVIDIA) ?
Does caffe utilize the memory available the same? (suppose one 980 and
one 970!)
I think you will be limited to lower card and may have problem with drivers, so I don't recomend to try this configuration.
Also from documentation:
Current implementation has a "soft" assumption that the devices being
used are homogeneous. In practice, any devices of the same general
class should work together, but performance and total size is limited
by the smallest device being used. e.g. if you combine a TitanX and a
GTX980, performance will be limited by the 980. Mixing vastly
different levels of boards, e.g. Kepler and Fermi, is not supported.
Summing up: with GPU that have lots of RAM you can train deeper models, with multiply GPUs you can train single model faster and also you can train separate models per GPU. I would choose single GPU with more memory (TitanX) because deep networks nowadays are more RAM bounded(e.g. ResNet-152 or some semantic segmentation network) and more memory will give the opportunity to run deeper networks and with larger batch size, otherwise if you have some task that fit on single GPU (GTX 1080) you can buy 2 or 4 of them just to make things faster.
Also here is some info about multi GPU support in Caffe:
The current implementation uses a tree reduction strategy. e.g. if
there are 4 GPUs in the system, 0:1, 2:3 will exchange gradients, then
0:2 (top of the tree) will exchange gradients, 0 will calculate
updated model, 0->2, and then 0->1, 2->3.
https://github.com/BVLC/caffe/blob/master/docs/multigpu.md
I don't believe Caffe supports SLI mode. The two GPUs are treated as
separate cards.
When you run Caffe and add the '-gpu' flag (assuming you are using the
command line), you can specify which GPU to use (-gpu 0 or -gpu 1 for
example). You can also specify multiple GPUs (-gpu 0,1,3) including
using all GPUs (-gpu all).
When you execute using multiple GPUs, Caffe will execute the training
across all of the GPUs and then merge the training updates across the
models. This is effectively doubling (or more if you have more than 2
GPUs) the batch size for each iteration.
In my case, I started with a NVIDIA GTX 970 (4GB card) and then
upgraded to a NVIDIA GTX Titan X (Maxwell version with 12 GB) because
my models were too large to fit in the GTX 970. I can run some of the
smaller models across both cards (even though they are not the same)
as long as the model will fully fit into the 4GB of the smaller card.
Using the standard ImageNet model, I could execute across both cards
and cut my training time in half.
If I recall correctly, other frameworks (TensorFlow and maybe the
Microsoft CNTK) support splitting a model among different nodes to
effectively increase the available GPU memory like what you are
describing. Although I haven't personally tried either one, I
understand you can define on a per-layer basis where the layer
executes.
Patrick
Link
Perhaps a late answer, but caffe supports gpu parallelism, which means you can indeed fully utilize both gpu's, but I do recommend getting two gpu's of equal memory size, since I don't think caffe lets you select the batch size per gpu.
As for how memory is utilized, using multiple gpu's each gpu gets a batch of batch size as specified in your train_val.prototxt, so if your batch size is for example 16 and you're using 2 gpu's, you'd have an effective batch size 32.
Finally, I know that for things such as gaming, SLI seems to be much less effective and often much more problematic than having a single, more powerful GPU. So if you are planning on using the GPU's for more than only Deep Learning, I'd recommend you still go for the Titan X

EtherumJ EthereumFactory.createEthereum() taking long hours

We are very new to Ethereum .We are trying to create a sample application as POC using ethereumJ .
Ethereum ethereum = EthereumFactory.createEthereum();
The above line is getting executed and its running for long time . Its almost 6 hours now .
Is this normal ? Are we missing something ? How we can minimise it ? ( We are ok to use a small test network . )
Any help will be greatly appreciated
Any blockchain-related processing is inherently compute & resource hungry.
To start with, may it would help if we understand your operating environment & specifications.
E.g. hardware (cpu, ram, etc); software (os, runtime, etc); versions (os, runtime, ethereumj, etc);
Model Name: MacBook Pro
Model Identifier: MacBookPro10,2
Processor Name: Intel Core i5
Processor Speed: 2.6 GHz
Number of Processors: 1
Total Number of Cores: 2
L2 Cache (per Core): 256 KB
L3 Cache: 3 MB
Memory: 8 GB