I am new to using training neural networks. I have access to GPU cluster and I am fine-tuning a version of Alex-Net for scene classification.
I have access to two GPUs right now and I want to use both of them for training. nvidia-smi command gives me the id of the GPUs (which are 0 and 1).
This is how I do the training to use both of the GPUs:
caffe.set_mode_gpu()
caffe.set_device([0,1])
Is this the right way to use it?
Python allows you to choose a single GPU using set_device(). Multi-GPU is only supported on the C++ interface. The --gpu flag used for this purpose is discussed here. The GPUs to be used for training can be set with the --gpu flag on the command line to the Caffe tool. For example,
build/tools/caffe train --solver=models/bvlc_alexnet/solver.prototxt --gpu=0,1
will train on GPUs 0 and 1.
In the terminal you can type like that:
python yourpythonfile.py CUDA_VISIBLE_DEVICES=0, 1
Related
I am using google colab, and I need to know if it uses any GPU for training if I don't do model.to('cuda') and data.to('cuda')?
If you do not use model.to(torch.device('cuda')) and data.to(torch.device('cuda')) your model and all of your tensors will be remaining on default device which is CPU, so they do not understand the existence of GPU. PyTorch uses CPU for its work.
You can see this Link for more information about torch.device.
Question:
How to improve model latency for web deployment without retraining the models? What is the checklist that I should mark to improve the model speed?
Context:
I have multiple models that process a video sequentially on one machine with one K80 GPU; each model takes around 5 mins to process a video that is 1 min long. What ideas and suggestions should I try to improve each model latency without changing the model architecture? How should I structure my thinking about this problem?
Sampling frames is the easiest technique if it fits your usecase. Picking every 5th frame for inference will cut your inference time by ~5x(theoretically). Caveat is if you are working on tasks like object tracking you will have reduced accuracy.
fp32 to fp16 might increase your inference speed.
Batch Inference always lowers inference time by a decent bit. Ref: https://github.com/ultralytics/yolov5/issues/1806#issuecomment-752834571
Multiprocess Concurrent Inference is basically spinning up more than 1 instances of the same model on seperate processes and infer parallely. torch has a multiprocessing module torch.multiprocessing. I havent ever used this but i assume the setup would be somewhat significant and complex.
Nvidia Tesla K80 is quite an old GPU (2014), so that's probably the reason why the processing time is so long. If your machine has a modern Intel CPU and/or iGPU you could try OpenVINO. It's a heavily optimized toolkit for inference. Here are some performance benchmarks.
You can find a full tutorial on how to convert the PyTorch model here.
Some snippets below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[pytorch,onnx]
Save your model to ONNX
OpenVINO cannot convert PyTorch model directly for now but it can do it with ONNX model. This sample code assumes the model is for computer vision.
dummy_input = torch.randn(1, 3, IMAGE_HEIGHT, IMAGE_WIDTH)
torch.onnx.export(model, dummy_input, "model.onnx", opset_version=11)
Use Model Optimizer to convert ONNX model
The Model Optimizer is a command line tool which comes from OpenVINO Development Package so be sure you have installed it. It converts the ONNX model to IR, which is a default format for OpenVINO. It also changes the precision to FP16 for even better performance (there shouldn't be much accuracy drop). Run in command line:
mo --input_model "model.onnx" --input_shape "[1,3, 224, 224]" --mean_values="[123.675, 116.28 , 103.53]" --scale_values="[58.395, 57.12 , 57.375]" --data_type FP16 --output_dir "model_ir"
Run the inference on the CPU
The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU. Use AUTO if you want to deploy on the best device you have.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO")
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.
Please forgive what may be an insane question.
My setup: two machines, one with a GTX1080, one with an RX Vega 64. Retraining/fine tuning a bvlc_googlenet model on the GTX1080.
If I build Caffe for the Vega 64, then can I take a snapshot from the GTX1080 machine and restart training on the Vega 64? Would this work in the sense that the training would continue in a normal manner?
What if I moved the GTX1080 snapshot to a Volta V100 in AWS? Would this work?
I know Caffe will to some degree abstract the hardware, but I don't know how well it can do that. I need the GTX 1080 for something else...
Thanks in advance!
To my knowledge, this should work without a problem. Weight files and training snapshots are just blobs of numbers, that you should be able to resume on other hardware (e.g. CPU/GPU), a different machine with a different operating system, or between 32 and 64 bit processes.
I wish to run Caffe on a 32 core machine.
Does caffe scale up to available number of cores to utilize them the best?
Although there are 32 cores, can I make caffe use only a selected number of cores?
Generally caffe doesn't support multiple CPUs/cores in its source code, but it uses BLAS routines.
Thus answers to your questions are the following:
Yes, but only through BLAS configuration, i. e. your BLAS version should be compiled with multithreading support (see related discussions: here or here - at the second link you can also find some modifications for caffe itself).
Also through BLAS (if it was compiled with openmp support, you can define OMP_NUM_THREADS to desired value).
caffe does not, but you can you Intel caffe which is optimized for CPU and supports multi node
https://github.com/intel/caffe/wiki/Multinode-guide
I managed to successfully run CUDA programs on a GeForce GTX 750 Ti while using a AMD Radeon HD 7900 as the rendering device (actually connected to the display) using this guide; for instance, the Vector Addition sample runs nicely. However, I can only run applications that do not produce visual output. For example, the Mandelbrot CUDA sample does not run and fails with an error:
Error: failed to get minimal extensions for demo:
Missing support for: GL_ARB_pixel_buffer_object
This sample requires:
OpenGL version 1.5
GL_ARB_vertex_buffer_object
GL_ARB_pixel_buffer_object
The error originates from asking glewIsSupported() for these extensions. Is there any way to run an application, like these CUDA samples, so that the CUDA operations are run on the GTX as usual but the Window is drawn on the Radeon card? I tried to convince Nsight Eclipse to run a remote debugging session, with my own PC as the remote host, but something else failed right away. Is this supposed to actually work? Could it be possible to use VirtualGL?
Some of the NVIDIA CUDA samples that involve graphics, such as the Mandelbrot sample, implement an efficient rendering strategy: they bind OpenGL data structures - Pixel Vertex Objects in the case of Mandelbrot - to the CUDA arrays containing the simulation data and render them directly from the GPU. This avoids copying the data from the device to the host at end of each iteration of the simulation, and results in a lightning fast rendering phase.
To answer your question: NVIDIA samples as they are need to run the rendering phase on the same GPU where the simulation phase is executed, otherwise, the GPU that handles the graphics would not have the data to be rendered in its memory.
This does not exclude that the samples can be modified to work with multiple GPUs. It should be possible to copy the simulation data back to the host at end of each iteration, and then render it using a custom method or even send it over the network. This would require to (1) modify the code, by separating and making independent simulation and rendering phases, and (2) accept the big loss in frame per second that would result from this.