Can I train a deep convolutional network without GPUs? - deep-learning

I am thinking of building a convolutional neural network as a tracking system application.I get the feeling that all the deep network applications require the use of GPUs. Is it necessary to use GPUs in a task like mine? What are the minimum PC requirements I should have in my laptop ?

It all depends on the size and depth of your CNN. If your CNN has one convolution layer, and one fully connected layer, and input images are 64x64, you will be able to train your network on your Laptop in a reasonable time. If you use GoogLeNet with hundred of layers, and train on the entire ImageNet set, than even with a video card it will take you a week, so on a CPU it will never finish training.
For most practical applications, however, it is desirable to have a GPU to train a convolution network. Note that on AWS you can get GPU-enabled instances for a rather reasonable price, especially if you get spot instances, so you don't necessarily need to have a GPU locally.
Last note: most of the frameworks (theano, torch, caffe, mxnet, tensorflow) allow you to execute the same model on CPU and on GPU with minor or no modifications to the code, so you can prototype locally on the CPU with a small set of images, and then when your model works, train it on AWS on a GPU instance.

Related

How do I train a deep learning model across multiple virtual machines using PyTorch?

I'm trying to train a model by purchasing multiple virtual machines on a cloud service provider(3-4 instances probably on AWS). I would like to load the model onto each VM, run the training process, then update the models on each VM. The problem is once each model has made its forward pass, I don't know how to accumulate the gradients of each model, and if I could, I'm not sure if I should sum the gradients or average them. I've been using DataParallel on single multiple GPU VMs, so I haven't had to keep track of multiple gradients before this point. I'm unsure if there is a PyTorch package that could help with GPU data parallelism across multiple VMs. I've seen PyTorch Lightning, but it isn't clear what modules to use to communicate between VMs. I'm very new to machine learning, and this is my first training process beyond a single machine. Any advice and tips would be appreciated on this matter including packages or architectural ideas.

Pretrained model or training from scratch for object detection?

I have a dataset composed of 10k-15k pictures for supervised object detection which is very different from Imagenet or Coco (pictures are much darker and represent completely different things, industrial related).
The model currently used is a FasterRCNN which extracts features with a Resnet used as a backbone.
Could train the backbone of the model from scratch in one stage and then train the whole network in another stage be beneficial for the task, instead of loading the network pretrained on Coco and then retraining all the layers of the whole network in a single stage?
From my experience, here are some important points:
your train set is not big enough to train the detector from scratch (though depends on network configuration, fasterrcnn+resnet18 can work). Better to use a pre-trained network on the imagenet;
the domain the network was pre-trained on is not really that important. The network, especially the big one, need to learn all those arches, circles, and other primitive figures in order to use the knowledge for detecting more complex objects;
the brightness of your train images can be important but is not something to stop you from using a pre-trained network;
training from scratch requires much more epochs and much more data. The longer the training is the more complex should be your LR control algorithm. At a minimum, it should not be constant and change the LR based on the cumulative loss. and the initial settings depend on multiple factors, such as network size, augmentations, and the number of epochs;
I played a lot with fasterrcnn+resnet (various number of layers) and the other networks. I recommend you to use maskcnn instead of fasterrcnn. Just command it not to use the masks and not to do the segmentation. I don't know why but it gives much better results.
don't spend your time on mobilenet, with your train set size you will not be able to train it with some reasonable AP and AR. Start with maskrcnn+resnet18 backbone.

PyTorch: Move Weights Between GPU and CPU on the fly

I have a large architecture which does not fit into GPU memory, but there is a nice property of this architecture where only subsets of the architecture run at any given time for a stretch of time. Therefore, I would like to dynamically load/unload the weights of layers which are not being utilized between the CPU and GPU. How can this be achieved?
The first thing one might try is call .cpu() or .cuda() on the parameters I wish to move. Unfortunately, that would cause training problems with the optimizer as stated in the docs:
cuda(device=None)
Moves all model parameters and buffers to the GPU.
This also makes associated parameters and buffers different objects. So it should be called before constructing optimizer if the module will live on GPU while being optimized.
One example use case would be implementing ProxylessNAS, however only final trained models are available at the time of writing and the architecture search implementation is not available.

Hardware requirements for pre-trained convolutional neural network for image recognition

I know that for training a deep neural network for image recognition a good GPU or GPUs are required since they are more suited for this task than CPUs.
It's all clear and there are lots of various tutorials using various libraries on how to do that.
However, when I have trained my deep neural network what are the hardware requirements for running the trained deep neural network for recognizing images in some web application located on a server? Do I steel need powerful GPUs on the server for that? What hardware is more important for running a pre-trained deep neural network - RAM, CPU, storage?
Can I run pre-trained network on Android app for image recognition? Is it a good idea?
Sorry if my questions are too vague and broad, but I couldn't find any proper and detailed comment on this topic.
Ofcourse the answer depends on lot of factors, including the size of your model, number of CNN layers, type of activation functions etc.
But once my model (3 layers of convolution and 256 nodes in full connection) was trained, to make a single image prediction was possible on my regular MacBook Pro (16GB RAM, 2.7 GHz Intel Core i5 processor) with no GPU. The prediction happened instantaneously (< 1sec).
Hope that answers your question.

DenseNet without convolution?

The recent paper Densely Connected Convolutional Networks https://arxiv.org/abs/1608.06993 has shown that their DenseNet deep learning architecture outperforms state-of-the-art ResNet architectures. Are there similar papers / repositories for similar architectures but without convolution (RNN/just dense)?
No.
The simple answer is that the convolution itself allows for regularization by exploiting the data locality which is true in most images. This is also the key to achieving a deeper network which is crucial for deeper representations.
Another critical reason is that a dense layer just the size of the input (usually 224*224) will hog down most of your GPU memory so there is little chance today to achieve a dense network for images of this size that or more than a few layers deep. Maybe if you had 10x the GPU RAM you can try to pull that one off... Convolution is simply more economical.