Backbone network in Object detection - deep-learning

I am trying to understand the training process of a object deetaction deeplearng algorithm and I am having some problems understanding how the backbone network (the network that performs feature extraction) is trained.
I understand that it is common to use CNNs like AlexNet, VGGNet, and ResNet but I don't understand if these networks are pre-trained or not. If they are not trained what does the training consist of?

We directly use a pre-trained VGGNet or ResNet backbone. Although the backbone is pre-trained for classification task, the hidden layers learn features which can be used for object detection also. Initial layers will learn low level features such as lines, dots, curves etc. Next layer will learn learn high-level features that are built on top of low-level features to detect objects and larger shapes in the image.
Then the last layers are modified to output the object detection coordinates rather than class.
There are object detection specific backbones too. Check these papers:
DetNet: A Backbone network for Object Detection
CBNet: A Novel Composite Backbone Network Architecture for Object Detection
DetNAS: Backbone Search for Object Detection
High-Resolution Network: A universal neural architecture for visual recognition
Lastly, the pretrained weights will be useful only if you are using them for similar images. E.g.: weights trained on Image-net will be useless on ultrasound medical image data. In this case we would rather train from scratch.

Related

Is it possible to do transfer learning on different observation and action space for Actor-Critic?

I have been experimenting with actor-critic networks such as SAC and TD3 on continuous control tasks and trying to do transfer learning using the trained network to another task with smaller observation and action space.
Would it be possible to do so if i were to save the weights in a dictionary and then load it in the new environment? The inputs to the Actor-Critic network requires a state with different dimensions as well as outputting an actor with different dimensions.
I had some experience doing fine-tuning with transformer models by addind another classifier head and fine-tuning it, but how would i do this with Actor-Critic networks, if the initial layer and final layer does not match with the learned agent.

Pretrained model or training from scratch for object detection?

I have a dataset composed of 10k-15k pictures for supervised object detection which is very different from Imagenet or Coco (pictures are much darker and represent completely different things, industrial related).
The model currently used is a FasterRCNN which extracts features with a Resnet used as a backbone.
Could train the backbone of the model from scratch in one stage and then train the whole network in another stage be beneficial for the task, instead of loading the network pretrained on Coco and then retraining all the layers of the whole network in a single stage?
From my experience, here are some important points:
your train set is not big enough to train the detector from scratch (though depends on network configuration, fasterrcnn+resnet18 can work). Better to use a pre-trained network on the imagenet;
the domain the network was pre-trained on is not really that important. The network, especially the big one, need to learn all those arches, circles, and other primitive figures in order to use the knowledge for detecting more complex objects;
the brightness of your train images can be important but is not something to stop you from using a pre-trained network;
training from scratch requires much more epochs and much more data. The longer the training is the more complex should be your LR control algorithm. At a minimum, it should not be constant and change the LR based on the cumulative loss. and the initial settings depend on multiple factors, such as network size, augmentations, and the number of epochs;
I played a lot with fasterrcnn+resnet (various number of layers) and the other networks. I recommend you to use maskcnn instead of fasterrcnn. Just command it not to use the masks and not to do the segmentation. I don't know why but it gives much better results.
don't spend your time on mobilenet, with your train set size you will not be able to train it with some reasonable AP and AR. Start with maskrcnn+resnet18 backbone.

Object Detection from Image Classification

Can I use a model used for Image Classification to do Object Detection? Already I spent too much time doing the image collection and distribute each class into it folder.
You can use your classification model as an initialized backbone for a detection model (e.g. Faster-RCNN) but it might not help that much compared to train your detector from scratch.
You will need to add detection layers (e.g. ROI pooling) to your backbone to perform detection.
While you can try unsupervised object detection, usually you will need extra labels such as object bounding-boxes to train your object detector.

Convolutional Layers for non-image data

I often see guides and examples using Convolutional Layers when implementing Deep Q-Networks. This makes sense for some scenarios, typically where you do not have access to the state in for example an array representation.
In my case, I have a game environment which gives me complete access to the state, in form of a 2D array. This 2D array is later interpreted by a graphics engine and dawn to the screen.
I have been recommended to use Convolutional Layers for interpreting images, but I have yet to see any recommendations about flattening the 2D State representation directly and utilize dense layers instead.
Does it make any sense to use Convolutional Networks/Layers for data which are not an image?

A simple Convolutional neural network code

I am interested in convolutional neural networks (CNNs) as a example of computationally extensive application that is suitable for acceleration using reconfigurable hardware (i.e. lets say FPGA)
In order to do that I need to examine a simple CNN code that I can use to understand how they are implemented, how are the computations in each layer taking place, how the output of each layer is being fed to the input of the next one. I am familiar with the theoretical part (http://cs231n.github.io/convolutional-networks/)
But, I am not interested in training the CNN, I want a complete, self contained CNN code that is pre-trained and all the weights and biases values are known.
I know that there are plenty of CNN libraries, i.e. Caffe, but the problem is that there is no trivial example code that is self contained. even for the simplest Caffe example "cpp_classification" many libraries are invoked, the architecture of the CNN is expressed as .prototxt file, other types of inputs such as .caffemodel and .binaryproto are involved. openCV2 libraries is invoked too. there are layers and layers of abstraction and different libraries working together to produce the classification outcome.
I know that those abstractions are needed to generate a "useable" CNN implementation, but for a hardware person who needs a bare-bone code to study, this is too much of "un-related work".
My question is: Can anyone guide me into a simple and self-contained CNN implementation that I can start with?
I can recommend tiny-cnn. It is simple, lightweight (e.g. header-only) and CPU only, while providing several layers frequently used within the literature (as for example pooling layers, dropout layers or local response normalization layer). This means, that you can easily explore an efficient implementation of these layers in C++ without requiring knowledge of CUDA and digging through the I/O and framework code as required by framework such as Caffe. The implementation lacks some comments, but the code is still easy to read and understand.
The provided MNIST example is quite easy to use (tried it myself some time ago) and trains efficiently. After training and testing, the weights are written to file. Then you have a simple pre-trained model from which you can start, see the provided examples/mnist/test.cpp and examples/mnist/train.cpp. It can easily be loaded for testing (or recognizing digits) such that you can debug the code while executing a learned model.
If you want to inspect a more complicated network, have a look at the Cifar-10 Example.
This is the simplest implementation I have seen: DNN McCaffrey
Also, the source code for this by Karpathy looks pretty straightforward.