What happens in each layers of a Generator in a Generative Adversarial Network? - generator

So I understand that in a classifying deep neural network or MLP, each nodes in earlier layers picks up rudimentary shapes and nodes in later layers detect more complex shapes by combining rudimentary shapes (in the example depicted in the image below). However, I absolutely fail to see what's happening in each layer of Generator (aka decoder) in a Generative Adversarial network. Thank you very much, your effort is much appreciated.
enter image description here

Related

Backbone network in Object detection

I am trying to understand the training process of a object deetaction deeplearng algorithm and I am having some problems understanding how the backbone network (the network that performs feature extraction) is trained.
I understand that it is common to use CNNs like AlexNet, VGGNet, and ResNet but I don't understand if these networks are pre-trained or not. If they are not trained what does the training consist of?
We directly use a pre-trained VGGNet or ResNet backbone. Although the backbone is pre-trained for classification task, the hidden layers learn features which can be used for object detection also. Initial layers will learn low level features such as lines, dots, curves etc. Next layer will learn learn high-level features that are built on top of low-level features to detect objects and larger shapes in the image.
Then the last layers are modified to output the object detection coordinates rather than class.
There are object detection specific backbones too. Check these papers:
DetNet: A Backbone network for Object Detection
CBNet: A Novel Composite Backbone Network Architecture for Object Detection
DetNAS: Backbone Search for Object Detection
High-Resolution Network: A universal neural architecture for visual recognition
Lastly, the pretrained weights will be useful only if you are using them for similar images. E.g.: weights trained on Image-net will be useless on ultrasound medical image data. In this case we would rather train from scratch.

Pretrained model or training from scratch for object detection?

I have a dataset composed of 10k-15k pictures for supervised object detection which is very different from Imagenet or Coco (pictures are much darker and represent completely different things, industrial related).
The model currently used is a FasterRCNN which extracts features with a Resnet used as a backbone.
Could train the backbone of the model from scratch in one stage and then train the whole network in another stage be beneficial for the task, instead of loading the network pretrained on Coco and then retraining all the layers of the whole network in a single stage?
From my experience, here are some important points:
your train set is not big enough to train the detector from scratch (though depends on network configuration, fasterrcnn+resnet18 can work). Better to use a pre-trained network on the imagenet;
the domain the network was pre-trained on is not really that important. The network, especially the big one, need to learn all those arches, circles, and other primitive figures in order to use the knowledge for detecting more complex objects;
the brightness of your train images can be important but is not something to stop you from using a pre-trained network;
training from scratch requires much more epochs and much more data. The longer the training is the more complex should be your LR control algorithm. At a minimum, it should not be constant and change the LR based on the cumulative loss. and the initial settings depend on multiple factors, such as network size, augmentations, and the number of epochs;
I played a lot with fasterrcnn+resnet (various number of layers) and the other networks. I recommend you to use maskcnn instead of fasterrcnn. Just command it not to use the masks and not to do the segmentation. I don't know why but it gives much better results.
don't spend your time on mobilenet, with your train set size you will not be able to train it with some reasonable AP and AR. Start with maskrcnn+resnet18 backbone.

Can Overfeat work on ResNet or Inception network architectures

I am familiar with the principal how Overfeat works to not only classify but also localize an object in an image by only using convolutional layers instead of fully connected layers at the end. However, each tutorial or explanation that I read talks about alexnet or a very basic neural network consisting of a few consecutive convolutional layers followed by 2-3 Fully connected layers to classify an image. However my question goes as follow, is it possible to modify a more complex network such as ResNet or Inception which don't use the standard consecutive convolutional layer techniques as in Alexnet or VGG?
Thanks
Welcome, and yes. Looking at a very simplified diagram like this, everything to the left of the split "FC" ('fully connected', or 'dense') arrows can be any kind of (what is typically called an) image classification network, such as those in Keras Applications, which includes VGG, ResNet, Inception, Xception, etc. For these kinds of networks, the input is obviously an image, and the output is sometimes called a 'feature map' (although that's a bit silly---have a look at the output and you'll understand---as it's typically far more akin to a post-modernist map than to a cartographic one).
So the answer to your question is yes: put any kind of network you want before the 'overfeat' ending thing, whether custom or otherwise, but know that it's intended to be some general convolutional reductionist model like ResNet, Inception, etc. Any kind of network that takes an image in and spits out a pooled or flattened (1 dimensional) form of a 'feature map' of 3 dimensions is what's apparently intended for this 'overfeat' concept.

What is the role of fully connected layer in deep learning?

What is the role of fully connected layer (FC) in deep learning? I've seen some networks have 1 FC and some have 2 FC and some have 3 FC. Can anyone explain to me?
Thanks a lot
The fully connected layers are able to very effectively learn non-linear combinations of input features. Let's take a convolutional neural network for example.
The output from the convolutional layers represents high-level features in the data. While that output could be flattened and connected to the output layer, adding a fully-connected layer is a (usually) cheap way of learning non-linear combinations of these features.
Essentially the convolutional layers are providing a meaningful, low-dimensional, and somewhat invariant feature space, and the fully-connected layer is learning a (possibly non-linear) function in that space.

Which is best for object localization among R-CNN, fast R-CNN, faster R-CNN and YOLO

what is the difference between R-CNN, fast R-CNN, faster R-CNN and YOLO in terms of the following:
(1) Precision on same image set
(2) Given SAME IMAGE SIZE, the run time
(3) Support for android porting
Considering these three criteria which is the best object localization technique?
R-CNN is the daddy-algorithm for all the mentioned algos, it really provided the path for researchers to build more complex and better algorithm on top of it.
R-CNN, or Region-based Convolutional Neural Network
R-CNN consist of 3 simple steps:
Scan the input image for possible objects using an algorithm called Selective Search, generating ~2000 region proposals
Run a convolutional neural net (CNN) on top of each of these region proposals
Take the output of each CNN and feed it into a) an SVM to classify the region and b) a linear regressor to tighten the bounding box of the object, if such an object exists.
Fast R-CNN:
Fast R-CNN was immediately followed R-CNN. Fast R-CNN is faster and better by the virtue of following points:
Performing feature extraction over the image before proposing regions, thus only running one CNN over the entire image instead of 2000 CNN’s over 2000 overlapping regions
Replacing the SVM with a softmax layer, thus extending the neural network for predictions instead of creating a new model
Intuitively it makes a lot of sense to remove 2000 conv layers and instead take once Convolution and make boxes on top of that.
Faster R-CNN:
One of the drawbacks of Fast R-CNN was the slow selective search algorithm and Faster R-CNN introduced something called Region Proposal network(RPN).
Here’s is the working of the RPN:
At the last layer of an initial CNN, a 3x3 sliding window moves across the feature map and maps it to a lower dimension (e.g. 256-d)
For each sliding-window location, it generates multiple possible regions based on k fixed-ratio anchor boxes (default bounding boxes)
Each region proposal consists of:
an “objectness” score for that region and
4 coordinates representing the bounding box of the region
In other words, we look at each location in our last feature map and consider k different boxes centered around it: a tall box, a wide box, a large box, etc. For each of those boxes, we output whether or not we think it contains an object, and what the coordinates for that box are. This is what it looks like at one sliding window location:
The 2k scores represent the softmax probability of each of the k bounding boxes being on “object.” Notice that although the RPN outputs bounding box coordinates, it does not try to classify any potential objects: its sole job is still proposing object regions. If an anchor box has an “objectness” score above a certain threshold, that box’s coordinates get passed forward as a region proposal.
Once we have our region proposals, we feed them straight into what is essentially a Fast R-CNN. We add a pooling layer, some fully-connected layers, and finally a softmax classification layer and bounding box regressor. In a sense, Faster R-CNN = RPN + Fast R-CNN.
YOLO:
YOLO uses a single CNN network for both classification and localising the object using bounding boxes. This is the architecture of YOLO :
In the end you will have a tensor of shape 1470 i.e 7*7*30 and the structure of the CNN output will be:
The 1470 vector output is divided into three parts, giving the probability, confidence and box coordinates. Each of these three parts is also further divided into 49 small regions, corresponding to the predictions at the 49 cells that form the original image.
In postprocessing steps, we take this 1470 vector output from the network to generate the boxes that with a probability higher than a certain threshold.
I hope you get the understanding of these networks, to answer your question on how the performance of these network differs:
On the same dataset: 'You can be sure that the performance of these networks are in the order they are mentioned, with YOLO being the best and R-CNN being the worst'
Given SAME IMAGE SIZE, the run time: Faster R-CNN achieved much better speeds and a state-of-the-art accuracy. It is worth noting that although future models did a lot to increase detection speeds, few models managed to outperform Faster R-CNN by a significant margin. Faster R-CNN may not be the simplest or fastest method for object detection, but it is still one of the best performing. However researchers have used YOLO for video segmentation and by far its the best and fastest when it comes to video segmentation.
Support for android porting: As far as my knowledge goes, Tensorflow has some android APIs to port to android but I am not sure how these network will perform or even will you be able to port it or not. That again is subjected to hardware and data_size. Can you please provide the hardware and the size so that I will be able to answer it clearly.
The youtube video tagged by #A_Piro gives a nice explanation too.
P.S. I borrowed a lot of material from Joyce Xu Medium blog.
If your are interested in these algorithms you should take a look into this lesson which go through the algoritmhs you named : https://www.youtube.com/watch?v=GxZrEKZfW2o.
PS: There is also a Fast YOLO if I remember well haha !
I have been working with YOLO and FRCNN a lot. To me the YOLO has the best accuracy and speed but if you want to do research on image processing, I will suggest FRCNN as many previous works are done with it, and to do research you really want to be consistent.
For Object detection, I am trying SSD+ Mobilenet. It has a balance of accuracy and speed So it can also be ported to android devices easily with good fps.
It has less accuracy compared to faster rcnn but more speed than other algorithms.
It also has good support for android porting.