I'm trying to understand why for example on MatLab page AlexNet is described as:
AlexNet is a convolutional neural network that is 8 layers deep.
After using analyzeNetwork() to check the architecture, there is clearly 25 layers.
How 25 layers are related to 8 layers deep? What's the difference between those two values?
I'm sure that I'm missing something, but I don't know what it is.
The MATLAB documentation is probably not clear enough. I should maybe talk about blocks (Personally I prefer this word). If you look at the figure:
Many "layers" have at the end a number that represents the block in which it is contained.
The term layer is often not clear, there are people who consider that a convolution + activation + batch norm is a layer. There is no consensus. In the case of MATLAB it is only counting the layers that have weights.
Related
I am familiar with the principal how Overfeat works to not only classify but also localize an object in an image by only using convolutional layers instead of fully connected layers at the end. However, each tutorial or explanation that I read talks about alexnet or a very basic neural network consisting of a few consecutive convolutional layers followed by 2-3 Fully connected layers to classify an image. However my question goes as follow, is it possible to modify a more complex network such as ResNet or Inception which don't use the standard consecutive convolutional layer techniques as in Alexnet or VGG?
Thanks
Welcome, and yes. Looking at a very simplified diagram like this, everything to the left of the split "FC" ('fully connected', or 'dense') arrows can be any kind of (what is typically called an) image classification network, such as those in Keras Applications, which includes VGG, ResNet, Inception, Xception, etc. For these kinds of networks, the input is obviously an image, and the output is sometimes called a 'feature map' (although that's a bit silly---have a look at the output and you'll understand---as it's typically far more akin to a post-modernist map than to a cartographic one).
So the answer to your question is yes: put any kind of network you want before the 'overfeat' ending thing, whether custom or otherwise, but know that it's intended to be some general convolutional reductionist model like ResNet, Inception, etc. Any kind of network that takes an image in and spits out a pooled or flattened (1 dimensional) form of a 'feature map' of 3 dimensions is what's apparently intended for this 'overfeat' concept.
What is the role of fully connected layer (FC) in deep learning? I've seen some networks have 1 FC and some have 2 FC and some have 3 FC. Can anyone explain to me?
Thanks a lot
The fully connected layers are able to very effectively learn non-linear combinations of input features. Let's take a convolutional neural network for example.
The output from the convolutional layers represents high-level features in the data. While that output could be flattened and connected to the output layer, adding a fully-connected layer is a (usually) cheap way of learning non-linear combinations of these features.
Essentially the convolutional layers are providing a meaningful, low-dimensional, and somewhat invariant feature space, and the fully-connected layer is learning a (possibly non-linear) function in that space.
I have a small dataset collect from imagenet(7 classes each class with 1000 training data). I try to train it with alexnet model. But somehow the accuracy just cant go any higher(about 68% maximum). I remove conv4 and conv5 layer to prevent model overfitting also decrease the number of neuron in each layer(conv and fc). here is my setup.
Did i do anything wrong so that the accuracy is so low?
I want to sort out a few terms:
(1) A perceptron is an individual cell in a neural net.
(2) In a CNN, we generally focus on the kernel (filter) as a unit; this is the square matrix of perceptrons that forms a psuedo-visual unit.
(3) The only place it usually makes sense to focus on an individual perceptron is in the FC layers. When you talk about removing some of the perceptrons, I think you mean kernels.
The most important part of training a model is to make sure that your model is properly fitted to the problem at hand. AlexNet (and CaffeNet, the BVLC implementation) is fitted to the full ImageNet data set. Alex Krizhevsky and his colleagues spent a lot of research effort in tuning their network to the problem. You are not going to get similar accuracy -- on a severely reduced data set -- by simply removing layers and kernels at random.
I suggested that you start from CONVNET (the CIFAR-10 net) because it's much better tuned to this scale of problem. Most of all, I strongly recommend that you make constant use of your visualization tools, so that you can detect when the various kernel layers begin to learn their patterns, and to see the effects of small changes in the topology.
You need to run some experiments to tune and understand your topology. Record the kernel visualizations at chosen times during the training -- perhaps at intervals of 10% of expected convergence -- and compare the visual acuity as you remove a few kernels, or delete an entire layer, or whatever else you choose.
For instance, I expect that if you do this with your current amputated CaffeNet, you'll find that the severe losses in depth and breadth greatly change the feature recognition it's learning. The current depth of building blocks is not enough to recognize edges, then shapes, then full body parts. However, I could be wrong -- you do have three remaining layers. That's why I asked you to post the visualizations you got, to compare with published AlexNet features.
edit: CIFAR VISUALIZATION
CIFAR is much better differentiated between classes than is ILSVRC-2012. Thus, the training requires less detail per layer and fewer layers. Training is faster, and the filters are not nearly as interesting to the human eye. This is not a problem with the Gabor (not Garbor) filter; it's just that the model doesn't have to learn so many details.
For instance, for CONVNET to discriminate between a jonquil and a jet, we just need a smudge of yellow inside a smudge of white (the flower). For AlexNet to tell a jonquil from a cymbidium orchid, the network needs to learn about petal count or shape.
In neutral network, when talking about layers, does the output count as one layer? It seems some people count, and others don't, based on my readings. For example, the author says this is a 2 layer network. Shouldn't at least input, hidden, and output be 3 layers?
The article is here: http://karpathy.github.io/2016/05/31/rl/.
Depending on your definition it counts. You could define it as a 1 layer network, but most people I talk to would say this is a 3 layer network. The definition that sticks is the one most people use. Programmatically this is aslo written as three layers in Keras or any other framework.
It depends on the people who use the term, but usually one is interested in the number of layers which have weights. (As I think of the weights being in between the layers this is a bit strange, but anyway).
From this logic, the network above is a 2-layer network. 2 weight/learned layers.
You could also say the network has one hidden layer (and one input and one output layer).
The recent paper Densely Connected Convolutional Networks https://arxiv.org/abs/1608.06993 has shown that their DenseNet deep learning architecture outperforms state-of-the-art ResNet architectures. Are there similar papers / repositories for similar architectures but without convolution (RNN/just dense)?
No.
The simple answer is that the convolution itself allows for regularization by exploiting the data locality which is true in most images. This is also the key to achieving a deeper network which is crucial for deeper representations.
Another critical reason is that a dense layer just the size of the input (usually 224*224) will hog down most of your GPU memory so there is little chance today to achieve a dense network for images of this size that or more than a few layers deep. Maybe if you had 10x the GPU RAM you can try to pull that one off... Convolution is simply more economical.