Are CRF (Conditional Random Fields) still actively used in semantic segmentation tasks or do the current deep neural networks made them unnecessary ?
I've seen both of the answers in academic papers and, since it seems quite complicated to implement and infer, I would like to have opinions on them before trying them out.
Thank you
The CRFs are still used for the tasks of image labeling and semantic image segmentation along with the DNNs. In fact, CRFs and DNNs are not self-excluding techniques and a lot of recent publications use both of them.
CRFs are based on probabilistic graphical models, where graph nodes and edges represent random variables, initialized with potential functions. DNN can be used as such potential function:
Conditional Random Fields Meet Deep Neural Networks for Semantic Segmentation
Conditional Random Fields as Recurrent Neural Networks
Brain Tumor Segmentation with Deep Neural Network (Future Work Section)
DCNN may be used for the feature extraction process, which is an essential step in applying CRFs:
Environmental Microorganism Classification Using Conditional Random Fields and Deep Convolutional Neural Networks
Conditional Random Field and Deep Feature Learning for Hyperspectral Image Segmentation
There are also toolkits, combining both CRFs and DNNs:
Direct graphical models C++ library
Related
I am a student and currently studying deep learning by myself. Here I would like to ask for clarification regarding the transfer learning.
For example MobileNetv2 (https://keras.io/api/applications/mobilenet/#mobilenetv2-function), if the weights parameter is set to None, then I am not doing transfer learning as the weights are random initialized. If I would like to do transfer learning, then I should set the weights parameter to imagenet. Is this concept correct?
Clarification and explanation regarding deep learning
Yes, when you initialize the weights with random values, you are just using the architecture and training the model from scratch. The goal of transfer learning is to use the previously gained knowledge by another trained model to get better results or to use less computational resources.
There are different ways to use transfer learning:
You can freeze the learned weights of the base model and replace the last layer of the model base on your problem and just train the last layer
You can start with the learned weights and fine-tune them (let them change in the learning process). Many people do that because sometimes it makes the training faster and gives better results because the weights already contain so much information.
You can use the first layers to extract basic features like colors, edges, circles... and add your desired layers after them. In this way, you can use your resources to learn high-level features.
There are more cases, but I hope it could give you an idea.
pretty new to deep learning, but couldn't seem to find/figure out what are backend weights such as
full_yolo_backend.h5
squeezenet_backend.h5
From what I have found and experimented, these backend weights have fundamentally different model architectures such as
yolov2 model has 40+ layers but the backend only 20+ layers (?)
you can build on top of the backend model with your own networks (?)
using backend models tend to yield poorer results (?)
I was hoping to seek some explanation on backend weights vs actual models for learning purposes. Thank you so much!
I'm note sure which implementation you are using but in many applications, you can consider a deep model as a feature extractor whose output is more or less task-agnostic, followed by a number of task-specific heads.
The choice of backend depends on your specific constraints in terms of tradeoff between accuracy and computational complexity. Examples of classical but time-consuming choices for backends are resnet-101, resnet-50 or VGG that can be coupled with FPN (feature pyramid networks) to yield multiscale features. However, if speed is your main concern then you can use smaller backends such as different MobileNet architectures or even the vanilla networks such as the ones used in the original Yolov1/v2 papers (tinyYolo is an extreme case).
Once you have chosen your backend (you can use a pretrained one), you can load its weights (that is what your *h5 files are). On top of that, you will add a small head that will carry the tasks that you need: this can be classification, bbox regression, or like in MaskRCNN forground/background segmentation. For Yolov2, you can just add very few, for example 3 convolutional layers (with non-linearities of course) that will output a tensor of size
BxC1xC2xAxP
#B==batch size
#C1==number vertical of cells
#C2==number of horizontal cells
#C3==number of anchors
#C4==number of parameters (i.e. bbx parameters, class prediction, confidence)
Then, you can just save/load the weights of this head separately. When you are happy with your results though, training jointly (end-to-end) will usually give you a small boost in accuracy.
Finally, to come back to your last questions, I assume that you are getting poor results with the backends because you are only loading backend weights but not the weights of the heads. Another possibility is that you are using a head trained with a backends X but that you are switching the backend to Y. In that case since the head expects different features, it's natural to see a drop in performance.
I am fairly new to Deep Learning and get quite overwhelmed by the many different Nets and their field of application. Thus, I want to know if there is some kind of overview which kind of different networks exist, what there key-features are and what kind of purpose they have.
For example I know abut LeNet, ConvNet, AlexNet - and somehow they are the same but still differ?
There are basically two types of neural networks, supervised and unsupervised learning. Both need a training set to "learn". Imagine training set as a massive book where you can learn specific information. In supervised learning, the book is supplied with answer key but without the solution manual, in contrast, unsupervised learning comes without answer key or solution manual. But the goal is the same, which is that to find patterns between the questions and answers (supervised learning) and questions (unsupervised learning).
Now we have differentiate between those two, we can go into the models. Let's discuss about supervised learning, which basically has 3 main models:
artificial neural network (ANN)
convolutional neural network (CNN)
recurrent neural network (RNN)
ANN is the simplest of all three. I believe that you have understand it, so we can move forward to CNN.
Basically in CNN all you have to do is to convolve our input with feature detectors. Feature detectors are matrices which have the dimension of (row,column,depth(number of feature detectors). The goal of convolving our input is to extract informations related to spatial data. Let's say you want to distinguish between cats and dogs. Cats have whiskers but dogs does not. Cats also have different eyes than dogs and so on. But the downside is, the more convolution layers will result in slower computation time. To mitigate that, we do some kind of processing called pooling or downsampling. Basically, this reduce the size of feature detectors while minimizing lost features or information. Then the next step would be flattening or squashing all those 3d matrix into (n,1) dimension so you can input it into ANN. Then the next step is self explanatory, which is normal ANN. Because CNN is inherently able to detect certain features, it mostly(maybe always) used for classification, for example image classification, time series classification, or maybe even video classification. For a crash course in CNN, check out this video by Siraj Raval. He's my favourite youtuber of all time!
Arguably the most sophisticated of all three, RNN is bestly described as neural networks that have "memory" by introducing "loops" within them which allow information to persist. Why is this important? As you are reading this, your brain use previous memory to comprehend all of this information. You don't seem to rethink everything from scratch again and this is what traditional neural networks do, which is to forget everything and re-learn again. But native RNN aren't effective so when people talk about RNN they mostly refer to LSTM which stands for Long Short-Term Memory. If that seems confusing to you, Cristopher Olah will give you in depth explanation in a very simple way. I advice you to check out his link for complete understanding about how RNN, especially LSTM variant
As for unsupervised learning, I'm so sorry that I haven't got the time to learn them, so this is the best I can do. Good luck and have fun!
They are the same type of Networks. Convolutional Neural Networks. The problem with the overview is that as soon as you post something it is already outdated. Most of the networks you describe are already old, even though they are only a few years old.
Nevertheless you can take a look at the networks supplied by caffe (https://github.com/BVLC/caffe/tree/master/models).
In my personal view the most important concepts in deep Learning are recurrent networks (https://keras.io/layers/recurrent/), residual connections, inception blocks (see https://arxiv.org/abs/1602.07261). The rest are largely theoretical concepts, which would not fit in a stack overflow answer.
I'm enrolled in Coursera ML class and I just started learning about neural networks.
One thing that truly mystifies me is how recognizing something so “human”, like a handwritten digit, becomes easy once you find the good weights for linear combinations.
It is even crazier when you understand that something seemingly abstract (like a car) can be recognized just by finding some really good parameters for linear combinations, and combining them, and feeding them to each other.
Combinations of linear combinations are much more expressible than I once thought.
This lead me to wonder if it is possible to visualize NN's decision process, at least in simple cases.
For example, if my input is 20x20 greyscale image (i.e. total 400 features) and the output is one of 10 classes corresponding to recognized digits, I would love to see some kind of visual explanation of which cascades of linear combinations led the NN to its conclusion.
I naïvely imagine that this may be implemented as visual cue over the image being recognized, maybe a temperature map showing “pixels that affected the decision the most”, or anything that helps to understand how neural network worked in a particular case.
Is there some neural network demo that does just that?
This is not a direct answer to your question. I would suggest you take a look at convolutional neural networks (CNN). In CNNs you can almost see the concept that is learned. You should read this publication:
Y. LeCun, L. Bottou, Y. Bengio and P. Haffner: Gradient-Based Learning Applied to Document Recognition, Proceedings of the IEEE, 86(11):2278-2324, November 1998
CNNs are often called "trainable feature extractors". In fact, CNNs implement 2D filters with trainable coefficients. This is why the activation of the first layers are usually shown as 2D images (see Fig. 13). In this paper the authors use another trick to make the networks even more transparant: the last layer is a radial basis function layer (with gaussian functions), i. e. the distance to an (adjustable) prototype for each class is calculated. You can really see the learned concepts by looking at the parameters of the last layer (see Fig. 3).
However, CNNs are artificial neural networks. But the layers are not fully connected and some neurons share the same weights.
Maybe it doesn't answer the question directly but I found this interesting piece in this Andrew Ng, Jeff Dean, Quoc Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin,
Kai Chen and
Greg Corrado paper (emphasis mine):
In this section, we will present two visualization techniques to verify if the optimal stimulus of the neuron is indeed a face. The first method is visualizing the most responsive stimuli in the test set. Since the test set is large, this method can reliably detect near optimal stimuli of the tested neuron. The second approach is to perform numerical optimization to find the optimal stimulus
...
These visualization methods have complementary strengths and weaknesses. For instance, visualizing the most responsive stimuli may suffer from fitting to noise. On the other hand, the numerical optimization approach can be susceptible to local minima. Results, shown [below], confirm that the tested neuron indeed learns the concept of faces.
In other words, they take a neuron that is best-performing at recognizing faces and
select images from the dataset that it cause it to output highest confidence;
mathematically find an image (not in dataset) that would get highest condifence.
It's fun to see that it actually “captures” features of the human face.
The learning is unsupervised, i.e. input data didn't say whether an image is a face or not.
Interestingly, here are generated “optimal input” images for cat heads and human bodies:
I do know that feedforward multi-layer neural networks with backprop are used with Reinforcement Learning as to help it generalize the actions our agent does. This is, if we have a big state space, we can do some actions, and they will help generalize over the whole state space.
What do recurrent neural networks do, instead? To what tasks are they used for, in general?
Recurrent Neural Networks, RNN for short (although beware that RNN is often used in the literature to designate Random Neural Networks, which effectively are a special case of Recurrent NN), come in very different "flavors" which causes them to exhibit various behaviors and characteristics. In general, however these many shades of behaviors and characteristics are rooted in the availability of [feedback] input to individual neurons. Such feedback comes from other parts of the network, be it local or distant, from the same layer (including in some cases "self"), or even on different layers (*). Feedback information it treated as "normal" input the neuron and can then influence, at least in part, its output.
Unlike back propagation which is used during the learning phase of a Feed-forward Network for the purpose of fine-tuning the relative weights of the various [Feedfoward-only] connections, FeedBack in RNNs constitute true a input to the neurons they connect to.
One of the uses of feedback is to make the network more resilient to noise and other imperfections in the input (i.e. input to the network as a whole). The reason for this is that in addition to inputs "directly" pertaining to the network input (the types of input that would have been present in a Feedforward Network), neurons have the information about what other neurons are "thinking". This extra info then leads to Hebbian learning, i.e. the idea that neurons that [usually] fire together should "encourage" each other to fire. In practical terms this extra input from "like-firing" neighbor neurons (or no-so neighbors) may prompt a neuron to fire even though its non-feedback inputs may have been such that it would have not fired (or fired less strongly, depending on type of network).
An example of this resilience to input imperfections is with associative memory, a common employ of RNNs. The idea is to use the feeback info to "fill-in the blanks".
Another related but distinct use of feedback is with inhibitory signals, whereby a given neuron may learn that while all its other inputs would prompt it to fire, a particular feedback input from some other part of the network typically indicative that somehow the other inputs are not to be trusted (in this particular context).
Another extremely important use of feedback, is that in some architectures it can introduce a temporal element to the system. A particular [feedback] input may not so much instruct the neuron of what it "thinks" [now], but instead "remind" the neuron that say, two cycles ago (whatever cycles may represent), the network's state (or one of its a sub-states) was "X". Such ability to "remember" the [typically] recent past is another factor of resilience to noise in the input, but its main interest may be in introducing "prediction" into the learning process. These time-delayed input may be seen as predictions from other parts of the network: "I've heard footsteps in the hallway, expect to hear the door bell [or keys shuffling]".
(*) BTW such a broad freedom in the "rules" that dictate the allowed connections, whether feedback or feed-forward, explains why there are so many different RNN architectures and variations thereof). Another reason for these many different architectures is that one of the characteristics of RNN is that they are not readily as tractable, mathematically or otherwise, compared with the feed-forward model. As a result, driven by mathematical insight or plain trial-and-error approach, many different possibilities are being tried.
This is not to say that feedback network are total black boxes, in fact some of the RNNs such as the Hopfield Networks are rather well understood. It's just that the math is typically more complicated (at least to me ;-) )
I think the above, generally (too generally!), addresses devoured elysium's (the OP) questions of "what do RNN do instead", and the "general tasks they are used for". To many complement this information, here's an incomplete and informal survey of applications of RNNs. The difficulties in gathering such a list are multiple:
the overlap of applications between Feed-forward Networks and RNNs (as a result this hides the specificity of RNNs)
the often highly specialized nature of applications (we either stay in with too borad concepts such as "classification" or we dive into "Prediction of Carbon shifts in series of saturated benzenes" ;-) )
the hype often associated with neural networks, when described in vulgarization texts
Anyway, here's the list
modeling, in particular the learning of [oft' non-linear] dynamic systems
Classification (now, FF Net are also used for that...)
Combinatorial optimization
Also there are a lots of applications associated with the temporal dimension of the RNNs (another area where FF networks would typically not be found)
Motion detection
load forecasting (as with utilities or services: predicting the load in the short term)
signal processing : filtering and control
There is an assumption in the basic Reinforcement Learning framework that your state/action/reward sequence is a Markov Decision Process. That basically means that you do not need to remember any information about previous states from this episode to make decisions.
But this is obviously not true for all problems. Sometimes you do need to remember some recent things to make informed decisions. Sometimes you can explicitly build the things that need to be remembered into the state signal, but in general we'd like our system to learn what it needs to remember. This is called a Partially Observable Markov Decision Process (POMDP), and there are a variety of methods used to deal with it. One possibly solution is to use a recurrent neural network, since they incorporate details from previous time steps into the current decision.