Why neural network tends to output 'mean value'? - regression

I am using keras to build a simple neural network for a regression task.
But the output is always tends to the 'mean value' of ground truth y data.
See the first figure, blue is ground truth, red is predicted value (very close to the constant mean of ground truth).
Also the model stops learning very early even though I set a learning epoch=100.
Anyone have ideas under what kinds of conditions the neural network will stop learning early and why the regression output tends to 'the mean' of ground truth?
Thanks!

Possibly because the data are unpredictable....? Do you know for certain that the data set has N order predictability of some kind?
Just eyeballing your data set, it lacks periodicity, lacks homoscedasticity, it lacks any slope or skew or trend or pattern... I can't really tell if there is anything wrong with your 'net. In the absence of any pattern, the mean is always the best prediction... and it is entirely possible (although not certain) that the neural net is doing its job.
I suggest you find an easier data set, and see if you can tackle that first.

The model is not learning from the data. Think of a basic linear regression - the 'null' prediction, the prediction if you didn't have any predictors at all, is just the expected value; i.e. the mean. It could be caused by many different issues, but initialization comes to mind - bad initialization leads to no learning. This blog post has good practical advice that may help.

Related

In deep learning, can I change the weight of loss dynamically?

Call for experts in deep learning.
Hey, I am recently working on training images using tensorflow in python for tone mapping. To get the better result, I focused on using perceptual loss introduced from this paper by Justin Johnson.
In my implementation, I made the use of all 3 parts of loss: a feature loss that extracted from vgg16; a L2 pixel-level loss from the transferred image and the ground true image; and the total variation loss. I summed them up as the loss for back propagation.
From the function
yˆ=argminλcloss_content(y,yc)+λsloss_style(y,ys)+λTVloss_TV(y)
in the paper, we can see that there are 3 weights of the losses, the λ's, to balance them. The value of three λs are probably fixed throughout the training.
My question is that does it make sense if I dynamically change the λ's in every epoch(or several epochs) to adjust the importance of these losses?
For instance, the perceptual loss converges drastically in the first several epochs yet the pixel-level l2 loss converges fairly slow. So maybe the weight λs should be higher for the content loss, let's say 0.9, but lower for others. As the time passes, the pixel-level loss will be increasingly important to smooth up the image and to minimize the artifacts. So it might be better to adjust it higher a bit. Just like changing the learning rate according to the different epochs.
The postdoc supervises me straightly opposes my idea. He thought it is dynamically changing the training model and could cause the inconsistency of the training.
So, pro and cons, I need some ideas...
Thanks!
It's hard to answer this without knowing more about the data you're using, but in short, dynamic loss should not really have that much effect and may have opposite effect altogether.
If you are using Keras, you could simply run a hyperparameter tuner similar to the following in order to see if there is any effect (change the loss accordingly):
https://towardsdatascience.com/hyperparameter-optimization-with-keras-b82e6364ca53
I've only done this on smaller models (way too time consuming) but in essence, it's best to keep it constant and also avoid angering off your supervisor too :D
If you are running a different ML or DL library, there are optimizer for each, just Google them. It may be best to run these on a cluster and overnight, but they usually give you a good enough optimized version of your model.
Hope that helps and good luck!

How do I verify that my model is actually functioning correctly in deep learning?

I have a dataset of around 6K chemical formulas which I am preprocessing via Keras' tokenization to perform binary classification. I am currently using a 1D convolutional neural network with dropouts and am obtaining an accuracy of 82% and validation accuracy of 80% after only two epochs. No matter what I try, the model just plateaus there and doesn't seem to be improving at all. Those same exact accuracies are reached with a vanilla LSTM too. What else can I try to improve my accuracies? Losses only have a difference of 0.04... Anyone have any ideas? Both models use an embedding layer and changing the output dimension isn't having an effect either.
According to your answer, I believe your model has a high bias and low variance (see this link for further details). Thus, your model is not fitting your data very well and it is causing underfitting. So, I suggest you 3 things:
Train your model a little longer: I believe two epoch are too few to give a chance to your model understand the patterns in the data. Try to minimize learning rate and increase the number of epochs.
Try a different architecture: you may change the amount of convolutions, filters and layers, You can also use different activation functions and other layers like max pooling.
Make an error analysis: once you finished your training, apply your model to test set and take a look into the errors. How much false positives and false negatives do you have? Is your model better to classify one class than the other? You can see a pattern in the errors that may be related to your data?
Finally, if none of these suggestions helped you, you may also try to increase the number of features, if possible.

Adding noise to image for deep learning, yes or no?

I've been thinking that adding noise to an image can prevent overfitting and also "increase" the dataset by adding variations to it. I'm only trying to add some random 1s to images that has shape (256,256,3) which uses uint8 to represent its color. I don't think that can affect the visualization at all (I showed both images with matplotlib and they seems almost the same) and has only ~0.01 mean difference in the sum of their values.
But it doesn't look to have its advances. After training for a long time it's still not as good as the one doesn't use noises.
Has anyone tried to use noise for image classification tasks like this? Is it eventually better?
I wouldn't go to add noise to your data. Some papers employ input deformations during training to increase robutness and convergence speed of models. However, these deformations are statistically inefficient (not just on image but any kind of data).
You can read Intriguing properties of Neural Networks from Szegedy et al. for more details (and refer to references 9 & 13 for papers that uses deformations).
If you want to avoid overfitting, you might be interested to read about regularization instead.
Yes you may add noise to extend your dataset and avoid overfitting your training set but make sure it is random otherwise your network will take this noise as something it should learn (and that's not something you want). I wouldn't use this method first to do that, I would first rotate and/or flip my samples.
However, your network should perform better or, at least, as well as your previous network.
First thing I would check is : How do you measure your performances ? What were your performances before and after ? And did you change anything else ?
There are a couple of works that deal with this problem. Because you make the training set harder the training error will be lower, however your generalization might be better. It has been shown that adding noise can have stability effects for training Generative Adversarial Networks (Adversarial Training).
For classification tasks it is not that cut and dry. Not many works have actually dealt with this topic. The closest one is to my best knowledge is this one from google (https://arxiv.org/pdf/1412.6572.pdf), where they show the limitation of using training without noise. They do report a regularization effect, but not actual better results than using other methods.

Any visualizations of neural network decision process when recognizing images?

I'm enrolled in Coursera ML class and I just started learning about neural networks.
One thing that truly mystifies me is how recognizing something so “human”, like a handwritten digit, becomes easy once you find the good weights for linear combinations.
It is even crazier when you understand that something seemingly abstract (like a car) can be recognized just by finding some really good parameters for linear combinations, and combining them, and feeding them to each other.
Combinations of linear combinations are much more expressible than I once thought.
This lead me to wonder if it is possible to visualize NN's decision process, at least in simple cases.
For example, if my input is 20x20 greyscale image (i.e. total 400 features) and the output is one of 10 classes corresponding to recognized digits, I would love to see some kind of visual explanation of which cascades of linear combinations led the NN to its conclusion.
I naïvely imagine that this may be implemented as visual cue over the image being recognized, maybe a temperature map showing “pixels that affected the decision the most”, or anything that helps to understand how neural network worked in a particular case.
Is there some neural network demo that does just that?
This is not a direct answer to your question. I would suggest you take a look at convolutional neural networks (CNN). In CNNs you can almost see the concept that is learned. You should read this publication:
Y. LeCun, L. Bottou, Y. Bengio and P. Haffner: Gradient-Based Learning Applied to Document Recognition, Proceedings of the IEEE, 86(11):2278-2324, November 1998
CNNs are often called "trainable feature extractors". In fact, CNNs implement 2D filters with trainable coefficients. This is why the activation of the first layers are usually shown as 2D images (see Fig. 13). In this paper the authors use another trick to make the networks even more transparant: the last layer is a radial basis function layer (with gaussian functions), i. e. the distance to an (adjustable) prototype for each class is calculated. You can really see the learned concepts by looking at the parameters of the last layer (see Fig. 3).
However, CNNs are artificial neural networks. But the layers are not fully connected and some neurons share the same weights.
Maybe it doesn't answer the question directly but I found this interesting piece in this Andrew Ng, Jeff Dean, Quoc Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin,
Kai Chen and
Greg Corrado paper (emphasis mine):
In this section, we will present two visualization techniques to verify if the optimal stimulus of the neuron is indeed a face. The first method is visualizing the most responsive stimuli in the test set. Since the test set is large, this method can reliably detect near optimal stimuli of the tested neuron. The second approach is to perform numerical optimization to find the optimal stimulus
...
These visualization methods have complementary strengths and weaknesses. For instance, visualizing the most responsive stimuli may suffer from fitting to noise. On the other hand, the numerical optimization approach can be susceptible to local minima. Results, shown [below], confirm that the tested neuron indeed learns the concept of faces.
In other words, they take a neuron that is best-performing at recognizing faces and
select images from the dataset that it cause it to output highest confidence;
mathematically find an image (not in dataset) that would get highest condifence.
It's fun to see that it actually “captures” features of the human face.
The learning is unsupervised, i.e. input data didn't say whether an image is a face or not.
Interestingly, here are generated “optimal input” images for cat heads and human bodies:

What kind of learning algorithm would you use to build a model of how long it takes a human to solve a given Sudoku situation?

I don't have much experience in machine learning, pattern recognition, data mining, etc. and in their underlying theory and systems.
I would like to develop an artificial model of the time it takes a human to make a move in a given Sudoku puzzle.
So what I'm looking for as an output from the machine learning process is a model that can give predictions on how long does it take for a target human to make a move in a given Sudoku situation.
Same input doesn't always map to same outcome. It takes different times for the human to make a move with the same situation, but my hypothesis is that there's a tendency in the resulting probability distribution. (My educated guess is that it is ~normal.)
I have ideas about the factors that influence the distribution (like #empty slots) but would preferably leave it to the system to figure these patterns out. Please notice, that I'm not interested in the patterns, just the model.
I can generate sample and test data easily by running sudoku puzzles and measuring the times it takes to make the moves.
What kind of learning algorithm would you suggest to use for this?
I was thinking NNs, but I'm not sure if they can have the desired property of giving weighted random outcomes for the same input.
If I understand this correctly you have an input vector of length 81, which contains 1 if the square is filled in and 0 otherwise. You want to learn a function which returns a probability distribution which models the response time of a human to that board position.
My first response would be that this is a regression problem and you should try straightforward linear regression. This will not provide you with a distribution of response times, but a single 'best-guess' response time.
I'm not clear on why you want to model a distribution of response times. However, if you really want to do want to output a distribution then it sounds like you want to look at Bayesian methods. I'm not really an expert on Bayesian inference, so I can't help you much further here.
However, I don't really think your approach is going to work because I agree with your intuition about features such as the number of empty slots being important. There are also other obvious features, such as the number of empty slots per row/column that are likely to be important. Explicitly putting these features in your representation will probably be much more successful than expecting that the learning algorithm will infer something similar on its own.
The monte carlo method seems like it would work well here but would require a stack of solutions the size of the moon to really do it. And it wouldn't give you the time per person, just the time on average.
My understanding of it, tenuous as it is, is that you have a database with a board position and the time it took a human to make the next move. At the very least you have a starting point for most moves. Even if it's not in the database you could start to calculate how long it would take to make a move based on some algorithm. Though I know you had specified you wanted machine learning to do this it might be worth segmenting the problem into something a little smaller then building on it.
If you have some guesstimate as to what influences the function (# of empty cell, etc), try to train a classifier on a vector of features, and not on the 81 cells vector (0/1 or 0..9, doesn't really matter for my argument).
I think that your claim:
we wouldn't have to necessary know the underlying patterns, the "trained patterns" in a learning system automatically encodes these sometimes quite delicate and subtle patterns inside them -- that's one of their great power
is wrong. you do have to give the network the right domain. for example, when trying to detect object in an image, working in the pixel domain is pointless. you'll only get results if you first run some feature detection to detect edges, corners, etc.
Theoretically, with enough non-linearity (in NN - enough layers in the network) it can detect such things, but in practice, I have never seen that work, without giving the classifier the right features to work with.
I was thinking NNs, but I'm not sure if they can have the desired property of giving weighted random outcomes for the same input.
You're just trying to learn a function from 2^81 or 10^81 (or a much smaller feature space as I suggest) to R (response time between 0 and Inf) or some discretization of that. So NN and other classifiers can do that.