Train neural network with unlimited training data - deep-learning

Does fitting a neural network with generated, and thus infinite training samples work?
Will the batch size still matter? Should training samples be repeated over some batches or is it ok to use every sample only once?
Does the term "epoch" makes any sense, if there is no complete dataset to iterate over?
Does validation makes any sense, if every sample from the training dataset already is a new one? If not, will training loss behave like validation loss would?

Does fitting a neural network with generated, and thus infinite training samples work?
Yes, it is completely fine, in fact it will likely be a much better setup than the one you are used to.
Will the batch size still matter?
Yes, batch size controls noise in the gradient estimation, the bigger the batch, smaller the error.
Should training samples be repeated over some batches or is it ok to use every sample only once?
If you can avoid repeating them, and just keep generating, you will be in a cleaner math setup, in practise it likely won't matter much.
Does the term "epoch" makes any sense, if there is no complete dataset to iterate over?
The term "epoch" is one of the big mistakes that we made as a community, it really is meaningless even when dataset is finite. Avoiding it completely will simplify your life, just think in terms of gradient updates/samples consumed and forget the epochs.
Does validation makes any sense, if every sample from the training dataset already is a new one? If not, will training loss behave like validation loss would?
It does still make sense just as an additional verification you are making progress, just remember to make sure you do not "generate" your validation set during training. That being said, it is much less important than in other cases, as long as your test scenario is also going to be generated in the same way. For example this is a reason why many RL papers (especially from Atari times) would not have validation sets - since training and test "environments" were exactly the same.

Related

In deep learning, can I change the weight of loss dynamically?

Call for experts in deep learning.
Hey, I am recently working on training images using tensorflow in python for tone mapping. To get the better result, I focused on using perceptual loss introduced from this paper by Justin Johnson.
In my implementation, I made the use of all 3 parts of loss: a feature loss that extracted from vgg16; a L2 pixel-level loss from the transferred image and the ground true image; and the total variation loss. I summed them up as the loss for back propagation.
From the function
yˆ=argminλcloss_content(y,yc)+λsloss_style(y,ys)+λTVloss_TV(y)
in the paper, we can see that there are 3 weights of the losses, the λ's, to balance them. The value of three λs are probably fixed throughout the training.
My question is that does it make sense if I dynamically change the λ's in every epoch(or several epochs) to adjust the importance of these losses?
For instance, the perceptual loss converges drastically in the first several epochs yet the pixel-level l2 loss converges fairly slow. So maybe the weight λs should be higher for the content loss, let's say 0.9, but lower for others. As the time passes, the pixel-level loss will be increasingly important to smooth up the image and to minimize the artifacts. So it might be better to adjust it higher a bit. Just like changing the learning rate according to the different epochs.
The postdoc supervises me straightly opposes my idea. He thought it is dynamically changing the training model and could cause the inconsistency of the training.
So, pro and cons, I need some ideas...
Thanks!
It's hard to answer this without knowing more about the data you're using, but in short, dynamic loss should not really have that much effect and may have opposite effect altogether.
If you are using Keras, you could simply run a hyperparameter tuner similar to the following in order to see if there is any effect (change the loss accordingly):
https://towardsdatascience.com/hyperparameter-optimization-with-keras-b82e6364ca53
I've only done this on smaller models (way too time consuming) but in essence, it's best to keep it constant and also avoid angering off your supervisor too :D
If you are running a different ML or DL library, there are optimizer for each, just Google them. It may be best to run these on a cluster and overnight, but they usually give you a good enough optimized version of your model.
Hope that helps and good luck!

Adding noise to image for deep learning, yes or no?

I've been thinking that adding noise to an image can prevent overfitting and also "increase" the dataset by adding variations to it. I'm only trying to add some random 1s to images that has shape (256,256,3) which uses uint8 to represent its color. I don't think that can affect the visualization at all (I showed both images with matplotlib and they seems almost the same) and has only ~0.01 mean difference in the sum of their values.
But it doesn't look to have its advances. After training for a long time it's still not as good as the one doesn't use noises.
Has anyone tried to use noise for image classification tasks like this? Is it eventually better?
I wouldn't go to add noise to your data. Some papers employ input deformations during training to increase robutness and convergence speed of models. However, these deformations are statistically inefficient (not just on image but any kind of data).
You can read Intriguing properties of Neural Networks from Szegedy et al. for more details (and refer to references 9 & 13 for papers that uses deformations).
If you want to avoid overfitting, you might be interested to read about regularization instead.
Yes you may add noise to extend your dataset and avoid overfitting your training set but make sure it is random otherwise your network will take this noise as something it should learn (and that's not something you want). I wouldn't use this method first to do that, I would first rotate and/or flip my samples.
However, your network should perform better or, at least, as well as your previous network.
First thing I would check is : How do you measure your performances ? What were your performances before and after ? And did you change anything else ?
There are a couple of works that deal with this problem. Because you make the training set harder the training error will be lower, however your generalization might be better. It has been shown that adding noise can have stability effects for training Generative Adversarial Networks (Adversarial Training).
For classification tasks it is not that cut and dry. Not many works have actually dealt with this topic. The closest one is to my best knowledge is this one from google (https://arxiv.org/pdf/1412.6572.pdf), where they show the limitation of using training without noise. They do report a regularization effect, but not actual better results than using other methods.

deep learning: How do I know my net is not memorizing

I have a convolutional neural network and my input data are 10.000 images of the same object from different views (angles in 3D around the image). My network converges, but I am not sure if the network has memorized all the different angles / views or not. Since I only have one object I cannot really check test it with different data.
My training / test plot looks like this (red trainig, green test):
Since the test is lower than training I expect the network to learn all the images by heart? Even though I have 10.000 kind of different images.
First, "memorize" is not a term we apply to the learning process, since it's not exact regurgitation of prior examples.
This is a matter of your experimental process. You get to define the success criteria. Is 95% accuracy good enough for your intended application? What, to you, is good enough performance to declare success?
One way to build a more convincing argument is to make the typical third partition: besides training and test sets, save part of your data for validation. You do the training and testing as you've already done. When the model has converged, you apply it to the validation set to predict results. If that test passes your success criterion, then you have a finished model.

How we know when to stop training a model on a pre-trained model?

My apologies since my question may sound stupid question. But I am quite new in deep learning and caffe.
How can we detect how many iterations are required to fine-tune a pre-trained on our own dataset? For example, I am running fcn32 for my own data with 5 classes. When can I stop the fine-tuning process by looking at the loss and accuracy of training phase?
Many thanks
You shouldn't do it by looking at the loss or accuracy of training phase. Theoretically, the training accuracy should always be increasing (also means the training loss should always be decreasing) because you train the network to decrease the training loss. But a high training accuracy doesn't necessary mean a high test accuracy, that's what we referred as over-fitting problem. So what you need to find is a point where the accuracy of test set (or validation set if you have it) stops increasing. And you can simply do it by specifying a relatively larger number of iteration at first, then monitor the test accuracy or test loss, if the test accuracy stops increasing (or the loss stops decreasing) in consistently N iterations (or epochs), where N could be 10 or other number specified by you, then stop the training process.
The best thing to do is to track training and validation accuracy and store snapshots of the weights every k iterations. To compute validation accuracy you need to have a sparate set of held out data which you do not use for training.
Then, you can stop once the validation accuracy stops increasing or starts decreasing. This is called early stopping in the literature. Keras, for example, provides functionality for this: https://keras.io/callbacks/#earlystopping
Also, it's good practice to plot the above quantities, because it gives you important insights into the training process. See http://cs231n.github.io/neural-networks-3/#accuracy for a great illustration (not specific to early stopping).
Hope this helps
Normally you converge to a specific validation accuracy for your model. In practice you normally stop training, if the validation loss did not increase in x epochs. Depending on your epoch duration x may vary most commonly between 5 and 20.
Edit:
An epoch is one iteration over your dataset for trainig in ML terms. You do not seem to have a validation set. Normally the data is split into training and validation data so you can see how well your model performs on unseen data and made decisions about which model to take by looking at this data. You might want to take a look at http://caffe.berkeleyvision.org/gathered/examples/mnist.html to see the usage of a validation set, even though they call it test set.

What kind of learning algorithm would you use to build a model of how long it takes a human to solve a given Sudoku situation?

I don't have much experience in machine learning, pattern recognition, data mining, etc. and in their underlying theory and systems.
I would like to develop an artificial model of the time it takes a human to make a move in a given Sudoku puzzle.
So what I'm looking for as an output from the machine learning process is a model that can give predictions on how long does it take for a target human to make a move in a given Sudoku situation.
Same input doesn't always map to same outcome. It takes different times for the human to make a move with the same situation, but my hypothesis is that there's a tendency in the resulting probability distribution. (My educated guess is that it is ~normal.)
I have ideas about the factors that influence the distribution (like #empty slots) but would preferably leave it to the system to figure these patterns out. Please notice, that I'm not interested in the patterns, just the model.
I can generate sample and test data easily by running sudoku puzzles and measuring the times it takes to make the moves.
What kind of learning algorithm would you suggest to use for this?
I was thinking NNs, but I'm not sure if they can have the desired property of giving weighted random outcomes for the same input.
If I understand this correctly you have an input vector of length 81, which contains 1 if the square is filled in and 0 otherwise. You want to learn a function which returns a probability distribution which models the response time of a human to that board position.
My first response would be that this is a regression problem and you should try straightforward linear regression. This will not provide you with a distribution of response times, but a single 'best-guess' response time.
I'm not clear on why you want to model a distribution of response times. However, if you really want to do want to output a distribution then it sounds like you want to look at Bayesian methods. I'm not really an expert on Bayesian inference, so I can't help you much further here.
However, I don't really think your approach is going to work because I agree with your intuition about features such as the number of empty slots being important. There are also other obvious features, such as the number of empty slots per row/column that are likely to be important. Explicitly putting these features in your representation will probably be much more successful than expecting that the learning algorithm will infer something similar on its own.
The monte carlo method seems like it would work well here but would require a stack of solutions the size of the moon to really do it. And it wouldn't give you the time per person, just the time on average.
My understanding of it, tenuous as it is, is that you have a database with a board position and the time it took a human to make the next move. At the very least you have a starting point for most moves. Even if it's not in the database you could start to calculate how long it would take to make a move based on some algorithm. Though I know you had specified you wanted machine learning to do this it might be worth segmenting the problem into something a little smaller then building on it.
If you have some guesstimate as to what influences the function (# of empty cell, etc), try to train a classifier on a vector of features, and not on the 81 cells vector (0/1 or 0..9, doesn't really matter for my argument).
I think that your claim:
we wouldn't have to necessary know the underlying patterns, the "trained patterns" in a learning system automatically encodes these sometimes quite delicate and subtle patterns inside them -- that's one of their great power
is wrong. you do have to give the network the right domain. for example, when trying to detect object in an image, working in the pixel domain is pointless. you'll only get results if you first run some feature detection to detect edges, corners, etc.
Theoretically, with enough non-linearity (in NN - enough layers in the network) it can detect such things, but in practice, I have never seen that work, without giving the classifier the right features to work with.
I was thinking NNs, but I'm not sure if they can have the desired property of giving weighted random outcomes for the same input.
You're just trying to learn a function from 2^81 or 10^81 (or a much smaller feature space as I suggest) to R (response time between 0 and Inf) or some discretization of that. So NN and other classifiers can do that.