In caffe's solver.prototxt, I wonder if I set test_iter*test_batchsize=val_imgs_number, is the Accuracy layer's output the whole val set accuracy? [duplicate]

In caffe's solver.prototxt, I wonder if I set test_iter*test_batchsize=val_imgs_number, is the Accuracy layer's output the whole val set accuracy? [duplicate] - deep-learning

On Caffe, I am trying to implement a Fully Convolution Network for semantic segmentation. I was wondering is there a specific strategy to set up your 'solver.prototxt' values for the following hyper-parameters:
test_iter
test_interval
iter_size
max_iter
Does it depend on the number of images you have for your training set? If so, how?

In order to set these values in a meaningful manner, you need to have a few more bits of information regarding your data:
1. Training set size the total number of training examples you have, let's call this quantity T.
2. Training batch size the number of training examples processed together in a single batch, this is usually set by the input data layer in the 'train_val.prototxt'. For example, in this file the train batch size is set to 256. Let's denote this quantity by tb.
3. Validation set size the total number of examples you set aside for validating your model, let's denote this by V.
4. Validation batch size value set in batch_size for the TEST phase. In this example it is set to 50. Let's call this vb.
Now, during training, you would like to get an un-biased estimate of the performance of your net every once in a while. To do so you run your net on the validation set for test_iter iterations. To cover the entire validation set you need to have test_iter = V/vb.
How often would you like to get this estimation? It's really up to you. If you have a very large validation set and a slow net, validating too often will make the training process too long. On the other hand, not validating often enough may prevent you from noting if and when your training process failed to converge. test_interval determines how often you validate: usually for large nets you set test_interval in the order of 5K, for smaller and faster nets you may choose lower values. Again, all up to you.
In order to cover the entire training set (completing an "epoch") you need to run T/tb iterations. Usually one trains for several epochs, thus max_iter=#epochs*T/tb.
Regarding iter_size: this allows to average gradients over several training mini batches, see this thread fro more information.

Related

How does the number of Gibbs sampling iterations impacts Latent Dirichlet Allocation?

The documentation of MALLET mentions following:
--num-iterations [NUMBER]
The number of sampling iterations should be a trade off between the time taken to complete sampling and the quality of the topic model.
MALLET provides furthermore an example:
// Run the model for 50 iterations and stop (this is for testing only,
// for real applications, use 1000 to 2000 iterations)
model.setNumIterations(50);
It is obvious that too few iterations lead to bad topic models.
However, does increasing the number of Gibbs sampling iterations necessarily benefit the quality of the topic model (measured by perplexity, topic coherence or on a downstream task)?
Or is it possible that the model quality decreases with the --num-iterations set to a too high value?
On a personal project, averaged over 10-fold cross-validation increasing the number of iterations from 100 to 1000 did not impact the average accuracy (measured as Mean Reciprocal Rank) for a downstream task. However, within the cross-validation splits the performance changed significantly, although the random seed was fixed and all other parameters kept the same. What part of background knowledge about Gibbs sampling am I missing to explain this behavior?
I am using a symmetric prior for alpha and beta without hyperparameter optimization and the parallelized LDA implementation provided by MALLET.

The 1000 iteration setting is designed to be a safe number for most collection sizes, and also to communicate "this is a large, round number, so don't think it's very precise". It's likely that smaller numbers will be fine. I once ran a model for 1000000 iterations, and fully half the token assignments never changed from the 1000 iteration model.
Could you be more specific about the cross validation results? Was it that different folds had different MRRs, which were individually stable over iteration counts? Or that individual fold MRRs varied by iteration count, but they balanced out in the overall mean? It's not unusual for different folds to have different "difficulty". Fixing the random seed also wouldn't make a difference if the data is different.

Neural Network : Epoch and Batch Size

I am trying to train a neural network to classify words into different categories.
I notice two things:
When I use a smaller batch_size (like 8,16,32) the loss is not decreasing, but rather sporadically varying. When I use a larger batch_size (like 128, 256), the loss is going going down, but very slowly.
More importantly, when I use a larger EPOCH value, my model does a good job at reducing the loss. However I'm using a really large value (EPOCHS = 10000).
Question:
How to get the optimal EPOCH and batch_size values?

There is no way to decide on these values based on some rules. Unfortunately, the best choices depend on the problem and the task. However, I can give you some insights.
When you train a network, you calculate a gradient which would reduce the loss. In order to do that, you need to backpropagate the loss. Now, ideally, you compute the loss based on all of the samples in your data because then you consider basically every sample and you come up with a gradient that would capture all of your samples. In practice, this is not possible due to the computational complexity of calculating gradient on all samples. Because for every update, you have to compute forward-pass for all your samples. That case would be batch_size = N, where N is the total number of data points you have.
Therefore, we use small batch_size as an approximation! The idea is instead of considering all the samples, we say I compute the gradient based on some small set of samples but the thing is I am losing information regarding the gradient.
Rule of thumb:
Smaller batch sizes give noise gradients but they converge faster because per epoch you have more updates. If your batch size is 1 you will have N updates per epoch. If it is N, you will only have 1 update per epoch. On the other hand, larger batch sizes give a more informative gradient but they convergence slower.
That is the reason why for smaller batch sizes, you observe varying losses because the gradient is noisy. And for larger batch sizes, your gradient is informative but you need a lot of epochs since you update less frequently.
The ideal batch size should be the one that gives you informative gradients but also small enough so that you can train the network efficiently. You can only find it by trying actually.

How to deal with large amount of data in caffe HDF5 and how to set test_iter? [duplicate]

On Caffe, I am trying to implement a Fully Convolution Network for semantic segmentation. I was wondering is there a specific strategy to set up your 'solver.prototxt' values for the following hyper-parameters:
test_iter
test_interval
iter_size
max_iter
Does it depend on the number of images you have for your training set? If so, how?

Recurrent NNs: what's the point of parameter sharing? Doesn't padding do the trick anyway?

The following is how I understand the point of parameter sharing in RNNs:
In regular feed-forward neural networks, every input unit is assigned an individual parameter, which means that the number of input units (features) corresponds to the number of parameters to learn. In processing e.g. image data, the number of input units is the same over all training examples (usually constant pixel size * pixel size * rgb frames).
However, sequential input data like sentences can come in highly varying lengths, which means that the number of parameters will not be the same depending on which example sentence is processed. That is why parameter sharing is necessary for efficiently processing sequential data: it makes sure that the model always has the same input size regardless of the sequence length, as it is specified in terms of transition from one state to another. It is thus possible to use the same transition function with the same weights (input to hidden weights, hidden to output weights, hidden to hidden weights) at every time step. The big advantage is that it allows generalization to sequence lengths that did not appear in the training set.
My questions are:
Is my understanding of RNNs, as summarized above, correct?
In the actual code example in Keras I looked at for LSTMs, they padded the sentences to equal lengths before all. By doing so, doesn't this wash away the whole purpose of parameter sharing in RNNs?

Parameter Sharing
Being able to efficiently process sequences of varying length is not the only advantage of parameter sharing. As you said, you can achieve that with padding. The main purpose of parameter sharing is a reduction of the parameters that the model has to learn. This is the whole purpose of using a RNN.
If you would learn a different network for each time step and feed the output of the first model to the second etc. you would end up with a regular feed-forward network. For a number of 20 time steps, you would have 20 models to learn. In Convolutional Nets, parameters are shared by the Convolutional Filters because when we can assume that there are similar interesting patterns in different regions of the picture (for example a simple edge). This drastically reduces the number of parameters we have to learn. Analogously, in sequence learning we can often assume that there are similar patterns at different time steps. Compare 'Yesterday I ate an apple' and 'I ate an apple yesterday'. These two sentences mean the same, but the 'I ate an apple' part occurs on different time steps. By sharing parameters, you only have to learn what that part means once. Otherwise, you'd have to learn it for every time step, where it could occur in your model.
There is a drawback to sharing the parameters. Because our model applies the same transformation to the input at every time step, it now has to learn a transformation that makes sense for all time steps. So, it has to remember, what word came in which time step, i.e. 'chocolate milk' should not lead to the same hidden and memory state as 'milk chocolate'. But this drawback is small compared to using a large feed-forward network.
Padding
As for padding the sequences: the main purpose is not directly to let the model predict sequences of varying length. Like you said, this can be done by using parameter sharing. Padding is used for efficient training - specifically to keep the computational graph during training low. Without padding, we have two options for training:
We unroll the model for each training sample. So, when we have a sequence of length 7, we unroll the model to 7 time steps, feed the sequence, do back-propagation through the 7 time steps and update the parameters. This seems intuitive in theory. But in practice, this is inefficient, because TensorFlow's computational graphs don't allow recurrency, they are feedforward.
The other option is to create the computational graphs before starting training. We let them share the same weights and create one computational graph for every sequence length in our training data. But when our dataset has 30 different sequence lengths this means 30 different graphs during training, so for large models, this is not feasible.
This is why we need padding. We pad all sequences to the same length and then only need to construct one computational graph before starting training. When you have both very short and very long sequence lengths (5 and 100 for example), you can use bucketing and padding. This means, you pad the sequences to different bucket lengths, for example [5, 20, 50, 100]. Then, you create a computational graph for each bucket. The advantage of this is, that you don't have to pad a sequence of length 5 to 100, as you would waste a lot of time on "learning" the 95 padding tokens in there.

Difference between training and testing phase in caffe

this might seem like a silly question, but I am trying to understand to what extent the testing phase in caffe is important for good results. Of course the training phase is important, but is the testing phase simply to test out how much loss is obtained periodically on a set that is not trained? If this is the case, does the size of my test set really matter? Does testing even matter at all? I ask because I currently have some serious overfit problems. If I have a large dataset (>50 000 images), how should I go about splitting them between test and train?

Caffe never use the result of the test sets while doing training and modify some parameter to fix some issues like overfitting.
The usage of a validation set (test set during training) is for us to visualize whether the model overfits the data by looking at the accuracy or loss values, by plotting them or looking at the outputs.
For example, if the loss of the training set keeps reducing at every iterations and the loss of the test set keeps increasing, this is a solid case of the model overfitting the training set. For getting such conclusions, the images selected for the test set shouldn't be the same as that of the training set. Its ideal to keep a 1:10 ratio for test-train image count. If the test set was using a subset of the trainset, the loss of the testset would have decreased and we may not detect the overfitting behaviour of the model.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008