When we build a model and train it, the initial weights are randomly initialized, unless specified (seed).
As we know, there are a variety of parameters we can adjust like epochs, optimizers, batch_size, etc to find the "best" model.
The concept I have trouble with is: Even if we do find the best model after tuning, the weights will be different, yielding different models and results. So the best model for this maybe wouldn't be the best if we compiled and ran it again with the "best parameters". If we seed the weights with the parameters for reproducibility, we don't know if those would be the best weights. On the other hand, if we tune the weights, then the "best parameters" won't be best parameters anymore? I am stuck in a loop. Is there a general guideline on what parameters to tune first as opposed to others?
Or is this whole logic flawed somewhere and I am way overthinking?
We initialize weights randomly to ensure that each node acts differently(unsymmetric) from others.
Depending upon the hyperparameters(epochs, batch size etc, iterations,.)The weights are updated until the iterations last. In the end, we call the updated weights as models.
Seed is used to control the randomness of initialization. If im not wrong, a good learning algorithm(Objective function and optimizer) converges irrespective of seed values.
Again, A good model means tuning all the hyperparameters, making sure that the model is not underfitting.
On the other hand, even the model shouldn't overfit.
There is nothing like the best parameters(weights, bias), we need to continuously tune the model until the results are satisfactory and the main parts are data processing.
Related
I am a student and currently studying deep learning by myself. Here I would like to ask for clarification regarding the transfer learning.
For example MobileNetv2 (https://keras.io/api/applications/mobilenet/#mobilenetv2-function), if the weights parameter is set to None, then I am not doing transfer learning as the weights are random initialized. If I would like to do transfer learning, then I should set the weights parameter to imagenet. Is this concept correct?
Clarification and explanation regarding deep learning
Yes, when you initialize the weights with random values, you are just using the architecture and training the model from scratch. The goal of transfer learning is to use the previously gained knowledge by another trained model to get better results or to use less computational resources.
There are different ways to use transfer learning:
You can freeze the learned weights of the base model and replace the last layer of the model base on your problem and just train the last layer
You can start with the learned weights and fine-tune them (let them change in the learning process). Many people do that because sometimes it makes the training faster and gives better results because the weights already contain so much information.
You can use the first layers to extract basic features like colors, edges, circles... and add your desired layers after them. In this way, you can use your resources to learn high-level features.
There are more cases, but I hope it could give you an idea.
When using the reinforcement learning model ddpg, the input data are sequence data, high-dimensional (21 dimensional) state and low dimensional (1-dimensional) action. Does this have any negative impact on the training of the model? How to solve it
In general in any machine learning scenario, dimensionality per se is not a problem, it is mostly a matter of how much variability there is the input data. Of course, higher dimensional data can have much higher variability than lower dimensional one.
Even considering this, the problem can "easily" be solved by feeding more data to the ML algorithm and increasing the complexity that it is allowed to represent (i.e. more nodes and/or layers in a neural network).
In RL, this is even less of a problem because you don't really have a restriction on how much data you actually have. You can always run your agent some more on the environment to get more sample trajectories to train on. The only issue you might find here is that your computing time grows a lot (depending on how much more you need to train on the environment for this problem).
pretty new to deep learning, but couldn't seem to find/figure out what are backend weights such as
full_yolo_backend.h5
squeezenet_backend.h5
From what I have found and experimented, these backend weights have fundamentally different model architectures such as
yolov2 model has 40+ layers but the backend only 20+ layers (?)
you can build on top of the backend model with your own networks (?)
using backend models tend to yield poorer results (?)
I was hoping to seek some explanation on backend weights vs actual models for learning purposes. Thank you so much!
I'm note sure which implementation you are using but in many applications, you can consider a deep model as a feature extractor whose output is more or less task-agnostic, followed by a number of task-specific heads.
The choice of backend depends on your specific constraints in terms of tradeoff between accuracy and computational complexity. Examples of classical but time-consuming choices for backends are resnet-101, resnet-50 or VGG that can be coupled with FPN (feature pyramid networks) to yield multiscale features. However, if speed is your main concern then you can use smaller backends such as different MobileNet architectures or even the vanilla networks such as the ones used in the original Yolov1/v2 papers (tinyYolo is an extreme case).
Once you have chosen your backend (you can use a pretrained one), you can load its weights (that is what your *h5 files are). On top of that, you will add a small head that will carry the tasks that you need: this can be classification, bbox regression, or like in MaskRCNN forground/background segmentation. For Yolov2, you can just add very few, for example 3 convolutional layers (with non-linearities of course) that will output a tensor of size
BxC1xC2xAxP
#B==batch size
#C1==number vertical of cells
#C2==number of horizontal cells
#C3==number of anchors
#C4==number of parameters (i.e. bbx parameters, class prediction, confidence)
Then, you can just save/load the weights of this head separately. When you are happy with your results though, training jointly (end-to-end) will usually give you a small boost in accuracy.
Finally, to come back to your last questions, I assume that you are getting poor results with the backends because you are only loading backend weights but not the weights of the heads. Another possibility is that you are using a head trained with a backends X but that you are switching the backend to Y. In that case since the head expects different features, it's natural to see a drop in performance.
Call for experts in deep learning.
Hey, I am recently working on training images using tensorflow in python for tone mapping. To get the better result, I focused on using perceptual loss introduced from this paper by Justin Johnson.
In my implementation, I made the use of all 3 parts of loss: a feature loss that extracted from vgg16; a L2 pixel-level loss from the transferred image and the ground true image; and the total variation loss. I summed them up as the loss for back propagation.
From the function
yˆ=argminλcloss_content(y,yc)+λsloss_style(y,ys)+λTVloss_TV(y)
in the paper, we can see that there are 3 weights of the losses, the λ's, to balance them. The value of three λs are probably fixed throughout the training.
My question is that does it make sense if I dynamically change the λ's in every epoch(or several epochs) to adjust the importance of these losses?
For instance, the perceptual loss converges drastically in the first several epochs yet the pixel-level l2 loss converges fairly slow. So maybe the weight λs should be higher for the content loss, let's say 0.9, but lower for others. As the time passes, the pixel-level loss will be increasingly important to smooth up the image and to minimize the artifacts. So it might be better to adjust it higher a bit. Just like changing the learning rate according to the different epochs.
The postdoc supervises me straightly opposes my idea. He thought it is dynamically changing the training model and could cause the inconsistency of the training.
So, pro and cons, I need some ideas...
Thanks!
It's hard to answer this without knowing more about the data you're using, but in short, dynamic loss should not really have that much effect and may have opposite effect altogether.
If you are using Keras, you could simply run a hyperparameter tuner similar to the following in order to see if there is any effect (change the loss accordingly):
https://towardsdatascience.com/hyperparameter-optimization-with-keras-b82e6364ca53
I've only done this on smaller models (way too time consuming) but in essence, it's best to keep it constant and also avoid angering off your supervisor too :D
If you are running a different ML or DL library, there are optimizer for each, just Google them. It may be best to run these on a cluster and overnight, but they usually give you a good enough optimized version of your model.
Hope that helps and good luck!
I have a dataset of around 6K chemical formulas which I am preprocessing via Keras' tokenization to perform binary classification. I am currently using a 1D convolutional neural network with dropouts and am obtaining an accuracy of 82% and validation accuracy of 80% after only two epochs. No matter what I try, the model just plateaus there and doesn't seem to be improving at all. Those same exact accuracies are reached with a vanilla LSTM too. What else can I try to improve my accuracies? Losses only have a difference of 0.04... Anyone have any ideas? Both models use an embedding layer and changing the output dimension isn't having an effect either.
According to your answer, I believe your model has a high bias and low variance (see this link for further details). Thus, your model is not fitting your data very well and it is causing underfitting. So, I suggest you 3 things:
Train your model a little longer: I believe two epoch are too few to give a chance to your model understand the patterns in the data. Try to minimize learning rate and increase the number of epochs.
Try a different architecture: you may change the amount of convolutions, filters and layers, You can also use different activation functions and other layers like max pooling.
Make an error analysis: once you finished your training, apply your model to test set and take a look into the errors. How much false positives and false negatives do you have? Is your model better to classify one class than the other? You can see a pattern in the errors that may be related to your data?
Finally, if none of these suggestions helped you, you may also try to increase the number of features, if possible.