How to implement fixed parameters in caffe? - deep-learning

Suppose there are parameters in the network I would like to change manually in pycaffe, rather than update automatically by the solver. For example, suppose we would like to penalize dense activations, this can be implemented as an additional loss layer. Across the training process, we would like to change the strength of this penalty by multiplying the loss with a coefficient that evolves in a pre-specified way. What would be a good way to do this in caffe? Is it possible to specify this in the prototxt definition? In the pycaffe interface?
Update: I suppose setting lr_mult and decay_mult to 0 might be a solution, but seems like a clumsy one. Maybe a DummyDataLayer providing the parameters as a blob would be a better option. But there is so little documentation that it's quite a struggle to write for someone new to caffe

Maybe this is a trivial question, but just in case someone else might be interested, here is a successful implementation I ended up using
In the layer proto def, set lr_mult and decay_mult to 0, which means that we neither want to learn or decay the parameters. Use filler to set initial values. To change the parameters in python during training of the network, use a statement like
net.param['name'][index].data[...] = something

Related

is there a way to compress a NN in pytorch

assumed I have a trained model of torch.nn.Module, and I only need it from now on for evaluation.
Is PyTorch passing my data through all the layers or is it compressing the model so that it only calculates an equivalent function?
If not, is there a way to do it in order to make the calculation faster and the model lighter in memory terms?
I have been looking on the internet for a similar question and didn't find any suitable answer.
You should do two things during inference: set your model to evaluation mode with model.eval() and use no_grad() (disables gradient computation). This will make your code faster and more memory-efficient.
In practice this will look like
model.eval()
with torch.no_grad():
#your inference code here
There are many options and it depends on your specific case.
One option is to convert to TorchScript.
Another option is to do quantization on the model.
Third, you could perform knowledge distillation, transferring the knowledge from your existing model to a smaller one.

LSTM prediction restriction

Is it possible to restrict the prediction of LSTM neural network to a finite set of values? For example if I have the following sequence 1,4,5,3,2,1,3,2, ... and I know that the next value will be in the set {1,2,3,4,5}, is it possible to feed that somehow to the network so it will always output one of these values?
While this is technically possible (you would have to write your own LSTM implementation or extend the existing one, however), it doesn't seem like a good approach towards this problem. If you find yourself in the situation where you're feeding data to the network that you don't want it to process, you should just preprocess your input to only hold relevant data.
Show the network what you want it to see. If you want the network to output specific behavior in certain situations, you can encode that behavior by modifying the labels to reflect this behavior. Finally, note that your use case suggests that you should be working with categorical labels and softmax output, i.e. framing this as a classification instead of a regression problem.

deep learning concept - hyperparameter tuning weights RNN/LSTM

When we build a model and train it, the initial weights are randomly initialized, unless specified (seed).
As we know, there are a variety of parameters we can adjust like epochs, optimizers, batch_size, etc to find the "best" model.
The concept I have trouble with is: Even if we do find the best model after tuning, the weights will be different, yielding different models and results. So the best model for this maybe wouldn't be the best if we compiled and ran it again with the "best parameters". If we seed the weights with the parameters for reproducibility, we don't know if those would be the best weights. On the other hand, if we tune the weights, then the "best parameters" won't be best parameters anymore? I am stuck in a loop. Is there a general guideline on what parameters to tune first as opposed to others?
Or is this whole logic flawed somewhere and I am way overthinking?
We initialize weights randomly to ensure that each node acts differently(unsymmetric) from others.
Depending upon the hyperparameters(epochs, batch size etc, iterations,.)The weights are updated until the iterations last. In the end, we call the updated weights as models.
Seed is used to control the randomness of initialization. If im not wrong, a good learning algorithm(Objective function and optimizer) converges irrespective of seed values.
Again, A good model means tuning all the hyperparameters, making sure that the model is not underfitting.
On the other hand, even the model shouldn't overfit.
There is nothing like the best parameters(weights, bias), we need to continuously tune the model until the results are satisfactory and the main parts are data processing.

How to change the learning rate of specific layer from the solver prototxt (CAFFE)

Anybody knows how to change the learning rate lr_mult of a specific layer in CAFFE from the solver prototxt? I know there's base_lr, however I would like to target the rate of a specific layer, and doing it from the solver instead of the network prototxt.
Thanks!
Every layer that requiers learning (i.e convultional, fully-connected, etc.) has a specific lr_mult parameter that can be controlled specifically for that layer. lr_mult is a "multiplier on the global learning rate for this parameter."
Simply define or change the lr_mult for your layer in train_val.prototxt.
This is useful for fine-tuning, where you might want to have increased learning rate only for the new layer.
For more info check the caffe fine-tuning tutorial. (Note: it is a bit outdated and the deprecated term blobs_lr is used there instead of lr_mult)
EDIT: To my best knowledge it is not possible to define a layer-specific learning rate from the solver.prototxt. Hence, assuming the solver.prototxt limitation is not strict, I suggest a different method to achieve the same result.

What kind of learning algorithm would you use to build a model of how long it takes a human to solve a given Sudoku situation?

I don't have much experience in machine learning, pattern recognition, data mining, etc. and in their underlying theory and systems.
I would like to develop an artificial model of the time it takes a human to make a move in a given Sudoku puzzle.
So what I'm looking for as an output from the machine learning process is a model that can give predictions on how long does it take for a target human to make a move in a given Sudoku situation.
Same input doesn't always map to same outcome. It takes different times for the human to make a move with the same situation, but my hypothesis is that there's a tendency in the resulting probability distribution. (My educated guess is that it is ~normal.)
I have ideas about the factors that influence the distribution (like #empty slots) but would preferably leave it to the system to figure these patterns out. Please notice, that I'm not interested in the patterns, just the model.
I can generate sample and test data easily by running sudoku puzzles and measuring the times it takes to make the moves.
What kind of learning algorithm would you suggest to use for this?
I was thinking NNs, but I'm not sure if they can have the desired property of giving weighted random outcomes for the same input.
If I understand this correctly you have an input vector of length 81, which contains 1 if the square is filled in and 0 otherwise. You want to learn a function which returns a probability distribution which models the response time of a human to that board position.
My first response would be that this is a regression problem and you should try straightforward linear regression. This will not provide you with a distribution of response times, but a single 'best-guess' response time.
I'm not clear on why you want to model a distribution of response times. However, if you really want to do want to output a distribution then it sounds like you want to look at Bayesian methods. I'm not really an expert on Bayesian inference, so I can't help you much further here.
However, I don't really think your approach is going to work because I agree with your intuition about features such as the number of empty slots being important. There are also other obvious features, such as the number of empty slots per row/column that are likely to be important. Explicitly putting these features in your representation will probably be much more successful than expecting that the learning algorithm will infer something similar on its own.
The monte carlo method seems like it would work well here but would require a stack of solutions the size of the moon to really do it. And it wouldn't give you the time per person, just the time on average.
My understanding of it, tenuous as it is, is that you have a database with a board position and the time it took a human to make the next move. At the very least you have a starting point for most moves. Even if it's not in the database you could start to calculate how long it would take to make a move based on some algorithm. Though I know you had specified you wanted machine learning to do this it might be worth segmenting the problem into something a little smaller then building on it.
If you have some guesstimate as to what influences the function (# of empty cell, etc), try to train a classifier on a vector of features, and not on the 81 cells vector (0/1 or 0..9, doesn't really matter for my argument).
I think that your claim:
we wouldn't have to necessary know the underlying patterns, the "trained patterns" in a learning system automatically encodes these sometimes quite delicate and subtle patterns inside them -- that's one of their great power
is wrong. you do have to give the network the right domain. for example, when trying to detect object in an image, working in the pixel domain is pointless. you'll only get results if you first run some feature detection to detect edges, corners, etc.
Theoretically, with enough non-linearity (in NN - enough layers in the network) it can detect such things, but in practice, I have never seen that work, without giving the classifier the right features to work with.
I was thinking NNs, but I'm not sure if they can have the desired property of giving weighted random outcomes for the same input.
You're just trying to learn a function from 2^81 or 10^81 (or a much smaller feature space as I suggest) to R (response time between 0 and Inf) or some discretization of that. So NN and other classifiers can do that.