LSTM prediction restriction - deep-learning

Is it possible to restrict the prediction of LSTM neural network to a finite set of values? For example if I have the following sequence 1,4,5,3,2,1,3,2, ... and I know that the next value will be in the set {1,2,3,4,5}, is it possible to feed that somehow to the network so it will always output one of these values?

While this is technically possible (you would have to write your own LSTM implementation or extend the existing one, however), it doesn't seem like a good approach towards this problem. If you find yourself in the situation where you're feeding data to the network that you don't want it to process, you should just preprocess your input to only hold relevant data.
Show the network what you want it to see. If you want the network to output specific behavior in certain situations, you can encode that behavior by modifying the labels to reflect this behavior. Finally, note that your use case suggests that you should be working with categorical labels and softmax output, i.e. framing this as a classification instead of a regression problem.

Related

Why does backprop algorithm store the inputs to the non-linearity of the hidden layers?

I have been reading the Deep Learning book by Ian Goodfellow and it mentions in Section 6.5.7 that
The main memory cost of the algorithm is that we need to store the input to the nonlinearity of the hidden layer.
I understand that backprop stores the gradients in a similar fashion to dynamic programming so not to recompute them. But I am confused as to why it stores the input as well?
Backpropagation is a special case of reverse mode automatic differentiation (AD).
In contrast to the forward mode, the reverse mode has the major advantage that you can compute the derivative of an output w.r.t. all inputs of a computation in one pass.
However, the downside is that you need to store all intermediate results of the algorithm you want to differentiate in a suitable data structure (like a graph or a Wengert tape) for as long as you are computing its Jacobian with reverse mode AD, because you're basically "working your way backwards" through the algorithm.
Forward mode AD does not have this disadvantage, but you need to repeat its calculation for every input, so it only makes sense if your algorithm has a lot more output variables than input variables.

Is it theoretically reasonable to use CNN for data like categorical and numeric data?

I'm trying to use CNN to do a binary classification.
As CNN shows its strength in feature extraction, it has been many uses for pattern data like image and voice.
However, the dataset I have is not image or voice data, but categorical data and numerical data, which are different from this case.
My question is as follows.
In this situation, Is it theoretically reasonable to use CNN for data in this configuration?
If it is reasonable, would it be reasonable to artificially place my dataset in a two-dimensional form and perform a 2D-CNN?
I often see examples of using CNN in many classifiers through Kaggle and various media, and I can see not only images and voices, but also numerical and categorical data like mine.
I really wonder this is theoretically a problem, and I would appreciate it if you could recommend it if you knew about the related paper or research.
I'm looking forward to hearing any advice about this situation. Thank you for your answer.
CNNs for images apply kernels to neighboring pixels and blocks of image. CNNs for audio work on spectrograms, i.e. use input data proximity as well.
If your data inputs has some sort of closeness (e.g. time-series, graph...), then CNN might be useful.

How to combine the probability (soft) output of different networks and get the hard output?

I have trained three different models separately in caffe, and I can get the probability of belonging to each class for semantic segmentation. I want to get an output based on the 3 probabilities that I am getting (for example, the argmax of three probabilities). This can be done by inferring through net model and deploy.prototxt files. And then based on the final soft output, the hard output shows the final segmentation.
My questions are:
How to get ensemble output of these networks?
How to do end-to-end training of ensemble of three networks? Is there any resources to get help?
How to get final segmentation based on the final probability (e.g., argmax of three probabilities), which is soft output?
My question may sound very basic question, and my apologies for that. I am still trying to learn step by step. I really appreciate your help.
There are two ways (at least that I know of) that you could do to solve (1):
One is to use pycaffe interface, instantiate the three networks, forward an input image through each of them, fetch the output and perform any operation you desire to combine all three probabilites. This is specially useful if you intend to combine them using a more complex logic.
The alternative (way less elegant) is to use caffe test and process all your inputs separately through each network saving the probabilities into files. Then combine the probabilities from the files later.
Regarding your second question, I have never trained more than two weight-sharing CNNs (siamese networks). From what I understood, your networks don't share weights, only the architecture. If you want to train all three end-to-end please take a look at this tutorial made for siamese networks. The authors define in their prototxt both paths/branches, connect each branch's layers to the input Data layer and, at the end, with a loss layer.
In your case you would define the three branches (one for each of your networks), connect with input data layers (check if each branch processes the same input or different inputs, for example, the same image pre-processed differently) and unite them with a loss, similarly to the tutorial.
Now, for the last question, it seems Caffe has a ArgMax layer that may be what you are looking for. If you are familiar with python, you could also use a python layer that allows you to define with great flexibility how to combine the output probabilities.

How can I create a classifier using the feature map of a CNN?

I intend to make a classifier using the feature map obtained from a CNN. Can someone suggest how I can do this?
Would it work if I first train the CNN using +ve and -ve samples (and hence obtain the weights), and then every time I need to classify an image, I apply the conv and pooling layers to obtain the feature map? The problem I find in this, is that the image I want to classify, may not have a similar feature map, and hence I wouldn't be able to find the distance correctly. As the order of the features may by different in the layer.
You can use the same CNN for classification if you used (for example) the cross entropy loss-(also known as softmax with loss). In this case, you should take the argmax of your last layer (the node with the highest score), and that would be the class given by the network. However, all the architectures used in machine learning would expect at testing time an input similar to those used during training.

How to implement fixed parameters in caffe?

Suppose there are parameters in the network I would like to change manually in pycaffe, rather than update automatically by the solver. For example, suppose we would like to penalize dense activations, this can be implemented as an additional loss layer. Across the training process, we would like to change the strength of this penalty by multiplying the loss with a coefficient that evolves in a pre-specified way. What would be a good way to do this in caffe? Is it possible to specify this in the prototxt definition? In the pycaffe interface?
Update: I suppose setting lr_mult and decay_mult to 0 might be a solution, but seems like a clumsy one. Maybe a DummyDataLayer providing the parameters as a blob would be a better option. But there is so little documentation that it's quite a struggle to write for someone new to caffe
Maybe this is a trivial question, but just in case someone else might be interested, here is a successful implementation I ended up using
In the layer proto def, set lr_mult and decay_mult to 0, which means that we neither want to learn or decay the parameters. Use filler to set initial values. To change the parameters in python during training of the network, use a statement like
net.param['name'][index].data[...] = something