sequence to sequence model using pytorch - deep-learning

I have dataset (sequence to sequence), each sample input is seq of charterers (combination from from 20 characters and max length 2166) and out is list of charterers (combination of three characters G,H,B). for example OIREDSSSRTTT ----> GGGHHHHBHBBB
I would like to do simple pytorch model that work in that type of dataset. Model that can predict sequence of classes. I would appreciate any suggestions or links for simple mode that do the same?
Thanks

If the output sequence always has the same length as the input sequence, you might want to use transformer encoder, because it basically transforms the inputs with attention to the context. Also you can try to use anything that is used to tagging: BiLSTM, BiGRU, etc.
If you want your model to be able to predict sequences of different length (not necessary the same as input length), look at some encoder-decoder models, such as vanilla transformer.

You can start with the sequence tagging model from PyTorch tutorial https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html .
As #Ilya Fedorov said, you can move to transformer models for potentially better performance.

Related

Is BertTokenizer similar to word embedding?

The idea of using BertTokenizer from huggingface really confuses me.
When I use
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer.encode_plus("Hello")
Does the result is somewhat similar to when I pass
a one-hot vector representing "Hello" to a learning embedding matrix?
How is
BertTokenizer.from_pretrained("bert-base-uncased")
different from
BertTokenizer.from_pretrained("bert-**large**-uncased")
and other pretrained?
The encode_plus and encode functions tokenize your texts and prepare them in a proper input format of the BERT model. Therefore you can see them similar to the one-hot vector in your provided example.
The encode_plus returns a BatchEncoding consisting of input_ids, token_type_ids, and attention_mask.
The pre-trained model differs based on the number of encoder layers. The base model has 12 encoders, and the large model has 24 layers of encoders.

Does any H2O algorithm support multi-label classification?

Is deep learning model supports multi-label classification problem or any other algorithms in H2O?
Orginal Response Variable -Tags:
apps, email, mail
finance,freelancers,contractors,zen99
genomes
gogovan
brazil,china,cloudflare
hauling,service,moving
ferguson,crowdfunding,beacon
cms,naytev
y,combinator
in,store,
conversion,logic,ad,attribution
After mapping them on the keys of the dictionary:
Then
Response variable look like this:
[74]
[156, 89]
[153, 13, 133, 40]
[150]
[474, 277, 113]
[181, 117]
[15, 87, 8, 11]
Thanks
No, H2O only contains algorithms that learn to predict a single response variable at a time. You could turn each unique combination into a single class and train a multi-class model that way, or predict each class with a separate model.
Any algorithm that creates a model that gives you "finance,freelancers,contractors,zen99" for one set of inputs, and "cms,naytev" for another set of inputs is horribly over-fitted. You need to take a step back and think about what your actual question is.
But in lieu of that, here is one idea: train some word embeddings (or use some pre-trained ones) on your answer words. You could then average the vectors for each set of values, and hope this gives you a good numeric representation of the "topic". You then need to turn your, say, 100 dimensional averaged word vector into a single number (PCA comes to mind). And now you have a single number that you can give to a machine learning algorithm, and that it can predict.
You still have a problem: having predicted a number, how do you turn that number into a 100-dim vector, and from there in to a topic, and from there into topic words? Tricky, but maybe not impossible.
(As an aside, if you turn the above "single number" into a factor, and have the machine learning model do a categorization, to predicting the most similar topic to those it has seen before... you've basically gone full circle and will get a model identical to the one you started with that has too many classes.)

Named Entity Recognition confidence

I need to get confidence about each extracted entity (not to print it but to get it), however, I can't find a method that returns confidences.
Firstly, I have tried using Stanford Named Entity Recognizer library on Java and this solution:
Display Stanford NER confidence score
but it doesn't work (I guess getCliqueTree method is not available). I also have tried using NLTK in Python and Stanford NER model to extract entities, but again couldn't find a way to get confidences.
I know how to do it on Spacy:
https://github.com/explosion/spaCy/issues/831
but as the author says it's inefficient.
So, can you please advise me, how to get the probabilities of each extracted entity?
Usually NER is a token level classification task.
Confidences are usually derived from each prediction, which is commonly the output of some type of softmax.
The issue then become, how can I get a confidence for a sequence of confidences?
There are multiple ways:
Entropy [Confidence is amount of information]
Average (Mean) [Confidence is the average]
Min/Max of confidences [Confidence is the min/max]
All of these give different answers, none are "better" and it really depends on your use case.
If you would like to order possible entity types, you can start with the following:
Get confidences assuming same label for each token
Get entropy for confidence (probability) sequence
Sort by entropy

How to perform multi labeling classification (for CNN)?

I am currently looking into multi-labeling classification and I have some questions (and I couldn't find clear answers).
For the sake of clarity let's take an example : I want to classify images of vehicles (car, bus, truck, ...) and their make (Audi, Volkswagen, Ferrari, ...).
So I thought about training two independant CNN (one for the "type" classification and one fore the "make" classifiaction) but I thought it might be possible to train only one CNN on all the classes.
I read that people tend to use sigmoid function instead of softmax to do that. I understand that sigmoid does not sum up to 1 like softmax does but I dont understand in what doing that enables to do multi-labeling classification ?
My second question is : Is it possible to take into account that some classes are completly independant ?
Thridly, in term of performances (accuracy and time to give the classification for a new image), isn't training two independant better ?
Thank you for those who could give my some answers or some ideas :)
Softmax is a special output function; it forces the output vector to have a single large value. Now, training neural networks works by calculating an output vector, comparing that to a target vector, and back-propagating the error. There's no reason to restrict your target vector to a single large value, and for multi-labeling you'd use a 1.0 target for every label that applies. But in that case, using a softmax for the output layer will cause unintended differences between output and target, differences that are then back-propagated.
For the second part: you define the target vectors; you can encode any sort of dependency you like there.
Finally, no - a combined network performs better than the two halves would do independently. You'd only run two networks in parallel when there's a difference in network layout, e.g. a regular NN and CNN in parallel might be viable.

How to massage inputs into Keras framework?

I am new to keras and despite reading the documentation and the examples folder in keras, I'm still struggling with how to fit everything together.
In particular, I want to start with a simple task: I have a sequence of tokens, where each token has exactly one label. I have a lot training data like this - practically infinite, as I can generate more (token, label) training pairs as needed.
I want to build a network to predict labels given tokens. The number of tokens must always be the same as the number of labels (one token = one label).
And I want this to be based on all surrounding tokens, say within the same line or sentence or window -- not just on the preceding tokens.
How far I got on my own:
created the training numpy vectors, where I converted each sentence into a token-vector and label-vector (of same length), using a token-to-int and label-to-int mappings
wrote a model using categorical_crossentropy and one LSTM layer, based on https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py.
Now struggling with:
All the input_dim and input_shape parameters... since each sentence has a different length (different number of tokens and labels in it), what should I put as input_dim for the input layer?
How to tell the network to use the entire token sentence for prediction, not just one token? How to predict a whole sequence of labels given a sequence of tokens, rather than just label based on previous tokens?
Does splitting the text into sentences or windows make any sense? Or can I just pass a vector for the entire text as a single sequence? What is a "sequence"?
What are "time slices" and "time steps"? The documentation keeps mentioning that and I have no idea how that relates to my problem. What is "time" in keras?
Basically I have trouble connecting the concepts from the documentation like "time" or "sequence" to my problem. Issues like Keras#40 didn't make me any wiser.
Pointing to relevant examples on the web or code samples would be much appreciated. Not looking for academic articles.
Thanks!
If you have sequences of different length you can either pad them or use a stateful RNN implementation in which the activations are saved between batches. The former is the easiest and most used.
If you want to use future information when using RNNs you want to use a bidirectional model where you concatenate two RNN's moving in opposite directions. RNN will use a representation of all previous information when e.g. predicting.
If you have very long sentences it might be useful to sample a random sub-sequence and train on that. Fx 100 characters. This also helps with overfitting.
Time steps are your tokens. A sentence is a sequence of characters/tokens.
I've written an example of how I understand your problem but it's not tested so it might not run. Instead of using integers to represent your data I suggest one-hot encoding if it is possible and then use binary_crossentropy instead of mse.
from keras.models import Model
from keras.layers import Input, LSTM, TimeDistributed
from keras.preprocessing import sequence
# Make sure all sequences are of same length
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
# The input shape is your sequence length and your token embedding size (which is 1)
inputs = Input(shape=(maxlen, 1))
# Build a bidirectional RNN
lstm_forward = LSTM(128)(inputs)
lstm_backward = LSTM(128, go_backwards=True)(inputs)
bidirectional_lstm = merge([lstm_forward, lstm_backward], mode='concat', concat_axis=2)
# Output each timestep into a fully connected layer with linear
# output to map to an integer
sequence_output = TimeDistributed(Dense(1, activation='linear'))(bidirectional_lstm)
# Dense(n_classes, activation='sigmoid') if you want to classify
model = Model(inputs, sequence_output)
model.compile('adam', 'mse')
model.fit(X_train, y_train)