Determining the probability of a sequence generated by T5 model by HuggingFace - deep-learning

I am using T5-Large by HuggingFace for inference. Given a premise and a hypothesis, I need to determine whether they are related or not. So, if I feed a string "mnli premise: This game will NOT open unless you agree to them sharing your information to advertisers. hypothesis: Personal data disclosure is discussed." the model is supposed to return either entailment, neutral, or contradiction.
Though I am able to determine the result, I am unable to determine the probability of the sequence generated. For instance, consider the model will generate entailment for the example given above. I also want to know what is the probability of entailment. So far, I have been using the following code,
from transformers import T5Tokenizer, T5ForConditionalGeneration
def is_entailment(premise, hypothesis):
entailment_premise = premise
entailment_hypothesis = hypothesis
token_output = tokenizer("mnli premise: " + entailment_premise + " hypothesis: " + entailment_hypothesis,
return_tensors="pt", return_length=True)
input_ids = token_output.input_ids
output = model.generate(input_ids, output_scores=True, return_dict_in_generate=True, max_new_tokens=15)
entailment_ids = output["sequences"]
entailment = tokenizer.decode(entailment_ids[0], skip_special_tokens=True)
return entailment
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True)
premise = "This game will NOT open unless you agree to them sharing your information to advertisers."
hypothesis = "Personal data disclosure is discussed."
print(is_entailment(premise, hypothesis))
I have tried using the scores we get as output, but not sure how to calculate the probability from them. Same goes for the last hidden states that can be fetched as the output from the generate(). I saw in another question on Stack Overflow that suggested using a softmax function on the last hidden states but I am unsure how to do it.
How can I calculate the probability of the sequence being generated? That is, if I get entailment for a pair of hypothesis and premise, what would be the P(entailment)?

What you get as the scores are output token distributions before the softmax, so-called logits. You can get the probabilities of generated tokens by normalizing the logits and taking respective token ids. You can get them from the field sequences from what the generate method returns.
These are, however, not the probabilities you are looking for because T5 segments your output words into smaller units (e.g., "entailment" gets segmented to ['▁', 'en', 'tail', 'ment'] using the t5-small tokenizer). This is even trickier because different answers get split into a different number of tokens. You can get an approximate score by averaging the token probabilities (this is typically used during beam search). Such scores do not sum up to one.
If you want a normalized score, the only way is to feed all three possible answers to the decoder, get their scores, and normalize them to sum to one.

Related

Validation Loss with non in time-series classification

Case 1:
I am feeding a variable-length input time-series window to the GRU model. Sometimes there may be 900 samples in the window, and sometimes there may be only 16. I fed into the RNN model (GRU) since I learned that RNN methods work better on long sequences. I utilize one GRU layer and get hidden sequences across all the time stamps in order to get maximum information of all the time stamps. Then, I used average pooling on GRU output to bring representation into fixed-length. The intuition of using average-pooling instead of max-pooling is that it may achieve summarized information of all the timestamps. Here is the code of the model:
input_layer = tf.keras.Input(shape=input_shape, name="time_series_activity")
input_mask = tf.keras.layers.Masking(mask_value=0.00000)(input_layer)
gru_l5 = tf.keras.layers.GRU(64, activation='tanh', recurrent_activation='sigmoid',
recurrent_initializer=tf.keras.initializers.Orthogonal(), dropout=0.5, recurrent_dropout=0.5, return_sequences=True
)(input_mask)
AP = tf.keras.layers.GlobalAveragePooling1D()(gru_l5)
gru_fm = tf.keras.layers.Dropout(0.3)(AP)
output_layer = tf.keras.layers.Dense(total_classes, activation="softmax")(gru_fm)
return tf.keras.models.Model(inputs=input_layer, outputs=output_layer)
From this model, I am obtaining better performance on validation set while on training data, performance increased by 100% (going for worst), however, the major issue is that validation loss is "nan." This issue is currently being explored on GitHub and StackOverflow.
I tried nearly all of the options provided here, here and here. But unable to resolve this validation_loss = non issue.
Case 2:
Then I decided not to get all of the GRU's hidden states but rather to retrieve only the last hidden state, which would provide a fixed-length representation and eliminate the requirement for pooling. Here, the validation loss as "nan" probelm is fixed, but the test data performance is drastically reduced. Here is this model's source code:
input_layer = tf.keras.Input(shape=input_shape, name="time_series_activity")
input_mask = tf.keras.layers.Masking(mask_value=0.00000)(input_layer)
gru_l5 = tf.keras.layers.GRU(64, activation='tanh', recurrent_activation='sigmoid',
recurrent_initializer=tf.keras.initializers.Orthogonal(), dropout=0.5, recurrent_dropout=0.5)(input_mask)
gru_fm = tf.keras.layers.Dropout(0.3)(gru_l5)
output_layer = tf.keras.layers.Dense(total_classes, activation="softmax")(gru_fm)
return tf.keras.models.Model(inputs=input_layer, outputs=output_layer)
We can observe the results of both Cases. In Case 1, I have the feeling that the vanishing gradient problem occurs with longer sequences. Any thoughts or discussions on resolving this "nan" issue and achieving high performance would be much appreciated.

One Hot Encoding dimension - Model Compexity

I will explain my problem:
I have around 50.000 samples, each of one described by a list of codes representing "events"
The number of unique codes are around 800.
The max number of codes that a sample could have is around 600.
I want to represent each sample using one-hot encoding. The representation should be, if we consider the operation of padding for those samples that has fewer codes, a 800x600 matrix.
Giving this new representation as input of a network, means to flatten each matrix to a vector of size 800x600 (460.000 values).
At the end the dataset should consist in 50.000 vectors of size 460.000 .
Now, I have two considerations:
How is it possible to handle a dataset of that size?(I tried data generator to obtain the representation on-the-fly but they are really slow).
Having a vector of size 460.000 as input for each sample, means that the complexity of my model( number of parameters to learn ) is extremely high ( around 15.000.000 in my case ) and, so, I need an huge dataset to train the model properly. Doesn't it?
Why do not you use the conventional model used in NLP?
These events can be translated as you say by embedding matrix.
Then you can represent the chains of events using LSTM (or GRU or RNN o Bilateral LSTM), the difference of using LSTM instead of a conventional network is that you use the same module repeated by N times.
So your input really is not 460,000, but internally an event A indirectly helps you learn about an event B. That's because the LSTM has a module that repeats itself for each event in the chain.
You have an example here:
https://www.kaggle.com/ngyptr/lstm-sentiment-analysis-keras
Broadly speaking what I would do would be the following (in Keras pseudo-code):
Detect the number of total events. I generate a unique list.
unique_events = list (set ([event_0, ..., event_n]))
You can perform the translation of a sequence with:
seq_events_idx = map (unique_events.index, seq_events)
Add the necessary pad to each sequence:
sequences_pad = pad_sequences (sequences, max_seq)
Then you can directly use an embedding to carry out the transfer of the event to an associated vector of the dimension that you consider.
input_ = Input (shape = (max_seq,), dtype = 'int32')
embedding = Embedding (len(unique_events),
                    dimensions,
                    input_length = max_seq,
                    trainable = True) (input_)
Then you define the architecture of your LSTM (For example):
lstm = LSTM (128, input_shape = (max_seq, dimensions), dropout = 0.2, recurrent_dropout = 0.2, return_sequences = True) (embedding)
Add the dense and the result you want:
out = Dense (10, activation = 'softmax') (lstm)
I think that this type of model can help you and give better results.

Understanding stateful LSTM [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm going through this tutorial on RNNs/LSTMs and I'm having quite a hard time understanding stateful LSTMs. My questions are as follows :
1. Training batching size
In the Keras docs on RNNs, I found out that the hidden state of the sample in i-th position within the batch will be fed as input hidden state for the sample in i-th position in the next batch. Does that mean that if we want to pass the hidden state from sample to sample we have to use batches of size 1 and therefore perform online gradient descent? Is there a way to pass the hidden state within a batch of size >1 and perform gradient descent on that batch ?
2. One-Char Mapping Problems
In the tutorial's paragraph 'Stateful LSTM for a One-Char to One-Char Mapping' were given a code that uses batch_size = 1 and stateful = True to learn to predict the next letter of the alphabet given a letter of the alphabet. In the last part of the code (line 53 to the end of the complete code), the model is tested starting with a random letter ('K') and predicts 'B' then given 'B' it predicts 'C', etc. It seems to work well except for 'K'. However, I tried the following tweak to the code (last part too, I kept lines 52 and above):
# demonstrate a random starting point
letter1 = "M"
seed1 = [char_to_int[letter1]]
x = numpy.reshape(seed, (1, len(seed), 1))
x = x / float(len(alphabet))
prediction = model.predict(x, verbose=0)
index = numpy.argmax(prediction)
print(int_to_char[seed1[0]], "->", int_to_char[index])
letter2 = "E"
seed2 = [char_to_int[letter2]]
seed = seed2
print("New start: ", letter1, letter2)
for i in range(0, 5):
x = numpy.reshape(seed, (1, len(seed), 1))
x = x / float(len(alphabet))
prediction = model.predict(x, verbose=0)
index = numpy.argmax(prediction)
print(int_to_char[seed[0]], "->", int_to_char[index])
seed = [index]
model.reset_states()
and these outputs:
M -> B
New start: M E
E -> C
C -> D
D -> E
E -> F
It looks like the LSTM did not learn the alphabet but just the positions of the letters, and that regardless of the first letter we feed in, the LSTM will always predict B since it's the second letter, then C and so on.
Therefore, how does keeping the previous hidden state as initial hidden state for the current hidden state help us with the learning given that during test if we start with the letter 'K' for example, letters A to J will not have been fed in before and the initial hidden state won't be the same as during training ?
3. Training an LSTM on a book for sentence generation
I want to train my LSTM on a whole book to learn how to generate sentences and perhaps learn the authors style too, how can I naturally train my LSTM on that text (input the whole text and let the LSTM figure out the dependencies between the words) instead of having to 'artificially' create batches of sentences from that book myself to train my LSTM on? I believe I should use stateful LSTMs could help but I'm not sure how.
Having a stateful LSTM in Keras means that a Keras variable will be used to store and update the state, and in fact you could check the value of the state vector(s) at any time (that is, until you call reset_states()). A non-stateful model, on the other hand, will use an initial zero state every time it processes a batch, so it is as if you always called reset_states() after train_on_batch, test_on_batch and predict_on_batch. The explanation about the state being reused for the next batch on stateful models is just about that difference with non-stateful; of course the state will always flow within each sequence in the batch and you do not need to have batches of size 1 for that to happen. I see two scenarios where stateful models are useful:
You want to train on split sequences of data because these are very long and it would not be practical to train on their whole length.
On prediction time, you want to retrieve the output for each time point in the sequence, not just at the end (either because you want to feed it back into the network or because your application needs it). I personally do that in the models that I export for later integration (which are "copies" of the training model with batch size of 1).
I agree that the example of an RNN for the alphabet does not really seem very useful in practice; it will only work when you start with the letter A. If you want to learn to reproduce the alphabet starting at any letter, you would need to train the network with that kind of examples (subsequences or rotations of the alphabet). But I think a regular feed-forward network could learn to predict the next letter of the alphabet training on pairs like (A, B), (B, C), etc. I think the example is meant for demonstrative purposes more than anything else.
You may have probably already read it, but the popular post The Unreasonable Effectiveness of Recurrent Neural Networks shows some interesting results along the lines of what you want to do (although it does not really dive into implementation specifics). I don't have personal experience training RNN with textual data, but there is a number of approaches you can research. You can build character-based models (like the ones in the post), where your input and receive one character at a time. A more advanced approach is to do some preprocessing on the texts and transform them into sequences of numbers; Keras includes some text preprocessing functions to do that. Having one single number as feature space is probably not going to work all that well, so you could simply turn each word into a vector with one-hot encoding or, more interestingly, have the network learn the best vector representation for each for, which is what they call en embedding. You can go even further with the preprocessing and look into something like NLTK, specially if you want to remove stop words, punctuation and things like that. Finally, if you have sequences of different sizes (e.g. you are using full texts instead of excerpts of a fixed size, which may or may not be important for you) you will need to be a bit more careful and use masking and/or sample weighting. Depending on the exact problem, you can set up the training accordingly. If you want to learn to generate similar text, the "Y" would be the similar to the "X" (one-hot encoded), only shifted by one (or more) positions (in this case you may need to use return_sequences=True and TimeDistributed layers). If you want to determine the autor, your output could be a softmax Dense layer.
Hope that helps.

How to massage inputs into Keras framework?

I am new to keras and despite reading the documentation and the examples folder in keras, I'm still struggling with how to fit everything together.
In particular, I want to start with a simple task: I have a sequence of tokens, where each token has exactly one label. I have a lot training data like this - practically infinite, as I can generate more (token, label) training pairs as needed.
I want to build a network to predict labels given tokens. The number of tokens must always be the same as the number of labels (one token = one label).
And I want this to be based on all surrounding tokens, say within the same line or sentence or window -- not just on the preceding tokens.
How far I got on my own:
created the training numpy vectors, where I converted each sentence into a token-vector and label-vector (of same length), using a token-to-int and label-to-int mappings
wrote a model using categorical_crossentropy and one LSTM layer, based on https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py.
Now struggling with:
All the input_dim and input_shape parameters... since each sentence has a different length (different number of tokens and labels in it), what should I put as input_dim for the input layer?
How to tell the network to use the entire token sentence for prediction, not just one token? How to predict a whole sequence of labels given a sequence of tokens, rather than just label based on previous tokens?
Does splitting the text into sentences or windows make any sense? Or can I just pass a vector for the entire text as a single sequence? What is a "sequence"?
What are "time slices" and "time steps"? The documentation keeps mentioning that and I have no idea how that relates to my problem. What is "time" in keras?
Basically I have trouble connecting the concepts from the documentation like "time" or "sequence" to my problem. Issues like Keras#40 didn't make me any wiser.
Pointing to relevant examples on the web or code samples would be much appreciated. Not looking for academic articles.
Thanks!
If you have sequences of different length you can either pad them or use a stateful RNN implementation in which the activations are saved between batches. The former is the easiest and most used.
If you want to use future information when using RNNs you want to use a bidirectional model where you concatenate two RNN's moving in opposite directions. RNN will use a representation of all previous information when e.g. predicting.
If you have very long sentences it might be useful to sample a random sub-sequence and train on that. Fx 100 characters. This also helps with overfitting.
Time steps are your tokens. A sentence is a sequence of characters/tokens.
I've written an example of how I understand your problem but it's not tested so it might not run. Instead of using integers to represent your data I suggest one-hot encoding if it is possible and then use binary_crossentropy instead of mse.
from keras.models import Model
from keras.layers import Input, LSTM, TimeDistributed
from keras.preprocessing import sequence
# Make sure all sequences are of same length
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
# The input shape is your sequence length and your token embedding size (which is 1)
inputs = Input(shape=(maxlen, 1))
# Build a bidirectional RNN
lstm_forward = LSTM(128)(inputs)
lstm_backward = LSTM(128, go_backwards=True)(inputs)
bidirectional_lstm = merge([lstm_forward, lstm_backward], mode='concat', concat_axis=2)
# Output each timestep into a fully connected layer with linear
# output to map to an integer
sequence_output = TimeDistributed(Dense(1, activation='linear'))(bidirectional_lstm)
# Dense(n_classes, activation='sigmoid') if you want to classify
model = Model(inputs, sequence_output)
model.compile('adam', 'mse')
model.fit(X_train, y_train)

Dynamic Topic model output - Blei format

I am working with the Dynamic Topic Models package that was developed by Blei. I am new to LDA however I understand it.
I would like to know what does the output by the name of
lda-seq/topic-000-var-obs.dat store?
I know that lda-seq/topic-001-var-e-log-prob.dat stores the log of the variational posterior and by applying the exponential over it, I get the probability of the word within Topic 001.
Thanks
Topic-000-var-e-log-prob.dat store the log of the variational posterior of the topic 1.
Topic-001-var-e-log-prob.dat store the log of the variational posterior of the topic 2.
I have failed to find a concrete answer anywhere. However, since the documentation's sample.sh states
The code creates at least the following files:
- topic-???-var-e-log-prob.dat: the e-betas (word distributions) for topic ??? for all times.
...
- gam.dat
without mentioning the topic-000-var-obs.dat file, suggests that it is not imperative for most analyses.
Speculation
obs suggest observations. After a little dig around in the example/model_run results, I plotted the sum across epochs for each word/token using:
temp = scan("dtm/example/model_run/lda-seq/topic-000-var-obs.dat")
temp.matrix = matrix(temp, ncol = 10, byrow = TRUE)
plot(rowSums(temp.matrix))
and the result is something like:
The general trend of the non-negative values is decreasing and many values are floored (in this case to -11.00972 = log(1.67e-05)) Suggesting that these values are weightings or some other measure of influence on the model. The model removes some tokens and the influence/importance of the others tapers off over the index. The later trend may be caused by preprocessing such as sorting tokens by tf-idf when creating the dictionary.
Interestingly the row sum values varies for both the floored tokens and the set with more positive values:
temp = scan("~/Documents/Python/inference/project/dtm/example/model_run/lda-seq/topic-009-var-obs.dat")
temp.matrix = matrix(temp, ncol = 10, byrow = TRUE)
plot(rowSums(temp.matrix))