How do I know which parameters to use with a pretrained Tokenizer? - deep-learning

I must be missing something ...
I want to use a pretrained model with HuggingFace:
transformer_name = "Geotrend/distilbert-base-fr-cased" # Or whatever model
model = AutoModelForSequenceClassification.from_pretrained(transformer_name, num_labels=5)
tokenizer = AutoTokenizer.from_pretrained(transformer_name)
Now that I have my model and my tokenizer, I need to tokenize my dataset, but I don't know which parameters (padding, truncation, max_length) to use with my Tokenizer.
Some examples just call the tokenizer tokenizer(data), others use truncation only tokenizer(data, truncation=True), and others will use many parameters tokenizer(data, padding=True, truncation=True, return_tensors='pt', max_length=512).
As I am reloading a pretrained Tokenizer, I would have love it to use the same parameters as in the original training process. How do I know which parameters to use ?
My understanding is that I always need to truncate my data and leave max_length to None so that my sequences length will always be lower than the model's maximum length. Is that it ? Does leaving max_length to None makes it backup on the model's maximum length ?
And what should I do with padding ? As I am using a Trainer object for training with a DataCollatorWithPadding should I set padding to False to reduce the memory impact and let the collator pad my batches ?
Final question : what should I do if I use a TextClassificationPipeline for inference ? Should I specify these parameters (padding, etc.) ? Will the pipeline handle it for me ?

The choice on whether to use padding and truncation depends on the model you are fine-tuning and on your training process, and not on the pretrained tokenizer.
Tranformer-based models have a constraint on the number of tokens the model can process, so generally yes that's it. Yes, when max_length is None then the maximum acceptable input length for the model is considered. (see docs).
Yes, you should not pad the input sequence if you use DataCollatorWithPadding. More about it in this video.
As you already noticed, you have to specify them yourself when you pass your input text to the pipeline.

Related

Can HuggingFace `Trainer` be customised for curriculum learning?

I have been looking for certain features in the HuggingFace transformer Trainer object (in particular Seq2SeqTrainer) and would like to know whether they exist and if so, how to implement them, or whether I would have to write my own training loop to enable them.
I am looking to apply Curriculum Learning to my training strategy, as well as evaluating the model at regular intervals, and therefore would like to enable the following
choose in which order the model sees training samples at each epoch (it seems that the data passed onto the train_dataset argument are automatically shuffled by some internal code, and even if I managed to stop that, I would still need to pass differently ordered data at different epochs, as I may want to start training the model from easy samples for a few epochs, and then pass a random shuffle of all data for later epochs)
run custom evaluation at integer multiples of a fix number of steps. The standard compute_metrics argument of the Trainer takes a function to which the predictions and labels are passed* and the user can decide how to generate the metrics given these. However I'd like a finer level of control, for example changing the maximum sequence length for the tokenizer when doing the evaluation, as opposed to when doing training, which would require me including some explicit evaluation code inside compute_metrics which needs to access the trained model and the data from disk.
Can these two points be achieved by using the Trainer on a multi-GPU machine, or would I have to write my own training loop?
*The function often looks something like this and I'm not sure it would work with the Trainer if it doesn't have this configuration
def compute_metrics(eval_pred):
predictions, labels = eval_pred
...
You can pass custom functions to compute metrics in the training arguments

sequence to sequence model using pytorch

I have dataset (sequence to sequence), each sample input is seq of charterers (combination from from 20 characters and max length 2166) and out is list of charterers (combination of three characters G,H,B). for example OIREDSSSRTTT ----> GGGHHHHBHBBB
I would like to do simple pytorch model that work in that type of dataset. Model that can predict sequence of classes. I would appreciate any suggestions or links for simple mode that do the same?
Thanks
If the output sequence always has the same length as the input sequence, you might want to use transformer encoder, because it basically transforms the inputs with attention to the context. Also you can try to use anything that is used to tagging: BiLSTM, BiGRU, etc.
If you want your model to be able to predict sequences of different length (not necessary the same as input length), look at some encoder-decoder models, such as vanilla transformer.
You can start with the sequence tagging model from PyTorch tutorial https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html .
As #Ilya Fedorov said, you can move to transformer models for potentially better performance.

PyTorch transformer argument "dim_feedforward"

I would like to understand what exactly is going on with this argument.
I have read that the feed forward sub-layer inside the transformer layer is a "pointwise" feed-forward layer. what does "pointwise" means in this context?
feed-forward layers takes 2 args: input features and output features.
this argument can't be the output features since no matter what value I use for it the output of the transformer layer always has the same shape. it also can't be the input features since it is determined by the self attention sublayer.
MOST IMPORTANTLY - where is the argument for the size of the tensors for the attention? the ones that translate the input into queries, keys and values?
"Position-wise", or "Point-wise", means the feed forward network (FFN) takes each position of a sequence, say, each word of a sentence, as its input. So point-wise FFN is a shared FFN that inputs each word one by one.
(and 3.) That's right. It is neither input features (determined by the self attention sublayer) nor output features (the same value as input features). It is actually the hidden features. The thing is, this particular FFN in transformer encoder has two linear layers, according to the implementation of TransformerEncoderLayer :
# Implementation of Feedforward model
self.linear1 = Linear(d_model, dim_feedforward, **factory_kwargs)
self.dropout = Dropout(dropout)
self.linear2 = Linear(dim_feedforward, d_model, **factory_kwargs)
So dim_feedforward is the feature no. of hidden layer of the FFN. Usually, its value is set to be several times larger than d_model (2048 as default).

Does PyTorch support variable with dynamic dimension?

I've updated my question based upon the variable dimension of variables.
Suppose the input tensor stores the 3d points with dimension 10x3, 10 means the #points and 3 is the feature dimension (say x,y,z coordinates). The dimension of the variable depends on the input tensor, say its dimension is 10x10. When the input tensor changes its dimension to 50x3, then the dimension of the variable will also have to change to 50x50.
I know in Tensorflow, if the input dimension is changing/unknown, we can declare it as tf.placeholder(None,3). However, I never meet the situation where the size of variable is changing/unknown, it seems that the variable will always have the fixed dimension.
I am currently learning PyTorch and don't know whether PyTorch supports this function. Any information would be appreciated!
========= Original question ========
I have a variable in which the size is changeable when input dimension changes. For example, if input is 10x2, then the variable should be 10x10. If input is 25x2, then the variable should be 25x25. As my understanding, the variable is used to store weights, which normally has fixed dimension. However in my case, the dimension of the variable depends on input data, which can change. Does PyTorch currently supports this kind of function?
Thanks!
Your question is little ambiguous. When you say, your input is say, 10x2, you need to define what the input tensor contains.
I am assuming you are talking about torch.autograd.Variable. If you want to use PyTorch's functionality, what you need to do is to provide your input through a tensor in the desired shape of the target function.
For example, if you want to use RNN implemented in PyTorch for an input sentence of length 10 where each word is represented by a 300 dimensional vector (e.g., word embedding), then you can do as follows.
rnn = nn.RNN(300, 256, 2) # emb_size=300,hidden_size=256,num_layers=2
input = Variable(torch.randn(10, 1, 300)) # sent_length=10,batch_size=1,emb_size=300
h0 = Variable(torch.randn(2, 1, 256)) # num_layers=2,batch_size=1,hidden_size=256
output, hn = rnn(input, h0)
If you have more than 1 sentence, then you can provide them in batch. In that case, you need to pad them to handle variable lengths. As you can see, RNN doesn't care about the sentence length, it can handle variable lengths but to provide many sentences in batch, you need padding. You can explore related functionalities in the official documentation.
Since you didn't mention what is your input actually, I am assuming you need variables with variable number of timesteps, in that case PyTorch can serve your purpose. Actually, PyTorch is developed to meet all basic functionalities that are required to build deep neural network architectures.

How to massage inputs into Keras framework?

I am new to keras and despite reading the documentation and the examples folder in keras, I'm still struggling with how to fit everything together.
In particular, I want to start with a simple task: I have a sequence of tokens, where each token has exactly one label. I have a lot training data like this - practically infinite, as I can generate more (token, label) training pairs as needed.
I want to build a network to predict labels given tokens. The number of tokens must always be the same as the number of labels (one token = one label).
And I want this to be based on all surrounding tokens, say within the same line or sentence or window -- not just on the preceding tokens.
How far I got on my own:
created the training numpy vectors, where I converted each sentence into a token-vector and label-vector (of same length), using a token-to-int and label-to-int mappings
wrote a model using categorical_crossentropy and one LSTM layer, based on https://github.com/fchollet/keras/blob/master/examples/lstm_text_generation.py.
Now struggling with:
All the input_dim and input_shape parameters... since each sentence has a different length (different number of tokens and labels in it), what should I put as input_dim for the input layer?
How to tell the network to use the entire token sentence for prediction, not just one token? How to predict a whole sequence of labels given a sequence of tokens, rather than just label based on previous tokens?
Does splitting the text into sentences or windows make any sense? Or can I just pass a vector for the entire text as a single sequence? What is a "sequence"?
What are "time slices" and "time steps"? The documentation keeps mentioning that and I have no idea how that relates to my problem. What is "time" in keras?
Basically I have trouble connecting the concepts from the documentation like "time" or "sequence" to my problem. Issues like Keras#40 didn't make me any wiser.
Pointing to relevant examples on the web or code samples would be much appreciated. Not looking for academic articles.
Thanks!
If you have sequences of different length you can either pad them or use a stateful RNN implementation in which the activations are saved between batches. The former is the easiest and most used.
If you want to use future information when using RNNs you want to use a bidirectional model where you concatenate two RNN's moving in opposite directions. RNN will use a representation of all previous information when e.g. predicting.
If you have very long sentences it might be useful to sample a random sub-sequence and train on that. Fx 100 characters. This also helps with overfitting.
Time steps are your tokens. A sentence is a sequence of characters/tokens.
I've written an example of how I understand your problem but it's not tested so it might not run. Instead of using integers to represent your data I suggest one-hot encoding if it is possible and then use binary_crossentropy instead of mse.
from keras.models import Model
from keras.layers import Input, LSTM, TimeDistributed
from keras.preprocessing import sequence
# Make sure all sequences are of same length
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
# The input shape is your sequence length and your token embedding size (which is 1)
inputs = Input(shape=(maxlen, 1))
# Build a bidirectional RNN
lstm_forward = LSTM(128)(inputs)
lstm_backward = LSTM(128, go_backwards=True)(inputs)
bidirectional_lstm = merge([lstm_forward, lstm_backward], mode='concat', concat_axis=2)
# Output each timestep into a fully connected layer with linear
# output to map to an integer
sequence_output = TimeDistributed(Dense(1, activation='linear'))(bidirectional_lstm)
# Dense(n_classes, activation='sigmoid') if you want to classify
model = Model(inputs, sequence_output)
model.compile('adam', 'mse')
model.fit(X_train, y_train)