Is there any other validation metrics for trainer? - allennlp

I use the seq2seq model and it can compute BLEU score (a NMT score) every epoch. However, I cannot set BLEU score as validation metric so it cannot early stop in training. I read the source code, but there are no hints as to what kind of string could be added to the validation metrics except for "+loss". Please save me, thanks!

The default validation_metric is actually "-loss", not "+loss". The "-" means this is a metric that should be minimized, not maximized.
So to use BLEU score instead, set the validation_metric to "+BLEU".
In general, you can use any metric that's returned by your model's .get_metric() method. The name of the metric you use for validation_metric just has to match the corresponding key from the dictionary returned by .get_metric().
In your case, presumably your model's .get_metric() method returns something like this: {"BLEU": ...}, which is why validation_metric should be set to "+BLEU".

Related

How do I know which parameters to use with a pretrained Tokenizer?

I must be missing something ...
I want to use a pretrained model with HuggingFace:
transformer_name = "Geotrend/distilbert-base-fr-cased" # Or whatever model
model = AutoModelForSequenceClassification.from_pretrained(transformer_name, num_labels=5)
tokenizer = AutoTokenizer.from_pretrained(transformer_name)
Now that I have my model and my tokenizer, I need to tokenize my dataset, but I don't know which parameters (padding, truncation, max_length) to use with my Tokenizer.
Some examples just call the tokenizer tokenizer(data), others use truncation only tokenizer(data, truncation=True), and others will use many parameters tokenizer(data, padding=True, truncation=True, return_tensors='pt', max_length=512).
As I am reloading a pretrained Tokenizer, I would have love it to use the same parameters as in the original training process. How do I know which parameters to use ?
My understanding is that I always need to truncate my data and leave max_length to None so that my sequences length will always be lower than the model's maximum length. Is that it ? Does leaving max_length to None makes it backup on the model's maximum length ?
And what should I do with padding ? As I am using a Trainer object for training with a DataCollatorWithPadding should I set padding to False to reduce the memory impact and let the collator pad my batches ?
Final question : what should I do if I use a TextClassificationPipeline for inference ? Should I specify these parameters (padding, etc.) ? Will the pipeline handle it for me ?
The choice on whether to use padding and truncation depends on the model you are fine-tuning and on your training process, and not on the pretrained tokenizer.
Tranformer-based models have a constraint on the number of tokens the model can process, so generally yes that's it. Yes, when max_length is None then the maximum acceptable input length for the model is considered. (see docs).
Yes, you should not pad the input sequence if you use DataCollatorWithPadding. More about it in this video.
As you already noticed, you have to specify them yourself when you pass your input text to the pipeline.

Can HuggingFace `Trainer` be customised for curriculum learning?

I have been looking for certain features in the HuggingFace transformer Trainer object (in particular Seq2SeqTrainer) and would like to know whether they exist and if so, how to implement them, or whether I would have to write my own training loop to enable them.
I am looking to apply Curriculum Learning to my training strategy, as well as evaluating the model at regular intervals, and therefore would like to enable the following
choose in which order the model sees training samples at each epoch (it seems that the data passed onto the train_dataset argument are automatically shuffled by some internal code, and even if I managed to stop that, I would still need to pass differently ordered data at different epochs, as I may want to start training the model from easy samples for a few epochs, and then pass a random shuffle of all data for later epochs)
run custom evaluation at integer multiples of a fix number of steps. The standard compute_metrics argument of the Trainer takes a function to which the predictions and labels are passed* and the user can decide how to generate the metrics given these. However I'd like a finer level of control, for example changing the maximum sequence length for the tokenizer when doing the evaluation, as opposed to when doing training, which would require me including some explicit evaluation code inside compute_metrics which needs to access the trained model and the data from disk.
Can these two points be achieved by using the Trainer on a multi-GPU machine, or would I have to write my own training loop?
*The function often looks something like this and I'm not sure it would work with the Trainer if it doesn't have this configuration
def compute_metrics(eval_pred):
predictions, labels = eval_pred
...
You can pass custom functions to compute metrics in the training arguments

Why are metrics in AllenNLP calculated with tensors? Can I define a metric based on strings?

I'm using AllenNLP in my project, and I'm confused by the Metric: all of the metrics are calculated with tensors, include bleu and rouge. However sometime I may want to calculate the metric with strings tokenized with white spaces. The built-in metrics are calculated in a token level tokenized by BertTokenizer, and it may have a different result because of the difference of tokenization.
Currently I'm converting the tensor to tokens, putting them into a string and calculate my defined Metric in forward. The code is working now, but I wonder this may not be the right way.
What does not seem to be exactly the AllenNLP way of calculating metrics is getting them in the forward call. AllenNLP train_loop features a special blueprint function get_metrics for this.
However, if you mean that you update your metrics in forward and reset them in get_metrics, there seems to be no better way to do it.

Named Entity Recognition confidence

I need to get confidence about each extracted entity (not to print it but to get it), however, I can't find a method that returns confidences.
Firstly, I have tried using Stanford Named Entity Recognizer library on Java and this solution:
Display Stanford NER confidence score
but it doesn't work (I guess getCliqueTree method is not available). I also have tried using NLTK in Python and Stanford NER model to extract entities, but again couldn't find a way to get confidences.
I know how to do it on Spacy:
https://github.com/explosion/spaCy/issues/831
but as the author says it's inefficient.
So, can you please advise me, how to get the probabilities of each extracted entity?
Usually NER is a token level classification task.
Confidences are usually derived from each prediction, which is commonly the output of some type of softmax.
The issue then become, how can I get a confidence for a sequence of confidences?
There are multiple ways:
Entropy [Confidence is amount of information]
Average (Mean) [Confidence is the average]
Min/Max of confidences [Confidence is the min/max]
All of these give different answers, none are "better" and it really depends on your use case.
If you would like to order possible entity types, you can start with the following:
Get confidences assuming same label for each token
Get entropy for confidence (probability) sequence
Sort by entropy

Reinforce Learning: Do I have to ignore hyper parameter(?) after training done in Q-learning?

Learner might be in training stage, where it update Q-table for bunch of epoch.
In this stage, Q-table would be updated with gamma(discount rate), learning rate(alpha), and action would be chosen by random action rate.
After some epoch, when reward is getting stable, let me call this "training is done". Then do I have to ignore these parameters(gamma, learning rate, etc) after that?
I mean, in training stage, I got an action from Q-table like this:
if rand_float < rar:
action = rand.randint(0, num_actions - 1)
else:
action = np.argmax(Q[s_prime_as_index])
But after training stage, Do I have to remove rar, which means I have to get an action from Q-table like this?
action = np.argmax(self.Q[s_prime])
Once the value function has converged (values stop changing), you no longer need to run Q-value updates. This means gamma and alpha are no longer relevant, because they only effect updates.
The epsilon parameter is part of the exploration policy (e-greedy) and helps ensure that the agent visits all states infinitely many times in the limit. This is an important factor in ensuring that the agent's value function eventually converges to the correct value. Once we've deemed the value function converged however, there's no need to continue randomly taking actions that our value function doesn't believe to be best; we believe that the value function is optimal, so we extract the optimal policy by greedily choosing what it says is the best action in every state. We can just set epsilon to 0.
Although the answer provided by #Nick Walker is correct, here it's some additional information.
What you are talking about is closely related with the concept technically known as "exploration-exploitation trade-off". From Sutton & Barto book:
The agent has to exploit what it already knows in order to obtain
reward, but it also has to explore in order to make better action
selections in the future. The dilemma is that neither exploration nor
exploitation can be pursued exclusively without failing at the task.
The agent must try a variety of actions and progressively favor those
that appear to be best.
One way to implement the exploration-exploitation trade-off is using epsilon-greedy exploration, that is what you are using in your code sample. So, at the end, once the agent has converged to the optimal policy, the agent must select only those that exploite the current knowledge, i.e., you can forget the rand_float < rar part. Ideally you should decrease the epsilon parameters (rar in your case) with the number of episodes (or steps).
On the other hand, regarding the learning rate, it worths noting that theoretically this parameter should follow the Robbins-Monro conditions:
This means that the learning rate should decrease asymptotically. So, again, once the algorithm has converged you can (or better, you should) safely ignore the learning rate parameter.
In practice, sometimes you can simply maintain a fixed epsilon and alpha parameters until your algorithm converges and then put them as 0 (i.e., ignore them).