Using the encoder part only from T5 model

Using the encoder part only from T5 model - deep-learning

I want to build a classification model that needs only the encoder part of language models. I have tried Bert, Roberta, xlnet, and so far I have been successful.
I now want to test the encoder part only from T5, so far, I found encT5 https://github.com/monologg/EncT5
And T5EncoderModel from HuggingFace.
Can anyone help me understand if T5EncoderModel is what I am looking for or not?
It says in the description: The bare T5 Model transformer outputting encoder’s raw hidden-states without any specific head on top.
This is slightly confusing to me, especially that encT5 mentioned that they implemented the encoder part only because it didn't exist in HuggingFace which is what makes me more doubtful here.
Please note that I am a beginner in deep learning, so please go easy on me I understand that ny questions can be naive to most of you.
Thank you

Load T5 encoder checkpoint only:
from transformers import T5EncoderModel
T5EncoderModel._keys_to_ignore_on_load_unexpected = ["decoder.*"]
auto_model = T5EncoderModel.from_pretrained("t5-base")
Note that T5 doesn't have CLS token so you should use another strategy (mean pooling, etc) for your classification task

Related

Is FastAI performing transfer learning when calling a vision learner?

learn = vision_learner(dls, models.resnet18)
In the above code snippet, I am calling a Vision Learner Resnet 18 model using FastAI and passing in a Dataloader containing my data.
I wonder if this process is performing any transfer learning within this call? As I am passing in my data to the vision learner.
It is important for the task I am carrying out that none is being performed at this stage.

FastAI's vision_learner has a pretrained argument designed specifically for that purpose. By default it is set to True, so in your case you would want to disable it:
learn = vision_learner(dls, models.resnet18, pretrained=False)

When you create a learner, which is a fastai object that combines the data and a model for training, and uses transfer learning to fine tune a pretrained model in just two lines of code:
learn = vision_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(1)
If you want to make a prediction on a new image, you can use learn.predict
If normalize and pretrained are True, this function adds a Normalization transform to the dls

How does the finetune on transformer (t5) work?

I am using pytorch lightning to finetune t5 transformer on a specific task. However, I was not able to understand how the finetuning works. I always see this code :
tokenizer = AutoTokenizer.from_pretrained(hparams.model_name_or_path) model = AutoModelForSeq2SeqLM.from_pretrained(hparams.model_name_or_path)
I don't get how the finetuning is done, are they freezing the whole model and training the head only, (if so how can I change the head) or are they using the pre-trained model as a weight initializing? I have been looking for an answer for couple days already. Any links or help are appreciated.

If you are using PyTorch Lightning, then it won't freeze the head until you specify it do so. Lightning has a callback which you can use to freeze your backbone and training only the head module. See Backbone Finetuning
Also checkout Ligthning-Flash, it allows you to quickly build model for various text tasks and uses Transformers library for backbone. You can use the Trainer to specify which kind of finetuning you want to apply for your training.
Thanks

Pretraining a language model on a small custom corpus

I was curious if it is possible to use transfer learning in text generation, and re-train/pre-train it on a specific kind of text.
For example, having a pre-trained BERT model and a small corpus of medical (or any "type") text, make a language model that is able to generate medical text. The assumption is that you do not have a huge amount of "medical texts" and that is why you have to use transfer learning.
Putting it as a pipeline, I would describe this as:
Using a pre-trained BERT tokenizer.
Obtaining new tokens from my new text and adding them to the existing pre-trained language model (i.e., vanilla BERT).
Re-training the pre-trained BERT model on the custom corpus with the combined tokenizer.
Generating text that resembles the text within the small custom corpus.
Does this sound familiar? Is it possible with hugging-face?

I have not heard of the pipeline you just mentioned. In order to construct an LM for your use-case, you have basically two options:
Further training BERT (-base/-large) model on your own corpus. This process is called domain-adaption as also described in this recent paper. This will adapt the learned parameters of BERT model to your specific domain (Bio/Medical text). Nonetheless, for this setting, you will need quite a large corpus to help BERT model better update its parameters.
Using a pre-trained language model that is pre-trained on a large amount of domain-specific text either from the scratch or fine-tuned on vanilla BERT model. As you might know, the vanilla BERT model released by Google has been trained on Wikipedia and BookCorpus text. After the vanilla BERT, researchers have tried to train the BERT architecture on other domains besides the initial data collections. You may be able to use these pre-trained models which have a deep understanding of domain-specific language. For your case, there are some models such as: BioBERT, BlueBERT, and SciBERT.
Is it possible with hugging-face?
I am not sure if huggingface developers have developed a robust approach for pre-training BERT model on custom corpora as claimed their code is still in progress, but if you are interested in doing this step, I suggest using Google research's bert code which has been written in Tensorflow and is totally robust (released by BERT's authors). In their readme and under Pre-training with BERT section, the exact process has been declared. This will provide you with Tensorflow checkpoint, which can be easily converted to Pytorch checkpoint if you'd like to work with Pytorch/Transformers.

It is entirely possible to both pre-train and further pre-train BERT (or almost any other model that is available in the huggingface library).
Regarding the tokenizer - if you are pre-training on a a small custom corpus (and therefore using a trained bert checkpoint), then you have to use the tokenizer that was used to train Bert. Otherwise, you will just confuse the model.
If your use case is text generation (from some initial sentence/part of sentence), then I can advise you to check gpt-2 (https://huggingface.co/gpt2). I haven't used GPT-2, but with some basic research I think you can do:
from transformers import GPT2Tokenizer, TFGPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = TFGPT2Model.from_pretrained('gpt2')
and follow this tutorial: https://towardsdatascience.com/train-gpt-2-in-your-own-language-fc6ad4d60171 on how to train the gpt-2 model.
Note: I am not sure if DeBERTa-V3, for example, can be pre-trained as usual. I have checked their github repo and it seems that for V3 there is no official pre-training code (https://github.com/microsoft/DeBERTa/issues/71). However, I think that using huggingface we can actually do it. Once I have time I will run a pre-training script and verify this.

Mask R-CNN annotation tool

I’m new to deep learning and I was reading some state of art papers and I found that mask r-cnn is utterly used in segmentation and classification of images. I would like to apply it to my MSc project but I got some questions that you may be able to answer. I apologize if this isn’t the right place to do it.
First, I would like to know what are the best strategy to get the annotations. It seems kind of labor intensive and I’m not understanding if there is any easy way. Following that, I want to know if you know any annotation tool for mask r-cnn that generates the binary masks that are manually done by the user.
I hope this can turn into a productive and informative thread so any suggestion, experience would be highly appreciated.
Regards

You can use MASK-RCNN, I recommend it, is a two-stage framework, first you can scan the image and generate areas likely contain an object. And the second stage classifies the proposal drawing bounding boxes.
But the two-big question
how to train a model from scratch? And What happens when we want to
train our own dataset?
You can use annotations downloaded from the internet, or you can start creating your own annotations, this takes a lot of time!
You have tools like:
VIA GGC image annotator
http://www.robots.ox.ac.uk/~vgg/software/via/via_demo.html
it's online and you don't have to download any program. It is the one that I recommend you, save the images in a .json file, and so you can use the class of ballons that comes by default in SAMPLES in the framework MASK R-CNN, you would only have to put your json file and your images and to train your dataset.
But there are always more options, you have labellimg which is also used for annotation and is very well known but save the files in xml, you will have to make a few changes to your Class in python. You also have labelme, labelbox, etc.

Weka: Limitations on what one can output as source?

I was consulting several references to discover how I may output trained Weka models into Java source code so that I may use the classifiers I am training in actual code for research applications I have been developing.
As I was playing with Weka 3.7, I noticed that while it does output Java code to its main text buffer when use simpler classification (supervised in my case this time) methods such as J48 decision tree, it removes the option (rather, it voids it by removing the ability to checkmark it and fades the text) to output Java code for RandomTree and RandomForest (which are the ones that give me the best performance in my situation).
Note: I am clicking on the "More Options" button and checking "Output source code:".
Does Weka not allow you to output RandomTree or RandomForest as Java code? If so, why? Or if it does and just doesn't put it in the output buffer (since RF is multiple decision trees which I imagine it doesn't want to waste buffer space), how does one go digging up where in the file system Weka outputs java code by default?
Are there any tricks to get Weka to give me my trained RandomForest as Java code? Or is Serialization of the output *.model files my only hope when it comes to RF and RandomTree?
Thanks in advance to those who provide help.
NOTE: (As an addendum to the answer provided below) If you run across a similar situation (requiring you to use your trained classifier/ML model in your code), I recommend following the links posted in the answer that was provided in response to my question. If you do not specifically need the Java code for the RandomForest, as an example, de-serializing the model works quite nicely and fits into Java application code, fulfilling its task as a trained model/hardened algorithm meant to predict future unlabelled instances.

RandomTree and RandomForest can't be output as Java code. I'm not sure for the reasoning why, but they don't implement the "Sourceable" interface.
This explains a little about outputting a classifier as Java code: Link 1
This shows which classifiers can be output as Java code: Link 2
Unfortunately I think the easiest route will be Serialization, although, you could maybe try implementing "Sourceable" for other classifiers on your own.
Another, but perhaps inconvenient solution, would be to use Weka to build the classifier every time you use it. You wouldn't need to load the ".model" file, but you would need to load your training data and relearn the model. Here is a starters guide to building classifiers in your own java code http://weka.wikispaces.com/Use+WEKA+in+your+Java+code.

Solved the problem for myself by turning the output of WEKA's -printTrees option of the RandomForest classifier into Java source code.
http://pielot.org/2015/06/exporting-randomforest-models-to-java-source-code/
Since I am using classifiers with Android, all of the existing options had disadvantages:
shipping Android apps with serialized models didn't reliably work across devices
computing the model on the phone took too much resources
The final code will consist of three classes only: the class with the generated model + two classes to make the classification work.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008