Pretraining a language model on a small custom corpus - deep-learning

I was curious if it is possible to use transfer learning in text generation, and re-train/pre-train it on a specific kind of text.
For example, having a pre-trained BERT model and a small corpus of medical (or any "type") text, make a language model that is able to generate medical text. The assumption is that you do not have a huge amount of "medical texts" and that is why you have to use transfer learning.
Putting it as a pipeline, I would describe this as:
Using a pre-trained BERT tokenizer.
Obtaining new tokens from my new text and adding them to the existing pre-trained language model (i.e., vanilla BERT).
Re-training the pre-trained BERT model on the custom corpus with the combined tokenizer.
Generating text that resembles the text within the small custom corpus.
Does this sound familiar? Is it possible with hugging-face?

I have not heard of the pipeline you just mentioned. In order to construct an LM for your use-case, you have basically two options:
Further training BERT (-base/-large) model on your own corpus. This process is called domain-adaption as also described in this recent paper. This will adapt the learned parameters of BERT model to your specific domain (Bio/Medical text). Nonetheless, for this setting, you will need quite a large corpus to help BERT model better update its parameters.
Using a pre-trained language model that is pre-trained on a large amount of domain-specific text either from the scratch or fine-tuned on vanilla BERT model. As you might know, the vanilla BERT model released by Google has been trained on Wikipedia and BookCorpus text. After the vanilla BERT, researchers have tried to train the BERT architecture on other domains besides the initial data collections. You may be able to use these pre-trained models which have a deep understanding of domain-specific language. For your case, there are some models such as: BioBERT, BlueBERT, and SciBERT.
Is it possible with hugging-face?
I am not sure if huggingface developers have developed a robust approach for pre-training BERT model on custom corpora as claimed their code is still in progress, but if you are interested in doing this step, I suggest using Google research's bert code which has been written in Tensorflow and is totally robust (released by BERT's authors). In their readme and under Pre-training with BERT section, the exact process has been declared. This will provide you with Tensorflow checkpoint, which can be easily converted to Pytorch checkpoint if you'd like to work with Pytorch/Transformers.

It is entirely possible to both pre-train and further pre-train BERT (or almost any other model that is available in the huggingface library).
Regarding the tokenizer - if you are pre-training on a a small custom corpus (and therefore using a trained bert checkpoint), then you have to use the tokenizer that was used to train Bert. Otherwise, you will just confuse the model.
If your use case is text generation (from some initial sentence/part of sentence), then I can advise you to check gpt-2 (https://huggingface.co/gpt2). I haven't used GPT-2, but with some basic research I think you can do:
from transformers import GPT2Tokenizer, TFGPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = TFGPT2Model.from_pretrained('gpt2')
and follow this tutorial: https://towardsdatascience.com/train-gpt-2-in-your-own-language-fc6ad4d60171 on how to train the gpt-2 model.
Note: I am not sure if DeBERTa-V3, for example, can be pre-trained as usual. I have checked their github repo and it seems that for V3 there is no official pre-training code (https://github.com/microsoft/DeBERTa/issues/71). However, I think that using huggingface we can actually do it. Once I have time I will run a pre-training script and verify this.

Related

Is FastAI performing transfer learning when calling a vision learner?

learn = vision_learner(dls, models.resnet18)
In the above code snippet, I am calling a Vision Learner Resnet 18 model using FastAI and passing in a Dataloader containing my data.
I wonder if this process is performing any transfer learning within this call? As I am passing in my data to the vision learner.
It is important for the task I am carrying out that none is being performed at this stage.
FastAI's vision_learner has a pretrained argument designed specifically for that purpose. By default it is set to True, so in your case you would want to disable it:
learn = vision_learner(dls, models.resnet18, pretrained=False)
When you create a learner, which is a fastai object that combines the data and a model for training, and uses transfer learning to fine tune a pretrained model in just two lines of code:
learn = vision_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(1)
If you want to make a prediction on a new image, you can use learn.predict
If normalize and pretrained are True, this function adds a Normalization transform to the dls

How does the finetune on transformer (t5) work?

I am using pytorch lightning to finetune t5 transformer on a specific task. However, I was not able to understand how the finetuning works. I always see this code :
tokenizer = AutoTokenizer.from_pretrained(hparams.model_name_or_path) model = AutoModelForSeq2SeqLM.from_pretrained(hparams.model_name_or_path)
I don't get how the finetuning is done, are they freezing the whole model and training the head only, (if so how can I change the head) or are they using the pre-trained model as a weight initializing? I have been looking for an answer for couple days already. Any links or help are appreciated.
If you are using PyTorch Lightning, then it won't freeze the head until you specify it do so. Lightning has a callback which you can use to freeze your backbone and training only the head module. See Backbone Finetuning
Also checkout Ligthning-Flash, it allows you to quickly build model for various text tasks and uses Transformers library for backbone. You can use the Trainer to specify which kind of finetuning you want to apply for your training.
Thanks

Using OpenVino pre-trained models with AWS Sagemaker

I'm looking to deploy a pre-trained model for real-time pedestrian and/or vehicle detection using the AWS Sagemaker workflow, I particularly want to use Sagemaker Neo to compile the model and deploy it on the edge. I want to use one of OpenVino's prebuilt models from their model zoo, but when I download the model it is already in their Intermediate Representation (IR) format for their own optimizer.
Is there a way to get an OpenVino pre-trained model not in IR format so that I can use it in sagemaker? Or any possible way to containerize the OpenVino model for use in sagemaker?
If not, are there any free pre-trained models (using any of the popular frameworks like pytorch, tensorflow, ONXX, etc.) that I can use for vehicle detection from a traffic camera POV? AWS Marketplace does not seem to have much to offer in this regard.
Answers to the query as below:
No.Only in Intermediate Representation (IR) format.
There are a few OpenVINO pre-trained models available for vehicle detection.Check out the list of Object Detection Models that are relevant for vehicle detection on these Github pages.
https://github.com/openvinotoolkit/open_model_zoo/blob/master/models/intel/index.md
https://github.com/openvinotoolkit/open_model_zoo/blob/master/models/public/index.md

How to use pre-trained BERT question-answering model for text extraction in Python?

So, let's say I have a following csv dataset. I have to use pre-trained BERT question-answering model to train , predict and finally evaluate. As, I am new to this it would be helpful to see similar project to understand and work myself on my project or any guidance would be helpful too.
I already tried training the model individually (checking one article at a time), that works. I need guidance on how to work with CSV dataset and evaluation.
Here is the format of the dataset

Named entity recognition with deep Learning model

How to use named entity recognition using Deep Learning? I want to build a model using DL for named entity recognition.
There are many pre-trained models/library for Named Entity Recognition(NER), you can use HuggingFace pre-traied modes, SpaCy and NLTK for the same.
If you want to go deep dive and train a Deep Learning model from scratch, you shall explore about BERT.
Also, I would recommend to go through Kaggle notebooks about Named Entity Recognition.