BertModel or BertForPreTraining - deep-learning

I want to use Bert only for embedding and use the Bert output as an input for a classification net that I will build from scratch.
I am not sure if I want to do finetuning for the model.
I think the relevant classes are BertModel or BertForPreTraining.
BertForPreTraining head contains two "actions":
self.predictions is MLM (Masked Language Modeling) head is what gives BERT the power to fix the grammar errors, and self.seq_relationship is NSP (Next Sentence Prediction); usually refereed as the classification head.
class BertPreTrainingHeads(nn.Module):
def __init__(self, config):
super().__init__()
self.predictions = BertLMPredictionHead(config)
self.seq_relationship = nn.Linear(config.hidden_size, 2)
I think the NSP isn't relevant for my task so I can "override" it.
what does the MLM do and is it relevant for my goal or should I use the BertModel?

You should be using BertModel instead of BertForPreTraining.
BertForPreTraining is used to train bert on Masked Language Model (MLM) and Next Sentence Prediction (NSP) tasks. They are not meant for classification.
BERT model simply gives the output of the BERT model, you can then finetune the BERT model along with the classifier that you build on top of it. For classification, if its just a single layer on top of BERT model, you can directly go with BertForSequenceClassification.
In anycase, if you just want to take the output of BERT model and learn your classifier (without fine-tuning BERT model), then you can freeze the Bert model weights using:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
for param in model.bert.bert.parameters():
param.requires_grad = False
The above code is borrowed from here

Related

How can you change (copy) the weights of 1 model (detectron2 based) to another model (HuggingFace based) in Pytorch?

I am working on Object detection in document images. I have a model working with DETR-Resnet-50 on custom data
I also found out that Layout Parser Faster RCNN - Mask RCNN models use Resnet 50 as a backbone. And since these are already trained on document images, fine tuning these would give better results.
I think one way to do this would be to do like:
model = torch.hub.load('facebookresearch/detr:main', 'detr_resnet50', pretrained=True)
for name, param in model.named_parameters():
print(name, param.shape)
#param.data = weight
and when the layers name seem similar, you change the weights manually.
Now the problem is that Layout Parser models are in detectron2 and my DETR models are in HuggingFace. How can I change the weights of backbone (resnet50) in DETR so that they are initialised with those weights instead of imagenet ones?

Is BertTokenizer similar to word embedding?

The idea of using BertTokenizer from huggingface really confuses me.
When I use
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer.encode_plus("Hello")
Does the result is somewhat similar to when I pass
a one-hot vector representing "Hello" to a learning embedding matrix?
How is
BertTokenizer.from_pretrained("bert-base-uncased")
different from
BertTokenizer.from_pretrained("bert-**large**-uncased")
and other pretrained?
The encode_plus and encode functions tokenize your texts and prepare them in a proper input format of the BERT model. Therefore you can see them similar to the one-hot vector in your provided example.
The encode_plus returns a BatchEncoding consisting of input_ids, token_type_ids, and attention_mask.
The pre-trained model differs based on the number of encoder layers. The base model has 12 encoders, and the large model has 24 layers of encoders.

Proper way to extract embedding weights for CBOW model?

I'm currently trying to implement the CBOW model on managed to get the training and testing, but am facing some confusion as to the "proper" way to finally extract the weights from the model to use as our word embeddings.
Model
class CBOW(nn.Module):
def __init__(self, config, vocab):
self.config = config # Basic config file to hold arguments.
self.vocab = vocab
self.vocab_size = len(self.vocab.token2idx)
self.window_size = self.config.window_size
self.embed = nn.Embedding(num_embeddings=self.vocab_size, embedding_dim=self.config.embed_dim)
self.linear = nn.Linear(in_features=self.config.embed_dim, out_features=self.vocab_size)
def forward(self, x):
x = self.embed(x)
x = torch.mean(x, dim=0) # Average out the embedding values.
x = self.linear(x)
return x
Main process
After I run my model through a Solver with the training and testing data, I basically told the train and test functions to also return the model that's used. Then I assigned the embedding weights to a separate variable and used those as the word embeddings.
Training and testing was conducted using cross entropy loss, and each training and testing sample is of the form ([context words], target word).
def run(solver, config, vocabulary):
for epoch in range(config.num_epochs):
loss_train, model_train = solver.train()
loss_test, model_test = solver.test()
embeddings = model_train.embed.weight
I'm not sure if this is the correct way of going about extracting and using the embeddings. Is there usually another way to do this? Thanks in advance.
Yes, model_train.embed.weight will give you a torch tensor that stores the embedding weights. Note however, that this tensor also contains the latest gradients. If you don't want/need them, model_train.embed.weight.data will give you the weights only.
A more generic option is to call model_train.embed.parameters(). This will give you a generator of all the weight tensors of the layer. In general, there are multiple weight tensors in a layer and weight will give you only one of them. Embedding happens to have only one, so here it doesn't matter which option you use.

Randomness in VGG16 prediction

I am using the VGG-16 network available in pytorch out of the box to predict some image index. I found out that for same input file, if i predict multiple time, I get different outcome. This seems counter-intuitive to me. Once the weights are predicted ( since I am using the pretrained model) there should not be any randomness at any step, and hence multiple run with same input file shall return same prediction.
Here is my code:
import torch
import torchvision.models as models
VGG16 = models.vgg16(pretrained=True)
def VGG16_predict(img_path):
transformer = transforms.Compose([transforms.CenterCrop(224),transforms.ToTensor()])
data = transformer(Image.open(img_path))
output = softmax(VGG16(data.unsqueeze(0)), dim=1).argmax().item()
return output # predicted class index
VGG16_predict(image)
Here is the image
Recall that many modules have two states for training vs evaluation: "Some models use modules which have different training and evaluation behavior, such as batch normalization. To switch between these modes, use model.train() or model.eval() as appropriate. See train() or eval() for details." (https://pytorch.org/docs/stable/torchvision/models.html)
In this case, the classifier layers include dropout, which is stochastic during training. Run VGG16.eval() if you want the evaluations to be non-random.

Negative Sampling Skip Gram model without one hot vector input

I have pairs of movie witch contains 2783 features.
The vector is defined as: if the feature is in the movie it's 1 otherwise it's 0.
Example :
movie 1 = [0,0,1,0,1,0,1 ...] & movie 2 = [1,0,1,1,1,0,1 ...]
Each pair has for label 1 or 0.
movie1,movie2=0
movie1,movie4=1
movie2,movie150=0
The input is similar to SGNS (Skip gram negative sampling) word2vec model.
My goal is to find similarity between programs and learn embedding of each movie.
I'd to build a kind of 'SGNS implementation with keras'. However my input is not one hot and I can't use the Embedding layers. I tried to use Dense layers and merge them with a dot product. I'm not sure about the model architecture and I got errors.
from keras.layers import Dense,Input,LSTM,Reshape
from keras.models import Model,Sequential
n_of_features = 2783
n_embed_dims = 20
# movie1 vectors
word= Sequential()
word.add(Dense(n_embed_dims, input_dim=(n_words,)))
# movie2 vectors
context = Sequential()
context.add(Dense(n_embed_dims, input_dim=n_words,))
model = Sequential()
model.add(keras.layers.dot([word, context], axes=1))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',
loss='mean_squared_error')
If someone has an idea how to implement it.
If you're not wedded to Keras, you could probably model this by turning each movie into a synthetic 'document' with tokens for each feature-that-is-present. Then, use a 'Paragraph Vectors' implementation in pure PV-DBOW mode to learn a vector for each movie.
(In pure PV-DBOW, dense doc-vectors are learned to predict each word in a document, without regard to order/word-adjacency/etc. It is a bit like skip-gram, but the training pairs are not 'word to every nearby word' but 'doc-token to every in-doc word'.)
In gensim, the Doc2Vec class with initialization parameter dm=0 uses pure PV-DBOW mode.