Clarification on the use of Vocab file in NER - deep-learning

I am learning Named Entity Recognition, and i see that the training script uses a variable called vocab which looks like this
vocab = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\'-/\t \n\r\x0b\x0c:"
My Guess is that it is supposed to learn all these characters present in the text like abcd... etc, what i dont understand is the use of char like /n /t what is the use of these char? and in general this variable?
Thanks in advance.

This string is the vocabulary. In the context of NLP, vocabulary is a list of all words or characters used in the training set. In your example the vocabulary is a list of characters. Specifically \n is a newline, and \t a tab.
For NER and other nlp tasks, we usually use a vocabulary to produce embeddings for each token (word or char), and these embeddings are fed to the machine learning model (nowadays, neural networks architectures such as LSTM are used to get the best results). Character based embeddings have an advantage over word based embeddings for OOV (Out-of-vocabulary) words, i.e. words that do not appear in the training set, but are encountered during inference.

Related

sequence to sequence model using pytorch

I have dataset (sequence to sequence), each sample input is seq of charterers (combination from from 20 characters and max length 2166) and out is list of charterers (combination of three characters G,H,B). for example OIREDSSSRTTT ----> GGGHHHHBHBBB
I would like to do simple pytorch model that work in that type of dataset. Model that can predict sequence of classes. I would appreciate any suggestions or links for simple mode that do the same?
Thanks
If the output sequence always has the same length as the input sequence, you might want to use transformer encoder, because it basically transforms the inputs with attention to the context. Also you can try to use anything that is used to tagging: BiLSTM, BiGRU, etc.
If you want your model to be able to predict sequences of different length (not necessary the same as input length), look at some encoder-decoder models, such as vanilla transformer.
You can start with the sequence tagging model from PyTorch tutorial https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html .
As #Ilya Fedorov said, you can move to transformer models for potentially better performance.

Is BertTokenizer similar to word embedding?

The idea of using BertTokenizer from huggingface really confuses me.
When I use
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer.encode_plus("Hello")
Does the result is somewhat similar to when I pass
a one-hot vector representing "Hello" to a learning embedding matrix?
How is
BertTokenizer.from_pretrained("bert-base-uncased")
different from
BertTokenizer.from_pretrained("bert-**large**-uncased")
and other pretrained?
The encode_plus and encode functions tokenize your texts and prepare them in a proper input format of the BERT model. Therefore you can see them similar to the one-hot vector in your provided example.
The encode_plus returns a BatchEncoding consisting of input_ids, token_type_ids, and attention_mask.
The pre-trained model differs based on the number of encoder layers. The base model has 12 encoders, and the large model has 24 layers of encoders.

How is the self-attention mechanism in Transformers able to learn how the words are related to each other?

Given the sentence The animal didn't cross the street because it was too tired, how the self-attention is able to map with a higher score the word aninal intead of the word street ?
I'm wondering if that might be a consequence of the word embedding vectors fed into the network, that some how already encapsulate some degree of distance among the words.
Word Embeddings are first added to Positional Encoding which adds information about the word's position in the sequence. Then through each Encoder stack(6 to be precise), the Embeddings undergo multiple transformations and are refined to form better representations before they are passed on to the decoder.
The modification to the Embeddings as it passes through the Encoder Stack is learnable. Sometimes it may appear that some Attention-Heads at the top Stack are doing something that may look like coreference resolution which you pointed out in your example. Attending more to the word "animal" simply results in better representation than attending to "street".
How do we know which representations are better? The one that minimizes the loss or produces a better output of course!

One-hot encoding to word2vec embedding

I'm trying to create vectors for categorical information that I have at hand. This information is intended to be used for aiding seq2seq network for NLP purposes (like summarization).
To get the idea, maybe an example would be of help:
Sample Text: shark attacks off Florida in a 1-hour span
And suppose that we have this hypothetical categorical information:
1. [animal, shark, sea, ocean]
2. [animal, tiger, jungle, mountains]
...
19. [animal, eagle, sky, mountains]
I want to feed sample text to an LSTM network token-by-token (like seq2seq networks). I'm using pre-trained GloVe embeddings as my original embeddings which are fed into the network, but also want to concatenate a dense vector to each token denoting its category.
For now, I know that I can simply use the one-hot embeddings (0-1 binary). So, for example, the first input (for shark) to the RNN network would be:
# GloVe embeddings of shark + one-hot encoding for shark, + means concatenation
[-0.323 0.213 ... -0.134 0.934 0.031 ] + [1 0 0 0 0 ... 0 0 1]
The problem is that I have an extreme number of categories out there (around 20,000). After searching over the Internet, it seemed to me that people suggest using word2vec instead of one-hots. But, I can't get the underlying idea of how word2vec can demonstrate the categorical features in this case. Does anybody have a more clear idea?
Word2Vec can't be used for classification. It is just the underlying algorithm.
For classification you can use Doc2Vec or something similar.
It basically takes a list of documents and each has unique id assigned to it. After the training it builds relations between the documents similar to those which word2vec builds for the words. Then when you give it an unknown document it will tell you the top n most similar, and if your documents have previously defined tags you can assume that the unknown document can be labeled the same way.

Best way to handle OOV words when using pretrained embeddings in PyTorch

I am using word2vec pretrained embedding in PyTorch (following code here). However, it does not seem to handle unseen words. Is there any good way to solve it?
FastText builds character ngram vectors as part of model training. When it finds an OOV word, it sums the character ngram vectors in the word to produce a vector for the word. You can find more detail here.