Word2Vec based on Hierarchies - nltk

This is an conceptual/algorithm question - on how to move a hierarchical word structure into Word2Vec
I have (luckily) a well structured source, with a string and the hierarchical meaning of this string above/below. Now I would like to make the Word2Vec learn close terminologies. E.g. an entry can look like:
String N+1: steering assembly
String N: wheel
String N-1: tire, wheel rim, bolts
My experience with Word2Vec is only on floating text, where the analysis is running over the text.
Any directions would be very much appreciated.

Related

Find dependencies between words in a sentence

I need to find connections between words in a sentence, like this (spacy lib).
How can i achieve these results with deep learning?
I don't really understand how hugging-face transformers work, because this library lean on a "self-attention" mechanism, which is quite a mystery for me.
Maybe i should stick to RNN, but i don't know what kind of properties (words, lemmas, morphemes) i should pass to the NN, and how to vectorize it.
I created some dataset sample, where i store each word, its POS, tense, gender, case, plurality/singularity (0 if doesn't have this property), word's parent (0 if it's sentence root)
I have got a few questions:
What would be an appropriate size of a dataset for this problem? In sentences
What kind of a model do i need to solve this and how this model learns?
I can't figure it out, so please describe everything in as much detail as possible. Thank you!

Model suggestion: Keyword spotting

I want to predict the occurrences of the word "repeat" in a speech as well as the word's approximate duration. For this task, I'm planning to build a Deep Learning model. I've around 50 positive as well as 50 negative utterances (I couldn't collect more).
Initially I've searched for any pretrained models for keyword spotting, but I couldn't get a good one.
Then I tried Speech Recognition models (Deep Speech), but it couldn't predict the exact repeat words as my data follows Indian accent. Also, I've thought that going for ASR models for this task would be a over-killing one.
Now, I've split the entire audio into chunk of 1 secs with 50% overlapping and tried a binary audio classification in each chunk that is whether the chunk has the word "repeat" or not. For building the classification model, I calculated the MFCC features and build a sequence model on the top of it. Nothing seems to work for me.
If anyone already worked with this kind of task, please provide me with a correct method/resources to build a DL model for this task. Thanks in advance!

U-Net segmentation without having mask

I am new to deep learning and Semantic segmentation.
I have a dataset of medical images (CT) in Dicom format, in which I need to segment tumours and organs involved from the images. I have labelled organs contoured by our physician which we call it RT structure stored in Dicom format also.
As far as I know, people usually use "mask". Does it mean I need to convert all the contoured structure in the rt structure to mask? or I can use the information from the RT structure (.dcm) directly as my input?
Thanks for your help.
There is a special library called pydicom that you need to install before you can actually decode and later visualise the X-ray image.
Now, since you want to apply semantic segmentation and you want to segment the tumours, the solution to this is to create a neural network which accepts as input a pair of [image,mask], where, say, all the locations in the mask are 0 except for the zones where the tumour is, which are marked with 1; practically your ground truth is the mask.
Of course for this you will have to implement your CustomDataGenerator() which must yield at every step a batch of [image,mask] pairs as stated above.

How is the self-attention mechanism in Transformers able to learn how the words are related to each other?

Given the sentence The animal didn't cross the street because it was too tired, how the self-attention is able to map with a higher score the word aninal intead of the word street ?
I'm wondering if that might be a consequence of the word embedding vectors fed into the network, that some how already encapsulate some degree of distance among the words.
Word Embeddings are first added to Positional Encoding which adds information about the word's position in the sequence. Then through each Encoder stack(6 to be precise), the Embeddings undergo multiple transformations and are refined to form better representations before they are passed on to the decoder.
The modification to the Embeddings as it passes through the Encoder Stack is learnable. Sometimes it may appear that some Attention-Heads at the top Stack are doing something that may look like coreference resolution which you pointed out in your example. Attending more to the word "animal" simply results in better representation than attending to "street".
How do we know which representations are better? The one that minimizes the loss or produces a better output of course!

Clarification on the use of Vocab file in NER

I am learning Named Entity Recognition, and i see that the training script uses a variable called vocab which looks like this
vocab = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ\'-/\t \n\r\x0b\x0c:"
My Guess is that it is supposed to learn all these characters present in the text like abcd... etc, what i dont understand is the use of char like /n /t what is the use of these char? and in general this variable?
Thanks in advance.
This string is the vocabulary. In the context of NLP, vocabulary is a list of all words or characters used in the training set. In your example the vocabulary is a list of characters. Specifically \n is a newline, and \t a tab.
For NER and other nlp tasks, we usually use a vocabulary to produce embeddings for each token (word or char), and these embeddings are fed to the machine learning model (nowadays, neural networks architectures such as LSTM are used to get the best results). Character based embeddings have an advantage over word based embeddings for OOV (Out-of-vocabulary) words, i.e. words that do not appear in the training set, but are encountered during inference.