Latent Dirichlet Allocation Solution Example - lda

I am trying to learn about Latent Dirichlet Allocation (LDA). I have basic knowledge of machine learning and probability theory and based on this blog post http://goo.gl/ccPvE I was able to develop the intuition behind LDA. However I still haven't got complete understanding of the various calculations that goes in it. I am wondering can someone show me the calculations using a very small corpus (let say of 3-5 sentences and 2-3 topics).

Edwin Chen (who works at Twitter btw) has an example in his blog. 5 sentences, 2 topics:
I like to eat broccoli and bananas.
I ate a banana and spinach smoothie for breakfast.
Chinchillas and kittens are cute.
My sister adopted a kitten yesterday.
Look at this cute hamster munching on a piece of broccoli.
Then he does some "calculations"
Sentences 1 and 2: 100% Topic A
Sentences 3 and 4: 100% Topic B
Sentence 5: 60% Topic A, 40% Topic B
And take guesses of the topics:
Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, …
at which point, you could interpret topic A to be about food
Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, …
at which point, you could interpret topic B to be about cute animals
Your question is how did he come up with those numbers? Which words in these sentences carry "information":
broccoli, bananas, smoothie, breakfast, munching, eat
chinchilla, kitten, cute, adopted, hampster
Now let's go sentence by sentence getting words from each topic:
food 3, cute 0 --> food
food 5, cute 0 --> food
food 0, cute 3 --> cute
food 0, cute 2 --> cute
food 2, cute 2 --> 50% food + 50% cute
So my numbers, differ slightly from Chen's. Maybe he includes the word "piece" in "piece of broccoli" as counting towards food.
We made two calculations in our heads:
to look at the sentences and come up with 2 topics in the first place. LDA does this by considering each sentence as a "mixture" of topics and guessing the parameters of each topic.
to decide which words are important. LDA uses "term-frequency/inverse-document-frequency" to understand this.

LDA Procedure
Step1: Go through each document and randomly assign each word in the document to one of K topics (K is chosen beforehand)
Step2: This random assignment gives topic representations of all documents and word distributions of all the topics, albeit not very good ones
So, to improve upon them:
For each document d, go through each word w and compute:
p(topic t | document d): proportion of words in document d that are assigned to topic t
p(word w| topic t): proportion of assignments to topic t, over all documents d, that come from word w
Step3: Reassign word w a new topic t’, where we choose topic t’ with probability
p(topic t’ | document d) * p(word w | topic t’)
This generative model predicts the probability that topic t’ generated word w.
we will iterate this last step multiple times for each document in the corpus to get steady-state.
Solved calculation
Let's say you have two documents.
Doc i: “The bank called about the money.”
Doc ii: “The bank said the money was approved.”
After removing the stop words, capitalization, and punctuation.
Unique words in corpus:
bank called about money boat approved
Next then,
After then, we will randomly select a word from doc i (word bank with topic assignment 1) and we will remove its assigned topic and we will calculate the probability for its new assignment.
For the topic k=1
For the topic k=2
Now we will calculate the product of those two probabilities as given below:
Good fit for both document and word for topic 2 (area is greater) than topic 1. So, our new assignment for word bank will be topic 2.
Now, we will update the count due to new assignment.
Now we will repeat the same step of reassignment. and iterate through each word of the whole corpus.

Related

Looking for dataset for sentiment analysis that consists of sentences with slang words

I am developing a machine learning model to predict the sentiment polarity of customers' comments about some product.
Currently, I use the pretrained twitter-roberta-base-sentiment as the base model.
It is works well most of the time except when predicting text contains slang words.
For example, it predict "The product is idiot proof." wrongly as Negative.
So, I want to add some labeled example sentences contains slang words into the training dataset in order to improve the model's performance at sentences contains slang.
For example:
[
{"doc":"I am having a blast with this game.", "sentiment": "Postive"},
{"doc":"This game is like pigeon chess", "sentiment": "Negative"},
...
]
I found SlangSD, a sentiment lexicon of slang words. For my project, it has 2 drawback as a training dataset.
it has only words, not sentences in each entry;
it contains not only slang words but also many ordinary words, such as "have","project","dictionary",etc.
I don't know what degree of slang you are targetting, but by intersecting SlangSD with a common English dictionary you might get a list of true slang terms.
Then, scraping a movie/game/forum website and selecting only the comments/posts with terms within your new slang list could do the trick I believe (giving you a set of sentences with slang terms). For the label, it would be imperfect, but quite viable I believe, to put the same label as the SlangSD word in the sentence.

I need to turn the texts into vectors then feed the vectors into a classifier

I have a csv file named movie_reviews.csv and the data inside looks like this:
1 Pixar classic is one of the best kids' movies of all time.
1 Apesar de representar um imenso avanço tecnológico, a força
1 It doesn't enhance the experience, because the film's timeless appeal is down to great characters and wonderful storytelling; a classic that doesn't need goggles or gimmicks.
1 As such Toy Story in 3D is never overwhelming. Nor is it tedious, as many recent 3D vehicles have come too close for comfort to.
1 The fresh look serves the story and is never allowed to overwhelm it, leaving a beautifully judged yarn to unwind and enchant a new intake of young cinemagoers.
1 There's no denying 3D adds extra texture to Pixar's seminal 1995 buddy movie, emphasising Buzz and Woody's toy's-eye- view of the world.
1 If anything, it feels even fresher, funnier and more thrilling in today's landscape of over-studied demographically correct moviemaking.
1 If you haven't seen it for a while, you may have forgotten just how fantastic the snappy dialogue, visual gags and genuinely heartfelt story is.
0 The humans are wooden, the computer-animals have that floating, jerky gait of animated fauna.
1 Some thrills, but may be too much for little ones.
1 Like the rest of Johnston's oeuvre, Jumanji puts vivid characters through paces that will quicken any child's pulse.
1 "This smart, scary film, is still a favorite to dust off and take from the ""vhs"" bin"
0 All the effects in the world can't disguise the thin plot.
the first columns with 0s and 1s is my label.
I want to first turn the texts in movie_reviews.csv into vectors, then split my dataset based on the labels (all 1s to train and 0s to test). Then feed the vectors into a classifier like random forest.
For such a task you'll need to parse your data first with different tools. First lower-case all your sentences. Then delete all stopwords (the, and, or, ...). Tokenize (an introduction here: https://medium.com/#makcedward/nlp-pipeline-word-tokenization-part-1-4b2b547e6a3). You can also use stemming in order to keep anly the root of the word, it can be helpful for sentiment classification.
Then you'll assign an index to each word of your vocabulary and replace words in your sentence by these indexes :
Imagine your vocabulary is : ['i', 'love', 'keras', 'pytorch', 'tensorflow']
index['None'] = 0 #in case a new word is not in your vocabulary
index['i'] = 1
index['love'] = 2
...
Thus the sentence : 'I love Keras' will be encoded as [1 2 3]
However you have to define a maximum length max_len for your sentences and when a sentence contain less words than max_len you complete your vector of size max_len by zeros.
In the previous example if your max_len = 5 then [1 2 3] -> [1 2 3 0 0].
This is a basic approach. Feel free to check preprocessing tools provided by libraries such as NLTK, Pandas ...

How to build deep learning model that picks words from serval distinct bags and forms a meaningful sentence [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Image of Bags and how to choose from them
Imagine I have 10 bags,Ordered one after other.ie Bag 1 , Bag 2 ......... Bag n.
Each bag has distinct set of words.
In order to understand what a bag is,
Consider we have a vocabulary of 10,000 words.
The first bag contains words Hello , India , Manager.
ie Bag 1 will have 1's at the words index present in the bag.
ex:Bag 1 will be of size 10000*1
if Hello's index was 1 India's index was 2 and Manager's was 4
It will be
[0 , 1, 1, 0 , 1 ,0,0,0,0.........]
*I dont have a model yet.
*I'm thinking to use story books,But its still kind of abstract for me.
A word has to chosen from each bag and assigned a number word 1(word from bag 1)
word 2(word from bag 2) and they must form a MEANINGFULL sentence in their numerical order.!
First, we need a way that the computer can recognise a word otherwise it cannot pick the correct one. That means at this stage, we need to decide what we are teaching the computer to begin with (ie what is a verb, noun, grammar) but I will assume we will dump a dictionary into it and give no information except the words themselves.
So that the computer can compute what sentences are, we need to convert them to numbers (one way would be to work alphabetically starting at 1, using them as keys for a dictionary (digital this time(!)) and the word as the value). Now we can apply the same linear algebra techniques to this problem as any other problem.
So we need to make generations of matrices of weights to multiply into the keys of the dictionary, then remove all the weights beyond the range of dictionary keys, the rest can be used to get the value in the dictionary and make a sentence. Optionally, you can also use a threshold value to take off of all the outputs of the matrix multiplication
Now for the hard part: learning. Once you have a few (say 100) matrices, we need to "breed" the best ones (this is where human intervention is needed) and you need to pick the 50 most meaningful sentences (might be hard at first) and use them to base your next 100 of (easiest way would be to weight the 50 matrices randomly for a weighted mean 100 times).
And the boring bit, keep running the generations over and over until you get to a point where your sentences are meaningful most of the time (of course there is no guarantee that it will always be meaningful but that's the nature of ANN's)
If you find it doesn't work, you can use more layers (more matrices) and/or I recently heard of a different technique that dynamically changed the network but I can't really help with that.
Have a database with thousands/millions of valid sentences.
Create a dictionary where each word represents a number (reserve 0 for "nothing", 1 for "start of sentence" and 2 for "end of sentence").
word_dic = { "_nothing_": 0, "_start_": 1, "_end_": 2, "word1": 3, "word2": 4, ...}
reverse_dic = {v:k for k,v in word_dic.items()}
Remember to add "_start_" and "_end_" at the beginning and end of all sentences in the database, and "_nothing_" after the end to complete the desired length capable of containing all sentences. (Ideally, work with sentences with 10 or less words, so your model wont't try to create bigger sentences).
Transform all your sentences into sequences of indices:
#supposing you have an array of shape (sentences, length) as string:
indices = []
for word in database.reshape((-1,)):
indices.append(word_dic[word])
indices = np.array(indices).reshape((sentences,length))
Transform this into categorical words with the keras function to_categorical()
cat_sentences = to_categorical(indices) #shape (sentences,length,dictionary_size)
Hint: keras has lots of useful text preprocessing functions here.
Separate training input and output data:
#input is the sentences except for the last word
x_train = cat_sentences[:,:-1,:]
y_train = cat_sentences[:,1:,:]
Let's create an LSTM based model that will predict the next words from the previous words:
model = Sequential()
model.add(LSTM(dontKnow,return_sequences=True,input_shape=(None,dictionary_size)))
model.add(.....)
model.add(LSTM(dictionary_size,return_sequences=True,activation='sigmoid'))
#or a Dense(dictionary_size,activation='sigmoid')
Compile and fit this model with x_train and y_train:
model.compile(....)
model.fit(x_train,y_train,....)
Create an identical model using stateful=True in all LSTM layers:
newModel = ......
Transfer the weights from the trained model:
newModel.set_weights(model.get_weights())
Create your bags in a categorical way, shape (10, dictionary_size).
Use the model to predict one word from the _start_ word.
#reset the states of the stateful model before you start a 10 word prediction:
newModel.reset_states()
firstWord = newModel.predict(startWord) #startword is shaped as (1,1,dictionary_size)
The firstWord will be a vector with size dictionary_size telling (sort of) the probabilities of each existing word. Compare to the words in the bag. You can choose the highest probability, or use some random selecting if the probabilities of other words in the bag are also good.
#example taking the most probable word:
firstWord = np.array(firstWord == firstWord.max(), dtype=np.float32)
Do the same again, but now input firstWord in the model:
secondWord = newModel.predict(firstWord) #respect the shapes
Repeat the process until you get a sentence. Notice that you may find _end_ before the 10 words in the bag are satisfied. You may decide to finish the process with a shorter sentence then, especially if other word probabilities are low.

Find similarity of a sentence with 6 basic emotions using wordnet

i'm working on a project and a part of it needs to detect emotion of the text we work on.
For example,
He is happy to go home.
I'll be taking two words from the above sentence i.e happy and home.
I'll be having a table containing 6 basic emotions. ( Happy, Sad, fear,anger,disgust, suprise)
Each of these emotions will be having some synsets associated with them.
I need to find the similarity between these synsets and the word happy and then similarity between these synsets and the word home.
I tried to use WORDNET for this purpose but couldn't able to understand how wordnet works as i'm new to this.
I think you want to find words in sentence that are similar to any of the words that represent any of the 6 basic given emotions. If I am correct I think you can use following solution.
First extract synset of each of the word sense representing 6 basic emotions. Now form the vectorized representation of each of these synset(collection of synonymous words). You can do this using word2Vec tool available at https://code.google.com/archive/p/word2vec/ . e.g.
Suppose "happy" has the word senses a1, a2, a3 as its synonymous words then
1. First train Word2Vec tool on any large English Corpus e.g. Bojar corpus
2. Then using trained word2Vec obtain word embeddings(vectorized representation) of each synonymous word a1, a2, a3.
3. Then vectorized representation of synset of "happy" would be average of vectorized representation of a1, a2, a3.
4. In this way you can have vectorized representation synset of each of the 6 basic emotion.
Now for given sentence find vectorized representation of each of the word in using trained word2vec generated vocabulary. Now you can use cosine similarity
(https://en.wikipedia.org/wiki/Cosine_similarity) to find distance(similarity) of each of the word from synset of 6 basic emotions. In this way you can determine emotion(basic level) of the sentence.
Source of the technique : Research paper "Unsupervised Most Frequent Sense Detection using Word Embeddings" by Sudha et. al.(http://www.aclweb.org/anthology/N15-1132)

topic modeling using keywords for topics

I need to do topic modeling in the following manner:
eg:
I need to extract 5 topics from a document.The document being a single document.I have the keywords for 5 topics and related to these 5 keywords i need to extract the topics.
The keywords for 5 topics being:
keyword 1-(car,motorsport,...)
keyword 2-(accident,insurance,...)
......
The corresponding output should be:
Topic 1-(vehicle,torque,speed...)
Topic 2-(claim,amount,....)
How could this be done?
A good place to start would be this LDA topic modelling library written for use with NodeJS.
https://www.npmjs.org/package/lda
var lda = require('lda');
// Example document.
var text = 'Cats are small. Dogs are big. Cats like to chase mice. Dogs like to eat bones.';
// Extract sentences.
var documents = text.match( /[^\.!\?]+[\.!\?]+/g );
// Run LDA to get terms for 2 topics (5 terms each).
var result = lda(documents, 2, 5);
The above example produces the following result with two topics (topic 1 is "cat-related", topic 2 is "dog-related"):
Topic 1
cats (0.21%)
dogs (0.19%)
small (0.1%)
mice (0.1%)
chase (0.1%)
Topic 2
dogs (0.21%)
cats (0.19%)
big (0.11%)
eat (0.1%)
bones (0.1%)
That should get you started down the path. Please note, you will likely have to play with the number of topics and documents to tune them for the amount of information you are looking to extract.
This isn't magic.
http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation