I need to do topic modeling in the following manner:
eg:
I need to extract 5 topics from a document.The document being a single document.I have the keywords for 5 topics and related to these 5 keywords i need to extract the topics.
The keywords for 5 topics being:
keyword 1-(car,motorsport,...)
keyword 2-(accident,insurance,...)
......
The corresponding output should be:
Topic 1-(vehicle,torque,speed...)
Topic 2-(claim,amount,....)
How could this be done?
A good place to start would be this LDA topic modelling library written for use with NodeJS.
https://www.npmjs.org/package/lda
var lda = require('lda');
// Example document.
var text = 'Cats are small. Dogs are big. Cats like to chase mice. Dogs like to eat bones.';
// Extract sentences.
var documents = text.match( /[^\.!\?]+[\.!\?]+/g );
// Run LDA to get terms for 2 topics (5 terms each).
var result = lda(documents, 2, 5);
The above example produces the following result with two topics (topic 1 is "cat-related", topic 2 is "dog-related"):
Topic 1
cats (0.21%)
dogs (0.19%)
small (0.1%)
mice (0.1%)
chase (0.1%)
Topic 2
dogs (0.21%)
cats (0.19%)
big (0.11%)
eat (0.1%)
bones (0.1%)
That should get you started down the path. Please note, you will likely have to play with the number of topics and documents to tune them for the amount of information you are looking to extract.
This isn't magic.
http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation
Related
I want to get the topic coherence for the LDA model. Let's say I have two LDA model one with a bag of words and the second one with a bag of phrases. how I can get the coherence for these two models and then compare them on the basis of coherence?
For two separate models you can just check coherence separately. You should post some code but this is how to check coherence:
# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=processed_docs, dictionary=dictionary, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet)
If you want a comparison check out the elbow method for optimizing coherence: 17 I hope this helps
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Image of Bags and how to choose from them
Imagine I have 10 bags,Ordered one after other.ie Bag 1 , Bag 2 ......... Bag n.
Each bag has distinct set of words.
In order to understand what a bag is,
Consider we have a vocabulary of 10,000 words.
The first bag contains words Hello , India , Manager.
ie Bag 1 will have 1's at the words index present in the bag.
ex:Bag 1 will be of size 10000*1
if Hello's index was 1 India's index was 2 and Manager's was 4
It will be
[0 , 1, 1, 0 , 1 ,0,0,0,0.........]
*I dont have a model yet.
*I'm thinking to use story books,But its still kind of abstract for me.
A word has to chosen from each bag and assigned a number word 1(word from bag 1)
word 2(word from bag 2) and they must form a MEANINGFULL sentence in their numerical order.!
First, we need a way that the computer can recognise a word otherwise it cannot pick the correct one. That means at this stage, we need to decide what we are teaching the computer to begin with (ie what is a verb, noun, grammar) but I will assume we will dump a dictionary into it and give no information except the words themselves.
So that the computer can compute what sentences are, we need to convert them to numbers (one way would be to work alphabetically starting at 1, using them as keys for a dictionary (digital this time(!)) and the word as the value). Now we can apply the same linear algebra techniques to this problem as any other problem.
So we need to make generations of matrices of weights to multiply into the keys of the dictionary, then remove all the weights beyond the range of dictionary keys, the rest can be used to get the value in the dictionary and make a sentence. Optionally, you can also use a threshold value to take off of all the outputs of the matrix multiplication
Now for the hard part: learning. Once you have a few (say 100) matrices, we need to "breed" the best ones (this is where human intervention is needed) and you need to pick the 50 most meaningful sentences (might be hard at first) and use them to base your next 100 of (easiest way would be to weight the 50 matrices randomly for a weighted mean 100 times).
And the boring bit, keep running the generations over and over until you get to a point where your sentences are meaningful most of the time (of course there is no guarantee that it will always be meaningful but that's the nature of ANN's)
If you find it doesn't work, you can use more layers (more matrices) and/or I recently heard of a different technique that dynamically changed the network but I can't really help with that.
Have a database with thousands/millions of valid sentences.
Create a dictionary where each word represents a number (reserve 0 for "nothing", 1 for "start of sentence" and 2 for "end of sentence").
word_dic = { "_nothing_": 0, "_start_": 1, "_end_": 2, "word1": 3, "word2": 4, ...}
reverse_dic = {v:k for k,v in word_dic.items()}
Remember to add "_start_" and "_end_" at the beginning and end of all sentences in the database, and "_nothing_" after the end to complete the desired length capable of containing all sentences. (Ideally, work with sentences with 10 or less words, so your model wont't try to create bigger sentences).
Transform all your sentences into sequences of indices:
#supposing you have an array of shape (sentences, length) as string:
indices = []
for word in database.reshape((-1,)):
indices.append(word_dic[word])
indices = np.array(indices).reshape((sentences,length))
Transform this into categorical words with the keras function to_categorical()
cat_sentences = to_categorical(indices) #shape (sentences,length,dictionary_size)
Hint: keras has lots of useful text preprocessing functions here.
Separate training input and output data:
#input is the sentences except for the last word
x_train = cat_sentences[:,:-1,:]
y_train = cat_sentences[:,1:,:]
Let's create an LSTM based model that will predict the next words from the previous words:
model = Sequential()
model.add(LSTM(dontKnow,return_sequences=True,input_shape=(None,dictionary_size)))
model.add(.....)
model.add(LSTM(dictionary_size,return_sequences=True,activation='sigmoid'))
#or a Dense(dictionary_size,activation='sigmoid')
Compile and fit this model with x_train and y_train:
model.compile(....)
model.fit(x_train,y_train,....)
Create an identical model using stateful=True in all LSTM layers:
newModel = ......
Transfer the weights from the trained model:
newModel.set_weights(model.get_weights())
Create your bags in a categorical way, shape (10, dictionary_size).
Use the model to predict one word from the _start_ word.
#reset the states of the stateful model before you start a 10 word prediction:
newModel.reset_states()
firstWord = newModel.predict(startWord) #startword is shaped as (1,1,dictionary_size)
The firstWord will be a vector with size dictionary_size telling (sort of) the probabilities of each existing word. Compare to the words in the bag. You can choose the highest probability, or use some random selecting if the probabilities of other words in the bag are also good.
#example taking the most probable word:
firstWord = np.array(firstWord == firstWord.max(), dtype=np.float32)
Do the same again, but now input firstWord in the model:
secondWord = newModel.predict(firstWord) #respect the shapes
Repeat the process until you get a sentence. Notice that you may find _end_ before the 10 words in the bag are satisfied. You may decide to finish the process with a shorter sentence then, especially if other word probabilities are low.
First, thank you everyone for taking the time to help with my issue.
I have a table which consists of id and content
An example set of data is:
id content
9 With astonishing insight and poignant precision
8 Whether they find themselves hacking rattles from the tails of snakes or nesting rattles in the hands of their babies.
4 This is a clear-eyed, deeply poignant collection.
12 This book is a dynamic compilation of snapshot tales, each of which encompasses its own sensory-rich world and can be read.
12 This book is an appropriate title for her collection-the prose poems feel revealed.
12 This book is a collection of ekphrastic vignettes set against surreal backdrops fraught with eerie characters faking normalcy.
What I need to do is find the shortest length content for each unique id
For example, the output of the above should be:
id content
9 With astonishing insight and poignant precision
8 Whether they find themselves hacking rattles from the tails of snakes or nesting rattles in the hands of their babies.
4 This is a clear-eyed, deeply poignant collection.
12 This book is an appropriate title for her collection-the prose poems feel revealed.
What I have so far, and is obviously wrong is:
SELECT DISTINCT id, content FROM t
GROUP BY id;
Thank you for your help!
One method is to calculate the shortest length and then join the results back in:
select t.id, t.content
from (select id, min(length(content)) as minl
from t
group by id
) tmin join
t
on t.id = tmin.id and length(t.content) = tmin.minl;
Note: if two contents have the same length and are the minimum, this will return both of them. Your question doesn't specify what to do in this case.
I am trying to learn about Latent Dirichlet Allocation (LDA). I have basic knowledge of machine learning and probability theory and based on this blog post http://goo.gl/ccPvE I was able to develop the intuition behind LDA. However I still haven't got complete understanding of the various calculations that goes in it. I am wondering can someone show me the calculations using a very small corpus (let say of 3-5 sentences and 2-3 topics).
Edwin Chen (who works at Twitter btw) has an example in his blog. 5 sentences, 2 topics:
I like to eat broccoli and bananas.
I ate a banana and spinach smoothie for breakfast.
Chinchillas and kittens are cute.
My sister adopted a kitten yesterday.
Look at this cute hamster munching on a piece of broccoli.
Then he does some "calculations"
Sentences 1 and 2: 100% Topic A
Sentences 3 and 4: 100% Topic B
Sentence 5: 60% Topic A, 40% Topic B
And take guesses of the topics:
Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, …
at which point, you could interpret topic A to be about food
Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, …
at which point, you could interpret topic B to be about cute animals
Your question is how did he come up with those numbers? Which words in these sentences carry "information":
broccoli, bananas, smoothie, breakfast, munching, eat
chinchilla, kitten, cute, adopted, hampster
Now let's go sentence by sentence getting words from each topic:
food 3, cute 0 --> food
food 5, cute 0 --> food
food 0, cute 3 --> cute
food 0, cute 2 --> cute
food 2, cute 2 --> 50% food + 50% cute
So my numbers, differ slightly from Chen's. Maybe he includes the word "piece" in "piece of broccoli" as counting towards food.
We made two calculations in our heads:
to look at the sentences and come up with 2 topics in the first place. LDA does this by considering each sentence as a "mixture" of topics and guessing the parameters of each topic.
to decide which words are important. LDA uses "term-frequency/inverse-document-frequency" to understand this.
LDA Procedure
Step1: Go through each document and randomly assign each word in the document to one of K topics (K is chosen beforehand)
Step2: This random assignment gives topic representations of all documents and word distributions of all the topics, albeit not very good ones
So, to improve upon them:
For each document d, go through each word w and compute:
p(topic t | document d): proportion of words in document d that are assigned to topic t
p(word w| topic t): proportion of assignments to topic t, over all documents d, that come from word w
Step3: Reassign word w a new topic t’, where we choose topic t’ with probability
p(topic t’ | document d) * p(word w | topic t’)
This generative model predicts the probability that topic t’ generated word w.
we will iterate this last step multiple times for each document in the corpus to get steady-state.
Solved calculation
Let's say you have two documents.
Doc i: “The bank called about the money.”
Doc ii: “The bank said the money was approved.”
After removing the stop words, capitalization, and punctuation.
Unique words in corpus:
bank called about money boat approved
Next then,
After then, we will randomly select a word from doc i (word bank with topic assignment 1) and we will remove its assigned topic and we will calculate the probability for its new assignment.
For the topic k=1
For the topic k=2
Now we will calculate the product of those two probabilities as given below:
Good fit for both document and word for topic 2 (area is greater) than topic 1. So, our new assignment for word bank will be topic 2.
Now, we will update the count due to new assignment.
Now we will repeat the same step of reassignment. and iterate through each word of the whole corpus.
I'm not a Natural Language Programming student, yet I know it's not trivial strcmp(n1,n2).
Here's what i've learned so far:
comparing Personal Names can't be solved 100%
there are ways to achieve certain degree of accuracy.
the answer will be locale-specific, that's OK.
I'm not looking for spelling alternatives! The assumption is that the input's spelling is correct.
For example, all the names below can refer to the same person:
Berry Tsakala
Bernard Tsakala
Berry J. Tsakala
Tsakala, Berry
I'm trying to:
build (or copy) an algorithm which grades the relationship 2 input names
find an indexing method (for names in my database, for hash tables, etc.)
note:
My task isn't about finding names in text, but to compare 2 names. e.g.
name_compare( "James Brown", "Brown, James", "en-US" ) ---> 99.0%
I used Tanimoto Coefficient for a quick (but not super) solution, in Python:
"""
Formula:
Na = number of set A elements
Nb = number of set B elements
Nc = number of common items
T = Nc / (Na + Nb - Nc)
"""
def tanimoto(a, b):
c = [v for v in a if v in b]
return float(len(c)) / (len(a)+len(b)-len(c))
def name_compare(name1, name2):
return tanimoto(name1, name2)
>>> name_compare("James Brown", "Brown, James")
0.91666666666666663
>>> name_compare("Berry Tsakala", "Bernard Tsakala")
0.75
>>>
Edit: A link to a good and useful book.
Soundex is sometimes used to compare similar names. It doesn't deal with first name/last name ordering, but you could probably just have your code look for the comma to solve that problem.
We've just been doing this sort of work non-stop lately and the approach we've taken is to have a look-up table or alias list. If you can discount misspellings/misheard/non-english names then the difficult part is taken away. In your examples we would assume that the first word and the last word are the forename and the surname. Anything in between would be discarded (middle names, initials). Berry and Bernard would be in the alias list - and when Tsakala did not match to Berry we would flip the word order around and then get the match.
One thing you need to understand is the database/people lists you are dealing with. In the English speaking world middle names are inconsistently recorded. So you can't make or deny a match based on the middle name or middle initial. Soundex will not help you with common name aliases such as "Dick" and "Richard", "Berry" and "Bernard" and possibly "Steve" and "Stephen". In some communities it is quite common for people to live at the same address and have 2 or 3 generations living at that address with the same name. The only way you can separate them is by date of birth. Date of birth may or may not be recorded. If you have the clout then you should probably make the recording of date of birth mandatory. A lot of "people databases" either don't record date of birth or won't give them away due to privacy reasons.
Effectively people name matching is not that complicated. Its entirely based on the quality of the data supplied. What happens in practice is that a lot of records remain unmatched - and even a human looking at them can't resolve the mismatch. A human may notice name aliases not recorded in the aliases list or may be able to look up details of the person on the internet - but you can't really expect your programme to do that.
Banks, credit rating organisations and the government have a lot of detailed information about us. Previous addresses, date of birth etc. And that helps them join up names. But for us normal programmers there is no magic bullet.
Analyzing name order and the existence of middle names/initials is trivial, of course, so it looks like the real challenge is knowing common name alternatives. I doubt this can be done without using some sort of nickname lookup table. This list is a good starting point. It doesn't map Bernard to Berry, but it would probably catch the most common cases. Perhaps an even more exhaustive list can be found elsewhere, but I definitely think that a locale-specific lookup table is the way to go.
I had real problems with the Tanimoto using utf-8.
What works for languages that use diacritical signs is difflib.SequenceMatcher()