what is the prime purpose of the using text.common_contexts() in nltk.Guys I have searched and gone through as best as I could do. but sorry to say I didn't understand a bit.please help me by giving an example.Thank you.
Example to understand:
Let's first define our input text, I will just Copy/Paste the first paragraph of Game of Thrones Wikipedia page:
input_text = "Game of Thrones is an American fantasy drama television series \
created by David Benioff and D. B. Weiss for HBO. It is an adaptation of A Song \
of Ice and Fire, George R. R. Martin's series of fantasy novels, the first of \
which is A Game of Thrones. The show was filmed in Belfast and elsewhere in the \
United Kingdom, Canada, Croatia, Iceland, Malta, Morocco, Spain, and the \
United States.[1] The series premiered on HBO in the United States on April \
17, 2011, and concluded on May 19, 2019, with 73 episodes broadcast over \
eight seasons. Set on the fictional continents of Westeros and Essos, Game of \
Thrones has several plots and a large ensemble cast, and follows several story \
arcs. One arc is about the Iron Throne of the Seven Kingdoms, and follows a web \
of alliances and conflicts among the noble dynasties either vying to claim the \
throne or fighting for independence from it. Another focuses on the last \
descendant of the realm's deposed ruling dynasty, who has been exiled and is \
plotting a return to the throne, while another story arc follows the Night's \
Watch, a brotherhood defending the realm against the fierce peoples and \
legendary creatures of the North."
To be able to apply nltk functions we need to convert our text of type 'str' to 'nltk.text.Text'.
import nltk
text = nltk.Text( input_text.split() )
text.similar()
The similar() method takes an input_word and returns other words who appear in a similar range of contexts in the text.
For example let's see what are the words used in similar context to the word 'game' in our text:
text.similar('game') #output: song web
text.common_contexts()
The common_contexts() method allows you to examine the contexts that are shared by two or more words. Let's see in which context the words 'game' and 'web' were used in the text:
text.common_contexts(['game', 'web']) #outputs a_of
This means that in the text we'll find 'a game of' and 'a song of'.
These methods are especially interesting when your text is quite large (book, magazine...)
Observe the below example. You will understand:
>>> text1.concordance("tiger")
of miles you wade knee - deep among Tiger - lilies -- what is the
one charm wa but nurse the cruellest fangs : the tiger of
Bengal crouches in spiced groves e would be more hideous than a
caged tiger , then . I could not endure the sigh
>>> text1.concordance("bird")
o the winds when that storm - tossed bird is on the wing . The three
correspon , Ahab seemed not to mark this wild bird ; nor ,
indeed , would any one else nd incommoding Tashtego there ; this
bird now chanced to intercept its broad f his hammer frozen there
; and so the bird of heaven , with archangelic shrieks
text1.common_contexts(["tiger","bird"])
the_of
Related
I am using a HuggingFace summariser pipeline and I noticed that if I train a model for 3 epochs and then at the end run evaluation on all 3 epochs with fixed random seeds, I get a different results based on whether I restart the python console 3 times or whether I load the different model (one for every epoch) on the same summariser object in a loop, and I would like to understand why we have this strange behaviour.
While my results are based on ROUGE score on a large dataset, I have made this small reproducible example to show this issue. Instead of using the weights of the same model at different training epochs, I decided to demonstrate using two different summarization models, but the effect is the same. Grateful for any help.
Notice how in the first run I firstly use the facebook/bart-large-cnn model and then the lidiya/bart-large-xsum-samsum model without shutting the python terminal. In the second run I only use lidiya/bart-large-xsum-samsum model and get different output (which should not be the case).
NOTE: this reproducible example won't work on a CPU machine as it doesn't seem sensitive to torch.use_deterministic_algorithms(True) and it might give different results every time when run on a CPU, so should be reproduced on a GPU.
FIRST RUN
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
import torch
# random text taken from UK news website
text = """
The veteran retailer Stuart Rose has urged the government to do more to shield the poorest from double-digit inflation, describing the lack of action as “horrifying”, with a prime minister “on shore leave” leaving a situation where “nobody is in charge”.
Responding to July’s 10.1% headline rate, the Conservative peer and Asda chair said: “We have been very, very slow in recognising this train coming down the tunnel and it’s run quite a lot of people over and we now have to deal with the aftermath.”
Attacking a lack of leadership while Boris Johnson is away on holiday, he said: “We’ve got to have some action. The captain of the ship is on shore leave, right, nobody’s in charge at the moment.”
Lord Rose, who is a former boss of Marks & Spencer, said action was needed to kill “pernicious” inflation, which he said “erodes wealth over time”. He dismissed claims by the Tory leadership candidate Liz Truss’s camp that it would be possible for the UK to grow its way out of the crisis.
"""
seed = 42
torch.cuda.manual_seed_all(seed)
torch.use_deterministic_algorithms(True)
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
model.eval()
summarizer = pipeline(
"summarization", model=model, tokenizer=tokenizer,
num_beams=5, do_sample=True, no_repeat_ngram_size=3, device=0
)
output = summarizer(text, truncation=True)
tokenizer = AutoTokenizer.from_pretrained("lidiya/bart-large-xsum-samsum")
model = AutoModelForSeq2SeqLM.from_pretrained("lidiya/bart-large-xsum-samsum")
model.eval()
summarizer = pipeline(
"summarization", model=model, tokenizer=tokenizer,
num_beams=5, do_sample=True, no_repeat_ngram_size=3, device=0
)
output = summarizer(text, truncation=True)
print(output)
output from lidiya/bart-large-xsum-samsum model should be
[{'summary_text': 'The UK economy is in crisis because of inflation. The government has been slow to react to it. Boris Johnson is on holiday.'}]
SECOND RUN (you must restart python to conduct the experiment)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
import torch
text = """
The veteran retailer Stuart Rose has urged the government to do more to shield the poorest from double-digit inflation, describing the lack of action as “horrifying”, with a prime minister “on shore leave” leaving a situation where “nobody is in charge”.
Responding to July’s 10.1% headline rate, the Conservative peer and Asda chair said: “We have been very, very slow in recognising this train coming down the tunnel and it’s run quite a lot of people over and we now have to deal with the aftermath.”
Attacking a lack of leadership while Boris Johnson is away on holiday, he said: “We’ve got to have some action. The captain of the ship is on shore leave, right, nobody’s in charge at the moment.”
Lord Rose, who is a former boss of Marks & Spencer, said action was needed to kill “pernicious” inflation, which he said “erodes wealth over time”. He dismissed claims by the Tory leadership candidate Liz Truss’s camp that it would be possible for the UK to grow its way out of the crisis.
"""
seed = 42
torch.cuda.manual_seed_all(seed)
torch.use_deterministic_algorithms(True)
tokenizer = AutoTokenizer.from_pretrained("lidiya/bart-large-xsum-samsum")
model = AutoModelForSeq2SeqLM.from_pretrained("lidiya/bart-large-xsum-samsum")
model.eval()
summarizer = pipeline(
"summarization", model=model, tokenizer=tokenizer,
num_beams=5, do_sample=True, no_repeat_ngram_size=3, device=0
)
output = summarizer(text, truncation=True)
print(output)
output should be
[{'summary_text': 'The government has been slow to deal with inflation. Stuart Rose has urged the government to do more to shield the poorest from double-digit inflation.'}]
Why is the first output different from the second one?
You might re-seed the program after bart-large-cnn pipeline. Otherwise the seed generator would be used by the first pipeline and generate different outputs for your lidiya model across two scripts.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
import torch
# random text taken from UK news website
text = """
The veteran retailer Stuart Rose has urged the government to do more to shield the poorest from double-digit inflation, describing the lack of action as “horrifying”, with a prime minister “on shore leave” leaving a situation where “nobody is in charge”.
Responding to July’s 10.1% headline rate, the Conservative peer and Asda chair said: “We have been very, very slow in recognising this train coming down the tunnel and it’s run quite a lot of people over and we now have to deal with the aftermath.”
Attacking a lack of leadership while Boris Johnson is away on holiday, he said: “We’ve got to have some action. The captain of the ship is on shore leave, right, nobody’s in charge at the moment.”
Lord Rose, who is a former boss of Marks & Spencer, said action was needed to kill “pernicious” inflation, which he said “erodes wealth over time”. He dismissed claims by the Tory leadership candidate Liz Truss’s camp that it would be possible for the UK to grow its way out of the crisis.
"""
seed = 42
torch.cuda.manual_seed_all(seed)
torch.use_deterministic_algorithms(True)
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
model.eval()
summarizer = pipeline(
"summarization", model=model, tokenizer=tokenizer,
num_beams=5, do_sample=True, no_repeat_ngram_size=3, device=0
)
output = summarizer(text, truncation=True)
seed = 42
torch.cuda.manual_seed_all(seed)
torch.use_deterministic_algorithms(True)
tokenizer = AutoTokenizer.from_pretrained("lidiya/bart-large-xsum-samsum")
model = AutoModelForSeq2SeqLM.from_pretrained("lidiya/bart-large-xsum-samsum")
model.eval()
summarizer = pipeline(
"summarization", model=model, tokenizer=tokenizer,
num_beams=5, do_sample=True, no_repeat_ngram_size=3, device=0
)
output = summarizer(text, truncation=True)
print(output)
I have a csv file named movie_reviews.csv and the data inside looks like this:
1 Pixar classic is one of the best kids' movies of all time.
1 Apesar de representar um imenso avanço tecnológico, a força
1 It doesn't enhance the experience, because the film's timeless appeal is down to great characters and wonderful storytelling; a classic that doesn't need goggles or gimmicks.
1 As such Toy Story in 3D is never overwhelming. Nor is it tedious, as many recent 3D vehicles have come too close for comfort to.
1 The fresh look serves the story and is never allowed to overwhelm it, leaving a beautifully judged yarn to unwind and enchant a new intake of young cinemagoers.
1 There's no denying 3D adds extra texture to Pixar's seminal 1995 buddy movie, emphasising Buzz and Woody's toy's-eye- view of the world.
1 If anything, it feels even fresher, funnier and more thrilling in today's landscape of over-studied demographically correct moviemaking.
1 If you haven't seen it for a while, you may have forgotten just how fantastic the snappy dialogue, visual gags and genuinely heartfelt story is.
0 The humans are wooden, the computer-animals have that floating, jerky gait of animated fauna.
1 Some thrills, but may be too much for little ones.
1 Like the rest of Johnston's oeuvre, Jumanji puts vivid characters through paces that will quicken any child's pulse.
1 "This smart, scary film, is still a favorite to dust off and take from the ""vhs"" bin"
0 All the effects in the world can't disguise the thin plot.
the first columns with 0s and 1s is my label.
I want to first turn the texts in movie_reviews.csv into vectors, then split my dataset based on the labels (all 1s to train and 0s to test). Then feed the vectors into a classifier like random forest.
For such a task you'll need to parse your data first with different tools. First lower-case all your sentences. Then delete all stopwords (the, and, or, ...). Tokenize (an introduction here: https://medium.com/#makcedward/nlp-pipeline-word-tokenization-part-1-4b2b547e6a3). You can also use stemming in order to keep anly the root of the word, it can be helpful for sentiment classification.
Then you'll assign an index to each word of your vocabulary and replace words in your sentence by these indexes :
Imagine your vocabulary is : ['i', 'love', 'keras', 'pytorch', 'tensorflow']
index['None'] = 0 #in case a new word is not in your vocabulary
index['i'] = 1
index['love'] = 2
...
Thus the sentence : 'I love Keras' will be encoded as [1 2 3]
However you have to define a maximum length max_len for your sentences and when a sentence contain less words than max_len you complete your vector of size max_len by zeros.
In the previous example if your max_len = 5 then [1 2 3] -> [1 2 3 0 0].
This is a basic approach. Feel free to check preprocessing tools provided by libraries such as NLTK, Pandas ...
I'm making an API call and using Cheshire to parse the JSON:
(defn fetch_headlines [source]
(let [articlesUrl (str "https://newsapi.org/v2/top-headlines?sources="
source
"&apiKey=a688e6494c444902b1fc9cb93c61d6987")]
(-> articlesUrl
client/get
generate-string
parse-string)))
The JSON payload:
{"status" 200, "headers" {"access-control-allow-headers" "x-api-key,
authorization", "content-type" "application/json; charset=utf-8",
"access-control-allow-origin" "*", "content-length" "7434",
"connection" "close", "pragma" "no-cache", "expires" "-1",
"access-control-allow-methods" "GET", "date" "Thu, 28 Mar 2019
20:22:16 GMT", "x-cached-result" "false", "cache-control" "no-cache"},
"body"
"{\"status\":\"ok\",\"totalResults\":10,\"articles\":[{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":null,\"title\":\"Trump:
Mueller probe was 'attempted takeover' of government - CNN
Video\",\"description\":\"In a Fox News interview with Sean Hannity,
President Trump called special counsel Robert Mueller's probe an
\\"attempted takeover of our
government.\\"\",\"url\":\"http://us.cnn.com/videos/politics/2019/03/28/trump-mueller-probe-attempted-takeover-hannity-cpt-sot-vpx.cnn\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/190324191527-06-trump-mueller-reaction-0324-super-tease.jpg\",\"publishedAt\":\"2019-03-28T20:09:04.1891948Z\",\"content\":\"Chat
with us in Facebook Messenger. Find out what's happening in the world
as it
unfolds.\"},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":null,\"title\":\"James
Clapper reacts to call he should be investigated - CNN
Video\",\"description\":\"Former Director of National Intelligence
James Clapper reacts to White House press secretary Sarah Sanders
saying he and other former intelligence officials should be
investigated after special counsel Robert Mueller did not establish
collusion between the
Tr…\",\"url\":\"http://us.cnn.com/videos/politics/2019/03/26/james-clapper-reponse-mueller-report-sarah-sanders-criticism-bts-ac360-vpx.cnn\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/190325211210-james-clapper-ac360-03252019-super-tease.jpg\",\"publishedAt\":\"2019-03-28T20:08:43.1736236Z\",\"content\":\"Chat
with us in Facebook Messenger. Find out what's happening in the world
as it
unfolds.\"},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":\"Maegan
Vazquez, CNN\",\"title\":\"Trump set for first rally since Mueller
investigation ended\",\"description\":\"President Donald Trump, making
his first appearance before supporters since Robert Mueller ended his
investigation, is set to speak during a rally in Grand Rapids,
Michigan Thursday
night.\",\"url\":\"http://us.cnn.com/2019/03/28/politics/donald-trump-grand-rapids-rally/index.html\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/190321115403-07-donald-trump-lead-image-super-tease.jpg\",\"publishedAt\":\"2019-03-28T19:49:26Z\",\"content\":\"Washington
(CNN)President Donald Trump, making his first appearance before
supporters since Robert Mueller ended his investigation, is set to
speak during a rally in Grand Rapids, Michigan Thursday
night.\r\nThe rally follows a chaotic week in Washington, preci…
[+2099
chars]\"},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":\"Katelyn
Polantz, CNN\",\"title\":\"Judge orders Justice Dept. to turn over
Comey memos\",\"description\":\"A federal judge has ordered that the
James Comey memos are turned over, in a court case brought by CNN and
other media organizations for access to the documents memorializing
former FBI Director's interactions with President Donald
Trump.\",\"url\":\"http://us.cnn.com/2019/03/28/politics/james-comey-memo-lawsuit/index.html\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/181209143047-comey-1207-super-tease.jpg\",\"publishedAt\":\"2019-03-28T19:14:45Z\",\"content\":\"Washington
(CNN)A federal judge has ordered that the Justice Department and FBI
submit James Comey's memos in full to the court under seal, in a court
case brought by CNN and other media organizations for access to the
documents memorializing the former FBI d… [+1043
chars]\"},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":\"Clare
Foran and Manu Raju, CNN\",\"title\":\"Pelosi calls AG's summary of
Mueller report 'arrogant'\",\"description\":\"House Speaker Nancy
Pelosi on Thursday criticized Attorney General William Barr's summary
of special counsel Robert Mueller's report, calling it
\\"condescending\\" and \\"arrogant\\" and saying \\"it wasn't
the right thing to
do.\\"\",\"url\":\"http://us.cnn.com/2019/03/28/politics/pelosi-mueller-report-congress-barr-summary/index.html\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/190328130240-02-nancy-pelosi-03282019-super-tease.jpg\",\"publishedAt\":\"2019-03-28T18:48:25Z\",\"content\":null},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":\"Analysis
by Chris Cillizza, CNN Editor-at-large\",\"title\":\"The 43 most
outrageous lines from Donald Trump's phone interview with Sean
Hannity\",\"description\":\"There's no \\"reporter\\" that President
Donald Trump likes more than Fox News' Sean Hannity -- largely due to
Hannity's unwavering, puppy dog-like support for the President. Trump
likes to reward people who play nice with him, which brings us to the
45-minute
ph…\",\"url\":\"http://us.cnn.com/2019/03/28/politics/sean-hannity-donald-trump-mueller/index.html\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/190328140149-01-hannity-trump-file-super-tease.jpg\",\"publishedAt\":\"2019-03-28T18:44:21Z\",\"content\":\"(CNN)There's
no \\"reporter\\" that President Donald Trump likes more than Fox
News' Sean Hannity -- largely due to Hannity's unwavering, puppy
dog-like support for the President. Trump likes to reward people who
play nice with him, which brings us to the 45-minu… [+14785
chars]\"},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":null,\"title\":\"Puerto
Rico Gov.: I'll punch the bully in the mouth - CNN
Video\",\"description\":\"In an exclusive interview with CNN, Puerto
Rico Governor Ricardo Rosselló said he would not sit back and allow
his officials to be bullied by the White
House.\",\"url\":\"http://us.cnn.com/videos/politics/2019/03/28/ricardo-rossello-trump-bully-puerto-rico-sot-vpx.cnn\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/190328123504-puerto-rico-gov-ricardo-rosello-super-tease.jpg\",\"publishedAt\":\"2019-03-28T18:08:33.7312458Z\",\"content\":\"Chat
with us in Facebook Messenger. Find out what's happening in the world
as it
unfolds.\"},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":\"Jeremy
Herb, Manu Raju and Ted Barrett, CNN\",\"title\":\"Jared Kushner
interviewed by Senate Intelligence
Committee\",\"description\":\"President Donald Trump's son-in-law
Jared Kushner returned to the Senate Intelligence Committee for a
closed door interview Thursday as part of the committee's Russia
investigation.\",\"url\":\"http://us.cnn.com/2019/03/28/politics/jared-kushner-senate-intelligence/index.html\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/180302124221-30-jared-kushner-super-tease.jpg\",\"publishedAt\":\"2019-03-28T16:21:29Z\",\"content\":null},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":\"Jeremy
Herb and Laura Jarrett, CNN\",\"title\":\"Mueller report more than 300
pages, sources say\",\"description\":\"Special counsel Robert
Mueller's confidential report on the Russia investigation is more than
300 pages, according to a Justice Department official and a second
source with knowledge of the
matter.\",\"url\":\"http://us.cnn.com/2019/03/28/politics/mueller-report-pages/index.html\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/190324130054-05-russia-investigation-0324-super-tease.jpg\",\"publishedAt\":\"2019-03-28T15:52:01Z\",\"content\":null},{\"source\":{\"id\":\"cnn\",\"name\":\"CNN\"},\"author\":\"Jim
Acosta and Kevin Liptak, CNN\",\"title\":\"Exclusive: Puerto Rico
governor warns White House over funding\",\"description\":\"Tensions
are escalating between President Donald Trump and Puerto Rico's
governor over disaster relief efforts that have been slow in coming
for the still-battered island after Hurricane
Maria.\",\"url\":\"http://us.cnn.com/2019/03/28/politics/ricardo-rossell-donald-trump-puerto-rico-funding/index.html\",\"urlToImage\":\"https://cdn.cnn.com/cnnnext/dam/assets/180920230539-pr-storm-of-controversy-rossello-trump-super-tease.jpg\",\"publishedAt\":\"2019-03-28T15:19:39Z\",\"content\":null}]}",
"trace-redirects"
["https://newsapi.org/v2/top-headlines?sources=cnn&apiKey=a688e6494c444902b1fc9cb93c61d687"]}
I'd like to extract to extract the URLs from the returned JSON payload, I've tried this:
(defn fetch_headlines [source]
(let [articlesUrl (str "https://newsapi.org/v2/top-headlines?sources="
source
"&apiKey=a688e6494c444902b1fc9cb93c61d697")]
(-> articlesUrl
client/get
generate-string
parse-string
(get-in ["source" "url"]))))
But I get a nil result, any ideas?
SOLUTION based on user feedback:
(defn fetch-headlines [source]
(let [articlesUrl (str "https://newsapi.org/v2/top-headlines?sources="
source
"&apiKey=a688e6494c444902b1fc9cb93c61d697")]
(-> articlesUrl
client/get
:body
parse-string
(get-in ["articles" 0 "url"]))))
What you need is inside the body key, but the value corresponding to that key is still a string and not yet a clojure map. When you look for source, you're getting nil back because that key doesn't exist (it should be inside body, after correctly parsing the string into json).
Once you've properly parsed the body value, it should be something like:
(let [index-of-article 0]
(get-in response ["body" "articles" index-of-article "url"]))
where index-of-article is the positional index of the article you want, since articles contains a vector of articles.
I am trying to learn about Latent Dirichlet Allocation (LDA). I have basic knowledge of machine learning and probability theory and based on this blog post http://goo.gl/ccPvE I was able to develop the intuition behind LDA. However I still haven't got complete understanding of the various calculations that goes in it. I am wondering can someone show me the calculations using a very small corpus (let say of 3-5 sentences and 2-3 topics).
Edwin Chen (who works at Twitter btw) has an example in his blog. 5 sentences, 2 topics:
I like to eat broccoli and bananas.
I ate a banana and spinach smoothie for breakfast.
Chinchillas and kittens are cute.
My sister adopted a kitten yesterday.
Look at this cute hamster munching on a piece of broccoli.
Then he does some "calculations"
Sentences 1 and 2: 100% Topic A
Sentences 3 and 4: 100% Topic B
Sentence 5: 60% Topic A, 40% Topic B
And take guesses of the topics:
Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, …
at which point, you could interpret topic A to be about food
Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, …
at which point, you could interpret topic B to be about cute animals
Your question is how did he come up with those numbers? Which words in these sentences carry "information":
broccoli, bananas, smoothie, breakfast, munching, eat
chinchilla, kitten, cute, adopted, hampster
Now let's go sentence by sentence getting words from each topic:
food 3, cute 0 --> food
food 5, cute 0 --> food
food 0, cute 3 --> cute
food 0, cute 2 --> cute
food 2, cute 2 --> 50% food + 50% cute
So my numbers, differ slightly from Chen's. Maybe he includes the word "piece" in "piece of broccoli" as counting towards food.
We made two calculations in our heads:
to look at the sentences and come up with 2 topics in the first place. LDA does this by considering each sentence as a "mixture" of topics and guessing the parameters of each topic.
to decide which words are important. LDA uses "term-frequency/inverse-document-frequency" to understand this.
LDA Procedure
Step1: Go through each document and randomly assign each word in the document to one of K topics (K is chosen beforehand)
Step2: This random assignment gives topic representations of all documents and word distributions of all the topics, albeit not very good ones
So, to improve upon them:
For each document d, go through each word w and compute:
p(topic t | document d): proportion of words in document d that are assigned to topic t
p(word w| topic t): proportion of assignments to topic t, over all documents d, that come from word w
Step3: Reassign word w a new topic t’, where we choose topic t’ with probability
p(topic t’ | document d) * p(word w | topic t’)
This generative model predicts the probability that topic t’ generated word w.
we will iterate this last step multiple times for each document in the corpus to get steady-state.
Solved calculation
Let's say you have two documents.
Doc i: “The bank called about the money.”
Doc ii: “The bank said the money was approved.”
After removing the stop words, capitalization, and punctuation.
Unique words in corpus:
bank called about money boat approved
Next then,
After then, we will randomly select a word from doc i (word bank with topic assignment 1) and we will remove its assigned topic and we will calculate the probability for its new assignment.
For the topic k=1
For the topic k=2
Now we will calculate the product of those two probabilities as given below:
Good fit for both document and word for topic 2 (area is greater) than topic 1. So, our new assignment for word bank will be topic 2.
Now, we will update the count due to new assignment.
Now we will repeat the same step of reassignment. and iterate through each word of the whole corpus.
Am parsing some sentences (from the inaugural speech in the nltk corpus) with the format S -> NP VP, and I want to make sure I parsed them correctly, do these sentences follow the aforementioned format, sorry if this question seems trivial, English is not my first language. If anyone has any questions on a given sentence follows NP VP, ask me and I will give you my reasons on why I picked it and give you it's parsing tree.
god bless you
our capacity remains undiminished
their memories are short
they are serious
these things are true
the capital was abandoned
they are many
god bless the united stated of
america
the enemy was advancing
all this we can do
all this we will do
Thanks in advance.
The first 9 are NP VP. In the last two, "all this" is the direct object, which is part of the VP.
god bless you
NP- VP-------
our capacity remains undiminished
NP---------- VP------------------
their memories are short
NP------------ VP-------
they are serious
NP-- VP---------
these things are true
NP---------- VP------
the capital was abandoned
NP--------- VP-----------
they are many
NP-- VP------
god bless the united stated of america
NP- VP--------------------------------
the enemy was advancing
NP------- VP-----------
all this we can do
VP------ NP VP----
all this we will do
VP------ NP VP-----
Note that the last two sentences are semantically equivalent to the sentences "We can do all this" and "We will do all this", an order which makes the subject predicate/verb predicate breakdown easier.