Using NLTK, how can I tell the difference between Bus, Public and Karak? - nltk

I have a paragraph
Public buses operating on all internal lines in Karak governorate have
been on strike yesterday to protest against the decision to remove
working buses that are over 12 years old. Bus drivers and owners said
the new government\'s decision to remove working buses, which are over
12 years of age, would mean large financial losses to owners of these
buses, most of whom suffer from high debt because of their purchase.
"The government is not aware of what it is doing, especially in the
case of the cancellation of thousands of buses operating in various
parts of the Kingdom, which bought hard-earned through the banks and
at great financial costs." He pointed out that "buses will remain idle
until the government review the decision as unfair to thousands of
families in the Kingdom." For his part, the head of the office of the
Karak Transport Regulatory Authority, Mahmoud Al-Sarayra, did not
answer Al Ghad\'s calls for a response to the complaints of drivers
and bus owners
Running the following code on the paragraph:
import nltk
sentences = [x.replace('.','').replace('"','') for x in nltk.sent_tokenize(paragraph)]
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = [x for x in nltk.ne_chunk_sents(tagged_sentences)]
entities=np.unique(np.array([x for s in chunked_sentences for x in s if type(x)==nltk.tree.Tree ])).tolist()
NLTK function ne_chunk_sents gives me back the following named entities:
[Tree('GPE', [('Bus', 'NNP')]),
Tree('GPE', [('Karak', 'NNP')]),
Tree('GPE', [('Public', 'NNP')]),
Tree('ORGANIZATION', [('Karak', 'NNP'), ('Transport', 'NNP'), ('Regulatory', 'NNP'), ('Authority', 'NNP')]),
Tree('ORGANIZATION', [('Kingdom', 'NNP')]),
Tree('PERSON', [('Al', 'NNP'), ('Ghad', 'NNP')]),
Tree('PERSON', [('Mahmoud', 'NNP'), ('Al-Sarayra', 'NNP')])]
GPE stands for "Geopolitical Entity". I'm not sure that "Public" and "Bus" qualify. I know that Karak is what I'm looking for. What's the easiest way in NLTK to distinguish common English words such as Public and Bus from works which are not English and are most likely place names?
NOTE: This is similar to this question from 2 years ago that didn't get a definitive answer.

So following the lead of the similar question from 2 years ago, here is a solution:
e2=[(x.label(),' '.join([y for y,z in x[0:]])) for x in entities]
e3=[y for x,y in e2 if x == 'GPE']
english_vocab = set(w.lower() for w in nltk.corpus.words.words())
e4=[x for x in e3 if x.lower() not in english_vocab]
Then e4 is the list
['Karak']

Related

Is it possible to use the prediction of a CNN for images not belonging to the model classes?

I am using photos of faces of individuals of 10, 12, 14, 16, 18, and 20 years old. The question I am trying to answer is: How does the face "evolves"? Will it remain +/- similar until a certain threshold where it will suddenly change?
To answer this question, I trained a CNN on 2 classes, the 10 year old photos (labelled as "0") and the 20 year old photos (labelled as "1").
I used this model to predict the categories that do not belong to the model's classes (the 12 to 18 years old), to compute the average prediction for each age group. The result is shown in the figure, where each value is respectively the mean prediction for 12, 14, 16, 18 years old. Mean prediction for 12,14,16,18 years old
My question here is : Does it make sense to use this model to predict other age groups and say for example "The 12 years old have a mean prediction of 0.2, which means they are more similar to the 10 years old faces than the 20 years old faces" ?
As the values are increasing with age, can I say that the faces are getting more similar to 20 years old faces ? And are there any references of articles using a model to predict images belonging to none of the model's classes ?
Thank you !
This is quite an interesting question. Let me answer this in two ways.
Q1: Can CNN predict image whose class is not included in the training set?
A1: Yes. However, instead of training the model and comparing the output probabilities, we use a method called “few-shot learning”, or “zero-shot learning” to figure this out. The basic idea is: first, we train the model so that it recognizes the high-level features underlying the data (e.g., the edges or the shape of eyes in your example). Then we implement the model to a new dataset, relying on its generalizability. This research is also closely related to transfer learning.
As a starting point, here is a good paper.
Generalizing from a Few Examples: A Survey on Few-shot Learning
Q2: If the probability is higher, can we say the faces are more similar to 20 year old faces?
A2: The short answer is, YES. The reason is that we only have two classes in your training data – if the value of probability is higher, it means the model has more confidence that the picture belongs to the class of 1 (i.e., 20 year old photos). But we cannot make sure how the model makes this prediction. You may want to visualize outputs of intermediate layers to see which feature the model finds. You can have a check at this blog.
Understanding CNN (Convolutional Neural Network)

Confused about Rewards in David Silver Lecture 2

While watching the Reinforcement Learning course by David Silver on youtube (and the slide: Lecture 2 MDP), I found the "Reward" and "Value Function" really confusing.
I tried to understand the "given rewards" marked on the slide (P11), but I cannot figure out why it is the case. Like, the "Class 1: R = -2" but "Pub: R = +1"
why the negative reward for Class and the positive reward for Pub? why the different value?
How to calculate the reward with the Discount Factor? (P17 and P18)
I think the lack of intuition for Reinforcement Learning is the main reason why I have encountered this kind of problem...
So, I'd really appreciate it if someone can give me a little hint.
You usually set the reward and the discount such that using RL you will drive the agent to solve a task.
In the student example the goal is to pass the exam. The student can spend his time attending a class, sleeping, on Facebook or at the pub. Attending a class is something "boring", so the student doesn't see the immediate benefits of doing it. Hence the negative reward. On the contrary, going to the pub is fun and gives a positive reward. However, only by attending all 3 classes the student can pass the exam and get the big final reward.
Now the question is: how much does the student value immediate vs future rewards? The discount factor tells you that: a small discount gives more importance to immediate rewards, because future rewards just "fade" in the long run. If we use a small discount, the student may prefer to always go to the pub or to sleep. With a discount close to 0, already after one step all rewards get close to 0 as well, so at each state the student will try to maximize the immediate reward, because after that "nothing else matter".
On the contrary, high discounts (max 1) value long-term rewards more: in this case the optimal student will attend all classes and pass the exam.
Choosing the discount can be tricky, especially if there is no terminal state (in this case "sleep" is terminal), because with a discount of 1 the agent may ignore the number of steps used to reach the highest reward. For instance, if classes would give a reward of -1 instead of -2, for the agent would be the same to spend time alternating between "class" and "pub" forever and at some point to pass the exam, because with discount 1 the rewards never fade, so even after 10 years the students will still get +10 for passing the exam.
Think also of a virtual agent having to reach a goal position. With discount 1, the agent would not learn to reach it in the least amount of steps: as long as it reaches it, it's the same for him.
Beside that, there is also a numerical problem with discount 1. Since the goal is to maximize the cumulative sum of the discounted reward, if rewards are not discounted (and the horizon is infinite) the sum will not converge.
Q1) First of all you should not forget that there rewards are given by the environment. The actions taken by the agent do not have an effect on the rewards of the environment, but of course it affects the reward gained by the followed trajectory.
In the example these +1 and -2 are just funny examples :) "As a student" you get bored during the class, so the reward of it is -2, while you have fun in the pub, so the reward is +1. Don't get confused with the reasons behind these numbers, they are environment given.
Q2) Let's do the calculation for the state with the value 4.1 in "Example: State-Value Function for Student MRP (2)":
v(s) = (-2) + 0.9 * [(0.4 * 1.9) + (0.6 * 10)] = (-2) + 6.084 =~ 4.1
Here David is using the Bellman Equation for MRPs. You can find it on the same slide.

Creating a corpus out of texts stored in JSON files in R

I have several JSON files with texts in grouped into date, body and title. As an example consider:
{"date": "December 31, 1990, Monday, Late Edition - Final", "body": "World stock markets begin 1991 facing the threat of a war in the Persian Gulf, recessions or economic slowdowns around the world, and dismal earnings -- the same factors that drove stock markets down sharply in 1990. Finally, there is the problem of the Soviet Union, the wild card in everyone's analysis. It is a country whose problems could send stock markets around the world reeling if something went seriously awry. With Russia about to implode, that just adds to the risk premium, said Mr. Dhar. LOAD-DATE: December 30, 1990 ", "title": "World Markets;"}
{"date": "December 30, 1992, Sunday, Late Edition - Final", "body": "DATELINE: CHICAGO Gleaming new tractors are becoming more familiar sights on America's farms. Sales and profits at the three leading United States tractor makers -- Deere & Company, the J.I. Case division of Tenneco Inc. and the Ford Motor Company's Ford New Holland division -- are all up, reflecting renewed agricultural prosperity after the near-depression of the early and mid-1980's. But the recovery in the tractor business, now in its third year, is fragile. Tractor makers hope to install computers that can digest this information, then automatically concentrate the application of costly fertilizer and chemicals on the most productive land. Within the next 15 years, that capability will be commonplace, predicted Mr. Ball. LOAD-DATE: December 30, 1990 ", "title": "All About/Tractors;"}
I have three different newspapers with separate files containing all the texts produced for the period 1989 - 2016. My ultimate goal is to combine all the texts into a single corpus. I have done it in Python using the pandas library and I am wondering if it could be done in R similarly. Here is my code with the loop in R:
for (i in 1989:2016){
df0 = pd.DataFrame([json.loads(l) for l in open('NYT_%d.json' % i)])
df1 = pd.DataFrame([json.loads(l) for l in open('USAT_%d.json' % i)])
df2 = pd.DataFrame([json.loads(l) for l in open('WP_%d.json' % i)])
appended_data.append(df0)
appended_data.append(df1)
appended_data.append(df2)
}
Use jsonlite::stream_in to read your files and jsonlite::rbind.pages to combine them.
There many options in R to read json file and convert them to a data.frame/data.table.
Here one using jsonlite and data.table:
library(data.table)
library(jsonlite)
res <- lapply(1989:2016,function(i){
ff <- c('NYT_%d.json','USAT_%d.json' ,'WP_%d.json')
list_files_paths <- sprintf(ff,i)
rbindlist(lapply(list_files_paths,fromJSON))
})
Here res is a list of data.table. If you want to aggregate all data.table in a single data.table:
rbindlist(res)
Use ndjson::stream_in to read them in faster and flatter than jsonlite::stream_in :-)

scrapy xpath not returning desired results. Any idea?

Please look at this page http://164.100.47.132/LssNew/psearch/QResult16.aspx?qref=15845. As you would have guessed, I am trying to scrape all the fields on this page. All fields are yield-ed properly except the Answer field. What I find odd is that the page structure for the question and answer is almost the same (Table[1] and Table[2]); the question scrapes perfectly but the Answer does not. Here are my xpaths:
question:
['q_main'] = Selector(response).xpath('//*[#id="ctl00_ContPlaceHolderMain_GridView2"]/tbody/tr/td/table[1]/tbody/tr/td/text()').extract()
works perfect
Answer:
['q_answer'] = Selector(response).xpath('//*[#id="ctl00_ContPlaceHolderMain_GridView2"]/tbody/tr/td/table[2]/tbody/tr[2]/td/text()').extract()
returns a blank. I have reproduced the full xpath, as returned by/verified in Xpath Helper and console.
What am i overlooking? What am I not able to see?
seems like your xpath has some problem,
checkout the demo from scrapy shell,
In [1]: response.xpath('//tr[td[#class="mainheaderq" and contains(font/text(), "ANSWER")]]/following-sibling::tr/td[#class="griditemq"]//text()').extract()
Out[1]:
[u'\r\n\r\n',
u'MINISTER OF STATE(I/C) FOR COAL, POWER AND NEW & RENEWABLE ENERGY (SHRI PIYUSH GOYAL)\r\n\r\n ',
u'(a) & (b): So far 29 coal mines have been auctioned under the provisions of Coal Mines (Special Provisions) \r\nAct, 2015 and the Rules made thereunder. The auction process for non-regulated sector viz. Iron and Steel, \r\nCement and Captive Power was based on forward bidding process where bidders had to submit their final price \r\noffer above the applicable floor price. In case of Power sector which is a regulated one, reverse bidding \r\nmethodology was adopted where bidders had to submit bids below the applicable ceiling price, which shall be \r\ntaken as fuel cost in determination of power tariff. In case, bid price reaches Rs. zero in reverse bidding, \r\nthe bidding is based on additional premium payable to the concerned State Government, over and above the \r\nfixed reserve price of Rs. 100/- per tonne.\r\n\r\n',
u'\r\nRevenue which would accrue to the coal bearing State Government concerned comprises of Upfront payment \r\nas prescribed in the tender document, Auction proceeds and Royalty on per tonne of coal production. State-wise \r\ndetails of 29 coal mines auctioned so far along-with specified end-uses and estimated revenue which would accrue \r\nto coal bearing state during the life of mine/lease period as given below:\r\n',
u'\r\n\r\nS.No\tState\t\tSpecified End \u2013Use\t\t\tName of Coal Mine\t\tEstimated Revenueduring \r\n\t\t\t\t\t\t\t\t\t\t\t\tthe life of mine/lease \r\n\t\t\t\t\t\t\t\t\t\t\t\tperiod (Rs. In Crores)\r\n1\tChattishgarh\tNon-Regualted Sector\t\t\tChotia\t\t\t\t51596\r\n\t\t\t\t\t\t\t\tGare Palma IV-4\t\r\n\t\t\t\t\t\t\t\tGare Palma IV-5\t\r\n\t\t\t\t\t\t\t\tGare Palma IV-7\t\r\n\t\t\t\t\t\t\t\tGare-Palma Sector-IV/8\r\n2\tJharkhand\tNon-Regualted Sector\t\t\tBrinda and Sasai\t\t49272\r\n\t\t\t\t\t\t\t\tDumri\r\n\t\t\t\t\t\t\t\tKathautia\r\n\t\t\t\t\t\t\t\tLohari\r\n\t\t\t\t\t\t\t\tMeral\r\n\t\t\t\t\t\t\t\tMoitra\r\n\t\t\tPower\t\t\t\t\tGaneshpur\r\n\t\t\t\t\t\t\t\tJitpur\r\n\t\t\t\t\t\t\t\tTokisud North\r\n3\tMadhya Pradesh\tNon-Regualted Sector\t\t\tBicharpur\t\t\t42811\r\n\t\t\t\t\t\t\t\tMandla North\r\n\t\t\t\t\t\t\t\tMandla-South\r\n\t\t\t\t\t\t\t\tSialGhoghri\r\n\t\t\tPower\t\t\t\t\tAmelia North\r\n4\tMaharashtra\tNon-Regualted Sector\t\t\tBelgaon\t\t\t\t2738\r\n\t\t\t\t\t\t\t\tMarkiMangli III\r\n\t\t\t\t\t\t\t\tNerad Malegaon\r\n5\tOdisha\t\tPower\t\t\t\t\tMandakini\t\t\t33741\r\n\t\t\t\t\t\t\t\tTalabira-I\r\n\t\t\t\t\t\t\t\tUtkal - C\r\n6\tWest Bengal\tNon-Regualted Sector\t\t\tArdhagram\t\t\t13354\r\n\t\t\tPower\t\t\t\t\tSarisatolli\r\n\t\t\t\t\t\t\t\tTrans Damodar\r\n\tTotal\t\t\t\t\t\t\t(29) coal blocks\t\t193512\r\n',
u'\r\n\r\n\r\nCoal mine has been assigned to successful bidder as Designated Custodian in view of a court case.\r\n\r\n',
u'\r\nIn addition, an estimated amount of Rs. 1,41,854 Crores would accrue to coal bearing States from allotment \r\nof 38 coal mines to Central and State PSU\u2019s.\r\n\r\n',
u'Out of these 29 coal mines, 16 are operational coal mines included in Schedule-II of the Act and 13 are \r\nnon-operational included in Schedule-III of the Act. Milestones for development and production of coal \r\nfrom the auctioned coal mines have been prescribed under the Coal Mines Development and Production Agreement \r\nsigned with the Successful Bidder. \r\n\r\n ',
u'(c) & (d): Yes, Sir. A few complaints were received regarding cartelization in bidding. It is not possible to \r\nconclusively establish the same until investigation are carried out by Competent Authority. ',
u'\r\n\r\n\r\nThe Government has not approved the recommendation of NA for declaration of successful bidder in case of \r\n4 coal mines namely Gare Palma IV/2&3, Gare Palma IV/1 and Tara as final closing bid price was not found \r\nto be reflecting fair value. ',
u'\r\n\r\n\r\n']
when you are dealing with the tables sometimes it happens and for more information you can refer this.
At least part of the source of your difficulty lies in the fact that the code you see in the console is not the source html that your spider gets as a response (and on which the selectors operate).
In particular, it is extremely common for a <table> to not include a <tbody>; but when your browser translates the html to the DOM tree, it slaps in <tbody> tags. And there was a time when much of the layout of webpages was actually accomplished with (crazily) nested tables. As a result, the DOM of such a website will typically have many more <tbody> elements than the html source.
What this means in practical terms is that:
It is generally a good idea to find a relatively simple xpath (or CSS selector, or ...) for the element(s) you want to select -- not the behemoth you sometimes get from your developer tools.
It is generally a bad idea to include /tbody in your xpath (unless there is an associated attribute, indicating that the tag exists in the source html).
For the site in question,
response.xpath('//td[#class="griditemq"]').extract()
returns a list with the first element the question and the second element the answer.

string manipulation explode mysql

I am dealing with SCORM package data and here is just one of the many nasty column data that I need to manipulate:
;~;VarQuestion_0016=kl%2Fklkl;VarReflectiveWriting_0001=I%20fink%20that%20aw%20childwens%20should%20be%20wuvved%20and%20pwotected.;VarQuestion_0005=D.%20%20Radio%20public%20service%20announcements%20aired%20during%20Child%20Abuse%20Prevention%20Month%20in%20April;VarQuestion_0004=D.%20%20Societal%20approach;VarQuestion_0003=B.%20%20Free%20respite%20child%20care%20offered%20to%20any%20family%20needing%20a%20break%20in%20order%20to%20reduce%20stress%2CC.%20%20Court-ordered%20substance%20abuse%20prevention%20classes%20for%20parents%20involved%20with%20Child%20Protective%20Services;VarQuestion_0001=B.%20%20A%20treatment%20program%20for%20parents%20identified%20by%20Child%20Protective%20Services%20as%20having%20abused%20their%20children%2CC.%20%20A%20parent%20education%20class%20open%20to%20all%20parents;VarQuestion_0009=Sexual%20abuse%20prevention%20training%20for%20children%20or%20adults;VarQuestion_0011=Community%20Volunteer;VarQuestion_0013=I%20am%20very%20familiar%20with%20the%20research%20and%20with%20community-based%20approaches%3B%20I%20could%20teach%20others%20about%20it.;
I want it to look more like this:
QUESTION ANSWER
Question 1 B. A treatment program for parents identified by Child Protective Services as having abused their children,C. A parent education class open to all parents
Question 3 B. Free respite child care offered to any family needing a break in order to reduce stress,C. Court-ordered substance abuse prevention classes for parents involved with Child Protective Services
Question 4 D. Societal approach
Question 5 D. Radio public service announcements aired during Child Abuse Prevention Month in April
Question 9 Sexual abuse prevention training for children or adults
Question 11 Community Volunteer
Question 13 I am very familiar with the research and with community-based approaches; I could teach others about it.
Question 16 I fink that aw childwens should be wuvved and pwotected.
Steps to solve:
urldecode
remove first few digits
explode the string by 'VarQuestion_', explode these strings by '=', select first element for column 1 and last element for column 2 (editing each with trim to remove '0' and excess end ';')
MySQL hurdles to solve:
find function for urldecode
find function to explode data
edit/manipulate array from explode function
output array into two columns for reporting
It seems simple on paper but it is an absolute nightmare for MySQL. Are there any package/procedures/functions that you all can recommend for each step?