pos_tag in NLTK does not tag sentences correctly - nltk

I have used this code:
# Step 1 : TOKENIZE
from nltk.tokenize import *
words = word_tokenize(text)
# Step 2 : POS DISAMBIG
from nltk.tag import *
tags = pos_tag(words)
to tag two sentences:
John is very nice. Is John very nice?
John in the first sentence was NN while in the second was VB! So, how can we correct pos_tag function without training back-off taggers?
Modified question:
I have seen the demonstration of NLTK taggers here http://text-processing.com/demo/tag/. When I tried the option "English Taggers & Chunckers: Treebank" or "Brown Tagger", I get the correct tags. So how to use Brown Tagger for example without training it?

Short answer: you can't. Slightly longer answer: you can override specific words using a manually created UnigramTagger. See my answer for custom tagging with nltk for details on this method.

I tried to reproduce the bug using NLTK v3.0. I think now nltk.pos_tag() is fixed. As #Jacob mentioned, you can use Brown Corpus to train a tagger(nltk in python) as follows;
from nltk.corpus import brown
train_sents = brown.tagged_sents()
unigram_tagger = nltk.UnigramTagger(train_sents)
tokens=nltk.word_tokenize("Is John very nice?")
tagged=unigram_tagger.tag(tokens)
tagged
But note that The tag set depends on the corpus that was used to train the tagger. The default tagger of nltk.pos_tag() uses the Penn Treebank Tag Set.

Related

Nltk Wordnet not lemmatizing word even with POS tag

When I do wnl.lemmatize('promotional','a') or wnl.lemmatize('promotional',wordnet.ADJ), I get merely 'promotional' when it should return promotion. I supplied the correct POS, so why isn't it working? What can I do?
Lemmatization only changes between inflected forms, so the noun "promotion" isn't a lemma of the adjective "promotional".
Note that your noun is included as a pertainym for the lemma.
wn.synsets('promotional')[0].lemmas()[0]
Lemma('promotional.a.01.promotional')
wn.synsets('promotional')[0].lemmas()[0].pertainyms()
[Lemma('promotion.n.01.promotion')]

R Stargazer table ASCII text output formatting (line break, alignment & reference group)

For better or worse, I don't use LaTeX (yet). I like producing stargazer formatted tables on the fly for class examples in both HTML and in the console. However, I'm having trouble with 3 formatting elements; so far I've found solutions for LaTeX and some in HTML, but the ASCII console text eludes me.
The 3 challenges are:
Breaking a line so that a variable name can wrap instead of increasing the table width.
Aligning coefficients & std. errors at the decimal, even when there are p-value stars.
Making space in the covariate labels & coefficients to allow for a reference group.
Let's start with some reproducible data & outputs to reference.
set.seed(3); x1 <- factor(sample(letters[1:4], 1000, replace=TRUE))
set.seed(4); x2 <- runif(1000, -10, 10)
set.seed(5); x3 <- rbinom(1000, size = 1, prob = 0.13)
set.seed(6); y <- runif(1000, -10, 10)
model <- (lm(y ~ x1 + x2 + x3))
stargazer(model, align=TRUE,
#type="html", out="SO_stargazer.html",
type="text", out="SO_stargazer.txt",
title="Example Title Goes Here",
dep.var.caption="",
dep.var.labels="This is my long title for the Dependent Variable Y",
covariate.labels=c("X1 Group B",
"X1 Group C",
"X1 Group D",
"X2 with a super ridiculous and annoyingly long name",
"X3"))
Line break
My default approach is to use \n in the character string. For example, I might try to break the DV caption:
dep.var.labels="This is my long title for \n the Dependent Variable Y",
But that generates the following error message:
Error in if (nchar(text.matrix[r, c]) > max.length[real.c]) { : missing value where TRUE/FALSE needed
Found a couple posts about this issue (here which reference here), but the poster on the first did not provide much of an example to follow and the second pertained to an underscore that I don't have or gave LaTeX solutions. The only difference that broke what already worked was the addition of the \n. I did try using the tex \\ escape, but that didn't do anything useful for text output.
I am able to get line breaks using <br> in the string for the html output file version.
This post also mentions the tex and html solutions, but not text.
Alignment on the decimal
When there are no statistical significance stars on coefficients, both the coefficients and std. errors align nicely, centered on the decimal point. However, once the stars appear, it 'pushes' the coefficient to the left. This happens in both the text and html output. This is not so bad with 1 star, but 3 stars can be quite a difference. How can I coerce it back to align on the decimal value for both formats? This issue persists even if I use the single.row=TRUE option. This post answer by #Marco Doe has a great visual of what I'm talking about, but noted the centering is for tex. Found a LaTeX solution, but no mention of the other formats on that post. I've tinkered with the align and float options to no avail (inspired by these quasi-related tex solution posts here and here). The latter post hinted at using xtable or post-process edits, but that was more than 5 years ago; so I'm hoping for an updated viable solution.
This image is from Marco Doe's solution and shows the LaTeX output, but does a good job showing an example output formats I get (left) and what I would like to have (right).
Reference categories
Found a LaTex solution, that 'pushes' the covariates & coeffient data down a row, making room for a reference group to be printed in the covariate column; however, the solution is in tex. How can I replicate this for the text output? Can I replicate it for HTML version as part of the R code without having to get surgical with the HTML output code?
#Giac posted the images (linked above) to illustrate the have (left) and want (right). Although these images are tex, how could I get the right image output in text and html?

beautifulsoup findAll resulting null with div tag

I am trying to scrap the comments from the following link:
https://www.kickstarter.com/projects/175927790/tupi-2d-animation-software-for-everyone/comments
Using code :
urlcomments = url +str("/comments")
htmlcomments=urllib2.urlopen(urlcomments).read()
commentsoup=BeautifulSoup(htmlcomments, "html.parser")
commentable = soup.findAll('section',attrs={"class_":"js-could-have-comments js-project-comments-content js-project-content project-content"})
I have tried urllib and urllib2 both but none of them is working , the result of the findAll is [].
Also i have tried different tags within the html , div with class is not working with as well.
This piece of code is a part of a class , so if anybody needs some more info on it please let me know.
Any suggestions will be helpful.
Thanks in advance.
the problem is if you want to use key word args in findall, you need to use class_ , but when you use attr={}, there is no need to use extra '_' in the class's tail.'class' is a key in the dictionary. so just change you code to this:
commentable = soup.findAll('section',attrs={"class":"js-could-have-comments js-project-comments-content js-project-content project-content"})
besides that, this page is rendered by javascript, so, you can not get the javascript part by urllib or requests, i recommend use selenium.
It’s very useful to search for a tag that has a certain CSS class, but
the name of the CSS attribute, “class”, is a reserved word in Python.
Using class as a keyword argument will give you a syntax error. As of
Beautiful Soup 4.1.2, you can search by CSS class using the keyword
argument class_:
soup.find_all("a", class_="sister")
soup.find_all("a", attrs={"class": "sister"})
BeautifulSoup Doc

In HandsomeSoup, how do I select the inner html of an element?

Say the html I'm parsing consists of the anchor tag:
this is what I want
Using the example in the package documentation I can get the href string "/here" by:
links <- runX $ doc >>> Text.HandsomeSoup.css "a" ! "href"
But how do I get the inner html? The following is in the spirit of what I'm looking for but does not work:
links <- runX $ doc >>> Text.HandsomeSoup.css "a" ! "value"
I have looked through the HandsomeSoup documentation thoroughly and at this point am wondering if this is even possible. Any help would be much appreciated.
HandsomeSoup builds on top of hxt, and so you can use the (vast) API of hxt as well. More specifically, I believe that...
getChildren >>> isText >>> getText
... will extract the text contents from the elements. Here are the documentation entries for
getChildren, isText and getText. I suspect you'll also want something like hasAttrValue to better specify which anchors you are interested in.

HTML XPath: Extracting text mixed in with multiple tags?

Goal: Extract text from a particular element (e.g. li), while ignoring the various mixed in tags, i.e. flatten the first-level child and simply return the concatenated text of each flattened child separately.
Example:
<div id="mw-content-text"><h2><span class="mw-headline" >CIA</span></h2>
<ol>
<li>Central Intelligence Agency.</li>
<li>Culinary Institute of America.</li>
</ol>
</Div>
desired text:
Central Intelligence Agency
Culinary Institute of America
Except that the anchor tags surrounding prevent a simple retrieval.
To return each li tag separately, we use the straightforward:
//div[contains(#id,"mw-content-text")]/ol/li
but that also includes surrounding anchor tags, etc. And
//div[contains(#id,"mw-content-text")]/ol/li/text()
returns only the text elements that are direct children of li, i.e. 'Central','.'...
It seemed logical then to look for text elements of self and descendants
//div[contains(#id,"mw-content-text")]/ol/li[descendant-or-self::text]
but that returns nothing at all!
Any suggestions? I'm using Python, so I'm open to using other modules for post-processing.
(I am using the Scrapy HtmlXPathSelector which seems XPath 1.0 compliant)
You were almost there. There is a small problem in:
//div[contains(#id,"mw-content-text")]/ol/li[descendant-or-self::text]
The corrected expression is:
//div[contains(#id,"mw-content-text")]/ol/li[descendant-or-self::text()]
However, there is a simpler expression that produces exactly the wanted concatenation of all text-nodes under the specified li:
string(//div[contains(#id,"mw-content-text")]/ol/li)
I think the following would return the correct result:
//div[contains(#id,"mw-content-text")]/ol/li//text()
Note the double slash before text(). This means text nodes on any level below li must be returned.
The string concatenation is tricky. Here's a quick solution using lxml:
>>> from lxml import etree
>>> doc = etree.HTML("""<div id="mw-content-text"><h2><span class="mw-headline" >CIA</span></h2>
... <ol>
... <li>Central Intelligence Agency.</li>
... <li>Culinary Institute of America.</li>
... </ol>
...
... </Div>""")
>>> for element in doc.xpath('//div[#id="mw-content-text"]/ol/li'):
... print "".join(element.xpath('descendant-or-self::text()'))
...
Central Intelligence Agency.
Culinary Institute of America.
Please note that // has potentially poor performance / unintended execution and should be avoided where possible, but difficult to do so with the example HTML fragment.