how to determine past perfect tense from POS tags - nltk

The past perfect form of 'I love.' is 'I had loved.' I am trying to identify such past perfects from POS tags (using NLTK, spacy, Stanford CoreNLP). What POS tag should I be looking for? Instead .. should I be looking for past form of the word have .. will that be exhaustive?
I PRP PRON
had VBD VERB
loved VBN VERB
. . PUNCT

The complete POS tag list used by CoreNLP (and I believe all the other libraries trained on the same data) is available at https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
I think your best best is to let the library annotate a list of sentences where you want to identify a specific verbal form and manually derive a series of rules (e.g., sequences of POS tags) that match what you need. For example you could be looking for VBD ("I loved"), VBD VBN ("I had loved"), VBD VBG ("I was loving somebody"), etc...

Related

How to Identify HTML attributes (Web Scraping)

I'm building a web scraper and I am at a loss for how to discern element attributes.
Currently, I'm moving a "scanning head" from left to write across the code, and I look for certain character strings to flag an attribute. For example, up until this point, I look for =" to decide where an attribute may exist.
Problems have started arising though, because there are multiple "valid" ways to write HTML.
For example, from this home depot page, the source code has two particular elements:
Learn More about the RYOBI Platform Here
<a href=https://www.homedepot.com/c/electronics_recycling_programs style=color:#F96302; target=_blank>Click here for more information on Electronic Recycling Programs</a>
This causes a headache for the scraper. The first element scrapes, but the second element does not have any =" to find. I can't just look for = neither, because that would give false positives, like in the first element there is ...sp=vanity... which wouldn't parse correctly.
How can I handle these multiple syntaxes of HTML?
Edit: I have been using C++ up to this point
Postulated answer
I have a thought but I'm not sure yet how well it will work. I could try pulling substrings based on the whitespaces " " then look for the first instance of an =
For example:
<a href="foo" target="bar"> -> pull substrings based on " " to get attList: ['href="foo"', 'target="bar"']
Then foreach(string att, attList), set key = att.left(att.indexOf('=')) and value = att.right(att.indexOf('='))

How to mark a in pyparsing position in stream to come back to it later?

Background:
I am trying to implement a simple (?) markup language to be used to write novels.
This is quite different from usual markups because semantic is centered on different primitives, in particular direct speech is far from being similar to a list.
The basic structure is well-known: #part{title}, #chapter{title} and #scene[{title}] have the usual meanings and double-\n indicates a paragraph break.
Specific features include:
#speach[speaker]{utterance, possibly complex}
#stress{something that should be visually enhanced}
#standout{some part that should have a different visual enhancement}
#quotation[original author]{possibly long block quotation}
This should be parsed and translated to different output formats (e.g.: html and LaTeX).
I have a pyparsing grammar able to parse a non-trivial input.
Problem is generation of paragraphs for HTML:
As said a paragraph ends with double-newline, but essentially starts from end of previous paragraph unless some top-level constucts (e.g.: #chapter) intervene to break sequence.
First naive attempt was to accumulate text fragments in a global buffer and to emit them at selected points; this wold logically work, but it seems pyparsing calls it's ParseActions multiple times, so my global buffer ends up holding the same fragment duplicated.
I have not found a way to either avoid such duplication or to mark the "start of paragraph" in such a way I can come back to it later to generate the well-known <p>Long line, maybe containing #speech{possibly nested with #standout{!} and other constructs}</p> (of course #standout should map to <b>!</b> and#speech to some specific <div class="speech"></div>)
What is the "best practice" to handle this kind of problems?
Note: LaTeX code generation is much less problematic because paragraphs are simply terminated (like in the markup) either with a blank line or with \par.
Is it possible for you recast this not as a "come back to the beginning later" problem but as a "read ahead as far as I need to get the whole thing" problem?
I think nestedExpr might be a way for you to read ahead to the next full markup, and then have a parse action re-parse the contents in order to process any nested markup directives. nestedExpr returns its parsed input as a nested list, but to get everything as a flattened string, wrap it in originalTextFor.
Here is a rework of the simpleWiki.py example from the pyparsing examples:
import pyparsing as pp
wiki_markup = pp.Forward()
# a method that will construct and return a parse action that will
# do the proper wrapping in opening and closing HTML, and recursively call
# wiki_markup.transformString on the markup body text
def convert_markup_to_html(opening,closing):
def conversionParseAction(s, l, t):
return opening + wiki_markup.transformString(t[1][1:-1]) + closing
return conversionParseAction
# use a nestedExpr with originalTextFor to parse nested braces, but return the
# parsed text as a single string containing the outermost nested braces instead
# of a nested list of parsed tokens
markup_body = pp.originalTextFor(pp.nestedExpr('{', '}'))
italicized = ('ital' + markup_body).setParseAction(convert_markup_to_html("<I>", "</I>"))
bolded = ('bold' + markup_body).setParseAction(convert_markup_to_html("<B>", "</B>"))
# another markup and parse action to parse links - again using transform string
# to recursively parse any markup in the link text
def convert_link_to_html(s, l, t):
t['link_text'] = wiki_markup.transformString(t['link_text'])
return '{link_text}'.format_map(t)
urlRef = ('link'
+ '{' + pp.SkipTo('->')('link_text') + '->' + pp.SkipTo('}')('url') + '}'
).setParseAction(convert_link_to_html)
# now inject all the markup bits as possible markup expressions
wiki_markup <<= urlRef | italicized | bolded
Try it out!
wiki_input = """
Here is a simple Wiki input:
ital{This is in italics}.
bold{This is in bold}!
bold{This is in ital{bold italics}! But this is just bold.}
Here's a URL to link{Pyparsing's bold{Wiki Page}!->https://github.com/pyparsing/pyparsing/wiki}
"""
print(wiki_markup.transformString(wiki_input))
Prints:
Here is a simple Wiki input:
<I>This is in italics</I>.
<B>This is in bold</B>!
<B>This is in <I>bold italics</I>! But this is just bold.</B>
Here's a URL to Pyparsing's <B>Wiki Page</B>!
Given your markup examples, I think this approach may get you further along.

What is regex expression for a string after I use nokogiri to scrape

I have this string and it is in an html document of 100 other names that are formatted the same:
<li>Physical education sed<span class="meta"><ul><li>15184745922</li></ul></span>
</li>
And I want to save 'Physical education sed under a name column and '15184745922' under a number column.
I was wondering how do you do this in Ruby.
In nokogiri I can get only the li's by doing this:
puts page.css("ul li").text
but then it comes out all in one word:"Physical education sed15184745922"
I was thinking regex is the way to go but I am stumped with that.
I did split it on the li
full_contact = page.css("ul li")[22]
split_contact_on_li = full_contact.to_s.split(/(\W|^)li(\W|$)/).map(&:to_sym)
puts split_contact_on_li
and I get this
<
>
Physical education sed<span class="meta"><ul>
<
>
15184745922<
/
>
</ul></span>
<
/
>
The same number of lines will be shown for each contact_info and the name is always the third line before the span class and the number is always the 6th line.
There is an instance where there might be an email address instead on the 6th line put not often.
So should I match the second and the third angular bracket and pull the information up to the third and fourth bracket then shove it into an array called name and number?
You shouldn't use a regex to parse xhtml since the regex engine might mess up things, you should use a html parser instead. However, if you want to use a regex, you can use a regex like this:
<li>(.*?)<.*?<li>(.*?)<
Working demo
The idea behind this regex is to use capturing groups (using paretheses) to capture the content you want. So, for you sample input the match information is:
MATCH 1
Group 1. [4-26] `Physical education sed`
Group 2. [53-64] `15184745922`
For example;
#!/usr/bin/env ruby
string = "<li>Physical education sed<span class=\"meta\"><ul><li>15184745922</li></ul></span></li>"
one, two = string.match(/<li>(.*?)<.*?<li>(.*?)</i).captures
p one #=> "Physical education sed"
p two #=> "15184745922"
Why don't you just do a regex on the string "physical education sed15184745922"? You can match on the first digit, and get back the number and the preceding text.
I don't know how to use Ruby, but if I understand your question correctly I would take advantage of the gsub function (or Ruby's equivalent). It might not be the prettiest approach, but since we just want the text in one variable and the numbers in another, we can just replace the characters we don't want with empty values.
v1 = page.css('ul li').text
v2 = gsub('\d*', '', v1)
v3 = gsub('(^\d)', '', v1)
v1 gets the full text value, v2 replaces all numeric characters with '', and v3 replaces all alpha characters with '', giving us two new variables to put wherever we please.
Again, I don't know how to use Ruby, but in R I know that I could get all of the values from the page using the xpath you provided ("ul li") into a vector and then loop across the vector performing the above steps on each element. I'm not sure if that adequately answers your question, but hopefully the gsub function gets you closer to what you want.
You need to use your HTML parser (Nokogiri) and regular expressions together. First, use Nokogiri to traverse down to the first parent node that contains all the text you need, then regex the text to get what you need.
Also, consider using .xpath instead of .css it provides much more functionality to search and scrape for just what you want. Given your example, you could do like so:
page.xpath("//span[#class='meta']/parent::li").map do |i|
i.text.scan(/^([a-z\s]+)(\d+)$/i).flatten
end
#=> [['Physical education sed', '15184745922'], ['the next string', '1234567890'], ...]
And now you have a two-dimensional array you can iterate over and save each pair.
This bit of xpath business: "//span[#class='meta']/parent::li" is doing what .css can't do, returning the parent node that has the text and specific children nodes you want to scrape against.

What do the abbreviations in POS tagging etc mean?

Say I have the following Penn Tree:
(S (NP-SBJ the steel strike)
(VP lasted
(ADVP-TMP (ADVP much longer)
(SBAR than
(S (NP-SBJ he)
(VP anticipated
(SBAR *?*))))))
.)
What do abbrevations like VP and SBAR etc mean? Where can I find these definitions? What are these abbreviations called?
Those are the Penn Treebank tags, for example, VP means "Verb Phrase". The full list can be found here
The full list of Penn Treebank POS tags (so-called tagset) including examples can be found on https://www.sketchengine.eu/penn-treebank-tagset/
If you are interested in detail information on POS tag or POS tagging, see a brief manual for beginners on https://www.sketchengine.co.uk/pos-tags/
VP means verb phrase . these are standard abbreviation in the treebank.

How can I retrieve a collection of values from nested HTML-like elements using RegExp?

I have a problem creating a regular expression for the following task:
Suppose we have HTML-like text of the kind:
<x>...<y>a</y>...<y>b</y>...</x>
I want to get a collection of values inside <y></y> tags located inside a given <x> tag, so the result of the above example would be a collection of two elements ["a","b"].
Additionally, we know that:
<y> tags cannot be enclosed in other <y> tags
... can include any text or other tags.
How can I achieve this with RegExp?
This is a job for an HTML/XML parser. You could do it with regular expressions, but it would be very messy. There are examples in the page I linked to.
I'm taking your word on this:
"y" tags cannot be enclosed in other "y" tags
input looks like: <x>...<y>a</y>...<y>b</y>...</x>
and the fact that everything else is also not nested and correctly formatted. (Disclaimer: If it is not, it's not my fault.)
First, find the contents of any X tags with a loop over the matches of this:
<x[^>]*>(.*?)</x>
Then (in the loop body) find any Y tags within match group 1 of the "outer" match from above:
<y[^>]*>(.*?)</y>
Pseudo-code:
input = "<x>...<y>a</y>...<y>b</y>...</x>"
x_re = "<x[^>]*>(.*?)</x>"
y_re = "<y[^>]*>(.*?)</y>"
for each x_match in input.match_all(x_re)
for each y_match in x_match.group(1).value.match_all(y_re)
print y_match.group(1).value
next y_match
next x_match
Pseudo-output:
a
b
Further clarification in the comments revealed that there is an arbitrary amount of Y elements within any X element. This means there can be no single regex that matches them and extracts their contents.
Short and simple: Use XPath :)
It would help if we knew what language or tool you're using; there's a great deal of variation in syntax, semantics, and capabilities. Here's one way to do it in Java:
String str = "<y>c</y>...<x>...<y>a</y>...<y>b</y>...</x>...<y>d</y>";
String regex = "<y[^>]*+>(?=(?:[^<]++|<(?!/?+x\\b))*+</x>)(.*?)</y>";
Matcher m = Pattern.compile(regex).matcher(str);
while (m.find())
{
System.out.println(m.group(1));
}
Once I've matched a <y>, I use a lookahead to affirm that there's a </x> somewhere up ahead, but there's no <x> between the current position and it. Assuming the pseudo-HTML is reasonably well-formed, that means the current match position is inside an "x" element.
I used possessive quantifiers heavily because they make things like this so much easier, but as you can see, the regex is still a bit of a monster. Aside from Java, the only regex flavors I know of that support possessive quantifiers are PHP and the JGS tools (RegexBuddy/PowerGrep/EditPad Pro). On the other hand, many languages provide a way to get all of the matches at once, but in Java I had to code my own loop for that.
So it is possible to do this job with one regex, but a very complicated one, and both the regex and the enclosing code have to be tailored to the language you're working in.