What do the abbreviations in POS tagging etc mean? - language-agnostic

Say I have the following Penn Tree:
(S (NP-SBJ the steel strike)
(VP lasted
(ADVP-TMP (ADVP much longer)
(SBAR than
(S (NP-SBJ he)
(VP anticipated
(SBAR *?*))))))
.)
What do abbrevations like VP and SBAR etc mean? Where can I find these definitions? What are these abbreviations called?

Those are the Penn Treebank tags, for example, VP means "Verb Phrase". The full list can be found here

The full list of Penn Treebank POS tags (so-called tagset) including examples can be found on https://www.sketchengine.eu/penn-treebank-tagset/
If you are interested in detail information on POS tag or POS tagging, see a brief manual for beginners on https://www.sketchengine.co.uk/pos-tags/

VP means verb phrase . these are standard abbreviation in the treebank.

Related

Nltk Wordnet not lemmatizing word even with POS tag

When I do wnl.lemmatize('promotional','a') or wnl.lemmatize('promotional',wordnet.ADJ), I get merely 'promotional' when it should return promotion. I supplied the correct POS, so why isn't it working? What can I do?
Lemmatization only changes between inflected forms, so the noun "promotion" isn't a lemma of the adjective "promotional".
Note that your noun is included as a pertainym for the lemma.
wn.synsets('promotional')[0].lemmas()[0]
Lemma('promotional.a.01.promotional')
wn.synsets('promotional')[0].lemmas()[0].pertainyms()
[Lemma('promotion.n.01.promotion')]

how to determine past perfect tense from POS tags

The past perfect form of 'I love.' is 'I had loved.' I am trying to identify such past perfects from POS tags (using NLTK, spacy, Stanford CoreNLP). What POS tag should I be looking for? Instead .. should I be looking for past form of the word have .. will that be exhaustive?
I PRP PRON
had VBD VERB
loved VBN VERB
. . PUNCT
The complete POS tag list used by CoreNLP (and I believe all the other libraries trained on the same data) is available at https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
I think your best best is to let the library annotate a list of sentences where you want to identify a specific verbal form and manually derive a series of rules (e.g., sequences of POS tags) that match what you need. For example you could be looking for VBD ("I loved"), VBD VBN ("I had loved"), VBD VBG ("I was loving somebody"), etc...

Skip a subgoal while proving in Isabelle

I am trying to prove a theorem but got stuck at a subgoal (that I prefer to skip and prove later). How can I skip this and prove the others ?
First, I tried oops and sorry but they both abort the entire proof (instead of the only subgoal). I also tried to put the subgoal into a dummy lemma (assuming proven with sorry) then using it (apply (rule [my dummy lemma])) but it applies the dummy lemma to every other subgoals (not only the first one).
It mostly depends on whether you are using the archaic (sorry for that ;)) apply-style or proper structured Isar for proving. I will give a small example to cover both styles. Assume you wanted to prove
lemma "A & B"
Where A and B just serve as placeholders for potentially huge formulas.
As structured proof you would do something like:
proof
show "A" sorry
next
show "B" sorry
qed
I.e., in this style you can use sorry to omit proofs for subgoals.
In apply-style you could do
apply (rule conjI)
defer -- "moves the first subgoal to the last position"
apply (*proof for subgoal "B"*)
apply (*proof for subgoal "A"*)
There is also the apply-style command prefer n which moves subgoal n to the front.

Idiomatic Proof by Contradiction in Isabelle?

So far I wrote proofs by contradiction in the following style in Isabelle (using a pattern by Jeremy Siek):
lemma "<expression>"
proof -
{
assume "¬ <expression>"
then have False sorry
}
then show ?thesis by blast
qed
Is there a way that works without the nested raw proof block { ... }?
There is the rule ccontr for classical proofs by contradiction:
have "<expression>"
proof (rule ccontr)
assume "¬ <expression>"
then show False sorry
qed
It may sometimes help to use by contradiction to prove the last step.
There is also the rule classical (which looks less intuitive):
have "<expression>"
proof (rule classical)
assume "¬ <expression>"
then show "<expression>" sorry
qed
For further examples using classical, see $ISABELLE_HOME/src/HOL/Isar_Examples/Drinker.thy
For better understanding of rule classical it can be printed in structured Isar style like this:
print_statement classical
Output:
theorem classical:
obtains "¬ thesis"
Thus the pure evil to intuitionists appears a bit more intuitive: in order to prove some arbitrary thesis, we may assume that its negation holds.
The corresponding canonical proof pattern is this:
notepad
begin
have A
proof (rule classical)
assume "¬ ?thesis"
then show ?thesis sorry
qed
end
Here ?thesis is the concrete thesis of the above claim of A, which may be an arbitrarily complex statement. This quasi abstraction via the abbreviation ?thesis is typical for idiomatic Isar, to emphasize the structure of reasoning.

Can OpenNLP use HTML tags as part of the training?

I'm creating a training set for the TokenNameFinder using html documents converted into plain text, but my precision is low and I want to use the HTML tags as part of the training. Like words in bold, and sentences in differents margin sizes.
Will OpenNLP accept and use those tags to create rules?
Is there another way to make use of those tags to improve precision?
It is not clear what you mean with using HTML tags to train OpenNLP.
The train input is an annotated tokenized sentence:
<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr . <START:person> Vinken <END> is chairman of <START:company> Elsevier N.V. <END> , the Dutch publishing group .
To train an OpenNLP model using the standard tooling you need annotations follows this convention. Note that the annotations does not follow the XML standard.
You can embed annotations directly to the HTML documents you will use for training. It might even help the classifier with the extra context, but I've never read any experimental results about it.
You should keep in mind that the training data should be tokenized. It means that you should include white spaces between words and punctuation, as well as between text elements and html:
<p> <i> Mr . <START:person> Vinken <END> </i> is chairman of <b> <START:company> Elsevier N.V. <END> </b>, the Dutch publishing group .