Princeton Wordnet database - two different synset identifiers? - mysql

I am trying to make sense of the different identifiers in the Princeton Wordnet database. I am using version 3.1. You can read about the structure here but my focus is on the synsets table.
The Synset Table The synsets table is one of the most important tables in the database. It is responsible for housing all the definitions within WordNet. Each row in the synset table has a synsetid, a definition, a pos (parts of speech field) and a lexdomainid (which links to the lexdomain table) There are 117373 synsets in the WordNet Database.
When I search for the word joy in my senses table, I see that there are four different results (2 nouns and 2 vebs). From there, I can identify the sense/meaning that I am looking for, which is the one that corresponds to the meaning:
"the emotion of great happiness"
So I have now found the result that I am looking for. The synset id of this result is 107542591 and I can search this id to find other words with the same sense/meaning.
However, when I use some online versions of Wordnet and I search for words in the synset "the emotion of great happiness", I see a different type of identifier. This identifier is 07527352-n.
For example, you can see it at the top-left corner of this site. On that same site, in the address bar you'll see that identifier is referred to as the synset id: &synset=07527352-n.
I would like to know how to retrieve the second type of identifier for a given synset. I've read through the documentation here and searched through the raw data files, but I cannot figure it out.
Thank you!

There are two things going on.
First, MySQL does not like ids starting with a 0, so they start with 1. (Specifically, nouns get a 1 prefix, verbs 2, adjectives 3, and adverbs get a 4 prefix: see WordNet identifiers section at http://wordnet-rdf.princeton.edu/ )
Second, 07542591 is from WordNet 3.1 (I've checked both the raw WordNet files, and the SQL files, and they both use this).
"07527352" is from an older version of WordNet. In the case of Chinese WordNet I believe they use WordNet 3.0. http://compling.hss.ntu.edu.sg/cow/
Additional: https://stackoverflow.com/a/33348009/841830 has more information. Strangely, I've not been able to track a simple 3.0 to 3.1 conversion table yet... but I'm sure I've seen one.

Related

Can an OCR run in a split-second if it is highly targeted? (Small dictionary)

I am looking for an open source ocr (maybe tesseract) that uses a dictionary to match words against. For example, I know that this ocr will only be used to search for certain names. Imagine I have a master guest list (written) and I want to scan this list in under a second with the ocr and check this against a database of names.
I understand that a traditional ocr can attempt to read every letter and then I could just cross reference the results with the 100 names, but this takes too long. If the ocr was just focusing on those 100 words and nothing else then it should be able to do all this in a split second. i.e. There is no point in guessing that a word might be "Jach" since "Jach" isn't a name in my database. The ocr should be able to infer that it is "Jack" since that is an actual name in the database.
Is this possible?
It should be possible. Think of it this way: instead of having your OCR look for 'J' it could be looking for 'Jack' directly, sort of: as an individual symbol.
So when you train / calibrate your OCR, train it with images of whole words, similar to how you would - for an individual symbol.
(if this feature is not directly available in your OCR then first map images of whole words to a unique symbol and later transform that symbol into the final word string)

WordNet 3.0 Curse Words

I'm developing a system where keywords are extracted from plain text.
The requirements for a keyword are:
Between 1 - 45 letters long
Word must exist within the WordNet database
Must not be a "common" word
Must not be a curse word
I have fulfilled requirements 1 - 3, however I can't find a method for finding a distinction between curse words; how do I filter them?
I know this would not be a definitive method of filtering out all the curse words, but what happens is all keywords are first set to a state of "pending" before being "approved" by a moderator. However if I can get WordNet to filter most of the curse words out, it would make the moderator's job more easy.
It's strange, the Unix command line version of WordNet (wn) will give you the desired
information with the option -domn (domain):
wn ass -domnn (-domnv for a verb)
...
>>> USAGE->(noun) obscenity#2, smut#4, vulgarism#1, filth#4, dirty word#1
>>> USAGE->(noun) slang#2, cant#3, jargon#1, lingo#1, argot#1, patois#1, vernacular#1
However, the equivalent method in the NLTK just returns an empty list:
from nltk.corpus import wordnet
a = wordnet.synsets('ass')
for s in a:
for l in s.lemmas:
print l.usage_domains()
[]
[]
...
As an alternative you could try to filter words that have "obscene", "coarse" or "slang" in their SynSet's definition. But probably it's much easier to filter against a fixed list as suggested before (like the one at noswearing.com).
For the 4th point it would be better and effective if you can collect the list of curse words and remove them through iterative process.
To achieve the same you can checkout this blog
I will summarize the same here.
1. Load Swear words text file from here
2. Compare it with the text, remove if it match.
def remove_curse_words():
text = 'Hey Bro Fuck you'
text = ' '.join([word for word in text.split() if word not in curseWords])
return text
The output would be.
Hey bro you

Using the nltk to recognise dates as named entities?

I'm trying to use the NLTK Named Entity Tagger to identify various named entities. In the book Natural Language Processing with Python they provide a list of commonly used named entitities, (Table 7.4, if anyone is curious) which include: DATE June, 2008-06-29 and TIME two fifty a m, 1:30 p.m. So I got the impresssion that this could be done with the NLTK's named entity tagger.
However, when I've run the tagger, it doesn't seem to pick up dates or times at all, as it does people or organizations. Does the NLTK named entity tagger not handle these date/time cases, or does it only pick up a specific date/time format? If it doesn't handle these cases, does anybody know of a system that does? Or is creating my own the only solution?
Thanks!
You should check out the contrib repository of NLTK - contains a module called timex.py or download it here:
https://github.com/nltk/nltk_contrib/blob/master/nltk_contrib/timex.py
From the first line of the module:
# Code for tagging temporal expressions in text

Named Entity Recognition using NLTK. Relevance of extracted keywords

I was checking out the Named Entity Recognition feature of NLTK. Is it possible to find out which of the extracted keywords is most relevant to the original text? Also, is it possible to know the type (Person / Organization) of the extracted keywords?
If you have a trained tagger, you can first tag your text and then use the NE classifier that comes with NLTK.
The tagged text should be presented as a list
sentence = 'The U.N.'
tagged_sentence = [('The','DT'), ('U.N.', 'NNP')]
Then, the ne classifier would be called like this
nltk.ne_chunk(tagged_sentence)
It returns a Tree. The classified words will appear as Tree nodes inside the main structure.
The result will include if it is a PERSON, ORGANIZATION or GPE.
To find out the most relevant terms, you have to define a measure of "relevance". Usually tf/idf is used but if you are considering only one document, frequency could be enough.
Computing the frequency of each word within a document is easy with NLTK. First you have to load your corpus and once you have load it and have a Text object, simply call:
relevant_terms_sorted_by_freq = nltk.probability.FreqDist(corpus).keys()
Finally, you could filter out all words in relevant_terms_sorted_by_freq that don't belong to a NE list of words.
NLTK offers an online version of a complete book which I find interesting to start with

List of idiomatic word pairs

I remember seeing somewhere a dictionary of idiomatic word pairs for use in programming.
Like get-set, open-close, allocate-free and so on.
Does anyone remember an URL?
Building on ergosys' answer:
From Code Complete 2, Chapter 11, p. 264:
Common Opposites in Variable Names
Begin/end
first/last
locked/unlocked
min/max
next/previous
old/new
opened/closed
visible/invisible
source/target
source/destination
up/down
There are two short lists of such pairs in Code Complete, one for function names, one for variable names. You can search for "Use Opposites Precisely" using amazon's look inside feature if you don't have the book.
Never seen a list generally aimed at programming, however PowerShell has such a list: Cmdlet verbs. Pairings are highlighted for each verb where they exist.
And while much of PowerShell's strive for consistency on the command line comes from standardizing those verbs some of the pairings may be appropriate in other contexts as well.
English is not my first language but aren't those antonyms rather than idiomatic pairs?
In Linux you can use wordnet to search for antonyms
sudo apt-get install wordnet
wn open -antsv
-ants for antonyms and v for verbs. You can also search for (n | a | r) – antonym for noun | adjective | adverbs.