Using NLTK RegexpParser to find subject, object, verb combinations - nltk

I'm trying to extract subject object verb combinations using the NLTK tool kit. This is my code so far. How would I be able to do it?
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
grammar = r"""
NP:
{<.*>+} # Chunk everything
}<VBD|VBZ|VBP|IN>+{ # Chink sequences of VBD and IN
"""
cp = nltk.RegexpParser(grammar)
s = "This song is the best song in the world. I really love it."
for t in sent_tokenize(s):
text = nltk.pos_tag(word_tokenize(t))
print cp.parse(text)

One approach you can try is to chunk the sentences in NPs (noun phrases) and VPs (verb phrases) and then build a RBS (Rule Based System) on top of this to establish the chunk roles. For example if the VP is in ActiveVoice then the Subject should be the chunk in front of the VP. If it's in PassiveVoice it should be the following NP.
You can also have a look at Pattern.en . The parser has Relation Extraction included: http://www.clips.ua.ac.be/pages/pattern-en#parser

Related

NLTK: Is there a term for this procedure?

I was reading some stuff about NLTK and I read something of a procedure that turns the word such as "you're" into two tokens "you" and "are". I can't remember the source. Is there a term for this or something?
pip install contractions
# import library
import contractions
# contracted text
text = '''I'll be there within 5 min. Shouldn't you be there too?
I'd love to see u there my dear. It's awesome to meet new friends.
We've been waiting for this day for so long.'''
# creating an empty list
expanded_words = []
for word in text.split():
# using contractions.fix to expand the shortened words
expanded_words.append(contractions.fix(word))
expanded_text = ' '.join(expanded_words)
print('Original text: ' + text)
print('Expanded_text: ' + expanded_text)
the source

Detect language/script from pdf with python

I am trying to create a python script to detect the language(s)/script(s) inside a not yet OCRed pdf with the help of pytesseract before doing the 'real' ocr by passing the correct detected language(s)
I have like 10000 pdf's not always standard english and sometimes 1000 pages long. In order to do the real OCR I need to autodetect the language first.
So a sort of two step OCR as you will that tesseract both can preform
Detecting the language/script on some centered pages
Preforming the real OCR with the found language/script over all pages
Any tips to fix/improve this script? All I want is language(s) on the given pages detected returned.
#!/usr/bin/python3
import sys
import pytesseract
from wand.image import Image
import fitz
pdffilename = sys.argv[1]
doc = fitz.open(pdffilename)
center_page = round(doc.pageCount / 2)
surround = 2
with Image(filename=pdffilename + '[' + str(center_page - surround) + '-' + str(center_page + surround) + ']') as im:
print(pytesseract.image_to_osd(im, lang='osd',config='psm=0 pandas_config=None', nice =0, timeout=0))
I run the script as follows:
script_detect.py myunknown.pdf
I am getting the following error atm:
TypeError: Unsupported image object
Assuming that you have converted your pdf-file using some tool (OCR or other) you can use langdetect. Sample your text and feed it detect
from langdetect import detect
lang = detect("je suis un petit chat")
print(lang)
```output fr````
or
from langdetect import detect
lang = detect("我是法国人")
print(lang)
output ch
There are other libraries, such as polyglot, useful if you have mixed languages.

Is it possible to give text format hints in google vision api?

I'm trying to detect handwritten dates isolated in images.
In the cloud vision api, is there a way to give hints about type?
example: the only text present will be dd/mm/yy, d,m and y being digits
The only thing I found is language hints in the documentation.
Sometimes I get results that include letters like O instead of 0.
There is not a way to give hints about type but you can filter the output using client libraries. I downloaded detect.py and requirements.txt from here and modified detect.py (in def detect_text, after line 283):
response = client.text_detection(image=image)
texts = response.text_annotations
#Import regular expressions
import re
print('Date:')
dateStr=texts[0].description
# Test case for letters replacement
#dateStr="Z3 OZ/l7"
#print(dateStr)
dateStr=dateStr.replace("O","0")
dateStr=dateStr.replace("Z","2")
dateStr=dateStr.replace("l","1")
dateList=re.split(' |;|,|/|\n',dateStr)
dd=dateList[0]
mm=dateList[1]
yy=dateList[2]
date=dd+'/'+mm+'/'+yy
print(date)
#for text in texts:
#print('\n"{}"'.format(text.description))
#print('Hello you!')
#vertices = (['({},{})'.format(vertex.x, vertex.y)
# for vertex in text.bounding_poly.vertices])
#print('bounds: {}'.format(','.join(vertices)))
# [END migration_text_detection]
# [END def_detect_text]
Then I launched detect.py inside the virtual environment using this command line:
python detect_dates.py text qAkiq.png
And I got this:
23/02/17
There are few letters that can be mistaken for numbers, so using str.replace(“letter”,”number”) should solve the wrong identifications. I added the most common cases for this example.

How can I make nltk.NaiveBayesClassifier.train() work with my dictionary

I'm currently making a simples spam/ham email filter using Naive Bayles.
For you to understand my algorithm logic: I have a folder with lots os files, which are examples of spam/ham emails. I also have two other files in this folder containing the titles of all my ham examples and another with the titles of all my spam examples. I organized like this so I can open and read this emails properly.
I'm putting all the words I judge to be important in a dictionary structure, with a label "spam" or "ham" depending from which kind of file I extracted them from.
Then I'm using nltk.NaiveBayesClassifier.train() so I can train my classifier, but I'm getting the error:
for featureset, label in labeled_featuresets:
ValueError: too many values to unpack
I don't know why this is happening. When I looked for a solution, I found that strings are not hashable, and I was using a list to do it, then I turned it into a dictionary, which are hashable as far as I know, but it keeps getting this error.
Someone knows how to solve it? Thanks!
All my code is listed below:
import nltk
import re
import random
stopwords = nltk.corpus.stopwords.words('english') #Words I should avoid since they have weak value for classification
my_file = open("spam_files.txt", "r") #my_file now has the name of each file that contains a spam email example
word = {} #a dictionary where I will storage all the words and which value they have (spam or ham)
for lines in my_file: #for each name of file (which will be represenetd by LINES) of my_file
with open(lines.rsplit('\n')[0]) as email: #I will open the file pointed by LINES, and then, read the email example that is inside this file
for phrase in email: #After that, I will take every phrase of this email example I just opened
try: #and I'll try to tokenize it
tokens = nltk.word_tokenize(phrase)
except:
continue #I will ignore non-ascii elements
for c in tokens: #for each token
regex = re.compile('[^a-zA-Z]') #I will also exclude numbers
c = regex.sub('', c)
if (c): #If there is any element left
if (c not in stopwords): #And if this element is a not a stopword
c.lower()
word.update({c: 'spam'})#I put this element in my dictionary. Since I'm analysing spam examples, variable C is labeled "spam".
my_file.close()
email.close()
#The same logic is used for the Ham emails. Since my ham emails contain only ascii elements, I dont test it with TRY
my_file = open("ham_files.txt", "r")
for lines in my_file:
with open(lines.rsplit('\n')[0]) as email:
for phrase in email:
tokens = nltk.word_tokenize(phrase)
for c in tokens:
regex = re.compile('[^a-zA-Z]')
c = regex.sub('', c)
if (c):
if (c not in stopwords):
c.lower()
word.update({c: 'ham'})
my_file.close()
email.close()
#And here I train my classifier
classifier = nltk.NaiveBayesClassifier.train(word)
classifier.show_most_informative_features(5)
nltk.NaiveBayesClassifier.train() expects “a list of tuples (featureset, label)” (see the documentation of the train() method)
What is not mentioned there is that featureset should be a dict of feature names mapped to feature values.
So, in a typical spam/ham classification with a bag-of-words model, the labels are 'spam'/'ham' or 1/0 or True/False;
the feature names are the occurring words and the values are the number of times each word occurs.
For example, the argument to the train() method might look like this:
[({'greetings': 1, 'loan': 2, 'offer': 1}, 'spam'),
({'money': 3}, 'spam'),
...
({'dear': 1, 'meeting': 2}, 'ham'),
...
]
If your dataset is rather small, you might want to replace the actual word counts with 1, to reduce data sparsity.

How to create a context free grammar based on both "lexicon" and "rules" in NLTK

I have two text files for a CFG grammar: one is the "rules" (e.g. S->NP VP) and another one contains just the "lexical symbols" (e.g. "these": Det). Does any one know how I can give this two files as my grammar to NLTK? The second file is also known as "lexicon", because it just contains the category of real words. In summary, I just need to provide a lexicon for a specific grammar. Otherwise, I have to write the lexicon as several new rules in my rules' file. Due to the large volume of lexicon, It is not possible to convert the second file to rules and merge it with the first file. So I am completely stuck here... Any help/idea would be appreciated.
Take a look at the tutorial, it's a little outdated but the idea is there: http://www.nltk.org/book/ch08.html
Then take a look at this question and answer: CFG using POS tags in NLTK
Lastly, here's an example:
from nltk import parse_cfg, ChartParser
grammar_string = """
S -> NP VP
NP -> DT NN | NNP
VP -> VB NP | VBS
VBS -> 'sleeps'
VB -> 'loves' | 'sleeps_with'
NNP -> 'John' | 'Mary'
"""
grammar = parse_cfg(grammar_string)
sentence = 'John loves Mary'.split()
parser = ChartParser(grammar)
print parser.parse(sentence)
[out]:
(S (NP (NNP John)) (VP (VB loves) (NP (NNP Mary))))