NLTK: Is there a term for this procedure? - nltk

I was reading some stuff about NLTK and I read something of a procedure that turns the word such as "you're" into two tokens "you" and "are". I can't remember the source. Is there a term for this or something?

pip install contractions
# import library
import contractions
# contracted text
text = '''I'll be there within 5 min. Shouldn't you be there too?
I'd love to see u there my dear. It's awesome to meet new friends.
We've been waiting for this day for so long.'''
# creating an empty list
expanded_words = []
for word in text.split():
# using contractions.fix to expand the shortened words
expanded_words.append(contractions.fix(word))
expanded_text = ' '.join(expanded_words)
print('Original text: ' + text)
print('Expanded_text: ' + expanded_text)
the source

Related

Detect language/script from pdf with python

I am trying to create a python script to detect the language(s)/script(s) inside a not yet OCRed pdf with the help of pytesseract before doing the 'real' ocr by passing the correct detected language(s)
I have like 10000 pdf's not always standard english and sometimes 1000 pages long. In order to do the real OCR I need to autodetect the language first.
So a sort of two step OCR as you will that tesseract both can preform
Detecting the language/script on some centered pages
Preforming the real OCR with the found language/script over all pages
Any tips to fix/improve this script? All I want is language(s) on the given pages detected returned.
#!/usr/bin/python3
import sys
import pytesseract
from wand.image import Image
import fitz
pdffilename = sys.argv[1]
doc = fitz.open(pdffilename)
center_page = round(doc.pageCount / 2)
surround = 2
with Image(filename=pdffilename + '[' + str(center_page - surround) + '-' + str(center_page + surround) + ']') as im:
print(pytesseract.image_to_osd(im, lang='osd',config='psm=0 pandas_config=None', nice =0, timeout=0))
I run the script as follows:
script_detect.py myunknown.pdf
I am getting the following error atm:
TypeError: Unsupported image object
Assuming that you have converted your pdf-file using some tool (OCR or other) you can use langdetect. Sample your text and feed it detect
from langdetect import detect
lang = detect("je suis un petit chat")
print(lang)
```output fr````
or
from langdetect import detect
lang = detect("我是法国人")
print(lang)
output ch
There are other libraries, such as polyglot, useful if you have mixed languages.

Use NLTK or similar tool to tell sentence boundary

I know how to split sentence with NLTK PunktSentenceTokenizer.
However I have another request: I have a text converted from pdf where the page break splits sentences. Is there any way of using NLTK to tell whether a string end is a sentence boundary or not? if it is not sentence boundary, I can concatenate the string with next string.
For example, here are my strings:
"I have a text converted"
"Is there any way to save human kind?"
The first one is not a sentence end and the second is.
If you are working with English, nltk already provided an API for you: english.pickle.
import nltk.data
text = '''
(How does it deal with this parenthesis?) "It should be part of the
previous sentence." "(And the same with this one.)" ('And this one!')
"('(And (this)) '?)" [(and this. )]
'''
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
print('\n-----\n'.join(sent_detector.tokenize(text.strip())))
Output:
(How does it deal with this parenthesis?)
-----
"It should be part of the
previous sentence."
-----
"(And the same with this one.)"
-----
('And this one!')
-----
"('(And (this)) '?)"
-----
[(and this. )]
Read more in nltk.tokenize

Is it possible to give text format hints in google vision api?

I'm trying to detect handwritten dates isolated in images.
In the cloud vision api, is there a way to give hints about type?
example: the only text present will be dd/mm/yy, d,m and y being digits
The only thing I found is language hints in the documentation.
Sometimes I get results that include letters like O instead of 0.
There is not a way to give hints about type but you can filter the output using client libraries. I downloaded detect.py and requirements.txt from here and modified detect.py (in def detect_text, after line 283):
response = client.text_detection(image=image)
texts = response.text_annotations
#Import regular expressions
import re
print('Date:')
dateStr=texts[0].description
# Test case for letters replacement
#dateStr="Z3 OZ/l7"
#print(dateStr)
dateStr=dateStr.replace("O","0")
dateStr=dateStr.replace("Z","2")
dateStr=dateStr.replace("l","1")
dateList=re.split(' |;|,|/|\n',dateStr)
dd=dateList[0]
mm=dateList[1]
yy=dateList[2]
date=dd+'/'+mm+'/'+yy
print(date)
#for text in texts:
#print('\n"{}"'.format(text.description))
#print('Hello you!')
#vertices = (['({},{})'.format(vertex.x, vertex.y)
# for vertex in text.bounding_poly.vertices])
#print('bounds: {}'.format(','.join(vertices)))
# [END migration_text_detection]
# [END def_detect_text]
Then I launched detect.py inside the virtual environment using this command line:
python detect_dates.py text qAkiq.png
And I got this:
23/02/17
There are few letters that can be mistaken for numbers, so using str.replace(“letter”,”number”) should solve the wrong identifications. I added the most common cases for this example.

Python - Why isn't this specific text being found by findall regex?

EDIT: PLEASE DO NOT DOWNVOTE WITHOUT COMMENTING ON WHY YOU ARE DOWNVOTING. I AM TRYING MY BEST TO WRITE THIS PROPERLY!
I am trying to print all of the URL links of watches on a website. I have all of them printing fine except one, even though that one has the exact same regex conditions as the others. Can someone explain why this isn't printing please? Have I messed up some syntax somewhere? The following code should be able to be pasted into a Python editor (i.e. IDLE) and run.
## Import required modules
from urllib import urlopen
from re import findall
import re
## Provide URL
dennisov_url = 'https://denissov.ru/en/'
## Open and read URL as string named 'dennisov_html'
dennisov_html = urlopen(dennisov_url).read()
## Find all of the links when each watch is clicked (those with the designated
## preceeding text 'window.open', then any character that occurs zero or more
## times, then the text '/en/'. Remove matches with the word "History" and
## any " symbols in the URL.
watch_link_urls = findall('window.open.*(/en/[^history][^"]*/)', dennisov_html)
## For every URL, convert it into a string on a new line and add the domain
for link in watch_link_urls:
link = 'https://denissov.ru' + link
## Print out the full URLs
print link
## This code should show the link https://denissov.ru/en/speedster/ yet
## it isn't showing. It has the exact preceeding text as the other links
## that are printing and is in the same div container. If you inspect the
## website then search 'en/barracuda_mechanical/ and then 'en/speedster/'
## you will see that the speedster link is only a few lines below barracuda
## mechanical and there is nothing different about the two's preceeding
## text, so speedster should be printing
You can try this code with this pattern:
from urllib2 import urlopen
import re
url = 'https://denissov.ru/en/'
data = urlopen(url).read()
sub_urls = re.findall('window.open\(\'(/.*?)\'', data)
# take everything without deleting dublicates
# final_urls = [k for k in b if '/history' not in k and k is not '']
# Or: remove duplicates
set(k for k in b if '/history' not in k)
for k in final_urls:
link = 'https://denissov.ru' + k
print link
Will output something like this:
https://denissov.ru/eng/denissovdesign/index.html
https://denissov.ru/en/barracuda_limited/
https://denissov.ru/en/barracuda_chronograph/
https://denissov.ru/en/barracuda_mechanical/
https://denissov.ru/en/speedster/
https://denissov.ru/en/free_rider/
https://denissov.ru/en/nau_automatic/
https://denissov.ru/en/lady_flower/
https://denissov.ru/en/enigma/
https://denissov.ru/en/number_one/
If you want a regex to get all URLs that don't contain the word history and start with en/ then you should use a tempered greedy solution, like this:
en\/(?:(?!history).)*?\/
(?:(?!history).)*? is a tempered dot which will match any character which doesn't have history as a lookahead.
(?!history) is a negative lookahead to ensure that.
The ?: has been added to indicate that the group is a non-capturing one.
The *? indicates a non-greedy match so that it will match only upto the first /
Regex101 Demo
Change the python code like this:
watch_link_urls = findall('window.open.*(/en\/(?:(?!history).)*?\/)', dennisov_html)
Output:
https://denissov.ru/en/barracuda_limited/
https://denissov.ru/en/barracuda_chronograph/
https://denissov.ru/en/barracuda_mechanical/
https://denissov.ru/en/speedster/
https://denissov.ru/en/free_rider/
https://denissov.ru/en/nau_automatic/
https://denissov.ru/en/lady_flower/
https://denissov.ru/en/enigma/
https://denissov.ru/en/number_one/
Read more about tempered greedy here.

Using NLTK RegexpParser to find subject, object, verb combinations

I'm trying to extract subject object verb combinations using the NLTK tool kit. This is my code so far. How would I be able to do it?
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
grammar = r"""
NP:
{<.*>+} # Chunk everything
}<VBD|VBZ|VBP|IN>+{ # Chink sequences of VBD and IN
"""
cp = nltk.RegexpParser(grammar)
s = "This song is the best song in the world. I really love it."
for t in sent_tokenize(s):
text = nltk.pos_tag(word_tokenize(t))
print cp.parse(text)
One approach you can try is to chunk the sentences in NPs (noun phrases) and VPs (verb phrases) and then build a RBS (Rule Based System) on top of this to establish the chunk roles. For example if the VP is in ActiveVoice then the Subject should be the chunk in front of the VP. If it's in PassiveVoice it should be the following NP.
You can also have a look at Pattern.en . The parser has Relation Extraction included: http://www.clips.ua.ac.be/pages/pattern-en#parser