NLTK Word Tokenize doesn't return anything - nltk

I am trying to tokenize a sentence, and I believe that the code is correct but there is no output. What could be the problem? Here is the code.
import nltk
from nltk.tokenize import word_tokenize
text = word_tokenize("And now for something completely different")
nltk.pos_tag(text)
text = word_tokenize("They refuse to permit us to obtain the refuse permit")
nltk.pos_tag(text)

It seems the following packages are missing.
punkt
averaged_perceptron_tagger
Note: You need to download them for the first time.
Try this..
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize
text = word_tokenize("And now for something completely different")
print(nltk.pos_tag(text))
text = word_tokenize("They refuse to permit us to obtain the refuse permit")
print(nltk.pos_tag(text))
print("----End of execution----")
Try this on IDE

Related

When I use tesseract to recognize the table, I can't get any output, how can I recognize the table

Recently, I want to make a tools for Table Recognition. I have tried tesseract ocr, but I can't get any output, can anyone give me the answer?
Highly recommand paddleocr for table recognition! It can output textfile and excel file using just a few lines of code.
import os
import cv2
from paddleocr import PPStructure,save_structure_res
table_engine = PPStructure(layout=False, show_log=True, use_gpu=False)
save_folder = './output'
img_path = 'PaddleOCR_pub/ppstructure/docs/table/table.jpg'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])
for line in result:
line.pop('img')
print(line
The output files are as follows, which can help you more.
you can experience it here: https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/ppstructure/docs/quickstart_en.md#214-table-recognition

Why is pytesseract not identifying this image?

I am trying to identify single digits in python with tesseract.
My code is this:
import numpy as np
from PIL import Image
from PIL import ImageOps
import pytesseract
import cv2
def predict(imageArray):
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
newImageArray = Image.open(imageArray)
number = pytesseract.image_to_string(newImageArray, lang='eng', config='--psm 10 --oem 1 -c tessedit_char_whitelist=0123456789')
return number
It has no problem saying this is an 8
but it does not recognise this as a 4
My images are just digits 0-9.
This is just one such example there are other instances where it struggles to identify "obvious/clear" digits.
Currently the only thing I am doing to my starting image,image is converting the colour. Using the following:
cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
Is there a way I can improve the accuracy. All of my images are clear computer typed images so I feel the accuracy should be a lot higher than it is.
You did not provide any information about your tesseract version and language model you used.
Best model identify '4' in your image without any preprocessing.

Python 2 to 3: telling 2to3 "I got this"

with either linters or coverage.py, you can tell the tool to ignore certain parts of your code.
for example, #pragma: no cover tells coverage not to count an exception branch as missing:
except (Exception,) as e: #pragma: no cover
if cpdb(): pdb.set_trace()
raise
Now, I know I can exclude specific fixers from 2to3. For example, to avoid fixing imports below, I can use 2to3 test_import_stringio.py -x imports.
But can use code annotations/directives to keep the fixer active, except at certain locations? For example, this bit of code is already adjusted to work for 2 and 3.
#this import should work fine in 2 AND 3.
try:
from io import StringIO
except ImportError:
#pragma-for-2to3: skip
from StringIO import StringIO
but 2to3 helpfully converts, because there is no such directive/pragma
And now this won't work in 2:
#this tests should work fine in 2 and 3.
try:
from io import StringIO
except ImportError:
#pragma-for-2to3: skip
from io import StringIO
Reason I am asking is because I want to avoid a big-bang approach. I intend to refactor code bit by bit, starting with unittests, to run under 2 and 3.
I am guessing this is not possible, just looking at my options. What I'll probably end up doing is to run the converter only on imports with -f imports for example, check what it ended up doing, do that manually myself on the code and then exclude imports from future consideration with -x imports.

Does nltk contain Arabic stop word, if not how can I add it?

I tried this but it doesn't work
from nltk.corpus import stopwords
stopwords_list = stopwords.words('arabic')
print(stopwords_list)
Update [January 2018]: The nltk data repository has included Arabic stopwords since October, 2017, so this issue no longer arises. The above code will work as expected.
As of October, 2017, the nltk includes a collection of Arabic stopwords. If you ran nltk.download() after that date, this issue will not arise. If you have been a user of nltk for some time and you now lack the Arabic stopwords, use nltk.download() to update your stopwords corpus.
If you call nltk.download() without arguments, you'll find that the stopwords corpus is shown as "out of date" (in red). Download the current version that includes Arabic.
Alternately, you can simply update the stopwords corpus by running the following code once, from the interactive prompt:
>>> import nltk
>>> nltk.download("stopwords")
Note:
Looking words up in a list is really slow. Use a set, not a list. E.g.,
arb_stopwords = set(nltk.corpus.stopwords.words("arabic"))
Original answer (still applicable to languages that are not included)
Why don't you just check what the stopwords collection contains:
>>> from nltk.corpus import stopwords
>>> stopwords.fileids()
['danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian',
'italian', 'norwegian', 'portuguese', 'russian', 'spanish', 'swedish',
'turkish']
So no, there's no list for Arabic. I'm not sure what you mean by "add it", but the stopwords lists are just lists of words. They don't even do morphological analysis, or other things you might want in an inflecting language. So if you have (or can put together) a list of Arabic stopwords, just put them in a set()ยน and you're one step ahead of where you'd be if your code worked.
There's an Arabic stopword list here:
https://github.com/mohataher/arabic-stop-words/blob/master/list.txt
If you save this file in your nltk_data directory with the filename arabic you will then be able to call it with nltk using your code above, which was:
from nltk.corpus import stopwords
stopwords_list = stopwords.words('arabic')
(Note that the possible locations of your nltk_data directory can be seen by typing nltk.data.path in your Python interpreter).
You can also use alexis' suggestion to check if it is found.
Do heed his advice to convert the the stopwords list to a set: stopwords_set = set(stopwords.words('arabic')), as it can make a real difference to performance.
You should use this library called Arabic stop words here is the pip for it:
pip install Arabic-Stopwords
just install it it should be imported after you type:
import arabicstopwords.arabicstopwords as stp
It is much better than the one in the nltk

how can I load one text file corpus using PlaintextCorpusReader Module

I can only do something like this:
`from nltk.corpus import PlaintextCorpusReader
corpus_root = '/usr/share/dict'
wordlists = PlaintextCorpusReader(corpus_root, '.*')
wordlists.fileids()`
if i have just a single file as my corpus is there an efficient code to directly choose that file rather than this method, which is for a corpus many text files