Tesseract returns nothing for Arabic words/letters

Tesseract returns nothing for Arabic words/letters - ocr

I have installed Pytesseract and it's working perfectly on French/English text and also in numbers. But when I try to read any Arabic text/letter it doesn't return anything.
Here is the code I have used:
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe"
print(pytesseract.image_to_string(Image.open('maroc.jpg'), lang='ara'))
Here is the letter I'm trying to read د:
If someone was able to read it using another method please help, thanks!

Code :
from pytesseract import image_to_string
from PIL import Image
import pytesseract
print(pytesseract.image_to_pdf_or_hocr('test.png', lang='ara', extension='hocr'))
Take new Arabic tessdata from here:

if you want to recognise arabic words download the arabic trained model from the link below then save it in the location according to your Tesseract folder
C:\Program Files\Tesseract-OCR\tessdata
or
C:\Program Files (x86)\Tesseract-OCR\tessdata
arabic_tesseract_trained

for raspberry pi 4 just download module from Eliyaz KL answer and put in this path
/usr/share/tesseract-ocr/4.00/tessdata/
i don't know which operating system use i answerd in my case

Related

encoding jsonl file as utf-8 failing in Python 2.7

I am currently working with Python 2.7.7 and I am trying to read in a jsonl file from standard input via the console:
python my_single_jsonl.jsonl | my_python_code.py
This is the code I am using to read std_in:
# coding=utf-8
from __future__ import unicode_literals
import os
import json
import sys
def extract_jsonl_data():
json_as_list= []
for line in sys.stdin:
json_as_list.append(json.loads(line))
return json_as_list
I am getting the following error:
SyntaxError: Non-ASCII character '\xe2' in file single_jsonl.jsonl on
line 2, but no encoding declared; see
http://python.org/dev/peps/pep-0263/ for details
The error appears to be caused by a non-ASCII '-' character in the source jsonl file. I have looked at similar stories around Python 2.x encoding and they suggest the following:
Add magic comment to head of the .py file similar to:
# coding=utf-8
Import unicode handling from Py3:
from __future__ import unicode_literals
converting the imported jsonl string on the fly:
myjson_str.encoding('utf-8')
I've tried all three approaches and nothing seems to change this error. Are there any other approaches. Editing the source file isn't possible in this case.
I'm working on a macbook, using Atom as my source code editor.

Odoo 10 .csv file import

I'm trying to import data into a module I created in Odoo. However, here is what comes out of it:
Do you know the reason?
It is a file containing 400 lines, I tried to reduce the import with 50 lines is the same error.
Thank you

This error normally happens when the source code of Odoo is modified or import fields are not correct. you can refer this link.
please try to import sample data with some fields.

name 'nltk' is not defined

The nltk module is running with other libraries in the corpus folder.
My Code
I've already tried putting 'import nltk' at first but it is still the same, and also I've tried 'from nltk.tokenize import 'PunktSentenceTokenizer'. I don't know why the Python shell can't find the definition of the nltk. How should I address this? I am still learning how to write and code python.

First, install the nltk package by typing...
pip install nltk
Then you need to import it...
import nltk

You misspelled the name of the package in your file, you have used ntlk instead of nltk
change
tagged = ntlk.pos_tag(words)
to
tagged = nltk.pos_tag(words)

How to use trained data with pytesseract?

Using this tool http://trainyourtesseract.com/ I would like to be able to use new fonts with pytesseract. the tool give me a file called *.traineddata
Right now I'm using this simple script :
try:
import Image
except ImportError:
from PIL import Image
import pytesseract as tes
results = tes.image_to_string(Image.open('./test.jpg'),boxes=True)
file = open('parsing.text','a')
file.write(results)
print(results)
How to I use my traineddata file so I'm able to read new font with the python script ?
thanks !
edit#1 : so I understand that *.traineddata can be used with Tesseract as a command-line program. so my question still the same, how do I use traineddata with python ?
edit#2 : the answer to my question is here How to access the command line for Tesseract from Python?

Below is a sample of pytesseract.image_to_string() with options.
pytesseract.image_to_string(Image.open("./imagesStackoverflow/xyz-small-gray.png"),
lang="eng",boxes=False,
config="--psm 4 --oem 3
-c tessedit_char_whitelist=-01234567890XYZ:"))
To use your own trained language data, just replace "eng" in lang="eng" with you language name(.traineddata).

How to load jar dependenices in IPython Notebook

This page was inspiring me to try out spark-csv for reading .csv file in PySpark
I found a couple of posts such as this describing how to use spark-csv
But I am not able to initialize the ipython instance by including either the .jar file or package extension in the start-up that could be done through spark-shell.
That is, instead of
ipython notebook --profile=pyspark
I tried out
ipython notebook --profile=pyspark --packages com.databricks:spark-csv_2.10:1.0.3
but it is not supported.
Please advise.

You can simply pass it in the PYSPARK_SUBMIT_ARGS variable. For example:
export PACKAGES="com.databricks:spark-csv_2.11:1.3.0"
export PYSPARK_SUBMIT_ARGS="--packages ${PACKAGES} pyspark-shell"
These property can be also set dynamically in your code before SparkContext / SparkSession and corresponding JVM have been started:
packages = "com.databricks:spark-csv_2.11:1.3.0"
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages {0} pyspark-shell".format(packages)
)

I believe you can also add this as a variable to your spark-defaults.conf file. So something like:
spark.jars.packages com.databricks:spark-csv_2.10:1.3.0
This will load the spark-csv library into PySpark every time you launch the driver.
Obviously zero's answer is more flexible because you can add these lines to your PySpark app before you import the PySpark package:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-csv_2.10:1.3.0 pyspark-shell'
from pyspark import SparkContext, SparkConf
This way you are only importing the packages you actually need for your script.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Tesseract returns nothing for Arabic words/letters - ocr

Code : from pytesseract import image_to_string from PIL import Image import pytesseract print(pytesseract.image_to_pdf_or_hocr('test.png', lang='ara', extension='hocr')) Take new Arabic tessdata from here:

if you want to recognise arabic words download the arabic trained model from the link below then save it in the location according to your Tesseract folder C:\Program Files\Tesseract-OCR\tessdata or C:\Program Files (x86)\Tesseract-OCR\tessdata arabic_tesseract_trained

for raspberry pi 4 just download module from Eliyaz KL answer and put in this path /usr/share/tesseract-ocr/4.00/tessdata/ i don't know which operating system use i answerd in my case

Related

encoding jsonl file as utf-8 failing in Python 2.7

Odoo 10 .csv file import

name 'nltk' is not defined

How to use trained data with pytesseract?

How to load jar dependenices in IPython Notebook

Categories

Resources