How to use trained data with pytesseract? - ocr

Using this tool http://trainyourtesseract.com/ I would like to be able to use new fonts with pytesseract. the tool give me a file called *.traineddata
Right now I'm using this simple script :
try:
import Image
except ImportError:
from PIL import Image
import pytesseract as tes
results = tes.image_to_string(Image.open('./test.jpg'),boxes=True)
file = open('parsing.text','a')
file.write(results)
print(results)
How to I use my traineddata file so I'm able to read new font with the python script ?
thanks !
edit#1 : so I understand that *.traineddata can be used with Tesseract as a command-line program. so my question still the same, how do I use traineddata with python ?
edit#2 : the answer to my question is here How to access the command line for Tesseract from Python?

Below is a sample of pytesseract.image_to_string() with options.
pytesseract.image_to_string(Image.open("./imagesStackoverflow/xyz-small-gray.png"),
lang="eng",boxes=False,
config="--psm 4 --oem 3
-c tessedit_char_whitelist=-01234567890XYZ:"))
To use your own trained language data, just replace "eng" in lang="eng" with you language name(.traineddata).

Related

How to import data from json file to mongodb atlas collection

I wanted to import data to my collection in mongodb atlas, and I was following the documentation: https://docs.mongodb.com/compass/beta/import-export/ but there is no "ADD DATA" and I don't know if Im using some other version or Im doing something else wrongly.
I need to import whole file which is json array.
The docs you referenced are for a future version of Compass. If you want to import from EJSON at the command line you can use mongoimport.
Here's the simplest syntax, but there are many variations possible.
mongoimport --db=users --collection=contacts --file=contacts.json

Tesseract returns nothing for Arabic words/letters

I have installed Pytesseract and it's working perfectly on French/English text and also in numbers. But when I try to read any Arabic text/letter it doesn't return anything.
Here is the code I have used:
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe"
print(pytesseract.image_to_string(Image.open('maroc.jpg'), lang='ara'))
Here is the letter I'm trying to read د:
If someone was able to read it using another method please help, thanks!
Code :
from pytesseract import image_to_string
from PIL import Image
import pytesseract
print(pytesseract.image_to_pdf_or_hocr('test.png', lang='ara', extension='hocr'))
Take new Arabic tessdata from here:
if you want to recognise arabic words download the arabic trained model from the link below then save it in the location according to your Tesseract folder
C:\Program Files\Tesseract-OCR\tessdata
or
C:\Program Files (x86)\Tesseract-OCR\tessdata
arabic_tesseract_trained
for raspberry pi 4 just download module from Eliyaz KL answer and put in this path
/usr/share/tesseract-ocr/4.00/tessdata/
i don't know which operating system use i answerd in my case

Tesseract 3.05 : Failed loading language `eng` when training Chinese

I am trying to use tesseract to train a Chinese model, here's my script:
./tesstrain.sh \
--lang chi_sim
--langdata_dir ../../langdata
--tessdata_dir ../ # root directory of tesseract
--output_dir ../../output
At the first, everything works fine, but when it comes to phase E: Extracting features, something went wrong:
Failed loading language 'eng'
Tesseract couldn't load any languages!
Couldn't initialize tesseract
I don't understand, I am trying to train a Chinese model, why it comes to look for eng language, and how do I resolve this problem? thanks!

How to create a uzn file for tesseract

I need to build an OCR application that scans passports and so I have chosen tesseract for start. From what I have read there should be a .uzn file that I define, but I can't find any documentation on it. How can I create such a template for tesseract to use.
you can rather use uzn file or let tesseract do the segmentation itself.
anyway checkout the folowing link if you need more informations about uzn file format :
https://github.com/OpenGreekAndLatin/greek-dev/wiki/uzn-format

How to convert sas7bdat file to csv?

I want to convert a .sas7bdat file to a .csv/txt format so that I can upload it into a hive table.
I'm receiving the .sas7bdat file from an outside server and do not have SAS on my machine.
Use one of the R foreign packages to read the file and then convert to CSV with that tool.
http://cran.r-project.org/doc/manuals/R-data.pdf
Pg 12
Using the SAS7BDAT package instead. It appears to ignore custom formatted, reading the underlying data.
In SAS:
proc format;
value agegrp
low - 12 = 'Pre Teen'
13 -15 = 'Teen'
16 - high = 'Driver';
run;
libname test 'Z:\Consulting\SAS Programs';
data test.class;
set sashelp.class;
age2=age;
format age2 agegrp.;
run;
In R:
install.packages(sas7bdat)
library(sas7bdat)
x<-read.sas7bdat("class.sas7bdat", debug=TRUE)
x
The python package sas7bdat, available here, includes a library for reading sas7bdat files:
from sas7bdat import SAS7BDAT
with SAS7BDAT('foo.sas7bdat') as f:
for row in f:
print row
and a command-line program requiring no programming
$ sas7bdat_to_csv in.sas7bdat out.csv
I recently wrote this package that allows you convert sas7bdat to csv using Hadoop/Spark. It's able to split giant sas7bdat file thus achieving high parallelism. The parsing also uses parso as suggested by #Ashpreet
https://github.com/saurfang/spark-sas7bdat
If this is a one-off, you can download the SAS system viewer for free from here (after registering for an account, which is also free):
http://support.sas.com/downloads/package.htm?pid=176
You can then open the sas dataset using the viewer and save it as a csv file. There is no CLI as far as I can tell, but if you really wanted to you could probably write an autohotkey script or similar to convert SAS datasets to csv.
It is also possible to use the SAS provider for OLE DB to read SAS datasets without actually having SAS installed, and that's available here:
http://support.sas.com/downloads/browse.htm?fil=0&cat=64
However, this is rather complicated - some documentation is available here if you want to get an idea:
http://support.sas.com/documentation/cdl/en/oledbpr/59558/PDF/default/oledbpr.pdf
Thanks for your help. I ended us using the parso utility in java and it worked like a charm. The utility returns the rows as object arrays which i wrote into a text file.
I referred to the utility from: http://lifescience.opensource.epam.com/parso.html