Tesseract 3.05 : Failed loading language `eng` when training Chinese - ocr

I am trying to use tesseract to train a Chinese model, here's my script:
./tesstrain.sh \
--lang chi_sim
--langdata_dir ../../langdata
--tessdata_dir ../ # root directory of tesseract
--output_dir ../../output
At the first, everything works fine, but when it comes to phase E: Extracting features, something went wrong:
Failed loading language 'eng'
Tesseract couldn't load any languages!
Couldn't initialize tesseract
I don't understand, I am trying to train a Chinese model, why it comes to look for eng language, and how do I resolve this problem? thanks!

Related

Tesseract 3 training new font "Error: traineddata file must contain at least (a unicharset fileand inttemp) OR an lstm file."

I followed the instructions on this website https://pretius.com/blog/ocr-tesseract-training-data/ to train an OCR for a font that it doesn't work too well on; however, on the last step of creating the tesseract OCR traineddata, this error occurs:
Error message and my tesseract version:
I've searched online and have not found where the inttemp file is created or how it is created. Thanks in advance!

tesseract 4.0 is failed to detect less than "<" symbol

I'm using tesseract 4.0.0-rc2
while am trying to extract data using Tesseract 4.0 from image
of passport mrz
it gives me output like this
PNKHMHORKKEN<KK<KLLLLLLLLLLLLLLLLLLLLLLLRLRK
NO06370803KHM9410132M2609201N0000714729<<<58
which is not exactly I want, please help me for the correct solution for that
thanks in advance

Training Tesseract-OCR with JTessBoxEditor

I have problem with this program that we want to download the jTessBoxEditor for training tesseract language file, but when we try to download it ,it downloads ocr program "VietOCR.NET and also how can i download jTessBoxEditor and install it on windows OS ?
can anyone help me ? it is so important
Go the page for training...
http://vietocr.sourceforge.net/training.html
Click the link on the first word on the page: jTessBoxEditor
This should bring you to their download site - files are in ZIP format.
The ZIP file contains the jTessBoxEditor and is executed from there.

How to create a uzn file for tesseract

I need to build an OCR application that scans passports and so I have chosen tesseract for start. From what I have read there should be a .uzn file that I define, but I can't find any documentation on it. How can I create such a template for tesseract to use.
you can rather use uzn file or let tesseract do the segmentation itself.
anyway checkout the folowing link if you need more informations about uzn file format :
https://github.com/OpenGreekAndLatin/greek-dev/wiki/uzn-format

How to use trained data with pytesseract?

Using this tool http://trainyourtesseract.com/ I would like to be able to use new fonts with pytesseract. the tool give me a file called *.traineddata
Right now I'm using this simple script :
try:
import Image
except ImportError:
from PIL import Image
import pytesseract as tes
results = tes.image_to_string(Image.open('./test.jpg'),boxes=True)
file = open('parsing.text','a')
file.write(results)
print(results)
How to I use my traineddata file so I'm able to read new font with the python script ?
thanks !
edit#1 : so I understand that *.traineddata can be used with Tesseract as a command-line program. so my question still the same, how do I use traineddata with python ?
edit#2 : the answer to my question is here How to access the command line for Tesseract from Python?
Below is a sample of pytesseract.image_to_string() with options.
pytesseract.image_to_string(Image.open("./imagesStackoverflow/xyz-small-gray.png"),
lang="eng",boxes=False,
config="--psm 4 --oem 3
-c tessedit_char_whitelist=-01234567890XYZ:"))
To use your own trained language data, just replace "eng" in lang="eng" with you language name(.traineddata).