Issue to train tesseract-OCR 4 - Empy shape table - ocr

I am trying to train Tesseract 4 with particular pictures (to read multimeters with 7 segments),
please note that I am aware of the allready trained data from Arthur Augusto at https://github.com/arturaugusto/display_ocr but I need to train Tesseract over my own data.
In order to train tess, I followed differents tutorials (as https://robipritrznik.medium.com/recognizing-vehicle-license-plates-on-images-using-tesseract-4-ocr-with-custom-trained-models-4ba9861595e7 or https://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/)
but i allways get problem when running the shapeclustering command with my own data
(With example data as https://github.com/tesseract-ocr/tesseract/issues/1174#issuecomment-338448972, every things is working fine)
Indeed when I try to do the shapeclusturing command it have this output screenshot
Then my shape_table is empty and the trainig could'nt be efficient...
With example data it's working fine and the shape_table is well filled
I am guessing that I have issue with box file generation, here is my process to create box file :
I use the
tesseract imageFileName.tif imageFileName batch.nochop makebox
command to generate box file and then i edit it with JtessboxEditor.
So I can't see where I'am wrong with my .box/.tif data couple.
Have a good day & thanks for helping me
\n
Adrien
Here is my full batch script for training after having generated and edited box files.
set name=sev7.exp0
set shortName=sev7
echo Run Tesseract for Training..
tesseract.exe %name%.tif %name% nobatch box.train
echo Compute the Character Set..
unicharset_extractor.exe %name%.box
shapeclustering -F font_properties -U unicharset -O %shortName%.unicharset %name%.tr
mftraining -F font_properties -U unicharset -O %shortName%.unicharset %name%.tr
echo Clustering..
cntraining.exe %name%.tr
echo Rename Files..
rename normproto %shortName%.normproto
rename inttemp %shortName%.inttemp
rename pffmtable %shortName%.pffmtable
rename shapetable %shortName%.shapetable
echo Create Tessdata..
combine_tessdata.exe %shortName%.
echo. & pause

Ok so finally I achieved to train tesseract.
The solution is to add a --psm parameter when using the command
tesseract.exe %name%.tif %name% nobatch box.train
as
tesseract.exe %name%.%typeFile% %name% --psm %psm% nobatch box.train
note that all the psm value are :
REM pagesegmode values are:
REM 0 = Orientation and script detection (OSD) only.
REM 1 = Automatic page segmentation with OSD.
REM 2 = Automatic page segmentation, but no OSD, or OCR
REM 3 = Fully automatic page segmentation, but no OSD. (Default)
REM 4 = Assume a single column of text of variable sizes.
REM 5 = Assume a single uniform block of vertically aligned text.
REM 6 = Assume a single uniform block of text.
REM 7 = Treat the image as a single text line.
REM 8 = Treat the image as a single word.
REM 9 = Treat the image as a single word in a circle.
REM 10 = Treat the image as a single character.
REM 11 = Sparse text. Find as much text as possible in no particular order.
REM 12 Sparse text with OSD.
REM 13 Raw line. Treat the image as a single text line bypassing hacks that are Tesseract-specific.
founded on https://github.com/tesseract-ocr/tesseract/issues/434

Related

Failed in generating Tesseract traineddata

I'm using Tesseract v5.0.1.20220118 on Windows 10, training a font only have letter "P" and "Q".
When I get to the step
mftraining -F font_properties.txt -U unicharset -O normal.unicharset pq.normal.exp0.tr
The pffmtable file is not generated.
And when I run code cntraining pq.normal.exp0.tr
It shows me
Reading pq.normal.exp0.tr ...
Clustering ...
N == sizeof(Cluster->Mean):Error:Assert failed:in file ../../../src/classify/cluster.cpp, line 2526
Why it goes wrong? How can I fix it?
I only have inttemp and shapetable generated, but the tutorial says there will be four files include shapetable, inttemp, pffmtable and normproto, I wonder that maybe is beacuse of the font only have letter "P" and "Q", but I have no idea how to solve it.
Please read the docs:
https://tesseract-ocr.github.io/tessdoc/#training-for-tesseract-5
Use the right tools:
https://github.com/tesseract-ocr/tesstrain

How to Create Traineddata file For Tesseract 4.1.0

I want to recognise the characters of NumberPlate.
How to train the tesseract-ocr for respective number plate in ubuntu 16.04.
Since i don't familiar with training. Please help me to create a 'traineddata' file for recognizing numberplate.
I have 1000 images of number plate.
Please look into it.
Any help would be appreciate.
So I have tried the following commands
tesseract [langname].[fontname].[expN].[file-extension] [langname].[fontname].[expN] batch.nochop makebox
tesseract eng.arial.plate3655.png eng.arial.plate3655 batch.nochop makebox
But it gives error.
Tesseract Open Source OCR Engine v4.1.0-rc1-56-g7fbd with Leptonica
Error, cannot read input file eng.arial.plate3655.png: No such file or directory
Error during processing.
after that i have tried
tesseract plate4.png eng.arial.plate4 batch.nochop makebox
it works but in some plates.
Now in Step 2. I am getting error.
Screenshot is attached.
Plate 4 image for training
Step 1 and Ste p2 display in terminal
File Generated after step 1 and step 2
Content of file generated after step 1 and step 2
Creating .traineddata for Tesseract 4
{*Note : After install tesseract open cmd and do the following.}
Step 1:
Make box files for images that we want to train
Syntax:
tesseract [langname].[fontname].[expN].[file-extension] [langname].[fontname].[expN] batch.nochop makebox
Eg:
tesseract own.arial.exp0.jpg own.arial.exp0 batch.nochop makebox
{*Note:After making box files we have to change or modify wrongly identified characters in box files.}
Step 2:
Create .tr file (Compounding image file and box file)
Syntax:
tesseract [langname].[fontname].[expN].[file-extension] [langname].[fontname].[expN] box.train
Eg:
tesseract own.arial.exp0.jpg own.arial.exp0 box.train
step 3:
Extract the charset from the box files (Output for this command is unicharset file)
Syntax:
unicharset_extractor [langname].[fontname].[expN].box
Eg:
unicharset_extractor own.arial.exp0.box
step 4:
Create a font_properties file based on our needs.
Syntax:
echo "[fontname] [italic (0 or 1)] [bold (0 or 1)] [monospace (0 or 1)] [serif (0 or 1)] [fraktur (0 or 1)]" > font_properties
Eg:
echo "arial 0 0 1 0 0" > font_properties
Step 5:
Training the data.
Syntax:
mftraining -F font_properties -U unicharset -O [langname].unicharset [langname].[fontname].[expN].tr
Eg:
mftraining -F font_properties -U unicharset -O own.unicharset own.arial.exp0.tr
Step 6:
Syntax:
cntraining [langname].[fontname].[expN].tr
Eg:
cntraining own.arial.exp0.tr
{*Note:After step 5 and step 6 four files were created.(shapetable,inttemp,pffmtable,normproto) }
Step 7:
Rename four files (shapetable,inttemp,pffmtable,normproto) into ([langname].shapetable,[langname].inttemp,[langname].pffmtable,[langname].normproto)
Syntax:
rename filename1 filename2
Eg:
rename shapetable own.shapetable
rename inttemp own.inttemp
rename pffmtable own.pffmtable
rename normproto own.normproto
Step 8:
Create .traineddata file
Syntax:
combine_tessdata [langname].
Eg:
combine_tessdata own.
{ *Note : I will use only one image exp0 for creating traineddata.if you want to train more than one image you can train i.e exp1,exp2..expn }
Reference

How to Convert Open Image Dataset to LMDB [duplicate]

I am relatively new to machine learning/python/ubuntu.
I have a set of images in .jpg format where half contain a feature I want caffe to learn and half don't. I'm having trouble in finding a way to convert them to the required lmdb format.
I have the necessary text input files.
My question is can anyone provide a step by step guide on how to use convert_imageset.cpp in the ubuntu terminal?
Thanks
A quick guide to Caffe's convert_imageset
Build
First thing you must do is build caffe and caffe's tools (convert_imageset is one of these tools).
After installing caffe and makeing it make sure you ran make tools as well.
Verify that a binary file convert_imageset is created in $CAFFE_ROOT/build/tools.
Prepare your data
Images: put all images in a folder (I'll call it here /path/to/jpegs/).
Labels: create a text file (e.g., /path/to/labels/train.txt) with a line per input image . For example:
img_0000.jpeg 1
img_0001.jpeg 0
img_0002.jpeg 0
In this example the first image is labeled 1 while the other two are labeled 0.
Convert the dataset
Run the binary in shell
~$ GLOG_logtostderr=1 $CAFFE_ROOT/build/tools/convert_imageset \
--resize_height=200 --resize_width=200 --shuffle \
/path/to/jpegs/ \
/path/to/labels/train.txt \
/path/to/lmdb/train_lmdb
Command line explained:
GLOG_logtostderr flag is set to 1 before calling convert_imageset indicates the logging mechanism to redirect log messages to stderr.
--resize_height and --resize_width resize all input images to same size 200x200.
--shuffle randomly change the order of images and does not preserve the order in the /path/to/labels/train.txt file.
Following are the path to the images folder, the labels text file and the output name. Note that the output name should not exist prior to calling convert_imageset otherwise you'll get a scary error message.
Other flags that might be useful:
--backend - allows you to choose between an lmdb dataset or levelDB.
--gray - convert all images to gray scale.
--encoded and --encoded_type - keep image data in encoded (jpg/png) compressed form in the database.
--help - shows some help, see all relevant flags under Flags from tools/convert_imageset.cpp
You can check out $CAFFE_ROOT/examples/imagenet/convert_imagenet.sh
for an example how to use convert_imageset.

Tesseract training archieves a worse result than without training

I used the Windows installer of tesseract-ocr 3.02.02 (I didn't find a newer one for 3.04). My image is a JPEG with a resolution of 600dpi (3507x4960) which is a scanned blank "certificate of incapacity for work". The OCR result without training is much more accurate than after training. So what am I doing wrong?
This way I build my box file:
SET LANG=arbeitsunfaehigkeit
SET FONTNAME=hausarzt
SET TESSLANG=%LANG%.%FONTNAME%.exp0
tesseract %TESSLANG%.jpg %TESSLANG% -l deu batch.nochop makebox
Using jTessBoxEditor I fixed every box by hand. Then I started the training:
SET LANG=arbeitsunfaehigkeit
SET FONTNAME=hausarzt
SET TESSLANG=%LANG%.%FONTNAME%.exp0
tesseract %TESSLANG%.jpg %TESSLANG% -l deu nobatch box.train
unicharset_extractor %TESSLANG%.box
shapeclustering -F font_properties -U unicharset %TESSLANG%.tr
mftraining -F font_properties -U unicharset -O %LANG%.unicharset %TESSLANG%.tr
cntraining %TESSLANG%.tr
MOVE inttemp %LANG%.inttemp
MOVE normproto %LANG%.normproto
MOVE pffmtable %LANG%.pffmtable
MOVE shapetable %LANG%.shapetable
combine_tessdata %LANG%.
COPY %LANG%.traineddata %TESSERACT_HOME%\tessdata /Y
The OCR without training (archieving the best results) is done like:
SET LANG=arbeitsunfaehigkeit
SET FONTNAME=hausarzt
SET TESSLANG=%LANG%.%FONTNAME%.exp0
tesseract %TESSLANG%.jpg without_training -l deu
Using the traineddata:
SET LANG=arbeitsunfaehigkeit
SET FONTNAME=hausarzt
SET TESSLANG=%LANG%.%FONTNAME%.exp0
tesseract %TESSLANG%.jpg with_training -l %LANG%
Maybe I am wrong but I expect a perfect result (I use the same JPEG for training and OCRing).
Here the first part of without_training.txt:
Paul Albrechts Verlag, 22952 Lütjensee Bei verspäteter Vorlage droht Krankengeldverlust!
Krankenkasse bzw. Kostenträger
Name, Vorname des Versicherten
geb. am
Kassen—Nr. Versicherten—Nr. Status
Betriebsstätten-Nr. Arzt—Nr. Datum
And the first part of with_training.txt:
Pau/A/brechrs Ver/ag, 22952 Lüfjensee Be! verspäteter vor!age droht Krankenge!dver!ust!
Krankenkasse bzw. Kostenträger
Name, Vorname des Versicherten
geb. am
Kassen-Nr. Versicherten-Nr. status
Betriebsstätten-Nr. Arzt-Nr. Datum
In my case adding the language "deu" did the trick:
tesseract %TESSLANG%.jpg with_training -l %LANG%+deu
instead of
tesseract %TESSLANG%.jpg with_training -l %LANG%

Use LibreOffice to convert HTML to PDF from Mac command in terminal?

I'm trying to convert a HTML file to a PDF by using the Mac terminal.
I found a similar post and I did use the code they provided. But I kept getting nothing. I did not find the output file anywhere when I issued this command:
./soffice --headless --convert-to pdf --outdir /home/user ~/Downloads/*.odt
I'm using Mac OS X 10.8.5.
Can someone show me a terminal command line that I can use to convert HTML to PDF?
I'm trying to convert a HTML file to a PDF by using the Mac terminal.
Ok, here is an alternative way to do convert (X)HTML to PDF on a Mac command line. It does not use LibreOffice at all and should work on all Macs.
This method (ab)uses a filter from the Mac's print subsystem, called xhtmltopdf. This filter is usually not meant to be used by end-users but only by the CUPS printing system.
However, if you know about it, know where to find it and know how to run it, there is no problem with doing so:
The first thing to know is that it is not in any desktop user's $PATH. It is in /usr/libexec/cups/filter/xhtmltopdf.
The second thing to know is that it requires a specific syntax and order of parameters to run, otherwise it won't. Calling it with no parameters at all (or with the wrong number of parameters) it will emit a small usage hint:
$ /usr/libexec/cups/filter/xhtmltopdf
Usage: xhtmltopdf job-id user title copies options [file]
Most of these parameter names show that the tool clearly related to printing. The command requires in total at least 5, or an optional 6th parameter. If only 5 parameters are given, it reads its input from <stdin>, otherwise from the 6ths parameter, a file name. It always emits its output to <stdout>.
The only CLI params which are interesting to us are number 5 (the "options") and the (optional) number 6 (the input file name).
When we run it on the command line, we have to supply 5 dummy or empty parameters first, before we can put the input file's name. We also have to redirect the output to a PDF file.
So, let's try it:
/usr/libexec/cups/filter/xhtmltopdf "" "" "" "" "" my.html > my.pdf
Or, alternatively (this is faster to type and easier to check for completeness, using 5 dummy parameters instead of 5 empty ones):
/usr/libexec/cups/filter/xhtmltopdf 1 2 3 4 5 my.html > my.pdf
While we are at it, we could try to apply some other CUPS print subsystem filters on the output: /usr/libexec/cups/filter/cgpdftopdf looks like one that could be interesting. This additional filter expects the same sort of parameter number and orders, like all CUPS filters.
So this should work:
/usr/libexec/cups/filter/xhtmltopdf 1 2 3 4 5 my.html \
| /usr/libexec/cups/filter/cgpdftopdf 1 2 3 4 "" \
> my.pdf
However, piping the output of xhtmltopdf into cgpdftopdf is only interesting if we try to apply some "print options". That is, we need to come up with some settings in parameter no. 5 which achieve something.
Looking up the CUPS command line options on the CUPS web page suggests a few candidates:
-o number-up=4
-o page-border=double-thick
-o number-up-layout=tblr
do look like they could be applied while doing a PDF-to-PDF transformation. Let's try:
/usr/libexec/cups/filter/xhtmltopdfcc 1 2 3 4 5 my.html \
| /usr/libexec/cups/filter/cgpdftopdf 1 2 3 4 5 \
"number-up=4 page-border=double-thick number-up-layout=tblr" \
> my.pdf
Here are two screenshots of results I achieved with this method. Both used as input files two HTML files which were identical, apart from one line: it was the line which referenced a CSS file to be used for rendering the HTML.
As you can see, the xhtmltopdf filter is able to (at least partially) take into account CSS settings when it converts its input to PDF:
Starting 3.6.0.1 , you would need unoconv on the system to converts documents.
Using unoconv with MacOS X
LibreOffice 3.6.0.1 or later is required to use unoconv under MacOS X. This is the first version distributed with an internal python script that works. No version of OpenOffice for MacOS X (3.4 is the current version) works because the necessary internal files are not included inside the application.
I just had the same problem, but I found this LibreOffice help post. It seems that headless mode won't work if you've got LibreOffice (the usual GUI version) running too. The fix is to add an -env option, e.g.
libreoffice "-env:UserInstallation=file:///tmp/LibO_Conversion" \
--headless \
--invisible \
--convert-to csv file.xls