detection of vertical texts (container BIC codes) with tesseract OCR fails - ocr

I'm trying to use Tesseract Open Source OCR Engine for a text detection of intermodal (shipping) containers codes in BIC format. BTW, I'm using tesseract through pytesseract and I preprocess input photos with few standard opencv filtering (huge rescaling/blurring/denoising/binarization).
I tuned tesseract (version: tesseract 5.0.0-alpha-647-g4a00) in this way:
config = (
# only a set of characters
' -c tessedit_char_whitelist=0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ' +
# no language model
' -c load_system_dawg=0' +
' -c load_freq_dawg=0' +
' -c enable_new_segsearch=1' +
' -c language_model_penalty_non_freq_dict_word=1' +
' -c language_model_penalty_non_dict_word=1' +
# select segmentation mode
' --psm 11'
)
I got hopeful results when codes are horizontal-aligned, as in this case:
but I have issues with vertical texts, as in the example show in the photo:
In this last case tesseract doesn't produce an useful result. Why tesseract fails event if input image seems "good"? Any suggestion about how to improve engine recognition?

Try running with --psm 5 :
pagesegmode values are:
5 = Assume a single uniform block of vertically aligned text.

Related

NLTK: Is there a term for this procedure?

I was reading some stuff about NLTK and I read something of a procedure that turns the word such as "you're" into two tokens "you" and "are". I can't remember the source. Is there a term for this or something?
pip install contractions
# import library
import contractions
# contracted text
text = '''I'll be there within 5 min. Shouldn't you be there too?
I'd love to see u there my dear. It's awesome to meet new friends.
We've been waiting for this day for so long.'''
# creating an empty list
expanded_words = []
for word in text.split():
# using contractions.fix to expand the shortened words
expanded_words.append(contractions.fix(word))
expanded_text = ' '.join(expanded_words)
print('Original text: ' + text)
print('Expanded_text: ' + expanded_text)
the source

Issue to train tesseract-OCR 4 - Empy shape table

I am trying to train Tesseract 4 with particular pictures (to read multimeters with 7 segments),
please note that I am aware of the allready trained data from Arthur Augusto at https://github.com/arturaugusto/display_ocr but I need to train Tesseract over my own data.
In order to train tess, I followed differents tutorials (as https://robipritrznik.medium.com/recognizing-vehicle-license-plates-on-images-using-tesseract-4-ocr-with-custom-trained-models-4ba9861595e7 or https://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-improve-characters-recognition/)
but i allways get problem when running the shapeclustering command with my own data
(With example data as https://github.com/tesseract-ocr/tesseract/issues/1174#issuecomment-338448972, every things is working fine)
Indeed when I try to do the shapeclusturing command it have this output screenshot
Then my shape_table is empty and the trainig could'nt be efficient...
With example data it's working fine and the shape_table is well filled
I am guessing that I have issue with box file generation, here is my process to create box file :
I use the
tesseract imageFileName.tif imageFileName batch.nochop makebox
command to generate box file and then i edit it with JtessboxEditor.
So I can't see where I'am wrong with my .box/.tif data couple.
Have a good day & thanks for helping me
\n
Adrien
Here is my full batch script for training after having generated and edited box files.
set name=sev7.exp0
set shortName=sev7
echo Run Tesseract for Training..
tesseract.exe %name%.tif %name% nobatch box.train
echo Compute the Character Set..
unicharset_extractor.exe %name%.box
shapeclustering -F font_properties -U unicharset -O %shortName%.unicharset %name%.tr
mftraining -F font_properties -U unicharset -O %shortName%.unicharset %name%.tr
echo Clustering..
cntraining.exe %name%.tr
echo Rename Files..
rename normproto %shortName%.normproto
rename inttemp %shortName%.inttemp
rename pffmtable %shortName%.pffmtable
rename shapetable %shortName%.shapetable
echo Create Tessdata..
combine_tessdata.exe %shortName%.
echo. & pause
Ok so finally I achieved to train tesseract.
The solution is to add a --psm parameter when using the command
tesseract.exe %name%.tif %name% nobatch box.train
as
tesseract.exe %name%.%typeFile% %name% --psm %psm% nobatch box.train
note that all the psm value are :
REM pagesegmode values are:
REM 0 = Orientation and script detection (OSD) only.
REM 1 = Automatic page segmentation with OSD.
REM 2 = Automatic page segmentation, but no OSD, or OCR
REM 3 = Fully automatic page segmentation, but no OSD. (Default)
REM 4 = Assume a single column of text of variable sizes.
REM 5 = Assume a single uniform block of vertically aligned text.
REM 6 = Assume a single uniform block of text.
REM 7 = Treat the image as a single text line.
REM 8 = Treat the image as a single word.
REM 9 = Treat the image as a single word in a circle.
REM 10 = Treat the image as a single character.
REM 11 = Sparse text. Find as much text as possible in no particular order.
REM 12 Sparse text with OSD.
REM 13 Raw line. Treat the image as a single text line bypassing hacks that are Tesseract-specific.
founded on https://github.com/tesseract-ocr/tesseract/issues/434

Detect language/script from pdf with python

I am trying to create a python script to detect the language(s)/script(s) inside a not yet OCRed pdf with the help of pytesseract before doing the 'real' ocr by passing the correct detected language(s)
I have like 10000 pdf's not always standard english and sometimes 1000 pages long. In order to do the real OCR I need to autodetect the language first.
So a sort of two step OCR as you will that tesseract both can preform
Detecting the language/script on some centered pages
Preforming the real OCR with the found language/script over all pages
Any tips to fix/improve this script? All I want is language(s) on the given pages detected returned.
#!/usr/bin/python3
import sys
import pytesseract
from wand.image import Image
import fitz
pdffilename = sys.argv[1]
doc = fitz.open(pdffilename)
center_page = round(doc.pageCount / 2)
surround = 2
with Image(filename=pdffilename + '[' + str(center_page - surround) + '-' + str(center_page + surround) + ']') as im:
print(pytesseract.image_to_osd(im, lang='osd',config='psm=0 pandas_config=None', nice =0, timeout=0))
I run the script as follows:
script_detect.py myunknown.pdf
I am getting the following error atm:
TypeError: Unsupported image object
Assuming that you have converted your pdf-file using some tool (OCR or other) you can use langdetect. Sample your text and feed it detect
from langdetect import detect
lang = detect("je suis un petit chat")
print(lang)
```output fr````
or
from langdetect import detect
lang = detect("我是法国人")
print(lang)
output ch
There are other libraries, such as polyglot, useful if you have mixed languages.

Setting Octave interpreter to 'tex'

Such a simple question to have wasted an hour or two of my time. The Octave docs allude to setting the interpreter to tex and never say how to do it. I've looked on line and through stackoverflow and haven't found how to do this. I've also looked at the .octaverc files and have seen nothing that would indicate how to turn on the tex edit function. I am using Debian GNUOctave version 4.0.0. Please help.
Gary Roach
The interpreter property is set to "tex" per default for axes, line, text, patch and surface. So changing interpreter makes only sense if you want to switch to "none":
set (findobj (gcf, "-property", "interpreter"), "interpreter", "none")
This sets "interpreter"="none" for al children of the current figure.
If you want to have some fancy latex stuff in your plots and not only simple tex commands you can render it with latex:
close all
graphics_toolkit fltk
sombrero ();
title ("The sombrero function:")
fcn = "$z = \\frac{\\sin\\left(\\sqrt{x^2 + y^2}\\right)}{\\sqrt{x^2 + y^2}}$";
text (0.5, -10, 1.8, fcn, "fontsize", 20);
print -depslatexstandalone sombrero
## process generated files with pdflatex
system ("latex sombrero.tex");
## dvi to ps
system ("dvips sombrero.dvi");
## convert to png
system ("gs -dNOPAUSE -dBATCH -dSAFER -sDEVICE=png16m -dTextAlphaBits=4 -dGraphicsAlphaBits=4 -r100x100 -dEPSCrop -sOutputFile=sombrero.png sombrero.ps")
which gives:

gnuplot giving warning: Skipping data file with no valid points

I am using the gnuplot script
#qe.conf
set terminal png truecolor
set output "qe.png"
set xrange ["400" : "700"]
set yrange ["0" : "1"]
set style data lines
plot "qe.txt" using 1:2 title "%Red", '' using 1:3 title "%G-r", '' using 1:4 title "%G-b", '' using 1:5 title "%R"
I am executing the gnuplot script qe.conf through a shell script
It gives me the following error
gnuplot> plot "qe.txt" using 1:2 title "%Red", '' using 1:3 title "%G-r", '' using 1:4 title "%G-b", '' using 1:5 title "%R"
^
line 0: warning: Skipping data file with no valid points
gnuplot> plot "qe.txt" using 1:2 title "%Red", '' using 1:3 title
"%G-r", '' using 1:4 title "%G-b", '' using 1:5 title "%R"
^
line 0: warning: Skipping data file with no valid points
gnuplot> plot "qe.txt" using 1:2 title "%Red", '' using 1:3 title
"%G-r", '' using 1:4 title "%G-b", '' using 1:5 title "%R"
^
line 0: warning: Skipping data file with no valid points
gnuplot> plot "qe.txt" using 1:2 title "%Red", '' using 1:3 title
"%G-r", '' using 1:4 title "%G-b", '' using 1:5 title "%R"
^
line 0: warning: Skipping data file with no valid points
But when I execute qe.conf manually, I works fine
The datafile is here.
400.0 0.3625060772
410.0 0.445987595886
420.0 0.503862994331
430.0 0.534251869841
440.0 0.576047041939
450.0 0.594211326218
460.0 0.58079588866
470.0 0.506666961836
480.0 0.495652452097
490.0 0.426107864611
500.0 0.342632041157
510.0 0.251232093174
520.0 0.178015786221
530.0 0.140803848655
540.0 0.120063881639
550.0 0.0995420648319
560.0 0.080193952073
570.0 0.0730989150532
580.0 0.0708069989426
590.0 0.0688014659014
600.0 0.0597099385221
610.0 0.0481330987744
620.0 0.042010859344
630.0 0.0425115579982
640.0 0.0460125024438
650.0 0.0515227545961
660.0 0.0559745367996
670.0 0.0629981328342
680.0 0.0573046109671
690.0 0.0688715871636
700.0 0.0742304568215
`
Can anyone suggest a solution?
Hi All, After hours of trying I still dont have the answer.
I tried the following things. I tried giving absolute paths for datafile, gnuscript and shell script.
The command gnuplot qe.conf works fine if run from linux command prompt but when run through the shell script gives this error.
line 10: warning: Skipping data file with no valid points
Request for help.
I get this error everytime I try to plot a .csv (comma separated variables) file, I forget that sometimes gnuplot needs to be reminded what the delimiter is. Usually I get the same error you mention, sometimes I get no error, but in either case, no data is plotted until delimiter is defined properly.
gnuplot defaults to whitespace as a delimiter, but perhaps you overrode that and set it to comma or something. Try telling gnuplot what your delimiter is.
set datafile separator " "
or
set datafile separator whitespace
and then of course for comma try "," and tab would try "\t"
I find it best to keep putting the set datafile separator " " into the top of my scripts, to remind myself.
You could also try checking the encoding on you data file.
I've just come across this exact problem as well when trying to plot a data file. And it turned out that Gnuplot was unable to understand the data file due to its encoding (which was UTF-16LE).
When I changed the file's encoding to UTF-8, Gnuplot was able to read it without issues.
Since this post is already a bit old, you've probably already managed to solve it. Though I just thought it might help anyone else who also gets this problem.
The problem here is that you're trying to plot the 3rd, 4th, and 5th columns of a two column dataset. If you change your plot command to drop everything that's using 1:3 or higher, it should work just fine. The error message is telling you that the data file really is empty (in the higher columns).
Recently, I had the same issue using Gnuplot 5.0.4. As Aendur suggested, the encoding could be problematic. What fixed it for me was, using TextWrangler, to change the line breaks, in my case from "Mac Classic (CR)", to "Unix (LF)" without having to change the encoding of the file to UTF-8.