Tesseract confuses two numbers - ocr

I'm writing an application to scan numbers from an image.
The numbers are using the OCR-B font and may also contain + and > characters.
This is my source image:
The scans using Tesseract weren't very good, even when limiting the character set to the mentioned characters. As I didn't find any OCRB training files for Tesseract, I decided to train it myself.
I created this training image and made a box file from it. The box file is correct, all letters are matched correctly.
Then I did all steps described here to create the other necessary files.
Using this newly trained OCR-B tessdata-set, I get pretty good results on the source image, with one little bug: All 1s are mistaken for 8s and vice-versa. The command used to process the image was
$ tesseract esr2c.tif ocrb-esr2c -l ocrb
and the output for the source image was
0800000001456>8 00000195731208 8 01050008 023+ 08 0301226>20
If you swap all 1s and 8s and compare it to the source image, the output would be correct (except for the last two letters which I can ignore).
How could this happen? Did I do some mistake in the training process? How can I fix it?

It's likely that somewhere in your box file has incorrect values (characters) for 1 and 8. You can verify using jTessBoxEditor program. If so, correct, regenerate the language data file, and try again.

I have trained tesseract 2.04 after 1 month efforts for OCR A extended font. Its working very well and showing above 90 Accuracy with font size 14.
Training image should be high Contrast image.
Use "GIMP" image editor and do following
Menu Colors->Info->Histgram- Read Std Deviation value
colors-> Threshould -> Write "Std Deviation value" as Threshould value
Save image
Use it for training.
Check and edit your box file using "qt-box-editor-1.06.exe".It is very easy to use.
Check All boxes and characters in it.
It is very important. Somewhere in your box file has incorrect characters for 1 and 8.
Run other cmds.

Related

Getting a surface graph from a json file in gnuplot

I'm not a dev, I'm doing this for a school project. I'm trying to put the following dataset into a surface plot in windows gnuplot. qt type terminal, if that's important.
https://files.catbox.moe/nbc6l1.json
As you can see, it's a huge set of data. Pulled directly from an image and into a csv file, which I converted to json.
When I type in "splot 'C:\Users\tyler\ESRP Data\sampleOutput.json'", this is what I get.
As you can see, there's only a single line, when there should be something approaching an intensity chart in a 3 dimensional space. Is it a problem with the data? Do I need a specific command to do this?
It would help if you attached an example of your image data to the question, and also if you provided a link to a plot similar to the one you are trying to create. There are many different styles one might use to represent a surface. I will attempt to guess at a possible solution.
Input image (scribbled in GIMP and saved as a png image):
Gnuplot surface plot:
set border -1
unset tics
# surface represented by colored lines in 3D
# down-sample by 4x in each dimension to get an interpretable surface
set palette defined (0 "blue", 1 "white")
splot 'scribble.png' binary filetype=png every 4:4:4 using 1:2:3:3 with lines lc palette

How to process my images to help Tesseract?

I have some images containing only digits, and a semicolon.
Example:
You can see more here: https://imgur.com/a/54dsl6h
They seem pretty clean and straightforward to me, but Tesseract considers them as empty "pages" (Empty page!!).
I tried both with oem 1 and oem 0 with a character list:
tesseract processed/35.0.png stdout -c tessedit_char_whitelist=0123456789: --oem 0
tesseract processed/35.0.png stdout
What can I do to get Tesseract to recognize the characters better?
Tesseract still gives me pretty bad results overall, but making the text bolder with a simple dilatation algorithm helped a bit.
In the end, since the font is really square, I used a trick, where I defined a bunch of segments for each digits, and depending on which segments intersect, or dont intersect with the digit, I can determine with 99% accuracy which digit it is.

Best method to train Tesseract 3.02

i'm wondering what is the best method to train Tesseract (kind of text/TIFF and so on) for a particular kind of documents, with these particularities:
the structure and main text of the documents is always the same
the only things that change are 5 alphanumeric codes (THIS ARE THE REAL IMPORTANT THING TO DETECT!)
Some of thes codes are bold
At the moment I used standard trained datas, I detect the entire text and I extrapolate the codes with some regular expressions.
It's okay, but I've got errors sometimes, for example:
0 / O
L / I / 1
Please someone knowns some "tricks" to improve precision?
Thanks!
during training part of Tesseract, you have to make a file manually to give to the engine in order to specify ambiguous characters.
For more information look at the "unicharambigs" part of the Tesseract documentation.
Best Regards.

Can I test tesseract ocr in windows command line?

I am new to tesseract OCR. I tried to convert an image to tif and run it to see what the output from tesseract using cmd in windows, but I couldn't. Can you help me? What will be command to use?
Here is my sample image:
The simplest tesseract.exe syntax is tesseract.exe inputimage output-text-file.
The assumption here, is that tesseract.exe is added to the PATH environment variable.
You can add the -psm N argument if your text argument is particularly hard to recognize.
I see that the regular syntax (without any -psm switches) works fine enough with the image you attached, unless the level of accuracy is not good enough.
Note that non-english characters (such as the symbol next to prescription) are not recognized; my default installation only contains the English training data.
Here's the tesseract syntax description:
C:\Users\vish\Desktop>tesseract.exe
Usage:tesseract.exe imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]
pagesegmode values are:
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
-l lang and/or -psm pagesegmode must occur before anyconfigfile.
Single options:
-v --version: version info
--list-langs: list available languages for tesseract engine
And here's the output for your image (NOTE: When I downloaded it, it converted to a PNG image):
C:\Users\vish\Desktop>tesseract.exe ECL8R.png out.txt
Tesseract Open Source OCR Engine v3.02 with Leptonica
C:\Users\vish\Desktop>type out.txt.txt
1 Project Background
A prescription (R) is a written order by a physician or medical doctor to a pharmacist in the form of
medication instructions for an individual patient. You can't get prescription medicines unless someone
with authority prescribes them. Usually, this means a written prescription from your doctor. Dentists,
optometrists, midwives and nurse practitioners may also be authorized to prescribe medicines for you.
It can also be defined as an order to take certain medications.
A prescription has legal implications; this means the prescriber must assume his responsibility for the
clinical care ofthe patient.
Recently, the term "prescriptionΓÇ¥ has known a wider usage being used for clinical assessments,

Programmatically obtaining the number of colors used in an image

Question:
Given an image in PNG format, what is the simplest way to programmatically obtain the number of colors used in the image?
Constraints:
The solution will be integreted into a shell script running under Linux, so any solution that fits in such an environment will do.
Please note that the "color capacity of the image file" does not necessarily correspond to "colors used". Example: In an image file with a theoretical color capacity of 256 colors only say 7 colors might be in actual use. I want to obtain the number of colors actually used.
Why write your own program?
If you're doing this with a shell script, you can use the netpbm utilities:
count = `pngtoppm png_file | ppmhist -noheader | wc -l`
The Image.getcolors method in Python Imaging Library seems to do exactly what you want.
Fun. There doesn't appear to be any guaranteed method of doing this; in the worst case you'll need to scan the image and interpret every pixel, in the best possible case the PNG will be using a palette and you can just check there.
Even in the palette case, though, you're not guaranteed that every entry is used -- so you're (at best) getting an upper bound.
http://www.libpng.org/pub/png/spec/1.1/PNG-Contents.html
.. and the chunk info here:
http://www.libpng.org/pub/png/spec/1.1/PNG-Chunks.html
Alnitak's solution is nice :) I really should get to know netpbm and imagemagick etc. better some time.
Just FYI, as a simple and very general solution: loop through each pixel in the image, getting the r,g,b color values as a single integer. Look for that integer in a list. If it's not there, add it. When finished with all the pixels, print the number of colors in the list.
If you want to count occurences, use a hashmap/dictionary instead of a simple list, incrementing the key's value (a counter) if found in the dictionary already. If not found, add it with a starting counter value of 1.