tesseract unable to detect characters in simple two-word image - ocr

I'm having trouble getting tesseract to recognize any characters in the following image:
When I run tesseract from the command line on this image, I get "Empty page!!" - that is, no results - returned. Based on my reading of the Improving Quality section of the wiki, I thought that the issue might be that the words in this image are not dictionary words. With that in mind, I have tried both disabling the tesseract dictionaries altogether (using the load_system_dawg and load_freq_dawg config flags) as well as augmenting the existing dictionary with these additional words (LAO and CAUD). Neither of those approaches worked. I have tried tesseract versions 3, 4, and have built version 5 from source on a Mac computer. All have given the same result.
Curiously, if I type the exact words from that image into a word processor and take a screenshot, it works: the resulting image is readable by tesseract. It correctly parses each character. Here is that image:
The only difference between the two images is that the first one is of a slightly lower resolution/quality. Am I then to believe that tesseract is unable to recognize characters in a slightly inferior quality image like that? Is there anything I can do to improve that image quality? Is there something else I'm missing?
Thanks in advance.

It's common problem. You probably will need preprocess the image, with rescaling, filters, etc.
Here are some ref on how to do that:
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
https://docparser.com/blog/improve-ocr-accuracy/

The solution was to use the right page segmentation method (PSM). In my case, PSM 6, which is for a single block of text, did the trick.

Related

tesseract: Recognize plain multi-digit number

for some strange reason, tesseract is not able to recognize the following image. I tried various config options such as:
--psm 13: "Treat the image as a single text line"
tessedit_char_whitelist=012345678iI': Only allow numbers (and i's that can be replaced later).
This is the image:
Maybe it's my preprocessing, but to me the picture looks good (I also tried increasing the borders around the number). Any advice would be highly appreciated! Couldn't find anything helpfull neither Google or SO.
Thanks!
Figured it out: pytesseract.image_to_string(img, config='digits')

Limit space size in Tesseract

I write in Python, using pytesseract or direct Popen calls if needed.
I try to OCR a document with irregular structure, a letter looking like this:
The problem is in the .hocr file generated by Tesseract I get lines consisting of left and right column glued together like "Recipient: Sender:"
What I'd like to achieve is output from the left and right column separated. Using third party Python utilities to pre-process the image is an acceptable solution if explained in reasonable detail. The script must be autonomous and somehow detect this issue as not all the letters have such strange formatting.
Tried/ideas:
Using --psm 1 to allow input format detection - no improvement over default, likely because structure is too complicated.
Tweaking some config file options like gapmap_use_ends and textord_words_maxspace - I couldn't find a good documentation on these and probably there is a right combination of values but there are 57 options with "space" in name... any insight on these would be much appreciated.
Editing the .hocr - not sure how to write appropriate grouping rules for the word boxes that do not interfere with normal text everywhere else...

Tesseract confuses "-" and "7" in a single-line image

This image is recognized as
08787365076858, instead of
0878-3650-6858
I have a list of 50 similar image files, and in each all "-" chars are matched as "7".
Default settings were used, even with installing tesseract to clear system.
Also tried to use -psm=7/8 (single line/word) and set whitelist characters.
What can be the reason of this issue and how can I overcome it?
I know about training, but it's interesting, why accurate (in most cases) tesseract confuses so different chars.
Rescaling to 300DPI would help get those dashes in the image.

Recognizing superscript characters using OCR

I've started a simple project in which it must get an image containing text with superscripts and then by using OCR (currently I'm using tesseract) it has to recognize the superscript characters + the normal ones.
For example, we have a chemical equation such as Cl², but when I use the tesseract to recognize it, it gives me Cl2 (all in one line).
So, what is the solution for this problem? Is there any other OCR API that has the ability to read superscripts?
Very good question that touches more advanced features of any OCR system.
First of all, to make sure you are NOT overlooking the functionality even though it may be there on an OCR system. Make sure to look at your result test not in plain TXT format, but in some kind of rich text capable viewer. TXT viewers, such as Notepad on Windows, often do not support superscript/subscript characters, so even if OCR were to give you correct characters, your viewer could have converted it to display it. If you are accessing text result programatically, that is less of an issue because you are supposed to get a proper subscript character value when accessing it directly. Just note that viewers must support it for you to actually see it. If you eliminated this possible post-processing conversion and made sure that no subscript is returned from OCR, then it probably does not support it.
Just like in this text box, in your original question you tried to give us a superscript character example, but this text box did not accept it even though you could copy/paste it from elsewhere.
Many OCR will see subscript as any other normal character, if they can see it at all. OCR of your use needs to have technical capability to actually produce superscripts/subscripts, and many of them do, but they tend to be commercial OCR systems not surprisingly.
I made a small testcase before answering this letter. I generated an image with a few superscript/subscript examples for my testing (of course EMC2 was the first example that came to mind :) .
You can find my test image here:
www.ocr-it.com/documents/superscript_subscript_test_page.tif
And processed this image through OCR-IT OCR Cloud 2.0 API using all default settings, but exporting to a rich text format, such as MS Word .DOC.
You can find my test image here:
www.ocr-it.com/documents/superscript_subscript_test_page_result.doc
Also note: When you are interested to extract superscript/subscript characters, pay separate attention to your image quality, more than you would with a typical text. Those characters are tiny and you need sufficient details and resolution to achieve descent OCR quality. Even scanned at 300 dpi images sometimes have issues with tiny characters due to too few pixels. If you are considering mobile and digital cameras, that becomes even more important.
DISCLOSURE: My specialty is implementing internal OCR solutions for companies of different sizes. My company is WiseTREND. Contact me directly if I can assist with anything further.

Converting Source ASCII Files to JPEGs

I publish technical books, in print, PDF, and Kindle/MOBI, with EPUB on the way.
The Kindle does not support monospace fonts, which are kinda useful for source code listings. The only way to do monospace fonts is to convert the text (Java source, HTML, XML, etc.) into JPEG images. More specifically, due to pagination issues, a given input ASCII file needs to be split into slices of ~6 lines each, with each slice turned into a JPEG, so listings can span a screen. This is a royal pain.
My current mechanism to do that involves:
Running expand to set a consistent 2-space tab size, which pipes to...
a2ps, which pipes to...
A small Perl snippet to add a "%%LanguageLevel: 3\n" line, which pipes to...
ImageMagick's convert, to take the (E)PS and make a JPEG out it, with an appropriate background, cropped to 575x148+5+28, etc.
That used to work 100% of the time. It now works 95% of the time. The rest of the time, I get convert: geometry does not contain image errors, which I cannot seem to get rid of, in part because I don't understand what the problem is.
Before this process, I used to use a pretty-print engine (source-highlight) to get HTML out of the source code...but then the only thing I could find to convert the HTML into JPEGs was to automate screen-grabs from an embedded Gecko engine. Reliability stank, which is why I switched to my current mechanism.
So, if you were you, and you needed to turn source listings into JPEG images, in an automated fashion, how would you do it? Bonus points if it offers some sort of pretty-print process (e.g., bolded keywords)!
Or, if you know what typically causes convert: geometry does not contain image, that might help. My current process is ugly, but if I could get it back to 100% reliability, that'd be just fine for now.
Thanks in advance!
You might consider html2ps and then imagemagick's convert.
A thought: if your target (Kindle?) supports PNG, use that in preference to JPEG for this text rendering.
html2ps is an excellent program -- I used it to produce a 1300-page book once, but it's overkill if you just want plain text -> postscript. Consider enscript instead.
Because the question of converting HTML to JPG has been answered, I will offer a suggestion on the pretty printer. I've found Pygments to be pretty awesome. It supports different themes and has lexers for pretty much any language out there (they advertise the fact that it even highlights brainfuck). There's a command line tool and it's available on most Linux distros.
Your Linux distribution may include pango-view and an assortment of fonts.
This works on my FC6 system:
pango-view --font=DejaVuLGCSansMono --dpi=200 --output=/tmp/text.jpg -q /tmp/text
You'll need to identify a monospaced font that is installed on your system. Look around /usr/share/fonts/.
Pango supports Unicode.
Leave off the -q while you're experimenting, it'll display to a window instead of to a file.
Don't use jpeg. It's optimized for photographs and does a terrible job with text and line art. Use gif or png instead. My understanding is that gif is now patent-free, so I would just use that.