Tesseract confuses "-" and "7" in a single-line image - ocr

This image is recognized as
08787365076858, instead of
0878-3650-6858
I have a list of 50 similar image files, and in each all "-" chars are matched as "7".
Default settings were used, even with installing tesseract to clear system.
Also tried to use -psm=7/8 (single line/word) and set whitelist characters.
What can be the reason of this issue and how can I overcome it?
I know about training, but it's interesting, why accurate (in most cases) tesseract confuses so different chars.

Rescaling to 300DPI would help get those dashes in the image.

Related

How to properly display Hebrew in text widget?

I'm using Manjaro Linux KDE and the most recent versions of Tcl and Tk, and am attempting to display Hebrew in a text widget. In testing, the Hebrew text was pasted into the Tcl script in the Kate text editor and appears in the correct order, right to left with compound characters.
Without using a specific font in Tcl/Tk, the text prints from left to right and separates the components of compound characters, such that the vowel points and cantillation marks appear as separate characters. After using the SBL Hebrew font, the words look better but the vowel points are not located properly and they are still written from left to right. I tried using the \u200f and \u200e marks but it made no difference; but I really don't know what I'm doing there and simply tried prefixing and suffixing it to the Hebrew word. Reversing the the string helps but the vowel points are not combined with the consonants.
I'm not using Tkinter but this older SO post seems to indicate that it is a Linux issue with Tcl.
If I extract Hebrew from SQLite using Tcl and write it to the command line using puts, it displays correctly. Also, if I copy the reversed text from the Tk text widget and paste it in this SO question, it is displayed in the correct order. To clarify, by reversed here, I don't mean using string reverse but simply that it appears reversed in Tk but when pasted in this SO box, it displays correctly.
Would you please tell me what I'm doing wrong and how to get it to display properly?
I tried to follow this document on internationalization in Tcl and encoding but don't follow how this affects displaying Hebrew in a text Widget. I also came across a web site that has code for a unicode editor that displays several languages including Hebrew but I can't follow that code either. I tried running the code and, if select Hebrew language, it writes right to left but I don't see vowel points or cantillation marks; but I don't know much about typing the Hebrew language.
Thank you.
.tw tag configure heb -font {"SBL Hebrew" 18 normal}
.tw insert end "בְּרֵאשִׁ֖ית" "heb"
# Also tried "בְּרֵאשִׁ֖ית\u200f" and "\u200fבְּרֵאשִׁ֖ית".
# and "בְּרֵאשִׁ֖ית\u200e" and "\u200eבְּרֵאשִׁ֖ית".
# Tried .t insert end [string reverse $h ] "heb", which order the
# consonants but the vowel points and cantillation marks are not correct.
This is the correct rendering.
This is from Tk. The first is in normal order and the second using string reverse. It can be observed that the vowel points are not "on" the consonants and the cantillation marks are not correct. I know little about Hebrew but I can tell they don't match and appear to be printed as separate characters instead of combined. I think what looks like a "t" under the Hebrew letter that looks similar to a "W" is two characters on top of each other-- a dot and the symbol sort of similar to a left parenthesis in the correct rendering.
I don't know why but after rebooting and installing the next batch of updates, not that they have anything to do with Tk, the rendering is different when a font is not set. However, once the SBL Hebrew font is set, then the characters are separated as displayed above.
I can tell you know that the text renders very close to correctly with Tk on macOS (I'm not sure how much is just font differences, and there's a bit of clipping of the descender decorations that I don't like, but I don't think that's Tk itself doing the wrong thing).
That means that it's definitely a rendering bug that you're seeing. I suspect it might relate to the size of chunks of characters fed into the renderer; if the low levels of the renderer are only being given a character at a time, then they've got no chance to get the overall placement correct or to apply any character combining. I'm guessing that the real issue is that TkpDrawCharsInContext() just calls Tk_DrawChars(), if my reading of the comments is right. (By contrast, the macOS renderer does something different here.)
I don't have a workaround.

tesseract: Recognize plain multi-digit number

for some strange reason, tesseract is not able to recognize the following image. I tried various config options such as:
--psm 13: "Treat the image as a single text line"
tessedit_char_whitelist=012345678iI': Only allow numbers (and i's that can be replaced later).
This is the image:
Maybe it's my preprocessing, but to me the picture looks good (I also tried increasing the borders around the number). Any advice would be highly appreciated! Couldn't find anything helpfull neither Google or SO.
Thanks!
Figured it out: pytesseract.image_to_string(img, config='digits')

tesseract unable to detect characters in simple two-word image

I'm having trouble getting tesseract to recognize any characters in the following image:
When I run tesseract from the command line on this image, I get "Empty page!!" - that is, no results - returned. Based on my reading of the Improving Quality section of the wiki, I thought that the issue might be that the words in this image are not dictionary words. With that in mind, I have tried both disabling the tesseract dictionaries altogether (using the load_system_dawg and load_freq_dawg config flags) as well as augmenting the existing dictionary with these additional words (LAO and CAUD). Neither of those approaches worked. I have tried tesseract versions 3, 4, and have built version 5 from source on a Mac computer. All have given the same result.
Curiously, if I type the exact words from that image into a word processor and take a screenshot, it works: the resulting image is readable by tesseract. It correctly parses each character. Here is that image:
The only difference between the two images is that the first one is of a slightly lower resolution/quality. Am I then to believe that tesseract is unable to recognize characters in a slightly inferior quality image like that? Is there anything I can do to improve that image quality? Is there something else I'm missing?
Thanks in advance.
It's common problem. You probably will need preprocess the image, with rescaling, filters, etc.
Here are some ref on how to do that:
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
https://docparser.com/blog/improve-ocr-accuracy/
The solution was to use the right page segmentation method (PSM). In my case, PSM 6, which is for a single block of text, did the trick.

Recognizing superscript characters using OCR

I've started a simple project in which it must get an image containing text with superscripts and then by using OCR (currently I'm using tesseract) it has to recognize the superscript characters + the normal ones.
For example, we have a chemical equation such as Cl², but when I use the tesseract to recognize it, it gives me Cl2 (all in one line).
So, what is the solution for this problem? Is there any other OCR API that has the ability to read superscripts?
Very good question that touches more advanced features of any OCR system.
First of all, to make sure you are NOT overlooking the functionality even though it may be there on an OCR system. Make sure to look at your result test not in plain TXT format, but in some kind of rich text capable viewer. TXT viewers, such as Notepad on Windows, often do not support superscript/subscript characters, so even if OCR were to give you correct characters, your viewer could have converted it to display it. If you are accessing text result programatically, that is less of an issue because you are supposed to get a proper subscript character value when accessing it directly. Just note that viewers must support it for you to actually see it. If you eliminated this possible post-processing conversion and made sure that no subscript is returned from OCR, then it probably does not support it.
Just like in this text box, in your original question you tried to give us a superscript character example, but this text box did not accept it even though you could copy/paste it from elsewhere.
Many OCR will see subscript as any other normal character, if they can see it at all. OCR of your use needs to have technical capability to actually produce superscripts/subscripts, and many of them do, but they tend to be commercial OCR systems not surprisingly.
I made a small testcase before answering this letter. I generated an image with a few superscript/subscript examples for my testing (of course EMC2 was the first example that came to mind :) .
You can find my test image here:
www.ocr-it.com/documents/superscript_subscript_test_page.tif
And processed this image through OCR-IT OCR Cloud 2.0 API using all default settings, but exporting to a rich text format, such as MS Word .DOC.
You can find my test image here:
www.ocr-it.com/documents/superscript_subscript_test_page_result.doc
Also note: When you are interested to extract superscript/subscript characters, pay separate attention to your image quality, more than you would with a typical text. Those characters are tiny and you need sufficient details and resolution to achieve descent OCR quality. Even scanned at 300 dpi images sometimes have issues with tiny characters due to too few pixels. If you are considering mobile and digital cameras, that becomes even more important.
DISCLOSURE: My specialty is implementing internal OCR solutions for companies of different sizes. My company is WiseTREND. Contact me directly if I can assist with anything further.

Rendering barcodes in HTML with Code 128 font

Is it possible to render correct bar-codes in HTML using the Code 128 font?
The main content of the bar-code is fine in the broswer (firefox) but when I try to add the start code character I just get this character in the browser:
Ñ
This is ASCII code 209. I'm wondering if it even has a bar representation.
I'm using MVC but this is really just a HTML/CSS problem I think.
Thanks
This isn't quite what you asked for, but you can make barcodes using CSS: see http://unixshell.jcomeau.com/src/barcodes/memberships.html. I'm using code39 for this, but most other linear codes can be done the same way.
Are you sure that the client is going to have barcode font installed?
Server side image generation seems to be a better solution.
You may want to try Barcode.dll for barcode rendering.
It includes ASP.NET barcode control - just drag & drop.
Please note that this is a commercial product I developed.
I know this is years too late, but looking again at the question, I'm pretty sure you're just not using the right numeric code for your font. there is no single "Code 128 font". while 209 is shown by Wikipedia to be the correct "common" code for Start B, in various fonts I found online this is not the case. in this, Start B is 236; and here it's 204. use the right code for your particular font, and you should get what you want.
a code point not encoded by the barcode font will be rendered by a default font, which is why you're seeting the N tilde character.