for some strange reason, tesseract is not able to recognize the following image. I tried various config options such as:
--psm 13: "Treat the image as a single text line"
tessedit_char_whitelist=012345678iI': Only allow numbers (and i's that can be replaced later).
This is the image:
Maybe it's my preprocessing, but to me the picture looks good (I also tried increasing the borders around the number). Any advice would be highly appreciated! Couldn't find anything helpfull neither Google or SO.
Thanks!
Figured it out: pytesseract.image_to_string(img, config='digits')
Related
I'm having trouble getting tesseract to recognize any characters in the following image:
When I run tesseract from the command line on this image, I get "Empty page!!" - that is, no results - returned. Based on my reading of the Improving Quality section of the wiki, I thought that the issue might be that the words in this image are not dictionary words. With that in mind, I have tried both disabling the tesseract dictionaries altogether (using the load_system_dawg and load_freq_dawg config flags) as well as augmenting the existing dictionary with these additional words (LAO and CAUD). Neither of those approaches worked. I have tried tesseract versions 3, 4, and have built version 5 from source on a Mac computer. All have given the same result.
Curiously, if I type the exact words from that image into a word processor and take a screenshot, it works: the resulting image is readable by tesseract. It correctly parses each character. Here is that image:
The only difference between the two images is that the first one is of a slightly lower resolution/quality. Am I then to believe that tesseract is unable to recognize characters in a slightly inferior quality image like that? Is there anything I can do to improve that image quality? Is there something else I'm missing?
Thanks in advance.
It's common problem. You probably will need preprocess the image, with rescaling, filters, etc.
Here are some ref on how to do that:
https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality
https://docparser.com/blog/improve-ocr-accuracy/
The solution was to use the right page segmentation method (PSM). In my case, PSM 6, which is for a single block of text, did the trick.
I write in Python, using pytesseract or direct Popen calls if needed.
I try to OCR a document with irregular structure, a letter looking like this:
The problem is in the .hocr file generated by Tesseract I get lines consisting of left and right column glued together like "Recipient: Sender:"
What I'd like to achieve is output from the left and right column separated. Using third party Python utilities to pre-process the image is an acceptable solution if explained in reasonable detail. The script must be autonomous and somehow detect this issue as not all the letters have such strange formatting.
Tried/ideas:
Using --psm 1 to allow input format detection - no improvement over default, likely because structure is too complicated.
Tweaking some config file options like gapmap_use_ends and textord_words_maxspace - I couldn't find a good documentation on these and probably there is a right combination of values but there are 57 options with "space" in name... any insight on these would be much appreciated.
Editing the .hocr - not sure how to write appropriate grouping rules for the word boxes that do not interfere with normal text everywhere else...
This bug was captured on a Windows machine with Chrome (on my Windows instead of a box it's an "L". I don't understand why these symbols are appearing.
This is my html code:
<p> ... are linked to specific <strong>disease</strong> or physical <strong>traits</strong>. Other sections of DNA ... </p>
Is this a browser specific issue/user specific issue? Or is this a issue with my code (like adding another fallback font)?
Any ideas, suggestions, direction would be greatly appreciated.
Thank you!
From looking at it it seems like you might be using non-breaking spaces. It is unusual that that is a problem for a font, but you might want to just not use them. Most Editors can highlight such "invisible characters" in one way or another. It is worth searching for that for your editor.
This image is recognized as
08787365076858, instead of
0878-3650-6858
I have a list of 50 similar image files, and in each all "-" chars are matched as "7".
Default settings were used, even with installing tesseract to clear system.
Also tried to use -psm=7/8 (single line/word) and set whitelist characters.
What can be the reason of this issue and how can I overcome it?
I know about training, but it's interesting, why accurate (in most cases) tesseract confuses so different chars.
Rescaling to 300DPI would help get those dashes in the image.
I'm trying to set validation for an image alternate text, and here's what I think should be validated so far. It's a pretty simple RegEx, but I'm yet to start learning that topic..
Double quotes
< and > characters to prevent HTML input
Is there anything else you would add to this?
Would text length ever be an issue?
I appreciate your help and if someone could provide this simple RegEx I'd be really grateful :)
That sounds like a good place to start for me. Max size: pick something sane, unless you want it to be valid to post a dissertation as an alt text - though it is probably possible. As for the regex to validate it's okay:
/^[^"<>&\\]{0,XXX}$/
where XXX is the maximum size you want. Or get rid of the {0,XXX} altogether and replace it with * to mean "zero or more". Syntax depends on language, of course.
Also found this, looked interesting:
http://www.cs.tut.fi/~jkorpela/html/alt.html
Update:
Yeah, you two make a good point. As long as the quotes used around the alt-text aren't themselves single-quotes, then they should be fine.
And as per other answers below, possibly also & and . Though you may need to be careful with how many slashes, whether they are before things that matter. And also, whether and such things are allowed in the text itself.