pytesseract - Invalid resolution 0 dpi - ocr

I am using pytesseract v5.0 and I am rotating the image with OpenCV and then passing it to pytesseract.image_to_osd(). There are some images that work with the image_to_osd, but other images do not and the program gives me the following error:
TesseractError: (1, 'Tesseract Open Source OCR Engine v5.0.0-alpha.20201127 with Leptonica Warning: Invalid resolution 0 dpi. Using 70 instead. Estimating resolution as 179 Warning. Invalid resolution 0 dpi. Using 70 instead. Too few characters. Skipping this page Error during processing.')
I am using python 3.9.5.
Please share the solution / sample code to fix this issue.

I've been facing this error for a quite long time but finally realized the reason.
Tesseract OSD seems to get the correct image rotation if it's only 0,90,180, or 270.
If you're using OpenCV or Pillow to read your image, it's likely to get the error above.
If you view Tesseract parameters, you will notice something called "Min_characters_to_try" which is the minimum number of characters to run OSD. It's set to 50 by default, which might be too much for you. So, we have to reduce it.
What you can do is cropping your background to make your object have one of the angles stated above. Then, pass your image file directly to Tesseract and reduce the min_characters_to_try like the following:
osd = pytesseract.image_to_osd(r'D:\image.jpg',config='--psm 0 -c min_characters_to_try=5')

Related

How to make heartbeats for Autocad from a plugin (C#)

Im developping a plugin for AutoCAD on Forge. Via a custom command (Provided by the plugin), it will publish a big png (25000x20000 for example) and sometimes it causes a timeout and the workitem failed.
<report.txt>
...
[03/02/2022 22:10:41] Save changes to page setup [Yes/No]? <N> N
[03/02/2022 22:10:41] Proceed with plot [Yes/No] <Y>: Y
[03/02/2022 22:10:41] Effective plotting area: 21212.94 wide by 20000.00 high
[03/02/2022 22:10:41] Plotting viewport 2.
[03/02/2022 22:11:42] Error: AutoCAD Core Console is shut down due to timeout.
[03/02/2022 22:11:42] End script phase.
[03/02/2022 22:11:42] Error: An unexpected error happened during phase CoreEngineExecution of job.
...
Im guessing the timeout will be done by design (Work Item Heartbeat) but havent succeed to find the way to make the long plotting survive.
Is there anyone who can help me?
The suggested solution, HeartBeat.cs has been tried but it seems not working for my case (because the engine Im using is AutoCAD?) though the document said that it will be ok if my plugin will print something (to stdout(=report.txt?) or trace(the suggested one) before one minute silence.
Additionally the actual PLOT command is issued on a scr file which is loaded and is executed inside my plugin.
PS1.
The only way to print something on report.txt I find is
Document doc = Application.DocumentManager.MdiActiveDocument;
doc.Editor.WriteMessage("somthing");
and seems the doc doesnt work with threading.
PS2.
The limitProcessingTimeSec for the workitem has been changed to 300, but it seems not related.

OCR detecting E as £

I am using pytesseract (version 5 of tesseract) to scan an image. I have changed image to black and white to remove the noise but still E is being detected as £196893 .
Also tried setting the language, dpi and psm values which has been suggested by most of people. Below are the settings I am using now. Please suggest.
pytesseract.image_to_string(Image.open(impath), config=" --dpi 120 --psm 6 -l eng")
Once of sample picture is shown below. For some samples it is working fine but for some samples it is giving such strange characters.
A solution to overcome this issue is to limit the characters that Tesseract looks for. To do so you must:
Create a file with arbitrary name (i.e. "whitelist") in tesseract config directory. In linux that directory is usually placed in /usr/share/tesseract/tessdata/configs.
Adding a line in that file containing only the characters that are you want to search in text:
tessedit_char_whitelist *list_of_characters*
Then call your script using the whitelist vocabulary:
tesseract input.tif output nobatch whitelist
In this case the parameters must be setted in your Python script as:
pytesseract.image_to_string(Image.open(impath), config=" --dpi 120 --psm 6 -l nobatch whitelist")

caffe could not open or find file

I'm new to caffe and after successfully running an example I'm trying to use my own data. However, when trying to either write my data into the lmdb data format or directly trying to use the solver, in both cases I get the error:
E0201 14:26:00.450629 13235 io.cpp:80] Could not open or find file ~/Documents/ChessgameCNN/input/train/731_1.bmp 731
The path is right, but it's weird that the label 731 is part of this error message. That implies that it's reading it as part of the path instead of as a label. The text file looks like this:
~/Documents/ChessgameCNN/input/train/731_1.bmp 731
Is it because the labels are too high? Or maybe because the labels don't start with 0? I've searched for this error and all I found were examples with relatively few labels, about ~1-5, but I have about 4096 classes of which I don't always actually have examples in the training data. Maybe this is a problem, too (certainly for learning, at least, but I didn't expect for it to give me an actual error message). Usually, the label does not seem to be part of this error message.
For the creation of the lmdb file, I use the create_imagenet.sh from the caffe examples. For solving, I use:
~/caffe/build/tools/caffe train --solver ~/Documents/ChessgameCNN/caffe_models/caffe_model_1/solver_1.prototxt 2>&1 | tee ~/Documents/ChessgameCNN/caffe_models/caffe_model_1/model_1_train.log
I tried different image data types, too: PNG, JPEG and BMP. So this isn't the culprit, either.
If it is really because of my choice of labels, what would be a viable workaround for this problem?
Thanks a lot for your help!
I had the same issue. Check that lines in your text file don't have spaces in the end.
I was facing a similar problem with convert_imageset. I have solved just removing the trailing spaces in the text file which contains the labels.

Tesseract OCR - recognize checkboxes as word

for a customer I want to teach Tesseract to recognize checkboxes as a word. It worked fine when Tesseract should recognize a empty checkbox.
This command in combination with this tutorial worked like a charm and Tesseract was able to find empty checkboxes and interpret them to "[_]":
tesseract -psm 10 deu2.unchecked1.exp0.JPG deu2.unchecked1.exp0.box nobatch box.train
Here is my command to successful analyze a document:
tesseract test.png test -l deu1+deu2
Then I tried to train a checked checkbox, but got this error:
Tesseract Open Source OCR Engine v3.04.00 with Leptonica
FAIL!
APPLY_BOXES: boxfile line 1/[X] ((60,30),(314,293)): FAILURE! Couldn't find a matching blob
APPLY_BOXES:
Boxes read from boxfile: 1
Boxes failed resegmentation: 1
Found 0 good blobs.
Generated training data for 0 words
Does anyone have an idea how to teach Tesseract recognize checked checkboxes as well?
Thank you in advance!
After much more tries I figured out that it is of course possible to teach Tesseract different kind of letters. But as I know today, there is no possibility to teach Tesseract a sign which is not conform to some "visual rules" of a letter. For example: A letter is always one connected line of ink, at most a combination of ink and "something outside it" (for example: i,ä,ö,ü) Problem here ist that there is nothing what is similiat to checkbox (one object in antother object) This leads for Tesseract to irritations and crashes.

tessaract ocr on url image gives me 100% error file

When I run tessaract on a PNG image containing only urls, it gives me a 100% error output
like:
Jcâa\râcL7mpnmeVr
Jevuusdwvmceranr
pmmyhemnï¬r
nnnnnysaaan
ï¬mï¬asmunï¬r
Is there a way to get a better result : image is clean and readable.
I tried with GOCR, and result is like 70% good (which is still not good enough for me)
Any chance to use a linux command line OCR to get better results ?