What does the Tesseract OCR library require of an image to be able to accurately extract text?

What does the Tesseract OCR library require of an image to be able to accurately extract text? - ocr

I am using the Tesseract library to extract text from images. The language is Vietnamese. I have two images. The first one is from a website. The second is a screenshot taken from the Wordpad program. They are shown in links below:
1
2
The first one has 95% accuracy.
Bán căn hộ tầng 5 khu tập thể Thành công Bắc, DT 28m2, gần chợ ThànhCông,
số
đỏ, chính chủ, giá 800 triệu.LH:A.Châu, 0979622551,0905685336
The second image is much larger but the accuracy is just about 60%.
Bặn căn hộ tầng ậ khu tập thể Ỉhành gông
Băc. llĩ 28 m2. gân chợ ĩllành Bông. sũ Ilỏ.
chính l:lIlì. giá 800 lriệu. l.ll: A.BhâU,
0979622551, 0905685336
What about the second image do I have to fix to get as accurate text as the first one?

As stated by #user898678 in image processing to improve tesseract OCR accuracy ,
the following operations can improve OCR's accuracy :
fix DPI (if needed) 300 DPI is minimum
fix text size (e.g. 12 pt should be ok)
try to fix text lines (deskew and dewarp text)
try to fix illumination of image (e.g. no dark part of image
binarize and de-noise image

Related

Can I control the output image quality/size with Puppeteer export to PDF?

When using Puppeteer to print a page as PDF, Puppeteer may convert images in that page to a different format.
For example, printing a JPEG image will result in a PDF with (roughly) the same size as the image. That means Puppeteer is using the same exact JPEG image in the generated PDF. Same happens with other formats like PNG and SVG (the output size matches the size of the original images).
However, printing a WebP image will result in a PDF with a much bigger size (10x more that expected). This seems to be because Puppeteer is converting the WebP image into a JPEG/PNG image before generating the PDF.
I am guessing this is because WebP is not supported (maybe not even by the PDF standard and that may be the reason Puppeteer converts the WebP image in the first place).
Is there a way to control this image conversion? In particular, is it possible to set the target format (ideally JPEG) and quality (ideally < 100) to try to maintain the output size of the PDF in the same range as the input WebP image size?

It may help you to see at two levels what happens to images when saved as pdf, now understand this is a basic demo thus not real world but just by explanation of considerations.
Upper left we have 5x5 pixels so screen rendering uses a blurring to not show images as "sharp" but upper right a pdf viewer tries to maintain vector sharpness.
so what about different formats, GIF TIF and PNG (middle line) are lossless and behave in roughly similar fashion. All should maintain colour pixel fidelity in a PDF.
However, lower line, Jpeg is lousy at maintaining colour fidelity because it spreads the colours between adjoining pixels, which is "Perfect" for fuzzy text or photographs but not much good for PDF colours.
Ok moving on your focus is input to pdf so what do those look like when stored.
each may be written in many ways but let's focus on the most versatile PNG.
%PDF-1.0
1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj
2 0 obj<</Type/Pages/Count 1/Kids[3 0 R]>>endobj
3 0 obj<</Type/Page/MediaBox[0 0 3.75 3.75]/Rotate 0/Resources<</XObject<</Img3 6 0 R>>>>/Contents 5 0 R/Parent 2 0 R>>endobj
5 0 obj<</Length 34>>
stream
q
3.75 0 0 3.75 0 0 cm
/Img3 Do
Q
endstream
endobj
6 0 obj<</Length 75/Type/XObject/Subtype/Image/Width 5/Height 5/BitsPerComponent 8/SMask 7 0 R/ColorSpace/DeviceRGB>>
stream
ÿ ÿÿÿ   ÿÿÿ ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ ÿÿÿÿ###ÿÿÿ ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ ÿÿÿÿÿÿ ÿÿÿ
endstream
endobj
7 0 obj
<</Length 5/Type/XObject/Subtype/Image/Width 5/Height 5/BitsPerComponent 1/ColorSpace/DeviceGray>>
stream
ÿÿÿÿÿ
endstream
endobj
xref
0 7
0000000000 65536 f
0000000016 00000 n
0000000062 00000 n
0000000114 00000 n
0000000334 00000 n
0000000472 00000 n
0000000555 00000 n
trailer
<</Size 7/Info<</Producer(Me)>>/Root 1 0 R>>
startxref
684
%%EOF
Again, for illustration this is a non-typical stream as it shows the bitmap uncompressed but note the main image is defined by
6 0 obj<</Length 75/Type/XObject/Subtype/Image/Width 5/Height 5/BitsPerComponent 8/SMask 7 0 R/ColorSpace/DeviceRGB>>
So of interest it is 5 pixels wide by 5 pixels high NO hint of how many inches it's just 8 bits R, 8bits G, 8 bits B (again its only 3 colours) the Alpha is in a separate image (image Smask 7) so 3x5x5=75 is the uncompressed storage now we can compress many ways such as "Flate" (similar to say used in a zip file)
that will convert the stream from lots of ÿs into a more compacted form.
Again, there are many encodings so if we wish to keep the pdf as text for editing in a text editor first
/Length 72
/Filter [ /ASCIIHexDecode /FlateDecode ]
>>
stream
789cfbcfc0f0ffff7f8605601244a002b0888383038c892a091582e86500009c
663a28>
endstream
Well that was not much compression down from 75 to 72 !
let's use something better by not using plain text.
6 0 obj
<</Length 36/Type/XObject/Subtype/Image/Width 5/Height 5/BitsPerComponent 8/SMask 7 0 R/ColorSpace/DeviceRGB/Filter/FlateDecode>>
stream
xœû¯ ðÿÿ…`D °ˆƒƒŒ‰*©€P¤   ÞF;è
endstream
Ok much better we halved the storage from 72 down to 36 good its small compact and well formed.
So, what about keeping the jpeg structure ahhhh! that when maintaining its lousy nature needs 730
<</Filter/DCTDecode/Type/XObject/Subtype/Image/BitsPerComponent 8/Width 5/Height 5/ColorSpace/DeviceRGB/Length 730>>
stream
ÿØÿà JFIF ` ` ÿÛ C
ÿÛ C
ÿÀ " ÿÄ
ÿÄ µ } !1AQa "q2‘¡#B±ÁRÑð$3br‚
%&'()*456789:CDEFGHIJSTUVWXYZcdefghijstuvwxyzƒ„…†‡ˆ‰Š’“”•–—˜™š¢£¤¥¦§¨©ª²³´µ¶·¸¹ºÂÃÄÅÆÇÈÉÊÒÓÔÕÖ×ØÙÚáâãäåæçèéêñòóôõö÷øùúÿÄ
ÿÄ µ w !1AQ aq"2B‘¡±Á #3RðbrÑ
$4á%ñ&'()*56789:CDEFGHIJSTUVWXYZcdefghijstuvwxyz‚ƒ„…†‡ˆ‰Š’“”•–—˜™š¢£¤¥¦§¨©ª²³´µ¶·¸¹ºÂÃÄÅÆÇÈÉÊÒÓÔÕÖ×ØÙÚâãäåæçèéêòóôõö÷øùúÿÚ ? ëÿ j¯Ú[Á_ >|×õ¯„šgŽ¡ñ–™ý¥mg¨.˜ƒJ
§é€Æ„é®¬
´`©þ¬ ãŒ¢Šõiæ¸ì,ëáèUq„*ÖŒRÑ%³I}Ëüõ<ìFŽ*£¯QËšVnÓšWi_E$—ÉÿÙ
endstream
endobj
So this test piece is not real world but may serve to help make decisions over best storage means for different inputs.
My preference is use PNG where possible for charts and document text and use Jpeg only when essential for photos or fuzzy OCR.
taking your offered sample jpeg is necessary but even set quality to high with size reduction from maximal can suffer collateral damage.
However, it's not very noticeable except you zoom in closer than blobby anyway here 4 X zoom
Source 58-59 KB
Slightly reduced 50-51 KB

Color limits from the HERE Map Image REST API

We've signed up to the Pro plan and now we need to create a report using Map Image REST API to generate heatmaps using multiple colors (more than 4 colors).
I saw on the documentation that there is a limit of 4 levels and colors, I'm wondering if it's possible to use more colors in order to reach our requirements.
Do you have plans to increase the limits or beta version that doesn't have those limits?
For instance, we need to create 6 areas each one with different colors and 6 levels on the same map as shown on the following image, I should be able to use 6 different colors but only shows up 4 colors.
Map image example with 6 areas
Here is the request
GET https://image.maps.ls.hereapi.com/mia/1.6/heat
?apiKey={{API_KEY}}
# Area 1 - Yellow
&a0=49.27,-123.48
&rad0=1900
&l0=0
# Area 2 - Red
&a1=49.25,-123.38
&rad1=1500
&l1=1
# Area 3 - Blue
&a2=49.18,-123.342144
&rad2=1500
&l2=2
# Area 4 - Green
&a3=49.28,-123.35
&rad3=1000
&l3=3
# Area 5 - Orange
&a4=49.21,-123.55
&rad4=1800
&l4=4
# Area 6 - White
&a5=49.30,-123.60
&rad5=1000
&l5=5
#
&z=11
&w=900
&h=900
&plt=FCFF00,EB2501,001EFF,1FE80C,FF8C0D,FFFFFF
Thanks!

I can't speak to our plans for this API, I can, however, raise a ticket internally asking that this be considered. My guess is that it's for performance as well as "length of URL" concerns, but at minimum I can ask.

IRAF imalign will not shift images, incorrectly reports unequal number of input and output images, why?

I recently started working in IRAF as I have the need for image data reduction.
I tried to stack .fit images using imalign function, but I get this error message:
This was a test, so I have only 4 images in input and output lists, and I have 4 shifts in shiftlist.txt. These are my files - input list:
NGC7286-0001_B.fit
NGC7286-0003_B.fit
NGC7286-0004_B.fit
NGC7286-0005_B.fit
Output list:
sh-NGC7286-0001_B.fit
sh-NGC7286-0003_B.fit
sh-NGC7286-0004_B.fit
sh-NGC7286-0005_B.fit
Shiftlist:
0.0 0.0
3.751 4.55
3.997 9.273
3.107 15.243
List of coordinates of referent stars:
618.58 666.96
1136.19 711.39
1288.88 942.79
1417.72 927.84
1004.71 1517.73
1053.39 1756.91
532.16 1794.60
Why do I get this error message? Do you see anything wrong with my files?
If I use shiftlist I calculated, do I need to change bigbox (20) and/or boxsize (7)? Thank you in advance.

Found the solution, though I don't know why my IRAF have that problem.
I can't have "_" character in the names of my images.
Strange, but now aligning works I guess.

Google Vision OCR: DOCUMENT_TEXT_DETECTION produces strange results when TEXT_DETECTION is fine

I'm playing around within the quick start guide: https://cloud.google.com/vision/docs/quickstart and I noticed there were wildly different results when using the same image for DOCUMENT_TEXT_DETECTION vs TEXT_DETECTION.
For reference, this is the image I'm using (plug this in for imageUri):
https://storage.googleapis.com/random-resources/receipt.jpg
When using TEXT_DETECTION, the description seems to give a good summary of the image but when I use DOCUMENT_TEXT_DETECTION, the result is a bunch of text that is found nowhere on the image:
"OLZ-E\niNO N WHL\nL8' G7 WY NINId\n9E'V\nDG'S\n78' 8\nSD 177\nXel [ 101\n3L VON VW IS\nXVI\n11/ans\nas\na new set \"\nHe same time in more\nS8' 9p\nS8' 9p\n98' 9p\nGD' Ot\nGD'Or\nIIIA AHNIH\nIIIA ANNIH !\nIIIA AHNIH L\nLNI ALII I\nLNI ALII !\nement on to\nSee more money were more women\none more time we came as we are on memo sense we need some more money when we see some moment as a team\nwdE:6\nI Ssang\n200E YOay)\nLL |602 SOL Jed H3NNIS\nNNS\nEEZ a qel\nsame was one or more was a\nmoment\nto earn and a time when\nwe are seen\n909t-G88-9\nOIL 6 PD 'OISION VX: NVS\nINNIZAV SSN NVA 906\nBIY SWINd\nO\nSNOH\n"
Any ideas?

tesseract didn't get the little labels

I've installed tesseract on my linux environment.
It works when I execute something like
# tesseract myPic.jpg /output
But my pic has some little labels and tesseract didn't see them.
Is an option is available to set a pitch or something like that ?
Example of text labels:
With this pic, tesseract doesn't recognize any value...
But with this pic:
I have the following output:
J8
J7A-J7B P7 \
2
40 50 0 180 190
200
P1 P2 7
110 110
\ l
For example, in this case, the 90 (on top left) is not seen by tesseract...
I think it's just an option to define or somethink like that, no ?
Thx

In order to get accurate results from Tesseract (as well as any OCR engine) you will need to follow some guidelines as can be seen in my answer on this post:
Junk results when using Tesseract OCR and tess-two
Here is the gist of it:
Use a high resolution image (if needed) 300 DPI is minimum
Make sure there is no shadows or bends in the image
If there is any skew, you will need to fix the image in code prior to ocr
Use a dictionary to help get good results
Adjust the text size (12 pt font is ideal)
Binarize the image and use image processing algorithms to remove noise
It is also recommended to spend some time training the OCR engine to receive better results as seen in this link: Training Tesseract
I took the 2 images that you shared and ran some image processing on them using the LEADTOOLS SDK (disclaimer: I am an employee of this company) and was able to get better results than you were getting with the processed images, but since the original images aren't the greatest - it still was not 100%. Here is the code I used to try and fix the images:
//initialize the codecs class
using (RasterCodecs codecs = new RasterCodecs())
{
//load the file
using (RasterImage img = codecs.Load(filename))
{
//Run the image processing sequence starting by resizing the image
double newWidth = (img.Width / (double)img.XResolution) * 300;
double newHeight = (img.Height / (double)img.YResolution) * 300;
SizeCommand sizeCommand = new SizeCommand((int)newWidth, (int)newHeight, RasterSizeFlags.Resample);
sizeCommand.Run(img);
//binarize the image
AutoBinarizeCommand autoBinarize = new AutoBinarizeCommand();
autoBinarize.Run(img);
//change it to 1BPP
ColorResolutionCommand colorResolution = new ColorResolutionCommand();
colorResolution.BitsPerPixel = 1;
colorResolution.Run(img);
//save the image as PNG
codecs.Save(img, outputFile, RasterImageFormat.Png, 0);
}
}
Here are the output images from this process:

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

What does the Tesseract OCR library require of an image to be able to accurately extract text? - ocr

Related

Can I control the output image quality/size with Puppeteer export to PDF?

Color limits from the HERE Map Image REST API

IRAF imalign will not shift images, incorrectly reports unequal number of input and output images, why?

Google Vision OCR: DOCUMENT_TEXT_DETECTION produces strange results when TEXT_DETECTION is fine

tesseract didn't get the little labels

Categories

Resources