Is it possible to extract text from any PDF or Image with 100% accuracy? - ocr

I were try with pytesseract but it is not good for some PDF or image then I were try with easyocr but it is not good for low memory so provide me solution to extract text from PDF or image with 100% accuracy
limitation :
my project deploy in AWS server so memory is less like 4 GB.
some tools take more time to extract text from PDF but I want to take less time to complete.
I want to extract text with 100% accuracy.
I want to provide me solution to extract text from PDF or image with 100% accuracy or provide me list of OCR tool which is solve my problem. I don't work with AWS textract.

Related

ocr - optical character recognition image size/quality decrease

I have question about OCR, parsing text from image. I am making ANDROID application that uses google cloud api to parse text from image. Problem is that sending/uploading image takes too much time. So I though i could resize image or decrease image quality. But in that case, usually the ocr detection results are suffering. Can anyone please tell me best way to do it. Maybe someone knows whatsapp image compression or the best image file format (jpg,png), best quality decrease/images size decrease ratio or something like that.
Thaks in progress

What is Blob in Tesseract OCR

I am learning Tesseract OCR and reading this article that is based on this article. From first article:
First step is Adaptive Thresholding, which converts the image into
binary images. Next step is connected component analysis which is
used to extract character outlines. This method is very useful
because it does the OCR of image with white text and black background.
Tesseract was probably first to provide this kind of
processing. Then after, the outlines are converted into Blobs.
Blobs are organized into text lines, and the lines and
regions are analyzed for some fixed area or equivalent text
size.
Could anyone explain what is Blob?
From https://tesseract-ocr.repairfaq.org/tess_glossary.html :
Blob
Isolated, small region of the scanned image. It's delineated by the outline. Tesseract 'juggles' the blobs to see if they can be split further into something that improved the confidence of recognition. Sometimes, blobs are 'combined' if that gives a better result. See pithsync.cpp, for example.
Generally a blob (also called a Connected Component) is a connected piece (i.e. not broken) in a binary image. In other words, it's a solid element in a binary image.
Blob finders are a key step in any system that aims extracting/measuring data from digital images.

Image Comparisson using CBIR and OCR

Working on a project for Retrieving content from a given image and compare with other images in the repository and list out the matching images.
what should be right approach to do it so that the search wont slowdown eventually.
What I was planning to do as a first level of filtering was to use any Image Querying (CBIR technique) to retrieve images matching the pattern of given image.
Then do OCR to get the image content and do a match check.
Please let me know if there is any better approach for this.
Steps done
Softwares
1. Tesseract OCR
2. Image Magick - For image cleaning
3. Textcleaner script
Found out the image orientation using Image Magick software
Convert package has a feature to find the image orientation using the EXIF data which is not that useful.
For this image was rotated 90 degree thrice and the ocr data for each was compared with the other to find the correct orientation. ( image with maximum number of words wins)
OCRed the image to get the text and applied filtering to get the bill no, date and amount.
on success stores the details on DB for future search
on failure
Created 10 different images with different filters (gray scale mode and sharpment applied)
OCRed all images and found out the required data form all the data got.
Saved data is used for future search feature to eradicate duplication

Minimalistic way to read TIFF image format pixels

We are participating at the RoboCup 2015 from the German Aerospace Center in October. Before the torunament we will get a 30x30 pixel sized TIFF image, representing a low-pixel heigtmap. My task is to write a fast and lightweight, dependency free code that reads this TIFF image an does some algorythmic stuff on it.
I googled about the TIFF image format and it seems there are some powerfull libraries, but is there a simple way of reading just the color values of the file?
I remember a format, don't know which, where I just skipped the first 30 bytes and then could read the color values in RGB. Do you have any code that could do that or an idea/explanation how I could acchieve that?
As I said, I do not need filename, imagesize data etc. I actually don't even know why they have choosen the TIFF image-format since its just a normal heightmap in greyscale, but however.
Every help is very appreciated.

Text Recognition in Sikuli-java-api (not sikuliX-api)

SikuliX or Sikuli Script has Region.text() which returns the text value from the image on screen based on tesseract ocr.
Is there something similar in Sikuli-java-api??
I need to verify some text from screen and am trying to decide which of the two api should be used. Thanx for ur help in advance!
No, only up on setting up the tesseract ocr, you would be able to read/validate text. if you are not able to download this OCR directly during installation try the offline version of this OCR and copy it to your local.
Depending on what you wish to accomplish by recognizing "text", Sikuli can still recognize images of text. Text images are treated the same as any other images displayed. Sikuli on its own can't interpret the text in an image, however if you know what text you wish to see, and have an image of it for comparison, you can still validate whether or not it appears. Keep in mind that font and resolution changes will likely cause unreliable results.