Text Recognition in Sikuli-java-api (not sikuliX-api) - ocr

SikuliX or Sikuli Script has Region.text() which returns the text value from the image on screen based on tesseract ocr.
Is there something similar in Sikuli-java-api??
I need to verify some text from screen and am trying to decide which of the two api should be used. Thanx for ur help in advance!

No, only up on setting up the tesseract ocr, you would be able to read/validate text. if you are not able to download this OCR directly during installation try the offline version of this OCR and copy it to your local.

Depending on what you wish to accomplish by recognizing "text", Sikuli can still recognize images of text. Text images are treated the same as any other images displayed. Sikuli on its own can't interpret the text in an image, however if you know what text you wish to see, and have an image of it for comparison, you can still validate whether or not it appears. Keep in mind that font and resolution changes will likely cause unreliable results.

Related

ocr - optical character recognition image size/quality decrease

I have question about OCR, parsing text from image. I am making ANDROID application that uses google cloud api to parse text from image. Problem is that sending/uploading image takes too much time. So I though i could resize image or decrease image quality. But in that case, usually the ocr detection results are suffering. Can anyone please tell me best way to do it. Maybe someone knows whatsapp image compression or the best image file format (jpg,png), best quality decrease/images size decrease ratio or something like that.
Thaks in progress

How can a scanned page be divided into words like the reCaptcha project?

I would like to digitize a book in a similar way to the reCaptcha project. Is there already a system for inputing an image and then outputting little images cropped around words? Any ideas on how to do this?
You should look into the Tesseract OCR project on which reCaptcha was probably based. It has the capability to output the coordinates of recognized words. Then you crop the page to those coords and you are done.
If you just want to split the image in multiple images one word each you could try to find the word bounding boxes and then take those co-ordinates for the splitting. This can be done by taking histograms/projections of the document in horizontal direction and then for each line in vertical direction. An example algorithm with some pictures describing the idea can be found in this paper: "Document Page Decomposition by the Bounding-Box Projection Technique" (http://haralick.org/conferences/71281119.pdf). You could implement this in OpenCV.
Alternativly, you can use Tessaract as mentioned by beppe9000. Perhaps this helps: Getting the bounding box of the recognized words using python-tesseract
But then you get the whole complexity of training OCR even though you only want the bounding boxes.

OCR algorithm- distinguish between textual image and object image

Am writing a program that extracts the contents from the logo of different websites.. i am using OCR to extract the text from the logo but i want to optimize the program and want to apply OCR only on those logos which have text but i dont know how to determine if a logo contains text or not??? any method??
this is a case where we need to know if an image has text in it. It is different from OCR.
The algorithm which is considered to be best to date is Stroke Width Transform. It was designed by Ephstein under Microsoft in 2010. It doesn't use any machine learning purposes.
You can get more details from this paper : Detecting Text in Natural Scenes with Stroke Width Transform
Or watch a video about this.
There is an implementation of this algorithm here.

How to make tesseract to give relevant results in the presence of noise?

I am using tesseract 3.0.0 and I bumped into the following problem:
When there is something too small for tesseract to recognize it seems it's merged with
other fragments. As a result nothing relevant is returned.
The image below shows 3 cases. Only the rectangle with the dashed line is passed to tesseract. Over the rectangle is the result (V over T means new line).
The last case is the problem one. Is there someway to improve tesseract in situations like this?
As far as I know, Tesseract does not have proper image segmentation yet (or Document Analysis, as it is called in commertial OCR applications.) Typically, before OCR is done, image is get's split on separate areas that contain text, pictures, barcodes, lines and so on. Then you apply OCR only on text ares and don't face problems you have just described.
Earlier versions of Tesseract did not have that functionality at all, and Tesseract was supposed to be used as line recognizer only, or so called field-level recognizer, when you use it on small snippets of text cut from bigger image.
I did not followed throughly what was introduced in 3.0, probably it is already there partially, but obviously it does not work as expected, as you have just found out.
There is another opensource project - OCRopus, that aproached this problem exactly as I described - first Document Analisys (aka Segmentation) and only then OCR. Their earlier versions were actually using Tesseract for OCR after analisys step finished. But later they introduced their own OCR (which is still not very good) and moved Tesseract plugin support down in priorities list.
Here's what you actually can do to address your problem:
If your images have very typical structure, you can try to do some dumb segmentation and cut text from the image yourself before passing it to Tesseract. However, if you expect to have wide variety of images to be supported, just forget it.
You can ckeck OCRopus and see if their segmentation work for your images. If yes, then you can spend some time to make OCRopus + Tesseract work together.
Well, if what you do is not just for fun and you value your time, I would recommend thinking about real OCR engine like ABBYY. You will get much higher accuracy of both segmentaiton and OCR out of the box, and professional customer support of course.
Disclaimer: I work for ABBYY

PDF to web page

I get a .pdf complete with images, fancy fonts, styles, gradients and what have you. Basically it's handed off to me with the message, "Make me a web page that looks exactly like this." I've tried a few pdf to html tools and they all look terrible. I figure I've only got 2 options and i hate them both.
convert the pdf to one big image and use an imagemap to add the links.
the screen copy tool that comes with acrobat reader to chop the file up into it's parts (buttons, logos, etc).
She uses Quarks to make this pdf. I've never used it, but I hear it is very popular. Are these really my only two options? Someone tell me I'm wrong, please.
Grab what text you can out of the PDF and clean it up. Pull the PDF into Photoshop and slice out the graphical elements you want to use. Rebuild the page using the images and put your text in HTML format.
Make a slice of the gradients and use them as background images with repeat.
Try to explain to your client why the fancy font is unsuitable for this medium.
Edit:
If it's just going to be a screen shot, you might as well just put the PDF up in the first place. At least people can zoom in.
Do not use one big image map. The more content you can convert from image to text, the better (more efficient) your HTML page will be.
Chop up the PDF into parts. Make the logos, etc. images, make text plain text, and make buttons button controls.
Exactly like what Diodeus said except-
-
Find the fancy font and check to see how much it will cost to license or buy it. Build two bills and send them to your client, one with the fancy font and one with a standard font. Then see if she wants the fancy font. It will show that you take your job serious and may get you less strict project conditions.
No they are not:
Adobes Online pdf to html service
or
pdftohtml