POC Help - OCR exploit to run code gathered from JPEG - ocr

I have been thinking about security concerns in regards to OCR programs such as Tesseract.
My theory is that malicious code printed out in plain text can be photographed and saved an image file. ( This leaves the hex and headers free from a year change )
Then using OCR the JPEG could be converted to greyscale and the characters then read and executed. Perhaps via an exploit within the OCR application.
Looking back at the way certain worms could self execute in windows via preview perhaps something similar can be done using the abike method.
I imagine it's one of the key security concerns for a company developing an OCR application so this may be very hard to provide a proof of concept.
If anyone would like to explore this concept or perhaps explain why it's is, or indeed is not possible I would appreciate it.
This is my first post so sorry if any forum rules have been missed.

Related

Howto improve OCR results

I tried to improved the results of OpenSource OCR software. I'm using tessaract, because I find it still produces better results than gocr, but with bad quality input it has huge problems. So I tried to prepocess the image with various tools I found in the internet:
unpaper
Fred's ImageMagick Scripts: TEXTCLEANER
manuall using GIMP
But I was not able to get good results with this bad test document: (really just for test, I don't need to content of this file)
http://9gag.com/gag/aBrG8w2/employee-handbook
This online service works surprisingly good with this test document:
http://www.onlineocr.net/
I'm wonderung if it is possible using smart preprocessing to get similar results with tesseract. Are the OpenSource OCR engines really so bad compared to commercial ones? Even google uses tesseract to scan documents, so I was expecting more...
Tesseract's precision in recognition is a little bit lower than the precision of the best commercial one (Abbyy FineReader), but it's more flexible because of its nature.
This flexibility entail sometimes some preprocessing, because it's not possible for Tesseract to manage each situation.
Actually is used by google because is Google its main sponsor!
The first thing you could do is to try to expand the text in order to have at least 20 pixel wide characters or more. Since Tesseract works using as features the main segments of the characters' borders, it needs to have a bigger characters' size comparing with other algorithms.
Another thing that you could try, always referring to the test document you mentioned, is to binarize your image with an adaptive thresholding method (here you can find some infos about that https://dsp.stackexchange.com/a/2504), because some changes in the illumination are present. Tesseract binarizes the image internally, but this could be the case when it fails to do that (it's similar to the example here Improving the quality of the output with Tesseract, where you can also find some other useful informations)

handwriting recognition with simple training

I've been reading (and trying) OCR programs suggested in previous answers but I'm still without a clear answer to my problem.
I need to recognize handwritten English text. The text would be multiple lines but each line is only one or two words length. The text is from a different person at time. I could ask that person to provide a training file (e.g. with the alphabet and 0-9 numbers) but I cannot really ask for a much more complicated training than this.
I need to integrate the recognition as part of another (Java) application but the solution doesn't need to be Java. I can just execute it from Java and get the results from a text file.
Any recommendations?
I've already tested Tesseract (bad results without training and training looks quite complex). Java OCR looked like the perfect solution (simple training, open source and Java) but it doesn't work well even with their own examples (anybody has had a better experiencie?). GOCR does not seem very active.
Of course I prefer free solutions but this is not a MUST (though the problem I see with a commercial option is that I must be able to integrate it in my own app which will be offered as SaaS)
From my experience ABBYY is one of the best for handwriting recognition, even without training. (It's possibly one of the most expensive too, though...) They have an SDK for Java.
http://www.abbyy.com
With a free trial, it's definately worth a look!
I am on the lookout for a handwritten text recognition software. So far the only one giving better results than even abby 11 has been SimpleOCR using the same text for both, which is a freeware for ocr but a 14 day trial for HCR!
I know I am answering after nearly 6 years. But if anyone's still looking, try using tensorflow. Their website has a simple example for handwritten digit recognition(MNIST). You can use this example and implement it for handwritten alphabet recognition (you need training data for this, I used NIST special Database 19 to get this data).

How can I get started on programmatically analyzing web site content?

I've been looking for a new hobby programming project, and I think it would be interesting to dabble in ways to programmatically gather information from websites and then analyze that data to do things like aggregate or filter it. For example, if I wanted to write an application that could take Craiglist listings and then do something like display only the ones matching a specific city not just a geographical area. That's just a simple example, but you could go as advanced and sophisticated as how Google analyzes a site's content to know how to rank it.
I know next to nothing about that subject and I think it would be fun to learn more about it, or hopefully do a very modest programming project in that topic. My problem is, I know so little that I don't even know how to find more information about the subject.
What are these types of programs called? What are some useful keywords to use when searching on Google? Where can I get some introductory reading material? Are there interesting papers I should read?
All I need is someone to disabuse me of my ignorance, so that I can do some research on my own.
cURL (http://en.wikipedia.org/wiki/CURL) is a good tool to fetch a website's contents and hand it off to a processor.
If you are proficient with a particular language, see if it supports cURL. If not, PHP (php.net) may be a good place to start.
When you have retrieved a website's content via cURL, you can use the language's text processing functionality to parse the data. You can use regular expressions (http://www.regular-expressions.info/) or functions such as PHP's strstr() to find and extract the particular data you seek.
Programs that "scan" other sites are usually called web crawlers or spiders.
I recently completed a project that uses Google Search Appliance that basically crawls the whole .com domain of the web server.
GSA is very powerful tool that pretty much indexes all the urls it encounters and serves the results.
http://code.google.com/apis/searchappliance/documentation/60/xml_reference.html

Reversing an old file format Inbox X

I’m trying to reverse engineer an old medical imaging format called Stentor for interoperability. It was designed by a company of the same name who was subsequently bought by Phillips. But Phillips has forgotten how to read Stentor files. I have a windows program which exports JPEG from Stentor files but it’s closed source. I’d like to automate this process in order to tackle hundreds of files in this format.
The program is late-1990s Win32 or MFC executeable. It runs next to an ActiveX (.ocx) file which I’ve been able to interop with, but that file doesn’t contain the export method. I'm looking for suggestions on how to dissemble the binary in order to unearth the algorithm used to convert Stentor to JPEG. I looked through the Stentor files in hex editor and didn’t find any evidence of JPEG (although hints on finding that would be appreciated too), so I think that the program has a couple of tricks up its sleeve.
Thanks in advance.
Kyle
Few programmers implement complex routines such as image recoding themselves. Instead they tend to license libraries that do that. A very smart way to start would be searching for text strings and see if you can discover the libraries they use. This will subsequently give you a lot of insight into how the data is encoded.
Another good strategy would be to build a program that simply runs the GUI of your export program by sending mouse and keyboard events directly to it. Let this run a few days to complete your export. Reverse engineering the file format is going to be slow and expensive so for a 1 time gig it's probably not worthwhile.

Reliably extracting identity fields from scanned documents / images?

I have to pull two pre-printed (not hand-written) fields out of a paper form, such that it can be automatically routed after being scanned. The fields contain batch and item identifiers, like "GG-9192" or "EPN/245G".
I've tried the following software:
Tesseract-OCR
Cuneiform
Canon ImageRunner built-in OCR
Asprise OCR Java API (demo)
I've tried the following settings:
Scanning at resolutions of 300dpi and 600dpi
Tried different fonts, including OCR-A and OCR-B.
In all cases output was pretty much all over the place. I can kick back documents for which I can't properly extract the necessary information, but I'm thinking it's going to be at least half of them. I considered some sort of fuzzy logic based on known values in a database, but sometimes these identifiers can differ by a single character, like "123G" and "123C".
Is this a lost cause? Perhaps OCR just isn't mature enough to handle a requirement of this nature? What other techniques might you recommend? Barcodes?
Edit: the containing application is in Java, so any recommendations for which there are free or cheap Java-based APIs for would help.
Edit 2: if anyone is interested...without any special tuning, Cuneiform for Linux and the Canon ImageRunner worked best, with Tesserect-OCR and Asprise Java API producing the worst results...none of the four was acceptable for anything but standard document search grade OCR. I'm beginning to think that this isn't going to work out.
If you have control over the fields, why use a human-readable format in the first place? For scanning, it seems like a QR Code, or something similar would be best. It is marked for orientation, and has some built-in error correction.
http://en.wikipedia.org/wiki/QR_Code
I started digging for products starting with Tomato's suggestion. I tried ABBYY and CVISION. Both have products that can automate OCR:
CVISION Maestro Recognition Server 4.0
ABBYY Recognition Server 2.0
In addition, ABBYY has SDKs for various platforms, and CVISION has an SDK that appears to work with at least VB/VC++.
I haven't tried either SDK yet, and am not sure it's necessary for my project. All I need is PDFs coming in that I can extract the text from. I did however try CVISION's server product and with the OCR on its most accurate settings, it worked really well. I haven't tried ABBYY's server product yet because I have to go through a reseller to get a trial. I'm in the process of doing so, but if it starts getting annoying I'm probably going to go with CVISION. I did try ABBYY's FineReader standalone product, and it worked very well, so I assume that their server product would also.