training tesseract for handwritten text

training tesseract for handwritten text - ocr

I need to identify handwritten text (icr). No need to understand arbitrary text - I am able to instruct my users to write very clearly, with separate letters and etc. However still there will be some amount of difference between any training set and the real letters.
I am hoping to train tesseract for this purpose. Has anyone tried this? Any hope in this path?

You must have fonts similar to those handwriting letters. You may create them with any font designing tool(a sample is here). Then you can follow the training process as described here.

Related

Get text outline in as3

I have a flash application and i am trying to create a hand write effect, it will have to draw the outline of the text , so i need the outline of the text , I could have got it if it is predefined fonts , but my users can upload fonts too , so is there any way i can extract outlines of a text in run time? any help is appreciated
I know about readGraphicsData, but it wont help!

We had this question before (can't find it). It boils down to this:
There's afaik no trivial way to access the vector outline of text. Yes, you can read the font file (like any other file) and just parse it. Any font description file obviously includes the font outline in some way. You'd have to read the specification of that font file and extract the desired information accordingly.
For example, according to the OpenType specification's Font File Tables, the Tables Related to TrueType Outlines contain a glyph tag wherein you can find numberOfContours, xCoordinates[ ], yCoordinates[ ], etc.
I haven't come across a library that reads font description files and extracts that data conveniently, so you'd have to parse it yourself, which is - just to be clear here - an insane amount of work. You could try to find a library in some other language like C and see if you can somehow use it in As3. That might be less work, but it's complicated.
However, there's a flaw in your premise anyway:
I have a flash application and i am trying to create a hand write effect, it will have to draw the outline of the text
There's pretty much no way to accomplish this. In order to do this, knowing the outline is not enough. To write each glyph, a number of strokes are made. There's no general way to know how a glyph would be written by hand only from its outline.
If you know the outline of an
A
nothing tells you that the steps to handwrite it are
/
\
––
Fonts might have additional lines. If I wanted to use a service that produces handwriting from a font that I supply, I'd be inclined to use a calligraphic font with plenty of additional loops, curves, etc. There's no trivial way to automate this.

Recognize Micr font using OCR Engine?

I am using Microsoft OCR Library for reading text.
The Microsoft OCR library works perfectly. However i want to read the following list of characters given in the link http://www.ict4u.net/databases/database-images/micr.jpg . Is there a way in which i can train the OCR library to read the following characters or is there a language that allows to read the following characters.

[Microsoft OCR crew here] We don't yet support training OCR to customize it for your use-cases. However, we do actively keep an eye on stackoverflow to see what developers need, so we can keep improving the OCR engine.

I have been working with Microsoft OCR for a while now.
Compared with Tesseract it has very basic functionality.
For example Microsoft OCR returns the words and lines.
But the lines are nonsense. Randomly 2 or 3 words are grouped together as a "line" but they are not a real line. And the "lines" are completely unordered. In this aspect it is worse than Tesseract. You have to take the coordinates of each word and order them on your own.
Microsoft does not return the rectangles of characters and there is absolutely no way to configure or train Microsoft OCR in any way. You can add languages with Windows Update for "Basic Typing" = OCR (see http://www.thewindowsclub.com/install-uninstall-languages-windows-10), but you cannot train your own language data.
MSDN says that the following 25 languages are supported with different accuracy:
Excellent: Czech, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Polish, Portuguese, Romanian, Serbian Cyrillic, Serbian Latin, Slovak, Spanish and Swedish.
Very good: Chinese Simplified, Greek, Japanese, Russian and Turkish.
Good: Chinese Traditional and Korean.
The recognition quality is very similar to Tesseract. It has even exactly the same problems as Tesseract. Some single characters are not recognized (separate symbols like a single '$') and it has the same huge problem with asterisks as Tesseract. Also does it insert spaces at the wrong places as Tesseract does. So I ask myself if Microsoft is using Tesseract under the hood?
However Microsoft OCR has an advantage over Tesseract: The image preprocessing is much better. It does not matter if you have red text on yellow background or white text on black. This is a catch for Tesseract which needs a black and white image of good quality as input.
For both OCR libraries applies: If you have recognition problems, try to amplify the image. Even blurring the image may be very helful because this removes the noise from the image.

Recognizing superscript characters using OCR

I've started a simple project in which it must get an image containing text with superscripts and then by using OCR (currently I'm using tesseract) it has to recognize the superscript characters + the normal ones.
For example, we have a chemical equation such as Cl², but when I use the tesseract to recognize it, it gives me Cl2 (all in one line).
So, what is the solution for this problem? Is there any other OCR API that has the ability to read superscripts?

Very good question that touches more advanced features of any OCR system.
First of all, to make sure you are NOT overlooking the functionality even though it may be there on an OCR system. Make sure to look at your result test not in plain TXT format, but in some kind of rich text capable viewer. TXT viewers, such as Notepad on Windows, often do not support superscript/subscript characters, so even if OCR were to give you correct characters, your viewer could have converted it to display it. If you are accessing text result programatically, that is less of an issue because you are supposed to get a proper subscript character value when accessing it directly. Just note that viewers must support it for you to actually see it. If you eliminated this possible post-processing conversion and made sure that no subscript is returned from OCR, then it probably does not support it.
Just like in this text box, in your original question you tried to give us a superscript character example, but this text box did not accept it even though you could copy/paste it from elsewhere.
Many OCR will see subscript as any other normal character, if they can see it at all. OCR of your use needs to have technical capability to actually produce superscripts/subscripts, and many of them do, but they tend to be commercial OCR systems not surprisingly.
I made a small testcase before answering this letter. I generated an image with a few superscript/subscript examples for my testing (of course EMC2 was the first example that came to mind :) .
You can find my test image here:
www.ocr-it.com/documents/superscript_subscript_test_page.tif
And processed this image through OCR-IT OCR Cloud 2.0 API using all default settings, but exporting to a rich text format, such as MS Word .DOC.
You can find my test image here:
www.ocr-it.com/documents/superscript_subscript_test_page_result.doc
Also note: When you are interested to extract superscript/subscript characters, pay separate attention to your image quality, more than you would with a typical text. Those characters are tiny and you need sufficient details and resolution to achieve descent OCR quality. Even scanned at 300 dpi images sometimes have issues with tiny characters due to too few pixels. If you are considering mobile and digital cameras, that becomes even more important.
DISCLOSURE: My specialty is implementing internal OCR solutions for companies of different sizes. My company is WiseTREND. Contact me directly if I can assist with anything further.

OCR and Distinguishing Between 2 or 3 Fonts

Let's say that I have a black and white image of a document with only 2 or 3 fonts being used. One of the 3 is used for the title and another is a small font (or at least, very plain). For example, one of the little bits of text might be:
Fancy/Bolded/Italicized/Script font: The Best Soup In The World
Plain/small: Made with tap water, salt, and sugar.
Fancy/Bolded/Italicized/Script font: The Best Soup and 1/2 Sandwich In The World
Plain/small: Made with flour, tap water, salt, and sugar.
I don't need a big fancy OCR system that can tell me that "Best Soup" uses a particular fancy font with italics/etc. I just need a system that can tell me "Best Soup" is formatted rather differently from "tap water", that "Best Soup" and "Sandwich" are probably using the same formatting, and "Sandwich" is bigger/fancier than "tap water."
I'll be using Tesseract to do the actual OCR and bounding box detection (http://www.mail-archive.com/tesseract-ocr#googlegroups.com/msg02157.html), if that's relevant.
Is there anything out there that I can use to do this simple formatting classification?
Edit:
Is there anything out there that will do this without costing me an arm and a leg?

I’m not sure whether tesseract can solve the task you describe, but I believe good ocr engine should detect font styles. For example, ABBYY OCR SDK can not only identify bold/italic font style, but it can also define proper font face to use in the output.
Based on what you describe I guess you are trying to determine document style hierarchy like header levels etc. ABBYY FineReader Engine provides this functionality and you don’t have engage into the font size&style based text purpose routine. Besides, it provides the best ocr quality and it’s free to try. Consider trying it out if you plan commercial software. I work # ABBYY and can provide you more info our OCR SDK if necessary.
Best regards.

Converting Source ASCII Files to JPEGs

I publish technical books, in print, PDF, and Kindle/MOBI, with EPUB on the way.
The Kindle does not support monospace fonts, which are kinda useful for source code listings. The only way to do monospace fonts is to convert the text (Java source, HTML, XML, etc.) into JPEG images. More specifically, due to pagination issues, a given input ASCII file needs to be split into slices of ~6 lines each, with each slice turned into a JPEG, so listings can span a screen. This is a royal pain.
My current mechanism to do that involves:
Running expand to set a consistent 2-space tab size, which pipes to...
a2ps, which pipes to...
A small Perl snippet to add a "%%LanguageLevel: 3\n" line, which pipes to...
ImageMagick's convert, to take the (E)PS and make a JPEG out it, with an appropriate background, cropped to 575x148+5+28, etc.
That used to work 100% of the time. It now works 95% of the time. The rest of the time, I get convert: geometry does not contain image errors, which I cannot seem to get rid of, in part because I don't understand what the problem is.
Before this process, I used to use a pretty-print engine (source-highlight) to get HTML out of the source code...but then the only thing I could find to convert the HTML into JPEGs was to automate screen-grabs from an embedded Gecko engine. Reliability stank, which is why I switched to my current mechanism.
So, if you were you, and you needed to turn source listings into JPEG images, in an automated fashion, how would you do it? Bonus points if it offers some sort of pretty-print process (e.g., bolded keywords)!
Or, if you know what typically causes convert: geometry does not contain image, that might help. My current process is ugly, but if I could get it back to 100% reliability, that'd be just fine for now.
Thanks in advance!

You might consider html2ps and then imagemagick's convert.
A thought: if your target (Kindle?) supports PNG, use that in preference to JPEG for this text rendering.

html2ps is an excellent program -- I used it to produce a 1300-page book once, but it's overkill if you just want plain text -> postscript. Consider enscript instead.

Because the question of converting HTML to JPG has been answered, I will offer a suggestion on the pretty printer. I've found Pygments to be pretty awesome. It supports different themes and has lexers for pretty much any language out there (they advertise the fact that it even highlights brainfuck). There's a command line tool and it's available on most Linux distros.

Your Linux distribution may include pango-view and an assortment of fonts.
This works on my FC6 system:
pango-view --font=DejaVuLGCSansMono --dpi=200 --output=/tmp/text.jpg -q /tmp/text
You'll need to identify a monospaced font that is installed on your system. Look around /usr/share/fonts/.
Pango supports Unicode.
Leave off the -q while you're experimenting, it'll display to a window instead of to a file.

Don't use jpeg. It's optimized for photographs and does a terrible job with text and line art. Use gif or png instead. My understanding is that gif is now patent-free, so I would just use that.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008