OCR Error Correction

OCR Error Correction - ocr

I'm working on cleaning OCR-ed documents, working with an commercial OCR engine. The quality of its output is bad, as it often produces both undefinded characters text, and misspellings within the text due to the very noisy background.
Our first approach is apply some spelling-corretion with Regular Expression.
Our second approach is to create a library of error
|------------1--------------------------|--------2----------|--------3----------
|Image Segmentation of misspelled words | Misspelled word | Human correction
| Image segmentation of the word | clcar | clear
The idea of the second approach is similar to Google reCaptcha project. We will ask a lot of people to proofread the OCR-ed text. In order to to speed up this process only the misspelled words (recognised through a spelling-corrector) will be selected out. They will be given a image segmentation of the misspelled word in the pdf-file and have to correct it manually
However we don't know how can we map the misspelled word in the OCR-ed document back to the original PDF-File.
What are the best practices for this problem? Are there open-source implementations that do this sort of thing (OpenCv,Algorithm,...etc)?

Related

How to translate text/HTML that has stylistic line breaks?

The general question here is how do you mark text up for translation on an HTML page when the position of the line breaks have to look eye pleasing (as opposed to the line break aways happening after a specific word)?
I have a web page I want to translate into 5 different languages. In some places, I have text like "Enjoyed by 10,000 happy users" under a small icon that needs to be displayed in an eye pleasing way. This looks good as the noun phrase is on its own line and each line has about the same number of letters:
<icon>
Enjoyed by
10,000 happy users
Do I send this text to be translated as this?
Enjoyed by <br> 10,000 happy users
Problems:
By adding markup to the text it makes it unlikely I can reuse the string elsewhere but I can't see any other options.
How do I cope with how I place the in the translated text given the translated text will have a different number of letters (e.g. "Genossen von 10.000 glückliche Benutzer" in German)? Just review how each one renders on the page manually and adjust the myself after the translations come back?
I can't see any clean way to do this. I could remove the markup and try to write some server code that will add the break in a nice place but I can't see how it's possible to automate (e.g. putting noun phrases on their own line if possible when the previous line has enough letters). CSS has even less options to do this.

Your question is somewhat subjective, but I think your choices are to either trust your translators to format the HTML, or trust them to come up with copy that fits your design. Trying to engineer your way to a "clean" solution with server code sounds like it will achieve the exact opposite.
Make sure your design is good enough to cope with a reasonable range of word lengths. If your layout lives and dies by the text being exactly X characters long, then it isn't well designed. You can always ask your translators to try and write a translation in less than a maximum number of characters. This is why we still have human translators - they are also copywriters :)

training tesseract for handwritten text

I need to identify handwritten text (icr). No need to understand arbitrary text - I am able to instruct my users to write very clearly, with separate letters and etc. However still there will be some amount of difference between any training set and the real letters.
I am hoping to train tesseract for this purpose. Has anyone tried this? Any hope in this path?

You must have fonts similar to those handwriting letters. You may create them with any font designing tool(a sample is here). Then you can follow the training process as described here.

Recognizing superscript characters using OCR

I've started a simple project in which it must get an image containing text with superscripts and then by using OCR (currently I'm using tesseract) it has to recognize the superscript characters + the normal ones.
For example, we have a chemical equation such as Cl², but when I use the tesseract to recognize it, it gives me Cl2 (all in one line).
So, what is the solution for this problem? Is there any other OCR API that has the ability to read superscripts?

Very good question that touches more advanced features of any OCR system.
First of all, to make sure you are NOT overlooking the functionality even though it may be there on an OCR system. Make sure to look at your result test not in plain TXT format, but in some kind of rich text capable viewer. TXT viewers, such as Notepad on Windows, often do not support superscript/subscript characters, so even if OCR were to give you correct characters, your viewer could have converted it to display it. If you are accessing text result programatically, that is less of an issue because you are supposed to get a proper subscript character value when accessing it directly. Just note that viewers must support it for you to actually see it. If you eliminated this possible post-processing conversion and made sure that no subscript is returned from OCR, then it probably does not support it.
Just like in this text box, in your original question you tried to give us a superscript character example, but this text box did not accept it even though you could copy/paste it from elsewhere.
Many OCR will see subscript as any other normal character, if they can see it at all. OCR of your use needs to have technical capability to actually produce superscripts/subscripts, and many of them do, but they tend to be commercial OCR systems not surprisingly.
I made a small testcase before answering this letter. I generated an image with a few superscript/subscript examples for my testing (of course EMC2 was the first example that came to mind :) .
You can find my test image here:
www.ocr-it.com/documents/superscript_subscript_test_page.tif
And processed this image through OCR-IT OCR Cloud 2.0 API using all default settings, but exporting to a rich text format, such as MS Word .DOC.
You can find my test image here:
www.ocr-it.com/documents/superscript_subscript_test_page_result.doc
Also note: When you are interested to extract superscript/subscript characters, pay separate attention to your image quality, more than you would with a typical text. Those characters are tiny and you need sufficient details and resolution to achieve descent OCR quality. Even scanned at 300 dpi images sometimes have issues with tiny characters due to too few pixels. If you are considering mobile and digital cameras, that becomes even more important.
DISCLOSURE: My specialty is implementing internal OCR solutions for companies of different sizes. My company is WiseTREND. Contact me directly if I can assist with anything further.

pos_tag fails on text in ALL CAPS

I am working with text which is, unfortunately, given in ALL CAPS. The default nltk.pos_tag function does not do a very good job on this text (it thinks everything is a proper noun).
What is the best way to deal with this issue?

The best would be to apply truecasing to your text before POS-tagging.
If that is too much efford for you, you can transform your Python string x to lower characters using x.lower(), that should at least avoid the problem of getting only proper noun tags (there might be some confusions with too less proper noun tags though).
You could train a POS-Tagger by transforming a tagged corpus previously to lower aswell, but if you want to get the best results you probably want to go for the truecasing.

British English to American English (and vice versa) Converter

Does anyone know of a library or bit of code that converts British English to American English and vice versa?
I don't imagine there's too many differences (some examples that come to mind are doughnut/donut, colour/color, grey/gray, localised/localized) but it would be nice to be able to provide localised site content.

I've been working on one to convert US English to UK English. As I've discovered it's actually a lot harder to write something to convert the other way but I hope to get around to providing a reverse conversion one day.
This isn't perfect, but it's not a bad effort (even if I do say so myself). It'll convert most US spellings to UK ones but there are some words where UK English retains the US spelling (e.g. "program" where this refers to computer software). It won't convert words like pants to trousers because my main goal was simply to make the spelling uniform across the whole document.
There are also words such as practice and license where UK English uses either those or practise & licence, depending on whether the word's being used as a verb or a noun. For those two examples the conversion tool will highlight them and an explanatory note pops up on the lower left hand of your screen when you hover your mouse over them. All word patterns which are converted are underlined in red, and the output is shown in a side by side comparison with your original input.
It'll do quite large blocks of text quite quickly, but I prefer to go use it just for a couple of paragraphs at a time - copying them in from a Word doc.
It's still a work in progress so if anyone has any comments or suggestions then I'd appreciate feedback I can use to improve it.
http://www.us2uk.eu/

The difference between UK and US English is far greater than just a difference in spelling. There is also the hood/bonnet, sidewalk/pavement, pants/trousers idea.
Guess it depends how far you need to take it.

I looked forever to find a solution to this, but couldn't find one, so, I wrote my own bit of code for it, using a master list of ~20,000 different spellings that were freely available from the varcon project and the language experts at wordsworldwide:
https://github.com/HoldOffHunger/convert-british-to-american-spellings
Since I had two source lists, I used them each to crosscheck each other, and I found numerous errors and typos (varcon lists "preexistent"'s british equivalent as "preaexistent"). It is possible that I may have accidentally made typos, too, but, since I didn't do any wordsmithing here, I don't believe that to be the case.
Example:
require('AmericanBritishSpellings.php');
$american_british_spellings = new AmericanBritishSpellings();
$text = "Axiomatically ax that door, would you, my neighbour?";
$text = $american_british_spellings->SwapBritishSpellingsForAmericanSpellings(['text'=>$text]);
print($text); // output: Axiomatically axe that door, would you, my neighbor?

I think if you're thinking of converting from American English to British English, I personally wouldn't bother. Britain is very Americanised anyway, we accept silly yank spellings on the net :)

I had a similar problem recently. I discovered the following tool, called VarCon. I haven't tested it out, but I needed a rough converter for some text data. Here's an example.
echo "I apologise for my colourful tongue ." | ./translate british american
# >> I apologize for my colorful tongue .
It looks like it works for various dialects. Be sure to read the README and proceed with caution.
*note: This will only correct spelling variations.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008