How to make Google's Custom Document Extractor OCR parse text in the right order when the text isn't perfectly straight? - ocr

I am working on a project where I need to parse information from a reference book. To do this, I am using Google's Custom Document Extractor. I have been annotating my first few scanned documents, but I have noticed a problem.
The problem is that when I select text to annotate, the OCRed text isn't detected in the right order. For example, if the actual text is "I walked to my home" then it could be read as "my home I walked to". My guess is that Google's OCR determines order of text by looking at what text is higher up on the page. Therefore, when a line isn't perfectly straight and the OCR parses the line as two separate blocks, it can put the second block first because the inclinaison of the line makes the second block higher than the first one.
Here is a screenshot showing the problem (it's in french)
As you can see, the block I highlighted starts with "c) chaque réservoir ..." while the OCR parses it as "protection contre les ..." which is in fact at the end of the line (but higher up on the document technically because the text is slightly inclined).
The problem is that this makes it extremely hard to put it back in the right order after as there is no way to tell if it was wrongly interpreted without a human looking at the sentence and seeing it makes no sense.
Note:
I thought of maybe rotating the pdf the other way slightly in a way that the left hand side is also the higher up on the page but this doesn't look too realiable.

Related

How to translate text/HTML that has stylistic line breaks?

The general question here is how do you mark text up for translation on an HTML page when the position of the line breaks have to look eye pleasing (as opposed to the line break aways happening after a specific word)?
I have a web page I want to translate into 5 different languages. In some places, I have text like "Enjoyed by 10,000 happy users" under a small icon that needs to be displayed in an eye pleasing way. This looks good as the noun phrase is on its own line and each line has about the same number of letters:
<icon>
Enjoyed by
10,000 happy users
Do I send this text to be translated as this?
Enjoyed by <br> 10,000 happy users
Problems:
By adding markup to the text it makes it unlikely I can reuse the string elsewhere but I can't see any other options.
How do I cope with how I place the in the translated text given the translated text will have a different number of letters (e.g. "Genossen von 10.000 glückliche Benutzer" in German)? Just review how each one renders on the page manually and adjust the myself after the translations come back?
I can't see any clean way to do this. I could remove the markup and try to write some server code that will add the break in a nice place but I can't see how it's possible to automate (e.g. putting noun phrases on their own line if possible when the previous line has enough letters). CSS has even less options to do this.
Your question is somewhat subjective, but I think your choices are to either trust your translators to format the HTML, or trust them to come up with copy that fits your design. Trying to engineer your way to a "clean" solution with server code sounds like it will achieve the exact opposite.
Make sure your design is good enough to cope with a reasonable range of word lengths. If your layout lives and dies by the text being exactly X characters long, then it isn't well designed. You can always ask your translators to try and write a translation in less than a maximum number of characters. This is why we still have human translators - they are also copywriters :)

Markup-less way to title and link abbrevations/acronyms to glossary entries

Background: I'm writing a DocBook 5 document (and including in it some already-written text) with the intention of generating HTML from it. I would like to get the semantic markup correct from the beginning so I don't need to re-do it later, but the standard way does not seem to generate what I'm looking for, so I'm not sure if I should deviate from it or not, depending on what is possible with XSL.
Current setup: My glossary only has abbreviated items. It consists of glossentrys each containing a full-spelling glossterm and some non-zero amount of acronyms and/or abbrevs. I suppose it could all just be glossterms instead. It doesn't matter to me. Suppose for example I have this:
<glossentry xml:id="ff"><glossterm>Firefox</glossterm>
<acronym>FF</acronym>
<glossdef>
<para>The web browser made by Mozilla.</para>
</glossdef>
</glossentry>
Ideal
Suppose wherever I want to refer to Firefox, I put FF in the text. Ideally, without any additional markup, wherever "FF" (case sensitive) appeared as a plain whole word in a paragraph (or title, but not, for example, code or programlisting or inside a URL inside an attribute...) in my DocBook file, it would come out in HTML as the text "FF" but marked up as a link to the glossary entry, but not with the standard link CSS, and furthermore with a title attribute having the value Firefox. That way a reader can hover to get the acronym/abbreviation spelled out for them, and if that is insufficient, they can click to be taken to a fuller definition. Meanwhile I would style it black and underlined, so that they know this feature is there, but it doesn't distract one's attention like a normal link does, especially with how often it occurs in the text.
Main question: is such replacement of plaintext, markup-less terms even possible in XSL (without creating something like the Scunthorpe problem)? If so, can it do this for every acronym or abbrev found in the glossary, automatically?
I could not figure out how to do this directly, but that is still my goal. Meanwhile I've tried other things:
Approach 1
Set up a keyboard macro so I can type ff and have that be transformed while I'm typing into <xref linkend="ff"/>.
Pros:
links to the glossary
spells out the abbreviation
Cons:
spells out the abbrevation (it would be nice to keep it short to read, not just short to type)...workaround: make the acronym into a glossterm and put it first in the glossentry (loss of semantics, but maybe that's OK here?)
links to the glossary (I would like it styled differently)...solution: CSS for a.xref
even with the above two worked around, the title comes out as FFFirefox instead of just Firefox (and with others that have more than one synonym, the mashing-together continues)...solution: put an alternate xml:id on your preferred acronym/abbrev, and then make the links in it refer to that id in their endterm attribute (as well as the linkend referring to the first id)
I have to remember to use the keyboard macro rather than just typing and letting the system do the work; any imported text then has to have text replacements done on it for each glossary entry
Approach 2
Using <xsl:param name="glossterm.auto.link" select="1"/> and a keyboard macro to change FF into <glossterm>FF</glossterm>.
Pros:
links to the glossary
Cons:
spells out the abbrevation (it would be nice to keep it short to read, not just short to type)...workaround: make the acronym into a glossterm and put it first in the glossentry (loss of semantics, but maybe that's OK here?)
links to the glossary (I would like it styled differently)...solution: CSS for .glossterm
even with the above two worked around, no title attribute is given
I have to remember to use the keyboard macro rather than just typing and letting the system do the work; any imported text then has to have text replacements done on it for each glossary entry
Approach 1 so far seems better after the workarounds, but is there a way to achieve the ideal I outlined?

MVC 4 - Displaying HTML vs. Straight Text

I have an MVC 4 View where the user can enter straight text or HTML into a text area control. When this text is displayed, I use #HTML.Raw() to display it. If the user entered HTML, everything displays based on the HTML. If he/she didn't all the line breaks are ignored and the text just runs together.
So, what I would like to do is to somehow test to see if the user entered HTML or straight text. If straight text, when displaying the text, I'd like to replace all the line break characters with an HTML break tag to maintain the formatting.
Is there a somewhat reliable way to detect if the text contains
HTML?
Is there a better/easier way to do what I'm trying to do?
Is there a somewhat reliable way to detect if the text contains HTML?
Not really. The problem becomes hardest when someone is writing a plain text enter that discusses HTML.
Is there a better/easier way to do what I'm trying to do?
I quite like Stackoverflow's approach. Just use markdown and provide clear instructions on how to use it beside the editing window.

Shortened HTML text and malformed tags

In my web application I intend to shorten a lengthy string of HTML formatted text if it is more than 300 characters long and then display the 300 characters and a Read More link on the page.
The issue I came across is when the 300 character limit is reached inside an HTML tag, example: (look for HERE)
<a hreHERE="somewhere">link</a>
<a hre="somewhere">liHEREnk</a>
When this happens, the entire page could become ill-formatted because everything after the HERE in the previous example is removed and the HTML tag is kept open.
I thinking of using CSS to hide any overflow beyond a certain limit and create the "Read More" link if the text is beyond a certain number, but this would entail me including all the text on the page.
I've also thought about splitting the text at . to ensure that it's split at the end of a sentence, but that would mean I would include more characters than I needed.
Is there a better way to accomplish this?
Note: I have not specified a server side language because this is more of a general question, but I'm using ASP.NET/C# .
Extract the plaintext from the HTML, and display that. There are libraries (like the HTML Agility Pack for .NET) that make this easy, and it's not too hard to do it yourself with an XML parser. Trying to fix a truncated HTML snippet is a losing cause.
One option I can think of is to cut it off at 300 characters and make sure the last index of '<' is less than the last index of '>'. If it is, truncate the string right before the last instance of '>', then use a library like tidy html to fix tags that are orphaned (like the </a> in the example).
There are problems with this though. One thing being if there are 300 chars worth of nothing but HTML - your summary will be displayed as empty.
If you do not need the html to be displayed it's far easier to simply extract the plain text and use that instead.
EDIT: Added using something like tidy html for orphaned tags. Original answer only solved cutting thing mid-tag, rather than within an opening/closing tag.

Find and Replace and a WYSIWYG Editor

My problem is as follows:
I have a column: ProductName. Now, the text entered here is entered from tinyMCE so it has all kinds of tags. The user wants to be able to do a Find-And-Replace on all products, and it has to support coloring.
For example - let's say this is a portion of a ProductName:
other text.. <strong>text text <font color="#ff6600">colortext®</font></strong> ..other text
Now, the user wants to replace the :
<font color="#ff6600">colortext®</font>
The original name has the <strong> tag in it so it appears bold. So the users makes it bold - now the text he is searching for is:
<strong><font color="#ff6600">colortext®</font></strong>
Obviously I'm not going to find it. Plus there's the matter of spaces: in one place it has a space in another it doesn't.
Is there a way to overcome this?
Strip the HTML tags from the search text and do a plain text search first. Then, part by part (i.e., text node by text node), take the element path of the search text's parts, and compare these with their counterparts in the found text. If the paths for all parts match, you're done.
Edit: By path, I meant something similar to XPath, or the path notion of the TinyMCE editor. Example: plain text part of the search text is "colortext®". The path of this text node in the search text is <strong>/<font color="#ff6600">. Search for the same plain text in the text body (trivial), and take it's path, which is also <strong>/<font color="#ff6600">. (Compare this with the path of "other text..", which is /, and of "text text", which is <strong>.) The two paths are the same, so this is a real match. If you have a DOM tree representation, determining the path shouldn't be difficult.
You're asking for several related, but discrete, abilities:
Search and Replace content
Search and Replace formatting
Search and Replace similar (ie, ignore trivial differences in whitespace)
You should take this in steps - otherwise it becomes overwhelming and a single search algorithm won't be able to do all three without intense effort and resulting in difficult to maintain code.
First, look at the similar problem. Make a search that ignores spaces and case. You might want to get into Lucene or another search engine technology if you also need to deal with "bowl" vs "bowls" and "intelligent" vs "smart" - though I expect this is beyond your current needs.
Once you have that working, it becomes one layer in your stack of searches.
Second, look a formatting search. This is typically done using tokens or tags - which you already have in the form of HTML. However, you have to be able to deal with things out of sequence - so <b><i>text</i></b> needs to be caught in a search for <i><b>text</b></i> and the malformed representation where tags aren't nested properly, such as <b><i>text</b></i>.
One method of this is to pre-parse the string and apply the formatting styles to each character. So you'd have a t that's bold and italic, an e that's bold and italic, etc. to make this easier and faster use a hash to represent the style combination - Read the first character, figure out what style it is (keep track of this turning styles on and off and you find tags) and if it already exists in the hash, assign that hash number to the letter. If it doesn't, get the new hash number and assign that.
Now you can compare the letter and its style hash against your search and get format and content matches. Stack that on top of your similar match and you have what you need.
-Adam
If it's valid XML, an XSLT would be trivial for this kind of exercise.
Use an identity template, and then add an XPath to find the specific node you want:
<xsl:template match="//strong/font">
<xsl:copy>
<!-- Insert the replacement text here -->
</xsl:copy>
</xsl:template>
When working with XML, this would be a maintainable, extensible solution.
Not sure to understand everything you said but the use of regular expression seems a good way to overcome the problem you're talking about.