Optical recognition of text and analysis of its structure (title, subtitle, text body) - deep-learning

We wish to analyze scans of documents with text (non-handwritten) and images with very broad range of arrangements/structures in different languages. The first problem we try to solve, is extracting text and identifying and separating titles, subtitles and text bodies.
At the moment we are doing a literature research. There is plenty of literature about deep learning, computer vision, optical character recognition or natural language processing but none of these are actually focused on optical recognition of the structure of text.
We wonder, what is the name of the discipline/field that deals with optical recognition of structure of text?
What are the state-of-the-art approaches and tools for solving these problems?

Optical Layout Recognition (OLR). A good example of an open-source tool for Layout Analysis and Region Extraction can be found here.

Related

Recognize Micr font using OCR Engine?

I am using Microsoft OCR Library for reading text.
The Microsoft OCR library works perfectly. However i want to read the following list of characters given in the link http://www.ict4u.net/databases/database-images/micr.jpg . Is there a way in which i can train the OCR library to read the following characters or is there a language that allows to read the following characters.
[Microsoft OCR crew here] We don't yet support training OCR to customize it for your use-cases. However, we do actively keep an eye on stackoverflow to see what developers need, so we can keep improving the OCR engine.
I have been working with Microsoft OCR for a while now.
Compared with Tesseract it has very basic functionality.
For example Microsoft OCR returns the words and lines.
But the lines are nonsense. Randomly 2 or 3 words are grouped together as a "line" but they are not a real line. And the "lines" are completely unordered. In this aspect it is worse than Tesseract. You have to take the coordinates of each word and order them on your own.
Microsoft does not return the rectangles of characters and there is absolutely no way to configure or train Microsoft OCR in any way. You can add languages with Windows Update for "Basic Typing" = OCR (see http://www.thewindowsclub.com/install-uninstall-languages-windows-10), but you cannot train your own language data.
MSDN says that the following 25 languages are supported with different accuracy:
Excellent: Czech, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Polish, Portuguese, Romanian, Serbian Cyrillic, Serbian Latin, Slovak, Spanish and Swedish.
Very good: Chinese Simplified, Greek, Japanese, Russian and Turkish.
Good: Chinese Traditional and Korean.
The recognition quality is very similar to Tesseract. It has even exactly the same problems as Tesseract. Some single characters are not recognized (separate symbols like a single '$') and it has the same huge problem with asterisks as Tesseract. Also does it insert spaces at the wrong places as Tesseract does. So I ask myself if Microsoft is using Tesseract under the hood?
However Microsoft OCR has an advantage over Tesseract: The image preprocessing is much better. It does not matter if you have red text on yellow background or white text on black. This is a catch for Tesseract which needs a black and white image of good quality as input.
For both OCR libraries applies: If you have recognition problems, try to amplify the image. Even blurring the image may be very helful because this removes the noise from the image.

training tesseract for handwritten text

I need to identify handwritten text (icr). No need to understand arbitrary text - I am able to instruct my users to write very clearly, with separate letters and etc. However still there will be some amount of difference between any training set and the real letters.
I am hoping to train tesseract for this purpose. Has anyone tried this? Any hope in this path?
You must have fonts similar to those handwriting letters. You may create them with any font designing tool(a sample is here). Then you can follow the training process as described here.

OCR and Distinguishing Between 2 or 3 Fonts

Let's say that I have a black and white image of a document with only 2 or 3 fonts being used. One of the 3 is used for the title and another is a small font (or at least, very plain). For example, one of the little bits of text might be:
Fancy/Bolded/Italicized/Script font: The Best Soup In The World
Plain/small: Made with tap water, salt, and sugar.
Fancy/Bolded/Italicized/Script font: The Best Soup and 1/2 Sandwich In The World
Plain/small: Made with flour, tap water, salt, and sugar.
I don't need a big fancy OCR system that can tell me that "Best Soup" uses a particular fancy font with italics/etc. I just need a system that can tell me "Best Soup" is formatted rather differently from "tap water", that "Best Soup" and "Sandwich" are probably using the same formatting, and "Sandwich" is bigger/fancier than "tap water."
I'll be using Tesseract to do the actual OCR and bounding box detection (http://www.mail-archive.com/tesseract-ocr#googlegroups.com/msg02157.html), if that's relevant.
Is there anything out there that I can use to do this simple formatting classification?
Edit:
Is there anything out there that will do this without costing me an arm and a leg?
I’m not sure whether tesseract can solve the task you describe, but I believe good ocr engine should detect font styles. For example, ABBYY OCR SDK can not only identify bold/italic font style, but it can also define proper font face to use in the output.
Based on what you describe I guess you are trying to determine document style hierarchy like header levels etc. ABBYY FineReader Engine provides this functionality and you don’t have engage into the font size&style based text purpose routine. Besides, it provides the best ocr quality and it’s free to try. Consider trying it out if you plan commercial software. I work # ABBYY and can provide you more info our OCR SDK if necessary.
Best regards.

What are situations with western languages where you'd use HTML 5's Ruby element?

HTML 5 is introducing a new element: <ruby>; here's the W3C's description:
The ruby element allows one or more spans of phrasing content to be marked with ruby annotations. Ruby annotations are short runs of text presented alongside base text, primarily used in East Asian typography as a guide for pronunciation or to include other annotations. In Japanese, this form of typography is also known as furigana.
They then go on to give a few examples of Ruby annotations in use for Chinese and Japanese text. I'm wondering though: is this element going to be useful only for east-asian HTML documents, or are there good semantic applications for the <ruby> element in other western languages like English, German, Spanish, etc.?
id-ee-oh-SINK-ruh-sees
Could be useful for people learning English, as our writing system has many idiosyncrasies that make it somewhat less than phonetic.
As a linguist, I can see the benefits in using <ruby> for marking up linguistic examples with various theoretical notational conventions. One example that comes to mind is indicating tonal levels in autosegmental phonology. Here's a quick example I threw together that can be seen in the latest Webkit/Chromium (at least):
http://miketaylr.com/code/western_ruby.html
Currently, this type of notation is left for LaTex and friends, and if on the web, generally a non-accessible image.
As I understand it, ruby annotations are not really relevant in Western languages because Western alphabets are (more or less) phonetic. In Japanese they are used to give a pronunciation guide for logographic characters which don't have obvious pronunciations (unless you've memorized them). I suppose the Western analog would be IPA notation in brackets following a word, but those are rarely used and I don't know if Ruby annotations would be appropriate for them.
My list:
theoretical notational conventions (miketylr's answer)http://miketaylr.com/code/western_ruby.html
language learning (Adam Bellaire's answer) id-ee-oh-SINK-ruh-sees foo idiosyncrasies bar - made with ascii 'nbsp' art
abbreviation, acronym, initialism (possibly - why hover?)
learning technical terms of English origin accidentally translated to your non-english native language
I'm often forced to do the latter in uni. While the translated terminology is often consistent, very often it's not at all self-explaining or not as much as the original english one.
Also the same term may have been translated using several translation systems by different authors/groups.
Another problem group is when, for example, queue, row, series (and sometimes tuple) are translated to the very same word in your language.
Given a western language with less users, and the low percentage of technical people in the population, this actually makes learning the topic much easier directly from English and then learn the translations in a second step.
Ruby could be a tool to transform this into a one-step process, providing either the translations or the original as a "Furigana".

How does Google use HTML tags to enhance the search engine?

I know that Google’s search algorithm is mainly based on pagerank. However, it also does analysis and uses the structure of the document H1, H2, title and other HTML tags to enhance the search results.
What is the name of this technique "using the document structure to enhance the search results"?
And are there any academic papers to help me study this area?
The fact that Google is taking the HTML structure into account is well covered in SEO articles however I could not find it in the academic papers.
I think it's called "Semantic Markup"
[...] semantic markup is markup that is descriptive enough to allow us and the machines we program to recognize it and make decisions about it. In other words, markup means something when we can identify it and do useful things with it. In this way, semantic markup becomes more than merely descriptive. It becomes a brilliant mechanism that allows both humans and machines to “understand” the same information. http://www.digital-web.com/articles/writing_semantic_markup/
A more practical article here
http://robertnyman.com/2007/10/29/explaining-semantic-mark-up/
SEO has become almost a religion to some people where they obsess about minutiae. Frankly, I'm not convinced that all this effort is justified.
My advice? Ignore what so-called pundits say and just follow Google's guidelines.
You might be looking for an academic answer but honestly, this isn't an academic question beyond the very basics of how Web indexing works. The reality of a modern page indexing and ranking algorithm is far more complex.
You may want to look at one of the earlier works on search engines. Note the authors' names. You may also want to read Google Patent application 20050071741.
These general principles aside, Google's search algorithm is constantly tweaked based on actual and desired results. The exact workings are a closely guarded secret just to make it harder for people to game the system. Much of the "advice" or descriptions on how Google's search algorithm works is pure supposition.
So, apart from having a title and having well-formed and valid HTML, I don't think you're going to find what you're looking for.
Google very deliberately doesn't give away too much information about its search algorithm, so it's unlikely you will find a definitve answer or academic paper that confirms this. If you're interested from an SEO point of view, just write your pages so they are good for humans and the robots will like them too.
To make a page good for humans, you SHOULD use tags such as h1, h2 and so on to create a hierarchical page outlay... a bit like this...
h1 "Contact Us"
...h2 "Contact Details"
......h3 "Telephone Numbers"
......h3 "Email Addresses"
...h2 "How To Find Us"
......h3 "By Car"
......h3 "By Train"
The difficulty with your question is that if you put something in your h1 tag hoping that it would increase your position in Google, but it didn't match up with other content on your page, you could look like you are spamming. Similarly, if your page is made up of too many headings and not enough actual content, you could look like you are spamming. It's not as simple as add a h1 and h2 tag and you'll go up! That's why you need to write websites for humans, not robots.
I have found this paper:
A New Study on Using HTML Structures to Improve Retrieval
however it is an old paper 1999,
still looking for more recent papers.
Check out
http://jcmc.indiana.edu/vol12/issue3/pan.html
http://www.springerlink.com/content/l22811484243r261/
Some time spent on scholar.google.com might help you find what you are looking for
You can also try searching the 'Computer Science' section of arXiv: http://arxiv.org for "search engine" and the various terms that others have suggested.
It contains many academic papers, all freely available... hopefully some of them will be relevant to your research. (Of course the caveat of validating any paper's content applies.)
Like cletus said follow the google guidelines.
I did a few tests came to the conclusion that title, image alt and h tags the most important. Also worth to mention is google adsense. I had the feeling if you implement these, the rank of your site increase.
I believe what you are interested in is called structural-fingerprinting, and it is often used to determine the similarity of two structures. In Google's case, applying a weight to different tags and applying to a secret algorithm that (probably) uses the frequencies of the different elements in the fingerprint. This is deeply routed in information theory - if you are looking for academic papers on information theory, I would start with "A Mathematical Theory of Communication" by Claude Shannon
I would also suggest looking at Microformats and RDF's. Both are used to enhance searching. These are mostly search engine agnostic, but there are some specific things as well. For google specific guidelines for HTML content read this link.
In short; very carefully. In long:
Quote from anatomy of a large-scale hypertextual erb search engine:
[...] This gives us some limited
phrase searching as long as there are
not that many anchors for a particular
word. We expect to update the way that
anchor hits are stored to allow for
greater resolution in the position and
docIDhash fields. We use font size
relative to the rest of the document
because when searching, you do not
want to rank otherwise identical
documents differently just because one
of the documents is in a larger
font. [...]
It goes on:
[...] Another big difference between
the web and traditional well controlled collections is that there
is virtually no control over what
people can put on the web. Couple
this flexibility to publish anything
with the enormous influence of search
engines to route traffic and companies
which deliberately manipulating search
engines for profit become a serious
problem. This problem that has not
been addressed in traditional closed
information retrieval systems. Also,
it is interesting to note that
metadata efforts have largely failed
with web search engines, because any
text on the page which is not directly
represented to the user is abused to
manipulate search engines. [...]
The Challenges in a web search engine addresses these issues in a more modern fashion:
[...] Web pages in HTML fall into the middle of this continuum of structure in documents, being neither close to free text nor to well-structured data. Instead HTML markup provides limited structural information, typically used to control layout but providing clues about semantic information. Layout information in HTML may seem of limited utility, especially compared to information contained in languages like XML that can be used to tag content, but in fact it is a particularly valuable source of meta-data in unreliable corpora such as the web. The value in layout information stems from the fact that it is visible to the user [...]:
And adds:
[...] HTML tags can be analyzed for what semantic information can be inferred. In addition to the header tags mentioned above, there are tags that control the font face (bold, italic), size, and color. These can be analyzed to determine which words in the document the author thinks are particularly important. One advantage of HTML, or any markup language that maps very closely to how the content is displayed, is that there is less opportunity for abuse: it is difficult to use HTML markup in a way that encourages search engines to think the marked text is important, while to users it appears unimportant. For instance, the fixed meaning of the tag means that any text in an HI context will appear prominently on the rendered web page, so it is safe for search engines to weigh this text highly. However, the reliability of HTML markup is decreased by Cascading Style Sheets which separate the names of tags from their representation. There has been research in extracting information from what structure HTML does possess.For instance, [Chakrabarti etal, 2001; Chakrabarti, 2001] created a DOM tree of an HTML page and used this information to in-crease the accuracy of topic distillation, a link-based analysis technique.
There are number of issues a modern search engine needs to combat, for example web spam and blackhat SEO schemes.
Combating webspam with trustrank
Webspam taxonomy
Detecting spam web pages through content analysis
But even in a perfect world, e.g. after eliminating the bad apples from the index, the web is still an utter mess because no-one has identical structures. There are maps, games, video, photos (flickr) and lots and lots of user generated content. In other word, the web is still very unpredictable.
Resources
Hypertext and the web:
Extracting knowledge from the World Wide Web
Rich media and web 2.0
Thresher: automating the unwrapping of semantic content from the World Wide Web
Information retrieval
Webspam papers
Combating webspam with trustrank
Webspam taxonomy
Detecting spam web pages through content analysis
To keep it painfully simple. Make your information architecture logical. If the most important elements for user comprehension are highlighted with headings and grouped logically, then the document is easier to interpret using information processing algorithms. Magically, it will also be easier for users to interpret. Remember the search engine algorithms were written by people trying to interpret language.
The Basic Process is:
Write well structured HTML - using header tags to indicate the most critical elements on the page. Use logical tags based on the structure of your information. Lists for lists, headers for major topics.
Supply relevant alt tags and names for any visual elements, and then use simple css to arrange these elements.
If the site works well for users and contains relevant information, you don't risk becoming a black listed spammer, and search engine algorithms will favor your page.
I really enjoyed the book Transcending CSS
for a clean explanation of properly structured HTML.
I suggest trying Google scholar as one of your avenues when looking for academic articles
semantic search
I found it interesting that - with no meta keywords nor description provided - in a scenatio like this:
<p>Some introduction</p>
<h1>headline 1</h1>
<p>text for section one</p>
Always the "text for section one" is shown on the search result page.
New tag to use called CANONICAL can now also be used, from Google, click HERE