Image processing/enhancement algorithms for document OCR / readability?

Image processing/enhancement algorithms for document OCR / readability? - ocr

I'm looking for algorithms, papers, or software to enhance faxes, images from cell phone cameras, and other similar source for readability and OCR.
I'm mainly interested in simple enhancements (eg. things you could do using ImageMagick), but I'm also interested in more sophisticated techniques. I'm already talking to vendors, so for this question I'm mostly looking for algorithms or open source software.
To further clarify: I'm not looking for OCR software or algorithms; I'm looking for algorithms to clean up the image so it looks more readable to the human eye, and can possibly be used for OCR.

I had a similar problem when I was writing some software to do book scanning; floating around on the internet is a program called pagetools that does straightening of scanned-in pages using a fairly clever mathematical trick called the Radon transform.
I also wrote a small routine that would white out the blank space on the page; OCR algorithms tend to do a lot better when they don't have to contend with background noise. What I did, was look for light-colored pixels that were more than a small radius away from dark-colored ones, and then boost those up to being pure white.
It's been a few years, though, so I don't have the exact implementation details handy.

One simple image filter to look into is the "Median Filter" which is a very straightforward, easy to implement yourself, filter to help clean up scanned/photographed text. http://en.wikipedia.org/wiki/Median_filter

As requested, link to Wikipedia: Optical character recognition
Microsoft Research: Optical character recognition papers
CiteSeerX : Papers on optical character recognition

Related

Raster graphics disadvantages of HTML/CSS presentations

Many programmers prefer the idea of building presentation with just the help of a text editor instead of using a GUI tool.
With the help of the JavaScript code provided by impress.js, it is possible to create an animated presentation using HTML/CSS. The project provides an "impress"ing example and many people wrote enthusiastic comments about it.
However, I am a bit worried about the inherent limitations of this approach: Lengths in HTML/CSS are specified in a raster graphics model. Therefore, one must know the resolution of the beamer (screen or other displaying device) to design the presentation in an adequate way. Scaling the presentation afterwards (and porting to different devices) seems to be difficult.
Doesn't it make more sense to use vector graphic tools like the LaTeX Beamer class or the Inkscape plugins JessyInk or Sozi, which allow for better scaling?
Any comment and oppinion is appreciated.

HTML5 canvas alternatives for d3.js, graph visualization library

Is there any Canvas library that is like d3.js (is svg library). I have a website here and I coded a graph with svg elements however it is not efficient on smart phone's browsers and works so slow. I now, want to change it with a 2d canvas type of it and see whether it is better or not. Can you suggest a canvas library that is useful for this purpose?
Thanks...

D3 is not necessarily an svg only library - svg is used in many cases, but the library can do any kind of representations that you would like to make. See this example of parallel coordinates using canvas in D3, by Kai Chang: http://bl.ocks.org/2409451
Also see here for some discussion on performance issues, etc, that might be helpful: https://groups.google.com/d/topic/d3-js/mtlTsGCULVQ/discussion

I know I am late to the party, but times have changed, and I believe this question deserves an updated answer. SVG performance has improved a lot over the years and especially for the non-trivial graph-like visualizations it often gives superior performance; but it really depends on the exact use-case: If the visualization is simple and consists of thousands of elements, especially on mobile, Canvas may be the faster option. If the visualization is almost trivial, WebGL gives the best performance and beats Canvas hands-down - especially on mobile!
However WebGL especially and also Canvas are a bit harder to use than the declarative approach that SVG uses. Things like CSS animations and transitions are easy to do with SVG and give good performance due to being hardware accelerated and totally independent of JavaScript performance. Canvas and WebGL always require JavaScript.
If you take a look at the commercial graph drawing library yFiles for HTML you will see that it offers all three technologies at the same time. This is because all three can be the best choice, depending on the exact use-case.
There is a blog entry that compares the performance of SVG, Canvas, and WebGL especially in the context of graph visualization. It compares various graph sizes and categories of devices. The "conclusion" is that there is not a clear winner. Often times the combination of all three technologies gives the best results. For smaller graphs, though, SVG most of the time gives very good results and is a pleasure to work with. That's also the reason why d3.js has its focus on SVG, rather than Canvas and WebGL, I would say.
There is an interactive demo linked from that blog entry that let's you play with the various technologies and see their strengths and weaknesses. Of course the demo mainly compares the three technologies used in that specific library so your results may vary, but they spent a lot of time optimizing all three technologies in that library, so I think the results are not too biased.
Disclaimer: I work for the company that creates the above mentioned library, but I do not represent my employer here on SO. I think what I said should be valid not just for that library.

For the Samsung Olympic Genome Project facebook app, we used http://thejit.org to make the force directed graph style animation for the app. It's heavily modified by me and others on my team of course, and only plays a very small part in the app, but it's quite a powerful framework.

Chart.js is a javascript library that just came out that creates charts using HTML5 for rendering. Its not as feature inclusive as D3, but it is working to become exactly that in the future. http://www.chartjs.org/

Take a look at Cytoscape.JS which uses a HTML5 canvas for rendering. At the time of writing this it's in its infancy but the project seems promising. According to its wiki the library supports both desktop and mobile browsers:
Cytoscape.js is easily integrated into your webapp, especially since
Cytoscape.js supports both desktop browsers, like Chrome, and mobile
browsers, like on the iPad.

What are the best html5 based charting packages you have used?

Requirements:
List item
entirely client-side (except maybe conversion to image)
export to image
able to print chart
user interactivity (hover annotation)
multiple axis
price < $300 per site
IE6/7/8 compatibility optional
I've looked at the following:
List item
Highchart
rGraph
Zingchart
infoVis toolkit
jQuery Flot
Protovis
jqPlot
Which would you recommend based on your (or your team's) experiences?
Considering the following aspects:
List item
ease of use/learning curve
ease of extension/customizability
range of available charts/themes, aesthetics
level of support/buginess

Not to be a pain and sidestep you, but - and I say this as a Canvas lover - the best charting package I've used is gRaphael, which uses SVG/VML and not Canvas.
http://g.raphaeljs.com/
You tagged this as "canvas" and "html5" only but gRaphael fulfills most of your requirements. It is especially easy to use, and the learning curve is better, as SVG generally requires a lot less code to get a rich user experience than Canvas-based libraries do.
Here is the plugin for exporting-to-image for raphael-based apps
I'm not sure about the printing situation, but since it is SVG you ought to be able to print with less fuss than if you were using Canvas, but I don't think raphael has anything additional built in to deal with printing.
Of course, using SVG means that performance will suffer more if you plan on making a highly complex/large app with lots of animation and interactivity, but that is pretty unlikely in the world of charting, unless you're trying to win a "most nauseating way to present information" award or something.
I earnestly think you should start prototyping your app with gRaphael first. You should be able to get something up quicker than with a Canvas library which will let you evaluate fairly quickly whether it will be a good fit or not.

#Xerion - I'm on the ZingChart team. Zing should fit the bill pretty well, as it renders in HTML5 Canvas, SVG, VML, and/or Flash for compatibility and various scenarios. Simon had a great point about SVG -- more complex charts (data, features or otherwise) tend to cause SVG to fall behind Canvas in performance. See different scenarios here http://www.zingchart.com/#speedtest.
Feel free to contact me abegin[at]zingchart.com with any questions, or mention/follow us at twitter.com/zingchart.
Thanks.

What is the state of the art in HTML content extraction?

There's a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages, and some signs of interest here, e.g., one, two, and three, but I'm not really clear about how well the practice of the latter reflects the ideas of the former. What is the best practice?
Pointers to good (in particular, open source) implementations and good scholarly surveys of implementations would be the kind of thing I'm looking for.
Postscript the first: To be precise, the kind of survey I'm after would be a paper (published, unpublished, whatever) that discusses both criteria from the scholarly literature, and a number of existing implementations, and analyses how unsuccessful the implementations are from the viewpoint of the criteria. And, really, a post to a mailing list would work for me too.
Postscript the second To be clear, after Peter Rowell's answer, which I have accepted, we can see that this question leads to two subquestions: (i) the solved problem of cleaning up non-conformant HTML, for which Beautiful Soup is the most recommended solution, and (ii) the unsolved problem or separating cruft (mostly site-added boilerplate and promotional material) from meat (the contentthat the kind of people who think the page might be interesting in fact find relevant. To address the state of the art, new answers need to address the cruft-from-meat peoblem explicitly.

Extraction can mean different things to different people. It's one thing to be able to deal with all of the mangled HTML out there, and Beautiful Soup is a clear winner in this department. But BS won't tell you what is cruft and what is meat.
Things look different (and ugly) when considering content extraction from the point of view of a computational linguist. When analyzing a page I'm interested only in the specific content of the page, minus all of the navigation/advertising/etc. cruft. And you can't begin to do the interesting stuff -- co-occurence analysis, phrase discovery, weighted attribute vector generation, etc. -- until you have gotten rid of the cruft.
The first paper referenced by the OP indicates that this was what they were trying to achieve -- analyze a site, determine the overall structure, then subtract that out and Voila! you have just the meat -- but they found it was harder than they thought. They were approaching the problem from an improved accessibility angle, whereas I was an early search egine guy, but we both came to the same conclusion:
Separating cruft from meat is hard. And (to read between the lines of your question) even once the cruft is removed, without carefully applied semantic markup it is extremely difficult to determine 'author intent' of the article. Getting the meat out of a site like citeseer (cleanly & predictably laid out with a very high Signal-to-Noise Ratio) is 2 or 3 orders of magnitude easier than dealing with random web content.
BTW, if you're dealing with longer documents you might be particularly interested in work done by Marti Hearst (now a prof at UC Berkely). Her PhD thesis and other papers on doing subtopic discovery in large documents gave me a lot of insight into doing something similar in smaller documents (which, surprisingly, can be more difficult to deal with). But you can only do this after you get rid of the cruft.
For the few who might be interested, here's some backstory (probably Off Topic, but I'm in that kind of mood tonight):
In the 80's and 90's our customers were mostly government agencies whose eyes were bigger than their budgets and whose dreams made Disneyland look drab. They were collecting everything they could get their hands on and then went looking for a silver bullet technology that would somehow ( giant hand wave ) extract the 'meaning' of the document. Right. They found us because we were this weird little company doing "content similarity searching" in 1986. We gave them a couple of demos (real, not faked) which freaked them out.
One of the things we already knew (and it took a long time for them to believe us) was that every collection is different and needs it's own special scanner to deal with those differences. For example, if all you're doing is munching straight newspaper stories, life is pretty easy. The headline mostly tells you something interesting, and the story is written in pyramid style - the first paragraph or two has the meat of who/what/where/when, and then following paras expand on that. Like I said, this is the easy stuff.
How about magazine articles? Oh God, don't get me started! The titles are almost always meaningless and the structure varies from one mag to the next, and even from one section of a mag to the next. Pick up a copy of Wired and a copy of Atlantic Monthly. Look at a major article and try to figure out a meaningful 1 paragraph summary of what the article is about. Now try to describe how a program would accomplish the same thing. Does the same set of rules apply across all articles? Even articles from the same magazine? No, they don't.
Sorry to sound like a curmudgeon on this, but this problem is genuinely hard.
Strangely enough, a big reason for google being as successful as it is (from a search engine perspective) is that they place a lot of weight on the words in and surrounding a link from another site. That link-text represents a sort of mini-summary done by a human of the site/page it's linking to, exactly what you want when you are searching. And it works across nearly all genre/layout styles of information. It's a positively brilliant insight and I wish I had had it myself. But it wouldn't have done my customers any good because there were no links from last night's Moscow TV listings to some random teletype message they had captured, or to some badly OCR'd version of an Egyptian newspaper.
/mini-rant-and-trip-down-memory-lane

One word: boilerpipe.
For the news domain, on a representative corpus, we're now at 98% / 99% extraction accuracy (avg/median)
Demo: http://boilerpipe-web.appspot.com/
Code: http://code.google.com/p/boilerpipe/
Presentation: http://videolectures.net/wsdm2010_kohlschutter_bdu/
Dataset and slides: http://www.l3s.de/~kohlschuetter/boilerplate/
PhD thesis: http://www.kohlschutter.com/pdf/Dissertation-Kohlschuetter.pdf
Also quite language independent (today, I've learned it works for Nepali, too).
Disclaimer: I am the author of this work.

Have you seen boilerpipe? Found it mentioned in a similar question.

I have come across http://www.keyvan.net/2010/08/php-readability/
Last year I ported Arc90′s Readability
to use in the Five Filters project.
It’s been over a year now and
Readability has improved a lot —
thanks to Chris Dary and the rest of
the team at Arc90.
As part of an update to the Full-Text
RSS service I started porting a more
recent version (1.6.2) to PHP and the
code is now online.
For anyone not familiar, Readability
was created for use as a browser addon
(a bookmarklet). With one click it
transforms web pages for easy reading
and strips away clutter. Apple
recently incorporated it into Safari
Reader.
It’s also very handy for content
extraction, which is why I wanted to
port it to PHP in the first place.

there are a few open source tools available that do similar article extraction tasks.
https://github.com/jiminoc/goose which was open source by Gravity.com
It has info on the wiki as well as the source you can view. There are dozens of unit tests that show the text extracted from various articles.

I've worked with Peter Rowell down through the years on a wide variety of information retrieval projects, many of which involved very difficult text extraction from a diversity of markup sources.
Currently I'm focused on knowledge extraction from "firehose" sources such as Google, including their RSS pipes that vacuum up huge amounts of local, regional, national and international news articles. In many cases titles are rich and meaningful, but are only "hooks" used to draw traffic to a Web site where the actual article is a meaningless paragraph. This appears to be a sort of "spam in reverse" designed to boost traffic ratings.
To rank articles even with the simplest metric of article length you have to be able to extract content from the markup. The exotic markup and scripting that dominates Web content these days breaks most open source parsing packages such as Beautiful Soup when applied to large volumes characteristic of Google and similar sources. I've found that 30% or more of mined articles break these packages as a rule of thumb. This has caused us to refocus on developing very low level, intelligent, character based parsers to separate the raw text from the markup and scripting. The more fine grained your parsing (i.e. partitioning of content) the more intelligent (and hand made) your tools must be. To make things even more interesting, you have a moving target as web authoring continues to morph and change with the development of new scripting approaches, markup, and language extensions. This tends to favor service based information delivery as opposed to "shrink wrapped" applications.
Looking back over the years there appears to have been very few scholarly papers written about the low level mechanics (i.e. the "practice of the former" you refer to) of such extraction, probably because it's so domain and content specific.

Beautiful Soup is a robust HTML parser written in Python.
It gracefully handles HTML with bad markup and is also well-engineered as a Python library, supporting generators for iteration and search, dot-notation for child access (e.g., access <foo><bar/></foo>' usingdoc.foo.bar`) and seamless unicode.

If you are out to extract content from pages that heavily utilize javascript, selenium remote control can do the job. It works for more than just testing. The main downside of doing this is that you'll end up using a lot more resources. The upside is you'll get a much more accurate data feed from rich pages/apps.

How does Google Books find text regions?

One challenging topic in computer vision is processing document scans. Typically this involves a number of steps, like noise removal, color analysis, binarization, text block identification, OCR, and then maybe some context analysis and correction.
I'm curious if anyone understands, knows or can point me to literature on how Google identifies text blocks prior to the OCR stage. Any insights?

I believe Google uses the Tesseract OCR engine in conjunction with another tool called Ocropus, both of which are open-source. I don't know anything about how they work but you may be interested in checking out the code, available at the above links.

This is second-hand information from the digitization specialist in my library, but it seems that Google's approach is to just throw everything through the automated process, ocr anything that looks like text and to not fuss too much about cropping individual images or doing much semantic analasys to look for image captions, etc. They may be doing subtle things that aren't obvious but on the surface they are definitely gunning for quantity over quality, which is smart for them to do for their purposes, IMO.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008