MLKIt TextRecognization - How to manipulate text blocks of a paragraph in an newspaper content - firebase-mlkit

I am developing an OCR application using MLKit Text Recognition. The results of OCR is good and also provides bouldingBox for every word.
My question is I wanted to scan a newspaper with proper block detection technique.
As I saw in the MLKit, It searches the text in plain horizontal manner but not giving exact portion of paragraphs. It mixed up the multiple paragraphs together and creates TextBlocks.
Hope you understand my question.
Thanks.

The next ML Kit release (in August) will have a brand-new version of the Text API that has drastic improvements in paragraphing. It should tackle most of these issues. Stay tuned.

Related

Recommended HTML readability transcoding libraries in .Net [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
Background
I'm trying to read and analyze content from web pages, with focus on the main content of the page - without menus, sidebars, scripts, and other HTML clutter.
What have I tried?
I've tried NReadability, but it throws exceptions and fails on too many cases. Other than that it is a good solution.
HTML Agility Pack is not what I need here, because I do want too get rid of non-content code.
EDIT: I'm looking for a library that actually sifts through content, and gives me only the "relevant" text from the page (i.e. for this page, the words "review", "chat", "meta", "about" , and "faq" from the top bar will not show, as well as "user contributions licensed under".
So, do you know any other stable .Net library for extracting content from websites?
I don't know if this is still relevant, but this is an interesting question I run into a lot, and I haven't seen much material on the web that covers it.
I've implemented a tool that does this over the span of several months myself.
Out of contract obligation, I can not share this tool freely. However, I'm free to share some advice about what you can do.
The Sad Truth :(
I can assure you that we have tried every option before undertaking the task of creating a readability tool ourselves. At the moment no such tools exist that were satisfactory for what we needed.
So, you want to extract content?
Great! You will need a few things
A tool for handling the page's HTML. I use CsQuery which is what Jamie suggested in the answer above. It works great for selecting elements.
A programming language (That's C# in this example, any .NET language will do!)
A tool that lets you download the pages themselves. CsQuery it on its own with createFromUrl. You can create your own helper class for downloading the page if you want to pre-process it and get finer grained control over the headers. (Try playing with the user agent, looking for mobile versions, etc)
Ok, I'm all set up, what's next?
There is surprisingly little research in the field of content extraction. A piece that stands out is Boilerplate Detection using Shallow Text Features. You can also read this answer here in StackOverflow from the paper's author to see how Readability works and what some approaches are.
Here are some more papers I enjoyed:
Extracting Article Text from the Web with Maximum
Subsequence Segmentation
Text Extraction from the Web via Text-to-Tag Ratio
The Easy Way to Extract Text from HTML
I'm done reading, what's done in practice
From my experience the following are good strategies for extracting content:
Simple heuristics: Filtering <header> and <nav> tags, removing lists with only links. Removing the entire <head> section. Giving negative/positive score to elements based on their name and removing the ones with the least score (for example, divs with a class that contains the name navigation might get get lower score). This is how readability works.
Meta-Content. Analyzing density of links to text, this is a powerful tool on its own, you can compare the amount of link text to html text and work with that, the most dense text is usually where the content is. CsQuery lets you compare the amount of text to the amount of text in nested link tags easily.
Templating. Crawl several pages on the same website and analyze the differences between them, the constants are usually the page layout, navigation and ads. You can usually filter based on similarities. This 'template' based approach is very effective. The trick is to come up with an efficient algorithm to keep track of templates and detect the template itself.
Natural language processing. This is probably the most advanced approach here, it is relatively simple with natural language processing tools to detect paragraphs, text structure and thus where the actual content starts and ends.
Learning, learning is a very powerful concept for this sort of task. In the most basic form this involves creating a program that 'guesses' HTML elements to remove on a set of pre-defined results from a website and learns which patterns is OK to remove. This approach works best on a machine-per-site from my experience.
Fixed list of selectors. Surprisingly, this is extremely potent and people tend to forget about it. If you are scraping from a specific few sites using selectors and manually extracting the content is probably the fastest thing to do. Keep it simple if you can :)
In Practice
Mix and match, a good solution usually involves more than one strategy, combining a few. We ended up with something quite complex because we use it for a complex task. In practice, content extraction is a really complicated task. Don't try creating something that is very general, stick to the content you need to scrape. Test a lot, unit tests and regression are very important for this sort of program, always compare and read the code of readability, it's pretty simple and it'll probably get you started.
Best of luck, let me know how this goes.
CsQuery: https://github.com/jamietre/csquery
It's a .NET 4 jQuery port. Getting rid of non-content nodes could be done a number of ways: the .Text method to just grab everything as a string; or filter for text nodes, e.g.
var dom = CQ.CreateFromUrl(someUrl);
// or var dom = CQ.Create(htmlText);
IEnumerable<string> allTextStrings = dom.Select("*")
.Contents()
.Where(el => el.NodeType == NodeType.TEXT_NODE)
.Select(el => el.NodeValue);
It works the same as jQuery, except, of course, you also have the .NET framework and LINQ to make your life easier. The Select selects all nodes in the DOM, then Contents selects all children of each (including text nodes). That's it for CsQuery; then with LINQ the Where filters for only text nodes, and the Select gets the actual text out of each node.
This will include a lot of whitespace, it returns everything. If you simply want a blob of text for the whole page, just
string text = dom.Select("body").Text();
will do it. The Text method coalesces whitespace so there will be a single space between each piece of actual text.

Good PDF to HTML Converter for Mobiles

We are having Multiple PDF which have account tables and balance sheet within it. We have tried many Converters but the result is not satisfactory. Can anybody please suggest any good converter that would replicated the contents of PDF to Exact structure in HTML. IF any paid Converter is there please suggest me .
This is the PDF we want to convert and Show in html "http://www.marico.com/html/investor/pdf/Quarterly_Updates/Consolidated%20Financial%20Results%20-%20Q3FY11.pdf"
Have you looked into this? http://pdftohtml.sourceforge.net/
It's open source as well, so it's free and can be modified if necessary.
There's even a demo showing the before PDF and the after HTML version. Not bad if you ask me.
If you're having issues specifically with tables in PDFs, perhaps the issue are the table themselves and whatever program is being used to generate them. Not all PDFs are created equal.
ALSO: Be aware that all PDFs that I've created and come across over the years have had lots of issues when it comes to copy/pasting blocks/lines of text that have other blocks/lines of text at equal or higher height on any given page. I think Acrobat lacks the ability to define a "sequence order" of what block is selected after what (or most programs don't use it properly), so the system sorta moves from a top-down, left-to-right way of selecting content.....even if that means jumping over large blank areas or grabbing lines from multiple columns at once when you wouldn't expect it. This may be part of your tabular data issue. Your weak link here is the PDF format itself and I think perhaps you may be expecting too much from it. Turning anything into a PDF is pretty much a one-way street, especially when you start putting lots of editable text into it.
Have you tried http://www.jpedal.org/html_index.php - there is also a free online version

OCR text+markings

Is there a free OCR library out there that can extract text as well as detect some markings on the text? I realize this is an extremely vague proposition and such functionality would be highly dependent on what type of "markings" I want to detect.
But as far as I can tell no such thing even exists, except for a few commercial packages that claim to convert scanned pages into editable files while preserving some semblance of the original page layout. I'm looking rather for a LIBRARY that I program with.
My specific application of such a library would be this:
Print a page.
Use a pencil to underline key words.
Scan the page.
Run a program that converts the scanned page image into some text format that marks each of the underlined words. For example, an RTF file where each pencil-underlined word has been bolded.
Best free OCR tool is probably still Tesseract. You'd have to modify the code yourself to identify your markup's positioning relative to the scanned text.
When I last checked a couple of years ago good, free, OCR libaries were thin on the ground. Even closed source offering are generally not worth the bother, unless you want to spend $$$ on them.

Recommended approach for presenting formatted text in Android?

I am developing an application that will provide instructions to making a product. The text will have bullets and/or numbered steps as well regular text paragraphs. I may have headings for various sections. The text will be placed into a scrollable TexView.
I was originally planning on loading the text from a resource text file and then applying formatting via xml. However, I just learned about WebView and the ability to load local html files. I could easily format the text in html and load it into a WebView for the various activities.
My question is, is there a performance issue with using WebView vs. TextView? Are there other ways to easily format text for a TextView?
Thanks,
WebView definitely takes longer to load the first time into your process. It also is not designed to go in a ScrollView, since it scrolls itself. OTOH, you get excellent HTML support.
TextView can display limited HTML, converted into a SpannedString via Html.fromHtml(). Here is a blog post where I list the HTML tags supported by the Android 2.1 edition of fromHtml(). Note that these are undocumented, and so the roster of tags may be different in other Android releases.

Extracting pure content / text from HTML Pages by excluding navigation and chrome content

I am crawling news websites and want to extract News Title, News Abstract (First Paragraph), etc
I plugged into the webkit parser code to easily navigate webpage as a tree. To eliminate navigation and other non news content I take the text version of the article (minus the html tags, webkit provides api for the same). Then I run the diff algorithm comparing various article's text from same website this results in similar text being eliminated. This gives me content minus the common navigation content etc.
Despite the above approach I am still getting quite some junk in my final text. This results in incorrect News Abstract being extracted. The error rate is 5 in 10 article i.e. 50%. Error as in
Can you
Suggest an alternative strategy for extraction of pure content,
Would/Can learning Natural Language rocessing help in extracting correct abstract from these articles ?
How would you approach the above problem ?.
Are these any research papers on the same ?.
Regards
Ankur Gupta
You might have a look at my boilerpipe project on Google Code and test it on pages of your choice using the live web app on Google AppEngine (linked from there).
I am researching this area and have written some papers about content extraction/boilerplate removal from HTML pages. See for example "Boilerplate Detection using Shallow Text Features" and watch the corresponding video on VideoLectures.net. The paper should give you a good overview of the state of the art in this area.
Cheers,
Christian
For question (1), I am not sure. I haven't done this before. Maybe one of the other answers will help.
For question (2), automatic creation of abstracts is not a developed field. It is usually referred to as 'sentence selection', because the typical approach right now is to just select entire sentences.
For question (3), the basic way to create abstracts from machine learning would be to:
Create a corpus of existing abstracts
Annotate the abstracts in a useful way. For example, you'd probably want to indicate whether each sentence in the original was chosen and why (or why not).
Train a classifier of some sort on the corpus, then use it to classify the sentences in new articles.
My favourite reference on machine learning is Tom Mitchell's Machine Learning. It lists a number of ways to implement step (3).
For question (4), I am sure there are a few papers because my advisor mentioned it last year, but I do not know where to start since I'm not an expert in the field.
I don't know how it works, but check out Readability. It does exactly what you wanted.