Scraper: distinguishing meaningful text from meaningless items, hadoop - html

I'm trying to build a crawler and scraper in Apache Nutch to find all the pages containing a section talking about a particular word-topic (e.g. "election","elections", "vote", etc).
Once I have crawled, Nutch cleans the HTML from stop words, and tags, but it doesn't take out menu voices (that are in every pages of the website).
So it could happen that when you look for all the pages talking about elections, you could retrieve a whole website because it has the word "elections" in its menu and therefore in every page.
I was wondering if techniques that analyze multiple pages of the website to understand what is the main template of a page, exist. Useful papers and/or implementations/libraries.
I was thinking about creating some kind of hadoop Job that analyzed similarities between multiple pages to extract a template. But the same website could have multiple templates, so it is hard to think of an effective way to do that.
E.G.
WEBPage 1:
MENU HOME VOTE ELECTION NEWS
meaningful text... elections ....
WebPage 2:
MENU HOME VOTE ELECTION NEWS
meaningful text... talking about swimming pools ....

You didn't mention which branch of Nutch (1.x/2.x) are you using, but at the moment I can think of a couple of approaches:
Take a look at NUTCH-585 which will be helpful if you are not crawling many different sites and if you can specify which nodes of your HTML content you want to exclude from the indexed content.
If you're working with different sites and the previous approach is not feasible take a look at NUTCH-961 which uses the boilerplate feature inside Apache Tika to guess what texts matter from your HTML content. This library uses some algorithms and provides several extractors, you could try it and see what works for you. In my experience I've had some issues with news sites that had a lot of comments and some of the comments ended up being indexed alone with the main article content, but it was a minor issue after all. In any case this approach could work very well for a lot of cases.
Also you can take a peek at NUTCH-1870 which let you specify XPath expressions to extract certain specific parts of the webpage as separated fields, using this with the right boost parameters in Solr could improve your precision.

Related

Using a list of dynamic links throughout website

By "dynamic links", I mean a list of links that will constantly be updated.
To illustrate my question, I have a website that I am constantly writing new articles for. I currently have about 10 articles. If someone is to read article #5, there is a list of links to all 10 articles in the right panel of the page. As I update the site, and article #1 becomes out of date, I'd like to replace article #1 with article #11. Rather than updating the links within every article (so 10 times), is there a way to update the links once and have them all update simultaneously to every page?? Could I create an iframe for this??
Thanks for any and all help!
What's your goal? Do you want to learn to be a web developer? Or are you mostly concerned with getting your articles published?
If you want to be a web developer, I'd recommend steering clear of large CMS system like Wordpress or Drupal. Those are great products. But you want to learn the basics first. I think starting a PHP tutorial is the way to go.
If you just want to publish your articles, I'd recommend you find a nice place to create a blog. There are so many to choose from. It all depends on how much you want to spend.
Feel free to ask follow up questions. Web development sounds simple. But it's really a complex topic. I can't imagine what is must be like starting out these days with so many choices and competing technologies.
One way to do it would be to use Server-side includes. (Wikipedia) They work like this:
<!--#include file="some-content.html" -->
or
<!--#include virtual="some-folder/some-content.html" -->
The difference is file="" finds a file relative to the current page, whereas virtual="" finds it from the domain root. Either way, this method can use any type of regular text file as a source. The actual addition of the content is done by the server (hence the name) so its contents will be parsed as regular HTML and all CSS will apply to it as if the file were part of your page. I don't know about compatibility with different hosts, but if your web server supports it, this is probably the easiest way to go.

cloud based "knowledge base" approach with links, snippets and excerpts for google chrome

I'm working as a web developer and have lots (hundreds) of links with hacks, tutorials and code snippets that I don't want to memorize. I am currently using evernote to save the content of my links as snippets and have them searchable and always available (even if the source site is down).
I spend a lot of time on tagging, sorting, evaluating and saving stuff to evernote and I'm not quite happy with the outcome. I ended up with a multitude of tags and keep reordering and renaming tags while retagging saved articles.
My Requirements
web based
saving web content as snippets with rich styling (code sections, etc.)
interlinked entries possible
chrome plugin for access to content
chrome plugin for content generation
web app or desktop client for faster sorting / tagging / batch processing
good and flexible search mechanism
(bonus) google search integration (search results from KnowledgeBase within google search results)
I had a look at kippt but that doesn't seem to be a solution for me. If I don't find a better solution, I'm willing to stay with evernote as it meets nearly all my needs but I need a good plan to sort through my links/snippets once and get them in order.
Which solutions do you use and how do you manage your knowledge base?
I'm a big Evernote fan but a stern critic of all my tools. I've stuck with Evernote because I'm happy enough with its fundamental information structures. I am, however, currently working on some apps to provide visualisations and hopefully better ways to navigate complex sets of notes.
A few tips, based on years of using Evernote and wiki's for collaboration and software project management:
you can't get away from the need to curate things, regardless of your tool
don't over-think using tags, tags in combination with words are a great way to search (you do know you can say tag:blah in a search to combine that with word searches?)
build index pages for different purposes - I'm using a lot more of the internal note links to treat Evernote like a wiki
refactor into smaller notebooks if you use mobile clients a lot, allowing you to choose to have different collections of content with you at different times

What is the state of the art in HTML content extraction?

There's a lot of scholarly work on HTML content extraction, e.g., Gupta & Kaiser (2005) Extracting Content from Accessible Web Pages, and some signs of interest here, e.g., one, two, and three, but I'm not really clear about how well the practice of the latter reflects the ideas of the former. What is the best practice?
Pointers to good (in particular, open source) implementations and good scholarly surveys of implementations would be the kind of thing I'm looking for.
Postscript the first: To be precise, the kind of survey I'm after would be a paper (published, unpublished, whatever) that discusses both criteria from the scholarly literature, and a number of existing implementations, and analyses how unsuccessful the implementations are from the viewpoint of the criteria. And, really, a post to a mailing list would work for me too.
Postscript the second To be clear, after Peter Rowell's answer, which I have accepted, we can see that this question leads to two subquestions: (i) the solved problem of cleaning up non-conformant HTML, for which Beautiful Soup is the most recommended solution, and (ii) the unsolved problem or separating cruft (mostly site-added boilerplate and promotional material) from meat (the contentthat the kind of people who think the page might be interesting in fact find relevant. To address the state of the art, new answers need to address the cruft-from-meat peoblem explicitly.
Extraction can mean different things to different people. It's one thing to be able to deal with all of the mangled HTML out there, and Beautiful Soup is a clear winner in this department. But BS won't tell you what is cruft and what is meat.
Things look different (and ugly) when considering content extraction from the point of view of a computational linguist. When analyzing a page I'm interested only in the specific content of the page, minus all of the navigation/advertising/etc. cruft. And you can't begin to do the interesting stuff -- co-occurence analysis, phrase discovery, weighted attribute vector generation, etc. -- until you have gotten rid of the cruft.
The first paper referenced by the OP indicates that this was what they were trying to achieve -- analyze a site, determine the overall structure, then subtract that out and Voila! you have just the meat -- but they found it was harder than they thought. They were approaching the problem from an improved accessibility angle, whereas I was an early search egine guy, but we both came to the same conclusion:
Separating cruft from meat is hard. And (to read between the lines of your question) even once the cruft is removed, without carefully applied semantic markup it is extremely difficult to determine 'author intent' of the article. Getting the meat out of a site like citeseer (cleanly & predictably laid out with a very high Signal-to-Noise Ratio) is 2 or 3 orders of magnitude easier than dealing with random web content.
BTW, if you're dealing with longer documents you might be particularly interested in work done by Marti Hearst (now a prof at UC Berkely). Her PhD thesis and other papers on doing subtopic discovery in large documents gave me a lot of insight into doing something similar in smaller documents (which, surprisingly, can be more difficult to deal with). But you can only do this after you get rid of the cruft.
For the few who might be interested, here's some backstory (probably Off Topic, but I'm in that kind of mood tonight):
In the 80's and 90's our customers were mostly government agencies whose eyes were bigger than their budgets and whose dreams made Disneyland look drab. They were collecting everything they could get their hands on and then went looking for a silver bullet technology that would somehow ( giant hand wave ) extract the 'meaning' of the document. Right. They found us because we were this weird little company doing "content similarity searching" in 1986. We gave them a couple of demos (real, not faked) which freaked them out.
One of the things we already knew (and it took a long time for them to believe us) was that every collection is different and needs it's own special scanner to deal with those differences. For example, if all you're doing is munching straight newspaper stories, life is pretty easy. The headline mostly tells you something interesting, and the story is written in pyramid style - the first paragraph or two has the meat of who/what/where/when, and then following paras expand on that. Like I said, this is the easy stuff.
How about magazine articles? Oh God, don't get me started! The titles are almost always meaningless and the structure varies from one mag to the next, and even from one section of a mag to the next. Pick up a copy of Wired and a copy of Atlantic Monthly. Look at a major article and try to figure out a meaningful 1 paragraph summary of what the article is about. Now try to describe how a program would accomplish the same thing. Does the same set of rules apply across all articles? Even articles from the same magazine? No, they don't.
Sorry to sound like a curmudgeon on this, but this problem is genuinely hard.
Strangely enough, a big reason for google being as successful as it is (from a search engine perspective) is that they place a lot of weight on the words in and surrounding a link from another site. That link-text represents a sort of mini-summary done by a human of the site/page it's linking to, exactly what you want when you are searching. And it works across nearly all genre/layout styles of information. It's a positively brilliant insight and I wish I had had it myself. But it wouldn't have done my customers any good because there were no links from last night's Moscow TV listings to some random teletype message they had captured, or to some badly OCR'd version of an Egyptian newspaper.
/mini-rant-and-trip-down-memory-lane
One word: boilerpipe.
For the news domain, on a representative corpus, we're now at 98% / 99% extraction accuracy (avg/median)
Demo: http://boilerpipe-web.appspot.com/
Code: http://code.google.com/p/boilerpipe/
Presentation: http://videolectures.net/wsdm2010_kohlschutter_bdu/
Dataset and slides: http://www.l3s.de/~kohlschuetter/boilerplate/
PhD thesis: http://www.kohlschutter.com/pdf/Dissertation-Kohlschuetter.pdf
Also quite language independent (today, I've learned it works for Nepali, too).
Disclaimer: I am the author of this work.
Have you seen boilerpipe? Found it mentioned in a similar question.
I have come across http://www.keyvan.net/2010/08/php-readability/
Last year I ported Arc90′s Readability
to use in the Five Filters project.
It’s been over a year now and
Readability has improved a lot —
thanks to Chris Dary and the rest of
the team at Arc90.
As part of an update to the Full-Text
RSS service I started porting a more
recent version (1.6.2) to PHP and the
code is now online.
For anyone not familiar, Readability
was created for use as a browser addon
(a bookmarklet). With one click it
transforms web pages for easy reading
and strips away clutter. Apple
recently incorporated it into Safari
Reader.
It’s also very handy for content
extraction, which is why I wanted to
port it to PHP in the first place.
there are a few open source tools available that do similar article extraction tasks.
https://github.com/jiminoc/goose which was open source by Gravity.com
It has info on the wiki as well as the source you can view. There are dozens of unit tests that show the text extracted from various articles.
I've worked with Peter Rowell down through the years on a wide variety of information retrieval projects, many of which involved very difficult text extraction from a diversity of markup sources.
Currently I'm focused on knowledge extraction from "firehose" sources such as Google, including their RSS pipes that vacuum up huge amounts of local, regional, national and international news articles. In many cases titles are rich and meaningful, but are only "hooks" used to draw traffic to a Web site where the actual article is a meaningless paragraph. This appears to be a sort of "spam in reverse" designed to boost traffic ratings.
To rank articles even with the simplest metric of article length you have to be able to extract content from the markup. The exotic markup and scripting that dominates Web content these days breaks most open source parsing packages such as Beautiful Soup when applied to large volumes characteristic of Google and similar sources. I've found that 30% or more of mined articles break these packages as a rule of thumb. This has caused us to refocus on developing very low level, intelligent, character based parsers to separate the raw text from the markup and scripting. The more fine grained your parsing (i.e. partitioning of content) the more intelligent (and hand made) your tools must be. To make things even more interesting, you have a moving target as web authoring continues to morph and change with the development of new scripting approaches, markup, and language extensions. This tends to favor service based information delivery as opposed to "shrink wrapped" applications.
Looking back over the years there appears to have been very few scholarly papers written about the low level mechanics (i.e. the "practice of the former" you refer to) of such extraction, probably because it's so domain and content specific.
Beautiful Soup is a robust HTML parser written in Python.
It gracefully handles HTML with bad markup and is also well-engineered as a Python library, supporting generators for iteration and search, dot-notation for child access (e.g., access <foo><bar/></foo>' usingdoc.foo.bar`) and seamless unicode.
If you are out to extract content from pages that heavily utilize javascript, selenium remote control can do the job. It works for more than just testing. The main downside of doing this is that you'll end up using a lot more resources. The upside is you'll get a much more accurate data feed from rich pages/apps.

Methods for preventing search engines from indexing irrelevant content on a page

I'm looking for ways to prevent indexing of parts of a page. Specifically, comments on a page, since they weigh up entries a lot based on what users have written. This makes a Google search on the page return lots of irrelevant pages.
Here are the options I'm considering so far:
1) Load comments using JavaScript to prevent search engines from seeing them.
2) Use user agent sniffing to simply not output comments for crawlers.
3) Use search engine-specific markup to hide parts of the page. This solution seems quirky at best, though. Allegedly, this can be done to prevent Yahoo! indexing specific content:
<div class="robots-nocontent">
This content will not be indexed!
</div>
Which is a very ugly way to do it. I read about a Google solution that looks better, but I believe it only works with Google Search Appliance (can someone confirm this?):
<!--googleoff: all-->
This content will not be indexed!
<!--googleon: all-->
Does anyone have other methods to recommend? Which of the three above would be the best way to go? Personally, I'm leaning towards #2 since while it might not work for all search engines, it's easy to target the biggest ones. And it has no side-effect on users, unless they're deliberately trying to impersonate a web crawler.
I would go with your JavaScript option. It has two advantages:
1) bots don't see it
2) it would speed up your page load time (load the comments asynchronously and unobtrusively, e.g. via jQuery) ... page load times have a much underrated positive effect on your search rankings
Javascript is an option but engines are getting better at reading javascript, to be honest I think your thinking too much into it, Engines love unique content, the more content you have on each page the better and if the users are providing it... its the holy grail.
Just because your commenter made a reference to star wars on your toaster review doesn't mean your not going to rank for the toaster model, it just means you might rank for star wars toaster.
Another idea would be, you could only show comments to people who are logged in, collegehumor do the same I believe, they show the amount of comments a post has but you have to login to see them.
googleoff and googleon are for the Google Search Appliance, which is a search engine they sell to companies that need to search through their own internal documents. It's not effective for the live Google site.
I think number 1 is the best solution, actually. The search engines doesn't like when you give them other material than you give your users so number 2 could get you kicked out from the search listings altogether.
This is the first I have heard that search engines provide a method for informing them that part of a page is irrelevant.
Google has a feature for web masters to declare parts of their site for a web search engine to use to find pages when crawling.
http://www.google.com/webmasters/
http://www.sitemaps.org/protocol.php
You might be able to relatively de-emphasize some things on the page by specifying the most relevant keywords using META tag(s) in the HEAD section of your HTML pages. I think that is more in line with the engineering philosophy used to architect search engines in the first place.
Look at Google's Search Engine Optimization tips. They spell out clearly what they will and will not let you do to influence how they index your site.

Extracting pure content / text from HTML Pages by excluding navigation and chrome content

I am crawling news websites and want to extract News Title, News Abstract (First Paragraph), etc
I plugged into the webkit parser code to easily navigate webpage as a tree. To eliminate navigation and other non news content I take the text version of the article (minus the html tags, webkit provides api for the same). Then I run the diff algorithm comparing various article's text from same website this results in similar text being eliminated. This gives me content minus the common navigation content etc.
Despite the above approach I am still getting quite some junk in my final text. This results in incorrect News Abstract being extracted. The error rate is 5 in 10 article i.e. 50%. Error as in
Can you
Suggest an alternative strategy for extraction of pure content,
Would/Can learning Natural Language rocessing help in extracting correct abstract from these articles ?
How would you approach the above problem ?.
Are these any research papers on the same ?.
Regards
Ankur Gupta
You might have a look at my boilerpipe project on Google Code and test it on pages of your choice using the live web app on Google AppEngine (linked from there).
I am researching this area and have written some papers about content extraction/boilerplate removal from HTML pages. See for example "Boilerplate Detection using Shallow Text Features" and watch the corresponding video on VideoLectures.net. The paper should give you a good overview of the state of the art in this area.
Cheers,
Christian
For question (1), I am not sure. I haven't done this before. Maybe one of the other answers will help.
For question (2), automatic creation of abstracts is not a developed field. It is usually referred to as 'sentence selection', because the typical approach right now is to just select entire sentences.
For question (3), the basic way to create abstracts from machine learning would be to:
Create a corpus of existing abstracts
Annotate the abstracts in a useful way. For example, you'd probably want to indicate whether each sentence in the original was chosen and why (or why not).
Train a classifier of some sort on the corpus, then use it to classify the sentences in new articles.
My favourite reference on machine learning is Tom Mitchell's Machine Learning. It lists a number of ways to implement step (3).
For question (4), I am sure there are a few papers because my advisor mentioned it last year, but I do not know where to start since I'm not an expert in the field.
I don't know how it works, but check out Readability. It does exactly what you wanted.