Extracting the same data from various HTML documents - html

Let's say I have several HTML pages from unrelated websites, but that contain the same overall information. I want to extract that information in a flexible manner, i.e. I want to only have to write a small number of data extractors for all of the pages (ideally, one). Say the fields are (to use a blog example) author, date, title, text. The classes of the HTML tags that denote these could be totally different for each page, but still display on the page in roughly the same way. For example, take this post from CNN and this post from Gawker. Both contain the same information - the information that I want - somewhere on the page when it is actually displayed. Is there a nice way to extract that data? Writing separate extractors is an option, but not a good one; there are about a thousand styles of documents in the dataset I want to use.

The only way you can do that is by finding a common element in all of those websites (e.g. they share the same DOM structure, or have the same ID, or are preceded by the same content in a previous tag like an <h1>).
Otherwise, you need to write different rules or regular expressions for each case.
Unless, of course, you write an algorithm so intelligent that is capable of recognizing the content intention/meaning even with different HTML - which is not simple nor quick to write in any way.

Related

How can I get all the ACM Computing Classification System tags as strings?

So I'm writing an app where there's a HTML tag which should have all the ACM CCS tags/fields as options and basically users should be able to select one or more of these tags and assign them.
However how can I go about getting all the ACM CCS tags as strings in the first place, so I can create the element with the proper options in the first place?
Getting them manually isn't really an option as there are probably thousands of them by the looks of it.
Considering they're part of the academic culture I thought they'd already been extracted at some point by someone in a certain format and then shared somewhere but I couldn't find any such thing.
For reference, if you don't know what I'm talking about, I want all the names of all the CCS tags in no particular order from this site: https://dl.acm.org/ccs
So for example, in the end I wanna obtain such strings as "General and reference", "Hardware" , etc.
I've looked into web crawlers and the likes but I seriously have no clue how to even use one, let alone build one. As far as I've seen all these names are stored inside <li> tags so if I had something that'd "parse" the whole site with all its pages and then store the content of all the <li> tags it encounters as strings that would work.

Detecting the changing areas in a web page

I'm trying to write a crawler that gets raw html data and finds Title, price, update date, photo etc... fields and writes it to database. This is an classic and old way to crawl data.
I think that I can do this job wit an other way.
If I crawl all pages (may be more than 1000) in the web site, and compare them all I can find the specific areas.
I mean html tags will be always the same. Only specific areas will change like title, image etc...
So, what is the best way to determine changed areas?
compare them all I can find the spesific areas
what is the best way to determine changed areas?
In your question you set the scrapeing/crawling approach of comparing pages' parts and getting the data of specific areas. This smells with regex approach. Do not use it as the very non-efficient approach. Rather use xpath, operating on XML structures.
So, be simple:
Get html
Make it DOM
Make DOM a valid XML
Apply xPath queries to XML
Believe me, xml libraries are well able to handle huge structures (including idle html tags) and traverse over them. A classical example of using xpath is in this post of mine.
To determine data node paths you just use web inspector tools (F12 - in Chrome and IE and Ctrl+Shift+I in FF) to see the html tags containing useful info.

Storing site data in columns or rows

This is a question of how to perform the best practice of storing data from a webpage. Like texts/image-urls/links etc.
I have an CMS were you can create web pages. Here you can edit texts/upload images. In the future it would also be nice to "add new elements", add links to a-tags etc.
I need to have a robust and flexible solution that also have good performance. In both getting/recieving this data.
Lets consider I have 1000 pages with each around 25 elements on each page that can be updated and stored in the database.
Alternative 1)
Create a table and 1 column for each element on these pages for example columns like:
title_1, title_2,image_1,image_2.
Here we have a set of columns that we can update, these we can use on the web page.
Alternative 2)
Create 1 table with the columns (id, namespace, page_id, data)
And for each element on the page I add the namespace in association with the page_id to make the data output unique. In the data I can add any kind of information; text, links etc.
What do you suggest as a good solution for this issue? I'm ofcourse also open for other alternatives.
Thanks!
I would recommend option two, with the addition of a column identifying the element id/or type, if indeed the element id is somehow comparable. That is to say, if anchor text (say) is always stored as element id = 4, then you might want an element id = 4 so that you could compare anchor texts across multiple documents.
If, on the other hand (and this is the scenario I imagine is more likely), you may have 1-25 elements on a page and each of them could be different (eg document one has three anchor texts and four images, document two has one anchor text and no images, etc) it would make sense to add an element_type_id table that stores a bit of information about the element types. This is assuming that you ever have any interest in comparing (say) images across multiple documents, or anchor texts across multiple documents, etc.
Another thing to consider: if you are likely to see the same element over and over again, it actually makes more sense to effectively parameterize those elements by way of a lookup table. So basically store each (say) unique anchor text in one table and reference its id in your actual data table.
If I may add one additional thing: SO may not be the best place for the particular question you are asking. I'm not totally sure of that and maybe I'm wrong... but I would poke around the Stack Exchange network and see if other forums more closely deal with the type of question you asking. In the very least, I'd observe that your question is fairly vague and the goal of achieving a "robust and flexible solution that also {has} good performance. In both getting/recieving this data." is not likely to be accomplished simply by asking for advice on SO. There is a LOT that goes into data architecture, and certainly many of the details I would consider important in designing this myself are not present in your questions. And if you're not sure what those details are, I am not sure if SO is really the best place to set about learning them. I think https://softwareengineering.stackexchange.com/ may be a better fit for this question.
Just my opinion, and I could be wrong. Either way, I would consider learning a bit about database normal forms (http://www.bkent.net/Doc/simple5.htm or Google it) as well as do a little research on the types of design considerations that go into building a database (an old but still good SO article on that is here: What are the most important considerations when designing a database?)

Adding ids to HTML tags for QA automation

I have a query In our application we have lots of HTML tags. During development many tags were not given any id because of no requirement.Now the QA team wants to automate the test cases using QTP. In most of the cases this tool doesn't recognizes because it does not find ids for most of the HTML tags.Now we are asked to add ids to all the HTML tags.
I want to know if there will be any effect adding id attribute to these tags. Even positive impact are welcome
I do not think there will be any either positive or negative effect : maybe the size of the HTML page will increase a bit, but probably not that much.
Still, are you sure you need to put "id" attributes on every HTML tag of your pages ? Wouldn't only a few of those be enough ? Like on form fields, on links, on error-messages ; and that's probably about it ?
One thing you must take care, though, is that "id", as in "identifers", must be unique ; which implies it might be good, before starting adding them, to define some kind of "id-policy", to say, for instance, that "ids for elements of that kind should be named that way".
And, for your next projects : have developpers add those when theyr're developping ;-)
(And following the policy, of course)
Now that I'm thinking about it : a positive effect might be that it'll be easier to write Javascript code interacting with your HTML document -- but that'll be true for next projects or evolutions for this one, when those id are already present in the HTML at the time developpers put the JS code in place...
Since there are no QTP related answers yet.
GUI recognition in QTP is object-oriented. In order to identify an object QTP needs a unique combination of object's properties, and checking them better to be as fast as possible - that is why HTML ID would be ideal.
Now, where it is especially critical - for objects that do not have other unique identifiers. The most typical example - html tables. Their contents is dynamic, their number on the page may vary. By adding HTML ID you allow recognition mechanism get straight to the right table.
Objects with other unique properties can be recognized well without HTML ID. For example, if you have a single "submit" link on the page QTP will successfully recognize it by inner text.
So the context-specific answer: don't start adding ids to every single tag. Ask automation guys to prepare a list of objects they have problem with. And add ids to those objects.
PS. It also depends on automation programming skills. There are descriptive programming and dynamic recognition methods. They allow retrieving the right objects even without ids provided.
As Albert said, QTP doesn't rely solely on elements' id, in fact due to the fact that many web applications generate different ids for each session, (as far as I remember) the id property isn't part of the default description for most web test objects.
QTP is pretty good at recognizing most simple web controls and if you're facing problems it may be the case that a Web Extensibility project will help you bridge the gap between the semantics of your web application and the raw HTML it is created in. If a complex control is recognized by QTP as a WebElement (which is actually the div that contains the span that drives the code) you will understandably have object recognition problems since there are many divs on the page but probably many less complex controls.
If you are talking about side-effects - NO. Adding ids won't cause any problems (apart from taking up some extra bytes of course)
If you really have the need to add ids, go ahead and add them.
http://www.w3.org/TR/html4/struct/links.html#anchors-with-id says: The id and name attributes share the same name space. This means that they cannot both define an anchor with the same name in the same document. It is permissible to use both attributes to specify an element's unique identifier for the following elements: A, APPLET, FORM, FRAME, IFRAME, IMG, and MAP. When both attributes are used on a single element, their values must be identical.

Programmatically detecting "most important content" on a page

What work, if any, has been done to automatically determine the most important data within an html document? As an example, think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
Note: Ideally, the method would work with well-formed markup, and terrible markup. Whether somebody uses paragraph tags to make paragraphs, or a series of breaks.
Readability does a decent job of exactly this.
It's open source and posted on Google Code.
UPDATE: I see (via HN) that someone has used Readability to mangle RSS feeds into a more useful format, automagically.
think of your standard news/blog/magazine-style website, containing navigation (with submenu's possibly), ads, comments, and the prize - our article/blog/news-body.
How would you determine what information on a news/blog/magazine is the primary data in an automated fashion?
I would probably try something like this:
open URL
read in all links to same website from that page
follow all links and build a DOM tree for each URL (HTML file)
this should help you come up with redundant contents (included templates and such)
compare DOM trees for all documents on same site (tree walking)
strip all redundant nodes (i.e. repeated, navigational markup, ads and such things)
try to identify similar nodes and strip if possible
find largest unique text blocks that are not to be found in other DOMs on that website (i.e. unique content)
add as candidate for further processing
This approach of doing it seems pretty promising because it would be fairly simple to do, but still have good potential to be adaptive, even to complex Web 2.0 pages that make excessive use of templates, because it would identify similiar HTML nodes in between all pages on the same website.
This could probably be further improved by simpling using a scoring system to keep track of DOM nodes that were previously identified to contain unique contents, so that these nodes are prioritized for other pages.
Sometimes there's a CSS Media section defined as 'Print.' It's intended use is for 'Click here to print this page' links. Usually people use it to strip a lot of the fluff and leave only the meat of the information.
http://www.w3.org/TR/CSS2/media.html
I would try to read this style, and then scrape whatever is left visible.
You can use support vector machines to do text classification. One idea is to break pages into different sections (say consider each structural element like a div is a document) and gather some properties of it and convert it to a vector. (As other people suggested this could be number of words, number of links, number of images more the better.)
First start with a large set of documents (100-1000) that you already choose which part is the main part. Then use this set to train your SVM.
And for each new document you just need to convert it to vector and pass it to SVM.
This vector model actually quite useful in text classification, and you do not need to use an SVM necessarily. You can use a simpler Bayesian model as well.
And if you are interested, you can find more details in Introduction to Information Retrieval. (Freely available online)
I think the most straightforward way would be to look for the largest block of text without markup. Then, once it's found, figure out the bounds of it and extract it. You'd probably want to exclude certain tags from "not markup" like links and images, depending on what you're targeting. If this will have an interface, maybe include a checkbox list of tags to exclude from the search.
You might also look for the lowest level in the DOM tree and figure out which of those elements is the largest, but that wouldn't work well on poorly written pages, as the dom tree is often broken on such pages. If you end up using this, I'd come up with some way to see if the browser has entered quirks mode before trying it.
You might also try using several of these checks, then coming up with a metric for deciding which is best. For example, still try to use my second option above, but give it's result a lower "rating" if the browser would enter quirks mode normally. Going with this would obviously impact performance.
I think a very effective algorithm for this might be, "Which DIV has the most text in it that contains few links?"
Seldom do ads have more than two or three sentences of text. Look at the right side of this page, for example.
The content area is almost always the area with the greatest width on the page.
I would probably start with Title and anything else in a Head tag, then filter down through heading tags in order (ie h1, h2, h3, etc.)... beyond that, I guess I would go in order, from top to bottom. Depending on how it's styled, it may be a safe bet to assume a page title would have an ID or a unique class.
I would look for sentences with punctuation. Menus, headers, footers etc. usually contains seperate words, but not sentences ending containing commas and ending in period or equivalent punctuation.
You could look for the first and last element containing sentences with punctuation, and take everything in between. Headers are a special case since they usually dont have punctuation either, but you can typically recognize them as Hn elements immediately before sentences.
While this is obviously not the answer, I would assume that the important content is located near the center of the styled page and usually consists of several blocks interrupted by headlines and such. The structure itself may be a give-away in the markup, too.
A diff between articles / posts / threads would be a good filter to find out what content distinguishes a particular page (obviously this would have to be augmented to filter out random crap like ads, "quote of the day"s or banners). The structure of the content may be very similar for multiple pages, so don't rely on structural differences too much.
Instapaper does a good job with this. You might want to check Marco Arment's blog for hints about how he did it.
Today most of the news/blogs websites are using a blogging platform.
So i would create a set of rules by which i would search for content.
By example two of the most popular blogging platforms are wordpress and Google Blogspot.
Wordpress posts are marked by:
<div class="entry">
...
</div>
Blogspot posts are marked by:
<div class="post-body">
...
</div>
If the search by css classes fails you could turn to the other solutions, identifying the biggest chunk of text and so on.
As Readability is not available anymore:
If you're only interested in the outcome, you use Readability's successor Mercury, a web service.
If you're interested in some code how this can be done and prefer JavaScript, then there is Mozilla's Readability.js, which is used for Firefox's Reader View.
If you prefer Java, you can take a look at Crux, which does also pretty good job.
Or if Kotlin is more your language, then you can take a look at Readability4J, a port of above's Readability.js.