Find out which div is containing the "main content" with crawler - html

We have a crawler that is crawling hundreds of thousands of pages per week. Currently to get the data from the crawled HTML we manually take a look at the HTML and see that "OK, Data A is within <div class=".info-list"> and Data B is inside <h1>", and then we use a parser to parse the data from those div's.
I guess this is the most common way to parse crawled HTML for most people, but it means that we have to know the HTML structure of all the pages and domains that we crawl. So it is not very scalable.
If we could just figure out what div the "main content" is, so that we can ignore other things such as "Relevant products" or "Relevant articles", or "Main menu" and so on, we could easily parse the data in the same way as we do now but without having to specify the exact div names and position of each data.
So... How do we figure out which is the "main div" of a page?
I'm pretty sure that Google does this. They definitely know position of elements on a page, and if something is positioned in the "main content" or in the footer for example. How can they know this?
The methods that I can see to do this in a large scale is:
Render the page and look for the largest div's and start from there. But to render millions or hundreds of thousands of pages is not really cheap and efficient.
Try to figure it out from the content of every div. For example, the div with most links inside of it is probably the menu. The div with most text inside of it is probably the main content. But this gets really tricky if the content is like:
<body>
<div class="maincontent">
<div class="post-header">
<h1>Header of post</h1>
</div>
<div class="short-description">
Hello World!
</div>
<div class="long-description">
Hello New World!
</div>
</div>
</body>
Obviously the div we want to identify as the "main content" is <div class="maincontent">. But if we look for the div that have "most text". It would be .long-description.
This is starting to become quite a long question. But my point is, it's really hard to figure out which part of a website that is the "main content". And I'm asking any smart people out there to help me come up with a decent way to find out what div or divs that is probably containing the most important content of the page.
EDIT: I guess one way of rendering it is not to render every single page. But to render the domain. For example. If the domain structure is http://example.com/post/1-post-name/ I can save a render of that, and next time I find a page that is http://example.com/post/2-post-name/ I know that it probably have the same HTML template as the first one, and the "largest div" is probably the same.
So what techinique to do this server side? I mean to render it and to save the sizes and position of all the elements. I guess this seems to be a pretty decent way of doing it on large scale.

I would try multiple approaches. For example start with the obvious - is there an id="content" or class="main_content" ? Use it! Look for ids and classes that are common for big content blocks and if they exist then use them. If not then move on to less certain tests.
Next try narrowing things down. is there a <header> or <nav> tag? ignore that and everything above it. Ignore a <footer> or a class="sidebar"
Make some rules, let them run, and then manually expect what comes back and look for patterns when you're pulling too much or things are being left out. Adjust your rules and write new ones based on that.
At that point you might even let the ones get past all your tests go to a short list where you check them by hand and create domain specific rules where you can point out the exact div you want to use. You can still be very efficient with some human intervention, and visually looking over 8 sites out of 50 is still a pretty good deal.

I didn't really find a great way to decide which div is the "main content" yet, however I found PhantomJS which lets you render the page you are crawling on server side, and be able to use Javascript and jQuery to get sizes and positions of elements on the page you are crawling.
So by using PhantomJS you can definitely get which div is the "largest", which div is on top or bottom or center, which is a long way already on solving this issue of finding out which div on a page that is the "main content".

Related

Why put an <img> inside a container (like a <div>, for instance)?

I am just learning to code and have been looking everywhere for an answer on this one and for some reason cannot find anything.
I noticed that it seems to be common practice to put an image inside of a container or wrapper. For instance, rather just having:
<img src="url"/>
Everyone seems to be in agreement that it needs to be this way:
<div class="container">
<img=src"url"/>
</div>
What is the purpose of wrapping the img inside of a div in this way? It seems to have something to do with "responsive design", but I'm not 100% sure. Is it just so that we have something to size the image relative to, rather than using definite sizing like pixels on the image selector in css? The more I think about it as I write this, the more it seems to be the right answer, but I'm not sure if there's something else I'm missing on this one.
Any insight would be very much appreciated. Thank you.
unfortunately there is no "single" correct answer for this.
There can be many reasons as to why one would wrap any element in another element, it is not specific to <img /> tags :)
In your question I read something like this (converted to real world example):
I see that it is common practice to put a frame around a photo.
Where the "frame" would be the wrapper element, and the photo would be the <img />.
Looking at it this way might make it seem more clear. The photo is the most important part, technically speaking you don't need a frame to show the photo. If you have just a photo, you won't be able to hang it on your wall without damaging it by driving a nail through the top or applying some tape. If you have a frame though, you can make that photo take up any amount of available space within it, you can use the clip to hang it on a wall and if you put multiple photo's in the frame, you can move them all at once since they are in the same frame.
The reason most people put that image in a "container" is because they get some sort of advantage out of it over using an image alone, this could range from aspect-ratio locks to relative positioning. In some cases, a wrapper is required to achieve certain (notably more complex) animations as well.
Websites are built out of "logical" pieces that, together, form a website. The individual pieces are all "frames" that "flow" together to create any page layout you see on every website.
It is merely a structural way of thinking, if the purpose of that image was to be used as a background image for the entire page, a better alternative would be to use CSS background-image property on the <body> tag and not use the image at all. But if the image is meant to be part of a smaller part of your website, it should probably be contained as appropiate.
This answer is in no way a guide to go by, nor a ruleset or anything like that, they are just the thoughts of another developer. There are countless reasons for wrapping an element and this answer doesn't even cover 0.0000001% of those cases. I'm just saying -- there's no specific reason to do or don't here.

Marking up the "BBC pattern" in HTML5

I'm looking at the BBC site, and putting together something following a similar overall pattern, and determining how to mark it up appropriately is stumping me somewhat.
The BBC consists of several what could be considered sites in their own right:
http://www.bbc.co.uk
http://www.bbc.co.uk/comedy/
http://www.bbc.co.uk/news/
www.bbc.co.uk/[a-lot-more-stuff]
(these could all be subdomains instead - indeed, this is the case for me - but the URLs are not important)
Each of these is essentially self-contained, with its own content, menu and look and feel. However all of them are tied together by the use of the (slightly variable but mostly) static header bar. This contains the header "BBC" along with links to all of the various sub-sites.
So the question is, how should this be marked up. I see several different options:
The main BBC header is the site's main <header> and <nav>. This is sort of correct, because it is but it ends up essentially de-emphasising the importance of the sub-site's actual content. When it boils down to it (to use the examples above), the title "Comedy" and associated menu is the main content of the page, not the BBC bar.
Make the sub-sites' header and navigation the ones which are marked up within <header> and <nav>. This feels better, but it then opens up the question as to what the BBC bar now is? An option is to use an <aside>, which then contains its own <header> and <nav>. As far as I know, this is fine for the header but having that other <nav> element is still weird. Better option than the above?
Do the same as number 1 (BBC bar has the main <header> and <nav>), but mark up the rest of the page inside an <article> element. The spec indicates that the article element is to be used for items which make sense on their own, which is the case here. And it'll also make sense for it to have its own <header> (and <nav>? Is this pushing it somewhat?) But this seems to be stretching the definition of an 'article' rather further than its dictionary definition allows.
To me, having given it some thought and thrown some ideas back and forth on Twitter, number 2 seems the best of these options. However the idea of essentially putting the contents of an <aside> as the top element on the page (visually and in markup, since it seems to make most logical sense this way) doesn't quite sit right with me.
Am I overlooking an obvious solution or is this an usual enough pattern that it does make itself as difficult as it seems? And surely I can't be the only one to puzzle over this?
Thanks for any thoughts.
The main header should be, as you pointed out, marked up in <header> and <nav>.
I would then mark each additional page content in an <article> containing it's own <header> and <nav>. Ignore the dictionary definition of article, it doesn't really apply here. It's fine to have more than one <nav> element on a page, as long as its contents navigate within the site, that makes sense.
Putting the top header in an <aside> also doesn't seem to be correct to me as the content isn't stand alone.
Just my thoughts on the subject!

is it bad to use many div's in a single page?

This is the first time i am properly coding in HTML,CSS. in my code i have used whole lot of div's to position and also to put the content in place. i am not sure if i am coding the right way. i have loads of contents too in a single page. here is the link to my code i have used.
http://jsfiddle.net/32ShZ/
can you please suggest. is it really bad in structure and shape?
Absolutely not. You don't want to go overboard though (it's called "div soup" when you do). If you find that a div has no purpose but to hold a background image, or to clear a float, etc that means you've done something wrong. By using wrappers (e.g. 3 levels deep of div tags for a content area that has some backgrounds, etc is OK), you can properly achieve any layout that you need without resorting to "div soup". Take a look at http://www.digitalperfections.net/ for an example of good (x)HTML with a lot of div tags.
To further expand, and answer the question about your code specifically, I noticed one thing right off the bat: <div id="divider"></div> - this is bad because you're using this div purely for non-semantic purposes (for decoration only).
The general principle is use as less HTML for layout as possible. And try to give Style to your page with the help of CSS. So if a minimum number of divs can achieve your task, you should go for it. This helps to make page lighter and maintainable. But yes how small structure (HTML) you can have in your page depends on your experience and design.

How to break up HTML documents into pages for ebook?

For an iPhone ebook application I need to break arbitrarily long HTML documents up into pages which fit exactly on one screen. If I simply use UIWebView for this, the bottom-most lines tend to get displayed only partly: the rest disappears off the edge of the view.
So I assume I would need to know how many complete lines (or characters) would be displayed by the UIWebView, given the source HTML, and then feed it exactly the right amount of data. This probably involves lots of calculation, and the user also needs to be able to change fonts and sizes.
I have no idea if this is even possible, although apps like Stanza take HTML (epub) files and paginate them nicely. It's a long time since I looked at JavaScript, would that be an option worth looking at?
Any suggestions very much appreciated!
update
So I've hit upon a possible solution, using JavaScript to annotate the DOM-tree with sizes and positions of each element. It should then be possible to restructure the tree (using built-in XSLT or JavaScript), cutting it up in pages which fit exactly on the screen.
Remaining problem here is that this always breaks the page on paragraph-boundaries, since there is no access to the text at a lower level than the P-element. Perhaps this can be remedied by parsing the text into words, encapsulating each word in a SPAN-tag, repeating the measurement procedure above, and then only displaying the SPAN elements that fit onto the screen, inserting the remaining ones at the front of the next page.
All this sounds rather complicated. Am I talking any sense? Is there a simpler way?
You should look at the PagedMedia CSS module: http://www.w3.org/TR/css3-page/
CSS3 also support multicolumn layouts (google for "css3-multicol". I don't have enough Karma to include a second link here :-)
About your update: how about doing the layout of one single page, then use a DIV with overflow:hidden for the text part. Next thing would be to overlay a transparent item on top of that, that would programmatically scroll the inner content of the DIV PAGE_HEIGHT pixels up or down according to some navigation controls (or gestures).
The other option is to have a parent <div> with multiple css3 columns: link1, link2.
This works on Android:
<style type='text/css'>
div {
width: 1024px; // calculated
-webkit-column-gap: 0px;
-webkit-column-width: 320px; // calculated
}
p {
text-align: justify;
padding:10px;
}
</style>
The CSS multicol suggestions are very interesting! However, and I hope it's ok to respond with another question: how would you go from splitting one or more long <p> elements into columns to having one particular of these columns being rendered in a WebView? The DOM hasn't changed, so you can't pick out an element and render it. What am I missing?

How to make the web page to download bottom to up?

Every web page load from top to bottom means first my header will be loaded then content and finally footer. How can i make it to load from bottom to up.means first footer then content and then finally header content.
Are you getting what i am trying to say.??
OR
to make it load from right to left OR
left to right..
This is probably one of the more bizarre questions I've seen here...
You cannot change the order in which the browser loads the file, it will always start at the beginning and read to the end. However, if you change the order of the file such that the footer is first and the header is last, the browser will render it in that order. As long as the CSS places each element in the correct place, it should work.
This will probably have some strange side effects since the browser will have to rerender or move elements several times as it moves the footer down the page to make room for the elements above it.
Is there really a need for this? Web pages generally load fast enough that users won't notice what direction they load in, and if your page isn't loading that fast, then I would focus on finding out why instead of trying to render it in a different order.
A web page is HTML + additional files.
The HTML file is loaded and read start-to-finish. When it gets to a point in the file where it requests another file (such as CSS, .JS, an image, etc.) it then sends a request to get that image.
You have control over that in that you can rearrange your HTML any way you want to.
What you don't have control over is how long it takes to request and then retrieve each of the individual files.
If you want full control, then you pretty much need to load everything but keep it hidden, and then reveal the items in the order you want them to appear via javascript and CSS.
All that said, though, the better answer is "No. You can't. That's just how the web works".
If this is for some kind of cool effect on your page, you could check out Page Transitions. These only work in IE though. If that is the case, it looks like you want the Wipe effect.
If you want it to just look like its loading from bottom to top then you could hide everything with css in the header and then have javascript unhide them starting from the bottom of the page - but I really don't know why you'd want to do this. Can you give us some more information on the effect you're trying to create?
Visually, you could get the sort of effect where one would see the content before the header by putting the header after the content in the HTML output then use CSS to make the header appear first visually.
If you want to scroll your content in somehow, I'd check out jquery and animations.
Assumption 1: Load content before styles/javascript.
In this assumption you care about the page loading first THEN the css/javascript executing thus allowing the user to get the content before all scripts/styles load and thus speed up the usability of the page.
To accomplish this put the style/script tags as the last elements in your body.
Assumption 2: Bizarro-world loading.
In this assumption you want the footer loaded/displayed first, then content, then header in that exact order.
1) The html head element will load before the body. No way to change that. Header = page header in my wording.
2) Use the following html pseudocode
<html>
<head></head>
<body>
<div id="footer"></div>
<div id="content"></div>
<div id="header"></div>
</body>
</html>
And in your css float everything to the right having them take up 100% width. This will cause the page to load backwards but when it is displayed it will be displayed appropriately.
#header,#footer,#content { width: 100%; float: right; }