How would you pick the best image from a webpage in a crawler? - html

If you were given any random webpage on the internet and had the html source only. What method would use to give you the most accurate image that would best describe that webpage? Assume that there are no meta tags or hints.
Facebook does something similar when you post a link but they give you choices of n images to chose from, they don't actually pick one unless it has the meta tags on it.

Try to analyze the structure of the page. The majority of web pages roughly has a header, content and footer area. The content area is most likely to contain images related to the subject of the page, so that's what you're looking for.
Find the content area
Most content areas are div elements with with an ID or class named content, so that's always a good first guess. There may be alternative descriptors of the content element, so you'll need to do some research to find common patterns.
The content area will also contain multiple h1 or h2 headings in most cases, so that's another indicator to look for.
Find the header and footer
Another approach is to identify the header and footer. Headers usually contain a hint to the logo of the site, such as an image, CSS class name or link to the root of the site. Footers are most likely to contain things like copyright statements.
You can also find the header and footer by analyzing the links on the page. Most internal links will be in the header and footer, while the content has relatively more outgoing links, if any.
Once you have the header and footer, the content is usually in between :)
Find an image
Once you've identified the content area, the first image is usually your best pick. You should, however, ignore images with a small width and/or height, as these will likely be decorative images.
You could also double-check the images against any included CSS files, to make sure you're not picking an image that's related to the design of the page.
Fall back to an educated guess
If you cannot reliably guess the content area of the page, just use the biggest image on the page, as egrunin suggested. Again, you can check this image against the CSS files, to rule out any design-related images.
In the fall-back case, you could log the URL and review those pages to improve your image detection algorithms.

This is best-guess stuff, but:
ignoring anything hosted in another domain will eliminate most ads
once you've grabbed the images, you can get their size; the biggest is probably the one to use.
images that are inside <a> and point to the root of the domain are probably logos. Example: the SO logo on this page is inside .
Edited to add:
It's true that large sites use auxiliary servers for their images. But you could probably make up a couple of simple parsing rules that will get 80% of cases, picking out g-ecx.images-amazon.com and static.ak.fbcdn.net as non-ad servers.

If you find og:image meta property, you can use that quite safely, as part of Open Graph specification used to provide images for Facebook links.
Example of format:
<html xmlns:og="http://opengraphprotocol.org/schema/">
<head>
<title>The Rock (1996)</title>
<meta property="og:image" content="http://ia.media-imdb.com/rock.jpg"/>
...
</head>
...
</html>

Well I would try to look for divs/spans/h1 with something like class or id = "logo" or "top". Almost every page has its logo on the top of page. Just look on stackoverflow :) logo.
I do it this way in my crawler and it works fine :)

Related

Is there valid HTML5 markup when the article header is above the site sidebar?

A classic site design for an individual article these days has a full width article header - containing say, the title, sharing links, a banner image etc - as that is the most prominent part of the page.
Underneath that, the article content appears with a site sidebar to its right.
Am I correct that this is basically impossible to achieve with correct HTML5 article / aside markup?
The header and content are both part of the article, so (without using position absolute, which is usually impossible due to variable heights) the article tag must wrap around both.
However, this also means the aside must be within the article tag, while the HTML5 specs say that site sidebars, not directly corresponding to the article, should be outside of the article tag.
Edit - for reference, since the answer is probably yes, it's impossible, I'm wondering what people are choosing as the least evil alternative, since this seems a common design.
You are correct. I have dealt with this for a long time. The rest of the development community seems to put the header above and outside the article and does not question it.
I currently put the header inside the article and use javascript to place the sidebar inside the article and under the header.
CSS Grid wont fix the issue.
Recently I am thinking its easier to pull out the header and out it above with javascript.

How do I add a boilerplate header and footer to every page of a printed HTML document?

I'm writing a script to automatically transform plain text test documents into HTML tables with proper column width, step numbering, etc. Each page will have some part of a big, long table in it. I also need to have a header and footer on each page to comply with FDA regulations in this area- it has simple information about copyright, page number, part numbers, etc.
I have noticed some of the prescribed CSS/HTML tools for this task don't seem to work.
1)The #page rule with margins and the ability to put content into those margins would solve this problem pretty neatly, but I don't think it was ever implemented.
2)Many of the suggested answers to this question on SO:
How to add a header and footer to each printed page of a web document (without browser restriction)?
Is there a way to get a web page header/footer printed on every page? end up with some combination of: The footer/header render on top of text, or somehow break the page-break-inside rendering of the body so half of a line renders on each page.
Is it still (given the answers and my attempts to use them) impossible to do this in a clean way? Or at all? I don't mind what browser I have to use to print them correctly either.

html4 header tag position

In all my websites XHTML source code, navigation and breadcrumbs appear below the content of the page yet visually they appear above. I am doing this as believe that in such way search engines find content more relevant.
In all the HTML5 examples I've seen, the order is classical:
header, body section, footer.
From SEO point of view, by working on HTML5 page, is it better to use classical tags order or the one I used till now in XHTML?
Unfortunately, this is more or less outdated advice.
Both Google and Bing have for many years now had the ability to render the DOM of the page and determine the actual layout of the page regardless of how the code is structured.
The old theory behind this technique was that search engines would only index the first 100kb or so of a page and typically that could be taken up by templated boilerplate code in some instances. This isn't a restriction that really exists anymore and to be honest if your pages are reaching that kind of filesize you probably have other things that you want to consider.
I think it is better when the content with keywords appear earlier in the source code. For the general link structure it doesn't matter where main navigation links are placed.
But maybe search engines can weight structure different when using standard semantic ids like navigation, breadcrumbs, content and footer? In this case the position would be equal. Isn't the semantic thing one of the big advantages of HTML 5?!

Should all presentational images be defined in CSS?

I've been learning (X)HTML & CSS recently, and one of the main principles is that HTML is for structure and CSS for presentation.
With that in mind, it seems to me that a fair number of images on most sites are just for presentation and as such should be in the CSS (with a div or span to hold them in the HTML) - for example logos, header images, backgrounds.
However, while the examples in my book put some images in CSS, they are still often in the HTML. (I'm just talking about 'presentational' images, not 'structural' ones which are a key part of the content, for example photos in a photo site).
Should all such images be in CSS? Or are there technical or logical reasons to keep them in the HTML?
Thanks,
Grant
If an image is "content" say in a newspaper article, the editorial image, then use img tag. If it is part of your UI, theme or skin or whatever the name is, then yes put it CSS.
Suggested readings
Designing with Web Standards (Zeldman)
Bullet Proof Web Design (Dan Cederholm)
CSS Mastery (Andy Clark, Andy Budd, Cameron Moll)
One reason to put those images in CSS might be to serve different browsers from the same web site, just by changing the CSS: for example, if you detect a mobile/embedded/pocket browser you could give them the same HTML but with a CSS that doesn't include images.
I put them to CSS if possible. One reason is that I think they belong there like you mentioned and the other one is the possibility to use sprites. This can reduce the loading time of your page significantly.
The src property of an img tag is required according to HTML 4.01/XHTML 1.0 DTD. That is why it should always be included in the HTML.
You can specify it in the CSS for skining purposes, but most images in most cases are static and non changing so putting it in CSS is an unecessary step.
Well, it depends. For example, if you want to do some effects when the mouse is over an image, it must be in the HTML. When you put the image in the HTML you can positionate it more freely than in CSS. Also, as far as I know, CSS included images are not crawled (You can have interest in have your company's logo crawled by searchers).
If you think about accesibility, the HTML embedded images can have an alt and title information. So, for example, when you put the mouse over the logo of your company, the browser could show the motto of your company if you embed it with title="motto" attribute in the img tag. You can't do that with CSS.
Also people are used to put images in the HTML not the CSS and behaviours are a hard thing to change.
In conclussion, depending of your needs, CSS isn't flexible enough to fit your needs and you should put the images in the HTML. But if CSS fits your needs for UI images, then CSS is better idea.
Sometimes, loading UI images using CSS, also prevents the users from downloading your UI images to their drives, while saving a page.
But of course there are other ways to save them, but just a point to add.
And browsers tend to prioritize CSS more than HTML, so loading images through CSS might be a little faster compared to HTML.

How Does Facebook Know What Image To Parse Out of An Article?

First off I want to say that I wasn't really sure where to post this but it is very much programming related. If it is in the wrong spot I apologize and please let me know where I should post it instead.
When sharing an article on a friends wall, facebook will grab a thumbnail of the article. How do they always get the right thumbnail from articles?
It doesn't grab the logo img element of of http://www.nytimes.com/2010/06/07/world/asia/07convoys.html?hp for example but rather grabs the correct image element that corresponds with the article.
I'm looking to do something similar and was wondering of a good way to parse the html to find the image given this example. Thanks.
Actually, Facebook's way of finding thumbnails isn't so magical. It searches for a set of <meta> and <link> tags which specify which title, description, and image to use.
If it cannot find any of the <meta> and <link> tags it is looking for, it basically asks the user to choose whichever <img> tag fits.
In the case of the NY Times, it uses the following:
<meta name="thumbnail" content="whatever.jpg" />
Facebook recommends you use a <link> tag instead for the thumbnail.
<meta name="title" content="title" />
<meta name="description" content="description " />
<link rel="image_src" href="thumbnail_image" />
Source: Facebok Share/Specifying Meta Tags
They don't always grab the correct image, even though there's certainly some good logic in place.
In many cases, I've seen a list of thumbnails to choose from, meaning Facebook's parser considered them equally relevant.
I would guess they (probably among other things) look at the dom structure and find images close to content that looks "shareable".
UPDATE:
After some empirical testing, it seems that image dimensions play a big role. Images too small and too wide are not considered thumbnails. If your logo is the right size though, expect it to show up as one of the thumbnails. Try sharing something on http://www.e24.se for example.
These are just guesses as I don't have any knowledge of Facebook's internal operations, but if I were parsing thumbnails from a page I would consider several things:
Size of the image, as previously stated
Relevant keywords in the href or alt attributes
Location of the <img> tag on page, the closer to relevant content the better, but may not always work for complicated layouts
Absence of ad-related keywords in the <img> tag or nearby tags (doubleclick comes to mind)
Also, as far as I know the Facebook meta tags are fairly new, so my guess is that the link page scraper is still grabbing images the hard way ;) However if you're running a site and want Facebook to grab the right information when it scrapes your pages I highly suggest implementing them.