How Does Facebook Know What Image To Parse Out of An Article? - html

First off I want to say that I wasn't really sure where to post this but it is very much programming related. If it is in the wrong spot I apologize and please let me know where I should post it instead.
When sharing an article on a friends wall, facebook will grab a thumbnail of the article. How do they always get the right thumbnail from articles?
It doesn't grab the logo img element of of http://www.nytimes.com/2010/06/07/world/asia/07convoys.html?hp for example but rather grabs the correct image element that corresponds with the article.
I'm looking to do something similar and was wondering of a good way to parse the html to find the image given this example. Thanks.

Actually, Facebook's way of finding thumbnails isn't so magical. It searches for a set of <meta> and <link> tags which specify which title, description, and image to use.
If it cannot find any of the <meta> and <link> tags it is looking for, it basically asks the user to choose whichever <img> tag fits.
In the case of the NY Times, it uses the following:
<meta name="thumbnail" content="whatever.jpg" />
Facebook recommends you use a <link> tag instead for the thumbnail.
<meta name="title" content="title" />
<meta name="description" content="description " />
<link rel="image_src" href="thumbnail_image" />
Source: Facebok Share/Specifying Meta Tags

They don't always grab the correct image, even though there's certainly some good logic in place.
In many cases, I've seen a list of thumbnails to choose from, meaning Facebook's parser considered them equally relevant.
I would guess they (probably among other things) look at the dom structure and find images close to content that looks "shareable".
UPDATE:
After some empirical testing, it seems that image dimensions play a big role. Images too small and too wide are not considered thumbnails. If your logo is the right size though, expect it to show up as one of the thumbnails. Try sharing something on http://www.e24.se for example.

These are just guesses as I don't have any knowledge of Facebook's internal operations, but if I were parsing thumbnails from a page I would consider several things:
Size of the image, as previously stated
Relevant keywords in the href or alt attributes
Location of the <img> tag on page, the closer to relevant content the better, but may not always work for complicated layouts
Absence of ad-related keywords in the <img> tag or nearby tags (doubleclick comes to mind)
Also, as far as I know the Facebook meta tags are fairly new, so my guess is that the link page scraper is still grabbing images the hard way ;) However if you're running a site and want Facebook to grab the right information when it scrapes your pages I highly suggest implementing them.

Related

What is <meta> in html and how it can be useful?

I don't know what is meta element in HTML and its usability. What is the purpose of name and content attributes, and how will this element affect my webpage?
I have seen it a couple of times and I tried to learn from a book, but I couldn't understand it.
<html>
<head>
<title>What is meta?</title>
<meta>
</head>
<body>
</body>
</html>
Meta is another word for self-referential, which means that meta(data) tags provide information about the HTML document (i.e. the webpage) itself.
w3schools has a good description on the HTML meta tag:
They won't be displayed on the page, but will be machine parsable.
For example, common meta tag attributes are description (what is this document about?) and author (who does this belong to?) which are used for machines like search engines.
Beyond this, you can also set things like character encoding and the viewport which is commonly used for responsive web design, so you can probably guess that it can be widely useful for your webpage!
Good luck in your learning.
Other resources:
https://smallbusiness.chron.com/meta-tags-used-promote-accessibility-search-engine-opimization-74918.html
https://metatags.io
https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta
It helps at bit with your site's seo and makes it easier for search engines to know more about your site. Some tags, however, can be viewed in the search engine (eg. description can be seen in Google under the link)

Meta Tags in Website

I have a website, and I need to figure out a few things:
Where to put the meta tag?
How many meta tags do I need?
Can I put all the webpages in 1 meta tag or do I need multiple?
As for my website, there are over 1000 things you can do, so an example would be "John is looking for a poker player." On my website, if you go under board games and click cards, you could add a classify OR if you do a search, you can look for members who play poker/card games. This is one example of thousands of activities.
My question is: do I need to create 1 meta tag for keywords of poker, friend, activity to show up on an SEO, OR can I create 1 meta tag that will hold 1000+ keywords on 1000+ different topics?
My website was created in C#. I'm confused when I google meta tags on youtube and find them written out in notepad as an html.
You should not use Meta tag for keywords !
The Keywords Meta Tag
A long time ago in a galaxy far, far away, the “keywords” meta tag was
a critical element for early search engines. Much like the dinosaurs,
this tag is a fossil from ancient search engine times.
The only search engine that looks at the keywords anymore is
Microsoft's Bing – and they use it to help detect spam. To avoid
hurting your site, your best option is to never add this tag.
Or, if that's too radical for you to stomach, at least make sure you
haven't stuffed 300 keywords in the hopes of higher search rankings.
It won't work. Sorry.
If you already have keyword meta tags on your website, but they aren't
spammy, there's no reason to spend the next week hurriedly taking them
out. It's OK to leave them for now – just take them out as you're
able, to reduce page weight and load times.
Check this link for crucial parts for your SEO !
This website can give you points in which your SEO is not good !
Also it will be good to see how fast your website is responding. You can check this link
Last 2 links give you detail information how you should fix the problems which you have.
Meta tags should be in <head>, css also in <head>, javascript if it possible at the end of the <body>.
You can check google web speed test
EDIT:
Here is meta description and title. If your website is written on C# this is probably located in Site.Master !
<head>
<title>Not a Meta Tag, but required anyway </title>
<meta name="description" content="Awesome Description Here">
</head>
1) Meta tags are always in <head> element of page.
2) It depends on what metadata you want to add to your page.
3) You will need 1 <meta> tag for each meta type. So 1 tag will be enough for your keywords.
You can find more about meta tag on W3Schools.

Proper use of og meta tags on website

I have a website that has deeplinking enabled. When I try to add the image link to facebook, it comes up with one of three images to choose from for the share.
I have the proper meta tags added for opengraph. However, because I want the ability to share any one of the 40 images on the site, is it ok to add an
<meta property="og:image" content="thumbnail_image" />
tag for every image, or is there a wiser to to solve this issue?
Additionally, should I/should I not add individual description tags, too?
You can add multiple og:images, but the share dialog will only show three of them to the user to chose from.
And if you have individual pieces of content those of course should have an individual descriptions as well (where appropriate).
More information on general Open Graph tags can be found here: https://developers.facebook.com/docs/opengraph/creating-custom-stories#objecttypes-properties and here: http://ogp.me/

What is <link rel="image_src">

Today I came across a <link rel="image_src"> tag. I don't know about it, so I use google. Google tell me that this tag are similar to og:image. So I came to open graph main site to read about it http://ogp.me/, but i found nothing about link rel="image_src". So this tag is replacement to meta property="og:image" or is in special tag in another specification ? How use this tag or for what is used?
The rel attribute specifies the type of the link, i.e. the kind of the relationship between the document and the linked resource. Usually just a few keywords, like stylesheet and icon, are used. Although many other keywords have been proposed and registered, most of them are write-only: they are meant to express something, but nobody cares (no software uses the information).
The extension mechanisms of HTML5 include, in the description of link types, a somewhat obscure mechanism that allows, in theory, anyone register his favorite keyword in the existing rel values wiki to make documents using it as rel value “conforming”.
And image_src has indeed been registered there, with the information that it is used to “specify a Webpage Icon for use by Facebook, Yahoo, Digg, etc.”, no specification has been identified but an article about it is linked to, and it is “probably redundant with rel=icon”.
You can use this tag to use an image as the thumb for link share.
When someone posts a link to your site on social media, such as Facebook, the image that is displayed with your link is usually the first one in your code. This may not be the image that best fits defines your site, and it may not fit well in the small box that Facebook posts. The link rel="image_src" tag lets you control what image (or images, you can have more than one by stacking separate references) is displayed alongside your link.

How would you pick the best image from a webpage in a crawler?

If you were given any random webpage on the internet and had the html source only. What method would use to give you the most accurate image that would best describe that webpage? Assume that there are no meta tags or hints.
Facebook does something similar when you post a link but they give you choices of n images to chose from, they don't actually pick one unless it has the meta tags on it.
Try to analyze the structure of the page. The majority of web pages roughly has a header, content and footer area. The content area is most likely to contain images related to the subject of the page, so that's what you're looking for.
Find the content area
Most content areas are div elements with with an ID or class named content, so that's always a good first guess. There may be alternative descriptors of the content element, so you'll need to do some research to find common patterns.
The content area will also contain multiple h1 or h2 headings in most cases, so that's another indicator to look for.
Find the header and footer
Another approach is to identify the header and footer. Headers usually contain a hint to the logo of the site, such as an image, CSS class name or link to the root of the site. Footers are most likely to contain things like copyright statements.
You can also find the header and footer by analyzing the links on the page. Most internal links will be in the header and footer, while the content has relatively more outgoing links, if any.
Once you have the header and footer, the content is usually in between :)
Find an image
Once you've identified the content area, the first image is usually your best pick. You should, however, ignore images with a small width and/or height, as these will likely be decorative images.
You could also double-check the images against any included CSS files, to make sure you're not picking an image that's related to the design of the page.
Fall back to an educated guess
If you cannot reliably guess the content area of the page, just use the biggest image on the page, as egrunin suggested. Again, you can check this image against the CSS files, to rule out any design-related images.
In the fall-back case, you could log the URL and review those pages to improve your image detection algorithms.
This is best-guess stuff, but:
ignoring anything hosted in another domain will eliminate most ads
once you've grabbed the images, you can get their size; the biggest is probably the one to use.
images that are inside <a> and point to the root of the domain are probably logos. Example: the SO logo on this page is inside .
Edited to add:
It's true that large sites use auxiliary servers for their images. But you could probably make up a couple of simple parsing rules that will get 80% of cases, picking out g-ecx.images-amazon.com and static.ak.fbcdn.net as non-ad servers.
If you find og:image meta property, you can use that quite safely, as part of Open Graph specification used to provide images for Facebook links.
Example of format:
<html xmlns:og="http://opengraphprotocol.org/schema/">
<head>
<title>The Rock (1996)</title>
<meta property="og:image" content="http://ia.media-imdb.com/rock.jpg"/>
...
</head>
...
</html>
Well I would try to look for divs/spans/h1 with something like class or id = "logo" or "top". Almost every page has its logo on the top of page. Just look on stackoverflow :) logo.
I do it this way in my crawler and it works fine :)