How to get the most representative image of a webpage? - html

There are some cases when you want to get a most representative image
of a web page, e.g. Pocket would try to add an image when you collect
a web page.
How would you define, in a programmatic way, which image is the key
image?
What would be the most appropriate way to do so?

Most websites that are seeking to be shared on sites like Facebook or Pocket will have an Open Graph protocol image. This is often an image in the head tag that uses the format <meta property="og:image" content="http://URL-TO-YOUR-IMAGE" />. The Open Graph protocol is used and looked for by companies such as Facebook, Pocket, Reddit, and has become fairly widespread in use.
For websites that do not follow such a standard, developers will often use a third-party tool such as Embedly, which has already solved the problem. Simply feed it a URL and it will return you some information on what content would be good for your thumbnail-ified images.
If you're wanting to create your own engine, you may want to study into DOM positioning analytics, and try to find your own algorithm by scraping many, many articles and web pages to try and find good patterns.

Study scraper.py to see how reddit uses BeautifulSoup to find representative images from links submitted to it.

Related

How to enable other websites to embed my website's content using only content link?

I have been spending my past week on the Internet to find at least one hint about it. There are no tutorials or even SO questions available. What I am trying to find is that when some website uses some library like oEmbed to embed content of other websites on their website, they fetch embed code from its link. For example, when you post a YouTube link on Facebook or other social networks, they automatically fetch their embed code. I know how to fetch embed code but what I don't know is how to provide embed code that can be fetched by other websites by using a link of my website's content?
I want that my article should be embedded in some special way. Not like the default layout of that website. So is there any META tag or something in HTML where I can put embed code for other websites?
I don't think what you want is possible. You can use special meta-tags that specific sites (e.g.: Facebook, Twitter, Linkedin) will interpret, and that will help you customize the share a little (still using the "host site" style). But as far as I know, there's nothing you can do to provide style/code of your own.
And it makes sense from a security point of view: embedding external code from an unknown source is potentially dangerous and no site would/should allow you to do it. Even if they do allow it, they should pre-process the code and sanitize it (adapting your style/code to their style/code) to prevent possible threats.
As suggested by Alvaro Montoro, I searched on the Internet about how to become an oembed provider. Following are the links I found:
https://timnash.co.uk/becoming-oembed-provider/
http://freear.org.uk/content/5-steps-being-oembed-provider
You may want to use the CSS !important directive.
http://css-tricks.com/when-using-important-is-the-right-choice/

Is there an effective way to serve specific thumbnail images to fetcher bots?

My site uses some aggressive caching techniques to keep requests to a minimum, among them being:
.htaccess redirects to cached HTML files;
Automatic merging of content images into CSS sprites.
This works great for human traffic, but when an article is posted on Facebook, Pinterest, Google+, Reddit, etc the bot fails to find a suitable thumbnail since the page images are all big sprite JPEGs.
One solution would be .htaccess rules that bypass the cache when a bot is making the request. Preferably without having to specifically name every possible bot user-agent. I am unsure how to accomplish that.
Another solution would be to embed one good thumbnail image on every page that a bot would download but a real web browser would not. Any ideas how to accomplish that?
Other suggestions are welcome. If all else fails I'll rework my script to exclude the first image of every post from the autosprites, but that will effectively double the number of image requests my poor overworked server must accomodate.
Showing different things to bots than to humans is a very bad approach regardless of the problem you're trying to solve. Google will sometimes even punish sites that do this with a low search ranking. A better way to do this would be to go to each bot's website and see if there is a way to tell that bot to display an image that is relevant to that page.
For example, Facebook accepts the following meta tag in the head of your html to tell it an image that is relevant to your page:
<meta property="og:image" content="[url to the image]">

How should i structure the code so Facebook and Google+ fetch the right elements when someone shares an url?

I was wondering which elements are fetched when an user shares an url on Facebook or Google+...
For example: how can i make sure the description of the post will be the description i want to be shared and the image will be the image i want to be shared?
Title is pretty obvious, so i skipped that.
Facebook suggests the opengraph protocol: http://ogp.me
It works reliable and can be checked with the facebook url linter http://developers.facebook.com/tools/lint/
Funny, I just wrote a blog post about this this week. It seems to me that there's no reliable way of knowing how either social network site will parse your web page to get the "status" version of it. Not only does each site do it differently (i.e. FB, vs. linkedin vs. G+), but they're liable to change it at a whim.
So currently the short answer is that you can't know this for sure. You have to reverse engineer each social network site's behavior and hope it doesn't change too often. That is until the industry smartens up and decides on some markup to convey, for example, which image form a page is considered the cardinal "share" image, and so on.

Prevent site configuration info from showing up on Google

I have a site that's running WordPress.
The main page has an embedded Flash player and an imbedded iframe, and for some reason, all the configuration info from the Flash player is showing up on Google for my site, and nothing else.
How can I have the main site information show up on Google, without having that Flash player config info show up?
And can I customize what shows up at all?
If there's some way to tag the info I don't want to show up, or tag the info I want to show up, I can probably do most ofthe edits myself, I just don't know where to start...
EDIT: I tried most of the suggestions below, and I didn't get anywhere...
Any other ideas?
Thanks a lot!
If you don't want Google, or other crawler to access certain parts of your website you should use a robots.txt file. Inside you specify which parts are accessible and which aren't, when the crawlers get to your website will always look for this file for instructions.
You can check some documentation on how to do it here and here
In order to influence what text is used on the google search result try putting this within your head tags
<meta name="description" content="WHATEVER YOU WANT DISPLAYED ON GOOGLE">
Source: http://static.googleusercontent.com/external_content/untrusted_dlcp/www.google.com/en/us/webmasters/docs/search-engine-optimization-starter-guide.pdf
Some more information from google on controling parts of a page. Apparently there are google off/google on tags.
http://perishablepress.com/press/2009/08/23/tell-google-to-not-index-certain-parts-of-your-page/
Hope this helps.
If you want Google to index only part of your pages, you can't follow normal SEO routines. You should provide a mechanism to understand whether the current client (requester) is a robot or not. If yes, then don't render that part. This is the only way. Otherwise, a robot either gets the whole rendered content, or doesn't have access based on robots.txt file (Robot Exclusion Protocol).
Another way (which is not really smart, and can't be guaranteed to work) is to dynamically inject your content into the page via JavaScript. Because AMAIK, robots don't run JavaScript.
As search spiders won't render javascript generated markup (JS is not run as it is client-side in the browser), a quick fix would be to don't output any of flash / markup initially in the HTML document and then use JS to add the flash stuff on load.
Note: as far as I'm aware, Google is currently testing a JS reading spider so this may not work long term.
Google is returning this data because it simply can't find any content where it normally would. Search engines require content - they're not advanced enough to process your multimedia to determine what it's all about.
Google will IGNORE your meta description if it doesn't feel that it reflects your page content (of which there is only iframes and JS)
Use SWFObject to provide alternate content for users without flash (including search engines) - ensure it's not some dinky text like "download flash here" - but a lengthy descriptive content piece about your site or media that they would normally experience if they could experience.
Use robots.txt or <meta name="robots" content="noindex,follow"> for the iframe content to prevent it from being indexed.
For the love of all things holy, please look at reducing the number of JS files and inline JS on your site (i'd recommend WP-minify since it's so obvious that you love plugins)

Web site as image/clip art library with reference?

As a software developer, I have done many web page applications and been doing blog for my programming experiences. I would like to use pictures in many cases. Pictures worth thousand words and they are universal language!
You could create your own clip art images or download graphics(actually many are open clip art/image libraries available, Open Clip Art Library as example). However your time and art skill are limited and you can only keep limited library of images.
I wish if there is any open art/image library web sites with permanent references available so that you just add a simple reference in your html page like this just like a way that you could use other people or web site's graphics:
<img src="http://OpenArtLibray.net/icon/work/DoItYourself.png".../>
In this way, there is no need to waste time to download and upload images and no waste on your and other computer's disk spaces(no duplication). Just one place with a huge amount of variety of images available, and open for people to use, or with some reasonable fees. People may vote the popularity of art/images as well.
Is there any such kind of web site available?
Typically sites discourage this. What this really does is shift the bandwidth cost to the hosting site. There have been cases where sites with pictures have analyzed the referrer to determine if images are linked to from other sites, then servering an image with text claiming the image is being 'stolen'.
The point of that, is the idea isn't very well liked.
However, some sites like w3c, allow you to link to their certification images. It all depends on what you are linking to.
It is hard to think of a business doing this, as there doesn't seem to be a revenue aspect.
Even if some were charged fees, there's a lot of work involved in checking/verifying who has paid, via referrer texts. Maybe you have a new business plan.
Update:
Oh, I have a friend who always sends me emails with links to flickr. Maybe their license lets you link to images on their site. Something for you to check out.
Update:
This text, "photo hosting sites", makes for an interesting, relevant google search.
Thanks for Chris explanation. I could accept it as a answer. However, I raised this question because I really don't like to "steal" images. I can see it is hard to charge fees, but there are some many open resources available on the web. Actually, I found one Open Clip Art Library, which allow people to contribute and share images. I found many good pictures there and downloaded. I may contribute some when I create images for my blog so I'll let people to use my.
Flickre is an open social place for people store and sharing pictures. As long as pictures are shared there, specially by people, I think you can use and link images there. Still you have to do the work: creating and uploading. Actually, I tried another open social site called as DropBox. I can create a public folder there and add my pictures for sharing. All those sites have one common problem: personal account and may not be available if inactive for certain period of time (90 days for DropBox?).
That's why I asked this question here in StackOverFlow. I hope some people may know some hosts available or any other alternative options available. Maybe it is just like Chris said, "the idea isn't very well liked".
Actually, I realize that Open Clip Art Library I mentioned in my previous email does provide image hosting-like service. If you click on any one's picture download link, it will open a new tab or window to display the graphic. The display has its URL. I have created a new user name and submitted my picture. It works well. I can include the graphics in my test web page. Not sure how long the URL will be there. It looks like permanent one.
Try searching Creative Commons licensed works. People will often upload and share photos on such places as Flickr under a Creative Commons license which allows you to remix, reference or use on your own projects, blogs or site.
There are different types of licenses under the CC with some asking you to not use their works if you're going to be making money from it or if you're engaging in commercial activity.
You just have to nod back to the original author when using items under CC and if you link back to them, that's just good karma.