Is there an effective way to serve specific thumbnail images to fetcher bots? - html

My site uses some aggressive caching techniques to keep requests to a minimum, among them being:
.htaccess redirects to cached HTML files;
Automatic merging of content images into CSS sprites.
This works great for human traffic, but when an article is posted on Facebook, Pinterest, Google+, Reddit, etc the bot fails to find a suitable thumbnail since the page images are all big sprite JPEGs.
One solution would be .htaccess rules that bypass the cache when a bot is making the request. Preferably without having to specifically name every possible bot user-agent. I am unsure how to accomplish that.
Another solution would be to embed one good thumbnail image on every page that a bot would download but a real web browser would not. Any ideas how to accomplish that?
Other suggestions are welcome. If all else fails I'll rework my script to exclude the first image of every post from the autosprites, but that will effectively double the number of image requests my poor overworked server must accomodate.

Showing different things to bots than to humans is a very bad approach regardless of the problem you're trying to solve. Google will sometimes even punish sites that do this with a low search ranking. A better way to do this would be to go to each bot's website and see if there is a way to tell that bot to display an image that is relevant to that page.
For example, Facebook accepts the following meta tag in the head of your html to tell it an image that is relevant to your page:
<meta property="og:image" content="[url to the image]">

Related

How to get the most representative image of a webpage?

There are some cases when you want to get a most representative image
of a web page, e.g. Pocket would try to add an image when you collect
a web page.
How would you define, in a programmatic way, which image is the key
image?
What would be the most appropriate way to do so?
Most websites that are seeking to be shared on sites like Facebook or Pocket will have an Open Graph protocol image. This is often an image in the head tag that uses the format <meta property="og:image" content="http://URL-TO-YOUR-IMAGE" />. The Open Graph protocol is used and looked for by companies such as Facebook, Pocket, Reddit, and has become fairly widespread in use.
For websites that do not follow such a standard, developers will often use a third-party tool such as Embedly, which has already solved the problem. Simply feed it a URL and it will return you some information on what content would be good for your thumbnail-ified images.
If you're wanting to create your own engine, you may want to study into DOM positioning analytics, and try to find your own algorithm by scraping many, many articles and web pages to try and find good patterns.
Study scraper.py to see how reddit uses BeautifulSoup to find representative images from links submitted to it.

Pictures sometime don't load on my website

I've made a website which displays images hosted on other sites using the html src="http://......" tag, however sometimes some of the images won't load. This appears somewhat random, and I don't think it is a problem with the links themselves.
I display a lot of images, so I am wondering if this is a common problem when trying to load many thumbnails from another site. Is the best solution to host all the thumbnails on my own server, and if so, is there an efficient way to do this (so I don't have to manually download and link to every image)?
Thanks
Is way better to host it on your own server.
Because if are all from other servers, you must connect to all servers and download it.
It causes worse response and increase the time required to load the page.
To the image and links downloading - I think it is possible, just go on google and try to find some advanced html page downloader. I had one and it worked directly the way you want. - can't remember the name..
(also sorry for my bad English)

How to hide content from File2HD?

There is a website called file2hd.com which can download any type of content from your website including audio, movies, links, applications, objects and style sheets. Of course this doesn't work for high profile websites such as Google, but is there there a type of method I can use to cloak content on my website and prevent this?
E.g. Using a HTML Code, or using .htaccess method?
Answers are appreciated. :)
If you hide something from the software, you also hide it from regular users. Unless you have a password-protected part of your website. But even then, those users with passwords will be able to fetch all loaded content - HTML is transparent. And since you didn't provide what kind of content are you trying to hide, it's hard to give you a more accurate answer.
One thing you can do, but it works just for certain file types, is to server just small portions of a file. For example, you have a video on your page and you're fetching 5-second bits of the video from the server every 5 seconds. That way, in order for someone to download the whole thing, they'd have to get all the bits (by watching the whole thing) and then find a way to join the parts... and it's usually just not worth it. Think of Google Maps... and Google uses this/similar technique on a few other products as well.

How should i structure the code so Facebook and Google+ fetch the right elements when someone shares an url?

I was wondering which elements are fetched when an user shares an url on Facebook or Google+...
For example: how can i make sure the description of the post will be the description i want to be shared and the image will be the image i want to be shared?
Title is pretty obvious, so i skipped that.
Facebook suggests the opengraph protocol: http://ogp.me
It works reliable and can be checked with the facebook url linter http://developers.facebook.com/tools/lint/
Funny, I just wrote a blog post about this this week. It seems to me that there's no reliable way of knowing how either social network site will parse your web page to get the "status" version of it. Not only does each site do it differently (i.e. FB, vs. linkedin vs. G+), but they're liable to change it at a whim.
So currently the short answer is that you can't know this for sure. You have to reverse engineer each social network site's behavior and hope it doesn't change too often. That is until the industry smartens up and decides on some markup to convey, for example, which image form a page is considered the cardinal "share" image, and so on.

Prevent site configuration info from showing up on Google

I have a site that's running WordPress.
The main page has an embedded Flash player and an imbedded iframe, and for some reason, all the configuration info from the Flash player is showing up on Google for my site, and nothing else.
How can I have the main site information show up on Google, without having that Flash player config info show up?
And can I customize what shows up at all?
If there's some way to tag the info I don't want to show up, or tag the info I want to show up, I can probably do most ofthe edits myself, I just don't know where to start...
EDIT: I tried most of the suggestions below, and I didn't get anywhere...
Any other ideas?
Thanks a lot!
If you don't want Google, or other crawler to access certain parts of your website you should use a robots.txt file. Inside you specify which parts are accessible and which aren't, when the crawlers get to your website will always look for this file for instructions.
You can check some documentation on how to do it here and here
In order to influence what text is used on the google search result try putting this within your head tags
<meta name="description" content="WHATEVER YOU WANT DISPLAYED ON GOOGLE">
Source: http://static.googleusercontent.com/external_content/untrusted_dlcp/www.google.com/en/us/webmasters/docs/search-engine-optimization-starter-guide.pdf
Some more information from google on controling parts of a page. Apparently there are google off/google on tags.
http://perishablepress.com/press/2009/08/23/tell-google-to-not-index-certain-parts-of-your-page/
Hope this helps.
If you want Google to index only part of your pages, you can't follow normal SEO routines. You should provide a mechanism to understand whether the current client (requester) is a robot or not. If yes, then don't render that part. This is the only way. Otherwise, a robot either gets the whole rendered content, or doesn't have access based on robots.txt file (Robot Exclusion Protocol).
Another way (which is not really smart, and can't be guaranteed to work) is to dynamically inject your content into the page via JavaScript. Because AMAIK, robots don't run JavaScript.
As search spiders won't render javascript generated markup (JS is not run as it is client-side in the browser), a quick fix would be to don't output any of flash / markup initially in the HTML document and then use JS to add the flash stuff on load.
Note: as far as I'm aware, Google is currently testing a JS reading spider so this may not work long term.
Google is returning this data because it simply can't find any content where it normally would. Search engines require content - they're not advanced enough to process your multimedia to determine what it's all about.
Google will IGNORE your meta description if it doesn't feel that it reflects your page content (of which there is only iframes and JS)
Use SWFObject to provide alternate content for users without flash (including search engines) - ensure it's not some dinky text like "download flash here" - but a lengthy descriptive content piece about your site or media that they would normally experience if they could experience.
Use robots.txt or <meta name="robots" content="noindex,follow"> for the iframe content to prevent it from being indexed.
For the love of all things holy, please look at reducing the number of JS files and inline JS on your site (i'd recommend WP-minify since it's so obvious that you love plugins)