Can crawlers find content loaded using body's onLoad() and ajax?

Can crawlers find content loaded using body's onLoad() and ajax? - html

I've created a HTML page where I'm using body's onLoad callback, fetching content from a Servlet via Ajax call and sending it to a div (the content contains info about books - each book as a table containing title, tags, author etc.).
Now I wonder when I submit this page to a search engine, will the bot be able to crawl this ajax content?
Any help/suggestions appreciated!

No. Search engines, in general do not crawl Ajax content. The only exception is Google's crawlable ajax proposal which you apparently did not implement. But its use is discouraged anyway. So your website is definitely not search engine friendly.
What you should have done is built the site to work without JavaScript and then used progressive enhancement to make it work better with JavaScript enabled.

Related

Difference between href="#!" and href="#" [duplicate]

I've just noticed that the long, convoluted Facebook URLs that we're used to now look like this:
http://www.facebook.com/example.profile#!/pages/Another-Page/123456789012345
As far as I can recall, earlier this year it was just a normal URL-fragment-like string (starting with #), without the exclamation mark. But now it's a shebang or hashbang (#!), which I've previously only seen in shell scripts and Perl scripts.
The new Twitter URLs now also feature the #! symbols. A Twitter profile URL, for example, now looks like this:
http://twitter.com/#!/BoltClock
Does #! now play some special role in URLs, like for a certain Ajax framework or something since the new Facebook and Twitter interfaces are now largely Ajaxified?
Would using this in my URLs benefit my Web application in any way?

This technique is now deprecated.
This used to tell Google how to index the page.
https://developers.google.com/webmasters/ajax-crawling/
This technique has mostly been supplanted by the ability to use the JavaScript History API that was introduced alongside HTML5. For a URL like www.example.com/ajax.html#!key=value, Google will check the URL www.example.com/ajax.html?_escaped_fragment_=key=value to fetch a non-AJAX version of the contents.

The octothorpe/number-sign/hashmark has a special significance in an URL, it normally identifies the name of a section of a document. The precise term is that the text following the hash is the anchor portion of an URL. If you use Wikipedia, you will see that most pages have a table of contents and you can jump to sections within the document with an anchor, such as:
https://en.wikipedia.org/wiki/Alan_Turing#Early_computers_and_the_Turing_test
https://en.wikipedia.org/wiki/Alan_Turing identifies the page and Early_computers_and_the_Turing_test is the anchor. The reason that Facebook and other Javascript-driven applications (like my own Wood & Stones) use anchors is that they want to make pages bookmarkable (as suggested by a comment on that answer) or support the back button without reloading the entire page from the server.
In order to support bookmarking and the back button, you need to change the URL. However, if you change the page portion (with something like window.location = 'http://raganwald.com';) to a different URL or without specifying an anchor, the browser will load the entire page from the URL. Try this in Firebug or Safari's Javascript console. Load http://minimal-github.gilesb.com/raganwald. Now in the Javascript console, type:
window.location = 'http://minimal-github.gilesb.com/raganwald';
You will see the page refresh from the server. Now type:
window.location = 'http://minimal-github.gilesb.com/raganwald#try_this';
Aha! No page refresh! Type:
window.location = 'http://minimal-github.gilesb.com/raganwald#and_this';
Still no refresh. Use the back button to see that these URLs are in the browser history. The browser notices that we are on the same page but just changing the anchor, so it doesn't reload. Thanks to this behaviour, we can have a single Javascript application that appears to the browser to be on one 'page' but to have many bookmarkable sections that respect the back button. The application must change the anchor when a user enters different 'states', and likewise if a user uses the back button or a bookmark or a link to load the application with an anchor included, the application must restore the appropriate state.
So there you have it: Anchors provide Javascript programmers with a mechanism for making bookmarkable, indexable, and back-button-friendly applications. This technique has a name: It is a Single Page Interface.
p.s. There is a fourth benefit to this technique: Loading page content through AJAX and then injecting it into the current DOM can be much faster than loading a new page. In addition to the speed increase, further tricks like loading certain portions in the background can be performed under the programmer's control.
p.p.s. Given all of that, the 'bang' or exclamation mark is a further hint to Google's web crawler that the exact same page can be loaded from the server at a slightly different URL. See Ajax Crawling. Another technique is to make each link point to a server-accessible URL and then use unobtrusive Javascript to change it into an SPI with an anchor.
Here's the key link again: The Single Page Interface Manifesto

First of all: I'm the author of the The Single Page Interface Manifesto cited by raganwald
As raganwald has explained very well, the most important aspect of the Single Page Interface (SPI) approach used in FaceBook and Twitter is the use of hash # in URLs
The character ! is added only for Google purposes, this notation is a Google "standard" for crawling web sites intensive on AJAX (in the extreme Single Page Interface web sites). When Google's crawler finds an URL with #! it knows that an alternative conventional URL exists providing the same page "state" but in this case on load time.
In spite of #! combination is very interesting for SEO, is only supported by Google (as far I know), with some JavaScript tricks you can build SPI web sites SEO compatible for any web crawler (Yahoo, Bing...).
The SPI Manifesto and demos do not use Google's format of ! in hashes, this notation could be easily added and SPI crawling could be even easier (UPDATE: now ! notation is used and remains compatible with other search engines).
Take a look to this tutorial, is an example of a simple ItsNat SPI site but you can pick some ideas for other frameworks, this example is SEO compatible for any web crawler.
The hard problem is to generate any (or selected) "AJAX page state" as plain HTML for SEO, in ItsNat is very easy and automatic, the same site is in the same time SPI or page based for SEO (or when JavaScript is disabled for accessibility). With other web frameworks you can ever follow the double site approach, one site is SPI based and another page based for SEO, for instance Twitter uses this "double site" technique.

I would be very careful if you are considering adopting this hashbang convention.
Once you hashbang, you can’t go back. This is probably the stickiest issue. Ben’s post put forward the point that when pushState is more widely adopted then we can leave hashbangs behind and return to traditional URLs. Well, fact is, you can’t. Earlier I stated that URLs are forever, they get indexed and archived and generally kept around. To add to that, cool URLs don’t change. We don’t want to disconnect ourselves from all the valuable links to our content. If you’ve implemented hashbang URLs at any point then want to change them without breaking links the only way you can do it is by running some JavaScript on the root document of your domain. Forever. It’s in no way temporary, you are stuck with it.
You really want to use pushState instead of hashbangs, because making your URLs ugly and possibly broken -- forever -- is a colossal and permanent downside to hashbangs.

To have a good follow-up about all this, Twitter - one of the pioneers of hashbang URL's and single-page-interface - admitted that the hashbang system was slow in the long run and that they have actually started reversing the decision and returning to old-school links.
Article about this is here.

I always assumed the ! just indicated that the hash fragment that followed corresponded to a URL, with ! taking the place of the site root or domain. It could be anything, in theory, but it seems the Google AJAX Crawling API likes it this way.
The hash, of course, just indicates that no real page reload is occurring, so yes, it’s for AJAX purposes. Edit: Raganwald does a lovely job explaining this in more detail.

Prevent google from indexing ajax loaded content

On our site we load identical content via Ajax calls (when the users click on the menu, just to prevent reloading the entire page again, so as to improve user experience).
So this is works well, but actually this Ajax loaded content is actually a copy of the original content.
May I prevent Google from indexing this content?
http://dinox-h.hu/en/gallery.php
In the left menu you can see the links:
For example:
http://dinox-h.hu/puffer_tartalyok_galeria.php?ajax=1

Try adding the following on your Ajax-delivered pages:
<meta name="robots" content="noindex,nofollow" />
This will tell site crawlers to not crawl the page. You could also add the pages in robots.txt, like this:
User-agent: *
Disallow: /*?ajax=1
That would block any URL with ?ajax=1 from being indexed (providing a robot honours your robots.txt). A better solution would also involve creating a sitemap and telling various search engines about it.
Edit
A better way of delivering Ajax content IMO would be to send the following header when requesting your pages via Ajax:
X-Requested-With: XMLHttpRequest
jQuery will do this by default, so provided you can check for it on the server side, you could deliver your usual content e.g. without the template. You could then very easily deliver different content from the same URL depending on what the type of request is. This should also solve your crawling issue as I doubt a crawler would stumble across it.

HTML validation in a multipage jquery mobile app

How do you validate (for example using http://validator.w3.org/) a multipage jquery mobile site? For example if I navigate away from index.html the page is only a div without a header or body.

"It depends".
Validation only makes sense in the context of HTML documents, and if you are modifying a document with JavaScript you only have the initial state to validate.
You could use a tool such as Selenium to drive the site and take snapshots of the DOM (serialising it to HTML) when it is in different states, then validate those snapshots. (The markup validation series has an API you can call programatically so you could combine those).
If you are generating fragments of HTML on the server (instead of sending pure, structured data to the client) then you can embed those fragments in HTML skeleton documents and validate those. You should have such documents existing for most views anyway (since you don't want to repeat Gawker's mistake by having a fragile site completely dependant on Ajax).
See also Progressive Enhancement and Unobtrusive JavaScript

One page website and linking

Ok guys, what I can't seem to grasp is how, on a one page website, you link to certain pages/divs while using the scrollto function.
if you look at Ultranoir.com
You can see the site is built with the one page format but if you watch the url field, it navigates to subfolders etc, but is still loading all content dynamically. How do they achieve this effect while still keeping it so clean and ordered? on my current site it all stays at www.url.com/index.html even when I navigate pages. any help? thanks!

They are using hash tags to load different parts of their pages dynamically. if u add i.e. index.html#!/blog or index.html#!/about
you can parse the url client-side using javascript and load the correct content through ajax based on the url.
Check out this page to see an example implementation of this functionality using php and JQuery: http://www.queness.com/post/328/a-simple-ajax-driven-website-with-jqueryphp

They do it by abusing the fragment identifier. A modern approach would make use of pushState

Indexing ajax loaded contents

Is there a widely used standard way on how to index ajax loaded content (for search engines)?
For example, indexing HTML content that would dynamically be inserted into a page.
Thanks

You may want to consider using some sort of sitemap generator that aggregates all the content you normally load through AJAX.
Sitemaps are particularly beneficial
on websites where:
Some areas of the website are not available through the browsable
interface, or
Webmasters use rich Ajax, Silverlight, or Flash content that is
not normally processed by search
engines.
From Wikipedia - Sitemaps
Remember that:
Because most web crawlers do not
execute JavaScript code, publicly
indexable web applications should
provide an alternative means of
accessing the content that would
normally be retrieved with Ajax, to
allow search engines to index it.
From Wikipedia - AJAX Drawbacks
In addition you may be interested in checking out the following articles:
Official Google Webmaster Central Blog - A proposal for making AJAX crawlable
SoftwareDeveloper.com - How to: Get Google and AJAX to Play Nice
Crawling Ajax-driven Web 2.0 Applications

One way of doing this is using JS fallbacks for dialog boxes like thickbox: A link would point to the dialog box loading Ajax content, and the fallback href='...' would point to a search engine-readable representation of that content (i.e. the HTML snippet that the AJAX function would load, but surrounded by the necessary HTML body basics).
Example (I pulled rel='box' out of my arse, this is supposed to be the anchor for the box plugin, like rel=thickbox):
<a href='/encyclopedia/definition/mushroom.html' rel='box'>Definition of Mushroom</a>
Clicking on the link in a Ajax/JS enabled browser will open a nice dialog box with the article
Clicking on the link without JS (or as a search engine) will lead to a new page containing the article (which needs some server side intelligence to detect which channel the request came from).
That's all that comes to my mind in this direction. Ajax and search engines is a widely uncharted field otherwise.

Have Javascript fallbacks. Have a look at Amazon Diamond Search with and without Javascript enabled. Read up on http://www.seroundtable.com/archives/006889.html

I don't really know the answer, but it seems to me that ajax-loaded content won't help to improve se positions because search engine can't refer to ajax-loaded content. Another words search engine can't say: "Hey, go here and then click 3rd button from the top to see the content you're interested in.".
I think that good idea is to put this content to xml and put link to this xml at tag (like URL to RSS)...

What about using an alternative content for JS disabled clients (search engines)? I think there is no other way of letting the search engines index your AJAX site properly.

I think actually only Google really implements a specification to index AJAX content.
It's the Google AJAX crawling specification.
We have used that for our website, there is an example in our technical blog on how to do that with Django in a clean way.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008