Prevent google from indexing ajax loaded content - html

On our site we load identical content via Ajax calls (when the users click on the menu, just to prevent reloading the entire page again, so as to improve user experience).
So this is works well, but actually this Ajax loaded content is actually a copy of the original content.
May I prevent Google from indexing this content?
http://dinox-h.hu/en/gallery.php
In the left menu you can see the links:
For example:
http://dinox-h.hu/puffer_tartalyok_galeria.php?ajax=1

Try adding the following on your Ajax-delivered pages:
<meta name="robots" content="noindex,nofollow" />
This will tell site crawlers to not crawl the page. You could also add the pages in robots.txt, like this:
User-agent: *
Disallow: /*?ajax=1
That would block any URL with ?ajax=1 from being indexed (providing a robot honours your robots.txt). A better solution would also involve creating a sitemap and telling various search engines about it.
Edit
A better way of delivering Ajax content IMO would be to send the following header when requesting your pages via Ajax:
X-Requested-With: XMLHttpRequest
jQuery will do this by default, so provided you can check for it on the server side, you could deliver your usual content e.g. without the template. You could then very easily deliver different content from the same URL depending on what the type of request is. This should also solve your crawling issue as I doubt a crawler would stumble across it.

Related

Redirecting unless client has come from paypal

I'm trying to have the HTML code check where a client came from so they can only access this page through a link and we will say this link is from Paypal after purchase and if they don't go through Paypal they will be redirected to the home page of my website, in this case, is home.com (not really).
My Code:
if(!isset($_SERVER['HTTP_REFERER'])){
<meta http-equiv="Refresh" content="0; url='https://bypassdetected!'" />
header('location:../index.php');
exit;
You would need to check if the contents of HTTP_REFERER includes 'paypal.com', although this is a dumb sort of check since it's easily spoofed and accomplishes little of value
Regarding the action your code then takes, you can't combine HTTP header location redirects with HTML redirects, it's one or the other, but if you do try to send both, the headers have to be set before any body content
Redirecting over to PayPal should be avoided in general. You should switch to a PayPal integration that does not use any redirects at all, such as this one: https://developer.paypal.com/demo/checkout/#/pattern/client -- then, your site always stays loaded in the background, which is a far better modern web experience

Serve different resource depending on full URL of requesting page

Let's say that we have two pages:
https://www.example.com/first/firstpage.html
https://www.example.com/second/secondpage.html
that both load the resource https://www.example.com/resource.js
If I want the server that serves resource.js to be able to serve a different version of resource.js depending on which page the request is coming from, is there a reliable header upon which the full URL of the requesting page can be determined (or maybe there is some other way to determine this)?
I know that there is an Origin header, but from my understanding this just represents the domain (and any subdomains) without the full URL and query string. Is there any way for the server to know the full URL and query string that the request for the resource is coming from?
If this isn't possible, I know it would be easy to include that info in the JS script tag as follows:
<script src="/resource.js?origin=/first/firstpage.html"></script>
But I don't want to have to modify the script tag for each page. Is there some other way to have the page automatically include it's own URL in the query string of the resource request (without having to dynamically load the resource using my own JS script - HTML only please!), or just any unique identifier so that the script tag doesn't have to be modified individually on each page?
There's the Referer header that you can use.
Make sure that your response uses Vary: Referer. Otherwise, browsers are going to cache this resource as if the referring page URL didn't matter.
I'd plead of you not to do this at all though. You're going to create a rabbit hole of problems, as not all browsers or proxy servers are well behaved. Some are going to aggressively cache this anyway, no matter what you do with the Vary header.

How to keep the search engines away from some pages on my domain

I've build a admin control panel for my website. I don't want the control panel app to end up in a search engine, since there's really no need for it. I did some research and i've found that by using the following tag, i can probably achieve my goal
<meta name="robots" content="noindex,nofollow">
Is this true? Is there other methods more reliable? I'm asking because i'm scare i could mess things up if i'm using the wrong method, and i do want search engines to search my site, just not the control panel...
Thanks
This is true, but on top of doing that, for even more security, in your .htaccess file, you should set this:
Header set X-Robots-Tag "noindex, nofollow"
And in you should create a new file in the root of your domain, named robots.txt with this content:
User-agent: *
Disallow: /
And you can be sure that they won't index your content ;)
Google will honor the meta tag by completely dropping the page from their index (source) however other crawlers might just simply decide to ignore it.
In that particular sense meta tags are more reliable with Google because by simply using robots.txt any other external source that is explicitly linking to your admin page (for whatever reason) will make your page appear in Google index (though without any content which will probably result in some SERP leeching).

Prevent Search Engine Indexes a Web Page

I have a members control panel page which I don't want search engine to index.
I did the following:
The page is secured, if there isn't a session or password provided, then direct user to main page. Redirect the user as following:
header("location:HOME PAGE");
exit();
I put only one meta with the following attributes
name="robots" content="noindex,nofollow"
Is this solution good enough?
If the page is secured, its content cannot be indexed. Even if, its content will be low rated.
How do you redirect the user? Using response header or HTML meta redirect / JavaScript?
you can use robots.txt for more info see http://www.google.com/support/webmasters/bin/answer.py?answer=156449
The meta data is irrelevant, if the page redirects any unidentified user to a different page, then the redirecting URL will never be indexed.
Most search engine robots (all legitimate ones) will respect this instruction. However, all you have really done is ask the robot nicely not to index your page. It does not force any sort of behaviour, merely requests it.
You could...
Require auth to view the page. The robots will not be authenticated, and therefore cannot view the page to index it.
Return a 404 error to any request where the User-Agent: string is in a list of known search engine robots. There are plenty of sites out there (such as this one) that will easily allow you to compile such a list.

Indexing ajax loaded contents

Is there a widely used standard way on how to index ajax loaded content (for search engines)?
For example, indexing HTML content that would dynamically be inserted into a page.
Thanks
You may want to consider using some sort of sitemap generator that aggregates all the content you normally load through AJAX.
Sitemaps are particularly beneficial
on websites where:
Some areas of the website are not available through the browsable
interface, or
Webmasters use rich Ajax, Silverlight, or Flash content that is
not normally processed by search
engines.
From Wikipedia - Sitemaps
Remember that:
Because most web crawlers do not
execute JavaScript code, publicly
indexable web applications should
provide an alternative means of
accessing the content that would
normally be retrieved with Ajax, to
allow search engines to index it.
From Wikipedia - AJAX Drawbacks
In addition you may be interested in checking out the following articles:
Official Google Webmaster Central Blog - A proposal for making AJAX crawlable
SoftwareDeveloper.com - How to: Get Google and AJAX to Play Nice
Crawling Ajax-driven Web 2.0 Applications
One way of doing this is using JS fallbacks for dialog boxes like thickbox: A link would point to the dialog box loading Ajax content, and the fallback href='...' would point to a search engine-readable representation of that content (i.e. the HTML snippet that the AJAX function would load, but surrounded by the necessary HTML body basics).
Example (I pulled rel='box' out of my arse, this is supposed to be the anchor for the box plugin, like rel=thickbox):
<a href='/encyclopedia/definition/mushroom.html' rel='box'>Definition of Mushroom</a>
Clicking on the link in a Ajax/JS enabled browser will open a nice dialog box with the article
Clicking on the link without JS (or as a search engine) will lead to a new page containing the article (which needs some server side intelligence to detect which channel the request came from).
That's all that comes to my mind in this direction. Ajax and search engines is a widely uncharted field otherwise.
Have Javascript fallbacks. Have a look at Amazon Diamond Search with and without Javascript enabled. Read up on http://www.seroundtable.com/archives/006889.html
I don't really know the answer, but it seems to me that ajax-loaded content won't help to improve se positions because search engine can't refer to ajax-loaded content. Another words search engine can't say: "Hey, go here and then click 3rd button from the top to see the content you're interested in.".
I think that good idea is to put this content to xml and put link to this xml at tag (like URL to RSS)...
What about using an alternative content for JS disabled clients (search engines)? I think there is no other way of letting the search engines index your AJAX site properly.
I think actually only Google really implements a specification to index AJAX content.
It's the Google AJAX crawling specification.
We have used that for our website, there is an example in our technical blog on how to do that with Django in a clean way.