How to tell Jekyll to hide one page from search engines?

How to tell Jekyll to hide one page from search engines? - jekyll

I have a website consisting of my public profile, made in Jekyll.
It also contains one page, say 'details.html', which contains more personal information about me. I want only those people to see this page whom I give out the link to. In particular, I'd like to hide it from search engines.
How do I best do this? I've heard I can add a robots.txt file or include a meta tag 'nofollow' or 'noindex'.
Which is the usual solution here?
If the way to go is to add a meta tag, how do I add it in only one page given a standard Jekyll setup?

The robots.txt is the standard way of telling search engines what to index and what not to (not just for Jekyll, but for websites in general).
Just create a file called robots.txt in the root of your Jekyll site, with the paths that should not be indexed.
e.g.
User-agent: *
Disallow: /2017/02/11/post-that-should-not-be-indexed/
Disallow: /page-that-should-not-be-indexed/
Allow: /
Jekyll will automagically copy the robots.txt to the folder where the site gets generated.
You can also test your robots.txt to make sure it is working the way you expect: https://support.google.com/webmasters/answer/6062598?hl=en
Update 2021-08-02 - Google Specific settings:
You can prevent a page from appearing in Google Search by including a noindex meta tag in the page's HTML code, or by returning a noindex header in the HTTP response
There are two ways to implement noindex: as a meta tag and as an HTTP response header. They have the same effect; choose the method that is more convenient for your site.
<meta> tag
To prevent most search engine web crawlers from indexing a page on your site, place the following meta tag into the <head> section of your page:
<meta name="robots" content="noindex">
To prevent only Google web crawlers from indexing a page:
<meta name="googlebot" content="noindex">
HTTP response header
Instead of a meta tag, you can also return an X-Robots-Tag header with a value of either noindex or none in your response. Here's an example of an HTTP response with an X-Robots-Tag instructing crawlers not to index a page:
HTTP/1.1 200 OK
(...)
X-Robots-Tag: noindex
(...)
More details: https://developers.google.com/search/docs/advanced/crawling/block-indexing

Try:
---
layout:
sitemap: false
---
So, whenever you include sitemap: false line in your front matter, you can exclude that page from your sitemap.
check:
add gem 'jekyll-sitemap' to your site’s Gemfile and run bundle
add the following to your site’s _config.yml:
plugins:
- jekyll-sitemap

A robots.txt file is a great solution, but .htaccess might be better for this purpose. Also, make sure you have a private repository!
Note that hosting your code on CloudCannon (paid account) allows you to set up all these things easily from within their interface.

Related

What is the right way to link to a plain text sitemap file in HTML?

Will this code correctly instruct Google to index my sitemap (or make it aware it exists)?
<link rel="sitemap" href="./sitemap.txt" type="text/plain" title="Sitemap" />
Google states in their instructions that plain text files simply listing URLs are permitted as sitemap format, but I could not find any verified solution as to how to link to this kind of file in the HTML <head>.
I modified the solution in this answer by changing the type attribute. Is this an accepted way to link to a plain text sitemap file?
I realize I can submit the file to Google directly, eg.
https://www.google.com/ping?sitemap=FULL_URL_OF_SITEMAP (Source)
But I'd like to include it in markup so other search engines (and whoever wants it) can possibly find it too.

To answer your question: HTML5 defines the values that you are allowed to use in rel and sitemap is not recognised by the validator. So the short answer is: It wouldn't work. See also here: WIKI, which statements are allowed.
Basically the best way to let other search engines know, that you have a sitemap, is to add the sitemap to your robots.txt file.
Therefore create a robots.txt file in your webservers root directory, so it looks like this: example.com/robots.txt
Then add the following to the file:
Sitemap: http://www.example.com/sitemap.txt
User-agent: *
Disallow:
The contents of the file tell search engines what pages to crawl (and what pages not to crawl) and also which search engines have permission to crawl your site. It is important that you have this file because when a search engine bot enters your site, it will look for your robots.txt before doing anything else.
To clarify the commands:
User-agent: Defines, which search engines are allowed to use the robots.txt file. However "bad" engines will still use the file, even if you say no. The * defines, that all engines are allowed to crawl the file.
Disallow: With this statement you can define, which directorys of your website should not be crawled by the search engines e.g. /photos/
Hope I could help!

Is the HTML <link rel="canonical"> about URL or content or both?

Does the HTML <link rel="canonical"> refer to URL or content or both?
I have a part of a website done in pure HTML + CSS + JavaScript and no server side. When a user enters the site with the root URL, the /index.htm is loaded. The root index.htm redirects to /site1/index.htm.
I would like to indicate that the canonical URL for /site1/index.htm should be /index.htm, and the canonical URL for that, in turn, should be /, so if needed at a later time, the redirect can go elsewhere. In this sense, specifying a canonical URL is intended to indicate that users should always enter the site through the specified path if possible when arriving at /site1/index.htm.
I'm wondering if specifying <link rel="canonical" href="/index.htm"> in /site1/index.htm, and <link rel="canonical" href="/"> in /index.htm would accomplish this. (I'm aware that absolute URLs are recommended, but this may not always be possible.)
The web server could be IIS, Apache, or other. I can't touch the server config or headers or htaccess.
Can this be done in HTML or possibly JavaScript? (I'm aware that JavaScript won't affect SEO, but it may have something to do with the redirect. Currently, the redirect is done using both meta refresh and JavaScript location = '', with a fallback link for the user to click. As mentioned, can't touch headers, or server config.)
Further, if <link rel="canonical"> is used in said fashion, would search engines index the content of the target in place of the specifying page? For example, would search engines assume the content of /site1/index.htm is the same as /index.htm, so that the URL /site1/index.htm would get associated with the actual contents of /index.htm?

I'm new here so I don't know if this is out of topic or not but I'll try to answer the question.
The <link rel="canonical"> is kind of straight forward. It works like this.
When a search engine spider crawls your page it tells him what URL should be indexed for that particular page. It's very usefull in cases of possible different URL access to a particular content. (one example within others non-www and www URLs)
Exemple : You have multiple products pages for a specific category on your website because you use pagination. In this case you will have several URLs for your paginated content page 1, page 2, page 3, etc... Adding a <link rel="canonical"> tag pointing to the first page to all these pages will tell the search engine that it should index the first page only instead of indexing all paginated pages.
Basically your telling the spider don't index this URL index that other URL instead.
In your particular case /index.htm is most probably a 301 redirection to /site1/index.htm. The risk is that Google won't index your page because you are telling it not to index the content on /site1/index.htm and index index.htm instead but this page has no content because it provides a redirection.
I'm aware that you stated that you have no access to the .htaccess file but the only way I can thing of without touching your folders structure on your FTP is to use .htaccess to rewrite /site1/index.htm to /index.htm and then add the canonical tag just to be safe because having a canonical tag is a good practice.

Code to redirect atom feed

Is it possible to define a redirect for an atom feed? If so, how does one do this? I attempted to simply replace my old atom.xml file with an 301 redirect HTML file, like so:
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<meta rel="canonical" http-equiv="refresh" content="0;url=http://io.carlboettiger.info/2014/atom.xml" />
<link rel="canonical" href="http://io.carlboettiger.info/2014/atom.xml"/>
</head>
</html>
and saved this as atom.xml, but (perhaps because of the misleading xml extension?) this doesn't seem to redirect either in the browser or any existing feed reader.
For the RSS format, it looks like the rssboard suggests that an html redirect or an XML-based redirect like so
<redirect>
<newLocation>
http://weblog.infoworld.com/udell/rss.xml
</newLocation>
</redirect>
should do it, but I cannot find any such advice for the atom format. So, how does one define redirects for atom files? (Note: I'm not looking for a hack using .htaccess rewrite rules, since the site is deployed via Github's gh-pages and I don't have the option of custom .htaccess rules anyhow)

Is it possible to define a redirect for an atom feed?
Yes, but only through HTTP redirection. In general, this is the best way to redirect a resource on the web.
the site is deployed via Github's gh-pages
As 301 redirect for site hosted at github? indicates, there's no way to specify HTTP redirection for a GH-pages-hosted site.
The Atom spec assumes that you have control over the server, and doesn't define any additional redirection mechanism.

Unfortunately I don't know of any standard (be it de-facto) to achieve this. As far as I know, the only way to do so is to find ways to do it at the HTTP level, which you don't control when using Github pages.
Both ways you're trying are documented, but I don't know of any reader which actually implements them. At Superfeedr, we have also seen redirects using the iTunes pattern: <itunes:new-feed-url>.
We've been able to do it using services like cloudflare which will act as proxies and allow you to setup rules for specific pages or addresses.

What meta tags can I use on my page to stop it getting indexed?

Are there any meta tags for this? I keep getting google index my logon and register pages. I tried putting something in the robots.txt but it doesn't seem to have checked that for a while. Just want to be sure and add a meta tag if there is one.
Thanks,

You're looking for a robots.txt entry, not meta tags. You can then exclude explict resource paths, i.e.:
# robots.txt for http://www.example.com/
User-agent: *
Disallow: /login.html
Disallow: /register.html
Add this text to a file called "robots.txt" and put it in the root of your site.

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
Would cause the page to not get indexed and links would not be followed by the crawler.
However, its up to the search engine's algorithm to decide whether it will use this information as intended by you. There are no guarantees for this meta tag or robots.txt

How to prevent search engines from indexing a single page of my website?

I don't want the search engines to index my imprint page. How could I do that?

Also you can add following meta tag in HEAD of that page
<meta name="robots" content="noindex,nofollow" />

You need a simple robots.txt file. Basically, it's a text file that tells search engines not to index particular pages.
You don't need to include it in the header of your page; as long as it's in the root directory of your website it will be picked up by crawlers.
Create it in the root folder of your website and put the following text in:
User-Agent: *
Disallow: /imprint-page.htm
Note that you'd replace imprint-page.html in the example with the actual name of the page (or the directory) that you wish to keep from being indexed.
That's it! If you want to get more advanced, you can check out here, here, or here for a lot more info. Also, you can find free tools online that will generate a robots.txt file for you (for example, here).

You can setup a robots.txt file to try and tell search engines to ignore certain directories.
See here for more info.
Basically:
User-agent: *
Disallow: /[directory or file here]

<meta name="robots" content="noindex, nofollow">
Just include this line in your <html> <head> tag. Why I'm telling you this because if you use robots.txt file to hide your URLs that might be login pages or other protected URLs that you won't show to someone else or search engines.
What I can do is just accessing the robots.txt file directly from your website and can see which URLs you have are secret. Then what is the logic behind this robots.txt file?
The good way is to include the meta tag from above and keep yourself safe from anyone.

Nowadays, the best method is to use a robots meta tag and set it to noindex,follow:
<meta name="robots" content="noindex, follow">

Create a robots.txt file and set the controls there.
Here are the docs for google:
http://code.google.com/web/controlcrawlindex/docs/robots_txt.html

A robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:
you can explicitly disallow :
User-agent: *
Disallow: /~joe/junk.html
please visit below link for details
robots.txt

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008