Code to redirect atom feed - html

Is it possible to define a redirect for an atom feed? If so, how does one do this? I attempted to simply replace my old atom.xml file with an 301 redirect HTML file, like so:
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<meta rel="canonical" http-equiv="refresh" content="0;url=http://io.carlboettiger.info/2014/atom.xml" />
<link rel="canonical" href="http://io.carlboettiger.info/2014/atom.xml"/>
</head>
</html>
and saved this as atom.xml, but (perhaps because of the misleading xml extension?) this doesn't seem to redirect either in the browser or any existing feed reader.
For the RSS format, it looks like the rssboard suggests that an html redirect or an XML-based redirect like so
<redirect>
<newLocation>
http://weblog.infoworld.com/udell/rss.xml
</newLocation>
</redirect>
should do it, but I cannot find any such advice for the atom format. So, how does one define redirects for atom files? (Note: I'm not looking for a hack using .htaccess rewrite rules, since the site is deployed via Github's gh-pages and I don't have the option of custom .htaccess rules anyhow)

Is it possible to define a redirect for an atom feed?
Yes, but only through HTTP redirection. In general, this is the best way to redirect a resource on the web.
the site is deployed via Github's gh-pages
As 301 redirect for site hosted at github? indicates, there's no way to specify HTTP redirection for a GH-pages-hosted site.
The Atom spec assumes that you have control over the server, and doesn't define any additional redirection mechanism.

Unfortunately I don't know of any standard (be it de-facto) to achieve this. As far as I know, the only way to do so is to find ways to do it at the HTTP level, which you don't control when using Github pages.
Both ways you're trying are documented, but I don't know of any reader which actually implements them. At Superfeedr, we have also seen redirects using the iTunes pattern: <itunes:new-feed-url>.
We've been able to do it using services like cloudflare which will act as proxies and allow you to setup rules for specific pages or addresses.

Related

Is the HTML <link rel="canonical"> about URL or content or both?

Does the HTML <link rel="canonical"> refer to URL or content or both?
I have a part of a website done in pure HTML + CSS + JavaScript and no server side. When a user enters the site with the root URL, the /index.htm is loaded. The root index.htm redirects to /site1/index.htm.
I would like to indicate that the canonical URL for /site1/index.htm should be /index.htm, and the canonical URL for that, in turn, should be /, so if needed at a later time, the redirect can go elsewhere. In this sense, specifying a canonical URL is intended to indicate that users should always enter the site through the specified path if possible when arriving at /site1/index.htm.
I'm wondering if specifying <link rel="canonical" href="/index.htm"> in /site1/index.htm, and <link rel="canonical" href="/"> in /index.htm would accomplish this. (I'm aware that absolute URLs are recommended, but this may not always be possible.)
The web server could be IIS, Apache, or other. I can't touch the server config or headers or htaccess.
Can this be done in HTML or possibly JavaScript? (I'm aware that JavaScript won't affect SEO, but it may have something to do with the redirect. Currently, the redirect is done using both meta refresh and JavaScript location = '', with a fallback link for the user to click. As mentioned, can't touch headers, or server config.)
Further, if <link rel="canonical"> is used in said fashion, would search engines index the content of the target in place of the specifying page? For example, would search engines assume the content of /site1/index.htm is the same as /index.htm, so that the URL /site1/index.htm would get associated with the actual contents of /index.htm?
I'm new here so I don't know if this is out of topic or not but I'll try to answer the question.
The <link rel="canonical"> is kind of straight forward. It works like this.
When a search engine spider crawls your page it tells him what URL should be indexed for that particular page. It's very usefull in cases of possible different URL access to a particular content. (one example within others non-www and www URLs)
Exemple : You have multiple products pages for a specific category on your website because you use pagination. In this case you will have several URLs for your paginated content page 1, page 2, page 3, etc... Adding a <link rel="canonical"> tag pointing to the first page to all these pages will tell the search engine that it should index the first page only instead of indexing all paginated pages.
Basically your telling the spider don't index this URL index that other URL instead.
In your particular case /index.htm is most probably a 301 redirection to /site1/index.htm. The risk is that Google won't index your page because you are telling it not to index the content on /site1/index.htm and index index.htm instead but this page has no content because it provides a redirection.
I'm aware that you stated that you have no access to the .htaccess file but the only way I can thing of without touching your folders structure on your FTP is to use .htaccess to rewrite /site1/index.htm to /index.htm and then add the canonical tag just to be safe because having a canonical tag is a good practice.

How to tell Jekyll to hide one page from search engines?

I have a website consisting of my public profile, made in Jekyll.
It also contains one page, say 'details.html', which contains more personal information about me. I want only those people to see this page whom I give out the link to. In particular, I'd like to hide it from search engines.
How do I best do this? I've heard I can add a robots.txt file or include a meta tag 'nofollow' or 'noindex'.
Which is the usual solution here?
If the way to go is to add a meta tag, how do I add it in only one page given a standard Jekyll setup?
The robots.txt is the standard way of telling search engines what to index and what not to (not just for Jekyll, but for websites in general).
Just create a file called robots.txt in the root of your Jekyll site, with the paths that should not be indexed.
e.g.
User-agent: *
Disallow: /2017/02/11/post-that-should-not-be-indexed/
Disallow: /page-that-should-not-be-indexed/
Allow: /
Jekyll will automagically copy the robots.txt to the folder where the site gets generated.
You can also test your robots.txt to make sure it is working the way you expect: https://support.google.com/webmasters/answer/6062598?hl=en
Update 2021-08-02 - Google Specific settings:
You can prevent a page from appearing in Google Search by including a noindex meta tag in the page's HTML code, or by returning a noindex header in the HTTP response
There are two ways to implement noindex: as a meta tag and as an HTTP response header. They have the same effect; choose the method that is more convenient for your site.
<meta> tag
To prevent most search engine web crawlers from indexing a page on your site, place the following meta tag into the <head> section of your page:
<meta name="robots" content="noindex">
To prevent only Google web crawlers from indexing a page:
<meta name="googlebot" content="noindex">
HTTP response header
Instead of a meta tag, you can also return an X-Robots-Tag header with a value of either noindex or none in your response. Here's an example of an HTTP response with an X-Robots-Tag instructing crawlers not to index a page:
HTTP/1.1 200 OK
(...)
X-Robots-Tag: noindex
(...)
More details: https://developers.google.com/search/docs/advanced/crawling/block-indexing
Try:
---
layout:
sitemap: false
---
So, whenever you include sitemap: false line in your front matter, you can exclude that page from your sitemap.
check:
add gem 'jekyll-sitemap' to your site’s Gemfile and run bundle
add the following to your site’s _config.yml:
plugins:
- jekyll-sitemap
A robots.txt file is a great solution, but .htaccess might be better for this purpose. Also, make sure you have a private repository!
Note that hosting your code on CloudCannon (paid account) allows you to set up all these things easily from within their interface.

How to get rid of .html extension when serving webpages with node.js?

I am a beginner with node.js and am using express with the ejs layout, and I want to know how to get rid of the .html extension when putting up a page. For example if I go to my localhost:3000/about.html - that works but I want it to show up as just /about. Also, having trouble figuring out how to change the favicon if anyone knows how to quickly change that from the express default.
Any help would be great thanks.
(I realise this question is old, but it appears high in Google search results, and the accepted answer isn't the best solution.)
The best solution for serving up static content in express.js is express.static. To avoid having to specify file extensions in URLs you can configure it with a list of default file extensions that it will use when searching for static files:
app.use(express.static(pathToBaseFolderOfStaticContent, {
extensions: ['html', 'htm'],
... // Other options here
}));
This will serve up pathToBaseFolderOfStaticContent/somePage.html or pathToBaseFolderOfStaticContent/somePage.htm in response to a GET request to http://www.example.com/somePage, which is what you want. For example, if you visit https://arcade.ly/star-castle, the file it serves up is just a static file called star-castle.html. I haven't had to add any special routing for this, or any other static file - it's all just handled by express.static.
I only need to add specific routes for content that requires active work on the server to return. A big advantage here is that I can use a CDN to cache more of my content (or nginx if I were running an internal line of business app), thus reducing load on my server.
You can obviously configure as many default file extensions as you like, although I'd tend to keep the list short. I only use it for resources where the URL is likely to appear in the address bar, which generally means HTML files, although not always.
Have a look at the following documentation on serving static content with express.js:
http://expressjs.com/en/starter/static-files.html
http://expressjs.com/en/4x/api.html (the express.static documentation is at the top)
This is also answered at In express what is the common way to associate a default file extension with static content requests?.
The favicon.ico issue can be solved by dropping your favicon into the root folder from which you serve static content, as well as implementing +Costa's solution where you reference it using a <link> in the <head> of your documents.
In theory you shouldn't need to do put the favicon in the root folder but, in practice, some browsers will still ask for it from the site root even though it's referenced in the <head> of your document. This leads to a spurious 404 error that you'll be able to see in client side debugging tools (e.g., Chrome dev tools).
The Favicon issue is usually a caching problem. As long as you have this code in your base html layout:
<link rel="shortcut icon" type="image/x-icon" href="/images/favicon.ico">
Then just navigate to wherever that image is with your browser, and that should force your cache to update.
I figured it out. I looked at this post Render basic HTML view? which solved the problem I was having.
app.engine('html', require('ejs').renderFile);
app.get('/', function(req, res){
res.render("index.html");
});
And this all goes in the app.js or whatever file you are running.

How to prevent search engines from indexing a single page of my website?

I don't want the search engines to index my imprint page. How could I do that?
Also you can add following meta tag in HEAD of that page
<meta name="robots" content="noindex,nofollow" />
You need a simple robots.txt file. Basically, it's a text file that tells search engines not to index particular pages.
You don't need to include it in the header of your page; as long as it's in the root directory of your website it will be picked up by crawlers.
Create it in the root folder of your website and put the following text in:
User-Agent: *
Disallow: /imprint-page.htm
Note that you'd replace imprint-page.html in the example with the actual name of the page (or the directory) that you wish to keep from being indexed.
That's it! If you want to get more advanced, you can check out here, here, or here for a lot more info. Also, you can find free tools online that will generate a robots.txt file for you (for example, here).
You can setup a robots.txt file to try and tell search engines to ignore certain directories.
See here for more info.
Basically:
User-agent: *
Disallow: /[directory or file here]
<meta name="robots" content="noindex, nofollow">
Just include this line in your <html> <head> tag. Why I'm telling you this because if you use robots.txt file to hide your URLs that might be login pages or other protected URLs that you won't show to someone else or search engines.
What I can do is just accessing the robots.txt file directly from your website and can see which URLs you have are secret. Then what is the logic behind this robots.txt file?
The good way is to include the meta tag from above and keep yourself safe from anyone.
Nowadays, the best method is to use a robots meta tag and set it to noindex,follow:
<meta name="robots" content="noindex, follow">
Create a robots.txt file and set the controls there.
Here are the docs for google:
http://code.google.com/web/controlcrawlindex/docs/robots_txt.html
A robot wants to vists a Web site URL, say http://www.example.com/welcome.html. Before it does so, it firsts checks for http://www.example.com/robots.txt, and finds:
you can explicitly disallow :
User-agent: *
Disallow: /~joe/junk.html
please visit below link for details
robots.txt

301 Redirect directly in HTML file

I have changed some file names on my web and now I want to make "301 Permanently Moved" redirect from the old files to the new ones.
The problem is that my web is made completelly by static html pages and all 301 redirect tutorials decribe how to do it in PHP, ASP, htaccess etc. I would like to write the redirect directly into the old html files, is this possible? Or do I have to contact my web provider and solve the redirect on the server?
The only thing I know about the server is that it runs on Windows and I have no server knowledge.
EDIT: My web hosting is using Microsoft IIS 7.0, so I assume using the .htaccess is not possible here?
EDIT #2: just now my server admin wrote me that even if I use only static HTML pages, I can still use web.config file to redirect individual html files. This is very nice.
You cannot alter the HTTP status code with HTML.
But if you’re using an Apache webserver, you could use mod_rewrite or mod_alias to redirect such requests to the new address:
# mod_rewrite
RewriteEngine on
RewriteRule ^old\.html$ /new.html [L,R=301]
# mod_alias
RedirectMatch 301 ^/old\.html$ /new.html
Edit   As you now clarified that you’re using IIS 7, take a look at its <httpRedirect> element for HTTP redirects.
I guess you could use JavaScript and/or meta-refresh (as suggested by Gumbo) to redirect the users from your old pages to the new one.
Something like:
<html>
<head>
<meta http-equiv="refresh" content="0;url=http://YourServer/NewFile.html" />
<script type="text/javascript">
location.replace('http://YourServer/NewFile.html');
</script>
</head>
<body>
This page has moved. Click here for the new location
</body>
</html>
No, it isn't possible. HTML is not processed by the server, so it cannot set HTTP headers.
You should look at Apache configuration instead (e.g. with .htaccess).
At its simplist you could do:
Redirect 301 old.html http://example.com/new/
Redirect 301 other-old.html http://example.com/newer/
Redirecting individual pages in IIS is a simple affair and done in your web.config file.
<location path="products.htm">
<system.webServer>
<httpRedirect enabled="true" destination="http://yourserver/products" httpResponseStatus="Permanent" />
</system.webServer>
</location>