Which one is better for removed content, 301 or 404? - html

We fed up with duplicate links of Joomla and we converted our website from Joomla to Html. Because in Joomla we have only (approx) 80 pages, but Google indexed 1970(!) pages. That means duplicate content for Google. So we converted it to Html. But what we can do for the old pages?
Our new link structure is domainname.com/article.html
But the old structure was domainname.com/index.php/article.php
So, which is better for the old pages, 301 redirect or 404 not found? What should we do?

If the content has moved then it is 301 Moved Permanently (coupled with a Location header to say where it has moved to).
If the content has been removed then it is 410 Gone.
404 Not Found is for content that never existed or can't be found for unknown reasons.
It sounds like you want to 301 all the URLs where the duplicate content used to be available to the one place that it is now available.

404 Not Found: The server has not found anything matching the Request-URI. No indication is given of whether the condition is temporary or permanent. The 410 (Gone) status code SHOULD be used if the server knows, through some internally configurable mechanism, that an old resource is permanently unavailable and has no forwarding address. This status code is commonly used when the server does not wish to reveal exactly why the request has been refused, or when no other response is applicable.
301 Moved Permanently: The requested resource has been assigned a new permanent URI and any future references to this resource SHOULD use one of the returned URIs. Clients with link editing capabilities ought to automatically re-link references to the Request-URI to one or more of the new references returned by the server, where possible. This response is cacheable unless indicated otherwise.
In my opinion, if the page you are removing has a suitable alternative page on your web site, then 301 it. Do not always 301 the page to your home page. If there is no suitable, and by suitable I mean, a page that is very similar to the page you are removing, then 404 the page.
301 if there is a related and similar page to the page you are removing. 404 if there is not.
You can see complete blog here: http://www.seroundtable.com/404-301-web-page-16773.html

Related

How do I direct all traffic or searches going to duplicate (or similar URLs) to one URL on our website?

I'll try to keep this as simple as possible, as i don't quite understand how to frame the question entirely correctly myself.
We have a report back on our website that is indicating duplicate meta titles and descriptions, which look very much (almost exactly) like the following - although i have used an example domain below:
http://example.com/green
https://example.com/green
http://www.example.com/green
https://www.example.com/green
But, only one of these actually exists as an HTML file on our server, which is:
https://www.example.com/green
As i understand it, i need to somehow tell google and other search engines which of these URLs is correct, and this should be done by specifying a 'canonical' link or URL.
My problem is that the canonical reference must apparently be added to any duplicate pages that exist, and not the actual main canonical page? But we don't actually have any other pages, beyond the one mentioned just above. So there is nowhere to set these canonical rel references?
I'm sure there must be a simple explanation for this that i am completely missing?
So it turns out that these were duplicate URLs which occur as a result of the fact that our website exists as a subdomain of our domain. Any traffic that arrives at example.com (our domain) needs permanent redirect to https://www.example.com, by way of a redirect within the htaccess documentation.

Which robots.txt for forwarded subdomain?

In theory I have two subdomains set up in my hosting:
subdomain1.mydomain.com
subdomain2.mydomain.com
subdomain2 has a CNAME record pointing to an external service.
mydomain.com has a robots.txt that allows indexing everything.
subdomain2.mydomain.com has a robots.txt that allows indexing nothing due to the CNAME record.
If I set up a forward from subdomain1.mydomain.com to subdomain2.mydomain.com, which robots.txt would be used if accessing a link to subdomain1.mydomain.com? Does the domain forward work in the same way as a CNAME record when it comes to robots.txt?
This depends on your server setup.
Take the following config, for example:
server {
server_name subdomainA.example.com;
listen 80;
return 302 http://subdomainB.example.com$request_uri;
}
In this case, we're redirecting everything from subdomainA.example.com to subdomainB.example.com. This will include your robots.txt file.
However, if your configuration is set up to only redirect certain parts, your robots.txt file will only be redirected if it's on your list. This would be the case if you were redirecting only, say, /someFolder.
Note that if you don't return a 302 but just use a different root (e.g. subdomainA and subdomainB are different subdomains but serve the same content), your robots.txt content will be determined by the root directory.
So, therefore, if I'm understanding your config correctly, subdomain1 will use the the robots.txt from subdomain2.
The challenge you're running into is you're looking at things from the standpoint of whatever software you're trying to configure, but search engines and other robots only see the document they load from a URL (just like any other user with a web browser would). That is, search engines will try to load http://subdomain1.mydomain.com/robots.txt and http://subdomain2.mydomain.com/robots.txt, and it's up to you (through configuring whatever software your server is running) to ensure that those are in fact serving what you want.
A CNAME is just a way to add a redirection when loading what IP a browser should look at to resolve a domain name. A robot will use it when resolving the name to find out the "real" IP to connect to, but it doesn't have any further bearing on what the GET /robots.txt request does once it connects to the server.
In terms of "forwarding", that term can mean different things, so you'd need to know what a browser or robot would receive when it requested the page. If it's doing a 301 or 302 redirection to send the client to another URL, you'll probably get different results from different search engines on how they may honor that, particularly if it's being redirected to an entirely different domain. I probably would try to avoid it, just because a lot of robots are poorly written. Some search engines have tools to help you determine how their crawlers are reading your robots.txt URLs, such as Google's tool.

custom 404 not found page, how to make it?

Making updates on my website, there are a lot of pages that I don't use left. So I delete them.
Unfortunately some slight idexing has been made by search engine so when u type the name of website of mine it appears non more existent pages too in browser results.
I need to create a custum 404 page not found that appears everytime people go on pages that doesn't exist, respecting google SEO policy and w3c standards.
Unfortunately I can't.
Someone could teach me please?
Make a .html document in your webserver or website's directory in the htdocs then make a new folder that is called "err", then upload all your Error pages like 404, The page cannot be found or 403, Forbidden. Then place those files in the err folder and re-name them to the error codes. If you are using cPanel then search (Using the search bar) or find (By browsing the tools listed) Error Pages. Go from there with the instructions given on cPanel.
Hope this helps you,
Jay Salway (13 year old developer)
Create a static page containing your custom message and anything else you want (eg. site layout etc.) and save it somewhere appropriate within your site (eg. from the root: /errors/404.asp). Within that page make sure you write a 404 response header (eg. Response.Status = "404 Page Not found")
In IIS (the option will also be available somewhere similar under Apache if you are running that) open up the settings for your website and choose 'Error Pages' then look for the status code 404 (by default there should be one but you may need to create it) open that up and choose the option 'Execute a URL on this site' and enter the url chosen above /error/404.asp)

Cloudfront Custom Origin Is Causing Duplicate Content Issues

I am using CloudFront to serve images, css and js files for my website using the custom origin option with subdomains CNAMEd to my account. It works pretty well.
Main site: www.mainsite.com
static1.mainsite.com
static2.mainsite.com
Sample page: www.mainsite.com/summary/page1.htm
This page calls an image from static1.mainsite.com/images/image1.jpg
If Cloudfront has not already cached the image, it gets the image from www.mainsite.htm/images/image1.jpg
This all works fine.
The problem is that google alert has reported the page as being found at both:
http://www.mainsite.com/summary/page1.htm
http://static1.mainsite.com/summary/page1.htm
The page should only be accessible from the www. site. Pages should not be accessible from the CNAME domains.
I have tried to put a mod rewrite in the .htaccess file and I have also tried to put a exit() in the main script file.
But when Cloudfront does not find the static1 version of the file in its cache, it calls it from the main site and then caches it.
Questions then are:
1. What am I missing here?
2. How do I prevent my site from serving pages instead of just static components to cloudfront?
3. How do I delete the pages from cloudfront? just let them expire?
Thanks for your help.
Joe
[I know this thread is old, but I'm answering it for people like me who see it months later.]
From what I've read and seen, CloudFront does not consistently identify itself in requests. But you can get around this problem by overriding robots.txt at the CloudFront distribution.
1) Create a new S3 bucket that only contains one file: robots.txt. That will be the robots.txt for your CloudFront domain.
2) Go to your distribution settings in the AWS Console and click Create Origin. Add the bucket.
3) Go to Behaviors and click Create Behavior:
Path Pattern: robots.txt
Origin: (your new bucket)
4) Set the robots.txt behavior at a higher precedence (lower number).
5) Go to invalidations and invalidate /robots.txt.
Now abc123.cloudfront.net/robots.txt will be served from the bucket and everything else will be served from your domain. You can choose to allow/disallow crawling at either level independently.
Another domain/subdomain will also work in place of a bucket, but why go to the trouble.
You need to add a robots.txt file and tell crawlers not to index content under static1.mainsite.com.
In CloudFront you can control the hostname with which CloudFront will access your server. I suggest using a specific hostname to give to CloudFront which is different than you regular website hostname. That way you can detect a request to that hostname and serve a robots.txt which disallows everything (unlike your regular website robots.txt)

How to prevent deep linking to files on my website

I own a website which contain a lot of freeware stuff to download on it. The problem I'm facing is that people from around the world are taking the direct links
of the files (for example .zip files) and posting them on their websites and general forums. I am getting a huge amount of bandwidth and that's ok, but the number of pages visited is low. Is there a way or a script that I can add to the links, so that when someone presses on the link from a foreign website, a page from my website opens instead, which then lets him download the file so I can get more visits.
For example, this is a address from my website:
http://sy-stu.org/stu/PublicFiles/StdLibrary/Exam.zip
When anyone presses on it, it will start the downloading process directly.
Your site is hosted by an Apache web server, so you should be able to do the following in your site's httpd.conf (or virtual host block)
RewriteEngine On
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?yourdomain\.com/ [NC]
RewriteRule ^/PublicFiles/ /page-about-direct-links.html
That is basically saying:
Turn the mod_rewrite engine on
If the HTTP Referrer is not blank…
And doesn't contain my domain name (with or without “www.”)…
Redirect any requests for anything under /PublicFiles/ to /page-about-direct-links.html
More information on mod_rewrite can be found here: mod_rewrite - Apache HTTP Server
If you are using PHP, you can have a script that links the user to the download but only if the $_SERVER['HTTP_REFERER'] is from your site. If it is not you redirect to your site.
Don't provide a direct link to the file you are serving. Provide a script that sends the content through the script once you hit the submit button.
Do a web search for sending files through cgi.
Here is a neat link I found online:
here
Why not just make the links dynamic and indirect, for example:
on page X: (static)
SuperNeat Program
on page Y: (dynamically generated)
Click here to download
<a href="Z.php?timestamp={timestamp}&counter={counter}&hash={hash}">
SuperNeat Program</a>
and replace timestamp w/ the current time in msec since 1970, counter = a counter that you increment once per download, hash = MD5 hash of concatenate(timestamp,counter,secret salt) where secret salt = any favorite code you keep secret.
Then on page Z.php, you just recalculate the hash from the counter and timestamp in the query string, check that it matches the hash in the query string, and that the timestamp is recent (e.g. from the previous 30 minutes or 60 minutes or whatever). If it is, then serve the file in question. If it isn't, throw an error message. This gives someone only a brief time period to direct-link to your file. If you don't even want that, then keep track of the counter values received in the Z.php query string and don't accept them more than once.
I'm not at all a web expert, but I was thinking about the following pointer -
if you're using asp.net could http handlers or modules configured at the web site level help (lot's of information on those on the web, I looked it up recently for some work, here's one article for example.
The idea is to intercept the request before it reaches the target file and redirect it to the page you wish to show; for example - if someone wishes to browse to the url you've posted ("http://sy-stu.org/stu/PublicFiles/StdLibrary/Exam.zip") intercept this call,use some lookup to find the page you wish to display and redirect the request there;I'm guessing users following a link won't be too annoyed (unless they have done "save target as", which would result them saving some HTML and not ZIP).
However, there's some "hole" in my plan - how do you actually provide a link that works from your own page? I believe you can differentiate between requests coming from your web site and ones coming from others' which you could check on the handler/module by examining the request object.