How to prevent deep linking to files on my website - html

I own a website which contain a lot of freeware stuff to download on it. The problem I'm facing is that people from around the world are taking the direct links
of the files (for example .zip files) and posting them on their websites and general forums. I am getting a huge amount of bandwidth and that's ok, but the number of pages visited is low. Is there a way or a script that I can add to the links, so that when someone presses on the link from a foreign website, a page from my website opens instead, which then lets him download the file so I can get more visits.
For example, this is a address from my website:
http://sy-stu.org/stu/PublicFiles/StdLibrary/Exam.zip
When anyone presses on it, it will start the downloading process directly.

Your site is hosted by an Apache web server, so you should be able to do the following in your site's httpd.conf (or virtual host block)
RewriteEngine On
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?yourdomain\.com/ [NC]
RewriteRule ^/PublicFiles/ /page-about-direct-links.html
That is basically saying:
Turn the mod_rewrite engine on
If the HTTP Referrer is not blank…
And doesn't contain my domain name (with or without “www.”)…
Redirect any requests for anything under /PublicFiles/ to /page-about-direct-links.html
More information on mod_rewrite can be found here: mod_rewrite - Apache HTTP Server

If you are using PHP, you can have a script that links the user to the download but only if the $_SERVER['HTTP_REFERER'] is from your site. If it is not you redirect to your site.

Don't provide a direct link to the file you are serving. Provide a script that sends the content through the script once you hit the submit button.
Do a web search for sending files through cgi.
Here is a neat link I found online:
here

Why not just make the links dynamic and indirect, for example:
on page X: (static)
SuperNeat Program
on page Y: (dynamically generated)
Click here to download
<a href="Z.php?timestamp={timestamp}&counter={counter}&hash={hash}">
SuperNeat Program</a>
and replace timestamp w/ the current time in msec since 1970, counter = a counter that you increment once per download, hash = MD5 hash of concatenate(timestamp,counter,secret salt) where secret salt = any favorite code you keep secret.
Then on page Z.php, you just recalculate the hash from the counter and timestamp in the query string, check that it matches the hash in the query string, and that the timestamp is recent (e.g. from the previous 30 minutes or 60 minutes or whatever). If it is, then serve the file in question. If it isn't, throw an error message. This gives someone only a brief time period to direct-link to your file. If you don't even want that, then keep track of the counter values received in the Z.php query string and don't accept them more than once.

I'm not at all a web expert, but I was thinking about the following pointer -
if you're using asp.net could http handlers or modules configured at the web site level help (lot's of information on those on the web, I looked it up recently for some work, here's one article for example.
The idea is to intercept the request before it reaches the target file and redirect it to the page you wish to show; for example - if someone wishes to browse to the url you've posted ("http://sy-stu.org/stu/PublicFiles/StdLibrary/Exam.zip") intercept this call,use some lookup to find the page you wish to display and redirect the request there;I'm guessing users following a link won't be too annoyed (unless they have done "save target as", which would result them saving some HTML and not ZIP).
However, there's some "hole" in my plan - how do you actually provide a link that works from your own page? I believe you can differentiate between requests coming from your web site and ones coming from others' which you could check on the handler/module by examining the request object.

Related

How can I make a webpage expire after a certain date and become not accessible?

Is there a way in Apache to make URLs only accessible during certain times/inaccessible after a specific point in time? I am looking for a solution using Apache only; I know it can be done manually or by scheduling a cron job to remove the file.
Example:
I have a web page accessible via http://example.com/aboutthisproject.html which I want to send to clients via email. I want that link to expire and the page not to be accessible after, let's say, one week. So when someone who has the link types in their browser http://example.com/aboutthisproject.html they should get a 404 error.
What options do I have besides manually moving or renaming the file? I want to be able to set an expiring date for that page and forget about having to keep in mind to go back and rename or move the file.
Have you look into apache mod_rewrite with server variables date+time? Using Rewrite condition based on date+time you could do a 404.
see https://httpd.apache.org/docs/2.4/mod/mod_rewrite.html
example:
RewriteEngine On
# between 03 am and 4am
RewriteCond %{TIME_HOUR} >02
RewriteCond %{TIME_HOUR} <05
RewriteRule ^index\.html$ /morning/index.html

Which robots.txt for forwarded subdomain?

In theory I have two subdomains set up in my hosting:
subdomain1.mydomain.com
subdomain2.mydomain.com
subdomain2 has a CNAME record pointing to an external service.
mydomain.com has a robots.txt that allows indexing everything.
subdomain2.mydomain.com has a robots.txt that allows indexing nothing due to the CNAME record.
If I set up a forward from subdomain1.mydomain.com to subdomain2.mydomain.com, which robots.txt would be used if accessing a link to subdomain1.mydomain.com? Does the domain forward work in the same way as a CNAME record when it comes to robots.txt?
This depends on your server setup.
Take the following config, for example:
server {
server_name subdomainA.example.com;
listen 80;
return 302 http://subdomainB.example.com$request_uri;
}
In this case, we're redirecting everything from subdomainA.example.com to subdomainB.example.com. This will include your robots.txt file.
However, if your configuration is set up to only redirect certain parts, your robots.txt file will only be redirected if it's on your list. This would be the case if you were redirecting only, say, /someFolder.
Note that if you don't return a 302 but just use a different root (e.g. subdomainA and subdomainB are different subdomains but serve the same content), your robots.txt content will be determined by the root directory.
So, therefore, if I'm understanding your config correctly, subdomain1 will use the the robots.txt from subdomain2.
The challenge you're running into is you're looking at things from the standpoint of whatever software you're trying to configure, but search engines and other robots only see the document they load from a URL (just like any other user with a web browser would). That is, search engines will try to load http://subdomain1.mydomain.com/robots.txt and http://subdomain2.mydomain.com/robots.txt, and it's up to you (through configuring whatever software your server is running) to ensure that those are in fact serving what you want.
A CNAME is just a way to add a redirection when loading what IP a browser should look at to resolve a domain name. A robot will use it when resolving the name to find out the "real" IP to connect to, but it doesn't have any further bearing on what the GET /robots.txt request does once it connects to the server.
In terms of "forwarding", that term can mean different things, so you'd need to know what a browser or robot would receive when it requested the page. If it's doing a 301 or 302 redirection to send the client to another URL, you'll probably get different results from different search engines on how they may honor that, particularly if it's being redirected to an entirely different domain. I probably would try to avoid it, just because a lot of robots are poorly written. Some search engines have tools to help you determine how their crawlers are reading your robots.txt URLs, such as Google's tool.

Need to have many different URLS resolve to a single web page

And I don't want to use GET params.
Here are the details:
the user has a bunch of photos, and each photo must be shown by itself and have a unuique URL of the form
www.mysite.com/theuser/photoNumber-N
I create a unique URL for a user each time they add a new photo to their gallery
the web page that displays the user's photo is the same code for every user and every photo -- only the photo itself is different.
the user gives a URL to Person-A but then Person-A has one URL to that one photo and cannot see the user's other photos (because each photo has a unique URL and Person-A was given only one URL for one photo)
I want the following URLS to (somehow) end up loading only one web page with only the photo contents being different:
www.mysite/user-Terry/terryPhoto1
www.mysite/user-Terry/terryPhoto2
www.mysite/user-Jackie/JackiesWeddingPhoto
www.mysite/user-Jackie/JackiesDogPhoto
What I'm trying to avoid is this: having many copies of the same web page on my server, with the only difference being the .jpeg filename.
If I have 200 user and each has 10 photos -- and I fulfill my requirement that each photo is on a page by itself with a distinct URL -- right now I've got 2000 web pages, each displaying a unique photo, taking space on my web server and every page is identical and redundant disk-space-wasting HTML code, the only difference being the .JPEG file name of the photo to display.
Is there something I can do to avoid wasting diskspace and still meet my requirement that each photo has a unique URL?
Again I cannot use GET with parameters.
If you are on an Apache server, you can use Apache's mod_rewrite to accomplish just that. While the script you are writing will ultimately still be fetching GET variables (www.mysite.com/photos.php?id=photo-id), mod_rewrite will convert all the URL's served in the format you choose (www.mysite.com/user-name/photo-id).
Some ways you can implement it can be found here and here, while the actual documentation on the Apache module itself can be found here.
Go to IIS Manager. Go to the site hosted in IIS. Add additional binding for each url.
This will redirect all request the same location.

Why doesn't Wikipedia have extensions?

Look at a random wikipedia article like http://en.wikipedia.org/wiki/Impostor_syndrome, I see that there's no .html attached to the end of the address. In fact, if I do try to put a .html after it, Wikipedia tells me "Wikipedia does not have an article with this exact name." How come it doesn't need any file extensions?
More a superuser question?
There is no law saying that an html file has to end in .html or .htm and since wiki generates pages from a database there is really no file page there anyway (except in a cache).
Not having .htm or .php is moresensible - why do you care what technology they use when you ask for a url? It would be like having to put the operating system of the recipient at the end of their email address.
if you make a call to a website it probably looks like
www.example.com/siteA/index.html
this request just tells the webserver you want to see a resource that is called index.html in siteA.
the website that runs on this server has to determine what you want to see and how the data is loaded.
index.html could be a file in the siteA directory
or
it can be row with the key "index.html" in the siteA-table in your database.
so the part siteA/index.html is just a resource identifier. the grammar of this resource identifier is completely free and is determined per website.
url rewriting is also common to make url easier to read and remember.
for example there could be a rewrite rule to accomplish the following:
if the user enters something like
www.example.com/download/demo.zip
rewrite it so your website sees it like:
www.example.com/download.php?file=demo.zip
Wikipedia's servers map the url to the page you want. .html is just a naming convention that, today is mostly historical from the period of static pages when urls actually were names of files on the server. In fact, there may be no file at all, where the server queries the database and a web framework sends out the html on the fly.
Wikipedia is most likely using the Apache module mod_rewrite in order to not have to link paths directly to a file system path.
See: http://en.wikipedia.org/wiki/Rewrite_engine#Web_frameworks
However programming languages can also take control of the incoming URLs and return data depending on the structure of the link according to some set of rules, for example the Django web framework employees a URL dispatcher.
That's because Wikipedia uses MediaWiki's feature of URL shortening.
Actually when you search for a file it really loads a php file. Try searching for a word that doesn't exist, for example "Pazaz". The URL is http://en.wikipedia.org/w/index.php?title=Special%3ASearch&search=pazaz . Notice index.php in the URL.
To tell the truth it's not a MediaWiki feature, it's Apache. For further info http://www.mediawiki.org/wiki/Manual:Short_URL .
URL routing is your answer for example in ASP read below source from
The ASP.NET MVC framework includes a flexible URL routing system that enables you to define URL mapping rules within your applications. The routing system has two main purposes:
Map incoming URLs to the application and route them so that the right Controller and Action method executes to process them
Construct outgoing URLs that can be used to call back to Controllers/Actions (for example: form posts, links, and AJAX calls)
I would suggest that sites like this use some sort of Model View Controller framework similar to Ruby on Rails where the url 'directories' form a part of a request/url route...
In frameworks that are MVC based, the url 'directories' can dictate what View/Controller to utilise as well as what action should be taken with the data.
eg: shop.com/product/carrots
Where product is a view/controller and carrots is the data. The framework then analyses which action/route to take. Default could be viewing the product information and price of the carrot.

web-development: how do you usually handle the "under costruction" page"?

I was wondering what's the best way to switch a website to a temporary "under costruction" page and switch it back to the new version.
For example, in a website, my customer decided to switch from Joomla to Drupal and I had to create a subfolder for the new CMS, and then move all the content to the root folder.
1) Moving all the content back to the root folder always create some problems with file permissions, links, etc...
2) Creating a rewrite rule in .htaccess or forward with php is not a solution because another url is shown including the top folder.
3) Many host services do not allow to change the root directory, so this is not an option since I don't have access to apache config file.
Thanks
Update: I can maybe forward only the domain (i.e. www.example.com) and leave the ip on the root folder (i.e. 123.24.214.22), so the access is finally different for me and other people? Can I do this in .htaccess file ?
One thing to consider is you don't want search engines to cache your under construction page - and you also don't want them to drop your homepage from the search index either (Hence just adding a "noindex" meta tag isn't the perfect solution).
A good way to deal with this is do a 302 redirect (temporarily moved) from your homepage to your under construction page - that way the search engine does not cache your homepage as an under construction page, does not index your under construction page (assuming it has a NOINDEX meta tag), and does not drop your homepage from the search index either.
One way would be the use of an include on your template page.
When you want the construction page to show, you set a redirect in the include to take all traffic to the construction page.
When you are done your remove the redirect.
What about hijacking your index.php file?
Something simple, along the lines of
<?php
if (SITE_OFFLINE)
include 'under_construction.html';
else
//normal content of your index page
?>
where you would naturally define SITE_OFFLINE in an appropriate place for your needs.
What I did when I used PHP for websites was to configure Apache to direct all requests to a front controller. You then would have full access to all requests no matter where they are pointing to. Then in your front controller (PHP file, static html file, etc.), you would do whatever you need to do there.
I believe you need to configure pathinfo in Apache and some other settings, it has been about 3 years since I have used that approach. But, this approach is also good for developing your own CMS or application so that you have full control over security.
You have to do something similar to this:
http://www.phpwact.org/pattern/front_controller
I am looking for more details, I know my configuration had more to it than that.
This is part of what I'm looking for too:
http://httpd.apache.org/docs/2.0/mod/core.html
Enabling path_info passes path information to the script, so all requests now go through a single point of entry. Let me find my configuration, I know vaguely how this works, but I'm sure it looks like a lot of hand waving.
Also, keep in mind that because all requests are going through this single PHP file, you are responsible for serving images, JavaScript, CSS, etc. So, if a requests is coming in for /css/default.css, that will go through your php script (index.php, most likely), then you'll need to determine how to handle the request. Serving static files is trivial, but it is a little more work.
If you don't want to go that route, you could possibly do something with mod_rewrite so that it only looks for .html, .htm pages or however you have your site configured. For me, I don't do extensions, so that made my regex a little more difficult. I also wanted to secure access to all files. The path_info was the solution for me, but if you don't need that granularity, then writing a front controller might be a bit too much work.
Walter