Crawling data or using API - html

How these sites gather all the data - questionhub, bigresource, thedevsea, developerbay?
Is this legal to show data in frame as bigresource do?

#amazed
EDITED : fixed some spelling issues 20110310
How these sites gather all data- questionhub, bigresource ...
Here's a very general sketch of what is probably happening in the background at website like questionhub.com
Spider program (google "spider program" to learn more)
a. configured to start reading web pages at stackoverflow.com (for example)
b. run program so it goes to home page of stackoverflow.com and starts visiting all links that it finds on those pages.
c. Returns HTML data from all of those pages
Search Index Program
Reads HTML data returned by spider and creates search index
Storing the words that it found AND what URL those words where found at
User Interface web-page
Provides feature rich user-interface so you can search the sites that have been spidered.
Is this legal to show data in frame as bigresource do?
To be technical, "it all depends" ;-)
Normally, websites want to be visible in google, so why not other search engines too.
Just as google displays part of the text that was found when a site was spidered,
questionhub.com (or others) has chosen to show more of the text found on the original page,
possibly keeping the formatting that was in the orginal HTML OR changing the formatting to
fit their standard visual styling.
A remote site can 'request' that spyders do NOT go thru some/all of their web pages
by adding a rule in a well-known file called robots.txt. Spiders do not
have to honor the robots.txt, but a vigilant website will track the IP addresses
of spyders that do not honor their robots.txt file and then block that IP address
from looking at anything on their website. You can find plenty of information about robots.txt here on stackoverflow OR by running a query on google.
There is a several industries (besides google) built about what you are asking. There are tags in stack-overflow for search-engine, search; read some of those question/answers. Lucene/Solr are open source search engine components. There is a companion open-source spider, but the name eludes me right now. Good luck.
I hope this helps.
P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, or give it a + (or -) as a useful answer. This goes for your other posts here too ;-)

Related

How to make a link url go through another page when clicked HTML

I'm sorry I do not know how to word that title better. I have tried searching google but my terminology isn't helping my results.
Let me explain the context. When you're on a news website or blog and you're on their homepage like: www.homepage.co.uk/ and then you click an article it will go somewhere like this: www.homepage.co.uk/2017/article/ how do they make the 2017 appear? because if you remove the /article/ from the url it takes you to an archive of all the links in that year? I don't understand, is there a process to this?
When I click a link in my website it goes to: www.website.co.uk/link
I want to be able to have that 2017/link/ in the url so they can find the archive of that year just like on their websites?
How do I do this?
I am sorry if I am not explaining this very well.
I understand changing my filenames to : "2017/article.html" might work but I do not believe that is the correct way of doing it?
Thanks a lot for your time and suggestions!
You're asking about a couple of things: one is the taxonomy of the site. Taxonomy, if you don't know, is the "shape" of or how your site is organized. News sites, for instance, are usually organized by date and perhaps topic (Health and Leisure, Politics, Entertainment, etc.). The other aspect of your question is regarding what you might call RESful "hacking" of URLs. One of the tenents of REST is that URLS (uri, to be accurate) are supposed to be hackable. A news site might have /2017/10/10 to display all articles for Oct 10. Maybe you remove the last "10", and get all the articles for October so far. If you are not using a site platform that does this for you, you will have to maintain that taxonomy yourself, and manually write all the links. Systems such as Drupal and Joomla, among others, will translate your taxonomy into automatically-maintained links. In editing a page on one of these platforms, you typically only refer to the system's internal name of the page (could be a shortened version of the article's title in the above example), and the underlying engine takes care of reconstructing the URL for you (in case the page moves, or its tags/taxonomy changes).
This is a big topic, and I encourage you to do some further reading:
http://searchcontentmanagement.techtarget.com/feature/Building-a-website-taxonomy-in-eight-steps
https://www.drupal.org/docs/7/organizing-content-with-taxonomies/organizing-content-with-taxonomies

What are dpuf (extension) files?

I have seen this extension in some urls and I would like to know what they are used for.
It seems odd, but I couldn't find any information about them. I think they are specific for some plug-in.
It seems to be connected to 'Share This'-buttons on the websites.
I found this page which gives a quite comprehensive explanation:
This tag is mainly developed for tracking the URL sharing on various Social Networks, so every time anyone copies your blog content there he gets the URL ending with #sthash and extension with .dpuf or .dpbs

How to edit the google description of your site? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I know that <meta name="Description" content="[description here]" /> can be used but I wonder how to make a description like the one in facebook.
Does this description use the <meta> tag as well? Or is there some other secret behind it?
Edit: I code my site by myself (no wordpress and stuff) :)
I believe this is how it happens.
Google primarily displays multi link listings when they feel a query
has a strong chance of being navigational in nature. I think they can
determine that something is navigational in nature based on linkage
data and click streams. If the domain is well aligned with the term
that could be another signal to consider.
If you have 10,000 legit links for a term that nobody else has more
than a few dozen external citations for then odds are pretty good that
your site is the official brand source for that term. I think overall
relevancy as primarily determined by link reputation is the driving
factor for weather or not they post mini site map links near your
domain.
This site ranks for many terms, but for most of them I don't get the
multi link map love. For the exceptionally navigational type terms
(like seobook or seo book) I get multi links.
The mini site maps are query specific. For Aaron Wall I do not get the
mini site map. Most people usually refer to the site by it's domain
name instead of my name.
Google may also include subdomains in their mini sitemaps. In some
cases they will list those subdomains as part of the mini site map and
also list them in the regular search results as additional results.
Michael Nguyen put together a post comparing the mini site maps to
Alexa traffic patterns. I think that the mini site maps may roughly
resemble traffic patterns, but I think the mini links may also be
associated with internal link structure.
For instance, I have a sitewide link to my sales letter page which I
use the word testimonials as the anchor text. Google lists a link to
the sales letter page using the word testimonials.
When I got sued the page referencing the lawsuit got tons and tons of
links from many sources, which not only built up a ton of linkage
data, but also sent tons of traffic to that specific page. That page
was never listed on the Google mini site map, which would indicate
that if they place heavy emphasis on external traffic or external
linkage data either they try to smooth the data out over a significant
period of time and / or they have a heavy emphasis on internal
linkage.
My old site used to also list the monthly archives on the right side
of each page, and the February 2004 category used to be one of the
mini site map links in Google.
You should present the pages you want people to visit the most to
search bots the most often as well. If you can get a few extra links
to some of your most important internal pages and use smart channeling
of internal linkage data then you should be able to help control which
pages Google picks as being the most appropriate matches for your mini
site map.
Sometimes exceptionally popular sites will get mini site map
navigational links for broad queries. SEO Chat had them for the term
SEO, but after they ticked off some of their lead moderators they
stopped being as active and stopped getting referenced as much. The
navigational links may ebb and flow like that on broad generic
queries. For your official brand term it may make sense to try to get
them, but for broad generic untargeted terms in competitive markets
the amount of effort necessary to try to get them will likely exceed
the opportunity cost for most webmasters.
Source.
Hope this helps.
It depends on the website popularity.
Google does it, you don't.
Google may do it but you can persuade them.And check this out sub sitelinks in google search result
For starters, be sure you have a “sitemap.xml” file. This is a file
that tells the search engine about the pages on your site and makes
it easier for its spiders to crawl and understand it. Your
webmaster or website provider or Content Management System (like
WordPress) should have handled this for you, but it’s worth
checking. If you’re not a master of website technical stuff,
whoever is your technical support person will be able to tell you if
that page is there, and properly set up.
You should register your site with Google Webmaster Tools, if you
haven’t already. The exact process changes from time to time, but
basically, you’ll give Google the URL of your Sitemap file, which
you’ll have from the previous step. You’ll have to put a “Site
Verification Code” on your site to prove to them that you own the
site, and there are a few other simple steps.
Whenever you link one page to another in your site, use anchor text
and alt text that’s descriptive, and as succinct as possible, and
consistent. For example, you’ve linked to your “concierge services”
page from another page using the anchor text “concierge services.”
That’s perfect. Now, don’t link from another page using “guest
services.” You don’t want to be confusing the poor Google spider,
after all.

cloud based "knowledge base" approach with links, snippets and excerpts for google chrome

I'm working as a web developer and have lots (hundreds) of links with hacks, tutorials and code snippets that I don't want to memorize. I am currently using evernote to save the content of my links as snippets and have them searchable and always available (even if the source site is down).
I spend a lot of time on tagging, sorting, evaluating and saving stuff to evernote and I'm not quite happy with the outcome. I ended up with a multitude of tags and keep reordering and renaming tags while retagging saved articles.
My Requirements
web based
saving web content as snippets with rich styling (code sections, etc.)
interlinked entries possible
chrome plugin for access to content
chrome plugin for content generation
web app or desktop client for faster sorting / tagging / batch processing
good and flexible search mechanism
(bonus) google search integration (search results from KnowledgeBase within google search results)
I had a look at kippt but that doesn't seem to be a solution for me. If I don't find a better solution, I'm willing to stay with evernote as it meets nearly all my needs but I need a good plan to sort through my links/snippets once and get them in order.
Which solutions do you use and how do you manage your knowledge base?
I'm a big Evernote fan but a stern critic of all my tools. I've stuck with Evernote because I'm happy enough with its fundamental information structures. I am, however, currently working on some apps to provide visualisations and hopefully better ways to navigate complex sets of notes.
A few tips, based on years of using Evernote and wiki's for collaboration and software project management:
you can't get away from the need to curate things, regardless of your tool
don't over-think using tags, tags in combination with words are a great way to search (you do know you can say tag:blah in a search to combine that with word searches?)
build index pages for different purposes - I'm using a lot more of the internal note links to treat Evernote like a wiki
refactor into smaller notebooks if you use mobile clients a lot, allowing you to choose to have different collections of content with you at different times

How do I prevent site scraping? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I have a fairly large music website with a large artist database. I've been noticing other music sites scraping our site's data (I enter dummy Artist names here and there and then do google searches for them).
How can I prevent screen scraping? Is it even possible?
Note: Since the complete version of this answer exceeds Stack Overflow's length limit, you'll need to head to GitHub to read the extended version, with more tips and details.
In order to hinder scraping (also known as Webscraping, Screenscraping, Web data mining, Web harvesting, or Web data extraction), it helps to know how these scrapers work, and
, by extension, what prevents them from working well.
There's various types of scraper, and each works differently:
Spiders, such as Google's bot or website copiers like HTtrack, which recursively follow links to other pages in order to get data. These are sometimes used for targeted scraping to get specific data, often in combination with a HTML parser to extract the desired data from each page.
Shell scripts: Sometimes, common Unix tools are used for scraping: Wget or Curl to download pages, and Grep (Regex) to extract the data.
HTML parsers, such as ones based on Jsoup, Scrapy, and others. Similar to shell-script regex based ones, these work by extracting data from pages based on patterns in HTML, usually ignoring everything else.
For example: If your website has a search feature, such a scraper might submit a request for a search, and then get all the result links and their titles from the results page HTML, in order to specifically get only search result links and their titles. These are the most common.
Screenscrapers, based on eg. Selenium or PhantomJS, which open your website in a real browser, run JavaScript, AJAX, and so on, and then get the desired text from the webpage, usually by:
Getting the HTML from the browser after your page has been loaded and JavaScript has run, and then using a HTML parser to extract the desired data. These are the most common, and so many of the methods for breaking HTML parsers / scrapers also work here.
Taking a screenshot of the rendered pages, and then using OCR to extract the desired text from the screenshot. These are rare, and only dedicated scrapers who really want your data will set this up.
Webscraping services such as ScrapingHub or Kimono. In fact, there's people whose job is to figure out how to scrape your site and pull out the content for others to use.
Unsurprisingly, professional scraping services are the hardest to deter, but if you make it hard and time-consuming to figure out how to scrape your site, these (and people who pay them to do so) may not be bothered to scrape your website.
Embedding your website in other site's pages with frames, and embedding your site in mobile apps.
While not technically scraping, mobile apps (Android and iOS) can embed websites, and inject custom CSS and JavaScript, thus completely changing the appearance of your pages.
Human copy - paste: People will copy and paste your content in order to use it elsewhere.
There is a lot overlap between these different kinds of scraper, and many scrapers will behave similarly, even if they use different technologies and methods.
These tips mostly my own ideas, various difficulties that I've encountered while writing scrapers, as well as bits of information and ideas from around the interwebs.
How to stop scraping
You can't completely prevent it, since whatever you do, determined scrapers can still figure out how to scrape. However, you can stop a lot of scraping by doing a few things:
Monitor your logs & traffic patterns; limit access if you see unusual activity:
Check your logs regularly, and in case of unusual activity indicative of automated access (scrapers), such as many similar actions from the same IP address, you can block or limit access.
Specifically, some ideas:
Rate limiting:
Only allow users (and scrapers) to perform a limited number of actions in a certain time - for example, only allow a few searches per second from any specific IP address or user. This will slow down scrapers, and make them ineffective. You could also show a captcha if actions are completed too fast or faster than a real user would.
Detect unusual activity:
If you see unusual activity, such as many similar requests from a specific IP address, someone looking at an excessive number of pages or performing an unusual number of searches, you can prevent access, or show a captcha for subsequent requests.
Don't just monitor & rate limit by IP address - use other indicators too:
If you do block or rate limit, don't just do it on a per-IP address basis; you can use other indicators and methods to identify specific users or scrapers. Some indicators which can help you identify specific users / scrapers include:
How fast users fill out forms, and where on a button they click;
You can gather a lot of information with JavaScript, such as screen size / resolution, timezone, installed fonts, etc; you can use this to identify users.
HTTP headers and their order, especially User-Agent.
As an example, if you get many request from a single IP address, all using the same User Agent, screen size (determined with JavaScript), and the user (scraper in this case) always clicks on the button in the same way and at regular intervals, it's probably a screen scraper; and you can temporarily block similar requests (eg. block all requests with that user agent and screen size coming from that particular IP address), and this way you won't inconvenience real users on that IP address, eg. in case of a shared internet connection.
You can also take this further, as you can identify similar requests, even if they come from different IP addresses, indicative of distributed scraping (a scraper using a botnet or a network of proxies). If you get a lot of otherwise identical requests, but they come from different IP addresses, you can block. Again, be aware of not inadvertently blocking real users.
This can be effective against screenscrapers which run JavaScript, as you can get a lot of information from them.
Related questions on Security Stack Exchange:
How to uniquely identify users with the same external IP address? for more details, and
Why do people use IP address bans when IP addresses often change? for info on the limits of these methods.
Instead of temporarily blocking access, use a Captcha:
The simple way to implement rate-limiting would be to temporarily block access for a certain amount of time, however using a Captcha may be better, see the section on Captchas further down.
Require registration & login
Require account creation in order to view your content, if this is feasible for your site. This is a good deterrent for scrapers, but is also a good deterrent for real users.
If you require account creation and login, you can accurately track user and scraper actions. This way, you can easily detect when a specific account is being used for scraping, and ban it. Things like rate limiting or detecting abuse (such as a huge number of searches in a short time) become easier, as you can identify specific scrapers instead of just IP addresses.
In order to avoid scripts creating many accounts, you should:
Require an email address for registration, and verify that email address by sending a link that must be opened in order to activate the account. Allow only one account per email address.
Require a captcha to be solved during registration / account creation.
Requiring account creation to view content will drive users and search engines away; if you require account creation in order to view an article, users will go elsewhere.
Block access from cloud hosting and scraping service IP addresses
Sometimes, scrapers will be run from web hosting services, such as Amazon Web Services or GAE, or VPSes. Limit access to your website (or show a captcha) for requests originating from the IP addresses used by such cloud hosting services.
Similarly, you can also limit access from IP addresses used by proxy or VPN providers, as scrapers may use such proxy servers to avoid many requests being detected.
Beware that by blocking access from proxy servers and VPNs, you will negatively affect real users.
Make your error message nondescript if you do block
If you do block / limit access, you should ensure that you don't tell the scraper what caused the block, thereby giving them clues as to how to fix their scraper. So a bad idea would be to show error pages with text like:
Too many requests from your IP address, please try again later.
Error, User Agent header not present !
Instead, show a friendly error message that doesn't tell the scraper what caused it. Something like this is much better:
Sorry, something went wrong. You can contact support via helpdesk#example.com, should the problem persist.
This is also a lot more user friendly for real users, should they ever see such an error page. You should also consider showing a captcha for subsequent requests instead of a hard block, in case a real user sees the error message, so that you don't block and thus cause legitimate users to contact you.
Use Captchas if you suspect that your website is being accessed by a scraper.
Captchas ("Completely Automated Test to Tell Computers and Humans apart") are very effective against stopping scrapers. Unfortunately, they are also very effective at irritating users.
As such, they are useful when you suspect a possible scraper, and want to stop the scraping, without also blocking access in case it isn't a scraper but a real user. You might want to consider showing a captcha before allowing access to the content if you suspect a scraper.
Things to be aware of when using Captchas:
Don't roll your own, use something like Google's reCaptcha : It's a lot easier than implementing a captcha yourself, it's more user-friendly than some blurry and warped text solution you might come up with yourself (users often only need to tick a box), and it's also a lot harder for a scripter to solve than a simple image served from your site
Don't include the solution to the captcha in the HTML markup: I've actually seen one website which had the solution for the captcha in the page itself, (although quite well hidden) thus making it pretty useless. Don't do something like this. Again, use a service like reCaptcha, and you won't have this kind of problem (if you use it properly).
Captchas can be solved in bulk: There are captcha-solving services where actual, low-paid, humans solve captchas in bulk. Again, using reCaptcha is a good idea here, as they have protections (such as the relatively short time the user has in order to solve the captcha). This kind of service is unlikely to be used unless your data is really valuable.
Serve your text content as an image
You can render text into an image server-side, and serve that to be displayed, which will hinder simple scrapers extracting text.
However, this is bad for screen readers, search engines, performance, and pretty much everything else. It's also illegal in some places (due to accessibility, eg. the Americans with Disabilities Act), and it's also easy to circumvent with some OCR, so don't do it.
You can do something similar with CSS sprites, but that suffers from the same problems.
Don't expose your complete dataset:
If feasible, don't provide a way for a script / bot to get all of your dataset. As an example: You have a news site, with lots of individual articles. You could make those articles be only accessible by searching for them via the on site search, and, if you don't have a list of all the articles on the site and their URLs anywhere, those articles will be only accessible by using the search feature. This means that a script wanting to get all the articles off your site will have to do searches for all possible phrases which may appear in your articles in order to find them all, which will be time-consuming, horribly inefficient, and will hopefully make the scraper give up.
This will be ineffective if:
The bot / script does not want / need the full dataset anyway.
Your articles are served from a URL which looks something like example.com/article.php?articleId=12345. This (and similar things) which will allow scrapers to simply iterate over all the articleIds and request all the articles that way.
There are other ways to eventually find all the articles, such as by writing a script to follow links within articles which lead to other articles.
Searching for something like "and" or "the" can reveal almost everything, so that is something to be aware of. (You can avoid this by only returning the top 10 or 20 results).
You need search engines to find your content.
Don't expose your APIs, endpoints, and similar things:
Make sure you don't expose any APIs, even unintentionally. For example, if you are using AJAX or network requests from within Adobe Flash or Java Applets (God forbid!) to load your data it is trivial to look at the network requests from the page and figure out where those requests are going to, and then reverse engineer and use those endpoints in a scraper program. Make sure you obfuscate your endpoints and make them hard for others to use, as described.
To deter HTML parsers and scrapers:
Since HTML parsers work by extracting content from pages based on identifiable patterns in the HTML, we can intentionally change those patterns in oder to break these scrapers, or even screw with them. Most of these tips also apply to other scrapers like spiders and screenscrapers too.
Frequently change your HTML
Scrapers which process HTML directly do so by extracting contents from specific, identifiable parts of your HTML page. For example: If all pages on your website have a div with an id of article-content, which contains the text of the article, then it is trivial to write a script to visit all the article pages on your site, and extract the content text of the article-content div on each article page, and voilà, the scraper has all the articles from your site in a format that can be reused elsewhere.
If you change the HTML and the structure of your pages frequently, such scrapers will no longer work.
You can frequently change the id's and classes of elements in your HTML, perhaps even automatically. So, if your div.article-content becomes something like div.a4c36dda13eaf0, and changes every week, the scraper will work fine initially, but will break after a week. Make sure to change the length of your ids / classes too, otherwise the scraper will use div.[any-14-characters] to find the desired div instead. Beware of other similar holes too..
If there is no way to find the desired content from the markup, the scraper will do so from the way the HTML is structured. So, if all your article pages are similar in that every div inside a div which comes after a h1 is the article content, scrapers will get the article content based on that. Again, to break this, you can add / remove extra markup to your HTML, periodically and randomly, eg. adding extra divs or spans. With modern server side HTML processing, this should not be too hard.
Things to be aware of:
It will be tedious and difficult to implement, maintain, and debug.
You will hinder caching. Especially if you change ids or classes of your HTML elements, this will require corresponding changes in your CSS and JavaScript files, which means that every time you change them, they will have to be re-downloaded by the browser. This will result in longer page load times for repeat visitors, and increased server load. If you only change it once a week, it will not be a big problem.
Clever scrapers will still be able to get your content by inferring where the actual content is, eg. by knowing that a large single block of text on the page is likely to be the actual article. This makes it possible to still find & extract the desired data from the page. Boilerpipe does exactly this.
Essentially, make sure that it is not easy for a script to find the actual, desired content for every similar page.
See also How to prevent crawlers depending on XPath from getting page contents for details on how this can be implemented in PHP.
Change your HTML based on the user's location
This is sort of similar to the previous tip. If you serve different HTML based on your user's location / country (determined by IP address), this may break scrapers which are delivered to users. For example, if someone is writing a mobile app which scrapes data from your site, it will work fine initially, but break when it's actually distributed to users, as those users may be in a different country, and thus get different HTML, which the embedded scraper was not designed to consume.
Frequently change your HTML, actively screw with the scrapers by doing so !
An example: You have a search feature on your website, located at example.com/search?query=somesearchquery, which returns the following HTML:
<div class="search-result">
<h3 class="search-result-title">Stack Overflow has become the world's most popular programming Q & A website</h3>
<p class="search-result-excerpt">The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which...</p>
<a class"search-result-link" href="/stories/story-link">Read more</a>
</div>
(And so on, lots more identically structured divs with search results)
As you may have guessed this is easy to scrape: all a scraper needs to do is hit the search URL with a query, and extract the desired data from the returned HTML. In addition to periodically changing the HTML as described above, you could also leave the old markup with the old ids and classes in, hide it with CSS, and fill it with fake data, thereby poisoning the scraper. Here's how the search results page could be changed:
<div class="the-real-search-result">
<h3 class="the-real-search-result-title">Stack Overflow has become the world's most popular programming Q & A website</h3>
<p class="the-real-search-result-excerpt">The website Stack Overflow has now become the most popular programming Q & A website, with 10 million questions and many users, which...</p>
<a class"the-real-search-result-link" href="/stories/story-link">Read more</a>
</div>
<div class="search-result" style="display:none">
<h3 class="search-result-title">Visit Example.com now, for all the latest Stack Overflow related news !</h3>
<p class="search-result-excerpt">Example.com is so awesome, visit now !</p>
<a class"search-result-link" href="http://example.com/">Visit Now !</a>
</div>
(More real search results follow)
This will mean that scrapers written to extract data from the HTML based on classes or IDs will continue to seemingly work, but they will get fake data or even ads, data which real users will never see, as they're hidden with CSS.
Screw with the scraper: Insert fake, invisible honeypot data into your page
Adding on to the previous example, you can add invisible honeypot items to your HTML to catch scrapers. An example which could be added to the previously described search results page:
<div class="search-result" style="display:none">
<h3 class="search-result-title">This search result is here to prevent scraping</h3>
<p class="search-result-excerpt">If you're a human and see this, please ignore it. If you're a scraper, please click the link below :-)
Note that clicking the link below will block access to this site for 24 hours.</p>
<a class"search-result-link" href="/scrapertrap/scrapertrap.php">I'm a scraper !</a>
</div>
(The actual, real, search results follow.)
A scraper written to get all the search results will pick this up, just like any of the other, real search results on the page, and visit the link, looking for the desired content. A real human will never even see it in the first place (due to it being hidden with CSS), and won't visit the link. A genuine and desirable spider such as Google's will not visit the link either because you disallowed /scrapertrap/ in your robots.txt.
You can make your scrapertrap.php do something like block access for the IP address that visited it or force a captcha for all subsequent requests from that IP.
Don't forget to disallow your honeypot (/scrapertrap/) in your robots.txt file so that search engine bots don't fall into it.
You can / should combine this with the previous tip of changing your HTML frequently.
Change this frequently too, as scrapers will eventually learn to avoid it. Change the honeypot URL and text. Also want to consider changing the inline CSS used for hiding, and use an ID attribute and external CSS instead, as scrapers will learn to avoid anything which has a style attribute with CSS used to hide the content. Also try only enabling it sometimes, so the scraper works initially, but breaks after a while. This also applies to the previous tip.
Malicious people can prevent access for real users by sharing a link to your honeypot, or even embedding that link somewhere as an image (eg. on a forum). Change the URL frequently, and make any ban times relatively short.
Serve fake and useless data if you detect a scraper
If you detect what is obviously a scraper, you can serve up fake and useless data; this will corrupt the data the scraper gets from your website. You should also make it impossible to distinguish such fake data from real data, so that scrapers don't know that they're being screwed with.
As an example: you have a news website; if you detect a scraper, instead of blocking access, serve up fake, randomly generated articles, and this will poison the data the scraper gets. If you make your fake data indistinguishable from the real thing, you'll make it hard for scrapers to get what they want, namely the actual, real data.
Don't accept requests if the User Agent is empty / missing
Often, lazily written scrapers will not send a User Agent header with their request, whereas all browsers as well as search engine spiders will.
If you get a request where the User Agent header is not present, you can show a captcha, or simply block or limit access. (Or serve fake data as described above, or something else..)
It's trivial to spoof, but as a measure against poorly written scrapers it is worth implementing.
Don't accept requests if the User Agent is a common scraper one; blacklist ones used by scrapers
In some cases, scrapers will use a User Agent which no real browser or search engine spider uses, such as:
"Mozilla" (Just that, nothing else. I've seen a few questions about scraping here, using that. A real browser will never use only that)
"Java 1.7.43_u43" (By default, Java's HttpUrlConnection uses something like this.)
"BIZCO EasyScraping Studio 2.0"
"wget", "curl", "libcurl",.. (Wget and cURL are sometimes used for basic scraping)
If you find that a specific User Agent string is used by scrapers on your site, and it is not used by real browsers or legitimate spiders, you can also add it to your blacklist.
If it doesn't request assets (CSS, images), it's not a real browser.
A real browser will (almost always) request and download assets such as images and CSS. HTML parsers and scrapers won't as they are only interested in the actual pages and their content.
You could log requests to your assets, and if you see lots of requests for only the HTML, it may be a scraper.
Beware that search engine bots, ancient mobile devices, screen readers and misconfigured devices may not request assets either.
Use and require cookies; use them to track user and scraper actions.
You can require cookies to be enabled in order to view your website. This will deter inexperienced and newbie scraper writers, however it is easy to for a scraper to send cookies. If you do use and require them, you can track user and scraper actions with them, and thus implement rate-limiting, blocking, or showing captchas on a per-user instead of a per-IP basis.
For example: when the user performs search, set a unique identifying cookie. When the results pages are viewed, verify that cookie. If the user opens all the search results (you can tell from the cookie), then it's probably a scraper.
Using cookies may be ineffective, as scrapers can send the cookies with their requests too, and discard them as needed. You will also prevent access for real users who have cookies disabled, if your site only works with cookies.
Note that if you use JavaScript to set and retrieve the cookie, you'll block scrapers which don't run JavaScript, since they can't retrieve and send the cookie with their request.
Use JavaScript + Ajax to load your content
You could use JavaScript + AJAX to load your content after the page itself loads. This will make the content inaccessible to HTML parsers which do not run JavaScript. This is often an effective deterrent to newbie and inexperienced programmers writing scrapers.
Be aware of:
Using JavaScript to load the actual content will degrade user experience and performance
Search engines may not run JavaScript either, thus preventing them from indexing your content. This may not be a problem for search results pages, but may be for other things, such as article pages.
Obfuscate your markup, network requests from scripts, and everything else.
If you use Ajax and JavaScript to load your data, obfuscate the data which is transferred. As an example, you could encode your data on the server (with something as simple as base64 or more complex), and then decode and display it on the client, after fetching via Ajax. This will mean that someone inspecting network traffic will not immediately see how your page works and loads data, and it will be tougher for someone to directly request request data from your endpoints, as they will have to reverse-engineer your descrambling algorithm.
If you do use Ajax for loading the data, you should make it hard to use the endpoints without loading the page first, eg by requiring some session key as a parameter, which you can embed in your JavaScript or your HTML.
You can also embed your obfuscated data directly in the initial HTML page and use JavaScript to deobfuscate and display it, which would avoid the extra network requests. Doing this will make it significantly harder to extract the data using a HTML-only parser which does not run JavaScript, as the one writing the scraper will have to reverse engineer your JavaScript (which you should obfuscate too).
You might want to change your obfuscation methods regularly, to break scrapers who have figured it out.
There are several disadvantages to doing something like this, though:
It will be tedious and difficult to implement, maintain, and debug.
It will be ineffective against scrapers and screenscrapers which actually run JavaScript and then extract the data. (Most simple HTML parsers don't run JavaScript though)
It will make your site nonfunctional for real users if they have JavaScript disabled.
Performance and page-load times will suffer.
Non-Technical:
Tell people not to scrape, and some will respect it
Find a lawyer
Make your data available, provide an API:
You could make your data easily available and require attribution and a link back to your site. Perhaps charge $$$ for it.
Miscellaneous:
There are also commercial scraping protection services, such as the anti-scraping by Cloudflare or Distill Networks (Details on how it works here), which do these things, and more for you.
Find a balance between usability for real users and scraper-proofness: Everything you do will impact user experience negatively in one way or another, find compromises.
Don't forget your mobile site and apps. If you have a mobile app, that can be screenscraped too, and network traffic can be inspected to determine the REST endpoints it uses.
Scrapers can scrape other scrapers: If there's one website which has content scraped from yours, other scrapers can scrape from that scraper's website.
Further reading:
Wikipedia's article on Web scraping. Many details on the technologies involved and the different types of web scraper.
Stopping scripters from slamming your website hundreds of times a second. Q & A on a very similar problem - bots checking a website and buying things as soon as they go on sale. A lot of relevant info, esp. on Captchas and rate-limiting.
I will presume that you have set up robots.txt.
As others have mentioned, scrapers can fake nearly every aspect of their activities, and it is probably very difficult to identify the requests that are coming from the bad guys.
I would consider:
Set up a page, /jail.html.
Disallow access to the page in robots.txt (so the respectful spiders will never visit).
Place a link on one of your pages, hiding it with CSS (display: none).
Record IP addresses of visitors to /jail.html.
This might help you to quickly identify requests from scrapers that are flagrantly disregarding your robots.txt.
You might also want to make your /jail.html a whole entire website that has the same, exact markup as normal pages, but with fake data (/jail/album/63ajdka, /jail/track/3aads8, etc.). This way, the bad scrapers won't be alerted to "unusual input" until you have the chance to block them entirely.
Sue 'em.
Seriously: If you have some money, talk to a good, nice, young lawyer who knows their way around the Internets. You could really be able to do something here. Depending on where the sites are based, you could have a lawyer write up a cease & desist or its equivalent in your country. You may be able to at least scare the bastards.
Document the insertion of your dummy values. Insert dummy values that clearly (but obscurely) point to you. I think this is common practice with phone book companies, and here in Germany, I think there have been several instances when copycats got busted through fake entries they copied 1:1.
It would be a shame if this would drive you into messing up your HTML code, dragging down SEO, validity and other things (even though a templating system that uses a slightly different HTML structure on each request for identical pages might already help a lot against scrapers that always rely on HTML structures and class/ID names to get the content out.)
Cases like this are what copyright laws are good for. Ripping off other people's honest work to make money with is something that you should be able to fight against.
Provide an XML API to access your data; in a manner that is simple to use. If people want your data, they'll get it, you might as well go all out.
This way you can provide a subset of functionality in an effective manner, ensuring that, at the very least, the scrapers won't guzzle up HTTP requests and massive amounts of bandwidth.
Then all you have to do is convince the people who want your data to use the API. ;)
There is really nothing you can do to completely prevent this. Scrapers can fake their user agent, use multiple IP addresses, etc. and appear as a normal user. The only thing you can do is make the text not available at the time the page is loaded - make it with image, flash, or load it with JavaScript. However, the first two are bad ideas, and the last one would be an accessibility issue if JavaScript is not enabled for some of your regular users.
If they are absolutely slamming your site and rifling through all of your pages, you could do some kind of rate limiting.
There is some hope though. Scrapers rely on your site's data being in a consistent format. If you could randomize it somehow it could break their scraper. Things like changing the ID or class names of page elements on each load, etc. But that is a lot of work to do and I'm not sure if it's worth it. And even then, they could probably get around it with enough dedication.
Sorry, it's really quite hard to do this...
I would suggest that you politely ask them to not use your content (if your content is copyrighted).
If it is and they don't take it down, then you can take furthur action and send them a cease and desist letter.
Generally, whatever you do to prevent scraping will probably end up with a more negative effect, e.g. accessibility, bots/spiders, etc.
Okay, as all posts say, if you want to make it search engine-friendly then bots can scrape for sure.
But you can still do a few things, and it may be affective for 60-70 % scraping bots.
Make a checker script like below.
If a particular IP address is visiting very fast then after a few visits (5-10) put its IP address + browser information in a file or database.
The next step
(This would be a background process and running all time or scheduled after a few minutes.) Make one another script that will keep on checking those suspicious IP addresses.
Case 1. If the user Agent is of a known search engine like Google, Bing, Yahoo (you can find more information on user agents by googling it). Then you must see http://www.iplists.com/. This list and try to match patterns. And if it seems like a faked user-agent then ask to fill in a CAPTCHA on the next visit. (You need to research a bit more on bots IP addresses. I know this is achievable and also try whois of the IP address. It can be helpful.)
Case 2. No user agent of a search bot: Simply ask to fill in a CAPTCHA on the next visit.
Late answer - and also this answer probably isn't the one you want to hear...
Myself already wrote many (many tens) of different specialized data-mining scrapers. (just because I like the "open data" philosophy).
Here are already many advices in other answers - now i will play the devil's advocate role and will extend and/or correct their effectiveness.
First:
if someone really wants your data
you can't effectively (technically) hide your data
if the data should be publicly accessible to your "regular users"
Trying to use some technical barriers aren't worth the troubles, caused:
to your regular users by worsening their user-experience
to regular and welcomed bots (search engines)
etc...
Plain HMTL - the easiest way is parse the plain HTML pages, with well defined structure and css classes. E.g. it is enough to inspect element with Firebug, and use the right Xpaths, and/or CSS path in my scraper.
You could generate the HTML structure dynamically and also, you can generate dynamically the CSS class-names (and the CSS itself too) (e.g. by using some random class names) - but
you want to present the informations to your regular users in consistent way
e.g. again - it is enough to analyze the page structure once more to setup the scraper.
and it can be done automatically by analyzing some "already known content"
once someone already knows (by earlier scrape), e.g.:
what contains the informations about "phil collins"
enough display the "phil collins" page and (automatically) analyze how the page is structured "today" :)
You can't change the structure for every response, because your regular users will hate you. Also, this will cause more troubles for you (maintenance) not for the scraper. The XPath or CSS path is determinable by the scraping script automatically from the known content.
Ajax - little bit harder in the start, but many times speeds up the scraping process :) - why?
When analyzing the requests and responses, i just setup my own proxy server (written in perl) and my firefox is using it. Of course, because it is my own proxy - it is completely hidden - the target server see it as regular browser. (So, no X-Forwarded-for and such headers).
Based on the proxy logs, mostly is possible to determine the "logic" of the ajax requests, e.g. i could skip most of the html scraping, and just use the well-structured ajax responses (mostly in JSON format).
So, the ajax doesn't helps much...
Some more complicated are pages which uses much packed javascript functions.
Here is possible to use two basic methods:
unpack and understand the JS and create a scraper which follows the Javascript logic (the hard way)
or (preferably using by myself) - just using Mozilla with Mozrepl for scrape. E.g. the real scraping is done in full featured javascript enabled browser, which is programmed to clicking to the right elements and just grabbing the "decoded" responses directly from the browser window.
Such scraping is slow (the scraping is done as in regular browser), but it is
very easy to setup and use
and it is nearly impossible to counter it :)
and the "slowness" is needed anyway to counter the "blocking the rapid same IP based requests"
The User-Agent based filtering doesn't helps at all. Any serious data-miner will set it to some correct one in his scraper.
Require Login - doesn't helps. The simplest way beat it (without any analyze and/or scripting the login-protocol) is just logging into the site as regular user, using Mozilla and after just run the Mozrepl based scraper...
Remember, the require login helps for anonymous bots, but doesn't helps against someone who want scrape your data. He just register himself to your site as regular user.
Using frames isn't very effective also. This is used by many live movie services and it not very hard to beat. The frames are simply another one HTML/Javascript pages what are needed to analyze... If the data worth the troubles - the data-miner will do the required analyze.
IP-based limiting isn't effective at all - here are too many public proxy servers and also here is the TOR... :) It doesn't slows down the scraping (for someone who really wants your data).
Very hard is scrape data hidden in images. (e.g. simply converting the data into images server-side). Employing "tesseract" (OCR) helps many times - but honestly - the data must worth the troubles for the scraper. (which many times doesn't worth).
On the other side, your users will hate you for this. Myself, (even when not scraping) hate websites which doesn't allows copy the page content into the clipboard (because the information are in the images, or (the silly ones) trying to bond to the right click some custom Javascript event. :)
The hardest are the sites which using java applets or flash, and the applet uses secure https requests itself internally. But think twice - how happy will be your iPhone users... ;). Therefore, currently very few sites using them. Myself, blocking all flash content in my browser (in regular browsing sessions) - and never using sites which depends on Flash.
Your milestones could be..., so you can try this method - just remember - you will probably loose some of your users. Also remember, some SWF files are decompilable. ;)
Captcha (the good ones - like reCaptcha) helps a lot - but your users will hate you... - just imagine, how your users will love you when they need solve some captchas in all pages showing informations about the music artists.
Probably don't need to continue - you already got into the picture.
Now what you should do:
Remember: It is nearly impossible to hide your data, if you on the other side want publish them (in friendly way) to your regular users.
So,
make your data easily accessible - by some API
this allows the easy data access
e.g. offload your server from scraping - good for you
setup the right usage rights (e.g. for example must cite the source)
remember, many data isn't copyright-able - and hard to protect them
add some fake data (as you already done) and use legal tools
as others already said, send an "cease and desist letter"
other legal actions (sue and like) probably is too costly and hard to win (especially against non US sites)
Think twice before you will try to use some technical barriers.
Rather as trying block the data-miners, just add more efforts to your website usability. Your user will love you. The time (&energy) invested into technical barriers usually aren't worth - better to spend the time to make even better website...
Also, data-thieves aren't like normal thieves.
If you buy an inexpensive home alarm and add an warning "this house is connected to the police" - many thieves will not even try to break into. Because one wrong move by him - and he going to jail...
So, you investing only few bucks, but the thief investing and risk much.
But the data-thief hasn't such risks. just the opposite - ff you make one wrong move (e.g. if you introduce some BUG as a result of technical barriers), you will loose your users. If the the scraping bot will not work for the first time, nothing happens - the data-miner just will try another approach and/or will debug the script.
In this case, you need invest much more - and the scraper investing much less.
Just think where you want invest your time & energy...
Ps: english isn't my native - so forgive my broken english...
Things that might work against beginner scrapers:
IP blocking
use lots of ajax
check referer request header
require login
Things that will help in general:
change your layout every week
robots.txt
Things that will help but will make your users hate you:
captcha
I have done a lot of web scraping and summarized some techniques to stop web scrapers on my blog based on what I find annoying.
It is a tradeoff between your users and scrapers. If you limit IP's, use CAPTCHA's, require login, etc, you make like difficult for the scrapers. But this may also drive away your genuine users.
From a tech perspective:
Just model what Google does when you hit them with too many queries at once. That should put a halt to a lot of it.
From a legal perspective:
It sounds like the data you're publishing is not proprietary. Meaning you're publishing names and stats and other information that cannot be copyrighted.
If this is the case, the scrapers are not violating copyright by redistributing your information about artist name etc. However, they may be violating copyright when they load your site into memory because your site contains elements that are copyrightable (like layout etc).
I recommend reading about Facebook v. Power.com and seeing the arguments Facebook used to stop screen scraping. There are many legal ways you can go about trying to stop someone from scraping your website. They can be far reaching and imaginative. Sometimes the courts buy the arguments. Sometimes they don't.
But, assuming you're publishing public domain information that's not copyrightable like names and basic stats... you should just let it go in the name of free speech and open data. That is, what the web's all about.
Your best option is unfortunately fairly manual: Look for traffic patterns that you believe are indicative of scraping and ban their IP addresses.
Since you're talking about a public site then making the site search-engine friendly will also make the site scraping-friendly. If a search-engine can crawl and scrape your site then an malicious scraper can as well. It's a fine-line to walk.
Sure it's possible. For 100% success, take your site offline.
In reality you can do some things that make scraping a little more difficult. Google does browser checks to make sure you're not a robot scraping search results (although this, like most everything else, can be spoofed).
You can do things like require several seconds between the first connection to your site, and subsequent clicks. I'm not sure what the ideal time would be or exactly how to do it, but that's another idea.
I'm sure there are several other people who have a lot more experience, but I hope those ideas are at least somewhat helpful.
No, it's not possible to stop (in any way)
Embrace it. Why not publish as RDFa and become super search engine friendly and encourage the re-use of data? People will thank you and provide credit where due (see musicbrainz as an example).
It is not the answer you probably want, but why hide what you're trying to make public?
There are a few things you can do to try and prevent screen scraping. Some are not very effective, while others (a CAPTCHA) are, but hinder usability. You have to keep in mind too that it may hinder legitimate site scrapers, such as search engine indexes.
However, I assume that if you don't want it scraped that means you don't want search engines to index it either.
Here are some things you can try:
Show the text in an image. This is quite reliable, and is less of a pain on the user than a CAPTCHA, but means they won't be able to cut and paste and it won't scale prettily or be accessible.
Use a CAPTCHA and require it to be completed before returning the page. This is a reliable method, but also the biggest pain to impose on a user.
Require the user to sign up for an account before viewing the pages, and confirm their email address. This will be pretty effective, but not totally - a screen-scraper might set up an account and might cleverly program their script to log in for them.
If the client's user-agent string is empty, block access. A site-scraping script will often be lazily programmed and won't set a user-agent string, whereas all web browsers will.
You can set up a black list of known screen scraper user-agent strings as you discover them. Again, this will only help the lazily-coded ones; a programmer who knows what he's doing can set a user-agent string to impersonate a web browser.
Change the URL path often. When you change it, make sure the old one keeps working, but only for as long as one user is likely to have their browser open. Make it hard to predict what the new URL path will be. This will make it difficult for scripts to grab it if their URL is hard-coded. It'd be best to do this with some kind of script.
If I had to do this, I'd probably use a combination of the last three, because they minimise the inconvenience to legitimate users. However, you'd have to accept that you won't be able to block everyone this way and once someone figures out how to get around it, they'll be able to scrape it forever. You could then just try to block their IP addresses as you discover them I guess.
Method One (Small Sites Only):
Serve encrypted / encoded data.I Scape the web using python (urllib, requests, beautifulSoup etc...) and found many websites that serve encrypted / encoded data that is not decrypt-able in any programming language simply because the encryption method does not exist.
I achieved this in a PHP website by encrypting and minimizing the output (WARNING: this is not a good idea for large sites) the response was always jumbled content.
Example of minimizing output in PHP (How to minify php page html output?):
<?php
function sanitize_output($buffer) {
$search = array(
'/\>[^\S ]+/s', // strip whitespaces after tags, except space
'/[^\S ]+\</s', // strip whitespaces before tags, except space
'/(\s)+/s' // shorten multiple whitespace sequences
);
$replace = array('>', '<', '\\1');
$buffer = preg_replace($search, $replace, $buffer);
return $buffer;
}
ob_start("sanitize_output");
?>
Method Two:
If you can't stop them screw them over serve fake / useless data as a response.
Method Three:
block common scraping user agents, you'll see this in major / large websites as it is impossible to scrape them with "python3.4" as you User-Agent.
Method Four:
Make sure all the user headers are valid, I sometimes provide as many headers as possible to make my scraper seem like an authentic user, some of them are not even true or valid like en-FU :).
Here is a list of some of the headers I commonly provide.
headers = {
"Requested-URI": "/example",
"Request-Method": "GET",
"Remote-IP-Address": "656.787.909.121",
"Remote-IP-Port": "69696",
"Protocol-version": "HTTP/1.1",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip,deflate",
"Accept-Language": "en-FU,en;q=0.8",
"Cache-Control": "max-age=0",
"Connection": "keep-alive",
"Dnt": "1",
"Host": "http://example.com",
"Referer": "http://example.com",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.111 Safari/537.36"
}
Rather than blacklisting bots, maybe you should whitelist them. If you don't want to kill your search results for the top few engines, you can whitelist their user-agent strings, which are generally well-publicized. The less ethical bots tend to forge user-agent strings of popular web browsers. The top few search engines should be driving upwards of 95% of your traffic.
Identifying the bots themselves should be fairly straightforward, using the techniques other posters have suggested.
Quick approach to this would be to set a booby/bot trap.
Make a page that if it's opened a certain amount of times or even opened at all, will collect certain information like the IP and whatnot (you can also consider irregularities or patterns but this page shouldn't have to be opened at all).
Make a link to this in your page that is hidden with CSS display:none; or left:-9999px; positon:absolute; try to place it in places that are less unlikely to be ignored like where your content falls under and not your footer as sometimes bots can choose to forget about certain parts of a page.
In your robots.txt file set a whole bunch of disallow rules to pages you don't want friendly bots (LOL, like they have happy faces!) to gather information on and set this page as one of them.
Now, If a friendly bot comes through it should ignore that page. Right but that still isn't good enough. Make a couple more of these pages or somehow re-route a page to accept differnt names. and then place more disallow rules to these trap pages in your robots.txt file alongside pages you want ignored.
Collect the IP of these bots or anyone that enters into these pages, don't ban them but make a function to display noodled text in your content like random numbers, copyright notices, specific text strings, display scary pictures, basically anything to hinder your good content. You can also set links that point to a page which will take forever to load ie. in php you can use the sleep() function. This will fight the crawler back if it has some sort of detection to bypass pages that take way too long to load as some well written bots are set to process X amount of links at a time.
If you have made specific text strings/sentences why not go to your favorite search engine and search for them, it might show you where your content is ending up.
Anyway, if you think tactically and creatively this could be a good starting point. The best thing to do would be to learn how a bot works.
I'd also think about scambling some ID's or the way attributes on the page element are displayed:
<a class="someclass" href="../xyz/abc" rel="nofollow" title="sometitle">
that changes its form every time as some bots might be set to be looking for specific patterns in your pages or targeted elements.
<a title="sometitle" href="../xyz/abc" rel="nofollow" class="someclass">
id="p-12802" > id="p-00392"
You can't stop normal screen scraping. For better or worse, it's the nature of the web.
You can make it so no one can access certain things (including music files) unless they're logged in as a registered user. It's not too difficult to do in Apache. I assume it wouldn't be too difficult to do in IIS as well.
Most have been already said, but have you considered the CloudFlare protection? I mean this:
Other companies probably do this too, CloudFlare is the only one I know.
I'm pretty sure that would complicate their work. I also once got IP banned automatically for 4 months when I tried to scrap data of a site protected by CloudFlare due to rate limit (I used simple AJAX request loop).
One way would be to serve the content as XML attributes, URL encoded strings, preformatted text with HTML encoded JSON, or data URIs, then transform it to HTML on the client. Here are a few sites which do this:
Skechers: XML
<document
filename=""
height=""
width=""
title="SKECHERS"
linkType=""
linkUrl=""
imageMap=""
href="http://www.bobsfromskechers.com"
alt="BOBS from Skechers"
title="BOBS from Skechers"
/>
Chrome Web Store: JSON
<script type="text/javascript" src="https://apis.google.com/js/plusone.js">{"lang": "en", "parsetags": "explicit"}</script>
Bing News: data URL
<script type="text/javascript">
//<![CDATA[
(function()
{
var x;x=_ge('emb7');
if(x)
{
x.src='data:image/jpeg;base64,/*...*/';
}
}() )
Protopage: URL Encoded Strings
unescape('Rolling%20Stone%20%3a%20Rock%20and%20Roll%20Daily')
TiddlyWiki : HTML Entities + preformatted JSON
<pre>
{"tiddlers":
{
"GettingStarted":
{
"title": "GettingStarted",
"text": "Welcome to TiddlyWiki,
}
}
}
</pre>
Amazon: Lazy Loading
amzn.copilot.jQuery=i;amzn.copilot.jQuery(document).ready(function(){d(b);f(c,function() {amzn.copilot.setup({serviceEndPoint:h.vipUrl,isContinuedSession:true})})})},f=function(i,h){var j=document.createElement("script");j.type="text/javascript";j.src=i;j.async=true;j.onload=h;a.appendChild(j)},d=function(h){var i=document.createElement("link");i.type="text/css";i.rel="stylesheet";i.href=h;a.appendChild(i)}})();
amzn.copilot.checkCoPilotSession({jsUrl : 'http://z-ecx.images-amazon.com/images/G/01/browser-scripts/cs-copilot-customer-js/cs-copilot-customer-js-min-1875890922._V1_.js', cssUrl : 'http://z-ecx.images-amazon.com/images/G/01/browser-scripts/cs-copilot-customer-css/cs-copilot-customer-css-min-2367001420._V1_.css', vipUrl : 'https://copilot.amazon.com'
XMLCalabash: Namespaced XML + Custom MIME type + Custom File extension
<p:declare-step type="pxp:zip">
<p:input port="source" sequence="true" primary="true"/>
<p:input port="manifest"/>
<p:output port="result"/>
<p:option name="href" required="true" cx:type="xsd:anyURI"/>
<p:option name="compression-method" cx:type="stored|deflated"/>
<p:option name="compression-level" cx:type="smallest|fastest|default|huffman|none"/>
<p:option name="command" select="'update'" cx:type="update|freshen|create|delete"/>
</p:declare-step>
If you view source on any of the above, you see that scraping will simply return metadata and navigation.
I agree with most of the posts above, and I'd like to add that the more search engine friendly your site is, the more scrape-able it would be. You could try do a couple of things that are very out there that make it harder for scrapers, but it might also affect your search-ability... It depends on how well you want your site to rank on search engines of course.
Putting your content behind a captcha would mean that robots would find it difficult to access your content. However, humans would be inconvenienced so that may be undesirable.
If you want to see a great example, check out http://www.bkstr.com/. They use a j/s algorithm to set a cookie, then reloads the page so it can use the cookie to validate that the request is being run within a browser. A desktop app built to scrape could definitely get by this, but it would stop most cURL type scraping.
Screen scrapers work by processing HTML. And if they are determined to get your data there is not much you can do technically because the human eyeball processes anything. Legally it's already been pointed out you may have some recourse though and that would be my recommendation.
However, you can hide the critical part of your data by using non-HTML-based presentation logic
Generate a Flash file for each artist/album, etc.
Generate an image for each artist content. Maybe just an image for the artist name, etc. would be enough. Do this by rendering the text onto a JPEG/PNG file on the server and linking to that image.
Bear in mind that this would probably affect your search rankings.
Generate the HTML, CSS and JavaScript. It is easier to write generators than parsers, so you could generate each served page differently. You can no longer use a cache or static content then.