How do deal with bots using the in-site search and overflowing the SQL with too many requests? - mysql

What is the best practise to not annoy users with flood limits, but yet block off bots doing automated searches?
What is going on:
I am been more aware of odd search behaviour and I finally had the time, to catch who it is. It is 157.55.39.* also known as Bing. Which is odd, because when _GET['q'] is detected, noindex is added.
Problem however is, that they are slowing down the SQL server, as there is just too many instances of requests coming in.
What I have done so far:
I have implemented searching flood limit. But since I did it with a session-cookie, checking and calculating from the last search timestamp -- bing obviously ignores cookies and continues on.
Worst case scenario is to add reCAPTHA, but I don't want the "Are you human?" tickbox everytime you search. It should appear only, when flood is detected. So basically, the real question is, how to detect too many requests from client to trigger some sort of recaptcha to stop requests..
EDIT #1:
I handled the situation currently, with:
<?
# Get end IP
define('CLIENT_IP', (filter_var(#$_SERVER['HTTP_X_FORWARDED_IP'], FILTER_VALIDATE_IP) ? #$_SERVER['HTTP_X_FORWARDED_IP'] : (filter_var(#$_SERVER['HTTP_X_FORWARDED_FOR'], FILTER_VALIDATE_IP) ? #$_SERVER['HTTP_X_FORWARDED_FOR'] : $_SERVER['REMOTE_ADDR'])));
# Detect BING:
if (substr(CLIENT_IP, 0, strrpos(CLIENT_IP, '.')) == '157.55.39') {
# Tell them not right now:
Header('HTTP/1.1 503 Service Temporarily Unavailable');
# ..and block the request
die();
}
It works. But it seems like another temp solution to a more systematic problem.
I would like to mention, that I still would like search engines, including Bing to index /search.html, just not to actually search there. There is no "latest searches" or anything like that, so its a mystery where they are getting the queries from.
EDIT #2 -- How I solved it
If someone else in the future has these problems, I hope this helps.
First of all, it turns out that Bing has the same URL parameter feature, that Google has. So I was able to tell Bing to ignore URL parameter "q".
Based on the correct answer, I added disallow rows for parameter q to robots.txt:
Disallow: /*?q=*
Disallow: /*?*q=*
I also told inside the bing webmaster console, to not bother us on peak traffic.
Overall, this right away showed positive feedback from server resource usage. I will however, implement overall flood limit for identical queries, specifically where _GET is involved. So in case Bing should ever decide to visit an AJAX call (example: ?action=upvote&postid=1).

Spam is a problem that all website owners struggle to deal with.
And there are a lot of ways to build good protection, starting from very easy ways and finishing with very hard and strong protection mechanisms.
But for you right now I see one simple solution.
Use robots.txt and disallow Bing spider to crawl your search page.
You can do this very easy.
Your robots.txt file would look like:
User-agent: bingbot
Disallow: /search.html?q=
But this will totally block search engine spider from crawling your search results.
If you want just to limit such requests, but not totally block them, try this:
User-agent: bingbot
crawl-delay: 10
This will force Bing to crawl your website pages only every 10 seconds.
But with such delay, it will crawl only 8,640 pages a day (which is very small amount of requests per/day).
If you good with this, then you ok.
But, what if you want manually control this behavior by the server itself, protecting search form not only from web crawlers, but also from hackers?
They could send to your server over 50,000 requests per/hour with the ease.
In this case, I would recommend you 2 solutions.
Firstly, connect CloudFlare to your website, and don't forget to check if your server real IP is still available via services like ViewDNS IP History, cuz many websites with CF protection lack on this (even popular once).
If your active server IP is visible in the history, then you may consider changing it (highly recommended).
Secondly, you could use MemCached to store flood data and detect if a certain IP is querying too much (i.e. 30 q/min).
And if they do, block their opportunity to use perform (via MemCached) for some time.
Of course, this is not the best solution you could use, but it will work and will cost not much for your server.

Related

Are google's search results influenced by our data?

I have always wondered that.
For example, If I search for the term "composer" or "what is composer", it shows the php package manager. Why does it show programmer-related results? Obviously, it makes sense that it does that, since the results I get are much more relevant to me.
What if an aspiring composer googles that? What results will they get?
Another example is, if I enter the word "spring" to the search engine, it shows the spring framework, instead of, let's say, the season.
So, my question(s):
Does google actually use the data it collects to show relevant search results? (I am not talking about ads, but search results)
If yes, why doesn't incognito mode work?
How can I avoid google using other parameters, besides the very term I typed in, to affect the search results?
Yes. This is the very core of Google's business model. The same data that influences search results is also applied to ad placement (see their real-time bidding system); when you do searches, it's likely you will see ads about the same subjects fairly soon afterwards.
Incognito mode is a very limited form of anonymisation; it's really not very anonymous at all. If you visit a page in a browser that has some google-controlled element (e.g. Google Analytics, a CDN JS library, or a font), then shortly afterwards perform a google search, there will be very many points in common that allow google to match you as very likely the same person (e.g. your IP, time of day, recent similar requests, user agent string, window size, fonts available) even if it blocks cookies that would identify you explicitly. This form of fingerprinting is quite hard to avoid, though Safari is a lot better at it than Chrome. Tor provides much more robust anonymisation by normalising many fingerprintable elements, as well as hiding your IP.
That's difficult because making use of all this information will indeed lead to generally more relevant search results, so it's in Google's interests to use whatever it can (within technical and mostly legal limits). Tor will disconnect the search results from you, but it may instead provide you with results linked to whoever else might have been using the same Tor exit node as you recently, which might not be pleasant! The same would apply to using VPN services.

Best practice for email links that will set a DB flag?

Our business wants to email our customers a survey after they work with support. For internal reasons, we want to ask them the first question in the body of the email. We'd like to have a link for each answer. The link will go to a web service, which will store the answer, then present the rest of the survey.
So far so good.
The challenge I'm running into: making a server-side changed based on an HTTP GET is bad practice, but you can't do a POST from a link. Options seem to be:
Use an HTTP GET instead, even though that's not correct and could cause problems (https://twitter.com/rombulow/status/990684453734203392)
Embed an HTML form in the email and style some buttons to look like links (likely not compatible with a number of email platforms)
Don't include the first question in the email (not possible for business reasons)
Use HTTP GET, but have some sort of mechanism which prevents a link from altering the server state more than once
Does anybody have any better recommendations? Googling hasn't turned up much about this specific situation.
One thing to keep in mind is that HTTP is specifying semantics, not implementation. If you want to change the state of your server on receipt of a GET request, you can. See RFC 7231
This definition of safe methods does not prevent an implementation from including behavior that is potentially harmful, that is not entirely read-only, or that causes side effects while invoking a safe method. What is important, however, is that the client did not request that additional behavior and cannot be held accountable for it. For example, most servers append request information to access log files at the completion of every response, regardless of the method, and that is considered safe even though the log storage might become full and crash the server. Likewise, a safe request initiated by selecting an advertisement on the Web will often have the side effect of charging an advertising account.
Domain agnostic clients are going to assume that GET is safe, which means your survey results could get distorted by web spiders crawling the links, browsers pre-loading resource to reduce the perceived latency, and so on.
Another possibility that works in some cases is to treat the path through the graph as the resource. Each answer link acts like a breadcrumb trail, encoding into itself the history of the clients answers. So a client that answered A and B to the first two questions is looking at /survey/questions/questionThree?AB where the user that answered C to both is looking at /survey/questions/questionThree?CC. In other words, you aren't changing the state of the server, you are just guiding the client through a pre-generated survey graph.

Counting views of any element on website

I am using such MySQL request for measuring views count
UPDATE content SET views=views+1 WHERE id='$id'
For example if I want to check how many times some single page has been viewed I've just putting it on top of page code. Unfortunately I always receiving about 5-10x bigger amount than results in Google Analytics.
If I am correct one refresh should increase value in my data base about +1. Doesn't "Views" in Google Analytics works in the same way?
If e.g. Google Analytics provides me that single page has been viewed 100x times and my data base says it was e.g. 450x times. How such simple request could generate additional 350 views? And I don't mean visits or unique visits. Just regular views.
Is it possible that Google Analytics interprates such data in a little bit different way and my data base result is correct?
There are quite a few reasons why this could be occurring. The most usual culprit is bots and spiders. As soon as you use a third-party API like Google Analytics, or Facebook's API, you'll get their bots making hits to your page.
You need to examine each request in more detail. The user agent is a good place to start, although I do recommend researching this area further - discriminating between human and bot traffic is quite a deep subject.
In Google Analytics the data is provided by the user, for example:
A user view a page on your domain, now he is on charge to comunicate to Google The PageView, if something fails in the road, the data will no be included in the reports.
In the other case , the SQL sistem that you have is a Log Based Analytic, the data is collected by your system reducing the data collection failures.
If we see this in that way, that means taht some data can be missed with the slow conections and users that dont execute javascriopt (Adbloquers or bots), or the HTML page is not properly printed***.
Now 5x times more it's a huge discrepancy, in my experiences must be near 8-25% of discrepancy. (tested over transaction level, maybe in Pageview can be more)
What i recomend you is:
Save device, browser information, the ip, and some other metadata information that can be useful and dont forget the timesatmp, so in that way yo can isolate the problem, maybe are robots or users with adblock, in the worst case you code is not properly implemented ( located in the Footer as example)
*** i added this because one time i had a huge discrepancy, but it was a server error, the HTML code was not properly printed showing to the user a empty HTTP. The MYSQL was no so fast to save the information and process the HTML code. I notice it when the effort test (via Screaming frog) showed a lot of 500x errors. ( Wordpress Blog with no cache)

Anyone ever tried to use Twitter to replace comments sections on web apps?

Here's the scenario I'm imagining.
Simple blog, users typically post comments in a comments form at the bottom of each blog article. Instead of that, using the Twitter API, pull tweets based on a hashtag. Base the hashtag on the article id (i.e. #site10201) where site is a prefix and the number is the article id.
Then provide a link to post a tweet using the hashtag., which would then get picked up in your twitter api pull.
I'm imagining horrible spam issues, but other than that, bad idea?
Has some drawbacks to more run-of-the-mill database systems:
Additional network overhead. Most self-hosted blogs would typically rely on database and blog being on the same server (physical or virtual) so db-lookup is fast (and reliable) compared to twitter API requests.
Caching issues. One host is only allowed X requests of twitter at a time (the next request is going to end up a 404), and how are you going to manage that from your website for a scenario which becomes steadily more complex as multiple articles are added? Presumably you need to authenticate so the easy-way out is a security liability. (The easy way out being to use JavaScript on the at the browser to perform the actual request, neatly circumvents the problem in 20/80 fashion.) Granted most blogs don't get that kind of traffic. ;)
You tie your precious or not so precious comments to the mercy of the fail whale. Which is kind of odd considering a self-hosted blog basically means you want to have such control in the first place by not using a service like blogger.
Is it possible to ensure unicity of hash tags --in the general case? What are you going to do if someone had the same bright idea, only took the name of the tag 5ms before you did? Would you end up pulling the drivel of someone else's blog comments rather than the brilliance you have come to expect from yours? ;)
Lesser point: you rely on others to have a twitter account. Anonymous replies are off the table.
TOS and other considerations that may be imposed on you by twitter, either now or in future. (2) is actually a major item of Twitter's TOS.

How do web spiders differ from Wget's spider?

The next sentence caught my eye in Wget's manual
wget --spider --force-html -i bookmarks.html
This feature needs much more work for Wget to get close to the functionality of real web spiders.
I find the following lines of code relevant for the spider option in wget.
src/ftp.c
780: /* If we're in spider mode, don't really retrieve anything. The
784: if (opt.spider)
889: if (!(cmd & (DO_LIST | DO_RETR)) || (opt.spider && !(cmd & DO_LIST)))
1227: if (!opt.spider)
1239: if (!opt.spider)
1268: else if (!opt.spider)
1827: if (opt.htmlify && !opt.spider)
src/http.c
64:#include "spider.h"
2405: /* Skip preliminary HEAD request if we're not in spider mode AND
2407: if (!opt.spider
2428: if (opt.spider && !got_head)
2456: /* Default document type is empty. However, if spider mode is
2570: * spider mode. */
2571: else if (opt.spider)
2661: if (opt.spider)
src/res.c
543: int saved_sp_val = opt.spider;
548: opt.spider = false;
551: opt.spider = saved_sp_val;
src/spider.c
1:/* Keep track of visited URLs in spider mode.
37:#include "spider.h"
49:spider_cleanup (void)
src/spider.h
1:/* Declarations for spider.c
src/recur.c
52:#include "spider.h"
279: if (opt.spider)
366: || opt.spider /* opt.recursive is implicitely true */
370: (otherwise unneeded because of --spider or rejected by -R)
375: (opt.spider ? "--spider" :
378: (opt.delete_after || opt.spider
440: if (opt.spider)
src/options.h
62: bool spider; /* Is Wget in spider mode? */
src/init.c
238: { "spider", &opt.spider, cmd_boolean },
src/main.c
56:#include "spider.h"
238: { "spider", 0, OPT_BOOLEAN, "spider", -1 },
435: --spider don't download anything.\n"),
1045: if (opt.recursive && opt.spider)
I would like to see the differences in code, not abstractly. I love code examples.
How do web spiders differ from Wget's spider in code?
A real spider is a lot of work
Writing a spider for the whole WWW is quite a task --- you have to take care about many "little details" such as:
Each spider computer should receive data from a few thousand servers in parallel in order to make efficient use of the connection bandwidth. (asynchronous socket i/o).
You need several computers that spider in parallel in order to cover the vast amount of information on the WWW (clustering; partitioning the work)
You need to be polite to the spidered web sites:
Respect the robots.txt files.
Don't fetch a lot of information too quickly: this overloads the servers.
Don't fetch files that you really don't need (e.g. iso disk images; tgz packages for software download...).
You have to deal with cookies/session ids: Many sites attach unique session ids to URLs to identify client sessions. Each time you arrive at the site, you get a new session id and a new virtual world of pages (with the same content). Because of such problems, early search engines ignored dynamic content. Modern search engines have learned what the problems are and how to deal with them.
You have to detect and ignore troublesome data: connections that provide a seemingly infinite amount of data or connections that are too slow to finish.
Besides following links, you may want to parse sitemaps to get URLs of pages.
You may want to evaluate which information is important for you and changes frequently to be refreshed more frequently than other pages. Note: A spider for the whole WWW receives a lot of data --- you pay for that bandwidth. You may want to use HTTP HEAD requests to guess whether a page has changed or not.
Besides receiving, you want to process the information and store it. Google builds indices that list for each word the pages that contain it. You may need separate storage computers and an infrastructure to connect them. Traditional relational data bases don't keep up with the data volume and performance requirements of storing/indexing the whole WWW.
This is a lot of work. But if your target is more modest than reading the whole WWW, you may skip some of the parts. If you just want to download a copy of a wiki etc. you get down to the specs of wget.
Note: If you don't believe that it's so much work, you may want to read up on how Google re-invented most of the computing wheels (on top of the basic Linux kernel) to build good spiders. Even if you cut a lot of corners, it's a lot of work.
Let me add a few more technical remarks on three points
Parallel connections / asynchronous socket communication
You could run several spider programs in parallel processes or threads. But you need about 5000-10000 parallel connections in order to make good use of your network connection. And this amount of parallel processes/threads produces too much overhead.
A better solution is asynchronous input/output: process about 1000 parallel connections in one single thread by opening the sockets in non-blocking mode and use epoll or select to process just those connections that have received data. Since Linux kernel 2.4, Linux has excellent support for scalability (I also recommend that you study memory-mapped files) continuously improved in later versions.
Note: Using asynchronous i/o helps much more than using a "fast language": It's better to write an epoll-driven process for 1000 connections written in Perl than to run 1000 processes written in C. If you do it right, you can saturate a 100Mb connection with processes written in perl.
From the original answer:
The down side of this approach is that you will have to implement the HTTP specification yourself in an asynchronous form (I am not aware of a re-usable library that does this). It's much easier to do this with the simpler HTTP/1.0 protocol than the modern HTTP/1.1 protocol. You probably would not benefit from the advantages of HTTP/1.1 for normal browsers anyhow, so this may be a good place to save some development costs.
Edit five years later:
Today, there is a lot of free/open source technology available to help you with this work. I personally like the asynchronous http implementation of node.js --- it saves you all the work mentioned in the above original paragraph. Of course, today there are also a lot of modules readily available for the other components that you need in your spider. Note, however, that the quality of third-party modules may vary considerably. You have to check out whatever you use. [Aging info:] Recently, I wrote a spider using node.js and I found the reliability of npm modules for HTML processing for link and data extraction insufficient. For this job, I "outsourced" this processing to a process written in another programming language. But things are changing quickly and by the time you read this comment, this problem may already a thing of the past...
Partitioning the work over several servers
One computer can't keep up with spidering the whole WWW. You need to distribute your work over several servers and exchange information between them. I suggest to assign certain "ranges of domain names" to each server: keep a central data base of domain names with a reference to a spider computer.
Extract URLs from received web pages in batches: sort them according to their domain names; remove duplicates and send them to the responsible spider computer. On that computer, keep an index of URLs that already are fetched and fetch the remaining URLs.
If you keep a queue of URLs waiting to be fetched on each spider computer, you will have no performance bottlenecks. But it's quite a lot of programming to implement this.
Read the standards
I mentioned several standards (HTTP/1.x, Robots.txt, Cookies). Take your time to read them and implement them. If you just follow examples of sites that you know, you will make mistakes (forget parts of the standard that are not relevant to your samples) and cause trouble for those sites that use these additional features.
It's a pain to read the HTTP/1.1 standard document. But all the little details got added to it because somebody really needs that little detail and now uses it.
I am not sure exactly what the original author of the comment was referring to, but I can guess that wget is slow as a spider, since it appears to only use a single thread of execution (at least by what you have shown).
"Real" spiders such as heritrix use a lot of parallelism and tricks to optimize their crawling speed, while simultaneously being nice to the website they are crawling. This typically means limiting hits to one site at a rate of 1 per second (or so), and crawling multiple websites at the same time.
Again this is all just a guess based on what I know of spiders in general, and what you posted here.
Unfortunately, many of the more well-known 'real' web spiders are closed-source, and indeed closed-binary. However there are a number of basic techniques wget is missing:
Parallelism; you're never going to be able to keep up with the entire web without retrieving multiple pages at a time
Prioritization; some pages are more important to spider than others
Rate limiting; you'll be banned quickly if you keep pulling down pages as quickly as you can
Saving to something other than a local filesystem; the Web is big enough that it's not going to fit in a single directory tree
Rechecking pages periodically without restarting the entire process; in practice, with a real spider you'll want to recheck 'important' pages for updates frequently, while less interesting pages can go for months.
There are also various other inputs that can be used such as sitemaps and the like. Point is, wget isn't designed to spider the entire web, and it's not really a thing that can be captured in a small code sample, as it's a problem of the whole overall technique being used, rather than any single small subroutine being wrong for the task.
I'm not going to go into details of how to spider the internet, I think that wget comment is regarding to spidering one website which is still a serious challenge.
As a spider you need to figure out when to stop, not go into recursive crawls just because the URL changed like date=1/1/1900 to 1/2/1900 and so
Even bigger challenge to sort out URL Rewrite (I have no clue what so ever how google or any other handles this). It's pretty big challenge to crawl enough but not too much. And how one can automatically recognise URL Rewrite with some random parameters and random changes in the content?
You need to parse Flash / Javascript at least up to some level
You need to consider some crazy HTTP issues like base tag. Even parsing the HTML is not easy, considering most of the websites are not XHTML and browsers are so flexible in the syntax.
I don't know how much of these implemented or considered in wget but you might want to take a look at httrack to understand the challenges of this task.
I'd love to give you some code examples but this is big tasks and a decent spider will be about 5000 loc without 3rd party libraries.
+ Some of them already explained by #yaakov-belch so I'm not going to type them again