Prevent MediaWiki from being spammed

Prevent MediaWiki from being spammed - mediawiki

My MediaWiki site is currently under the spammers attack. I get around 10 spam pages registered daily.
What I've I already done:
Only users with confirmed emails can create/edit pages.
ReCAPTCHA widget.
Captcha displayed on the actions:
'edit' - triggered on every attempted page save
'create' - triggered on page creation
'addurl' - triggered on a page save that would add one or more URLs to the page
'createaccount' - triggered on creation of a new account
Proxy blocker
SpamBlacklist
What else can I do to stop the spam?

It's counter-intuitive, but I have found this combination very effective:
Disable new signups or if you think that is too extreme, install SecurePages
Install SimpleAntiSpam
Install SpamBlacklist and TitleBlacklist
Allow anonymous edits
Always block the IP addresses that spam is posted from
Install User Merge and Delete and use that to clear out the existing spammer accounts.
#1 is the most important step. It's easy for spammers to create throwaway accounts.
A CAPTCHA makes only a small difference, not worth the extra bandwidth cost for the images.
The hundreds of throwaway accounts are almost as big a problem as the spam postings.
#2 reduces the volume of spam by at least 1/3.
The only robots that get past SimpleAntiSpam are those specially designed for MediaWiki, not the ones that fill in all textareas in every web page everywhere.
Similarly if your site has SSL, SecurePages (or its predecessor HttpsLogin) thwarts some bots that don't have SSL support.
#3 will stop you getting the same spam posting (or variants of it) repeatedly. If you update the blacklist regularly that should reduce the volume of spam by another 10-20%.
And remember the spammers will run out of paying customers (you eliminate one for every domain you block links to) long before they run out of public proxies/zombies to post from.
#4 does not increase the volume of spam as much as you might expect. There's a popular MediaWiki-spamming bot that never attempts to post anonymously - it gives up when it cannot find the "create account" link.
And if you don't do this, you don't have a wiki anymore (you just have a static website using MediaWiki as a CMS.)
There is a small bonus - it makes it easier to find (and block) the spammers' IP addresses. Of course you can get the IP addresses using CheckUser or by reading the database directly, but it's much easier when the IP address is in plain sight.
#5 is the least effective measure, but it's still worth doing. Spammers do re-use IP addresses. They may be cheap but they are not infinite, and sometimes you will catch one of those runaway robots that posts a spam page every 5 minutes.
#6 doesn't prevent spam, but it allows you to clean up your user list page once you have other anti-spam measures in place.

Maybe you can check IPs used for spamming?
Or use special questions instead of standard CAPTCHA? (for example, one of NetHack (roguelike) related sites is asking for symbol of ring/spellbok/potion - trivial for NetHack players, impossible for bots/hired spam solvers).

I used to have a HUGE problem with spam attacks on my wiki. I used to have to go through the wiki everyday and manually delete spam posts and then block the addresses but this was a never-ending battle. Restricting editing to registered users didn't help as the spammers just got tyhemselves registered. So I finally had to shut the site down.
I started a new wiki where I have managed to block all spam.
My wiki is for a particular professional group so what I did was add in a username/password that had to be used to access the wiki directory. The username was displayed on my home page so no secrets there. BUT the password was the answer to a cryptic question selected carefully so the answer was easy for people in my professional group to answer but very hard for a spammer and certainly not something a a bot could work out. The question was selected so the answer could not be found by a Google search on any of the words - I had a mis-spelling and a non-standard abbrevaiation in the question. As it turned out about 1% of my target audience (mostly non-English speakers) found the question troo cryptic so the alternative was for them to contact me by email using a professional organisation email address (not gmail or hotmail). The answer was one word all in lowercase.
I thought I would have to change the password every so often BUT after several years there has been not a single spam message posted so I've just left the same question.

Related

Difference between CAPTCHA and reCAPTCHA

What are the differences between CAPTCHA and reCAPTCHA?
What is the best situation to opt for reCAPTCHA?

CAPTCHA is the human validation test (usually the blurry squiglly letters that need to be deciphered) used by many sites to prevent spam.
reCAPTCHA is a reversed CAPTCHA - the same test, used not only to prevent spam but to help in the book digitazion project. In other words, the reCAPTCHA tests are not meaningless combination of words, but excerpts from books that undergo digitation, while CAPTCHA uses several human validation methods including math or general knowledge questions, visual puzzles and even chess puzzles.
Google purchased reCAPTCHA several years ago, and now it is also used to collect street view data

A CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) is a program that can tell whether its user is a human or a computer.
The process involves a computer asking a user to complete a simple test which generated by computer. Because other computers are unable to solve the CAPTCHA, any user entering a correct solution is presumed to be human. Sometimes it is described as a reverse Turing test, because it is done by a machine and targeted to a human. reCAPTCHA does exactly that by channeling the effort spent solving CAPTCHAs online into "reading" books.
reCaptcha is hosted by Google, and one of the more interesting things about it is that it is used to digitize text of old newspapers and books. That’s why there are two “sections” of a reCaptcha instead of the single series of characters for CAPTCHA - one is known text, the other is not. If you get the known one correct, it assumes you got the second one. Then the next time it offers up that same “unknown” text, it is considered possibly known.
A few more times with the same result for the “unknown” text, and it becomes “known” and the text it originated from can be correctly digitized. Clever, eh?
Also, because of frequent updates, I would expect reCAPTCHA to be slightly better at preventing bots from solving them.
Reference: https://anydifferencebetween.com/difference-between-captcha-and-recaptcha/

reCAPTCHA is a type of CAPTCHA. It is easier for humans and relatively difficult for bots to crack, which is the sole purpose of having a CAPTCHA in forms. If you need a CAPTCHA, go for reCAPTCHA.

I think I remember CAPTCHA from a few years ago for Gmail.
reCAPTCHA is easier for people. I usually had some difficulty getting past the old CAPTCHA, because I couldn't quite read the letters and had a hard time seeing the difference between "O" and "Q" or other similar looking letters like "mn" or "nm".
I recommend reCAPTCHA over CAPTCHA since it's easier for actual users, and you can always add a firewall to help block bots.

How to defend against users with Multiple Accounts?

We have a service where we literally give away free money.
Naturally said service is ripe for abuse. To defend against this we do the following:
log ip address
use unique email addresses (only 1 acct/email addy)
collect more info like st. address, phone number, etc.
use signup captcha
BHOs (I've seen poker rooms use these)
Now, let's get real here -- NONE of this will stop a determined user.
Obviously ip addresses can be changed via a proxy (which could be blacklisted via akismet) but change anyways if the user has a dynamic ip or if more than one user is behind a NAT'd network (can we say almost everyone?)
I can sign up for thousands of unique email addresses each hour -- this is no defense.
I can put in fake information taken from lists for street addresses and phone numbers.
I can buy captchas from captcha solving services (1k for $5).
bhos seem only effective for downloadable software -- this is a website
What are some other ways to prevent multiple users from abusing the service? How do all the PPC people control click fraud?
I know we could actually call the person but I don't think we are trying to do that anytime soon.
Thanks,

It's pretty difficult to generate lots of fake phone numbers that can send and receive SMS messages. SMS verification could go a long way towards cutting down on fraud. Of course, it also limits you to giving away free money to cell phone owners.

I think only way is to bind your users accounts to 'real world' information, like his/her passport number, for instance. Of course, you'll need to make sure that information is securely stored and to find some way to validate it.

Re: signing up for new email accounts...
A user doesn't even need to do that. Please feel free to send your mail to brian_s#mailinator.com, or feydr.asks.a.question#spamherelots.com, or stackoverflow#safetymail.info, or my_arbitrary_username#zippymail.info. I haven't registered any of those email addresses, but all of them will work.
Those domains are owned by ManyBrain, and they (and probably others as well) set the domain to accept any email user. ManyBrain in particular then makes the inboxes for those emails publicly accessible without any registration (stripping everything by text from the email and deleting old mail). Check it out: admin#mailinator.com's email inbox!
Others have mentioned ways to try and keep user identities unique. This is just one more reason to not trust email addresses.

First, I suppose (hope) that you don't literally give away free money but rather give it to use your service or something like that.
That matters as there is a big difference between users trying to just get free money from you they can spend on buying expensive cars vs only spending on your service which would be much more limited.
Obviously many more user will try to fool the system in the former than in the latter case.
Why it matters? Because it is all about the balance between your control vs your user annoyance. I see many answers concentrating on the control part, so let's go through annoyance, shall we?
Log IP address. What if I am the next guy on the computer in say internet shop and the guy before me already used that IP? The other guy left your hot page that I now see but I am screwed because the IP is blocked. Yes, I can go to another computer but it is annoyance and I may have other things to do.
Collecting physical Adresses. For what??? Are you going to visit me? Or start sending me spam letters? Let me guess, more often than not you get addresses with misprints at best and fake ones at worst. In fact, it is much less hassle for me to give you fake address and not dealing with whatever possible spam letters I'll have to recycle in environment-friendly way. :)
Collecting phone numbers. Again, why shall I trust your site? This is the real story. I gave my phone nr to obscure site, then later I started receiving occasional messages full of nonsense like "hit the fly". That I simply deleted. Only later and by accident to discover that I was actually charged 2 euros to receive each of those messages!!! Do I want to get those hassles? Obviously not! So no, buddy, sorry to disappoint but I will not give your site my phone number unless your company is called Facebook or Google. :)
Use signup captcha. I love that :). So what are we trying to achieve here? Will the user who is determined to abuse your service, have problems to type in a couple of captchas? I doubt it. But what about the "good user"? Are you aware how annoying captchas are for many users??? What about users with impaired vision? But even without it, most captchas are so bad that they make you feel like you have impaired vision! The best advice I can give - if you care about user experience, avoid captchas as plague! If you have any doubts, do your online research first!
See here more discussion about control vs annoyance and here some more thoughts about being user-friendly.

You have to bind their information to something that is 'real world', as Rubens says. Of course, you also need to be able to verify this information (I can just make up passport numbers all day if you don't check to make sure they're correct).
How do you deliver the money? Perhaps you can index this off the paypal account, mailing address, or whatever you're sending the money to?

Sometimes the only way to prevent people abusing a system is to not have the system in the first place.
If you're doing what you say you're doing, "giving away money to people", then surprise surprise, there will be tons of people with more time available to try to find ways to game the system than you will have to fix it.

I guess it will never be possible to have an identification system which identifies fake identities that is:
cheap to run (I think it's called "operational cost"?)
cheap to implement (ideally one time cost - how do you call that?)
has no Type-I/Type-II errors
is scalable
But I think you could prevent users from having too many (to say a quite random number: more than 50) accounts.
You might combine the following approaches:
IP address: can be bypassed with VPN
CAPTCHA: can be bypassed with human farms (see this article, for example - although they claim that their test can't be that easily passed to other humans, I doubt this is true)
Ability-based identification: can be faked when you know what is stored and how exactly the identification works by randomly (but with a given distribution) acting (example: brainauth.com)
Real-world interaction: Although this might be the best one, but I guess it is expensive and not many users will accept it. Also, for some users/countries it might not be possible. (example: Postident in Germany, where the Post wants to see your identity card. I guess this can only be faced in massive scale by the government.)
Other sites/resources: This basically transforms the problem for other sites. You can use services, where it is not allowed/uncommon/expensive to have much more than 1 account
Email
Phone number: e.g. by using SMS, see Multi-factor authentication
Bank account: PayPal; transfer not much money or ask them to transfer a random (small) amount to you (which you will send back).
Social based
When you take the social graph (vertices are people, edges are connections), you will expect some distribution. You know that you are a single human and you know some other people. So you have a "network of trust" (in quotes, because I think this might be used in other context as well). Now you might not trust people / networks how interact heavily with your service, but are either isolated (no connection) or who connect a large group with another large group ("articulation points"). You also might not trust fast growing, heavily interacting new, isolated graphs.
When a user provides content that is liked by many other users (who you trust), this might be an indicator that there is a real human creating it.

We had a similar issue recently on our website, it is really a hassle to solve this issue if you are providing a business over one time or monthly recurring free credits system.
We are using a fraud detection solution https://fraudradar.io for a while and that helped us a lot to clean out most of the spam activities. It is pretty customizable with:
IP checks
Email domain validity
Regex rules
Whitelisting options per IP, email domain etc.
Simple API to communicate through
I would suggest to check that out.

Anyone ever tried to use Twitter to replace comments sections on web apps?

Here's the scenario I'm imagining.
Simple blog, users typically post comments in a comments form at the bottom of each blog article. Instead of that, using the Twitter API, pull tweets based on a hashtag. Base the hashtag on the article id (i.e. #site10201) where site is a prefix and the number is the article id.
Then provide a link to post a tweet using the hashtag., which would then get picked up in your twitter api pull.
I'm imagining horrible spam issues, but other than that, bad idea?

Has some drawbacks to more run-of-the-mill database systems:
Additional network overhead. Most self-hosted blogs would typically rely on database and blog being on the same server (physical or virtual) so db-lookup is fast (and reliable) compared to twitter API requests.
Caching issues. One host is only allowed X requests of twitter at a time (the next request is going to end up a 404), and how are you going to manage that from your website for a scenario which becomes steadily more complex as multiple articles are added? Presumably you need to authenticate so the easy-way out is a security liability. (The easy way out being to use JavaScript on the at the browser to perform the actual request, neatly circumvents the problem in 20/80 fashion.) Granted most blogs don't get that kind of traffic. ;)
You tie your precious or not so precious comments to the mercy of the fail whale. Which is kind of odd considering a self-hosted blog basically means you want to have such control in the first place by not using a service like blogger.
Is it possible to ensure unicity of hash tags --in the general case? What are you going to do if someone had the same bright idea, only took the name of the tag 5ms before you did? Would you end up pulling the drivel of someone else's blog comments rather than the brilliance you have come to expect from yours? ;)
Lesser point: you rely on others to have a twitter account. Anonymous replies are off the table.
TOS and other considerations that may be imposed on you by twitter, either now or in future. (2) is actually a major item of Twitter's TOS.

Do we really need email confirmation?

I've gotten into a habit of using the standard register->send activation email->activate account process for every site that supports user authentication and free registration without questioning if I really need this.
What are your thoughts on this? If I have captcha on the registration form is the email confirmation process really necessary?
EDIT:
OK, so the general consensus seems to be that by getting the users to confirm the email they entered I'll keep them away from putting someone else's email in there.
What about when I let users edit their profile/settings and they enter another email?
If I need to keep them away from entering other people's addresses then I'd need to confirm that email address (by temporarily deactivating their accoun)t every time they change it.

Captcha+activation prevents bots AND spoofed people
Well basically it is since each part prevents one problematic scenario:
Captcha prevents (if you use strong captcha like reCaptcha) bots from registering new users
Email activation prevents people from registering other people (by their email address)
I guess this is a valid everyday pattern for registration that's widely acknowledged by IT community.
EDIT
Yes. When you want to prevent users from changing their email address, you'd have to repeat email activation procedure to make it robust.
But you don't have to deactivate their account while doing it. All you have to do is having a pending email-change email activation active. If it gets activated, you change email address at that point (not when they change it), otherwise the old one is still used.

If you don't confirm an e-mail, you're supposing that the user registering that service owns that email account. How can you start sending a lot of system e-mails, reset passwords and etc to a person that has nothing to do with your system? I would be really pissed of if it was my e-mail.
Another scenario: what if the register mispelled his e-mail when registering? Suppose he doesn't check his "account settins" in your application, doesn't change his email, and needs to reset his password. If the e-mail is registered in a wrong way, it's your fault for not checking it before.
Of course, I'm just saying this to services that would REALLY demand an account to be created. Avoid the login barrier when possible, or use openid when your service isn't so critical.

You should give serious consideration to supporting OpenID. http://openid.net/get-an-openid/what-is-openid/
The key benefit for OpenID is that it reduces the complexity for your user. There is no reason to force people to remember login credentials for hundreds of sites when a viable alternative exists. There is no worldwide netizen database - and there likely never will be - but OpenID simplifies the situation greatly.
I know that as a user I found the registration process for Stack Overflow to be painless and easy. I wish more sites used OpenID.

It's the lowest-level attempt at identity validation. It encourages users to re-use the same account when they return (by having a common, shared identifier you and they can use to reconnect), and it prevents impersonation, because it requires access to the claimed identity as proof.
It's not perfect, but something by definition works infinitely better than nothing.
If identity doesn't matter on your site (e.g. your service is throwaway after each use) then you don't need email activation. Otherwise, you probably want it.

On my site, I let users sign-up and do everything non-public until they confirm their email address. Because I run a gaming website, it means users can earn medals, post scores, just not post in the forum or post comments in the blog until they verify their email address.
I find it works pretty well. I have 16,000 registered users.

I find it both unnecessary and annoying. If I can, I avoid doing this.
However, I do do this if 1) email will be sent by the program, so I can test if the email address is valid, or 2) this is a very large, public-facing website, in which case I want to filter out as many potential problems as possible.

For most basic sites, I don't bother with either. Both email activation and captcha are relatively easy for dedicated spammers to bypass and overcome and do little but cause an annoyance to most of the users, driving away at least a certain percentage who might have otherwise signed up. I've found in my experience, focusing more on spam filters for member posted content has a better ROI overall.
For sites with more serious content, you'll typically have more serious users. In cases like that, I'll throw everything I've reasonably got available at it to counter the spam.

I find it useful when an email is sent for confirmation. This makes sure that I am the one who has registered with that email address.
Even with captcha you can register someone else email address although he may or may not approve that confirmation.

You only seem to need e-mail confirmation to confirm identity, not to send useful content by e-mail. But e-mail confirmation is only one means to an end. You may consider others, preferablly less intrusive ones.
Generally you can check something that
you are (e.g. fingerprint, iris scan)
you have (e.g. token, creditcard, key, access to an e-mail account)
you know (e.g. PIN, password, your mom's weight, name of your favorite deceased pet, the optimistic length of your most private bodyparts measured in inches)
Also, you can delegate the check to others; the creditcard company, phone company, someone's friends.
Example: GoogleMail could not ask
for a confirmation e-mail address
upon creation of your GMail account.
Instead, the early adopters had a limited supply of
"invites" to share with friends.
So - unless you actually need me to receive information you'd e-mail, which I generally hate anyway - you might be inclined to resort to more creative/fun means.

Should you worry about fake accounts/logins on a website?

I'm specifically thinking about the BugMeNot service, which provides user name and password combos to a good number of sites. Now, I realize that pay-for-content sites might be worried about this (and I would suspect that most watch for shared accounts), but how about other sites? Should administrators be on the lookout for these accounts? Should web developers do anything differently to take them into account (and perhaps prevent their use)?

I think it depends on the aim of your site. If usage analytics are all-important, then this is something you'd have to watch out for. If advertising is your only revenue stream, then does it really matter which username someone uses?
Probably the best way to discourage use of bugmenot accounts is to make it worthwhile to have an actual account. E.g.: No one would use that here, since we all want rep and a profile, or if you're sending out useful emails, people want to receive them.

Ask yourself the question "Why do we require users to register to access my site?" Once you have business reason for this requirement, then you can try to work out what the effect of having some part of that bypassed by suspect account information.
Work on the basis that at least 10 to 15 percent of account information will be rubbish - and if people using the site can't see any benefit to them personally for registering, and if the registration process is even remotely tedious or an imposition, then accept that you will be either driving more potential visitors away, or increasing your "crap to useful information" ratio.

Not make registration mandatory to read something? i.e. Ask people to register when you are providing some functionality for them that 'saves' some settings, data, etc. I would imagine site like stackoverflow gets less fake registrations (reading questions doesn't require an account) than say New York Times, where you need to have an account to read articles.
If that is not upto your control, you may consider removing dormant accounts. i.e. Removing accounts after a certain amount of inactivity.

That entirely depends.
Most sites that find themselves listed in bugmenot.com tend to be the ones that require registration for in order to access otherwise-free content.
If registration is required in order to interact with the site (ie, add comments/posts/etc), then chances are most people would rather create their own account than use one that has been made public.
So before considering whether to do things like automatically check bugmenot - think about whether their are problems with your business model.
There are a few situations where pay-to-access content sites (I'm thinking things like, ahem, 'adult' sites) end up with a few user accounts being published publically (usually because someone has brute-forced some account details), and in that case there may be a argument for putting significant effort into it.

From an administrator viewpoint absolutely. That registration is required for a reason, even if it's something just as simple as user tracking/profile maintaining. Several thousand people using that login entirely defeats the purpose. IP tracking could help mitigate this problem, but it would definitely be hard to eliminate entirely.

No need to worry about BugMeNot: http://www.bugmenot.com/report.php

With bugmenot, keep in mind that this service is not actually there to harm the sites, but rather to make using them easier. You can request to block your site if it is pay-per-view, community-based (i.e. a forum or Wiki) or the account includes sensible information (like banking). This means in virtually all situations where you would think that bugmenot is a bad thing, bugmenot does not want to be used. So maybe things are not as bad as you might think.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008