Many users and forum programs in attempt to make automatic e-mail address harversting harder conseal them via obfuscation - # is replaced with "at" and . is replaced with "dot", so
team#stackoverflow.com
now becomes
team at stackoverflow dot com
I'm not an expert in regular expressions and I'm really curious - does such obfuscation really make automatic harvesting harder? Is it really much harder to automatically identify such obfuscated addresses?
Definitely!
I read this article a while ago which shows how effective (as well as the relative degree) the various methods can be.
Reversing an already reversed string seems to be fairly decent protection at the moment.
The following code sample:
<style type="text/css">
span.codedirection { unicode-bidi:bidi-override; direction: rtl; }
</style>
<p><span class="codedirection">moc.etalllit#7raboofnavlis</span></p>
Will output the email so it's readable at least.
That said, it is almost an arms race. But as long at you're ahead of the curve, it'll be more effort to harvest your address rather than ordinary un-obfuscated ones.
Obfuscation techniques falls in the same category than captchas. They are not reliable and tend to hurt regular users more than bots.
Javascript obfuscation seems to be praised, but is no silver bullet : it is not that hard today to automate a browser for email sniffing. If it can be displayed in a browser, it can be harvested. You could even imagine a bot that's taking screenshots of a browser window and using OCR to extract addresses to beat your million-dollar-obfuscation-technique.
Depending on where and why you want to obfuscate emails, those techniques could be useful :
Restrict email visibility : you may hide emails on your website/forum to anonymous users, to new users (with little to no activity or posts to date) or even hide them completely and replace email contact between members with a built-in private messaging feature.
Use a dedicated spam-filtered email : you will get spammed, but it will be limited to this particular address. This is a good trade-off when you need to expose the email address to any user.
Use a contact form : while bots are pretty good at filling forms, it turns out that they are too good at filling forms. Hidden field techniques can filter most of the spam coming through your contact form.
When I see this type of obfuscation I also immediately think of regular expressions. It's a piece of cake to harvest emails "obfuscated" in this manner.
I once came with an idea to publish my email address in this way:
You can mail me here:
string myEmail = "";
myEmail = myEmail
.Append ("myname")
.Append ("#")
.Append ("domain")
.Append (".")
.Append ("com");
Whoever does not make it out, has failed my basic intelligence test.
It will be difficult for the spammers as well as your users to identify the email address.
A nice article from wikipedia on Email obfuscation or address munging
One common way of hiding email from
bots and spammers is to create an
image containing the email address.
Facebook does this, for instance. Now,
using images for email is inherently
bad for accessibility, because text
readers will not be able to read it.
But even otherwise, there are several
free character recognition programs
that do a pretty good of decoding such
email-images.
From here
I'm not sure if it really helps with spam - but I've learned to love the Escape Encode Obfuscation for mailto: tags/emails. An example tag:
team#stackoverflow.com
Mails team#stackoverflow.com
It's analagous to putting a "protected by ADT" sticker on your front door.
Will that prevent a talented burglar from entering your house? Of course not.
Will it make the house next door with an unlocked door and an iPod in the window a more compelling target? Pretty likely.
A simple unobfuscated email scraper is going to get TONS of emails as it is. Maybe a very simple regex to pick up very common obfuscation methods is worth the effort. Past that, you're spending a lot of time trying to decipher an increasingly small percentage of emails.
All that to say, having some clever obfuscation is probably worth it.
For the record, my email has been on my public resume in plain text for years now, because I use gmail, which has a spam filter that works.
I was wondering why nobody mentioned ALAs solution so far.
Roel Van Gils wrote an Article about Graceful Email Obfuscation in 2007
Graceful Email Obfuscation is simply a JavaScript Email Obfuscation technique with a contact form fallback.
Email addresses are obfuscated by converting them into a url poiting to a contact form and applying a ROT13 transform
mailto:mail#example.com → contact/mail+example+com → contact/znvy+rknzcyr+pbz
Via javascript contact/znvy+rknzcyr+pbz is converted back to mailto:mail#example.com
If no javascript is available, the browser will open contact/znvy+rknzcyr+pbz as a fallback. The contact form will know where to send the email because of the url.
http://www.alistapart.com/articles/gracefulemailobfuscation/
It does make it harder but there are so many really smart scrapers that it probably doesn't help a lot, since the big spammers are using the high quality spam tools.
How to fight spamers? Make email address less recognizable for something without brain (i.e. computer).
Non-English speakers are your friends: if your user base is non-English speaking community, switch to obfuscating using other languages: team_małpa_stackoverlow_kropka_com or team_Affenschwanz_stackoverflow_Punkt_com are perfectly recognizable email addresses for respectively Polish- and German-speaking communities. Some email harvesters know Polish or German, but chance is most of harvesters will understand only English.
If you cannot leave English, than switch to some descriptive phrases- like: “in order to send us message write team in your address field, than put symbol AT, than write the name of our site!”.
To provide a literal answer, yes, harvesting obfuscated addresses is harder than harvesting standardized addresses. The real question is whether the extra effort will be put in by harvesters and if the (major? minor?) barrier to the harvesters is worth the possible problems for your users.
If you are going to scramble addresses or otherwise transpose them away from the standard form, you should avoid being consistent in how you do so – at least on the same site.
For example, if every email address on a large community site is reversed in the markup and rendered properly with CSS, or token-replaced (# becomes 'at'), or any other predictable method, the harvesters will just write a thin adapter for your site.
Think of it this way: if it only takes you one line of code to "scramble" them sitewide, it will only take the harvester one line of code to "unscramble" them for your site. Roughly speaking.
In my opinion, spam has become such a problem and so many DBs have been turned over that we're beyond hiding our addresses. Instead, consider looking at Defensio and Akismet, etc, to help classify and block spam.
I have a solution, well, more of a theory.
Problem is, the bots parse the page. they can get the text. even if it's being put
into the page in some sophisticated way through Javascript.
So, just you CSS3 pseudo element! it won't be a link, but your email will be visible, and will never be an actual text. something like this:
.email::after{ content:'myemail#gmail.com'; }
Again, it's a theory, I've no idea how far these evil people can go to get it, but I think this be pretty safe. (unless they parse the CSS files, which I don't think they do)
It does make it harder to a degree, but the simple ones used by users even today (the [dot] and [at]) are obsolete and can be captured easily using a simple regex by spammers.
Using something as simple as an image would be helpful and readable for the intended human reader without effort to 'decrypt' the encoded email id.
Contact email:
If you are still paranoid about character recognition equipped spam bots, them something like this would be effective.
It uses optical illusion as an advantage to complete letters in the human mind that cannot be easily understood by computer vision. Applying CAPCHA-like overlay can also help, but I doubt you need to go that far.
Related
I was wondering if there was any way to dynamically obfuscate html on a live server but not offline, so soon as my website was visited the source would be obfuscated rather than in plain text.
Since the client (browser) will have to parse it into a sensible DOM tree, this is pretty much fruitless. These days it's a lot more common to inspect a site using Firebug/Webkit Inspector, which provides a nicely formatted, navigable tree. Most people won't even notice that the HTML is "obfuscated", much less be stopped by it.
Executable code can be obfuscated by minimizing variable names and such without changing the result. HTML is the result though, if you change anything about it, the result will change. So "obfuscation" would mostly be limited to creative use of spacing anyway.
The real question you should ask yourself is "why do I need to obfuscate HTML?". If you're hiding sensitive information, then you should be either encrypting that data, or never presenting it to the client.
Most sensitive information or transactions should take place on the server, and the client only receives a token, or encrypted information, or a unique transaction identifier that can be passed back and forth.
Let me put it this way: There's no way to dynamically obfuscate the HTML on your site such that any reasonably competent person couldn't get it anyway.
You could use JavaScript to attempt to obfuscate it, but you'd have to do it in a way that didn't actually affect the DOM.
You could generate the contents of the page itself with JavaScript, but that is likely to damage accessibility, and once again the DOM will have to be in a condition the browser can use.
You could insert massive amounts of whitespace into the source, but that is easily overcome as well.
All this, and you make it harder and more annoying to manage your site. Minification has its purpose, but obfuscation here is lose-lose.
Your could search for and remove all tabs, newlines, extra spaces, and comments
If you are using php, IonCube has a plugin. it can be found here: http://www.ioncube.com/html_encoder.php it turns your html page into minified javascript.
I know that spam bots scour web sites and harvest emails, however I wasn't sure about the extent of information that they search for (for instance, names, physical addresses, phone numbers, etc.)
In essence, my question boils down to:
"Do spam bots search web pages for physical addresses, and I am helping them through the use of the <address> HTML tag?"
EDIT: I should have been more specific with my question. If I use proper techniques to obfuscate the sensitive information so they wouldn't detect it otherwise, would enclosing that content inside the address tag be like offering it up to spam bots on a silver platter?
It depends entirely on how the bot is programmed to recognize addresses. Some probably pull anything in an <address> tag and assume that it's an address. Others might ignore the html tags entirely. The big ones probably use a combination of techniques, including showing suspected addresses to humans and having them help it recognize the actual address therein.
I'd just assume that any information you post publicly on the Internet can be gathered either by bots or by humans and put to ill use. If you don't want the information to be public, protect it with passwords, encryption, and such.
Spam bots will search for anything that looks like the information that they want to find, and not just information that is tagged up properly. Avoiding a specific tag will not make any difference, spammers don't play by any rules.
ct... ess
It's said to be the same as :
<a href="mailto:myaddress#mydomain.com'>contact</a>
But can work against email harvesting robot.
They're numeric character entities, trying to trick spiders into not seeing "mailto" or characters in the form of an email address. And as an anti-harvesting strategy, it probably hasn't worked since 1997 or so. :-)
It assumes that spambot spiders treat webpages as text to regex match against instead of performing the most basic HTML parsing.
This:
ct... ess
makes it a little difficult for email harvesting than its counter part:
<a href="mailto:myaddress#mydomain.com'>contact</a>
However, there are ways to decrypt even that so this is not that much useful in practice :(
Internationalizing web apps always seems to be a chore. No matter how much you plan ahead for pluggable languages, there's always issues with encoding, funky phrasing that doesn't fit your templates, and other problems.
I think it would be useful to get the SO community's input for a set of things that programmers should look out for when deciding to internationalize their web apps.
Internationalization is hard, here's a few things I've learned from working with 2 websites that were in over 20 different languages:
Use UTF-8 everywhere. No exceptions. HTML, server-side language (watch out for PHP especially), database, etc.
No text in images unless you want a ton of work. Use CSS to put text over images if necessary.
Separate configuration from localization. That way localizers can translate the text and you can deal with different configurations per locale (features, layout, etc). You don't want localizers to have the ability to mess with your app.
Make sure your layouts can deal with text that is 2-3 times longer than English. And also 50% less than English (Japanese and Chinese are often shorter).
Some languages need larger font sizes (Japanese, Chinese)
Colors are locale-specific also. Red and green don't mean the same thing everywhere!
Add a classname that is the locale name to the body tag of your documents. That way you can specify a specific locale's layout in your CSS file easily.
Watch out for variable substitution. Don't split your strings. Leave them whole like this: "You have X new messages" and replace the 'X' with the #.
Different languages have different pluralization. 0, 1, 2-4, 5-7, 7-infinity. Hard to deal with.
Context is difficult. Sometimes localizers need to know where/how a string is used to make sure it's translated correctly.
Resources:
http://interglacial.com/~sburke/tpj/as_html/tpj13.html
http://www.ryandoherty.net/2008/05/26/quick-tips-for-localizing-web-apps/
http://ed.agadak.net/2007/12/one-potato-two-potato-three-potato-four
In my company all our strings are stored in *.properties files. Our build tools build a "test languange" copy of the properties files, which replace a string like this:
Click here
with something like this:
[~~ Çļïčк н∑ѓё ~~ タウ ~~]
Now, when we set the language to "test" in our config files, these properties files are used. (And of course we don't ship the test language files).
This allows us to:
Make sure that Unicode characters are displayed correctly, including Japanese/Chinese/Korean.
Make sure that the layout scales appropriately for languages with longer words (German in particular has longer words on average than English).
Spot any hard-coded strings (as they will be in plain-English).
As for the actual translation, this is done by professional translators, not developers.
As an English person living abroad I have become frustrated by many web application's approach to internationalization and have blogged about my frustrations.
My tips would be:
think about how you show an international version of a page
using geolocation might work for many users, but as my examples show for many it will not
why not use the Accept-Language header for determining which language to serve
if a user accesses a page via a search engine then don't redirect them somewhere else e.g. to a homepage in a different language
it's extremely annoying to change language and have a different page reload - either serve the same page or warn the user that the current content is not available in a different language before redirecting them
English is a very common language, so perhaps default to that
But make sure the change language option is clear on the GUI (I like what Google Maps are doing, as shown in the post)
All I see on the Web is companies getting internalization wrong. Getting it right from a user's perspective is tricky indeed.
I have a couple apps that are "bilingual"
I used resource files in ASP.NET1.1
There is also something called the String Resource Tool
Basically you put all your strings in a .RES file for both languages and then determine what file to read from based on Culture or whether someone clicked a Link for the language
The biggest gotcha is making sure the Translations are done correctly
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
What guidelines can you give for rich HTML formatting in emails while maintaining good visual stability across many clients and web based email interfaces?
An unrelated answer on a question on Stack Overflow suggested:
http://www.campaignmonitor.com/blog/archives/2008/05/2008_email_design_guidelines.html
Which contains the following guidelines:
Place stylesheet in <body> instead of <head>
Some email clients will strip CSS out of the head, but leave it if the style block is (invalidly) in the body.
Use inline styles where ever possible
Gmail will strip any stylesheet, whether in the <head> or in the <body>, but honor inline styles assigned using the style="" attribute
Return to tables
Email standards have actually taken a giant step backwards in recent years thanks to Outlook 2007 using the Microsoft Word rendering engine. Unlearn most of what you learned about positioning without stylesheets.
Don't rely on images
Most clients and most web based email clients will not display images unless the user specifically requests them to be displayed.
I also have a few "unconfirmed" truths that I don't remember where I read them.
Don't use more than two levels of nesting in tables
Is this true. What is likely to happen if I do? Is there any particular client/clients that choke on this?
Be careful of nesting background images in cells/tables
As I understand you may encounter situations where the background image is applied in the descending table/cell completely anew, and not just "shining through". Again, true or not? Which clients?
I would like to flesh out this list with more guidelines and experiences from the trenches.
Can you offer any further suggestions?
Update: I'm specifially asking for guidelines for the design part in HTML and consistency there of. Questions about general guidelines for avoiding spam filters, and common courtesy are already on SO.
It's actually really hard to make a decent HTML email, if you approach it from a 'modern HTML and CSS' perspective.
For best results, imagine it's 1999.
Go back to tables for layout (or preferably - don't attempt any complex layout)
Be afraid of background images (they break in Outlook 2007 and Gmail).
The style-tag-in-the-body thing is because Hotmail used to accept it that way - I'm pretty sure they strip it out now though. Use inline styles with the style attribute if you must use CSS.
Forget entirely about float
Remember your images will probably be blocked - use background and text colour to your advantage - make sure there is some readable text with images disabled
Be very careful with links, be especially wary of anything that looks like a URL in the link text - you will anger 'phishing' filters (eg www.someotherdomain.tld is bad)
Remember that the "fold" on webmail clients tends to be extremely high up the page (on a 1024x768 screen, most interfaces won't show more than a hundred pixels or so) - get your identity stuff in right at the top so the recipient knows who you are.
Recent version of outlook have a "portrait" preview pane which is significantly narrower than you may be expecting - be very wary of fixed-width layouts, if you must use them, make them as narrow as you can.
Don't even think about flash, Javascript, SVG, canvas, or anything like that.
Test, a lot. Make sure you test in a recent Outlook (things have changed a lot! It now uses Word as its HTML rendering engine, and it's crippled: Word 2007 HTML/CSS support). Gmail is pretty finicky also. Surprisingly Yahoo's webmail is extremely good, with nice CSS support.
Good luck ;)
Update to answer further questions:
Don't use more than two levels of nesting in tables
I believe this is an older guideline pertaining to Lotus Notes. Nested tables should be okay, but really, if you have a layout that's complicated enough to need them, you're probably going to have trouble anyway. Keep your layout simple.
Be careful of nesting background images in cells/tables
This may be related to the above, and the same applies, if you're getting that complicated then you will have problems. Recent versions of Outlook don't support background images at all, so you'd be best advised to forget about them entirely.
Always use multipart mime and provide a plain text alternative.
The folks behind Campaign Monitor also started a Email Standards Project web site with a lot of good information.
Take a look at this boilerplate, it is like html5boilerplate, but for emails:
http://htmlemailboilerplate.com/
I think this is lower level than the question you are asking, but if you really want an html email to be correctly viewed by as many clients as possible, make sure it's using valid MIME. In particular, for an email to be considered as valid MIME, the headers MUST (in the RFC sense of the word) contain both of these headers:
MIME-Version:
Content-Type:
Very strict clients will display your HTML as raw text if one or the other of these is missing. You'd be surprised how many large online vendors who should know better have screwed this up (notably, I've gotten HTML emails w/ missing MIME-Version: headers from Amazon and the ACM in the past)
Background images are not reliable.
Practically a no-brainer, but no javascript.
Use an editor that lets you send the current file/buffer as an email, or at the very least, find a program that would let you send the contents of a file as an HTML email. do not test your emails by copying the HTML, and pasting it into outlook (or any other mail program for that matter).
Three words of advice: test, test, test.
Check out LitmusApp.com's email testing service. You send them a message and they render it in a bunch of clients and show you screenshots of the results. It's not perfect, but it's pretty good.
(Lotus Notes prior to 8.0 really, really stinks for HTML mail, by the way)
Also, beyond just inline CSS styles, I recommend switching to tags wherever possible.
Embed your images, don't link to them.
This is bad :
<img src="http://myserver.com/myImage.jpg" alt="Lolkat"/>
This is good :
<img src=cid:myImage/>
Yeah, it looks weird but check out this guide regarding embedding images in emails.
If you're including a style block don't begin any new line with ".classname" or "." anything. Put a brace or something before the period. If you don't do this some web mail systems will not properly display your style sheets.
Many people have incorrectly assumed they cannot use CSS blocks in emails because of this behavior... IIRC "." is the body delimiter for SMTP. Systems will tend to escape in their mail stores to prevent the contents of one message from being misrecognized as a new message. The way this is handled tends to break any style starting on a new line with a period.