What's the principle inside this? - html

ct... ess
It's said to be the same as :
<a href="mailto:myaddress#mydomain.com'>contact</a>
But can work against email harvesting robot.

They're numeric character entities, trying to trick spiders into not seeing "mailto" or characters in the form of an email address. And as an anti-harvesting strategy, it probably hasn't worked since 1997 or so. :-)

It assumes that spambot spiders treat webpages as text to regex match against instead of performing the most basic HTML parsing.

This:
ct... ess
makes it a little difficult for email harvesting than its counter part:
<a href="mailto:myaddress#mydomain.com'>contact</a>
However, there are ways to decrypt even that so this is not that much useful in practice :(

Related

Way validate HTML code for gmail?

I got some auto generated HMLT code working. Made sure it is correctly parsed by http://validator.w3.org/, and it is a working HTML4.01 Strict.
Now, when I embed this code in a email and send it to gmail, the result is quite adverse (messes up the formatting).
The code is quite long, and apparently only happens when it has such size. This tells me two things:
not worth putting a code snippet here
it is probably some conflicting tag, but still considered valid the parser
Would you guys know of any tool even more strict to validate my HTML? Maybe even something specific for gmail?
Or maybe, just some pro tip on what usually screws the code for gmail.
ps.: the code, although is long, is also quite simple, only a few tables and styles - I took some care in making sure used only "email friendly" tags and formats.
Did you say styles? Oh, boy! Email clients are all doing things differently and even if you get it working for GMail, it may not work for Yahoo.
You may want to look at something like Email CSS guide to start, but really you also want to use some of the Inbox Analysis services (e.g. Inbox Inspector from MailChimp) to get a better picture.
I haven't done this myself (yet), but I have seen mentioned over and over that this is one area you can lose your hair over.
You have to code like its 1999 and not worry about being so tied to conformity to HTML.
Unfortunately valid HTML just doesn't work for some (most) email clients. Even Gmail will strip or ignore things probably for security reasons. The best bet for an email is basically HTML 3. Some inline styles for fonts. I know that <p> tags break in Gmail, and in general colspan and rowspan won't work as intended and you have to use nested tables. Those are just a few things I can think of off the top of my head.
All the answers here helped, but the actual problem was elsewhere.
The problem was In the HTML as I thought, but not exactly in my HTML.
Turns out the the email client will wrap lines too big before handling it to render, regardless if it is HTML code or whatever, more exactly, breaking tags in the middle - that explain why it was happening only when the report reached a certain length.
What tipped me that was when I looked at the code generated by MailChimp (suggested by Alexandre Rafalovitch) and noticed it was formatted as quoted-printable, cropped at exactly 75chars for every line.
After that was easy enough to do the same in my own code generator. Well, actually, I didn't even format as quoted-printable, only made sure it would wrap too long lines by itself.
Apart from that, for all I can tell, a HTML 4.01 Strict code will work pretty much fine in a Gmail client.
Hope it helps post-1999 generations.
cheers.

Regular Expressions vs XPath when parsing HTML text

I want to parse a HTML text and find special parts. For example a text in 3rd div of 1st row and 2nd column of a table. I have 2 options to parse: Regular Expressions and XPath. What is advantages and disadvantages of each one?
thanks
It somewhat depends on whether you have a complete HTML file of unknown but well-formed content versus having merely a snippet or an expanse of HTML of completely known content which may or may not be well-formed.
There is a difference between editing and parsing, you see.
It is one thing to be editing your own HTML file that you wrote yourself or are otherwise staring right in the face, and you issue the editor command
:100,200s!<br */>!!g
To remove the breaks from lines 200–300.
It is quite another to suck down whatever HTML happens to be at the other end of a URL and then try to make some sense out it, sight unseen.
The first calls for a regex solution — the very one shown above, in fact. To go off writing some massively overengineered behemoth to do a fall parse to set up the entire parse tree just to do the simple edit shown above is quite simply wrong. It’s also its own punishment.
On the other hand, using patterns to parse out (as opposed to lex out) an entire HTML document that can contain all kinds of whacky things you aren’t planning for just cries out for leveraging someone else’s hard work intead of recreating the wheel for yourself, and badly at that.
However, there’s something else nobody likes to mention, and that’s that most people just aren’t competent at regexes. They don’t really understand them. They don’t know how to test them or to craft them. They don’t know how to make them readable and maintainable.
The truth of the matter is that the overwhelming majority of regex users cannot even manage as simple and basic a thing as matching an arbitrary HTML tag using a regex, even when things gotchas like alternate encodings and CDATA sections and redefined entitities and <script> contents and archaic never-seen forms are all safely dispensed with.
It’s not because it’s hard to do; it isn’t, actually. It’s just that the people trying to do it understand neither regexes nor HTML particularly well, and they don’t know they don’t know, and so they get themselves in way over their heads more quickly than they realize. And then they have a complete disaster on their hands.
Plus it’s been done before, and correctly. Might as well learn from someone else’s mistakes for a change, eh? It would probably help to have a few canned regexes at your disposal to go at frequently manipulated things. This is especially useful for editing.
But for a full parse, you really shouldn’t try to embed a full HTML grammar inside your pattern. Honest, you really shouldn’t. Speaking as someone has actually can and has done this, I unlike 99.9999% of the responders here the credibility of actual experience in this area when I advise against it. Sure, I can do it, but I almost never want to, and I certainly don’t want you to try it at home unsupervised. I can’t be held responsible for any damage that might ensue. :)
Sure, this may sound like “Do as I say, not as I do,” but if your level of regex mastery were at a level that allowed you to contemplate such a thing, you would not be asking this question. As I mentioned, almost no one who uses regexes can actually match an arbitrary HTML tag, simple as that is. Given that you need that sort of building block before writing your recursive descent grammar, and given that next to nobody can even manage that simple building block, well...
Given that sad state of affairs, it’s probably best to use regexes for simple edit jobs only, and leave their use for more complete solutions to real regex wizards, for they are subtle and quick to anger. Meaning of course the regexes, not (just) the wizards.
But sure, keep some canned regexes handy for doing simple editing rather than full parsing. That way you won’t be forced to redevise them each time from first principles. I do keep a few of these around, but then I also keep simple frameworks that allow me to edit a particular structural element of the HTML, like the plain text or the tag contents or the link references, etc, and those all use a full parser, letting me then surgically target just the parts I want in complete confidence I haven’t forgotten something.
More as a testament to what is possible than what is advisable, you can see some answers with more, um, “heroic” pattern matching, including recursion,
here,
here,
here,
here,
here, and
here.
Understand that some of those were actually written for the express purpose of showing people why they should not use regexes, because some of them are really quite sophisticated, much moreso than you can expect in nonwizards. That difficulty may chase you away, which is ok, because it was sort of meant to.
But don’t let that stop you from using vi on your HTML files, nor should it scare you away from using its search or substitute commands. Don’t let the perfect be the enemy of the good. Sometimes good enough is exactly what you need, because the perfect would take more investment than it could ever be worth.
Understanding which out of several possible approaches will give you the most bang for your buck is something that takes time to learn, and no one can tell you the answer that works for you. They don’t know your dataset, your requirements, your skillset, your priorities. Therefore any categorical answer is automatically wrong. You have to evaluate these things for yourself.
I think XPath is the primary option for traversing XML-like documents. With RegExp, it will be up to you to handle the different forms of writing a tag (with multiple spaces, double quotes, single quotes, no quotes, in one line, in multi-lines, with inner data, without inner data, etc). With XPath, this is all transparent to you, and it has many features (like accessing a node by index, selecting by attribute values, selecting simblings, and MANY others).
See how powerfull it can be at http://www.w3schools.com/xpath/.
EDIT: See also How do HTML parses work if they're not using regexp?
XPath is less likely to break if the web developer does any minor changes. That would be my choice.
Here is the canonical Stackoverflow explanation for why you should not parse HTML with regex:
RegEx match open tags except XHTML self-contained tags
In general, you cannot parse HTML with regex because regex is not made to parse HTML. Just use XPath.

is this url unvalid and not good practice?

i have a url in this format:
http://www.example.com/manchester united
note the space between manchester and united, is this bad practice, or is it perfectly fine, i just wanted to before i proceed, thanks
The space is not a valid character in URIs; you have to replace it with %20. It may also be considered bad practice. Replacing the space with -, + or _ is preferable; it is both “prettier” and doesn't require escaping of the URI.
Most browsers will still try to parse URIs with a space; but that's highly ambiguous.
It's bad practice not only because browsers are required to turn the space into a %20 and thus obfuscate your users' address bars, but because it would be difficult to communicate the url to anyone.
Furthermore, what about all of those "find links in text" regexes that are around stack overflow? You effectively break them all!
It will be replaced in the address bar as http://www.example.com/manchester%20united, which I personally think if far uglier than the alternative http://www.example.com/manchester_united.
I believe spaces in URLS are replaced with a %20 sign by many browsers.
you will need to add %20 instead of the space, however the browser will do it for you, I would rather not have any spaces in the URI
Technically this will work. The browser will replace the space with a %20, and the server will translate it back.
But ... it's not generally a good idea because it can lead to ambiguity, or difficulty in communicating the URL to others, particularly in an advertising setting where you're expecting someone to type in a URL they've seen in print.
Maybe a question for: https://webmasters.stackexchange.com/
But...
If you enter than into a browser, it will add %20 between manchester and united. Technically you should do this in your HTML page but most modern browsers can handle this. Common practice is to split them out with a hyphen i.e. http://www.example.com/manchester-united.
Look at the URL of this question for an example of this in action.
Can do that, but apparently it's bad style.
See the following: http://www.blooberry.com/indexdot/html/topics/urlencoding.htm

Does e-mail obfuscation really make automatic harvesting harder?

Many users and forum programs in attempt to make automatic e-mail address harversting harder conseal them via obfuscation - # is replaced with "at" and . is replaced with "dot", so
team#stackoverflow.com
now becomes
team at stackoverflow dot com
I'm not an expert in regular expressions and I'm really curious - does such obfuscation really make automatic harvesting harder? Is it really much harder to automatically identify such obfuscated addresses?
Definitely!
I read this article a while ago which shows how effective (as well as the relative degree) the various methods can be.
Reversing an already reversed string seems to be fairly decent protection at the moment.
The following code sample:
<style type="text/css">
span.codedirection { unicode-bidi:bidi-override; direction: rtl; }
</style>
<p><span class="codedirection">moc.etalllit#7raboofnavlis</span></p>
Will output the email so it's readable at least.
That said, it is almost an arms race. But as long at you're ahead of the curve, it'll be more effort to harvest your address rather than ordinary un-obfuscated ones.
Obfuscation techniques falls in the same category than captchas. They are not reliable and tend to hurt regular users more than bots.
Javascript obfuscation seems to be praised, but is no silver bullet : it is not that hard today to automate a browser for email sniffing. If it can be displayed in a browser, it can be harvested. You could even imagine a bot that's taking screenshots of a browser window and using OCR to extract addresses to beat your million-dollar-obfuscation-technique.
Depending on where and why you want to obfuscate emails, those techniques could be useful :
Restrict email visibility : you may hide emails on your website/forum to anonymous users, to new users (with little to no activity or posts to date) or even hide them completely and replace email contact between members with a built-in private messaging feature.
Use a dedicated spam-filtered email : you will get spammed, but it will be limited to this particular address. This is a good trade-off when you need to expose the email address to any user.
Use a contact form : while bots are pretty good at filling forms, it turns out that they are too good at filling forms. Hidden field techniques can filter most of the spam coming through your contact form.
When I see this type of obfuscation I also immediately think of regular expressions. It's a piece of cake to harvest emails "obfuscated" in this manner.
I once came with an idea to publish my email address in this way:
You can mail me here:
string myEmail = "";
myEmail = myEmail
.Append ("myname")
.Append ("#")
.Append ("domain")
.Append (".")
.Append ("com");
Whoever does not make it out, has failed my basic intelligence test.
It will be difficult for the spammers as well as your users to identify the email address.
A nice article from wikipedia on Email obfuscation or address munging
One common way of hiding email from
bots and spammers is to create an
image containing the email address.
Facebook does this, for instance. Now,
using images for email is inherently
bad for accessibility, because text
readers will not be able to read it.
But even otherwise, there are several
free character recognition programs
that do a pretty good of decoding such
email-images.
From here
I'm not sure if it really helps with spam - but I've learned to love the Escape Encode Obfuscation for mailto: tags/emails. An example tag:
team#stackoverflow.com
Mails team#stackoverflow.com
It's analagous to putting a "protected by ADT" sticker on your front door.
Will that prevent a talented burglar from entering your house? Of course not.
Will it make the house next door with an unlocked door and an iPod in the window a more compelling target? Pretty likely.
A simple unobfuscated email scraper is going to get TONS of emails as it is. Maybe a very simple regex to pick up very common obfuscation methods is worth the effort. Past that, you're spending a lot of time trying to decipher an increasingly small percentage of emails.
All that to say, having some clever obfuscation is probably worth it.
For the record, my email has been on my public resume in plain text for years now, because I use gmail, which has a spam filter that works.
I was wondering why nobody mentioned ALAs solution so far.
Roel Van Gils wrote an Article about Graceful Email Obfuscation in 2007
Graceful Email Obfuscation is simply a JavaScript Email Obfuscation technique with a contact form fallback.
Email addresses are obfuscated by converting them into a url poiting to a contact form and applying a ROT13 transform
mailto:mail#example.com → contact/mail+example+com → contact/znvy+rknzcyr+pbz
Via javascript contact/znvy+rknzcyr+pbz is converted back to mailto:mail#example.com
If no javascript is available, the browser will open contact/znvy+rknzcyr+pbz as a fallback. The contact form will know where to send the email because of the url.
http://www.alistapart.com/articles/gracefulemailobfuscation/
It does make it harder but there are so many really smart scrapers that it probably doesn't help a lot, since the big spammers are using the high quality spam tools.
How to fight spamers? Make email address less recognizable for something without brain (i.e. computer).
Non-English speakers are your friends: if your user base is non-English speaking community, switch to obfuscating using other languages: team_małpa_stackoverlow_kropka_com or team_Affenschwanz_stackoverflow_Punkt_com are perfectly recognizable email addresses for respectively Polish- and German-speaking communities. Some email harvesters know Polish or German, but chance is most of harvesters will understand only English.
If you cannot leave English, than switch to some descriptive phrases- like: “in order to send us message write team in your address field, than put symbol AT, than write the name of our site!”.
To provide a literal answer, yes, harvesting obfuscated addresses is harder than harvesting standardized addresses. The real question is whether the extra effort will be put in by harvesters and if the (major? minor?) barrier to the harvesters is worth the possible problems for your users.
If you are going to scramble addresses or otherwise transpose them away from the standard form, you should avoid being consistent in how you do so – at least on the same site.
For example, if every email address on a large community site is reversed in the markup and rendered properly with CSS, or token-replaced (# becomes 'at'), or any other predictable method, the harvesters will just write a thin adapter for your site.
Think of it this way: if it only takes you one line of code to "scramble" them sitewide, it will only take the harvester one line of code to "unscramble" them for your site. Roughly speaking.
In my opinion, spam has become such a problem and so many DBs have been turned over that we're beyond hiding our addresses. Instead, consider looking at Defensio and Akismet, etc, to help classify and block spam.
I have a solution, well, more of a theory.
Problem is, the bots parse the page. they can get the text. even if it's being put
into the page in some sophisticated way through Javascript.
So, just you CSS3 pseudo element! it won't be a link, but your email will be visible, and will never be an actual text. something like this:
.email::after{ content:'myemail#gmail.com'; }
Again, it's a theory, I've no idea how far these evil people can go to get it, but I think this be pretty safe. (unless they parse the CSS files, which I don't think they do)
It does make it harder to a degree, but the simple ones used by users even today (the [dot] and [at]) are obsolete and can be captured easily using a simple regex by spammers.
Using something as simple as an image would be helpful and readable for the intended human reader without effort to 'decrypt' the encoded email id.
Contact email:
If you are still paranoid about character recognition equipped spam bots, them something like this would be effective.
It uses optical illusion as an advantage to complete letters in the human mind that cannot be easily understood by computer vision. Applying CAPCHA-like overlay can also help, but I doubt you need to go that far.

Removing Javascript from HREFs

We want to allow "normal" href links to other webpages, but we don't want to allow anyone to sneak in client-side scripting.
Is searching for "javascript:" within the HREF and onclick/onmouseover/etc. events good enough? Or are there other things to check?
It sounds like you're allowing users to submit content with markup. As such, I would recommend taking a look at a few articles about preventing cross-site scripting which would cover a bit more than simply preventing javascript from being inserted into an HREF tag. Below is one I found that might be useful:
http://weblogs.java.net/blog/gmurray71/archive/2006/09/preventing_cros.html
You'll have to use a whitelist of allowed protocols to be completely safe. If you use a blacklist, sooner or later you'll miss something like "telnet://" or "shell:" or some exploitable browser-specific thing you've never heard of...
Nope, there's a lot more that you need to check.
First of the URL could be encoded (using HTML entities or URL encoding or a mixture of both).
Secondly you need to check for malformed HTML, which the browser might guess at and end up allowing some script in.
Thirdly you need to check for CSS based script, e.g. background: url(javascript:...) or width:expression(...)
There's probably more that I've missed - you need to be careful!
You have to be extremely careful when taking user input. You'll want to do a whitelist as mentioned, but not just with the href. Example:
<img src="nosuchimage.blahblah" onerror="alert('Haxored!!!');" />
or
click meh
one option would be to disallow html at all and use the same sort of formatting that some forums use. Just replace
[url="xxx"]yyy[/url]
with
yyy
That'll get you around the issues with mouse over etc. Then just make sure the link starts off with a white-listed protocol, and doesn't have a quote in it (" or some such that might be decrypted by php or the browser).
Sounds like you're looking for the companion function to PHP's strip_tags, which is strip_attributes. Unfortunately, it hasn't been written yet. (Hint, hint.)
There is, however, an interesting-looking suggestion in the strip_tags documentation, here:
http://www.php.net/manual/en/function.strip-tags.php#85718
In theory this will strip anything that isn't an href, class, or ID from submitted links; seems like you probably want to lock it down even further and just take hrefs.