Regular Expressions vs XPath when parsing HTML text

Regular Expressions vs XPath when parsing HTML text - html

I want to parse a HTML text and find special parts. For example a text in 3rd div of 1st row and 2nd column of a table. I have 2 options to parse: Regular Expressions and XPath. What is advantages and disadvantages of each one?
thanks

It somewhat depends on whether you have a complete HTML file of unknown but well-formed content versus having merely a snippet or an expanse of HTML of completely known content which may or may not be well-formed.
There is a difference between editing and parsing, you see.
It is one thing to be editing your own HTML file that you wrote yourself or are otherwise staring right in the face, and you issue the editor command
:100,200s!<br */>!!g
To remove the breaks from lines 200–300.
It is quite another to suck down whatever HTML happens to be at the other end of a URL and then try to make some sense out it, sight unseen.
The first calls for a regex solution — the very one shown above, in fact. To go off writing some massively overengineered behemoth to do a fall parse to set up the entire parse tree just to do the simple edit shown above is quite simply wrong. It’s also its own punishment.
On the other hand, using patterns to parse out (as opposed to lex out) an entire HTML document that can contain all kinds of whacky things you aren’t planning for just cries out for leveraging someone else’s hard work intead of recreating the wheel for yourself, and badly at that.
However, there’s something else nobody likes to mention, and that’s that most people just aren’t competent at regexes. They don’t really understand them. They don’t know how to test them or to craft them. They don’t know how to make them readable and maintainable.
The truth of the matter is that the overwhelming majority of regex users cannot even manage as simple and basic a thing as matching an arbitrary HTML tag using a regex, even when things gotchas like alternate encodings and CDATA sections and redefined entitities and <script> contents and archaic never-seen forms are all safely dispensed with.
It’s not because it’s hard to do; it isn’t, actually. It’s just that the people trying to do it understand neither regexes nor HTML particularly well, and they don’t know they don’t know, and so they get themselves in way over their heads more quickly than they realize. And then they have a complete disaster on their hands.
Plus it’s been done before, and correctly. Might as well learn from someone else’s mistakes for a change, eh? It would probably help to have a few canned regexes at your disposal to go at frequently manipulated things. This is especially useful for editing.
But for a full parse, you really shouldn’t try to embed a full HTML grammar inside your pattern. Honest, you really shouldn’t. Speaking as someone has actually can and has done this, I unlike 99.9999% of the responders here the credibility of actual experience in this area when I advise against it. Sure, I can do it, but I almost never want to, and I certainly don’t want you to try it at home unsupervised. I can’t be held responsible for any damage that might ensue. :)
Sure, this may sound like “Do as I say, not as I do,” but if your level of regex mastery were at a level that allowed you to contemplate such a thing, you would not be asking this question. As I mentioned, almost no one who uses regexes can actually match an arbitrary HTML tag, simple as that is. Given that you need that sort of building block before writing your recursive descent grammar, and given that next to nobody can even manage that simple building block, well...
Given that sad state of affairs, it’s probably best to use regexes for simple edit jobs only, and leave their use for more complete solutions to real regex wizards, for they are subtle and quick to anger. Meaning of course the regexes, not (just) the wizards.
But sure, keep some canned regexes handy for doing simple editing rather than full parsing. That way you won’t be forced to redevise them each time from first principles. I do keep a few of these around, but then I also keep simple frameworks that allow me to edit a particular structural element of the HTML, like the plain text or the tag contents or the link references, etc, and those all use a full parser, letting me then surgically target just the parts I want in complete confidence I haven’t forgotten something.
More as a testament to what is possible than what is advisable, you can see some answers with more, um, “heroic” pattern matching, including recursion,
here,
here,
here,
here,
here, and
here.
Understand that some of those were actually written for the express purpose of showing people why they should not use regexes, because some of them are really quite sophisticated, much moreso than you can expect in nonwizards. That difficulty may chase you away, which is ok, because it was sort of meant to.
But don’t let that stop you from using vi on your HTML files, nor should it scare you away from using its search or substitute commands. Don’t let the perfect be the enemy of the good. Sometimes good enough is exactly what you need, because the perfect would take more investment than it could ever be worth.
Understanding which out of several possible approaches will give you the most bang for your buck is something that takes time to learn, and no one can tell you the answer that works for you. They don’t know your dataset, your requirements, your skillset, your priorities. Therefore any categorical answer is automatically wrong. You have to evaluate these things for yourself.

I think XPath is the primary option for traversing XML-like documents. With RegExp, it will be up to you to handle the different forms of writing a tag (with multiple spaces, double quotes, single quotes, no quotes, in one line, in multi-lines, with inner data, without inner data, etc). With XPath, this is all transparent to you, and it has many features (like accessing a node by index, selecting by attribute values, selecting simblings, and MANY others).
See how powerfull it can be at http://www.w3schools.com/xpath/.
EDIT: See also How do HTML parses work if they're not using regexp?

XPath is less likely to break if the web developer does any minor changes. That would be my choice.

Here is the canonical Stackoverflow explanation for why you should not parse HTML with regex:
RegEx match open tags except XHTML self-contained tags
In general, you cannot parse HTML with regex because regex is not made to parse HTML. Just use XPath.

Related

How should I format my markup?

When it comes to my markup, I'm anal. It always has to be perfectly indented, easily readable to me, and 100% valid with the W3C. Often time, when viewing the markup of other websites, I'm appalled with the lack of effort by the developer to try to and keep their markup in the browser clean, organized, and valid.
On the flip side, there's a lot of people who will force all their markup on to one, continuous line for the size saving benefits. This annoys me as well, though not to the same extent because it is done with a purpose. But for the most part, it seems like no developer ever actually looks at their markup in the browser and does anything about it.
Understanding that, to the parser in the browser, indents and spaces (usually) don't matter, how should I be handling my markup? Is it worth the extra time to get my markup perfectly easily readable to humans as well as the browser? Are all my \t's and \n's being used in vain?

There are some browsers who has bugs that renders indented well formed html completely wrong. Such as some versions of Internet explorer with tables and images.
Other than that, i try to keep sane indention, I don't spend to much time with it, just enough to make it easy to debug.

Is it worth the extra time to get my markup perfectly easily readable
My answer is no. The arguments:
Whoever tries to look at the code probably will want modify it so, for editing the code you need good code editor with code formatting (e.g. Netbeans). You'll very soon need other features like, syntax coloring.
Some users might prefer other type of formatting than you.
Anyone interested in readable HTML may use Tidy (of Tidy extension to Firefox) to format it.
It's a performance issue too: additional overload of formatting + stripping whitespace (and minifying when possible) will speed up the site. It's very important for sites with high traffic.

It's worth the effort imho since it helps you understand what exactly is going on in your html page, and that's definitely worth something.
If we want to write clean, elegant code in general this means we should want to generate nice, clean elegant html as well, not?

Not sure if this answers your question, but as long as the code is valid by W3C, is structured as intended. As far as your view-ability of the code (like view source) structure, that's really up to you, but I would not add too much clutter (comments etc). Use the correct DOCTYPE for your markup and you should be fine with that. I don't see any reason to "waste" time on making the source code from the browser "book" readable. The view source would only be beneficial to you so you can quickly see what's happening at a glance through source view.

I like to correctly format my markup, and I think it makes it easier to manage when I do.
Then again, I use ASP.NET and a lot of markup is generated through various controls and classes. In this case, I've decided it is not worth trying to track down each mis-aligned markup and see if something can be done to get the associated control to produce the correct result.
In short, nicely formatted markup is worth it if it can be accomplished without a huge effort.

Yes, in my opinion it is worth. It will be easier to maintain, for you and for other collegues, now and in the future.
About the disadvantage of lower performance, why not to develop a well indented and commented source file and to generate a minimized version to run on the server? It can be acheived with a simple series of regex replacements.

What language should I use for editing documents?

Document editors are nice but they have their limitations.
What is a good alternative to them?
I already know HTML and CSS and while they can do the job, they are ill-suited for printed documents.
I was thinking in learning LaTeX, because many scholars use it. But I wonder if someone would recommend another language such as postscript.

LaTeX is fine. You don't want to write postscript by hand.

I’m using LaTeX almost exclusively nowadays, at least for text documents (everything from CV over letters to manuals).
For quick one-off notes, I’m actually using Markdown (without a renderer. I just think that Markdown preserves document structure quite nicely even when used in text-only mode).
For presentations and spreadsheets, I use appropriate applications, though. In particular, I don’t think LaTeX is that well-suited to do the former (depending on your style of presentations, obviously. Mine have next to no text though …).

I finally got a chance to write an entire paper in LaTeX for my final semester of College and found it to be easier than I thought it would be. A couple of the nice things I found about it were
A fairly lightweight syntax for most things (tables being the only real offender, but no one can get text tables right).
An extremely wide array of syntax for doing anything from automatically marking up a chemical formula to writing inline lists.
Beautiful output automatically.
Extremely easy to write modular documents where I might store a chapter in a file and then simply \include{} it in another. One particularly nice use I found for this was to include code that I had written in the document simply by referencing the files.
Wonderful support for footnotes and bibliographic references.
Libraries for just about anything you can imagine.
The major drawbacks are, IMHO:
A lack of any real direction or life in the language. It feels dead, and not because it's done.
A frustrating build process, although there are tools to help with that, from a simple bash script to a full fledged make file.
If you're interested in learning LaTeX, I would recommend starting out by reading the Not So Short Introduction to LaTeX 2e PDF.
However, I decided against using LaTeX for most things that I write these days specifically because it feels dead and has a frustrating build process. I instead switched over to MultiMarkdown, as it is well supported and can be transformed into a large array of other formats, including LaTeX which can then be hand massaged if you really need to in order to get it the format expected by some publication. If you haven't played with MultiMarkdown or Markdown before, then I highly recommend checking them out. The syntax is extremely lightweight and natural, even compared to LaTeX. I find that except for some of the higher level typographical constructs, MultiMarkdown supports everything I need on a regular basis.
My 2 cents.

It depends on what you want to do. If you are planning to write a formal document, maybe for printing too, just go for LaTex.
Not difficolt as it may appear at the very beginning but professional and fulfilling.
If Web is your goal, go for HTML / CSS.
OpenOffice or Word would do the trick in most cases; do not underestimate them, if you are going to use them (example for job) take time to learn them.

To expand on zzzzBov's commmment, LaTeX is SUPPOSED to allow the writer to concentrate on the content and allow the compiler/documentclass to handle formatting (and that usually is true). If you use HTML/CSS to format you will probably be spending more time (rather than less) doing formatting. Imagine that the LaTeX documentclass is the CSS, only it is already written for you, and your LaTeX source is the content, only the tags are more functional (such as italics or equations) than for patching between the HTML and the CSS (<div ...>). I recommend the LaTeX wikibook as an easy way to start, and the short-math-guide, it if you need mathematics. Enjoy!

Rails - Escaping HTML using the h() AND excluding specific tags

I was wondering, and was as of yet, unable to find any answers online, how to accomplish the following.
Let's say I have a string that contains the following:
my_string = "Hello, I am a string."
(in the preview window I see that this is actually formatting in BOLD and ITALIC instead of showing the "strong" and "i" tags)
Now, I would like to make this secure, using the html_escape() (or h()) method/function.
So I'd like to prevent users from inserting any javascript and/or stylesheets, however, I do still want to have the word "Hello" shown in bold, and the word "string" shown in italic.
As far as I can see, the h() method does not take any additional arguments, other than the piece of text itself.
Is there a way to escape only certain html tags, instead of all? Like either White or Black listing tags?
Example of what this might look like, of what I'm trying to say would be:
h(my_string, :except => [:strong, :i]) # => so basically, escape everything, but leave "strong" and "i" tags alone, do not escape these.
Is there any method or way I could accomplish this?
Thanks in advance!

Excluding specific tags is actually pretty hard problem. Especially the script tag can be inserted in very many different ways - detecting them all is very tricky.
If at all possible, don't implement this yourself.

Use the white list plugin or a modified version of it . It's superp!
You can have a look Sanitize as well(Seems better, never tried it though).

Have you considered using RedCloth or BlueCloth instead of actually allowing HTML? These methods provide quite a bit of formatting options and manage parsing for you.
Edit 1: I found this message when browsing around for how to remove HTML using RedCloth, might be of some use. Also, this page shows you how version 2.0.5 allows you to remove HTML. Can't seem to find any newer information, but a forum post found a vulnerability. Hopefully it has been fixed since that was from 2006, but I can't seem to find a RedCloth manual or documentation...

I would second Sanitize for removing HTML tags. It works really well. It removes everything by default and you can specify a whitelist for tags you want to allow.

Preventing XSS attacks is serious business, follow hrnt's and consider that there is probably an order of magnitude more exploits than that possible due to obscure browser quirks. Although html_escape will lock things down pretty tightly, I think it's a mistake to use anything homegrown for this type of thing. You simply need more eyeballs and peer review for any kind of robustness guarantee.
I'm the in the process of evaluating sanitize vs XssTerminate at the moment. I prefer the xss_terminate approach for it's robustness—scrubbing at the model level will be quite reliable in a regular Rails app where all user input goes through ActiveRecord, but Nokogiri and specifically Loofah seem to be a little more peformant, more actively maintained, and definitely more flexible and Ruby-ish.
Update I've just implemented a fork of ActsAsTextiled called ActsAsSanitiled that uses Santize (which has recently been updated to use nokogiri by the way) to guarantee safety and well-formedness of the RedCloth output, all without needing any helpers in your templates.

So what if custom HTML attributes aren't valid XHTML?

I know that is the reason some people don't approve of them, but does it really matter? I think that the power that they provide, in interacting with JavaScript and storing and sending information from and to the server, outweighs the validation concern. Am I missing something? What are the ramifications of "invalid" HTML? And wouldn't a custom DTD resolve them anyway?

The ramification is that w3c comes along in 2, 5, 10 years and creates an attribute with the same name. Now your page is broken.
HTML5 is going to provide a data attribute type for legal custom attributes (like data-myattr="foo") so maybe you could start using that now and be reasonably safe from future name collisions.
Finally, you may be overlooking that custom logic is the rational behind the class attribute. Although it is generally thought of as a style attribute it is in reality a legal way to set custom meta-properties on an element. Unfortunately you are basically limited to boolean properties which is why HTML5 is adding the data prefix.
BTW, by "basically boolean" I mean in principle. In reality there is nothing to stop you using a seperator in your class name to define custom values as well as attributes.
class="document docId.56 permissions.RW"

Yes you can legally add custom attributes by using "data".
For example:
<div id="testDiv" data-myData="just testing"></div>
After that, just use the latest version of jquery to do something like:
alert($('#testDiv').data('myData'))
or to set a data attribute:
$('#testDiv').data('myData', 'new custom data')
And since jQuery works in almost all browsers, you shouldn't have any problems ;)
update
data-myData may be converted to data-mydata in some browsers, as far as the javascript engine is concerned. Best to keep it lowercase all the way.

Validation is not an end in itself, but a tool to be used to help catch mistakes early, and reduce the number of mysterious rendering and behavioural issues that your web pages may face when used on multiple browser types.
Adding custom attributes will not affect either of these issues now, and unlikely to do so in the future, but because they don't validate, it means that when you come to assess the output of a validation of your page, you will need to carefully pick between the validation issues that matter, and the ones that don't. Each time you change your page and revalidate, you have to repeat this operation. If your page validates entirely then you get a nice green PASS message, and you can move on the next stage of testing, or to the next change that needs to be made.

I've seen people obsessed with validation doing far worse/weird things than using a simple custom attribute:
<base href="http://example.com/" /><!--[if IE]></base><![endif]-->
In my opinion, custom attributes really don't matter. As other say, it may be good to watch out for future additions of attributes in the standards. But now we have data-* attributes in HTML5, so we're saved.
What really matters is that you have properly nested tags, and properly quoted attribute values.
I even use custom tag names (those introduced by HTML5, like header, footer, etc), but these ones have problems in IE.
By the way, I often find ironically how all those validation zealots bow in front of Google's clever tricks, like iframe uploads.

Instead of using custom attributes, you can associate your HTML elements with the attributes using JSON:
var customAttributes = { 'Id1': { 'custAttrib1': '', ... }, ... };
And as for the ramifications, see SpliFF's answer.

Storing multiple values in the class attribute is not correct code encapsulation and just a convoluted hack way of doing things. Take a custom ad rotator for instance that uses jquery. It is much cleaner on the page to do
<div class="left blue imagerotator" AdsImagesDir="images/ads/" startWithImage="0" endWithImage="10" rotatorTimerSeconds="3" />
and let some simple jquery code do the work from here.
Any developer or web designer now can work on the ad rotator and change values to this when asked without much ado.
Coming back to project a year later or coming into a new one where the previous developer split and went to an island somewhere in the pacific can be hell trying to figure out intentions when code is written in an unclear encrypted manner like this:
<div class="left blue imagerotator dir:images-ads endwith:10 t:3 tf:yes" />
When we write code in c# and other languages we don't write code putting all custom properties in one property as a space delimited string and end up having to parse that string every time we need to access or write to it. Think about the next person that will work on your code.

The thing with validation is that TODAY it may not matter, but you cannot know if it's going to matter tomorrow (and, by Murphy's law, it WILL matter tomorrow).
It's just better to choose a future-proof alternative. If they don't exist (they do in this particular case), the way to go is to invent a future proof alternative.
Using custom attributes is probably harmless, but still, why choose a potentially harmful solution just because you think (you can never be sure) it will cause no harm?. It might be worth to discuss this further if the future proof alternative was too costly or unwieldy, but this is certainly not the case.

Old discussion but nevertheless; in my opinion since html is a mark-up and not a progamming language, it should always be interpreted with leniency for mark-up 'errors'. A browser is perfectly able to do so. I don't think this will and should change ever. Therefore, the only important practical criteria is that your html will be displayed correctly by most browsers and will continue to do so in, say a few years. After that time, your html will probalbly be redesigned anyway.

Just to add my ingredient to the mix, validation is also important when you need to create content that can/could be post-processed using automated tools. If your content is valid you can much more easily convert markup from one format to another. For example, doing valid XHTML to XML with a specific schema is Much easier when parsing data that you know and can verify to follow a predictable format.
I, for example NEED my content to be valid XHTML because very often it is converted into XML for various jobs and then converted back without data loss or unexpected rendering results.

Well it depends on your client/boss/etc .. do they require it be validating XHTML?
Some people say there are a lot of workarounds - and depending on the sceneraio, they can work great. This includes adding classes, leveraging the rel attribute, and someone that has even written their own parser to extract JSON from HTML comments.
HTML5 provides a standard way to do this, prefix your custom attributes with "data-". I would recommend doing this now anyway, as there is a chance you may use an attribute that will be used down the track in standard XHTML.

Using non-standard HTML could make the browser render the page in "quirks mode", in which case some other parts of the page may render differently, and other things like positioning may be slightly different. Using a custom DTD may get around this, though.

Because they're not standard you have no idea what might happen, neither now, nor in the future. As others have said W3C might start using those same names in the future. But what's even more dangerous is that you don't know what the developers of "browser xxx" have done when they encounter they.
Maybe the page is rendered in quirks mode, maybe the page doesn't render at all on some obscure mobile browser, maybe the browser will leak memory, maybe a virus killer will choke on your page, etc, etc, etc.
I know that following the standards religiously might seem like snobbery. However once you have experienced problems due to not following them, you tend to stop thinking like that. However, then it's mostly too late, and you need to start your application from scratch with a different framework...

I think developers validate just to validate, but there is something to be said for the fact that it keeps markup clean. However, because every (exaggeration warning!) browser displays everything differently there really is no standard. We try to follow standards because it makes us feel like we at least have some direction. Some people argue that keeping code standard will prevent issues and conflicts in the future. My opinion: Screw that nobody implements standards correctly and fully today anyway, might as well assume all your code will fail eventually. If it works it works, use it, unless its messy or your just trying to ignore standards to stick it to W3C or something. I think its important to remember that standards are implemented very slowly, has the web changed all that much in 5 years. I'm sure anyone will have years of notice when they need to fix a potential conflict. No reason to plan for compatibility of standards in the future when you can't even rely on today's standards.
Oh I almost forgot, if your code doesn't validate 10 baby kittens will die. Are you a kitten killer?

Jquery .html(markup) doesn't work if markup is invalid.

Validation
You shouldn't need custom attributes to provide validation. A better approach would be to add validation based on fields actual task.
Assign meaning by using classes. I have classnames like:
date (Dates)
zip (Zip code)
area (Areas)
ssn (Social security number)
Example markup:
<input class="date" name="date" value="2011-08-09" />
Example javascript (with jQuery):
$('.date').validate(); // use your custom function/framework etc here.
If you need special validators for a certain or scenario you just invent new classes (or use selectors) for your
special case:
Example for checking if two passwords match:
<input id="password" />
<input id="password-confirm" />
if($('#password').val() != $('#password-confirm').val())
{
// do something if the passwords don't match
}
(This approach works quite seamless with both jQuery validation and the mvc .net framework and probably others too)
Bonus: You can assign multiple classes separated with a space class="ssn custom-one custom-two"
Sending information "from and to the server"
If you need to pass data back, use <input type="hidden" />. They work out of the box.
(Make sure you don't pass any sensitive data with hidden inputs since they can be modified by the user with almost no effort at all)

Writing XSS Filter for (X)HTML Based on White List

I need to implement a simple and efficient XSS Filter in C++ for CppCMS. I can't use existing high quality filters
written in PHP because because it is high performance framework that uses C++.
The basic idea is provide a filter that have a while list of HTML tags and a white
list of options for these tags. For example. typical HTML input can consist of
<b>, <i>, tags and <a> tag with href. But straightforward implementation is not
good enough, because, even allowed simple links may include XSS:
Click On Me
There are many other examples can be found there. So I though also about a possibility to create a white list of prefixes for tags like href/src -- so I always need to check if it starts with (https?|ftp)://
Questions:
Are these assumptions are good enough for most of purposes? Meaning that If I do not
give an options for style tags and check src/href using white list of prefixes it solves XSS problems? Are there problems that can't be fixes this way?
Is there a good reference for formal grammar of HTML/XHTML in order to write simple
parser that would cleanup all incorrect of forbidden tags like <script>

You can take a look at the Anti Samy project, trying to accomplish the same thing. It's Java and .NET though.
http://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project#.NET_version
http://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project_.NET
Edit 1, A bit extra :
You can potentially come up with a very strict white listing. It should be structured well and should be pretty tight and not much flexible. When you combine flexibility, so many tags, attributes and different browsers generally you end up with a XSS vulnerability.
I don't know what is your requirements but I'd go with a strict and simple tag support (only b li h1 etc.) and then strict attribute support based on the tag (for example src is only valid under href tag), then you need to do whitelisting in the attribute values as you stated http|https|ftp or style="color|background-color" etc.
Consider this one:
<x style="express/**/ion:(alert(/bah!/))">
Also you need to think about some character whitelisting or some UTF-8 normalization, because different encodings can cause awkward issues. Such as new lines in attributes, non valid UTF-8 sequences.

All details of HTML parsing are specified in HTML 5. However implementation of it is quite a lot of work, and it doesn't matter whether you'll parse HTML exactly with all corner cases. At worst you'll end up with different DOM, but you have to sanitize DOM anyway.

As you mentioned, there are various PHP implementations of this, but I don't know of any in C++, since that's not a language typically applied to web development. Overall, it's going to depend on how complex of an implementation you want to come up with.
A very restrictive whitelist is probably the "simplest" way, but if you want to be really comprehensive I would look into doing a conversion of one of the established versions to C++, as opposed to trying to write your own from scratch. There are so many tricks to worry about, that I think you'd be better off standing on the shoulders of others that have already gone through all that.
I don't know anything about using C++ for web development, but converting PHP to it doesn't seem like it would be a particularly difficult task, PHP doesn't really have any magical capabilities that C++ won't be able to duplicate. I'm sure there will be some small hitches, but overall if you want to go the more-complex route it'd definitely still be faster to do a conversion than a full design from scratch.
HTML Purifier seems like a strong PHP implementation that is still actively maintained, there's a comparison document where the author discuss some differences between his approach and others', probably worth reading.
Whatever you come up with, definitely test it with all the examples you link, and make sure it passes all those. Good luck!

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008