Related
I want to parse a HTML text and find special parts. For example a text in 3rd div of 1st row and 2nd column of a table. I have 2 options to parse: Regular Expressions and XPath. What is advantages and disadvantages of each one?
thanks
It somewhat depends on whether you have a complete HTML file of unknown but well-formed content versus having merely a snippet or an expanse of HTML of completely known content which may or may not be well-formed.
There is a difference between editing and parsing, you see.
It is one thing to be editing your own HTML file that you wrote yourself or are otherwise staring right in the face, and you issue the editor command
:100,200s!<br */>!!g
To remove the breaks from lines 200–300.
It is quite another to suck down whatever HTML happens to be at the other end of a URL and then try to make some sense out it, sight unseen.
The first calls for a regex solution — the very one shown above, in fact. To go off writing some massively overengineered behemoth to do a fall parse to set up the entire parse tree just to do the simple edit shown above is quite simply wrong. It’s also its own punishment.
On the other hand, using patterns to parse out (as opposed to lex out) an entire HTML document that can contain all kinds of whacky things you aren’t planning for just cries out for leveraging someone else’s hard work intead of recreating the wheel for yourself, and badly at that.
However, there’s something else nobody likes to mention, and that’s that most people just aren’t competent at regexes. They don’t really understand them. They don’t know how to test them or to craft them. They don’t know how to make them readable and maintainable.
The truth of the matter is that the overwhelming majority of regex users cannot even manage as simple and basic a thing as matching an arbitrary HTML tag using a regex, even when things gotchas like alternate encodings and CDATA sections and redefined entitities and <script> contents and archaic never-seen forms are all safely dispensed with.
It’s not because it’s hard to do; it isn’t, actually. It’s just that the people trying to do it understand neither regexes nor HTML particularly well, and they don’t know they don’t know, and so they get themselves in way over their heads more quickly than they realize. And then they have a complete disaster on their hands.
Plus it’s been done before, and correctly. Might as well learn from someone else’s mistakes for a change, eh? It would probably help to have a few canned regexes at your disposal to go at frequently manipulated things. This is especially useful for editing.
But for a full parse, you really shouldn’t try to embed a full HTML grammar inside your pattern. Honest, you really shouldn’t. Speaking as someone has actually can and has done this, I unlike 99.9999% of the responders here the credibility of actual experience in this area when I advise against it. Sure, I can do it, but I almost never want to, and I certainly don’t want you to try it at home unsupervised. I can’t be held responsible for any damage that might ensue. :)
Sure, this may sound like “Do as I say, not as I do,” but if your level of regex mastery were at a level that allowed you to contemplate such a thing, you would not be asking this question. As I mentioned, almost no one who uses regexes can actually match an arbitrary HTML tag, simple as that is. Given that you need that sort of building block before writing your recursive descent grammar, and given that next to nobody can even manage that simple building block, well...
Given that sad state of affairs, it’s probably best to use regexes for simple edit jobs only, and leave their use for more complete solutions to real regex wizards, for they are subtle and quick to anger. Meaning of course the regexes, not (just) the wizards.
But sure, keep some canned regexes handy for doing simple editing rather than full parsing. That way you won’t be forced to redevise them each time from first principles. I do keep a few of these around, but then I also keep simple frameworks that allow me to edit a particular structural element of the HTML, like the plain text or the tag contents or the link references, etc, and those all use a full parser, letting me then surgically target just the parts I want in complete confidence I haven’t forgotten something.
More as a testament to what is possible than what is advisable, you can see some answers with more, um, “heroic” pattern matching, including recursion,
here,
here,
here,
here,
here, and
here.
Understand that some of those were actually written for the express purpose of showing people why they should not use regexes, because some of them are really quite sophisticated, much moreso than you can expect in nonwizards. That difficulty may chase you away, which is ok, because it was sort of meant to.
But don’t let that stop you from using vi on your HTML files, nor should it scare you away from using its search or substitute commands. Don’t let the perfect be the enemy of the good. Sometimes good enough is exactly what you need, because the perfect would take more investment than it could ever be worth.
Understanding which out of several possible approaches will give you the most bang for your buck is something that takes time to learn, and no one can tell you the answer that works for you. They don’t know your dataset, your requirements, your skillset, your priorities. Therefore any categorical answer is automatically wrong. You have to evaluate these things for yourself.
I think XPath is the primary option for traversing XML-like documents. With RegExp, it will be up to you to handle the different forms of writing a tag (with multiple spaces, double quotes, single quotes, no quotes, in one line, in multi-lines, with inner data, without inner data, etc). With XPath, this is all transparent to you, and it has many features (like accessing a node by index, selecting by attribute values, selecting simblings, and MANY others).
See how powerfull it can be at http://www.w3schools.com/xpath/.
EDIT: See also How do HTML parses work if they're not using regexp?
XPath is less likely to break if the web developer does any minor changes. That would be my choice.
Here is the canonical Stackoverflow explanation for why you should not parse HTML with regex:
RegEx match open tags except XHTML self-contained tags
In general, you cannot parse HTML with regex because regex is not made to parse HTML. Just use XPath.
I was wondering, and was as of yet, unable to find any answers online, how to accomplish the following.
Let's say I have a string that contains the following:
my_string = "Hello, I am a string."
(in the preview window I see that this is actually formatting in BOLD and ITALIC instead of showing the "strong" and "i" tags)
Now, I would like to make this secure, using the html_escape() (or h()) method/function.
So I'd like to prevent users from inserting any javascript and/or stylesheets, however, I do still want to have the word "Hello" shown in bold, and the word "string" shown in italic.
As far as I can see, the h() method does not take any additional arguments, other than the piece of text itself.
Is there a way to escape only certain html tags, instead of all? Like either White or Black listing tags?
Example of what this might look like, of what I'm trying to say would be:
h(my_string, :except => [:strong, :i]) # => so basically, escape everything, but leave "strong" and "i" tags alone, do not escape these.
Is there any method or way I could accomplish this?
Thanks in advance!
Excluding specific tags is actually pretty hard problem. Especially the script tag can be inserted in very many different ways - detecting them all is very tricky.
If at all possible, don't implement this yourself.
Use the white list plugin or a modified version of it . It's superp!
You can have a look Sanitize as well(Seems better, never tried it though).
Have you considered using RedCloth or BlueCloth instead of actually allowing HTML? These methods provide quite a bit of formatting options and manage parsing for you.
Edit 1: I found this message when browsing around for how to remove HTML using RedCloth, might be of some use. Also, this page shows you how version 2.0.5 allows you to remove HTML. Can't seem to find any newer information, but a forum post found a vulnerability. Hopefully it has been fixed since that was from 2006, but I can't seem to find a RedCloth manual or documentation...
I would second Sanitize for removing HTML tags. It works really well. It removes everything by default and you can specify a whitelist for tags you want to allow.
Preventing XSS attacks is serious business, follow hrnt's and consider that there is probably an order of magnitude more exploits than that possible due to obscure browser quirks. Although html_escape will lock things down pretty tightly, I think it's a mistake to use anything homegrown for this type of thing. You simply need more eyeballs and peer review for any kind of robustness guarantee.
I'm the in the process of evaluating sanitize vs XssTerminate at the moment. I prefer the xss_terminate approach for it's robustness—scrubbing at the model level will be quite reliable in a regular Rails app where all user input goes through ActiveRecord, but Nokogiri and specifically Loofah seem to be a little more peformant, more actively maintained, and definitely more flexible and Ruby-ish.
Update I've just implemented a fork of ActsAsTextiled called ActsAsSanitiled that uses Santize (which has recently been updated to use nokogiri by the way) to guarantee safety and well-formedness of the RedCloth output, all without needing any helpers in your templates.
I need to implement a simple and efficient XSS Filter in C++ for CppCMS. I can't use existing high quality filters
written in PHP because because it is high performance framework that uses C++.
The basic idea is provide a filter that have a while list of HTML tags and a white
list of options for these tags. For example. typical HTML input can consist of
<b>, <i>, tags and <a> tag with href. But straightforward implementation is not
good enough, because, even allowed simple links may include XSS:
Click On Me
There are many other examples can be found there. So I though also about a possibility to create a white list of prefixes for tags like href/src -- so I always need to check if it starts with (https?|ftp)://
Questions:
Are these assumptions are good enough for most of purposes? Meaning that If I do not
give an options for style tags and check src/href using white list of prefixes it solves XSS problems? Are there problems that can't be fixes this way?
Is there a good reference for formal grammar of HTML/XHTML in order to write simple
parser that would cleanup all incorrect of forbidden tags like <script>
You can take a look at the Anti Samy project, trying to accomplish the same thing. It's Java and .NET though.
http://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project#.NET_version
http://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project_.NET
Edit 1, A bit extra :
You can potentially come up with a very strict white listing. It should be structured well and should be pretty tight and not much flexible. When you combine flexibility, so many tags, attributes and different browsers generally you end up with a XSS vulnerability.
I don't know what is your requirements but I'd go with a strict and simple tag support (only b li h1 etc.) and then strict attribute support based on the tag (for example src is only valid under href tag), then you need to do whitelisting in the attribute values as you stated http|https|ftp or style="color|background-color" etc.
Consider this one:
<x style="express/**/ion:(alert(/bah!/))">
Also you need to think about some character whitelisting or some UTF-8 normalization, because different encodings can cause awkward issues. Such as new lines in attributes, non valid UTF-8 sequences.
All details of HTML parsing are specified in HTML 5. However implementation of it is quite a lot of work, and it doesn't matter whether you'll parse HTML exactly with all corner cases. At worst you'll end up with different DOM, but you have to sanitize DOM anyway.
As you mentioned, there are various PHP implementations of this, but I don't know of any in C++, since that's not a language typically applied to web development. Overall, it's going to depend on how complex of an implementation you want to come up with.
A very restrictive whitelist is probably the "simplest" way, but if you want to be really comprehensive I would look into doing a conversion of one of the established versions to C++, as opposed to trying to write your own from scratch. There are so many tricks to worry about, that I think you'd be better off standing on the shoulders of others that have already gone through all that.
I don't know anything about using C++ for web development, but converting PHP to it doesn't seem like it would be a particularly difficult task, PHP doesn't really have any magical capabilities that C++ won't be able to duplicate. I'm sure there will be some small hitches, but overall if you want to go the more-complex route it'd definitely still be faster to do a conversion than a full design from scratch.
HTML Purifier seems like a strong PHP implementation that is still actively maintained, there's a comparison document where the author discuss some differences between his approach and others', probably worth reading.
Whatever you come up with, definitely test it with all the examples you link, and make sure it passes all those. Good luck!
I'm using google docs, and some templates we are using were created using MS-Office.
The resulting HTML is fat and ugly, and the 500KB per doc limitation on google makes some cleanup mandatory.
I was able to find redundant "style" attributes and move them to some CSS class, and rename the most redundant classes names to shorter ones, which makes me save about 50% of the original size.
Are you aware of some existing tools/scripts/lib which could do this painful job for me, or at least help me to write this magic tool ?
Thanks in advance !
EDIT: I gave a try to both tidy, demoronizer and "manual rewrite":
- Input : 140Kb
- Tidy'ed : 110Kb
- Demoronized : 135Kb
So my favorite answer will be "rewrite it!"
Thanks !
MS-Office makes crappy HTML, period. You're better of spending time rebuilding the HTML from the original text than trying to walk through that minefield.
I made a few macros that do some search/replace functions on Word to do basic things like wrap <p> tags around paragraphs and stuff like that, then re-markup the whole thing from scratch.
You could try tidy it will clean up many things.
Without commenting on its name, I could mention demoronizer, which the author describes as:
...a Perl program available for downloading from this site which corrects numerous errors and incompatibilities in HTML generated by, or edited with, Microsoft applications.
YMMV.
One of my favourite utilties now is actually Windows Live Writer - it does a neat job of stripping rubbish out of Word doc files. Some might disagree but I use it quite often!
In a web application, is it acceptable to use HTML in your code (non-scripted languages, Java, .NET)?
There are two major sub questions:
Should you use code to print HTML, or otherwise directly create HTML that is displayed?
Should you mix code within your HTML pages?
Generally, it's better to keep presentation (HTML) separate from logic ("back-end" code). Your code is decoupled and easier to maintain this way.
As long as your HTML-writing code is separate from your application logic, and the HTML is guaranteed to be well-formed somehow, you should be okay.
The only code that should be mixed in markup-based pages (i.e, those that contain literal HTML) is the code used for formatting the HTML (e.g., a loop for writing out a list).
There are trade-offs whether you put the code in with the HTML or you use pure code to write the HTML out using quoted string literals.
No, if you want to build good and maintainable software, and to achieve loose coupling.
If I understand the question right, you're asking whether it's a good practice to mix markup with back-end code. No. While this is commonly done, it's still a bad idea.
You should read up on the MVC paradigm, as well as on existing questions on the matter, such as What is the best way to migrate an existing messy webapp to elegant MVC? and Best practices for refactoring classic ASP?
The point is to keep the display logic separate from the rest of the code. In any complex site you'll have code mixed in with your HTML, but the code should be for display purposes only. It shouldn't be doing any complex calculations.
For example, templates will contain loops and conditionals. Plus you'll probably have a library of HTML-specific routines, like printing out an <option> list based on a list object.
Imagine you were writing an application that has two output modes: HTML and something else. How would you write it, to avoid duplicating code? That will probably point you in the right direction.
The HTML that makes up the view has to get sent to the browser in some way. In .net, each server control emits its own HTML markup as part of the page lifecycle. So yes it is OK to use HTML in server side code.
Perhaps you should try following the ASP.net pattern. Create a bunch of controls that represent UI elements and make them responsible for emitting their own HTML based on their state.
Its fugly, and not type safe. But people do it without consequence. I'd prefer using a DOM or, at a minimum, classes designed to write HTML using type safe semantics. Also, its not all that good to mix UI with logic...
If I need methods that generate HTML I usually isolate them in an HtmlHelpers class. That way you keep some level of separation. The ASP.NET MVC Framework does this quite successfully.
If you mean printing out HTML in your code, then no. Unless you have a good reason not to, you should use templates
Even if you think you don't need this now, there's always a good chance you'll need it later. Maybe you want to output in a different format than HTML, or you want different presentation for the same data. You usually have the need for these things further down the road, so it's best to use one from the start.
I hate when developers print() a bunch of html. It's completely unnecessary and looks ugly in any text editor that shows print/echo strings in red.
I agree with everyone else that you should try as hard as you can to separate the HTML/XHTML markup from the application logic. However, sometimes you do need to generate HTML/XHTML in the application logic for various reasons.
In these cases what I have been trying to do is to ensure the bare minimum amount of presentation code is in mixed in with the application logic and try to migrate everything else over to the presentation code. It is worth nothing that is some cases you have situations where you could have everything moved over to the presentation layer, but it might be a bit easier to generate the markup as part of the application logic. In those cases, your best bet is likely to be to go the route that makes the most sense in terms of time.
I don't think there's any excuse for generating HTML inside your business logic. Don't even do it when it's just a "quick fix" or when you'll "go back and fix it later", because that never happens.
To reiterate my position from other questions, using some control logic (conditionals, loops) within HTML to construct it is OK. Do NOT do any data massaging or business logic in the HTML. You have to be disciplined, but it's worth it. Maintenance is much easier if your concerns (like logic and display) are separated.
Ideally you are aiming for a separation of concerns between your presentation (UI) code and your domain (business logic) code.
The reason why you should avoid coupling these two concerns (in either direction) is simple...
You will only have one reason to change a piece of code. whether this is from structural/styling changes in your html design, or from your business rules changing, you should only have to make the change in one place.
To a lesser extent, although many purists would disagree, by sprinkling HTML code through your domain code or vice versa you are creating noise for the next developer who comes along to read/maintain it.
I try to avoid using code to print HTML "directly". It is difficult to maintain, edit, add styles and etc. Some cases like generating an HTML email in the code, I create a text file or HTML file with markers like, [name], [verification code] and etc. I load this from the code and replace those markers. This way, you can edit the style of the email without re-compiling your code. Separating "presentation" and "logic" is a good practice in my opinion.
Mixing code within HTML is generally not a good practice in similar reasons as said in #1. However, I do use code in HTML for things like simple dynamic strings that are displayed multiple times on a page or pages. I think this is better than creating multiple server controls for same exact values to set. Since this is not code "logic" mixed in the HTML, I think this is ok.