How should I format my markup? - html

When it comes to my markup, I'm anal. It always has to be perfectly indented, easily readable to me, and 100% valid with the W3C. Often time, when viewing the markup of other websites, I'm appalled with the lack of effort by the developer to try to and keep their markup in the browser clean, organized, and valid.
On the flip side, there's a lot of people who will force all their markup on to one, continuous line for the size saving benefits. This annoys me as well, though not to the same extent because it is done with a purpose. But for the most part, it seems like no developer ever actually looks at their markup in the browser and does anything about it.
Understanding that, to the parser in the browser, indents and spaces (usually) don't matter, how should I be handling my markup? Is it worth the extra time to get my markup perfectly easily readable to humans as well as the browser? Are all my \t's and \n's being used in vain?

There are some browsers who has bugs that renders indented well formed html completely wrong. Such as some versions of Internet explorer with tables and images.
Other than that, i try to keep sane indention, I don't spend to much time with it, just enough to make it easy to debug.

Is it worth the extra time to get my markup perfectly easily readable
My answer is no. The arguments:
Whoever tries to look at the code probably will want modify it so, for editing the code you need good code editor with code formatting (e.g. Netbeans). You'll very soon need other features like, syntax coloring.
Some users might prefer other type of formatting than you.
Anyone interested in readable HTML may use Tidy (of Tidy extension to Firefox) to format it.
It's a performance issue too: additional overload of formatting + stripping whitespace (and minifying when possible) will speed up the site. It's very important for sites with high traffic.

It's worth the effort imho since it helps you understand what exactly is going on in your html page, and that's definitely worth something.
If we want to write clean, elegant code in general this means we should want to generate nice, clean elegant html as well, not?

Not sure if this answers your question, but as long as the code is valid by W3C, is structured as intended. As far as your view-ability of the code (like view source) structure, that's really up to you, but I would not add too much clutter (comments etc). Use the correct DOCTYPE for your markup and you should be fine with that. I don't see any reason to "waste" time on making the source code from the browser "book" readable. The view source would only be beneficial to you so you can quickly see what's happening at a glance through source view.

I like to correctly format my markup, and I think it makes it easier to manage when I do.
Then again, I use ASP.NET and a lot of markup is generated through various controls and classes. In this case, I've decided it is not worth trying to track down each mis-aligned markup and see if something can be done to get the associated control to produce the correct result.
In short, nicely formatted markup is worth it if it can be accomplished without a huge effort.

Yes, in my opinion it is worth. It will be easier to maintain, for you and for other collegues, now and in the future.
About the disadvantage of lower performance, why not to develop a well indented and commented source file and to generate a minimized version to run on the server? It can be acheived with a simple series of regex replacements.

Related

Regular Expressions vs XPath when parsing HTML text

I want to parse a HTML text and find special parts. For example a text in 3rd div of 1st row and 2nd column of a table. I have 2 options to parse: Regular Expressions and XPath. What is advantages and disadvantages of each one?
thanks
It somewhat depends on whether you have a complete HTML file of unknown but well-formed content versus having merely a snippet or an expanse of HTML of completely known content which may or may not be well-formed.
There is a difference between editing and parsing, you see.
It is one thing to be editing your own HTML file that you wrote yourself or are otherwise staring right in the face, and you issue the editor command
:100,200s!<br */>!!g
To remove the breaks from lines 200–300.
It is quite another to suck down whatever HTML happens to be at the other end of a URL and then try to make some sense out it, sight unseen.
The first calls for a regex solution — the very one shown above, in fact. To go off writing some massively overengineered behemoth to do a fall parse to set up the entire parse tree just to do the simple edit shown above is quite simply wrong. It’s also its own punishment.
On the other hand, using patterns to parse out (as opposed to lex out) an entire HTML document that can contain all kinds of whacky things you aren’t planning for just cries out for leveraging someone else’s hard work intead of recreating the wheel for yourself, and badly at that.
However, there’s something else nobody likes to mention, and that’s that most people just aren’t competent at regexes. They don’t really understand them. They don’t know how to test them or to craft them. They don’t know how to make them readable and maintainable.
The truth of the matter is that the overwhelming majority of regex users cannot even manage as simple and basic a thing as matching an arbitrary HTML tag using a regex, even when things gotchas like alternate encodings and CDATA sections and redefined entitities and <script> contents and archaic never-seen forms are all safely dispensed with.
It’s not because it’s hard to do; it isn’t, actually. It’s just that the people trying to do it understand neither regexes nor HTML particularly well, and they don’t know they don’t know, and so they get themselves in way over their heads more quickly than they realize. And then they have a complete disaster on their hands.
Plus it’s been done before, and correctly. Might as well learn from someone else’s mistakes for a change, eh? It would probably help to have a few canned regexes at your disposal to go at frequently manipulated things. This is especially useful for editing.
But for a full parse, you really shouldn’t try to embed a full HTML grammar inside your pattern. Honest, you really shouldn’t. Speaking as someone has actually can and has done this, I unlike 99.9999% of the responders here the credibility of actual experience in this area when I advise against it. Sure, I can do it, but I almost never want to, and I certainly don’t want you to try it at home unsupervised. I can’t be held responsible for any damage that might ensue. :)
Sure, this may sound like “Do as I say, not as I do,” but if your level of regex mastery were at a level that allowed you to contemplate such a thing, you would not be asking this question. As I mentioned, almost no one who uses regexes can actually match an arbitrary HTML tag, simple as that is. Given that you need that sort of building block before writing your recursive descent grammar, and given that next to nobody can even manage that simple building block, well...
Given that sad state of affairs, it’s probably best to use regexes for simple edit jobs only, and leave their use for more complete solutions to real regex wizards, for they are subtle and quick to anger. Meaning of course the regexes, not (just) the wizards.
But sure, keep some canned regexes handy for doing simple editing rather than full parsing. That way you won’t be forced to redevise them each time from first principles. I do keep a few of these around, but then I also keep simple frameworks that allow me to edit a particular structural element of the HTML, like the plain text or the tag contents or the link references, etc, and those all use a full parser, letting me then surgically target just the parts I want in complete confidence I haven’t forgotten something.
More as a testament to what is possible than what is advisable, you can see some answers with more, um, “heroic” pattern matching, including recursion,
here,
here,
here,
here,
here, and
here.
Understand that some of those were actually written for the express purpose of showing people why they should not use regexes, because some of them are really quite sophisticated, much moreso than you can expect in nonwizards. That difficulty may chase you away, which is ok, because it was sort of meant to.
But don’t let that stop you from using vi on your HTML files, nor should it scare you away from using its search or substitute commands. Don’t let the perfect be the enemy of the good. Sometimes good enough is exactly what you need, because the perfect would take more investment than it could ever be worth.
Understanding which out of several possible approaches will give you the most bang for your buck is something that takes time to learn, and no one can tell you the answer that works for you. They don’t know your dataset, your requirements, your skillset, your priorities. Therefore any categorical answer is automatically wrong. You have to evaluate these things for yourself.
I think XPath is the primary option for traversing XML-like documents. With RegExp, it will be up to you to handle the different forms of writing a tag (with multiple spaces, double quotes, single quotes, no quotes, in one line, in multi-lines, with inner data, without inner data, etc). With XPath, this is all transparent to you, and it has many features (like accessing a node by index, selecting by attribute values, selecting simblings, and MANY others).
See how powerfull it can be at http://www.w3schools.com/xpath/.
EDIT: See also How do HTML parses work if they're not using regexp?
XPath is less likely to break if the web developer does any minor changes. That would be my choice.
Here is the canonical Stackoverflow explanation for why you should not parse HTML with regex:
RegEx match open tags except XHTML self-contained tags
In general, you cannot parse HTML with regex because regex is not made to parse HTML. Just use XPath.

When "viewing source", some sites have neat markup, some sites don't. Why? (pic attached)

Notice how in the 'ugly' side, the doctype is all the way indented and some of the meta lines extend past the left indent.
How can I get my markup looking neat when viewing source in a browser? Is there a certain way to encode the code while using an editor? I use Notepad++ by the way.
Large blocks of unindented code like you see in the left hand side are probably being written out server side, and so although the tag that creates them is nicely indented in your HTML the erver script output will not honour that.
It's not about encoding, it's about writing neat source code, haha. If you are outputting from php or something you can use keep track of how far to indent each thing or you an use some sort of template output function that keeps track of how many tags are open for you and indents the correct amount each time. But, there is no point on having neat HTML, the only important thing is that it's valid. Developer Tools will make it neat for you when you're trying to debug, and actually removing all that whitespace used to make it neat can reduce your page size quite a bit.
The ugly ones probably look pretty in the underlying php or other source. Once generated into HTML it looks ugly, and very few programmers will try to make that pretty too - it's not worth it.
It's funny that what you list as "ugly" seems properly indented to me... at least from what I can tell from the screenshot.
In any case, it doesn't matter. Most of the time these days, sites are made with something dynamic, and a lot of the HTML formatting isn't explicitly output.
If you were to view the source on many of my sites, it is all rammed together on one line, as that is how I echo it out. I don't see the point in wasting bytes on line feeds. Especially these days with all of the browser tools available that reformat the source while debugging.
I use Eclipse to do my coding and I can use Source->Format to clean up my code and format it nicely.
For Notepad++, I believe you can use HTML tidy as per: Formatting code in Notepad++
TextFX -> HTML Tidy -> Tidy: Reindent XML
You really want your HTML code to look like this:
view-source:http://lightningsoul.com/
As it uses the minimum amount of data to present itself to the browser. Remember that indents and white-spaces consume data as well as any other character.

To what point is making an HTML page valid worth it?

Since a long time ago, when I found out about the W3C Validator, I made sure every HTML document I made was valid HTML.
However, I think sometimes it just isn't necessary to waste time making it valid. Of course, for actual Internet pages may be important, but is making pages on an Intranet, or even little front-ends that are used with other programs, when the HTML page renders correctly in the most used browsers (not necessarily counting IE 6 and 7).
I think I'm mostly talking about little improvements over code, such as wrapping every shown element of the page on <p> or <div> tags.
Making a page validate for its own sake is not really a business proposition. What happens for end-users (with their cranky browsers) is the real test.
That said, validating periodically will help you debug. It'll catch the more salient errors like unclosed tags. Which, in turn, does affect end-users. So treat validation like compiler warnings -- good for discipline.
It's the best practice, but it really comes down to an organizational requirement/desire. Is it important enough that standards add value for your organization? Or is it simply enough that it displays correctly? Often with intranets its the latter.
Making an HTML page "valid" is worth it if you intend to be future friendly. That is, when browsers begin to strip out deprecated or vendor specific tags, you will find your page displaying incorrectly.
Web standards are there for a reason - to ensure consistent display/output among web browsers and interpreters. Choosing to write your pages in non-compliant HTML is your decision. It is also, to take an old adage, your "funeral".
What happens when the browser of choice for the intranet changes? There really isn't a way to guarantee that the code you have will render correctly in EVERY browser. But in a lot of cases the browsers will be reasonably close to the standard. I think it also depends on how complex the page is because the chances it renders differently in different browsers increases as the complexity of the CSS and tag depth does. The best way is to write valid cross-browser code and test for target browsers. Its silly to think write-once and render the same everywhere is possible for all browsers. But adhering to the standards is the best way you can get close.

Where to draw the line between efficiency and practicality

I understand very well the need for websites' front ends to be coded and compressed as much as possible, however, I feel like I have more lax standards than others when it comes to practical applications.
For instance, while I understand why some would, I don't see anything wrong with putting selectors in the <html> or <body> tags on a website with an expected small visitation rate. I would only do this for a cheap website for a small client, because I can't really justify the cost of time otherwise.
So, that said, do you think it's okay to draw a line? Where do you draw yours?
Some best practices can be safely ignored if you know what your are doing and why you are doing it.
Don't cut corners because you are lazy, but don't over engineer a 2 page website. Use your judgment.
But, if you delude yourself into thinking you are better than you are, either yourself, or a future maintainer will be cursing your existence.
For instance, while I understand why some would, I don't see anything wrong with putting selectors in the or tags on a website with an expected small visitation rate
I assume you mean putting inline CSS into those tags. Well, there's nothing wrong with that per se. As far as I'm concerned, everybody is allowed to do that to their heart's content (as long as I don't have to maintain it.) But a practice that puts all the CSS into a separate style sheet, so that the HTML file consists really only of a skeleton and the actual content, is just cleaner, easier to maintain and a joy to the eyes.
I would only do this for a cheap website for a small client, because I can't really justify the cost of time otherwise.
I don't think this reasoning is correct. A cleanly separated structure is equally expensive to build when you've got the hang of it, and cheaper to maintain in the long run.
A small client who doesn't have a lot of money to spend is going to be extremely angry when he asks you to change some color and it turns out that will take you two hours because it's specified in a bunch of in-line styles rather than in a css file.
I would also argue that if you get in the habit of using an external stylesheet and just applying styles in your HTML, you will find that it's actually faster than in-line css.
Where I draw the line is going to depend on the project. You're always going to have to choose a balance between readability and efficiency.
For example, it's possible to make HTML and JavaScript very efficient by making it unreadable--stripping whitespace, shortening element, variable, and function names, etc. To evaluate whether or not to do so, I would calculate delta in hardware costs plus opportunity cost of the heavier file and compare it to the cost of writing a generator that will take clean, easy-to-read code and turn it into terse, easy-to-load code. Whichever solution costs less is then the one to use.
Best practices so called for a reason. Your work will always reflect you, and although these things may seem small and insignificant, they will aid maintainability, readability etc when you come back to modify something. The profitability argument is one oft cited by freelancers - if you don't always do it, you may never do it. In time you'll realise it's actually quicker to "do it right" than bodge it, and you'll be proud of your work.
Always adhere to standards and best practices!

Writing XSS Filter for (X)HTML Based on White List

I need to implement a simple and efficient XSS Filter in C++ for CppCMS. I can't use existing high quality filters
written in PHP because because it is high performance framework that uses C++.
The basic idea is provide a filter that have a while list of HTML tags and a white
list of options for these tags. For example. typical HTML input can consist of
<b>, <i>, tags and <a> tag with href. But straightforward implementation is not
good enough, because, even allowed simple links may include XSS:
Click On Me
There are many other examples can be found there. So I though also about a possibility to create a white list of prefixes for tags like href/src -- so I always need to check if it starts with (https?|ftp)://
Questions:
Are these assumptions are good enough for most of purposes? Meaning that If I do not
give an options for style tags and check src/href using white list of prefixes it solves XSS problems? Are there problems that can't be fixes this way?
Is there a good reference for formal grammar of HTML/XHTML in order to write simple
parser that would cleanup all incorrect of forbidden tags like <script>
You can take a look at the Anti Samy project, trying to accomplish the same thing. It's Java and .NET though.
http://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project#.NET_version
http://www.owasp.org/index.php/Category:OWASP_AntiSamy_Project_.NET
Edit 1, A bit extra :
You can potentially come up with a very strict white listing. It should be structured well and should be pretty tight and not much flexible. When you combine flexibility, so many tags, attributes and different browsers generally you end up with a XSS vulnerability.
I don't know what is your requirements but I'd go with a strict and simple tag support (only b li h1 etc.) and then strict attribute support based on the tag (for example src is only valid under href tag), then you need to do whitelisting in the attribute values as you stated http|https|ftp or style="color|background-color" etc.
Consider this one:
<x style="express/**/ion:(alert(/bah!/))">
Also you need to think about some character whitelisting or some UTF-8 normalization, because different encodings can cause awkward issues. Such as new lines in attributes, non valid UTF-8 sequences.
All details of HTML parsing are specified in HTML 5. However implementation of it is quite a lot of work, and it doesn't matter whether you'll parse HTML exactly with all corner cases. At worst you'll end up with different DOM, but you have to sanitize DOM anyway.
As you mentioned, there are various PHP implementations of this, but I don't know of any in C++, since that's not a language typically applied to web development. Overall, it's going to depend on how complex of an implementation you want to come up with.
A very restrictive whitelist is probably the "simplest" way, but if you want to be really comprehensive I would look into doing a conversion of one of the established versions to C++, as opposed to trying to write your own from scratch. There are so many tricks to worry about, that I think you'd be better off standing on the shoulders of others that have already gone through all that.
I don't know anything about using C++ for web development, but converting PHP to it doesn't seem like it would be a particularly difficult task, PHP doesn't really have any magical capabilities that C++ won't be able to duplicate. I'm sure there will be some small hitches, but overall if you want to go the more-complex route it'd definitely still be faster to do a conversion than a full design from scratch.
HTML Purifier seems like a strong PHP implementation that is still actively maintained, there's a comparison document where the author discuss some differences between his approach and others', probably worth reading.
Whatever you come up with, definitely test it with all the examples you link, and make sure it passes all those. Good luck!