Related
I want to parse a HTML text and find special parts. For example a text in 3rd div of 1st row and 2nd column of a table. I have 2 options to parse: Regular Expressions and XPath. What is advantages and disadvantages of each one?
thanks
It somewhat depends on whether you have a complete HTML file of unknown but well-formed content versus having merely a snippet or an expanse of HTML of completely known content which may or may not be well-formed.
There is a difference between editing and parsing, you see.
It is one thing to be editing your own HTML file that you wrote yourself or are otherwise staring right in the face, and you issue the editor command
:100,200s!<br */>!!g
To remove the breaks from lines 200–300.
It is quite another to suck down whatever HTML happens to be at the other end of a URL and then try to make some sense out it, sight unseen.
The first calls for a regex solution — the very one shown above, in fact. To go off writing some massively overengineered behemoth to do a fall parse to set up the entire parse tree just to do the simple edit shown above is quite simply wrong. It’s also its own punishment.
On the other hand, using patterns to parse out (as opposed to lex out) an entire HTML document that can contain all kinds of whacky things you aren’t planning for just cries out for leveraging someone else’s hard work intead of recreating the wheel for yourself, and badly at that.
However, there’s something else nobody likes to mention, and that’s that most people just aren’t competent at regexes. They don’t really understand them. They don’t know how to test them or to craft them. They don’t know how to make them readable and maintainable.
The truth of the matter is that the overwhelming majority of regex users cannot even manage as simple and basic a thing as matching an arbitrary HTML tag using a regex, even when things gotchas like alternate encodings and CDATA sections and redefined entitities and <script> contents and archaic never-seen forms are all safely dispensed with.
It’s not because it’s hard to do; it isn’t, actually. It’s just that the people trying to do it understand neither regexes nor HTML particularly well, and they don’t know they don’t know, and so they get themselves in way over their heads more quickly than they realize. And then they have a complete disaster on their hands.
Plus it’s been done before, and correctly. Might as well learn from someone else’s mistakes for a change, eh? It would probably help to have a few canned regexes at your disposal to go at frequently manipulated things. This is especially useful for editing.
But for a full parse, you really shouldn’t try to embed a full HTML grammar inside your pattern. Honest, you really shouldn’t. Speaking as someone has actually can and has done this, I unlike 99.9999% of the responders here the credibility of actual experience in this area when I advise against it. Sure, I can do it, but I almost never want to, and I certainly don’t want you to try it at home unsupervised. I can’t be held responsible for any damage that might ensue. :)
Sure, this may sound like “Do as I say, not as I do,” but if your level of regex mastery were at a level that allowed you to contemplate such a thing, you would not be asking this question. As I mentioned, almost no one who uses regexes can actually match an arbitrary HTML tag, simple as that is. Given that you need that sort of building block before writing your recursive descent grammar, and given that next to nobody can even manage that simple building block, well...
Given that sad state of affairs, it’s probably best to use regexes for simple edit jobs only, and leave their use for more complete solutions to real regex wizards, for they are subtle and quick to anger. Meaning of course the regexes, not (just) the wizards.
But sure, keep some canned regexes handy for doing simple editing rather than full parsing. That way you won’t be forced to redevise them each time from first principles. I do keep a few of these around, but then I also keep simple frameworks that allow me to edit a particular structural element of the HTML, like the plain text or the tag contents or the link references, etc, and those all use a full parser, letting me then surgically target just the parts I want in complete confidence I haven’t forgotten something.
More as a testament to what is possible than what is advisable, you can see some answers with more, um, “heroic” pattern matching, including recursion,
here,
here,
here,
here,
here, and
here.
Understand that some of those were actually written for the express purpose of showing people why they should not use regexes, because some of them are really quite sophisticated, much moreso than you can expect in nonwizards. That difficulty may chase you away, which is ok, because it was sort of meant to.
But don’t let that stop you from using vi on your HTML files, nor should it scare you away from using its search or substitute commands. Don’t let the perfect be the enemy of the good. Sometimes good enough is exactly what you need, because the perfect would take more investment than it could ever be worth.
Understanding which out of several possible approaches will give you the most bang for your buck is something that takes time to learn, and no one can tell you the answer that works for you. They don’t know your dataset, your requirements, your skillset, your priorities. Therefore any categorical answer is automatically wrong. You have to evaluate these things for yourself.
I think XPath is the primary option for traversing XML-like documents. With RegExp, it will be up to you to handle the different forms of writing a tag (with multiple spaces, double quotes, single quotes, no quotes, in one line, in multi-lines, with inner data, without inner data, etc). With XPath, this is all transparent to you, and it has many features (like accessing a node by index, selecting by attribute values, selecting simblings, and MANY others).
See how powerfull it can be at http://www.w3schools.com/xpath/.
EDIT: See also How do HTML parses work if they're not using regexp?
XPath is less likely to break if the web developer does any minor changes. That would be my choice.
Here is the canonical Stackoverflow explanation for why you should not parse HTML with regex:
RegEx match open tags except XHTML self-contained tags
In general, you cannot parse HTML with regex because regex is not made to parse HTML. Just use XPath.
I am trying to create a bracket system using HTML. I've found other solutions, however, most require lots of absolute/relative positioning or tables.
I'm looking for a way to make it flexible, so I can just change the HTML to change it from a 16-man bracket to a 64-man bracket.
[404 - link removed]
Now, I don't see much wrong with my current example, however, I'm just curious if there is anyone out there has some suggestions on improving or completely changing the way I am doing it.
I'd rather stay away from tables, and definitely stay away from any sort of positioning (this is meant to be flexible).
If you have any ideas, that would be great. :)
Thanks,
Andrew
That actually looks fairly good. What I would do to improve it is encapsulate the logic in a bit of Javascript, supply the bracket information in some sort of text format, and have the Javascript parse the text format to generate the bracket as deeply as you need it.
When it comes to my markup, I'm anal. It always has to be perfectly indented, easily readable to me, and 100% valid with the W3C. Often time, when viewing the markup of other websites, I'm appalled with the lack of effort by the developer to try to and keep their markup in the browser clean, organized, and valid.
On the flip side, there's a lot of people who will force all their markup on to one, continuous line for the size saving benefits. This annoys me as well, though not to the same extent because it is done with a purpose. But for the most part, it seems like no developer ever actually looks at their markup in the browser and does anything about it.
Understanding that, to the parser in the browser, indents and spaces (usually) don't matter, how should I be handling my markup? Is it worth the extra time to get my markup perfectly easily readable to humans as well as the browser? Are all my \t's and \n's being used in vain?
There are some browsers who has bugs that renders indented well formed html completely wrong. Such as some versions of Internet explorer with tables and images.
Other than that, i try to keep sane indention, I don't spend to much time with it, just enough to make it easy to debug.
Is it worth the extra time to get my markup perfectly easily readable
My answer is no. The arguments:
Whoever tries to look at the code probably will want modify it so, for editing the code you need good code editor with code formatting (e.g. Netbeans). You'll very soon need other features like, syntax coloring.
Some users might prefer other type of formatting than you.
Anyone interested in readable HTML may use Tidy (of Tidy extension to Firefox) to format it.
It's a performance issue too: additional overload of formatting + stripping whitespace (and minifying when possible) will speed up the site. It's very important for sites with high traffic.
It's worth the effort imho since it helps you understand what exactly is going on in your html page, and that's definitely worth something.
If we want to write clean, elegant code in general this means we should want to generate nice, clean elegant html as well, not?
Not sure if this answers your question, but as long as the code is valid by W3C, is structured as intended. As far as your view-ability of the code (like view source) structure, that's really up to you, but I would not add too much clutter (comments etc). Use the correct DOCTYPE for your markup and you should be fine with that. I don't see any reason to "waste" time on making the source code from the browser "book" readable. The view source would only be beneficial to you so you can quickly see what's happening at a glance through source view.
I like to correctly format my markup, and I think it makes it easier to manage when I do.
Then again, I use ASP.NET and a lot of markup is generated through various controls and classes. In this case, I've decided it is not worth trying to track down each mis-aligned markup and see if something can be done to get the associated control to produce the correct result.
In short, nicely formatted markup is worth it if it can be accomplished without a huge effort.
Yes, in my opinion it is worth. It will be easier to maintain, for you and for other collegues, now and in the future.
About the disadvantage of lower performance, why not to develop a well indented and commented source file and to generate a minimized version to run on the server? It can be acheived with a simple series of regex replacements.
This question already has answers here:
Closed 13 years ago.
I am developing a "modern" website, and I'm having a lot of trouble getting the CSS to make everything line up properly. I feel like they layout would be a lot easier if I just used a table, but I've been avoiding <table> tags, because I've been told that they are "old-fashioned" and not the right way to do things.
Is it okay to use tables? How do I decide when a table is appropriate, and when I should use CSS instead? Do I just do whatever is easier?
The answer is yes, it's fine to use tables. The general rule of thumb is that if you are displaying tabular data, a table is probably a good way to go. You should generally try to style your table with css as much as you can though.
Also, this pie graph might help you:
alt text http://www.ratemyeverything.net/image/7292/0/Time_Breakdown_of_Modern_Web_Design.ashx
EDIT: Tables are fine. For displaying data. Just like my second sentence stated. The question was "is it ok to use tables". The answer is - yes, it is ok to use tables. It is not illegal.
Since even though it's implied to use tables for data in my general rule of thumb, apparently I must also state that the corollary is that it's not ok to use tables for anything else, even though the poster already seemed to grasp this concept. So, for the record, the general rule of thumb is to not use tables for laying out your site.
Tables should be used to represent tabular data. CSS should be used for presentation and layout.
This question has also been exhaustively answered here:
Why not use tables for layout in HTML?
Essentially - if you have tabular data, then use a table. There's really no need now to use tables for layout - sure, they were often considered 'easier' but semantically the page is horrid, they were often considered inaccessible.
See some discussion:
css-discuss
and a particularly comical URL - shouldiusetablesforlayout.com
In the 'modern' approach of tables it is not about using table tags or div tags, but about using the right tag for the right purpose.
The table tag is used for tabular data. There is nothing wrong with using it for that!
For using CSS, there are a lot of tutorials and guides (good and bad) around. Indicators of a bad tutorial are: lot of use of blocks (divs) that only make sense for the layout and not for the content. Good signs are the ones that advise to use the right tags for the right content and teach you how to make up that tags.
Tables are only appropriate for tabular data. Imagine you have to add some spreadsheet like data, where you have clear row/column headers, and some data inside those rows.
A product comparison, for example, is also a valid table item.
I believe that tables are OK for display of rectilinear data of arbitrary rows and/or columns. That's about it. Tables should not be used for layout purposes anymore.
In general, HTML markup should describe the structure and content of a web page—it should not be used to control presentational aspects such as layout and styling (that's what CSS is for). A <table> tag, like most have already said, should represent tabular data—something that would appear as a table of information.
The reason why people rag on tables so much is that in the old days, there was no such thing as CSS—all page layout was done directly in HTML. Tags were not thought of as describing content—all anyone really cared about was how a tag would make things look in a web browser. As a result of this, people figured that, since they could organize things into rows and columns, tables must be good for laying out elements of a web page. This became a really popular technique—in fact, I'd wager that using tables was considered the preferred method of laying out web pages for quite some time.
So when people tell you that tables are "old-fashioned," they are specifically referring to this abuse of the <table> tag that was so popular back in the old days. Like I said, there's nothing wrong with HTML tables themselves, but using them for web page layout just doesn't make sense nowadays.
(Plus, from a purely pragmatic standpoint, layouts done with HTML tables are very inflexible and hard to maintain.)
its ok to use tables when you are showing data in a grid / tabular format. however, for general structure of the site, its highly recommended that you use css driven div, ul, li elements to give you more lucid website.
If you anyways decide to work with tables, you must consider the following cons :
they are not SEO friendly
they are quite rigid in terms of their structure and at times difficult to maintain as well
you may be spending little extra time on div based website, but its worth every minute spent.
The whole "anti-Table" movement is a reaction to a time when deeply nested tables were the only method to layout pages, leading to HTML that was very hard to understand.
Tables are a valid method for tabular (data) layout, and if a table is the easiest way to implement a layout, then by any means use a table.
Table is always the right choice when you have the need to present data in a grid.
Quoting Sitepoints's book HTML Utopia: Designing Without Tables using CSS
If you have tabular data and the appearance of that data is less important than its appropriate display in connection with other portions of the same data set, then a table is in order. If you have information that would best be displayed in a spreadsheet such as Excel, you have tabular data.
I would say no for using tables to construct your layout. Tables make sense only for actual tabular data you need to represent. If you spend enough time figuring the CSS out you will find its easier then using tables for a layout. Just remember: Tables for displaying data. CSS for page layouts.
Tables are just that: Tables.
They are frowned upon because they should not be used for layout, as has been the fashionable thing to do before browsers could position stuff properly.
If what you want to markup is, in fact, a table, then use a table. Other than that, try to stay away.
One small thing: Aligning two parts of text to the exact same line that won't move apart (think, username and post date). There using a table is IMHO an option.
First get it working. Then get it perfect.
Get the layout done in some way before making it perfect or better.
How many people per day will go to the page you are working on? A million? or 20 ?
How much time are you going to spend on CSS issues instead of other issues? Does your boss want you to spend this much time on the issue? Does he/she know what you are doing?
Absolutely. I don't know where CSS zealots invented the idea that tables are not naturally used for "layout". Tables have been used for laying things out since their invention, whether those things be numbers, words, or pretty pictures. That's what they do. Moreover, table is part of all versions of (X)HTML so there are no deprecation concerns.
Absolutely.
All that HTML offers was originally intended for you to define the markup of your page. In my book, absolute and relative positions of elements on a page belong to markup. So both divs and tables are very much suited for this task. Pick up what works best for your particular need.
CSS adds many styling possibilities and also layout tricks but it complements HTML options not replaces them.
There is actually a very fine line between seeing something as a markup or styling issue. CSS proponents would say that with CSS you can relocate and reshuffle completely all big and little pieces of a page. I cannot however imagine putting header below, footer above and making things appear in reverse order.
Take an example. You design a notebook. You know where to place major components, mainboard, cooling system, keyboard, display and ports. You may certainly wish to rearrange a little bit port connectors, on whic side and in which sequence they appear, but you don't really expect to put display where the keyboard is, put keyboard on the lid, make fans blow to your face and have all connectors on the botom to be reached through holes on your desk.
Using tables can make it slightly difficult to rearrange elements on a page. This might be true. However, in most cases you know in advance how approximately your page should look like and you would not want to change everything drastically. if you can't say it before your begin your work you probably have no clear idea what you are doing and what for.
Moreover, only tables possess elastic properties, which allows the to stretch to the width/height of their content. Nothing else of HTML/CSS can be used to do that.
CSS design on one side allows you to create quite adjustable designs. On the other hand, it locks you out from designing a page adjustable to its content. Both wins and losses.
Table is also the only tool to make very complex and precise interfaces. For example, the page SO is very simple. It probably can be done with pure CSS. In the meantime, have you seen any enterprise-class software like CRMs, SRMs etc? That multitude of buttons, text field, check boxes, dropdownlists all precisely located on a screen? Good luck achieving that kind of complexity with just CSS. And these layouts migrate from desktop applications into web each day (keyword: software-as-a-service).
So choose what suits best your current need and don't trust those CSS lovers. Actually don't trust any fanatics at all.
Let's say I have a string holding a mess of text and (x)HTML tags. I want to remove all instances of a given tag (and any attributes of that tag), leaving all other tags and text along. What's the best Regex to get this done?
Edited to add: Oh, I appreciate that using a Regex for this particular issue is not the best solution. However, for the sake of discussion can we assume that that particular technical decision was made a few levels over my pay grade? ;)
Attempting to parse HTML with regular expressions is generally an extremely bad idea. Use a parser instead, there should be one available for your chosen language.
You might be able to get away with something like this:
</?tag[^>]*?>
But it depends on exactly what you're doing. For example, that won't remove the tag's content, and it may leave your HTML in an invalid state, depending on which tag you're trying to remove. It also copes badly with invalid HTML (and there's a lot of that about).
Use a parser instead :)
I think there is some serious anti-regex bigotry happening here. There are lots of times when you may want to strip a particular tag out of some markup when it doesn't make sense to use a full blown parser.
Of course there are times when a parser might be the best option, but if you are looking for a regex then:
<script[^>]*?>[\s\S]*?<\/script>
That would remove script tags and their contents. Make sure that you use case-insensitive matching.
If you don't want to remove the contents of the tag then you can use:
<\/?script[^>]*?>
An example of usage in javascript would be:
function stripScripts(markup) {
return markup.replace(/<script[^>]*?>[\s\S]*?<\/script>/gi, '');
}
var safeText = stripScripts(textarea.value);
I think it might be Raymond Chen (blogs.msdn.com/oldnewthing) that I'm paraphrasing (badly!) here... But, you want a Regular Expression? "Now you have two problems" ... :=)
If the string is well-formed (X)HTML, could you load it up into a parser (HTML/XML) and use this to remove any nodes of the offending variety? If it's not well-formed, then it becomes a bit more tricky, but, I suspect that a RegEx isn't the best way to go about this...
There are just TOO many ways a single tag can appear, not to mention encodings, variants, etc.
I strongly suggest you rethink this approach.... you really shouldnt have to be handling HTML directly, anyway.
Off the top of my head, I'd say this will get you started in the right direction.
s/<TAG[^>]*>([^<]*)</TAG[^>]*>/\1
Basically find the starting tag, any text in between the tags, and then the ending tag. Replace the whole thing with whatever was in between the tags.
Corrected answer:
</?TAG\b[^>]*?>
Because Dans answer would remove <br />, but you want only <b>
Here's a regex I wrote for this purpose, it works in a few more situations:
</?(?(?=b|img|a|script)notag|[a-zA-Z0-9]+)(?:\s[a-zA-Z0-9\-]+=?(?:(["",']?).*?\1?)?)*\s*/?>
While using regexes for parsing HTML is generally frowned upon or looked down on, you almost certainly don't want to write your own parser.
You could however use some inbuilt or library functions to achieve what you need.
JavaScript has getElementsByTagName and getElementById, not to mention jQuery.
PHP has the DOM extension.
Python has the awesome Beautiful Soup
...and many more.