Differences in query algorithms between XPath and CSS - html

I'm wondering why someone would want to use CSS selectors rather than XPath selectors, or vice-versa, if he could use either one. I think that understanding the algorithms that process the languages will resolve my wonder.
There's a lot of documentation on XPath and CSS selectors individually, but I've found very few comparisons. Also, I don't use CSS selectors that much.
Here's what I've read about the differences. (These three references discuss the use of XPath and CSS selectors in Selenium to query HTML, but my wonder is general.)
XPath allows traversal from child to parent
CSS selectors have features specific to HTML
CSS selectors are faster when you're using Internet Explorer in Selenium
It looks like CSS selection algorithms are somehow optimized for HTML, but I don't know how.
Is there a paper on how CSS and XPath query algorithms work and how they differ?
Are there other abstract differences between the languages that I'm missing?

The main difference is in how stable is the document structure you target:
XPath is a good query language when the structure matters and/or is stable. You usually specify path, conditions, exact offset... it is also a good query language to retrieve a set of similar objects and because of that, it has an intimate relationship with XQuery. Here the document has a stable structure and you must retrieve repeated/similar sections
CSS selectors suits better CSS stylesheets. These do not care about the document structure because this changes a lot. Think of one CSS stylesheet applied to all the HTML pages of a website. The content and structure of every page is different. Here CSS selectors are better because of that changing structure. You will notice that access is more tag based. Most CSS syntax specify a set of elements, attributes, id, classes... and not so much their structure. Here you must locate sections that do not have a clear location within a document structure but are marked with certain attributes.
Update: After a closer look to your question I realized that you are more interested in the current implementation, not the nature of the the query languages. In that case I cannot give you the answer you are looking for. I can only suppose that the reason is still that one is more dependent on the structure than the other.
For example, in XPath you must keep track of the structure of the document you are working on. On the other hand CSS selectors are triggered when a specific tag shows up, and it usually does not matter what came before it. I can imagine that it will be much easier to implement a CSS selector algorithm that work as you read a document, while XPath has more cases where you really need the full document and/or strict track of what it is reading (because the history and background of what you are reading is more important)
Now, do not take me too serious on my update. I am only guessing here because I had some background on language parsing, but I actually do not have experience with the ones designed for data querying.

Related

What happened to the "Use efficient CSS selectors" rule?

There was a recommendation by Google PageSpeed that asked web developers to Use efficient CSS selectors:
Avoiding inefficient key selectors that match large numbers of
elements can speed up page rendering.
Details
As the browser parses HTML, it constructs an internal document tree
representing all the elements to be displayed. It then matches
elements to styles specified in various stylesheets, according to the
standard CSS cascade, inheritance, and ordering rules. In Mozilla's
implementation (and probably others as well), for each element, the
CSS engine searches through style rules to find a match. The engine
evaluates each rule from right to left, starting from the rightmost
selector (called the "key") and moving through each selector until it
finds a match or discards the rule. (The "selector" is the document
element to which the rule should apply.)
According to this system, the fewer rules the engine has to evaluate
the better. [...]. After that, for pages that contain large numbers of
elements and/or large numbers of CSS rules, optimizing the definitions
of the rules themselves can enhance performance as well. The key to
optimizing rules lies in defining rules that are as specific as
possible and that avoid unnecessary redundancy, to allow the style
engine to quickly find matches without spending time evaluating rules
that don't apply.
This recommendation has been removed from current Page Speed Insights rules. Now I am wondering why this rule was removed. Did browsers get efficient at matching CSS rules in the meantime? And is this recommendation valid anymore?
In Feb 2011, Webkit core developer Antti Koivisto made several improvements to CSS selector performance in Webkit.
Antti Koivisto taught the CSS Style Selector to skip over sibling selectors and faster sorting, which bring some minor improvements, after which he landed two more awesome patches: one which enables ancestor identifier filtering for tree building, halving the remaining time in style matching over a typical page load, and a fast path for simple selectors that speed up matching up another 50% on some websites.
CSS Selector Performance has changed! (For the better) by Nicole Sullivan runs through these improvements in greater detail. In summary -
According to Antti, direct and indirect adjacent combinators can still be slow, however, ancestor filters and rule hashes can lower the impact as those selectors will only rarely be matched. He also says that there is still a lot of room for webkit to optimize pseudo classes and elements, but regardless they are much faster than trying to do the same thing with JavaScript and DOM manipulations. In fact, though there is still room for improvement, he says:
“Used in moderation pretty much everything will perform just fine from the style matching perspective.”
While browsers are much faster at matching CSS selectors, it's worth reiterating that CSS selectors should still be optimised (eg. kept as 'flat' as possible) to reduce file sizes and avoid specificity issues.
Here's a thorough article (which is dated early 2014)
I am quoting Benjamin Poulain, a WebKit Engineer who had a lot to say about the CSS selectors performance test:
~10% of the time is spent in the rasterizer. ~21% of the time is spent
on the first layout. ~48% of the time is spent in the parser and DOM
tree creation ~8% is spent on style resolution ~5% is spent on
collecting the style – this is what we should be testing and what
should take most of the time. (The remaining time is spread over many
many little functions)
And he continues:
“I completely agree it is useless to optimize selectors upfront, but
for completely different reasons:
It is practically impossible to predict the final performance impact
of a given selector by just examining the selectors. In the engine,
selectors are reordered, split, collected and compiled. To know the
final performance of a given selectors, you would have to know in
which bucket the selector was collected, how it is compiled, and
finally what does the DOM tree looks like.
All of that is very different between the various engines, making the
whole process even less predictable.
The second argument I have against web developers optimizing selectors
is that they will likely make things worse. The amount of
misinformation about selectors is larger than correct cross-browser
information. The chance of someone doing the right thing is pretty
low.
In practice, people discover performance problems with CSS and start
removing rules one by one until the problem go away. I think that is
the right way to go about this, it is easy and will lead to correct
outcome.”
There are approaches, like BEM for example, which models the CSS as flat as possible, to minimize DOM hierarchy dependency and to decouple web components so they could be "moved" across the DOM and work regardless.
Maybe because doing CSS for CMSes or frameworks is more common now and it's hard then to avoid using general CSS selectors. This to limit the complexity of the stylesheet.
Also, modern browsers are really fast at rendering CSS. Even with huge stylesheets on IE9, it did not feel like the rendering was slow. (I must admit I tested on a good computer. Maybe there are benchmarks out there).
Anyway, I think you must write very inefficient CSS to slow down Chrome or Firefox...
There's a 2 years old post on performance # Which CSS selectors or rules can significantly affect front-end layout / rendering performance in the real world?
I like his one-liner conclusion : Anything within the limits of "yeah, this CSS makes sense" is okay.

Why does CSS work with fake elements?

In my class, I was playing around and found out that CSS works with made-up elements.
Example:
imsocool {
color:blue;
}
<imsocool>HELLO</imsocool>
When my professor first saw me using this, he was a bit surprised that made-up elements worked and recommended I simply change all of my made up elements to paragraphs with ID's.
Why doesn't my professor want me to use made-up elements? They work effectively.
Also, why didn't he know that made-up elements exist and work with CSS. Are they uncommon?
Why does CSS work with fake elements?
(Most) browsers are designed to be (to some degree) forward compatible with future additions to HTML. Unrecognised elements are parsed into the DOM, but have no semantics or specialised default rendering associated with them.
When a new element is added to the specification, sometimes CSS, JavaScript and ARIA can be used to provide the same functionality in older browsers (and the elements have to appear in the DOM for those languages to be able to manipulate them to add that functionality).
(There is a specification for custom elements, but they have specific naming requirements and require registering using JavaScript.)
Why doesn't my professor want me to use made-up elements?
They are not allowed by the HTML specification
They might conflict with future standard elements with the same name
There is probably an existing HTML element that is better suited to the task
Also; why didn't he know that made-up elements existed and worked with CSS. Are they uncommon?
Yes. People don't use them because they have the above problems.
TL;DR
Custom tags are invalid in HTML. This may lead to rendering issues.
Makes future development more difficult since code is not portable.
Valid HTML offers a lot of benefits such as SEO, speed, and professionalism.
Long Answer
There are some arguments that code with custom tags is more usable.
However, it leads to invalid HTML. Which is not good for your site.
The Point of Valid CSS/HTML | StackOverflow
Google prefers it so it is good for SEO.
It makes your web page more likely to work in browsers you haven't tested.
It makes you look more professional (to some developers at least)
Compliant browsers can render [valid HTML faster]
It points out a bunch of obscure bugs you've probably missed that affect things you probably haven't tested e.g. the codepage or language set of the page.
Why Validate | W3C
Validation as a debugging tool
Validation as a future-proof quality check
Validation eases maintenance
Validation helps teach good practices
Validation is a sign of professionalism
YADA (yet another (different) answer)
Edit: Please see the comment from BoltClock below regarding type vs tag vs element. I usually don't worry about semantics but his comment is very appropriate and informative.
Although there are already a bunch of good replies, you indicated that your professor prompted you to post this question so it appears you are (formally) in school. I thought I would expound a little bit more in depth about not only CSS but also the mechanics of web browsers. According to Wikipedia, "CSS is a style sheet language used for describing ... a document written in a markup language." (I added the emphasis on "a") Notice that it doesn't say "written in HTML" much less a specific version of HTML. CSS can be used on HTML, XHTML, XML, SGML, XAML, etc. Of course, you need something that will render each of these document types that will also apply styling. By definition, CSS does not know / understand / care about specific markup language tags. So, the tags may be "invalid" as far as HTML is concerned, but there is no concept of a "valid" tag/element/type in CSS.
Modern visual browsers are not monolithic programs. They are an amalgam of different "engines" that have specific jobs to do. At a bare minimum I can think of 3 engines, the rendering engine, the CSS engine, and the javascript engine/VM. Not sure if the parser is part of the rendering engine (or vice versa) or if it is a separate engine, but you get the idea.
Whether or not a visual browser (others have already addressed the fact that screen readers might have other challenges dealing with invalid tags) applies the formatting depends on whether the parser leaves the "invalid" tag in the document and then whether the rendering engine applies styles to that tag. Since it would make it more difficult to develop/maintain, CSS engines are not written to understand that "This is an HTML document so here are the list of valid tags / elements / types." CSS engines simply find tags / elements / types and then tell the rendering engine, "Here are the styles you should apply." Whether or not the rendering engine decides to actually apply the styles is up it.
Here is an easy way to think of the basic flow from engine to engine: parser -> CSS -> rendering. In reality it is much more convoluted but this is good enough for starters.
This answer is already too long so I will end there.
Unknown elements are treated as divs by modern browsers. That's why they work. This is part of the oncoming HTML5 standard that introduces a modular structure to which new elements can be added.
In older browsers (I think IE7-) you can apply a Javascript-trick after which they will work as well.
Here is a related question I found when looking for an example.
Here is a question about the Javascript fix. Turns out it is indeed IE7 that doesn't support these elements out of the box.
Also; why didn't he know that made-up tags existed and worked with CSS. Are they uncommon?
Yes, quite. But especially: they don't serve additional purpose. And they are new to html5. In earlier versions of HTML an unknown tag was invalid.
Also, teachers seem to have gaps in their knowledge, sometimes. This might be due to the fact that they need to teach students the basics about a given subject, and it doesn't really pay off to know all ins and outs and be really up to date.
I once got detention because a teacher thought I programmed a virus, just because I could make a computer play music using the play command in GWBasic. (True story, and yes, long ago). But whatever the reason, I think the advice not to use custome elements is a sound one.
Actually you can use custom elements. Here is the W3C spec on this subject:
http://w3c.github.io/webcomponents/spec/custom/
And here is a tutorial explaining how to use them:
http://www.html5rocks.com/en/tutorials/webcomponents/customelements/
As pointed out by #Quentin: this is a draft specification in the early days of development, and that it imposes restrictions on what the element names can be.
There are a few things about the other answers that are either just poorly phrased or perhaps a little incorrect.
FALSE(ish): Non-standard HTML elements are "not allowed", "illegal", or "invalid".
Not necessarily. They're "non-conforming". What's the difference? Something can "not conform" and still be "allowed". The W3C aren't going to send the HTML police to your home and haul you away.
The W3C left things this way for a reason. Conformance and specifications are defined by a community. If you happen to have a smaller community consuming HTML for more specific purposes and they all agree on some new Elements they need to make things easier, they can have what the W3C refers to as "other applicable specifications". (this is a gross over simplification, obviously, but you get the idea)
That said, strict validators will declare your non-standard elements to be "invalid". but that's because the validator's job is to ensure conformance to whatever spec it's validating for, not to ensure "legality" for the browser or for use.
FALSE(ish): Non-standard HTML elements will result in rendering issues
Possibly, but unlikely. (replace "will" with "might") The only way this should result in a rendering issue is if your custom element conflicts with another specification, such as a change to the HTML spec or another specification being honored within the same system (such as SVG, Math, or something custom).
In fact, the reason CSS can style non-standard tags is because the HTML specification clearly states that:
User agents must treat elements and attributes that they do not understand as semantically neutral; leaving them in the DOM (for DOM processors), and styling them according to CSS (for CSS processors), but not inferring any meaning from them
Note: if you want to use a custom tag, just remember a change to the HTML spec at a later time could blow your styling up, so be prepared. It's really unlikely that the W3C will implement the <imsocool> tag, however.
Non-standard tags and JavaScript (via the DOM)
The reason you can access and alter custom elements using JavaScript is because the specification even talks about how they should be handled in the DOM, which is the (really horrible) API that allows you to manipulate the elements on your page.
The HTMLUnknownElement interface must be used for HTML elements that are not defined by this specification (or other applicable specifications).
TL;DR: Conforming to the spec is done for purposes of communication and safety. Non-conformance is still allowed by everything but a validator, whose sole purpose is to enforce conformity, but whose use is optional.
For example:
var wee = document.createElement('wee');
console.log(wee.toString()); //[object HTMLUnknownElement]
(I'm sure this will draw flames, but there's my 2 cents)
According to the specs:
CSS
A type selector is the name of a document language element type written using the syntax of CSS qualified names
I thought this was called the element selector, but apparently it is actually the type selector. The spec goes on to talk about CSS qualified names which put no restriction on what the names actually are. That is to say that as long as the type selector matches CSS qualified name syntax it is technically correct CSS and will match the element in the document. There is no CSS-specific restriction on elements that do not exist in a particular spec -- HTML or otherwise.
HTML
There is no official restriction on including any tags in the document that you want. However, the documentation does say
Authors must not use elements, attributes, or attribute values for purposes other than their appropriate intended semantic purpose, as doing so prevents software from correctly processing the page.
And it later says
Authors must not use elements, attributes, or attribute values that are not permitted by this specification or other applicable specifications, as doing so makes it significantly harder for the language to be extended in the future.
I'm not sure specifically where or if the spec says that unkown elements are allowed, but it does talk about the HTMLUnknownElement interface for unrecognized elements. Some browsers may not even recognize elements that are in the current spec (IE8 comes to mind).
There is a draft for custom elements, though, but I doubt it is implemented anywhere yet.
This is possible with html5 but you need to take into consideration of older browsers.
If you do decide to use them then, make sure to COMMENT your html!! Some people may have some trouble figuring out what it is so a comment could save them a ton of time.
Something like this,
<!-- Custom tags in use, refer to their CSS for aid -->
When you make your own custom tag/elements the older browsers will have no clue what that is just like html5 elements like nav/section.
If you are interested in this concept then I recommend to do it the right way.
Getting started
Custom Elements allow web developers to define new types of HTML
elements. The spec is one of several new API primitives landing under
the Web Components umbrella, but it's quite possibly the most
important. Web Components don't exist without the features unlocked by
custom elements:
Define new HTML/DOM elements Create elements that extend from other
elements Logically bundle together custom functionality into a single
tag Extend the API of existing DOM elements
There is a lot you can do with it and it does make your script beautiful as this article likes to put it. Custom Elements defining new elements in HTML.
So lets recap,
Pros
Very elegant and easy to read.
It is nice to not see so many divs. :p
Allows a unique feel to the code
Cons
Older browser support is a strong thing to consider.
Other developers may have no clue what to do if they don't know about custom tags. (Explain to them or add comments to inform them)
Lastly one thing to take into consideration, but I am unsure, is block and inline elements. By using custom tags you are going to end up writing more css because of the custom tag won't have a default side to it.
The choice is entirely up to you and you should base it on what the project is asking for.
Update 1/2/2014
Here is a very helpful article I found and figured I would share, Custom Elements.
Learn the tech Why Custom Elements? Custom Elements let authors define
their own elements. Authors associate JavaScript code with custom tag
names, and then use those custom tag names as they would any standard
tag.
For example, after registering a special kind of button called
super-button, use the super button just like this:
Custom elements are still elements. We
can create, use, manipulate, and compose them just as easily as any
standard or today.
This seems like a very good library to use but I did notice it didn't pass Window's Build status. This is also in a pre-alpha I believe so I would keep an eye on this while it develops.
Why doesn't he want you to use them? They are not common nor part of the HTML5 standard.
Technically, they are not allowed. They are a hack.
I like them myself, though. You may be interested in XHTML5. It allows you to define your own tags and use them as part of the standard.
Also, as others have pointed out, they are invalid and thus not portable.
Why didn't he know that they exist? I don't know, except that they are not common. Possibly he was just not aware that you could.
Made-up tags are hardly ever used, because it's unlikely that they will work reliably in every current browser, and every future browser.
A browser has to parse the HTML code into elements that it knows, to made-up tags will be converted into something else to fit in the document object model (DOM). As the web standards doesn't cover how to handle everyting that is outside of the standards, web browsers tend to handle non-standars code in different ways.
Web development is tricky enough with a bunch of different browsers that have their own quirks, without adding another element of uncertainty. The best bet it to stick with things that are actually in the standards, that is what the browser vendors try to follow, so that has the best chance to actually work.
I think made-up tags are just potentially more confusing or unclear than p's with IDs (some block of text generally). We all know a p with an ID is a paragraph, but who knows what made-up tags are intended for? At least that's my thought. :) Therefore this is more of a style / clarity issue than one of functionality.
Others have made excellent points but its worth noting that if you look at a framework such as AngularJS, there is a very valid case for custom elements and attributes. These convey not only better semantic meaning to the xml, but they also can provide behavior, look and feel for the web page.
CSS is a style sheet language that can be used to present XML documents, not only (X)HTML documents. Your snippet with the made-up tags could be part of a legal XML document; it would be one if you enclose it in a single root element. Probably you already have a <html> ...</html> around it? Any current browser can display XML documents.
Of course it is not a very good XML document, it lacks a grammar and an XML declaration. If you use an HTML declaration header instead (and probably a server configuration that sends the correct mime type) it would instead be illegal HTML.
(X)HTML has advantages over plain XML as elements have a semantic meaning that is useful in the context of a web page presentation. Tools can work with this semantics, other developers know the meaning, it is less error prone and better to read.
But in other contexts it is better to use CSS with XML and/or XSLT to do the presentation. This is what you did. As this wasn't your task, you didn't know what you were doing, and HTML/CSS is the better way to go most of the time you should stick to it in your scenario.
You should add an (X)HTML header to your document so tools can give you meaningful error messages.
...I simply change all of my made up tags to paragraphs with ID's.
I actually take issue with his suggestion of how to do it properly.
A <p> tag is for paragraphs. I see people using it all the time instead of a div -- simply for spacing purposes or because it seems gentler. If it's not a paragraph, don't use it.
You don't need or want to stick ID's on everything unless you need to target it specifically (e.g. with Javascript). Use classes or just a straight-up div.
From its early days CSS was designed to be markup agnostic so it can be used with any markup language producing tree alike DOM structures (SVG for example). Any tag that comply to name token production is perfectly valid in CSS. So your question is rather about HTML than CSS itself.
Elements with custom tags are supported by HTML5 specification. HTML5 standardize the way how unknown elements must be parsed in the DOM. So HTML5 is the first HTML specification that enables custom elements strictly speaking. You just need to use HTML5 doctype <!DOCTYPE html> in your document.
As of custom tag names themselves...
This document http://www.w3.org/TR/custom-elements/ recommends custom tags you choose to contain at least one '-' (dash) symbol. This way they will not conflict with future HTML elements. Therefore you'd better change your doc to something like this:
<style>
so-cool {
color:blue;
}
</style>
<body>
<so-cool>HELLO</so-cool>
</body>
Surprisingly, nobody (including my past self) mentioned accessibility. Another reason that using valid tags instead of custom ones is for compatibility with the greatest amount of software, including screen-readers and other tools that people need for accessibility purposes. Moreover, accessibility laws like WAI require making accessible websites, which generally means requiring them to use valid markup.
Apparently nobody mentioned it, so I will.
This is a by-product of browser wars.
Back in the 1990’s when the Internet was first starting to go mainstream, competition incrased in the browser market. To stay competitive and draw users, some browsers (most notably Internet Explorer) tried to be helpful and “user-friendly” by attempting to figure out what page designers meant and thus allowed markup that are incorrect (e.g., <b><i>foobar</b></i> would correctly render as bold-italics).
This made sense to some degree because if one browser kept complaining about syntax errors while another ate anything you threw at it and spit out a (more-or-less) correct result, then people would naturally flock to the latter.
While many thought the browser wars were over, a new war between browser vendors has reignited in the past few years since Chrome was released, Apple started growing again and pushing Safari, and IE lost its dominance. (You could call it a “cold war” due to the perceived cooperation and support of standards by browser vendors.) Therefore, it is not a surprise that even contemporary browsers which supposedly conform strictly to web standards actually try to be “clever” and allow standard-breaking behavior such as this in order to try to gain an advantage as before.
Unfortunately, this permissive behavior led to a massive (some might even say cancerous) growth of poorly marked up webpages. Because IE was the most lenient and popular browser, and due to Microsoft’s continued flouting of standards, IE became infamous for encouraging and promoting bad design and propagating and perpetuating broken pages.
You may be able to get away with using quirks and exploits like that on some browsers for now, but other than the occasional puzzle or game or something, you should always stick to web standards when creating web pages and sites to ensure they display correctly and avoid them becoming broken (possibly completely ignored) with a browser update.
While browsers will generally relate CSS to HTML tags regardless of whether or not they are valid, you should ABSOLUTELY NOT do this.
There is technically nothing wrong with this from a CSS perspective. However, using made up tags is something you should NEVER do in HTML.
HTML is a markup language, which means that each tag corresponds to a specific type of information.
Your made up tags don't correspond to any type of information. This will create problems from web crawlers, such as Google.
Read more information on the importance of correct markup.
Edit
Divs refer to groups of multiple related elements, meant to be displayed in block form and can be manipulated as such.
Spans refer to elements that are to be styled differenly than the context they are currently in and are meant to be displayed inline, not as a block. An example is if a few words in a sentence needs to be all caps.
Custom tags do not correlate to any standards and thus span/div should be used with class/ID properties instead.
There are very specific exemptions to this, such as Angular JS
Although CSS has a thing called a "tag selector," it doesn't actually know what a tag is. That's left for the document's language to define. CSS was designed to be used not just with HTML, but also with XML, where (assuming you're not using a DTD or other validation scheme) the tags can be just about anything. You could use it with other languages too, though you would need to come up with your own semantics for exactly what things like "tags" and "attributes" correspond to.
Browsers generally apply CSS to unknown tags in HTML, because this is considered better than breaking completely: at least they can display something. But it is very bad practice to use "fake" tags deliberately. One reason for this is that new tags do get defined from time to time, and if one is defined that looks sort of like your fake tag but doesn't quite work the same way, that can cause problems with your site on new browsers.
Why does CSS work with fake elements? Because it doesn't hurt anyone because you're not supposed to use them anyways.
Why doesn't my professor want me to use made-up elements? Because if that element is defined by a specification in the future your element will have an unpredictable behavior.
Also, why didn't he know that made-up elements exist and work with CSS. Are they uncommon? Because he, like most other web developers, understand that we shouldn't use things that might break randomly in the future.

Parse HTML with CSS or XPath selectors?

My goal is to parse HTML with lxml, which supports both XPath and CSS selectors.
I can tie my model properties either to CSS or XPath, but I'm not sure which one would be the best, e.g. less fuss when HTML layout is changed, simpler expressions, greater extraction speed.
What would you choose in such a situation?
Which are you more comfortable with? Most people tend to find CSS selectors easier and if others will maintain your work, you should take this into account. One reason for this might be that there's less worrying about XML namespaces which are the source of many a bug. CSS selectors tend to be more compact than the equivalent XPath, but only you can decide whether that's relevant factor or not. I would note that it's not an accident that jquery's selection language is modelled on CSS selectors and not on XPath.
On the other hand, XPath is a more expressive language for general DOM manipulation. For example, there's no CSS selector equivalent of the "parent" or "ancestor" axes, nor is there a way to directly address text nodes equivalent to "text()" in XPath. In contrast, I can't think of any DOM path that can be expressed in CSS selectors but not in XPath, although E[foo~="warning"] and E[lang|="en"] are distinctly tricky in XPath.
What CSS selectors do have that XPath doesn't are pseudo-classes, though if you're doing server side DOM manipulation, these are not likely to be of use to you.
As for which results in greater extraction speed, I don't know lxml, but I would expect equivalent paths to have very similar performance characteristics.

CSS semantics; selecting elements directly or via order

Perhaps this question has been asked elsewhere, but I'm unable to find it. With HTML5 and CSS3 modules inching closer, I'm getting interested in a discussion about the way we write CSS.
Something like this where selection is done via element order and pseudo-classes is particularly fascinating. The big advantage to this method seems to be complete modularization of HTML and CSS to make tweaks and redesigns simpler.
At the same time, semantic IDs and classes seem advantageous for sundry reasons. Particularly, direct linking, JS targeting, and shorter CSS selectors. Also, it seems selector length might be an issue. For instance, I just wrote the following, which would be admittedly easier using some semantic HTML5 elements:
body>div:nth-child(2)>div:nth-child(2)>ul:nth-child(2)>li:last-child
So what say you, Stack Overflow? Is the future of CSS writing focused on element order and pseudo-classes? Or are IDs and classes and the current ways here to stay?
(I'm well aware the IDs and classes have their place, although I am interested to hear more ways you think they'll continue to be necessary. I don't want to misrepresent this or frame it as "Are pseudo-classes ID killers?" The discussion I'm interested in is bigger-picture and the ways writing CSS is changing.)
I think that's an unreadable abomination which will mysteriously stop working when the HTML changes.
Order-based selectors are completely non-self-documenting.
If someone else takes over the project, and the HTML changes, he will have no idea what the selector is supposed to select, and will be hard-pressed to fix it correctly.
This is especially important if any part of the HTML is automatically generated.

Writing Efficient CSS

Ok so in another question something was being discussed, and this link was mentioned:
https://developer.mozilla.org/en/Writing_Efficient_CSS
In that article, they say some things I didn't know, but before I ask about them, I should ask this... Does that apply to CSS interpreted by Firefox? Forgive my noobness, but I wasn't sure what they meant by Mozilla UI. (don't hurt me!)
If it does apply, when they say:
Avoid the descendant selector!
The descendant selector is the most
expensive selector in CSS. It is
dreadfully expensive, especially if a
rule using the selector is in the tag
or universal category. Frequently what
is really desired is the child
selector. The use of the descendant
selector is banned in UI CSS without
the explicit approval of your skin's
module owner.
* BAD - treehead treerow treecell { }
* BETTER, BUT STILL BAD (see next guideline) - treehead > treerow > treecell { }
The descendant selector is just a space? And then what would the difference be between child and descendant? Child is an element inside another, but isn't that the same as descendant? As I'm writing I think I might have figured it out. A descendant could be a child/grandchild/great-grandchild/etc? And child is only one deep?
Sorry again for the stupid level of my question... just wondering, because I have been constantly using descendants in my CSS for my site. But yeah, if this isn't about Firefox then this whole question is pointless...
If its not about Firefox, does anyone have a link to an article explaining efficiency for Firefox or Browsers in general?
A descendant could be a child/grandchild/great-grandchild/etc? And child is only one deep?
Yes, exactly. Since a child can only be one deep, there's a much smaller space that the rendering engine has to recursively search to check if the rule matches or not.
And yes, that article is about both Firefox and browsers in general. Most (all?) of what is in it applies to any page rendering engine.
First of all - the suggestions in this article are not for html pages - they are specifically for the Mozilla UI - XUL, so it may be best practice for XUL, but not for html.
Applying the CSS on an average HTML page is one of the quickest things than happen while loading the page.
Also, the article may suggest the fastest way to apply css rules, but at what cost? For example, they suggest not having more than one class per rule:
BAD - .treecell.indented { }
GOOD - .treecell-indented { }
That is almost outrageous. It may lead to quicker CSS, but who cares? Assuming you already have .treecell and .indented, following these suggestions leads to complicated logic, harder maintenance, duplicated css rules, harder JavaScript (which costs a lot more that CSS), etc.
They suggest not using the full richness of CSS selectors and replacing these selectors with flat classes, which is a shame.
...as I'm writing I think I might have figured it out. A descendant could be a child/grandchild/great-grandchild/etc? And child is only one deep?
Indeed.
One thing I can add on the efficiency side of things is: Don't use * unless you really mean it. It's pretty intensive as rules go and most people could get away just specifying the elements they really want to target.
A "parent > child" is only one step down, whereas an "ancestor descendant" could be one or more steps down.
Even better is to use "#id" tags wherever possible such that there is less DOM searching.
The UI CSS is for styling the internals of the browser - the settings dialog, extensions interfaces etc.
Descendants and children are different, children are much more specific and result in much less having to be considered.
The problem with the child selector is that it's not as well supported. Of course, this might've been fixed on newer IE browsers.
In any case, when writing CSS for a webpage it isn't going to be that big of a deal. I doubt the fractions of seconds you'd save in page load would even be noticed. This article seems more directed towards people writing stuff for the actual browser, not websites.
O'Reillys "Even Faster Web Sites" has a whole chapter on this entitled "Simplifying CSS Selectors". It references your link on Mozilla.
I think two points are worth bearing in mind.
Yes, if you did this as far as possible, your HTML and CSS would be a mess of styles and possibly even more inefficient due to added file size. It is up to the developer to pick the best balance. Don't agonize over optimizing every line as you write it, get it working then see what can be beneficial.
As another commenter noted, it takes the browser milliseconds to figure it out how to apply your styles on page load. However, where this can have much bigger impact is with DHTML. Every time you change the DOM, the browser re-applies your whole style sheet to the page. In this scenario many inefficient selectors could make a visible impact to your page (perceived lagginess/ unresponsiveness).
The documentation for Google's Page Speed (a Firefox/Firebug add-on) includes a good page on efficient CSS.