Parse HTML with CSS or XPath selectors? - html

My goal is to parse HTML with lxml, which supports both XPath and CSS selectors.
I can tie my model properties either to CSS or XPath, but I'm not sure which one would be the best, e.g. less fuss when HTML layout is changed, simpler expressions, greater extraction speed.
What would you choose in such a situation?

Which are you more comfortable with? Most people tend to find CSS selectors easier and if others will maintain your work, you should take this into account. One reason for this might be that there's less worrying about XML namespaces which are the source of many a bug. CSS selectors tend to be more compact than the equivalent XPath, but only you can decide whether that's relevant factor or not. I would note that it's not an accident that jquery's selection language is modelled on CSS selectors and not on XPath.
On the other hand, XPath is a more expressive language for general DOM manipulation. For example, there's no CSS selector equivalent of the "parent" or "ancestor" axes, nor is there a way to directly address text nodes equivalent to "text()" in XPath. In contrast, I can't think of any DOM path that can be expressed in CSS selectors but not in XPath, although E[foo~="warning"] and E[lang|="en"] are distinctly tricky in XPath.
What CSS selectors do have that XPath doesn't are pseudo-classes, though if you're doing server side DOM manipulation, these are not likely to be of use to you.
As for which results in greater extraction speed, I don't know lxml, but I would expect equivalent paths to have very similar performance characteristics.

Related

What are the performance of every selector's types?

In CodeceptJs, you have different way to select elements and manipulate them for e2e testing :
CSS selector
Xpath selector
Semantic locator
Locator builder
But I have some questions about their performance. Which are the best locator type ?
What is the ratio between them ?
Currently, I use Locator builder but I don't know if they're efficient like CSS selector or ID selector.
There are no performance metrics between type of locators for CodeceptJS, AFAIK.
About locator types
Locator builder just creates xpath locator.
CSS and xpath locator's has similar logic for helpers (WebDriver, Puppeteer, Playwright, etc).
Difference in speed appears in browser Driver level.
AFAIK, IE11 with WebDriver works faster with xpath, but Chrome and Firefox with css.
For Nightmare, for Puppeteer and others it may differ.
For semantic locators - it depends on type of locator.
It tries to find element with one strategy, than with others.
So, technically, there are high probability, that semantic locators will be slower.
CodeceptJS tip
And I don't know how really do you need performance tricks, but I know one tip (works for current and older versions, for <= 2.5.0):
If you specify type of selector, like I.click({ css: "button#some-selector" });, it will be a bit faster, than use of none specified selector I.click("button#some-selector") due to implementation of semantic (fuzzy) locators.
For non described locator type, Codecept will try to "guess" what type of locator do you use, and, in some cases, it will run same as semantic locator - with search a couple of times with different strategies.
UPD added info about locator types

What happened to the "Use efficient CSS selectors" rule?

There was a recommendation by Google PageSpeed that asked web developers to Use efficient CSS selectors:
Avoiding inefficient key selectors that match large numbers of
elements can speed up page rendering.
Details
As the browser parses HTML, it constructs an internal document tree
representing all the elements to be displayed. It then matches
elements to styles specified in various stylesheets, according to the
standard CSS cascade, inheritance, and ordering rules. In Mozilla's
implementation (and probably others as well), for each element, the
CSS engine searches through style rules to find a match. The engine
evaluates each rule from right to left, starting from the rightmost
selector (called the "key") and moving through each selector until it
finds a match or discards the rule. (The "selector" is the document
element to which the rule should apply.)
According to this system, the fewer rules the engine has to evaluate
the better. [...]. After that, for pages that contain large numbers of
elements and/or large numbers of CSS rules, optimizing the definitions
of the rules themselves can enhance performance as well. The key to
optimizing rules lies in defining rules that are as specific as
possible and that avoid unnecessary redundancy, to allow the style
engine to quickly find matches without spending time evaluating rules
that don't apply.
This recommendation has been removed from current Page Speed Insights rules. Now I am wondering why this rule was removed. Did browsers get efficient at matching CSS rules in the meantime? And is this recommendation valid anymore?
In Feb 2011, Webkit core developer Antti Koivisto made several improvements to CSS selector performance in Webkit.
Antti Koivisto taught the CSS Style Selector to skip over sibling selectors and faster sorting, which bring some minor improvements, after which he landed two more awesome patches: one which enables ancestor identifier filtering for tree building, halving the remaining time in style matching over a typical page load, and a fast path for simple selectors that speed up matching up another 50% on some websites.
CSS Selector Performance has changed! (For the better) by Nicole Sullivan runs through these improvements in greater detail. In summary -
According to Antti, direct and indirect adjacent combinators can still be slow, however, ancestor filters and rule hashes can lower the impact as those selectors will only rarely be matched. He also says that there is still a lot of room for webkit to optimize pseudo classes and elements, but regardless they are much faster than trying to do the same thing with JavaScript and DOM manipulations. In fact, though there is still room for improvement, he says:
“Used in moderation pretty much everything will perform just fine from the style matching perspective.”
While browsers are much faster at matching CSS selectors, it's worth reiterating that CSS selectors should still be optimised (eg. kept as 'flat' as possible) to reduce file sizes and avoid specificity issues.
Here's a thorough article (which is dated early 2014)
I am quoting Benjamin Poulain, a WebKit Engineer who had a lot to say about the CSS selectors performance test:
~10% of the time is spent in the rasterizer. ~21% of the time is spent
on the first layout. ~48% of the time is spent in the parser and DOM
tree creation ~8% is spent on style resolution ~5% is spent on
collecting the style – this is what we should be testing and what
should take most of the time. (The remaining time is spread over many
many little functions)
And he continues:
“I completely agree it is useless to optimize selectors upfront, but
for completely different reasons:
It is practically impossible to predict the final performance impact
of a given selector by just examining the selectors. In the engine,
selectors are reordered, split, collected and compiled. To know the
final performance of a given selectors, you would have to know in
which bucket the selector was collected, how it is compiled, and
finally what does the DOM tree looks like.
All of that is very different between the various engines, making the
whole process even less predictable.
The second argument I have against web developers optimizing selectors
is that they will likely make things worse. The amount of
misinformation about selectors is larger than correct cross-browser
information. The chance of someone doing the right thing is pretty
low.
In practice, people discover performance problems with CSS and start
removing rules one by one until the problem go away. I think that is
the right way to go about this, it is easy and will lead to correct
outcome.”
There are approaches, like BEM for example, which models the CSS as flat as possible, to minimize DOM hierarchy dependency and to decouple web components so they could be "moved" across the DOM and work regardless.
Maybe because doing CSS for CMSes or frameworks is more common now and it's hard then to avoid using general CSS selectors. This to limit the complexity of the stylesheet.
Also, modern browsers are really fast at rendering CSS. Even with huge stylesheets on IE9, it did not feel like the rendering was slow. (I must admit I tested on a good computer. Maybe there are benchmarks out there).
Anyway, I think you must write very inefficient CSS to slow down Chrome or Firefox...
There's a 2 years old post on performance # Which CSS selectors or rules can significantly affect front-end layout / rendering performance in the real world?
I like his one-liner conclusion : Anything within the limits of "yeah, this CSS makes sense" is okay.

Differences in query algorithms between XPath and CSS

I'm wondering why someone would want to use CSS selectors rather than XPath selectors, or vice-versa, if he could use either one. I think that understanding the algorithms that process the languages will resolve my wonder.
There's a lot of documentation on XPath and CSS selectors individually, but I've found very few comparisons. Also, I don't use CSS selectors that much.
Here's what I've read about the differences. (These three references discuss the use of XPath and CSS selectors in Selenium to query HTML, but my wonder is general.)
XPath allows traversal from child to parent
CSS selectors have features specific to HTML
CSS selectors are faster when you're using Internet Explorer in Selenium
It looks like CSS selection algorithms are somehow optimized for HTML, but I don't know how.
Is there a paper on how CSS and XPath query algorithms work and how they differ?
Are there other abstract differences between the languages that I'm missing?
The main difference is in how stable is the document structure you target:
XPath is a good query language when the structure matters and/or is stable. You usually specify path, conditions, exact offset... it is also a good query language to retrieve a set of similar objects and because of that, it has an intimate relationship with XQuery. Here the document has a stable structure and you must retrieve repeated/similar sections
CSS selectors suits better CSS stylesheets. These do not care about the document structure because this changes a lot. Think of one CSS stylesheet applied to all the HTML pages of a website. The content and structure of every page is different. Here CSS selectors are better because of that changing structure. You will notice that access is more tag based. Most CSS syntax specify a set of elements, attributes, id, classes... and not so much their structure. Here you must locate sections that do not have a clear location within a document structure but are marked with certain attributes.
Update: After a closer look to your question I realized that you are more interested in the current implementation, not the nature of the the query languages. In that case I cannot give you the answer you are looking for. I can only suppose that the reason is still that one is more dependent on the structure than the other.
For example, in XPath you must keep track of the structure of the document you are working on. On the other hand CSS selectors are triggered when a specific tag shows up, and it usually does not matter what came before it. I can imagine that it will be much easier to implement a CSS selector algorithm that work as you read a document, while XPath has more cases where you really need the full document and/or strict track of what it is reading (because the history and background of what you are reading is more important)
Now, do not take me too serious on my update. I am only guessing here because I had some background on language parsing, but I actually do not have experience with the ones designed for data querying.

Use CSS selectors to collect HTML elements from a streaming parser (e.g. SAX stream)

How to parse CSS (CSS3) selector and use it (in jQuery-like way) to collect HTML elements not from DOM (from tree structure), but from stream (e.g. SAX), i.e. using sequential access event based parser?
By the way, are there any CSS selectors (or their combination) that need access to DOM (Wikipedia SAX page says that XPath selectors "need to be able to access any node at any time in the parsed XML tree")?
I am most interested in implementing selector combinators, e.g. 'A B' descendant selector.
I prefer solutions describing algorithm, or in Perl (for HTML::Zoom).
I would do it with regular expressions.
First, convert the selector into a regular expression that matches a simple top-to-bottom list of opening tags representing a given parser stack state. To explain, here are some simple selectors and their corresponding regexen:
A becomes /<A[^>]*>$/
A#someid becomes /<A[^>]*id="someid"[^>]*>$/
A.someclass becomes /<A[^>]*class="[^"]*(?<= |")someclass(?= |")[^"]*"[^>]*>$/
A > B becomes /<A[^>]*><B[^>]*>$/
A B becomes /<A[^>]*>(?:<[^>]*>)*<B[^>]*>$/
And so on. Note that the regular expressions all end with $, but do not start with ^; this corresponds with the way CSS selectors do not have to match from the root of the document. Also note that there is some lookbehind and lookahead stuff in the class matching code, which is necessary so that you don't accidentally match against "someclass-super-duper" when you want the quite distinct class "someclass".
If you need more examples, please let me know.
Once you've constructed the selector regex, you're ready to begin parsing. As you parse, maintain a stack of tags which currently apply; update this stack whenever you descend or ascend. To check for selector matching, convert that stack to a list of tags which can match the regular expression. For example, consider this document:
<x><a>Stuff goes here</a><y id="boo"><z class="bar">Content here</z></y></x>
Your stack state string would go through the following values in order as you enter each element:
<x>
<x><a>
<x><y id="boo">
<x><y id="boo"><z class="bar">
The matching process is simple: whenever the parser descends into a new element, update the state string and check if it matches the selector regex. If the regex matches, then the selector matches that element!
Issues to watch out for:
Double quotes inside attributes. To get around this, apply html entity encoding to attribute values when creating the regex, and to attribute values when creating the stack state string.
Attribute order. When building both the regex and the state string, use some canonical order for the attributes (alphabetical is easiest). Otherwise, you might find that your regex for the selector a#someid.someclass which expects <a id="someid" class="someclass"> unfortunately fails when your parser goes into <a class="someclass" id="someid">.
Case sensitivity. According to the HTML spec, the class and id attributes match case sensitively (notice the 'CS' marker on the corresponding sections). So, you must use case-sensitive regex matching. However, in HTML, element names are not case sensitive, although they are in XML. If you want HTML-like case-insensitive element name matching, then canonicalize element names to either upper case or lower case in both the selector regex and the state stack string.
Additional magic is necessary to deal with the selector patterns that involve presence or absence of element siblings, namely A:first-child and A + B. You might accomplish these by adding a special attribute to the tag containing the name of the tag immediately prior, or "" if this tag is the first child. There's also the general sibling selector, A ~ B; I'm not quite sure how to deal with that one.
EDIT: If you dislike regular expression hackery, you can still use this approach to solve the problem, only using your own state machine instead of the regex engine. Specifically, a CSS selector can be implemented as a nondeterministic finite state machine, which is an intimidating-sounding term, but just means the following in practical terms:
There might be more than one possible transition from any given state
The machine tries one of them, and if that doesn't work out, then it backtracks and tries the other
The easiest way to implement this is to keep a stack for the machine, which you push onto whenever you follow a path and pop from whenever you need to backtrack. It comes down to the same sort of thing you'd use to do a depth-first search.
The secret behind nearly all of the awesomeness of regular expressions is in its use of this style of state machine.
Check out nokogiri. From their page:
Nokogiri is an HTML, XML, SAX, and Reader parser. Among Nokogiri’s many features is the ability to search documents via XPath or CSS3 selectors.".
It's in Ruby, but you said you wanted an algorithm and Ruby is great for reading. Or just call it from whatever you are working in.
What does a browser do to create the DOM out of a stream? I guess there lies the answer to your question because it must be storing the discovered elements in a form which facilitates a CSS selector query. If you can afford reading the source code for an open source browser parser, then I think you can reuse it.
I would not do so, honestly. Rather I would reuse an existing SAX based parser (maybe you rewrite another with perl), and that go through the entire string.
When handlers fire, use them to construct an in memory database for elements. Create a virtual "table" for every element with it's #number [for references] , tagName, parent #number, next #number and its opening tag char offset in the source mother string.
As well, create a table for every attribute ever found, and fill it with a record for each tag with that attribute value.
Now it's all about the process of creating a database, tables and indices.

CSS semantics; selecting elements directly or via order

Perhaps this question has been asked elsewhere, but I'm unable to find it. With HTML5 and CSS3 modules inching closer, I'm getting interested in a discussion about the way we write CSS.
Something like this where selection is done via element order and pseudo-classes is particularly fascinating. The big advantage to this method seems to be complete modularization of HTML and CSS to make tweaks and redesigns simpler.
At the same time, semantic IDs and classes seem advantageous for sundry reasons. Particularly, direct linking, JS targeting, and shorter CSS selectors. Also, it seems selector length might be an issue. For instance, I just wrote the following, which would be admittedly easier using some semantic HTML5 elements:
body>div:nth-child(2)>div:nth-child(2)>ul:nth-child(2)>li:last-child
So what say you, Stack Overflow? Is the future of CSS writing focused on element order and pseudo-classes? Or are IDs and classes and the current ways here to stay?
(I'm well aware the IDs and classes have their place, although I am interested to hear more ways you think they'll continue to be necessary. I don't want to misrepresent this or frame it as "Are pseudo-classes ID killers?" The discussion I'm interested in is bigger-picture and the ways writing CSS is changing.)
I think that's an unreadable abomination which will mysteriously stop working when the HTML changes.
Order-based selectors are completely non-self-documenting.
If someone else takes over the project, and the HTML changes, he will have no idea what the selector is supposed to select, and will be hard-pressed to fix it correctly.
This is especially important if any part of the HTML is automatically generated.