Different elements with same XPath? How and why can it happens? - html

I realized even on one page, that two HTML elements with different CSSPathes, which are placed in fully different page areas, have the exactly same XPath. How and why can it happen? Can somebody explain it to me?
Example: http://goo.gl/P4oZmW
First Element: a select with default value Standard.
<div class="list-sorting">
<select data-current-sorting="" name="sort" id="sort">
...</select>
</div>
XPath: /html/body/div[3]/div[1]/div[1]/div[2]
CSSPath: body > div.page > div.page-content > div.list-page-header > div.list-sorting
Second element: a text block on the page's bottom
<div class="list mmkcontent">...</div>
XPath: /html/body/div[3]/div[1]/div[2]/div[2]
CSSPath: body > div.page > div.page-content > div.right-section > div.list.mmkcontent
I tried to get XPath and CSSPath with Chrome Dev. Tools and with Firefox+Firebug: the XPath was everywhere the same. Only Firebug with its extension gave me for the second element the CSSPath .list.mmkcontent, which i was finally able to use to accomplish my mission.
But i still don't understand, how fully different elements can have the same XPath: XPath should be the path from top of the DOM tree to the element... How can elements located on different places have the same way to them through the DOM tree?

that two HTML elements with different CSSPathes, which are placed in fully different page areas, have the exactly same XPath. How and why can it happen?
Yes, this is quite easy to see. XPath is a flexible language and can select zero or more elements. CSS is also a flexible language and can select zero or more elements, but their syntax differs:
<p class="foo bar">
<div>test</test>
</p>
Here the two different CSS selectors .foo div and .bar div will select the same element. And these different XPath expressions, /p/div, /p[1]/div[1], /p[div]/*[1]/../div and /p[#class="foo bar"]/div all point to the same <div> element, but are very different.
There is a way in both CSS and in XPath to define an exact path. In CSS the only sure way is with #id syntax, assuming ids are unique, or through nodename::nth-child[x] syntax. The syntax div.a > div.b is not guaranteed unique with CSS.
In XPath the usual way is /foo[x]/bar[y] which is indisputable if x and y are numerical. Each such paths will select one unique element, or nothing.
If I look at your question, you wrote:
/html/body/div[3]/div[1]/div[1]/div[2]
and
/html/body/div[3]/div[1]/div[2]/div[2]
as being the same XPath, but they are not. Also, they do not follow the foo[x]/bar[y] syntax, though I may assume that there is only one html and one body element, in which case it does not matter.
The first selects <div class="list-sorting"> on your page, the second selects <div class="list mmkcontent"> on your page.
But i still don't understand, how fully different elements can have the same XPath: XPath should be the path from top of the DOM tree to the element... How can elements located on different places have the same way to them through the DOM tree?
One XPath can select multiple elements, in which case you can argue that one XPath selects both. But you suggest that one XPath, that selects one element, selects another element another time, which isn't possible, unless for a bug in the XPath implementations, or when you use a dynamic page that changed between two invocations.
I didn't see that for your page. The XPaths are different, and the browsers (Chrome, Firefox) show the correct path.
When I try and select "Copy XPath" in the browsers, I get this:
/html/body/div[3]/div[1]/div[1]/div[2]
/html/body/div[3]/div[1]/div[2]/div[2]
Which are different XPaths.
Of course, if you used some plugin, or other means of constructing the XPath dynamically, it is entirely possible that the plugin has some bug. But the XPaths you showed are different and trying to repro your issue shows different XPaths. Perhaps it was just a slight oversight?

Related

Add html element that is "invisible" or skipped by CSS selector rules

I want to build an external GUI that operates on a generic HTML piece that comes with associated CSS. In order to enable some functionalities of the GUI, I would need to create some "meta" HTML elements to contain parts of content and associate them with data.
Example:
<div id="root">
<foo:meta data-source="document:1111" data-xref="...">
sometext
<p class="quote">...</p>
</foo:meta>
<p class="other">...</p>
</div>
This HTML is auto-generated starting from already existing HTML that has associated CSS:
<div id="root">
sometext
<p class="quote">...</p>
<p class="other">...</p>
</div>
#root>p {
color:green;
}
#root>p+p {
color:red;
}
The problem is, when adding the <foo:meta> element, this breaks CSS child and sibling selectors. I am looking for a way for the CSS selectors to keep working when encapsulating content in this way. We have tried foo\:meta{display:contents} style, but, although it works in terms of hiding the meta element from the box renderer, it doesn't hide it from the selector matcher. We do not produce the HTML/CSS to be processed, so writing them in a certain way before processing is not an option. They come as they are, generic HTML documents with associated CSS.
Is there a way to achieve what we are looking for using HTML/CSS?
To restate, we are looking for a way to dynamically encapsulate parts of content in non-visual elements without breaking child and sibling CSS selectors. The elements should only be available to DOM traversal such as document.getElementsByTagName('foo:meta')
If I understood your problem correctly.I would suggest using the space between the grandparent and the child instead of a '>'. Also your selector is an id and not a class.
The selector you have put in selects the next level child that is the children. But adding the space in between enables you to select grandchildren too!
so you have do is this
#root .quote {
color:green;
}
Let me know if this helped.
A working css is here
So, after much fiddling and research, we came to the conclusion that this can't be done, even with ShadowDom, as even that would require massive CSS rewrites that might not preserve semantics.
However, for anyone stumbling upon this question, we came to the same end by employing the following (I'll be short, pointers only):
using two comments to mark where the tag would start/end, instead of an XML tag (eg. <!--<foo:bar data-source="1111">-->...content...<!--</foo:bar>-->)
these pointers work more or less like the markup equivalent of a DOM Range and they can work together with it.
this approach has the interesting advantage (as opposed to a single node) that it can start and end in different nodes, so it can span subtrees.
But this also breaks the XML structure when you try to recompose it. Also it's quite easy by manipulation to end up with the range end moving before the range start, multiple ranges overlapping etc.
In order to recompose it (to send to a next XML processor or noSQL XML database for cross-referencing), we need to make sure we avoid the XML-breaking manipulations described above; then, one only needs to convert encapsulated tags to regular tags by using string manipulation on the document (X)HTML (innerHtml, outerHtml, XMLSerializer) to get a clean XML which can be mined and cross-referenced for content.
We used the TreeWalker API for document scanning of comments, you might need it, although scanning the document for comments this way can be slow (works for us though). If you are bolder you can try using xPath, ie. document.evaluate('//comment()',document), seems to work but we don't trust all browsers comply.

Extracting entire CSS affecting a DIV on a web page

I wish to extract the entire CSS affecting a div that is highlighted. With complex designs the CSS is made up of many classes which add in some CSS. Looking for a technique that can strip out and perhaps concatenate all these together, like Inspect Element but cleaner.
For example, on this Adobe Experience Page(http://www.adobe.com/uk/products/experience-design.html). I wish to select the article div "A new experience in user experience." and then pull out all the CSS affecting everything inside it attached to a class.
There is an ExtractCSS tool that does something similar, but looking for something a bit more intuitive. That ignores all the strikethroughs too.
The simplest way is:
Select your element in the developer tools
Run window.getComputedStyle($0).cssText on the Js console
where $0 represents the currently selected DOM element.
In alternative, if you want to target a specific element with a given class, then do
window.getComputedStyle( document.getElementsByClassName('hero2-align-2 hero2-basis-0')[0] ).cssText
Querying elements by class name might return more than 1 element, the [0] is there to guarantee only one is processed.
Or by id
window.getComputedStyle( document.getElementById('yourID') ).cssText

Using CSS selectors to get first child from a nested element

I am trying to understand CSS selectors better and am fiddling around with Google/Gmail. When you go to Google's home page and enter "gmail", it will automatically present you with search results for that term. I want to write a CSS selector that will find the first one (that is, the link to Gmail, since it should always be the first result). The HTML for these results looks like:
<div class="srg">
<div class="g">
<h3 class="r">
Gmail - Google
...
Based on what I could gather from the W3schools CSS docs, it seems like I want the first <a> child of a class named r, or:
h3.r a:first-child
However, the tool I'm using doesn't recognize this as the first link. So I ask: is this a correct selector for the Gmail (first) link, or did I go awry somewhere?
Well, the anchor element you're referring to is the only child of the h3.r parent.
So :first-child, :last-child and :only-child would all apply.
A simple h3.r > a (child selector) or h3.r a (descendant selector) should suffice, assuming it's unique in the document.
Your selector – h3.r a:first-child – should, technically speaking, work as well.
Based on the image above, an attribute selector may also work:
h3.r a[data-href="https://mail.google.com/"]
More information: https://www.w3.org/TR/css3-selectors/#selectors
Within Geb, you can also use
`$("h3.r").find("a")[0]
to select the first child.
Using :first-of-type is very similar to :nth-child, but there is a critical difference: it is less specific.
In the example above, if we had used p:nth-child(1), nothing would happen because the paragraph is not the first child of its parent (the <article>). This reveals the power of :first-of-type: it targets a particular type of element in a particular arrangement with relation to similar siblings, not all siblings.
Reference: https://css-tricks.com/almanac/selectors/f/first-of-type/

How can I find a certain element that comes right after another element with Capybara?

I am trying to use learn Capybara for a scraping task I have. I have heretofore only used it for testing. There are a million things I want to learn, but at the very basic part of it I want to know how to find a certain element that is a sibling and comes after another element that I am able to find?
Take a page like this:
<body>
<h3>Name1</h3>
<table>
...
</table>
<h3>Name2</h3>
<table>
...
</table>
<h3>Name3</h3>
<table>
...
</table>
</body>
I want to return the <table> element that comes after the <h3> element having text Name2.
I know how to loop through elements with all, and I know how to use first instead of find, but I don't know how to "Find the first element X following specific element Y".
CSS
In CSS you could use a sibling selector. These allow you to select sibling elements; or those at the same nesting level and with the same parent element. There are two types of sibling selectors:
'+' the adjacent sibling selector
'~' the general sibling selector (adjacent or non-adjacent siblings)
It's usually ideal to avoid matching by text whenever possible. (This makes your specs easier to write and also means that textual changes are less likely to break your specs.) In that ideal world your 'h3' elements might have IDs on them and we could just:
find('h3#name2+table')
However, in your example they don't have IDs so let's connect a couple of queries to scope to what we want.
find('h3', text: 'Name2').find('+table')
First we found the correct 'h3' element (using text matching) and then with that query as a basis we request the sibling 'table' element.
You may also note that if you used the general sibling selector '~' you would get an ambiguous element error; Capybara found all the 'table' elements rather than just the adjacent one.
XPath
Sometimes XPath is actually easier to use if you're really forced to do textual element selection. So you could instead:
find(:xpath, "//h3[contains(text(),'Name2')]/following-sibling::table")
More difficult to read but does the same thing. First find an 'h3' with text 'Name2' and then select it's sibling 'table' element.
Capybara now has sibling and ancestor finders (~>=2.15.0)
[Updated 2018/09]
As #trueunlessfalse commented out, the original answer using sibling finder will return amgibuous match error if there are multiple matches. So, please consider to use xpath in that case..
Following code using xpath will return what OP wanted
find('h3', text: 'Name2').first(:xpath, './following-sibling::table')
Following code will return what OP wanted => ambiguous match error
find('h3', text: 'Name2').sibling('table')
You can check the detail here.
https://www.rubydoc.info/github/jnicklas/capybara/Capybara/Node/Finders:sibling
Using
find('h3#name2+table')
didnt work for me. I needed to add :css then it worked
find(:css, 'h3#name2+table')

Proper xpath specification between two children within the same parent?

So, on a given website: for example there is a div element. I want to properly specify the xpath for a given sub-set of the main content of the page, found in:
<div[#id="content"> otherwise known as <div[3]>
Specifically, I want the xpath for the content between the second horizontal-rule (hr) tag and the third horizontal-rule (hr) tag. Which I believe should be, respectively:
'//div[#id="content"]/hr[2]' **AND** '//div[#id="content"]/hr'
I have been reading the XPath Tutorial and trying to figure out if the two hr tags are siblings or not, which I believe they are. However, Python does not seem to be recognizing them as such. I have tried every derivation of:
"following-sibling" and "preceding:: and not(preceding::)"
to the point that I no longer know which is which, and what is what. I do know that I am confused, and I believe the script is being confounded by the fact that the second hr of interest is not being numbered/identified as the third hr within the content/div (does not follow logically in numbering) as it 'should' be... according to what Firebug has been telling me.
The bottom line is: How do I properly specify this xpath? Again, these horizontal-rule tags appear to be siblings to me so I would think it would follow a structure such as following-sibling & preceding-sibling to specify the content between these two tags.
If you have access to XPath 2.0 functions you can use intersect to select all elements between the two:
//hr[2]/following-sibling::node()
intersect
//hr[3]/preceding-sibling::node()
If you only have access to XPath 1.0 functions, you can use this wonderful workaround to achieve the same result:
//hr[2]/following-sibling::node()[
count(.| //hr[3]/preceding-sibling::node())
=
count(//hr[3]/preceding-sibling::node())
]