Proper xpath specification between two children within the same parent? - html

So, on a given website: for example there is a div element. I want to properly specify the xpath for a given sub-set of the main content of the page, found in:
<div[#id="content"> otherwise known as <div[3]>
Specifically, I want the xpath for the content between the second horizontal-rule (hr) tag and the third horizontal-rule (hr) tag. Which I believe should be, respectively:
'//div[#id="content"]/hr[2]' **AND** '//div[#id="content"]/hr'
I have been reading the XPath Tutorial and trying to figure out if the two hr tags are siblings or not, which I believe they are. However, Python does not seem to be recognizing them as such. I have tried every derivation of:
"following-sibling" and "preceding:: and not(preceding::)"
to the point that I no longer know which is which, and what is what. I do know that I am confused, and I believe the script is being confounded by the fact that the second hr of interest is not being numbered/identified as the third hr within the content/div (does not follow logically in numbering) as it 'should' be... according to what Firebug has been telling me.
The bottom line is: How do I properly specify this xpath? Again, these horizontal-rule tags appear to be siblings to me so I would think it would follow a structure such as following-sibling & preceding-sibling to specify the content between these two tags.

If you have access to XPath 2.0 functions you can use intersect to select all elements between the two:
//hr[2]/following-sibling::node()
intersect
//hr[3]/preceding-sibling::node()
If you only have access to XPath 1.0 functions, you can use this wonderful workaround to achieve the same result:
//hr[2]/following-sibling::node()[
count(.| //hr[3]/preceding-sibling::node())
=
count(//hr[3]/preceding-sibling::node())
]

Related

Selecting both links and anchors using XPath

I am using screaming frog and I want to do this using XPath.
Extract all links and anchors containing a certain class from the main body content
but I want to exclude all links within div.list
Right now I am trying this but it's not working too well, plus I want it to spit it out in text form in possible.
//div[#class="page-content"]/*[not(class="list")]//a[#data-wpel-link="internal"]
Anyone got an idea?
This XPath,
//a[#data-wpel-link="internal"][not(ancestor::div[#class="list"])]
will select all a elements with the given attribute value that do not have an ancestor div of the given class.
You can, of course, prefix any heritage in which to restrict the selection, e.g:
//div[#class="page-content"]//a[#data-wpel-link="internal"]
[not(ancestor::div[#class="list"])]

Add html element that is "invisible" or skipped by CSS selector rules

I want to build an external GUI that operates on a generic HTML piece that comes with associated CSS. In order to enable some functionalities of the GUI, I would need to create some "meta" HTML elements to contain parts of content and associate them with data.
Example:
<div id="root">
<foo:meta data-source="document:1111" data-xref="...">
sometext
<p class="quote">...</p>
</foo:meta>
<p class="other">...</p>
</div>
This HTML is auto-generated starting from already existing HTML that has associated CSS:
<div id="root">
sometext
<p class="quote">...</p>
<p class="other">...</p>
</div>
#root>p {
color:green;
}
#root>p+p {
color:red;
}
The problem is, when adding the <foo:meta> element, this breaks CSS child and sibling selectors. I am looking for a way for the CSS selectors to keep working when encapsulating content in this way. We have tried foo\:meta{display:contents} style, but, although it works in terms of hiding the meta element from the box renderer, it doesn't hide it from the selector matcher. We do not produce the HTML/CSS to be processed, so writing them in a certain way before processing is not an option. They come as they are, generic HTML documents with associated CSS.
Is there a way to achieve what we are looking for using HTML/CSS?
To restate, we are looking for a way to dynamically encapsulate parts of content in non-visual elements without breaking child and sibling CSS selectors. The elements should only be available to DOM traversal such as document.getElementsByTagName('foo:meta')
If I understood your problem correctly.I would suggest using the space between the grandparent and the child instead of a '>'. Also your selector is an id and not a class.
The selector you have put in selects the next level child that is the children. But adding the space in between enables you to select grandchildren too!
so you have do is this
#root .quote {
color:green;
}
Let me know if this helped.
A working css is here
So, after much fiddling and research, we came to the conclusion that this can't be done, even with ShadowDom, as even that would require massive CSS rewrites that might not preserve semantics.
However, for anyone stumbling upon this question, we came to the same end by employing the following (I'll be short, pointers only):
using two comments to mark where the tag would start/end, instead of an XML tag (eg. <!--<foo:bar data-source="1111">-->...content...<!--</foo:bar>-->)
these pointers work more or less like the markup equivalent of a DOM Range and they can work together with it.
this approach has the interesting advantage (as opposed to a single node) that it can start and end in different nodes, so it can span subtrees.
But this also breaks the XML structure when you try to recompose it. Also it's quite easy by manipulation to end up with the range end moving before the range start, multiple ranges overlapping etc.
In order to recompose it (to send to a next XML processor or noSQL XML database for cross-referencing), we need to make sure we avoid the XML-breaking manipulations described above; then, one only needs to convert encapsulated tags to regular tags by using string manipulation on the document (X)HTML (innerHtml, outerHtml, XMLSerializer) to get a clean XML which can be mined and cross-referenced for content.
We used the TreeWalker API for document scanning of comments, you might need it, although scanning the document for comments this way can be slow (works for us though). If you are bolder you can try using xPath, ie. document.evaluate('//comment()',document), seems to work but we don't trust all browsers comply.

How to find a child html element by id with jsoup?

I am parsing the html code of one site with Jsoup. I need to find some html elements that has an specific id but their parent´s tree is complicating me the task. So I would like to know if is it possible to search an specific html element without having to search first all of their parents.
For instance I am doing the next:
Elements el=elements.select(".scroller.context-inplay").select(".zone.grid-1-1").select(".grid-1").select(".module-placeholder");
I would likte to know if there is a simple way to get the same element I get with this code searching by its id
The id of an html element should be unique within the page. Some html that you find in the wild breaks this requirement unfortunately tough. However, if your html source follows the standard you can simply use the # css operator to select the element in question:
Element el = doc.select("#someID").first();
Alternatively you can directly use the getElmentById Jsoup method:
Element el = doc.getElmentById("someID");
Also, if you decide to go by class names as you suggest in your question, it is easy to combine all selects into one selector:
Elements els = elements.select(".scroller.context-inplay .zone.grid-1-1 .grid-1 .module-placeholder");
The spaces in the CSS selector mean that any subselector right of the space must be a child of the stuff on the left side.

Different elements with same XPath? How and why can it happens?

I realized even on one page, that two HTML elements with different CSSPathes, which are placed in fully different page areas, have the exactly same XPath. How and why can it happen? Can somebody explain it to me?
Example: http://goo.gl/P4oZmW
First Element: a select with default value Standard.
<div class="list-sorting">
<select data-current-sorting="" name="sort" id="sort">
...</select>
</div>
XPath: /html/body/div[3]/div[1]/div[1]/div[2]
CSSPath: body > div.page > div.page-content > div.list-page-header > div.list-sorting
Second element: a text block on the page's bottom
<div class="list mmkcontent">...</div>
XPath: /html/body/div[3]/div[1]/div[2]/div[2]
CSSPath: body > div.page > div.page-content > div.right-section > div.list.mmkcontent
I tried to get XPath and CSSPath with Chrome Dev. Tools and with Firefox+Firebug: the XPath was everywhere the same. Only Firebug with its extension gave me for the second element the CSSPath .list.mmkcontent, which i was finally able to use to accomplish my mission.
But i still don't understand, how fully different elements can have the same XPath: XPath should be the path from top of the DOM tree to the element... How can elements located on different places have the same way to them through the DOM tree?
that two HTML elements with different CSSPathes, which are placed in fully different page areas, have the exactly same XPath. How and why can it happen?
Yes, this is quite easy to see. XPath is a flexible language and can select zero or more elements. CSS is also a flexible language and can select zero or more elements, but their syntax differs:
<p class="foo bar">
<div>test</test>
</p>
Here the two different CSS selectors .foo div and .bar div will select the same element. And these different XPath expressions, /p/div, /p[1]/div[1], /p[div]/*[1]/../div and /p[#class="foo bar"]/div all point to the same <div> element, but are very different.
There is a way in both CSS and in XPath to define an exact path. In CSS the only sure way is with #id syntax, assuming ids are unique, or through nodename::nth-child[x] syntax. The syntax div.a > div.b is not guaranteed unique with CSS.
In XPath the usual way is /foo[x]/bar[y] which is indisputable if x and y are numerical. Each such paths will select one unique element, or nothing.
If I look at your question, you wrote:
/html/body/div[3]/div[1]/div[1]/div[2]
and
/html/body/div[3]/div[1]/div[2]/div[2]
as being the same XPath, but they are not. Also, they do not follow the foo[x]/bar[y] syntax, though I may assume that there is only one html and one body element, in which case it does not matter.
The first selects <div class="list-sorting"> on your page, the second selects <div class="list mmkcontent"> on your page.
But i still don't understand, how fully different elements can have the same XPath: XPath should be the path from top of the DOM tree to the element... How can elements located on different places have the same way to them through the DOM tree?
One XPath can select multiple elements, in which case you can argue that one XPath selects both. But you suggest that one XPath, that selects one element, selects another element another time, which isn't possible, unless for a bug in the XPath implementations, or when you use a dynamic page that changed between two invocations.
I didn't see that for your page. The XPaths are different, and the browsers (Chrome, Firefox) show the correct path.
When I try and select "Copy XPath" in the browsers, I get this:
/html/body/div[3]/div[1]/div[1]/div[2]
/html/body/div[3]/div[1]/div[2]/div[2]
Which are different XPaths.
Of course, if you used some plugin, or other means of constructing the XPath dynamically, it is entirely possible that the plugin has some bug. But the XPaths you showed are different and trying to repro your issue shows different XPaths. Perhaps it was just a slight oversight?

CSS selector select by div class attributes

<div class="thumbnail-popular" style="background: url('http://images.gogoanime.tv/images/upload/Go!.Princess.Precure.full.1812487.jpg');"></div>
I am trying to get the url component of this div class but I seem to be unable to fetch that specific data in the div class.
I have looked into making use of attributes but my attempts have been unsuccessful so far.
Usage of this CSS selector is through Kimonolabs.
div.thumbnail-popular should get you the element you're looking for — unless there is more than one such element, in which case you will need to narrow down your selector.
For example you will need to find out if this particular element belongs to a specific parent, or is the first, second, ... nth child, or any other information about the surrounding elements in the page that you're working with.
The background URL is in a style attribute on this element, so you will need to extract that attribute as described here. However you will still need to parse the declarations inside the style value in order to get the URL; I am not sure if it is possible to do this through kimono as I am not familiar with it (I'm not sure what its advanced mode really does, and it's difficult to tell from the lone screenshot that is provided in that help article).