XPath for href link - html

I got for example
News
More
What would be the most efficient way of extracting the href based on the value between <a></a> (the atomic data) with XPath

Question interpretation:
"Most efficient" is taken to mean in a programmer's time sense, not
in a performance sense.
"The value between" is taken to mean the string between the a tags.
This XPath selects all a elements,
//a
This XPath selects all a elements whose string value is "News":
//a[.='News']
This XPath selects all href attributes of all a elements whose string value is "News"1:
//a[.='News']/#href
1. Credit: #localghost posted correct answer in comments

Related

Why is XPath contains(text(),'substring') not working as expected?

Let's say I have a piece of HTML like this:
<a>Ask Question<other/>more text</a>
I can match this piece of XPath:
//a[text() = 'Ask Question']
Or...
//a[text() = 'more text']
Or I can use dot to match the whole thing:
//a[. = 'Ask Questionmore text']
This post describes this difference between . (dot) and text(), but in short the first returns a single element, where the latter returns a list of elements. But this is where it gets a bit weird to me. Because while text() can be used to match either of the elements on the list, this is not the case when it comes to the XPath function contains(). If I do this:
//a[contains(text(), 'Ask Question')]
...I get the following error:
Error: Required cardinality of first argument of contains() is one or zero
How can it be that text() works when using a full match (equals), but doesn't work on partial matches (contains)?
For this markup,
<a>Ask Question<other/>more text</a>
notice that the a element has a text node child ("Ask Question"), an empty element child (other), and a second text node child ("more text").
Here's how to reason through what's happening when evaluating //a[contains(text(),'Ask Question')] against that markup:
contains(x,y) expects x to be a string, but text() matches two text nodes.
In XPath 1.0, the rule for converting multiple nodes to a string is this:
A node-set is converted to a string by returning the string-value of
the node in the node-set that is first in document order. If the
node-set is empty, an empty string is returned. [Emphasis added]
In XPath 2.0+, it is an error to provide a sequence of text nodes to a function expecting a string, so contains(text(),'substr') will cause an error for more than one matching text node.
In your case...
XPath 1.0 would treat contains(text(),'Ask Question') as
contains('Ask Question','Ask Question')
which is true. On the other hand, be sure to notice that contains(text(),'more text') will evaluate to false in XPath 1.0. Without knowing the (1)-(3) above, this can be counter-intuitive.
XPath 2.0 would treat it as an error.
Better alternatives
If the goal is to find all a elements whose string value contains the substring, "Ask Question":
//a[contains(.,'Ask Question')]
This is the most common requirement.
If the goal is to find all a elements with an immediate text node child equal to "Ask Question":
//a[text()='Ask Question']
This can be useful when wishing to exclude strings from descendent elements in a such as if you want this a,
<a>Ask Question<other/>more text</a>
but not this a:
<a>more text before <not>Ask Question</not> more text after</a>
See also
How contains() handles a nodeset first arg
How to use XPath contains() for specific text?
Testing text() nodes vs string values in XPath
The reason for this is that the contains function doesn't accept a nodeset as input - it only accepts a string. (Well, it may be engine dependent, because it works for Python's lxml module. According to the specification, it should convert the value of the first node in the set to a string and act on that. See also XPath contains(text(),'some string') doesn't work when used with node with more than one Text subnode)
//a[text() = 'Ask Question'] is matching any a elements which contain a text node which equals Ask Question.
//a[text() = 'more text'] is matching any a elements which contain a text node which equals more text.
So both of these expressions match the same a element.
You can re-work your query to //a[text()[contains(., 'Ask Question')]] so that the contains method will only act on a single text node at a time.

Selecting element based on attribute order in XPath?

I am working on a project using the Html-Agility-Pack and I need to build a list of each link that has an href attribute as its first attribute. What XPath expression would be used for this?
Example (I would want to only select the first):
<a href="http://someurl.com"/>
<a id="someid" href="http://someurl.com"/>
No, don't do that.
You really don't want to select elements based upon the ordering of their attributes because attribute order is arbitrary in HTML and XML. Find another criteria to limit your selections:
attribute presence or attribute value
child element presence or string value
preceding element value, possibly a label
etc
You want to choose a criteria that's invariant across all instances of the HTML/XML documents you may encounter. Attribute order is not such a criteria.

Why won't my XPath select link/button based on its label text?

<a href="javascript:void(0)" title="home">
<span class="menu_icon">Maybe more text here</span>
Home
</a>
So for above code when I write //a as XPath, it gets highlighted, but when I write //a[contains(text(), 'Home')], it is not getting highlighted. I think this is simple and should have worked.
Where's my mistake?
Other answers have missed the actual problem here:
Yes, you could match on #title instead, but that's not why OP's
XPath is failing where it may have worked previously.
Yes, XML and XPath are case sensitive, so Home is not the same as
home, but there is a Home text node as a child of a, so OP is
right to use Home if he doesn't trust #title to be present.
Real Problem
OP's XPath,
//a[contains(text(), 'Home')]
says to select all a elements whose first text node contains the substring Home. Yet, the first text node contains nothing but whitespace.
Explanation: text() selects all child text nodes of the context node, a. When contains() is given multiple nodes as its first argument, it takes the string value of the first node, but Home appears in the second text node, not the first.
Instead, OP should use this XPath,
//a[text()[contains(., 'Home')]]
which says to select all a elements with any text child whose string value contains the substring Home.
If there weren't surrounding whitespace, this XPath could be used to test for equality rather than substring containment:
//a[text()[.='Home']]
Or, with surrounding whitespace, this XPath could be used to trim it away:
//a[text()[normalize-space()= 'Home']]
See also:
Testing text() nodes vs string values in XPath
Why is XPath unclean constructed? Why is text() not needed in predicate?
XPath: difference between dot and text()
yes you are doing 2 mistakes, you're writing Home with an uppercase H when you want to match home with a lowercase h. also you're trying to check the text content, when you want to check check the "title" attribute. correct those 2, and you get:
//a[contains(#title, 'home')]
however, if you want to match the exact string home, instead of any a that has home anywhere in the title attribute, use #zsbappa's code.
You can try this XPath..Its just select element by attribute
//a[#title,'home']

Using XPath to get the first child for every child of a node

I'm trying to parse some HTML with the following structure, how can I extract the first <a> element of every <li> element using xpath?
<ul>
<li>
<a>
<span>
<a>
</li>
<li>
<a>
<span>
<a>
</li>
...
</ul>
#Mathias : You are correct, I apologize. //li/a[1] did not work because it wasn't a direct child (there is an article tag in between, which I omitted for simplicity).
Then let me post this as a solution with some more explanation.
If, as you have described, //li/a[1] does not return anything while (//li//a)[1] does, then the HTML sample you show is not representative for your actual document. Then, a would be a descendant of li, but not a direct child of it.
A correct XPath expression in this case is
//li//a[1]
but only use it if the level of nesting varies, i.e. if there could be other elements nested between li and a:
<li>
<article>
<other>
<a/>
If the nesting is consistent, but it is not always the article element which is in between li and a then use
//li/*/a[1]
Which avoids the // axis that is computationally more expensive than /.
Finally, if you know that the a elements you are interested in are always grandchildren of li elements and if it is always the article element in between them, use
//li/article/a[1]
When I correct the expression to be //li/article/a[1]', I get the first a` for the first li.
//li/article/a[1] returns several results if there are several a elements that are children of article and grandchildren of li. If this only returns a single result either
you invoke this XPath expression in a context where only a single result is expected, e.g. if you use an XPath library in a programming language or
the structure of your input document is even more intricate
I think that the XPath to accomplish that would be .//ul/li/a[position()=1] .
Explanation:
The reason I spell it all out as .//ul/li/a is because, when you use the xpath, if there is an error, your stack-trace will reveal exactly what the locator pointed at, and is less vague. But, you can obviously short-hand it if you dont care: .//a .
Using the position clause, you can do =1 or >1 , or whatever. I would choose using [position()=1] over using [1] because Xpath doesn't use 0-based arrays, which might confuse others looking at your locator. I mean position=0, by logic, means null, right?
I start my locator with a . because personally, sometimes I like to chain my locators together in a combination. You don't really need to start with the dot char but since i use the // wildcard in this case, its effectively the same as starting without a dot, but with the additional ability to be chained.
Answer tested on http://the-internet.herokuapp.com/

How to parse HTML/XML tags according to NOT conditions in [r]

Dearest StackOverflow homies,
I'm playing with HTML that was output by EverNote and need to parse the following:
Note Title
Note anchor (hyperlink identities of the notes themselves)
Note Creation Date
Note Content, and
Intra-notebook hyperlinks (the
links within the content of a note to another note's anchor)
According to examples by Duncan Temple Lang, author of the [r] XML package and a SO answer by #jdharrison, I have been able to parse the Note Title, Note anchor, and Note Creation Dates with relative ease. For those who may be interested, the commands to do so are
require("XML")
rawHTML <- paste(readLines("EverNotebook.html"), collapse="\n") #Yes... this is noob code
doc = htmlTreeParse(rawHTML,useInternalNodes=T)
#Get Note Titles
html.titles<-xpathApply(doc, "//h1", xmlValue)
#Get Note Title Anchors
html.tAnchors<-xpathApply(doc, "//a[#name]", xmlGetAttr, "name")
#Get Note Creation Date
html.Dates<-xpathApply(doc, "//table[#bgcolor]/tr/td/i", xmlValue)
Here's a fiddle of an example HTML EverNote export.
I'm stuck on parsing 1. Note Contents and 2. Intra-notebook hyperlinks.
Taking a closer look at the code it is apparent the solution for the first part is to return every upper-most* div that does NOT include a table with attribute bgcolor="#D4DDE5." How is this accomplished?
Duncan says that it is possible to use XPath to parse XML according to NOT conditions:
"It allows us to express things such as "find me all nodes named a" or "find me all nodes named a that have no attribute named b" or "nodes a that >have an attribute b equal to 'bob'" or "find me all nodes a which have c as >an ancestor node"
However he does not go on to describe how the XML package can parse exclusions... so I'm stuck there.
Addressing the second part, consider the format of anchors to other notes in the same notebook:
<a href="#13178">
The goal with these is to procure their number and yet this is difficult because they are solely distinguished from www links by the # prefix. Information on how to parse for these particular anchors via partial matching of their value (in this case #) is sparse - maybe even requiring grep(). How can one use the XML package to parse for these special hrefs? I describe both problems here since it's possible a solution to the first part may aid the second... but perhaps I'm wrong. Any advice?
UPDATE 1
By upper-most div I intend to say outer-most div. The contents of every note in an EverNote HMTL export are within the DOMs outer-most divs. Thus the interest is to return every outer-most div that does NOT include a table with attribute bgcolor="#D4DDE5."
"....to return every upper-most div that does NOT include a table with attribute bgcolor="#D4DDE5." How is this accomplished?"
One possible way ignoring 'upper-most' as I don't know exactly how would you define it :
//div[not(table[#bgcolor='#D4DDE5'])]
Above XPath reads: select all <div> not having child element <table> with bgcolor attribute equals #D4DDE5.
I'm not sure about what you mean by "parse" in the 2nd part of the question. If you simply want to get all of those links having special href, you can partially match the href attribute using starts-with() or contains() :
//a[starts-with(#href, '#')]
//a[contains(#href, '#')]
UPDATE :
Taking "outer-most" div into consideration :
//div[not(table[#bgcolor='#D4DDE5']) and not(ancestor::div)]
Side note : I don't know exactly how XPath not() is defined, but if it works like negation in general, (this worked as confirmed by OP in the comment below) you can apply one of De Morgan's law :
"not (A or B)" is the same as "(not A) and (not B)".
so that the updated XPath can be slightly simplified to :
//div[not(table[#bgcolor='#D4DDE5'] or ancestor::div)]