With xPath I am trying to get the following values:
html:
<ul class="listVideoAttributes alpha only">
<li class="alpha only">
<span>Categories:</span>
<ul>
<li class="psi alpha">
Cinema
</li>
<li class="omega">
HD
</li>
</ul>
</li>
</ul>
Categories are not always named as categories, sometimes they call it Tags.
I would like the following xPath to locate Categories and get the category values
like Cinema and HD.
For now, I'm using:
//ul[#class="listVideoAttributes"][contains(., 'Categories:')]
and it returns values but also the text 'categories:'.
I would like to do something like:
//ul[#class="listVideoAttributes"][contains(., 'Categories:')]/ul
But it seems not to work.
Your XPath expresion did not work, because the inner <ul/> is not direct child of the outer <ul/>. Use the descendant-or-self axis step //ul instead of the child axis step /ul at the end of your expression. If you're sure the markup will not change, better only use child axis steps: /li/ul/li/a.
Another problem is that the #class attribute does not equal listVideoAttributes, but only contain it. You should never compare HTML-class-attributes with equals, always use contains.
Anyway, I'd be as specific as possible while searching for the "headline", otherwise you could find false positives when the content of any "listVideoAttributes"-list contains one "Categories" or "Tags":
//ul[contains(#class, 'listVideoAttributes')]/li[contains(span, 'Categories') or contains(span, 'Tags')]//a
You might want to add a /text() if you cannot read the string value from the programming language you're using which would usually be preferred (eg., when a link contains bold text like <a href="..."><strong>foo</strong><a>; text() wouldn't return the string value in this case.
You can try the below Xpath
//ul[contains(#class,'listVideoAttributes') and contains(.//span,'Categories')]//a/text()
output:
Cinema
HD
There are two problems with
//ul[#class="listVideoAttributes"][contains(., 'Categories:')]/ul
first the outer ul class is not equal to "listVideoAttributes", it only contains that as a substring, and secondly the inner ul is not a direct child of the outer one, it's a grandchild. How about
//ul[contains(#class, 'listVideoAttributes')][contains(., 'Categories')]/li/ul/li/a
Related
The result from my scrapy project looks like this:
<div class="news_li">...</div>
<div class="news_li">...</div>
<div class="news_li">...</div>
...
<div class="news_li">...</div>
And each "news_li" class looks like this:
<div class="news_li">
<div class="a">
<a href="aaa">
<div class="a1"></div>
</a>
</div>
<a href="xxx">
<div class="b">
<div class="b1"></div>
<div class="b2"></div>
<div class="b3"></div>
</div>
</a>
</div>
I am trying to extract information one at a time in the scrapy shell by the following command:
response.xpath("//div[#class='news_li']")[0].xpath("//div[#class='a1']").extract()
response.xpath("//div[#class='news_li ']/descendant::div[#class='a1']").extract()
But these commands returns me with all the "a1" class from all other "news_li" class
I have 2 quesitons:
How do I get the child div information one at a time.
How do I get the and separately? (The difference is the first one is wrap in a parent div and the second one is by itself.)
Many Many thanks in advance.
Edit: To be specific, how can i extract the information depends on the parent /root node? I look up XPath Axes and I tried with 'descendant', but it does not work.
Here's what you can try
response.xpath("(//div[#class='news_li'])[0]").xpath("//div[#class='a1']").extract()
Use the [0] directly in the XPath.
It is very likely that when combining XPath expressions like so:
response.xpath("//div[#class='news_li']")[0].xpath("//div[#class='a1']").extract()
if the second expression starts with a double slash //, then elements are selected anywhere in the document, regardless of what was selected before. Put another way: even if the first expression:
//div[#class='news_li']
selects only div elements with a certain class attribute, the next one:
//div[#class='a1']
selects all div elements where #class='a1' in the entire document. That seems to be your problem.
Solution: Use a relative path
One possible solution is to use a relative path expression that does not start with //:
response.xpath("//div[#class='news_li']")[0].xpath(".//div[#class='a1']").extract()
General remarks
Depending on the structure of your actual documents and if you can make certain assumptions, better solutions may be possible.
Also, in general, to process results "one at a time", you should
write an XPath expression that selects all of those desired elements and return them as a list
process each item in this list individually, for example with Python code
Try with the below.
# first link
response.xpath("(//div[#class='news_li']//a)[1]").extract()
# second link
response.xpath("(//div[#class='news_li']//a)[2]").extract()
Edit 1:
# change the X value in the below xpath to get the first link
//div[#class='news_li'][X]/descendant::div[#class='a1']/parent::a
# change the X value in the below xpath to get the second link (direct
# link) based on the child div
//div[#class='news_li'][X]/descendant::a[div[#class='b']]
I'm trying to parse some HTML with the following structure, how can I extract the first <a> element of every <li> element using xpath?
<ul>
<li>
<a>
<span>
<a>
</li>
<li>
<a>
<span>
<a>
</li>
...
</ul>
#Mathias : You are correct, I apologize. //li/a[1] did not work because it wasn't a direct child (there is an article tag in between, which I omitted for simplicity).
Then let me post this as a solution with some more explanation.
If, as you have described, //li/a[1] does not return anything while (//li//a)[1] does, then the HTML sample you show is not representative for your actual document. Then, a would be a descendant of li, but not a direct child of it.
A correct XPath expression in this case is
//li//a[1]
but only use it if the level of nesting varies, i.e. if there could be other elements nested between li and a:
<li>
<article>
<other>
<a/>
If the nesting is consistent, but it is not always the article element which is in between li and a then use
//li/*/a[1]
Which avoids the // axis that is computationally more expensive than /.
Finally, if you know that the a elements you are interested in are always grandchildren of li elements and if it is always the article element in between them, use
//li/article/a[1]
When I correct the expression to be //li/article/a[1]', I get the first a` for the first li.
//li/article/a[1] returns several results if there are several a elements that are children of article and grandchildren of li. If this only returns a single result either
you invoke this XPath expression in a context where only a single result is expected, e.g. if you use an XPath library in a programming language or
the structure of your input document is even more intricate
I think that the XPath to accomplish that would be .//ul/li/a[position()=1] .
Explanation:
The reason I spell it all out as .//ul/li/a is because, when you use the xpath, if there is an error, your stack-trace will reveal exactly what the locator pointed at, and is less vague. But, you can obviously short-hand it if you dont care: .//a .
Using the position clause, you can do =1 or >1 , or whatever. I would choose using [position()=1] over using [1] because Xpath doesn't use 0-based arrays, which might confuse others looking at your locator. I mean position=0, by logic, means null, right?
I start my locator with a . because personally, sometimes I like to chain my locators together in a combination. You don't really need to start with the dot char but since i use the // wildcard in this case, its effectively the same as starting without a dot, but with the additional ability to be chained.
Answer tested on http://the-internet.herokuapp.com/
I am aware that I can directly use:
driver.FindElement(By.XPath("//ul[3]/li/ul/li[7]")).Text
to get the text .. but I am trying get the text by using Xpath and combination of different attributes like text(), contains() etc.
//ul[3]/li/ul/li//[text()='My Data']
Please suggest me different ways that I can handle this ... except the one I mentioned.
<li class="ng-binding ng-scope selectedTreeElement" ng-click="orgSelCtrl.selectUserSessionOrg(child);" ng-class="{selectedTreeElement: child.organizationId == orgSelCtrl.SelectedOrg.organizationId}" ng-repeat="child in node.childOrgs" style="background-color: transparent;"> My Data </li>
looks like you have extra "/" in your xpath and you miss dot:
//ul[3]/li/ul/li//[text()='My Data']
try this:
.//ul[3]/li/ul/li[text()='My Data']
BUT you are use xpath only for find elements, but not for reading its attributes. If you need to read attribute or text inside of it, you need to use selenium after search.
.Text of a WebElement would just return you the text of an element.
If you want to make expectations about the text, check the text() inside the XPath expression, e.g.:
//ul[3]/li/ul/li[text()='My Data']
or, using contains():
//ul[3]/li/ul/li[contains(text(), 'My Data')]
There are other functions you can make use of, see Functions - XPath.
You can also combine it with other conditions. For instance:
//ul[3]/li/ul/li[contains(#class, 'selectedTreeElement') and contains(text(), 'My Data')]
I have a XML document with this specitic structure :
<ul>
<li>
the
dog
is black
</li>
<li >
the
cat
is white
</li>
</ul>
But I have also this :
<ul>
<li>
the bird is blue
</li>
<li >
the
frog
</li>
</ul>
I don't know if there is a <a> in my <li> and where is it.
I would like the XPath query to get sentences like "the dog is black", "the cat is white", "the bird is blue" and "the frog"
Thanks !
If you're bound to XPath 1.0, you cannot get the sentences as separated tokens. You can get all text in all list elements using
//ul//text()
, but for the first HTML snippet this will return something like "the dog is black the cat is white".
If you need the sentences seperated, retrieve the list items and but the sentences together from outside XPath (eg. PHP, Java, ...; whatever you're using). How to do this differs from language to language, have a look at the reference or refine question / ask another question.
//ul/li
With XPath 2.0 you've got more luck and you can use one of these queries:
//ul/li/data(.)
//ul/li/string-join(.//text. ' ')
If the first one returns what you need use it, if there are problems with whitespace (whitespace handling is different for different implementations, but usually can be configured) go for the more flexible second query and adjust it as needed.
Thanks for your repply, I use Xpath for an iOS application with an HTML Parser : hpple (https://github.com/topfunky/hpple)
I think it use Xpath 1.0, because the log say me string-join function isn't recognized
//ul//text()
works but he return one word per word, and not one line per line
I am using XPath to query HTML sites, which works pretty good so far, but now I hit a (brick)wall and can't find a solution :-)
The html looks like this:
<ul>
<li>Text1<span>AnotherText1</span></li>
<li>Text2<span>AnotherText2</span></li>
<li>Text3<span>AnotherText3</span></li>
</ul>
I want to select the "TextX" part, but NOT the AnotherTextX part in the <span></span>
So far I couldn't come up with any (pure) XPath solution to do that (and in my setup I unfortunately need a pure XPath solution.
This selects kind of what I want, but it results in "TextXAnotherTextX" and I only need "TextX".
/ul/li/a
Any hints? :-)
This gets you the first direct text node child of <a>:
/ul/li/a/text()[1]
and this would get you any direct text node child (separately):
/ul/li/a/text()
Both of the above return "TextX", but if you had:
<li>Text4<span>AnotherText3</span>TrailingText</li>
then the latter would return: ["Text4", "TrailingText"], while the former would return "Text4" only.
Your expression /ul/li/a gets the string value of <a>, which is defined as the concatenation of the string value of all the children of <a>, so you get "TextXAnotherTextX".