I have a XML document with this specitic structure :
<ul>
<li>
the
dog
is black
</li>
<li >
the
cat
is white
</li>
</ul>
But I have also this :
<ul>
<li>
the bird is blue
</li>
<li >
the
frog
</li>
</ul>
I don't know if there is a <a> in my <li> and where is it.
I would like the XPath query to get sentences like "the dog is black", "the cat is white", "the bird is blue" and "the frog"
Thanks !
If you're bound to XPath 1.0, you cannot get the sentences as separated tokens. You can get all text in all list elements using
//ul//text()
, but for the first HTML snippet this will return something like "the dog is black the cat is white".
If you need the sentences seperated, retrieve the list items and but the sentences together from outside XPath (eg. PHP, Java, ...; whatever you're using). How to do this differs from language to language, have a look at the reference or refine question / ask another question.
//ul/li
With XPath 2.0 you've got more luck and you can use one of these queries:
//ul/li/data(.)
//ul/li/string-join(.//text. ' ')
If the first one returns what you need use it, if there are problems with whitespace (whitespace handling is different for different implementations, but usually can be configured) go for the more flexible second query and adjust it as needed.
Thanks for your repply, I use Xpath for an iOS application with an HTML Parser : hpple (https://github.com/topfunky/hpple)
I think it use Xpath 1.0, because the log say me string-join function isn't recognized
//ul//text()
works but he return one word per word, and not one line per line
Related
In google sheets, for my own amusement, I'm trying to display "Galleon in Valley of the Four Winds." as one string from the code below.
I want to do this for about 600 pages, all of which have identical structure in their HTML (without ID's). I'm only ever going to be interested in the first list between the UL tags.
<h3>Source:</h3>
<ul>
<li>
<a href='http://www.wowhead.com/npc=62346' target='_blank'>Galleon</a> in Valley of the Four Winds.
</li>
<li>
<a href='/bmah.php'>The Black Market Auction House</a> (rarely)
</li>
</ul>
There are many many lists in the source code & not always in the same order, which makes something like IMPORTHTML(B2,"list",3) hard to use.
I can get "Galleon" by itself using this
=IMPORTXML(URL, "//a[#href[starts-with(., 'http://www.wowhead.com/npc')]]")
I tried adding a "//li | " but it brought back all of the lists & not the text that I hoped for, which made sense but I'm at a loss on how to proceed further with this.
=IMPORTXML(URL, "//li | //a[#href[starts-with(.,'http://www.wowhead.com/npc')]]")
I've tried reading through guides & guidelines, but at this point I'm just floundering and a bit lost.
Hope that all made sense, many thanks in advance for the replies.
This one is working on your sample
xmllint --html --xpath 'string(//li[a[#href[starts-with(., "http://www.wowhead.com/npc")]]])' test.html
Galleon in Valley of the Four Winds.
Thank you!
I've rebuilt it for Google Sheets & it posts each part in a different cell, however a quick concatenation has built a complete string.
=IMPORTXML(A5,"//li[a[#href[starts-with(.,""wowhead.com/npc"")]]]")
The best I do (so far) with XPath is to extract the following node:
<li class="List-guests">
<span class="icon guests"/>
3
</li>
I actually need just to extract the number 3. Is there a way to do this in XPath? I really don't want to start using some complicated regex if I can avoid it.
You should be able to use the text() function
Both the normalized text following the class="icon guests" span,
normalize-space(//span[#class="icon guests"]/following-sibling::text())
and the normalized text of the class='List-guests' span,
normalize-space(//li[#class='List-guests'])
for your shown XML will be 3, as requested.
Note: This is the string 3. You can wrap either of the above XPath expressions in number() if you actually need the number 3.
I'm want to extract a few values from HTML using Nokogiri in this ruby script:
#!/usr/bin/ruby
require 'Nokogiri'
doc = Nokogiri::HTML(<<-END_OF_HTML)
<html>
<meta content="text/html; charset=UTF-8"/>
<body style='margin:20px'>
<p>The following user has registered a device, click on the link below to review the user and make any changes if necessary.</p>
<ul style='list-style-type:none; margin:25px 15px;'>
<li><b>User name:</b> Test User</li>
<li><b>User email:</b> test#abc.com</li>
<li><b>Identifier:</b> abc123def132afd1213afas</li>
<li><b>Description:</b> Tom's iPad</li>
<li><b>Model:</b> iPad 3</li>
<li><b>Platform:</b> </li>
<li><b>App:</b> Test app name</li>
<li><b>UserID:</b> </li>
</ul>
<p>Review user: https://cirrus.app47.com/users?search=test#abc.com</p> <hr style='height=2px; color:#aaa'/>
<p>We hope you enjoy the app store experience!</p>
<p style='font-size:18px; color:#999'>Powered by App47</p>
<img src='https://cirrus.app47.com/notifications/562506219ac25b1033000904/img' alt=''/></body></html>
END_OF_HTML
Specifically I want to get the values of the some of the list members like "Identifier:" and "User name:" and store them in strings.
I'm sure I need to use xpath but that's about it. My understanding is that xpath does node selection.
What do I need to specify with xpath and then how do I get the selection into some variables?
Full Solution
Ultimately I was really asking two questions.
Question 1 (implicit): How can I see the results of a search using xpath?
doc.xpath("SPECIFY_SEARCH_HERE").each do |node|
puts node
end
This works because xpath returns an array that you can parse and then you can do what you want with the results (in my case, print).
Question 2: How do I get the value of a particular list item?
str = doc.xpath("//ul/li[contains(b, 'Identifier')]/text()").to_s.strip
My analysis on this line is limited, but it looks like it does this:
Find the location of the li child keys with: //ul/li
Select the bolded key (b) containing 'Identifier'
Extract the value of the selection from #2: /text()
.to_s.strip converts the selection to a string and removes leading/trailing whitespace
For anyone better versed in HTML/Ruby/Xpath, feel free to update the explanation for precision.
That will return both values you asked for
//ul/li[contains(b, 'Identifier') or contains(b, 'User name')]/text()
Of course you can modify xpath and get only 1 value at one time.
//ul/li[contains(b, 'Identifier')]/text()
With xPath I am trying to get the following values:
html:
<ul class="listVideoAttributes alpha only">
<li class="alpha only">
<span>Categories:</span>
<ul>
<li class="psi alpha">
Cinema
</li>
<li class="omega">
HD
</li>
</ul>
</li>
</ul>
Categories are not always named as categories, sometimes they call it Tags.
I would like the following xPath to locate Categories and get the category values
like Cinema and HD.
For now, I'm using:
//ul[#class="listVideoAttributes"][contains(., 'Categories:')]
and it returns values but also the text 'categories:'.
I would like to do something like:
//ul[#class="listVideoAttributes"][contains(., 'Categories:')]/ul
But it seems not to work.
Your XPath expresion did not work, because the inner <ul/> is not direct child of the outer <ul/>. Use the descendant-or-self axis step //ul instead of the child axis step /ul at the end of your expression. If you're sure the markup will not change, better only use child axis steps: /li/ul/li/a.
Another problem is that the #class attribute does not equal listVideoAttributes, but only contain it. You should never compare HTML-class-attributes with equals, always use contains.
Anyway, I'd be as specific as possible while searching for the "headline", otherwise you could find false positives when the content of any "listVideoAttributes"-list contains one "Categories" or "Tags":
//ul[contains(#class, 'listVideoAttributes')]/li[contains(span, 'Categories') or contains(span, 'Tags')]//a
You might want to add a /text() if you cannot read the string value from the programming language you're using which would usually be preferred (eg., when a link contains bold text like <a href="..."><strong>foo</strong><a>; text() wouldn't return the string value in this case.
You can try the below Xpath
//ul[contains(#class,'listVideoAttributes') and contains(.//span,'Categories')]//a/text()
output:
Cinema
HD
There are two problems with
//ul[#class="listVideoAttributes"][contains(., 'Categories:')]/ul
first the outer ul class is not equal to "listVideoAttributes", it only contains that as a substring, and secondly the inner ul is not a direct child of the outer one, it's a grandchild. How about
//ul[contains(#class, 'listVideoAttributes')][contains(., 'Categories')]/li/ul/li/a
I need a regular expression in NP++ to find query that is NOT inside an anchor or a hyperlink tag in an html file.. So it finds any gives search query (a word or a couple of words, like: "test", or "Ask a question", that is not linked.. Or in other words: finds the search query that is not linked, and ignores the linked one.
Given that links could be direct links, that tag is not directly before and after the query, or in more than one line.
Example:
<p>any text here, something else..</p>
<p>more
test
to find through other test. With much
<a href="http://www.site.com/folder/filename45.html">
<font color="#800000">Ask a question</font></a> more test</p>
<p>and test to Ask a question here.</p>
There is no perfect solution with regular expressions. It would be better to do this with a programming language and a DOM parser.
Here is about the best you can get:
test(?!((?!<a\W).)*</a)
It uses two negative lookaheads to match test if there is no </a before the next opening <a. Make sure to check . matches newline and to update to Notepad++ 6.
This will start to fail, if you have <a or </a in comments or within attribute strings. Not even speaking of invalid HTML.