Fetching text with xpath in dynamic html structure - html

I have a lot of html and want to process it via xpath. There are two possible ways text can occur:
<div>
The Text
</div>
<!-- OR -->
<div>
<span>The Text</span>
</div>
<!-- BUT NOT -->
<div> other text
<span>The Text</span>
</div> other text
Is there a way I can fetch "The Text" with a single xpath expression?
edit:
concrete structure:
<div id="content">
<h1>...</h1>
<div>
...
</div>
<div>
<span>The Text</span>
</div>
I'm getting the content node via //div[#id='content'][1] and reuse it for other purposes. On this context-node, I tried to execute ./div[2]/span/text() | ./div[not(span)][2]/text(). It works if there is no span, but returns blank/null if there is a spawn. Im using the Java xpath implementation. The div is always the second one of the content-node.

div/span/text() | div[not(span)]/text()
should do the trick. This selects text nodes that are children of the <span> (if there is a <span>), as well as text nodes that are children of the <div> if there is no <span>.
You'll have to modify the div parts to reflect the context from which you're evaluating the XPath expression. If you want to do this with all <div> elements in the document, then change div to //div.
Update:
Based on the new context information you posted, the above XPath should be modified to:
./div[2]/span/text() | ./div[2][not(span)]/text()
However I don't see why your version is returning no text when there is a <span> element. Can you give more context -- your java code that's evaluating the XPath; maybe a more detailed snippet of your input HTML? Is the sample input HTML really exactly representative of your actual input? Could there be another </div> in there that's going unnoticed?

Related

XPath for parent's sibling descendants

I have the following HTML I need to scrape, but the only reliable handle is a stable description of a text field. From there, I need to go to its parent, find that parents next sibling and then get the descendents (unfortunately the data-automation-id selector repeats in every such iteration of this snippet on the site). I put together the below XPath but my RPA tool is unable to find it in the document.
XPath
div[contains(text(),'STABLE TEXT HANDLE')]/following-sibling::div/div/div/span[data-automation-id="SOMETHING"]
HTML:
<ul>
<li>
<div>
<label>STABLE TEXT HANDLE</label>
</div>
<div>
<div>
<div>
<span></span>
<span data-automation-id="something">
<div>
<div>
<div>
DYNAMIC TEXT I WANT TO SCRAPE
</div>
</div>
</div>
</span>
<span data-automation-id="somethingelse">
<div>
<div>
<div>
DYNAMIC TEXT I WANT TO SCRAPE
</div>
</div>
</div>
</span>
</div>
</div>
</div>
</li>
</ul>
EDIT:
After futher testing, it seems the issue starts with the contains(text(),'STABLE TEXT HANDLE'), which fails to find that particular node (be it the label, or its parent div).
Please try this:
//label[contains(text(),'STABLE TEXT HANDLE')]/../..//span[#data-automation-id="something"]

XPath for an element that follows some specific paragraph text nested in a div?

I'm trying to select the text "Part Sun, Sun" and "Herb", "Houseplant" from the html below.
The <div class="specifics"> has more of these "row" divs and the text I'm interested in always comes after certain paragraph tags containing specific text like "Light:", and "Type:" below.
Edit: To clarify out of all the "value" divs I'm only interested in ones that have specific "names". So I want to check the text of paragraphs nested inside <div class="name"> elements and if it's what I'm interested in then select the text inside the subsequent <div class="value"> element.
<div class="specifics">
<div class="row">
<div class="name">
<p>Light:</p>
</div>
<div class="value">
<p>Part Sun, Sun</p>
</div>
</div>
<div class="row">
<div class="name">
<p>Type:</p>
</div>
<div class="value">
<p>
Herb, Houseplant
</p>
</div>
</div>
...more rows...
</div>
I've tried this (using Scrapy):
trait = response.xpath("//div[#class='specifics']")
trait.xpath(".//div[#class='row']/div[#class='name']/p[text()='Light:']/../../div[#class='value']/p/text()[normalize-space()]")
The first line is ok but the second one is returning \n \n
Apologies for poor editing originally, below is what the paragraph element actually looks like.
Second Edit: There are a bunch of empty lines and when I select just /p without text() I still get back just a bunch of \n without any of the text? Tried normalize-space as above.
<p>
Part Sun,
Sun
</p>
To select the elements you need, you can do something like this:
/div[#class='specifics']/div[#class='row']/div[#class='value']/p
Adding /text() on the end will grab the Part Sun, Sun in your first row, but because your second row has additional nested elements in it, that text won't be picked up.
Instead you can use /string() which will also extract text from children. /div[#class='specifics']/div[#class='row']/div[#class='value']/p/string()
If you also need to strip out whitespace then you can use either normalize-whitespace() or translate(input, charsToReplace, replacement).
/div[#class='specifics']/div[#class='row']/div[#class='value']/p/normalize-space(string()). Using this tool I get output of String='Part Sun, Sun' and String='Herb, Houseplant'
/div[#class='specifics']/div[#class='row']/div[#class='value']/p/translate(string(), '
', '') where
is the newline character, but you could also add others characters you need removing. source

Match page source tags with regex

I am trying to catch a tag from a page source with regex.
After allot of trying i find it very hard to establish.
Here is an example of an HTML source:
<div class="searchBx">
<div>
<li>somthing</li>
</div>
</div>
<div>
<li>somthing2</li>
</div>
I am trying to catch only the (div class="searchBx") tag and the tags inside.
It is hard because it always catch the div tag after him.
The result should be:
<div class="searchBx">
<div>
<li>somthing</li>
</div>
</div>
Thanks ahead.
It is impossible for regex to match the div you speak of.
Since the div contains another div, by nature it will not be able to differentiate between the </div> tag within it, or the </div> tag that closes the div you wish to match.
<div class="searchBx">
<div>
<li>somthing</li>
</div> <!-- This -->
</div> <!-- and this are the same to regex -->
<div>
<li>somthing2</li>
</div>
Here's what happens: http://regexr.com/3d0jn
For what you need to do, you must use a DOM parser in whichever language you are using.
Plus it's incredibly poor practice using regex to parse HTML, but everyone does it anyway.

<div> tags inside <div> using importXML Xpath query, in Google Spreadsheet

I'm using Xpath in Google docs to get the text inside <div>.
I want to save the text inside <div id="job_description"> in one cell of Google doc spreadsheet, but it shows each <div> in separate cell.
<div id="job_description">
<div>
<strong>
Basic Purpose:
</strong>
<br></br>
</div>
<div>
Work closely with developers, product owners and Q…
<br></br>
</div>
<div>
The Test Analyst is accountable for the developmen…
<br></br>
</div>
<div>
<strong>
Duties and Responsibilities:
</strong>
</div>
<ul>
<li></li>
<li></li>
</ul>
<div>
<strong>
Requirements:
</strong>
<br></br>
</div>
<ul>
<li></li>
<li></li>
</ul>
</div>
Image:
http://i.stack.imgur.com/K0mAY.png
and this is the code I wrote:
=IMPORTXML(E4,"//div[#id='job_description']")
May you help me to put all of the text (including <div> <ul> ...) inside the <div id="job_description"> in only one cell ?
Using JOIN is a good start, but you can make it a single operation.
You did not show the URL to the page you're importing, so I can only give you an example with another page. For instance, if you are importing www.w3.org and looking for a div where #class='event closed expand_block', use
=JOIN(CHAR(10),IMPORTXML("http://www.w3.org/","//div[#class='event closed expand_block']//text()"))
Notice that I also modified the XPath expression: //text() makes sure only descendant text nodes are retrieved, that is, all the text.
EDIT: Responding to your comment:
May I know what is CHAR(10) referring to?
Yes, of course. CHAR returns a character and takes a number as input. In the case of CHAR(10), a newline character is returned (I assume because of
).
In the formula, CHAR(10) is used as the first argument of JOIN, which is the delimiter of the objects that are to be joined.
For now I found a solution , I'll put it here so that others can know my answer, but if there is any other solution please let us know
I used JOIN to put the separate cells (L3:X3) into one single cell
=Trim(JOIN(" ",L3:X3))
you can also use regexreplace to remove the line breaks, with
=REGEXREPLACE(IMPORTXML(E4,"//div[#id='job_description']"),"\n","")
this should wrap it all into one cell for you.

XPath searching multiple nested elements

I have the following html document
<div class="books">
<div class="book">
<div>
there are many deep nested elements here, somewhere there will be one span with some text e.g. 'mybooktext' within these
<div>
<div>
<div>
<span>mybooktext</span>
</div>
</div>
</div>
</div>
<div>
there are also many nested elements here, somewhere there will be a link with a class called 'mylinkclass' within these. (this is the element i want to find)
<div>
<div>
<a class="mylinkclass">Bla</a>
</div>
</div>
</div>
</div>
<div class="book">
<div>
there are many deep nested elements here, somewhere there will be one span with some text e.g. 'mybooktext' within these
<div>
<span>mybooktext</span>
</div>
<div>
</div>
<div>
there are also many nested elements here, somewhere there will be a link with a class called 'mylinkclass' within these. (this is the element i want to find)
<div>
<a class="mylinkclass">Bla</a>
</div>
</div>
</div>
<div class="book">
same as above
</div>
</div>
I want to find the link element (link has class called 'mylinkclass') within the book element, this will be based on the text of the span within the same book element.
So it would be something like:
-Find span with text 'mybooktext'
-Navigate up Book div
-Find link with class 'mylinkclass' within book div
This should be done using one xpath statement
In my few this is was your are looking for:
" //span[contains(text(),'mybooktext')]
/ancestor::div[#class='book']
//a[#class='mylinkclass']"
//span[contains(text(),'mybooktext')] Find san containing "mybooktext"
/ancestor::div[#class='book'] Navigate up Book div (in any deeps)
//a[#class='mylinkclass'] Find link with class 'mylinkclass' within book div (in any deeps)
Update:
change first condition to
//span[(text() ='mybooktext'] if mybooktext is the only text in span