Select text adjacent to element using xpath - html

I'm begginer to write xpath expression,facing an issue for captureing string (or) text of next to the <b> Tag sibling element like,
<div id="product-desc" class="green-box">
<p class="ref">
<b class="">Mfr Part#:</b>
"STM6520AQRRDG9F"
<br class="">
<b class="">Mounting Method:</b>
"Surface Mount"
<br class="">
<b class="">Package Style:</b>
"TDFN-8"
<br class="">
<b class="">Packaging:</b>
"REEL"
<br class="">
</p>
</div>
In Above html code how should i get the text xpath expression i.e ("STM6520AQRRDG9F") next to <b> element.I tried with following ways
//*[#id="product-desc"]/p[2]/b[1]/following-sibling::text()
can any one suggest me to get currect xpath expression of getting text xapth Expression.
Thanks for advance regards.

As hek2mgl has mentioned, the text you'd like to find is in the first p element of that div. Also, to avoid any surprising results, you should select only the first following text node that is a sibling.
One way to do it is
//*[#id="product-desc"]/p[1]/b[1]/following-sibling::text()[1]
and the result will be
[EMPTY LINE]
"STM6520AQRRDG9F"
[EMPTY LINE]

Related

XPath for an element that follows some specific paragraph text nested in a div?

I'm trying to select the text "Part Sun, Sun" and "Herb", "Houseplant" from the html below.
The <div class="specifics"> has more of these "row" divs and the text I'm interested in always comes after certain paragraph tags containing specific text like "Light:", and "Type:" below.
Edit: To clarify out of all the "value" divs I'm only interested in ones that have specific "names". So I want to check the text of paragraphs nested inside <div class="name"> elements and if it's what I'm interested in then select the text inside the subsequent <div class="value"> element.
<div class="specifics">
<div class="row">
<div class="name">
<p>Light:</p>
</div>
<div class="value">
<p>Part Sun, Sun</p>
</div>
</div>
<div class="row">
<div class="name">
<p>Type:</p>
</div>
<div class="value">
<p>
Herb, Houseplant
</p>
</div>
</div>
...more rows...
</div>
I've tried this (using Scrapy):
trait = response.xpath("//div[#class='specifics']")
trait.xpath(".//div[#class='row']/div[#class='name']/p[text()='Light:']/../../div[#class='value']/p/text()[normalize-space()]")
The first line is ok but the second one is returning \n \n
Apologies for poor editing originally, below is what the paragraph element actually looks like.
Second Edit: There are a bunch of empty lines and when I select just /p without text() I still get back just a bunch of \n without any of the text? Tried normalize-space as above.
<p>
Part Sun,
Sun
</p>
To select the elements you need, you can do something like this:
/div[#class='specifics']/div[#class='row']/div[#class='value']/p
Adding /text() on the end will grab the Part Sun, Sun in your first row, but because your second row has additional nested elements in it, that text won't be picked up.
Instead you can use /string() which will also extract text from children. /div[#class='specifics']/div[#class='row']/div[#class='value']/p/string()
If you also need to strip out whitespace then you can use either normalize-whitespace() or translate(input, charsToReplace, replacement).
/div[#class='specifics']/div[#class='row']/div[#class='value']/p/normalize-space(string()). Using this tool I get output of String='Part Sun, Sun' and String='Herb, Houseplant'
/div[#class='specifics']/div[#class='row']/div[#class='value']/p/translate(string(), '
', '') where
is the newline character, but you could also add others characters you need removing. source

How do I find SPAN tag containing DIV tag with notepad++ regex for W3C Validation?

I'm trying to fix my HTML views for W3C validation. On error is that I had some rare div or structural tags in a span tag. Here's a fake example made from my HTML codes :
<div style="margin-left:10px;">
<h2>Sub Title</h2>
<span><span class="bold_text">Phones : </span> 000-000-000000 / 000-000-000000 </span>
<br/>
<span><span class="bold_text">Email : </span>
<ul>
<li>For Support use <a href="mailto:support#email.com" >support#email.com</a></li>
<li>For CopyRights use <a href="mailto:copyright#email.com" >copyright#email.com</a></li>
<li>For Technical issue use <a href="mailto:staff#email.com" >staff#email.com</a></li>
</ul>
</span>
<span>
<span class="bold_text">Location : </span>
<div class="address_container">#0, City, Region, Country</div>
</span>
<div class="map_container" style="margin-top:10px;display:inline-block;width:90%;height:400px;" >
#yield('map_member')
</div>
I'm playing with regex101 and so far I got this :
<span[^>]*>[.\s\S]*<div[\s\S]*<\/div>[\s\S]*<\/span> /gm
It must match new lines and spaces. But this select the 1st and finishes on the last span ending tag . But I want it to point only to :
<span>
<span class="bold_text">Location : </span>
<div class="address_container">#0, City, Region, Country</div>
</span>
To replace those DIV within the SPAN, while there is SPAN within the SPAN?
One can also assume that if it ended with SPAN that it also started with SPAN.
So this regex just uses a positive lookahead to check if the DIV is followed by 0 or more enclosed DIV or SPAN, then closed with SPAN.
\s*<div[^<>]*>[^<>]*</div>(?=(?:\s*<(div|span)[^<>]*>[^<>]*</\1>)*[^<>]*</span>)
Replace with nothing and it'll be spick-and-span.

xPath: How to get 'title' text from table?

I am using xPath to try to get the title text from the following section of a table:
<td class="title" title="if you were in a job and then one day, the work..." data-id="3198695">
<span id="thread_3198695" class="titleline threadbit">
<span class="prefix">
</span>
<a id="thread_title_3198695" href="showthread.php?t=3198695">would this creep you out?</a>
<span class="thread-pagenav">(Pgs:
<span>1</span> <span>2</span> <span>3</span> <span>4</span>)</span>
</span>
<span class="byline">
by
<a href="member.php?u=1687137" data-id="3198695" class="username">
damoni
</a>
</span>
</td>
The output I want is: "if you were in a job and then one day, the work..."
I have been trying various expressions in Scrapy (python) to try and get the title. It outputs a weird text such as: '\n\n \r \r \n \n\n\r'
response.xpath("//tr[3]/td[#class='title']/text()")
I know that the following part is correct, at least (I verified it locates the correct table element using Chrome's developer tools:
//tr[3]/td
# (This is the above snippet)
Any idea as to how I can extract the title?
You want:
response.xpath("//tr[3]/td[#class='title']/#title")
Note that text() selects the text content of a node but #attribute the value of an attribute. Since the desired text is stored in the title attribute you need to use #title.

<div> tags inside <div> using importXML Xpath query, in Google Spreadsheet

I'm using Xpath in Google docs to get the text inside <div>.
I want to save the text inside <div id="job_description"> in one cell of Google doc spreadsheet, but it shows each <div> in separate cell.
<div id="job_description">
<div>
<strong>
Basic Purpose:
</strong>
<br></br>
</div>
<div>
Work closely with developers, product owners and Q…
<br></br>
</div>
<div>
The Test Analyst is accountable for the developmen…
<br></br>
</div>
<div>
<strong>
Duties and Responsibilities:
</strong>
</div>
<ul>
<li></li>
<li></li>
</ul>
<div>
<strong>
Requirements:
</strong>
<br></br>
</div>
<ul>
<li></li>
<li></li>
</ul>
</div>
Image:
http://i.stack.imgur.com/K0mAY.png
and this is the code I wrote:
=IMPORTXML(E4,"//div[#id='job_description']")
May you help me to put all of the text (including <div> <ul> ...) inside the <div id="job_description"> in only one cell ?
Using JOIN is a good start, but you can make it a single operation.
You did not show the URL to the page you're importing, so I can only give you an example with another page. For instance, if you are importing www.w3.org and looking for a div where #class='event closed expand_block', use
=JOIN(CHAR(10),IMPORTXML("http://www.w3.org/","//div[#class='event closed expand_block']//text()"))
Notice that I also modified the XPath expression: //text() makes sure only descendant text nodes are retrieved, that is, all the text.
EDIT: Responding to your comment:
May I know what is CHAR(10) referring to?
Yes, of course. CHAR returns a character and takes a number as input. In the case of CHAR(10), a newline character is returned (I assume because of
).
In the formula, CHAR(10) is used as the first argument of JOIN, which is the delimiter of the objects that are to be joined.
For now I found a solution , I'll put it here so that others can know my answer, but if there is any other solution please let us know
I used JOIN to put the separate cells (L3:X3) into one single cell
=Trim(JOIN(" ",L3:X3))
you can also use regexreplace to remove the line breaks, with
=REGEXREPLACE(IMPORTXML(E4,"//div[#id='job_description']"),"\n","")
this should wrap it all into one cell for you.

Fetching text with xpath in dynamic html structure

I have a lot of html and want to process it via xpath. There are two possible ways text can occur:
<div>
The Text
</div>
<!-- OR -->
<div>
<span>The Text</span>
</div>
<!-- BUT NOT -->
<div> other text
<span>The Text</span>
</div> other text
Is there a way I can fetch "The Text" with a single xpath expression?
edit:
concrete structure:
<div id="content">
<h1>...</h1>
<div>
...
</div>
<div>
<span>The Text</span>
</div>
I'm getting the content node via //div[#id='content'][1] and reuse it for other purposes. On this context-node, I tried to execute ./div[2]/span/text() | ./div[not(span)][2]/text(). It works if there is no span, but returns blank/null if there is a spawn. Im using the Java xpath implementation. The div is always the second one of the content-node.
div/span/text() | div[not(span)]/text()
should do the trick. This selects text nodes that are children of the <span> (if there is a <span>), as well as text nodes that are children of the <div> if there is no <span>.
You'll have to modify the div parts to reflect the context from which you're evaluating the XPath expression. If you want to do this with all <div> elements in the document, then change div to //div.
Update:
Based on the new context information you posted, the above XPath should be modified to:
./div[2]/span/text() | ./div[2][not(span)]/text()
However I don't see why your version is returning no text when there is a <span> element. Can you give more context -- your java code that's evaluating the XPath; maybe a more detailed snippet of your input HTML? Is the sample input HTML really exactly representative of your actual input? Could there be another </div> in there that's going unnoticed?