xPath: How to get 'title' text from table? - html

I am using xPath to try to get the title text from the following section of a table:
<td class="title" title="if you were in a job and then one day, the work..." data-id="3198695">
<span id="thread_3198695" class="titleline threadbit">
<span class="prefix">
</span>
<a id="thread_title_3198695" href="showthread.php?t=3198695">would this creep you out?</a>
<span class="thread-pagenav">(Pgs:
<span>1</span> <span>2</span> <span>3</span> <span>4</span>)</span>
</span>
<span class="byline">
by
<a href="member.php?u=1687137" data-id="3198695" class="username">
damoni
</a>
</span>
</td>
The output I want is: "if you were in a job and then one day, the work..."
I have been trying various expressions in Scrapy (python) to try and get the title. It outputs a weird text such as: '\n\n \r \r \n \n\n\r'
response.xpath("//tr[3]/td[#class='title']/text()")
I know that the following part is correct, at least (I verified it locates the correct table element using Chrome's developer tools:
//tr[3]/td
# (This is the above snippet)
Any idea as to how I can extract the title?

You want:
response.xpath("//tr[3]/td[#class='title']/#title")
Note that text() selects the text content of a node but #attribute the value of an attribute. Since the desired text is stored in the title attribute you need to use #title.

Related

Xpath - select first occurence of node with specific type

I am trying to select all of the first occurrences of a specific type in the following structure:
<div class="jobs-list">
<div class="job-listing">
<h3>Title1</h3>
<span class="organization">
Org1
</span>
<span class="location">Loc1</span>
<div class="description">
desc1
https://www.domain1-1.org/
<span class="list-date">Posted on: 01/19/2022</span>
</div>
</div>
<div class="job-listing">
<h3>Title2</h3>
<span class="organization">
Org2
</span>
<span class="location">Loc2</span>
<div class="description">
desc2
https://www.domain2.org/
<span class="list-date">Posted on: 01/18/2022</span>
</div>
</div>
<div class="job-listing">
<h3>Title3</h3>
<span class="organization">
Org3
</span>
<span class="location">Loc3</span>
<div class="description">
desc3
user#domain3.org
<span class="list-date">Posted on: 01/19/2022</span>
</div>
</div>
<div class="job-listing">
<h3>TItle4</h3>
<span class="organization">Org4</span>
<span class="location">Loc4</span>
<div class="description">
desc4
user#domain4.org
https://www.domain4.org/
https://www.domain4-1.org/
<span class="list-date">Posted on: 01/06/2022</span>
</div>
</div>
</div>
Specifically, I need the result to be the following:
https://www.domain1.org/
https://www.domain2.org/
https://www.domain3.org/
https://www.domain4.org/
Which should be the first a/#href under each div[#class='job-listing'], but I'm not sure how to express that. Some things to note:
The <a> is always two nodes under the root (job-listing)
The first <a> isn't always correct (only looking for http), but I can filter those out easily enough; I'm caught up on how to select the node, not filtering for the content or anything like that.
I need the value of a/#href, not the contents of <a>.
Thanks!
//div[#class='job-listing']/descendant::a[1] gives you the first a descendant of each of those divs, if you want to add the check then use e.g. //div[#class='job-listing']/descendant::a[starts-with(#href, 'http')][1].
If you need the href attribute node use //div[#class='job-listing']/descendant::a[starts-with(#href, 'http')][1]/#href. Note that some default serialization for XSLT or XQuery doesn't allow you to serialize a sequence of standalone attribute nodes but in XPath 2 or 3 you can of course use e.g. //div[#class='job-listing']/descendant::a[starts-with(#href, 'http')][1]/#href/string() to get a sequence of attribute values instead.
I'd suggest a more class-based selector:
//span[#class="organization"]//a/#href
|
//div[#class="description"][not(preceding-sibling::span/a)]
//a[contains(#href,"http")][1]/#href
Select links under organization (A) and first http link under description that doesn't meet A
See live tester link

How do you use xpath to find an element with two specific descendants?

I have an unordered list of list items containing elements for labels and values that are dynamically generated. I am trying to validate that the list contains a specific label with a specific value.
I am attempting to write an xpath that will allow me to find the parent element that contains the defined label and value with protractor's element(by.xpath). Given a list, I need to be able to find any single li by the combination of two descendants of specific attributes. For example, a li element that contains any descendent with class=label and text=Color AND any descendent with text=Blue.
<ul>
<li>
<span class='label'> Car </span>
<p> Ford </p>
</li>
<li>
<span class='label'> Color </span>
<p> <span>My favorite color is</span> : <webl>Blue</webl></p>
</li>
<li>
<span class='label'> Name </span>
<p> Meri </p>
</li>
<li>
<span class='label'> Pet </span>
<p> Cats <span>make the best pets</span> </p>
</li>
I have tried several variations on the following pattern:
//li[.//*[#class="label" | contains(text(), 'Color')] | .//*[contains(text(), 'Blue')]
This is the closest I think I have come and it's coming back as not a valid xpath. I've been looking at references, cheatsheets, and SO questions for several hours now and I am no closer to understanding what I am doing wrong. Eventually I will need to replace the text with variables, but right now I just need to get my head around this.
a list item that contains, at any depth,
any tag with a class of 'label' and text of x
AND
any tag with text y
Can anyone tell me what I am doing wrong? Am I just making it too complex?
The reason you are getting invalid xPath is because:
The |, or union, operator returns the union of its two operands,
which must be node-sets..
However since you have used inside one node you are getting issue. To meet your requirement below xpath will work just fine:
//*[#class="label" and contains(text(),'Color')]//ancestor::li//*[contains(text(), 'Blue')]
As per the HTML you have shared to locate the <li> element that contains a descendent with class='label' and text=Color AND any descendent with text=Blue you can use the following xpath based Locator Strategy:
//li[./span[#class='label' and contains(., 'Color')]][.//webl[contains(., 'Blue')]]
Proof Of Concept:

Select optional nodes with XPath

I have an HTML fragment:
<td>
<span class="x-cell">something</span>
<span class="y-cell">something</span>
<span class="z-cell">something</span>
A text
<span class="foo"/>
Another text
<span class="bar"/>
Also text
</td>
I try to select all nodes following the <span class="z-cell"/> to move them into another node. But all the nodes within td are optional, I can have zero to three <span class="*-cell"/>, the text is optional and there could be further <span> nodes in the middle/begin/end of the text or not.
In short, I have to move all nodes except the <span class="*-cell"/> into another node. I tried XPath to select the nodes:
td/span[contains(#class,"-cell")][last()]/following-sibling::*
but it doesn't work, if there aren't any <span class="*-cell"/> nodes. How I could solve that?
Have your xpath expression exclude all elements you do not want:
td/(*[not(contains(#class,"-cell"))]|text())
If you only want to copy elements without the intervening text this simplifies to
td/*[not(contains(#class,"-cell"))]
Live Demo on XPathTester

html and combining span ID's into one span ID

I'm working on an eBook which requires me to create an overlay. All is working fine except in some cases I have a drop cap combined with the rest of the word which need to be highlighted at the same time.
The code below is my current problem. I need to have the two span ID's combined into on without destroying the html.
Any ideas?
<p class="ParaOverride-1"><span id="_idTextSpan017" class="DropCap-color CharOverride-6" style="position:absolute;top:-109.78px;left:26.39px;">W</span><span id="_idTextSpan018" class="PageText-v1 CharOverride-7" style="position:absolute;top:0px;left:1626.19px;letter-spacing:-2.6px;">hat </span>
You need a nested <span>:
<span id="myID">
<span id="x">
</span>
<span id="y">
</span>
</span>

Select text adjacent to element using xpath

I'm begginer to write xpath expression,facing an issue for captureing string (or) text of next to the <b> Tag sibling element like,
<div id="product-desc" class="green-box">
<p class="ref">
<b class="">Mfr Part#:</b>
"STM6520AQRRDG9F"
<br class="">
<b class="">Mounting Method:</b>
"Surface Mount"
<br class="">
<b class="">Package Style:</b>
"TDFN-8"
<br class="">
<b class="">Packaging:</b>
"REEL"
<br class="">
</p>
</div>
In Above html code how should i get the text xpath expression i.e ("STM6520AQRRDG9F") next to <b> element.I tried with following ways
//*[#id="product-desc"]/p[2]/b[1]/following-sibling::text()
can any one suggest me to get currect xpath expression of getting text xapth Expression.
Thanks for advance regards.
As hek2mgl has mentioned, the text you'd like to find is in the first p element of that div. Also, to avoid any surprising results, you should select only the first following text node that is a sibling.
One way to do it is
//*[#id="product-desc"]/p[1]/b[1]/following-sibling::text()[1]
and the result will be
[EMPTY LINE]
"STM6520AQRRDG9F"
[EMPTY LINE]