Xpath query for HTML - what am I doing wrong?

Xpath query for HTML - what am I doing wrong? - html

I have this snippet of HTML inside a <BODY> that I'm trying to select with scrapy:
<section class="content">
<div class="social clearfix">
<div class="profile profile-nano pull-left">
<img src="/xxx" class="avatar" height="48" width="48" title="xxx" alt="xxx">
</div>
<p class="byline pull-left text-left"><strong>BY <a class="text-uppercase" href="https://xxx">xxx</a><br />
September 07, 2015</strong> </p>
This is the xpath selector I'm using to get the date:
response.selector.xpath('//p/#byline/text()')
Which returns a null result.
What am I doing wrong in my xpath selector?

//p/#byline/text() would match nothing since here you are basically trying to get the byline attribute from a p element and the provided p element does not have a byline attribute.
You can get the following sibling of the a element inside the div element having byline class:
In [1]: response.xpath("//p[contains(#class, 'byline')]//a/following-sibling::text()").extract()[0].strip()
Out[1]: u'September 07, 2015'
Alternatively, you may get all the text nodes from the appropriate p element and filter out the desired one by checking it with a regular expression pattern via re:test() function:
In [2]: response.xpath("//p[contains(#class, 'byline')]//text()[re:test(., '\w+ \d{2}, \d{4}')]").extract()[0].strip()
Out[2]: u'September 07, 2015'

Related

Traversing the DOM with querySelector

I'm using the statement document.querySelector("[data-testid='people-menu'] div:nth-child(4)") in the console to give me the below HTML snippet:
<div>
<span class="jss1">
<div class="jss2">
<p class="jss3">Owner</p>
</div>
</span>
<div class="jss4">
<div class="5" title="User Title">
<p class="jss6">UT</p>
</div>
<div class="jss7">
<p class="jss82">User Title</p>
<span class="jss9">Project Manager</span>
</div>
</div>
</div>
I'd like to extend the statement in the console to extract the title "User Title" but can't figure out what combination of nth-child or nextSibling (or something else) to use. The closest I've gotten is:
document.querySelector("[data-testid='people-menu'] div:nth-child(4) span:nth-child(1)")
which gives me the span with class jss1.
I expected document.querySelector("[data-testid='people-menu'] div:nth-child(4) span:nth-child(1).nextSibling") to give me the div with class jss4, but it returns null.
I can't use class selectors because those are generated dynamically at build.

Why not just add [title] onto your querySelector?
document.querySelector("[data-testid='people-menu'] div:nth-child(4) [title]")
You can then get whatever you are looking for from that section? This is assuming title will be unique attribute in this section of html

Using Nokogiri's CSS method to get all elements within an alt tag

I am trying to use Nokogiri's CSS method to get some names from my HTML.
This is an example of the HTML:
<section class="container partner-customer padding-bottom--60">
<div>
<div>
<a id="technologies"></a>
<h4 class="center-align">The Team</h4>
</div>
</div>
<div class="consultant list-across wrap">
<div class="engineering">
<img class="" src="https://v0001.jpg" alt="Person 1"/>
<p>Person 1<br>Founder, Chairman & CTO</p>
</div>
<div class="engineering">
<img class="" src="https://v0002.png" alt="Person 2"/></a>
<p>Person 2<br>Founder, VP of Engineering</p>
</div>
<div class="product">
<img class="" src="https://v0003.jpg" alt="Person 3"/></a>
<p>Person 3<br>Product</p>
</div>
<div class="Human Resources & Admin">
<img class="" src="https://v0004.jpg" alt="Person 4"/></a>
<p>Person 4<br>People & Places</p>
</div>
<div class="alliances">
<img class="" src="https://v0005.jpg" alt="Person 5"/></a>
<p>Person 5<br>VP of Alliances</p>
</div>
What I have so far in my people.rake file is the following:
staff_site = Nokogiri::HTML(open("https://www.website.com/company/team-all"))
all_hands = staff_site.css("div.consultant").map(&:text).map(&:squish)
I am having a little trouble getting all elements within the alt="" tag (the name of the person), as it is nested under a few divs.
Currently, using div.consultant, it gets all the names + the roles, i.e. Person 1Founder, Chairman; CTO, instead of just the person's name in alt=.
How could I simply get the element within alt?

Your desired output isn't clear and the HTML is broken.
Start with this:
require 'nokogiri'
doc = Nokogiri::HTML('<html><body><div class="consultant"><img alt="foo"/><img alt="bar" /></div></body></html>')
doc.search('div.consultant img').map{ |img| img['alt'] } # => ["foo", "bar"]
Using text on the output of css isn't a good idea. css returns a NodeSet. text against a NodeSet results in all text being concatenated, which often results in mangled text content forcing you to figure out how to pull it apart again, which, in the end, is horrible code:
doc = Nokogiri::HTML('<html><body><p>foo</p><p>bar</p></body></html>')
doc.search('p').text # => "foobar"
This behavior is documented in NodeSet#text:
Get the inner text of all contained Node objects
Instead, use text (AKA inner_text or content) against the individual nodes, resulting in the exact text for that node, that you can then join as you want:
Returns the content for this Node
doc.search('p').map(&:text) # => ["foo", "bar"]
See "How to avoid joining all text from Nodes when scraping" also.

XPath to select link containing text?

I tried to use this XPath:
//*[contains(normalize-space(text()),'Jira')]
Also tried:
//*[contains(text(),'Jira')]
In the below HTML example, there is space before and after text "Jira". I am not able to click on the link:
<a href="#/crm/usergroup-edit?id=572a3c84e4b07f6189958700"
ng-repeat="gp in groups | filter : userGroupSearch | orderBy:'-name':1"
class="ng-scope">
<div class="inventoryPanel" ng-style="myStyle" style="width: 15.8%;">
<h4 class="ng-binding">
<div class="groupIcon G">
<div class="text ng-binding">P</div>
</div>Jira
</h4>
</div>
</a>

The following XPath will select all a elements whose string value contains a Jira substring:
//a[contains(.,'Jira')]

HtmlUnit - Unable to get anchors from div

The divs of the HTML page I am targeting look like this:
<div class="white-row1">
<div class="results">
<div class="profile">
<a href="hrefThatIWant.com" class>
<img src = "http://imgsource.jpg" border="0" width="150" height="150 alt>
</a>
</div>
</div>
</div>
<div class="white-row2">
// same content as the div above
</div>
I want to scrap collect the href in each div in a list.
This is my current code:
List<HtmlAnchor> profileDivLinks = (List)htmlPage.getByXPath("//div[#class='profile']//#href");
for(HtmlAnchor link:profileDivLinks)
{
System.out.println(link.getHrefAttribute());
}
This is the error I am receiving (which goes on first line of the for statement):
Exception in thread "main" java.lang.ClassCastException: com.gargoylesoftware.htmlunit.html.DomAttr cannot be cast to com.gargoylesoftware.htmlunit.html.HtmlAnchor
What do you think the issue is?

The issue is you're getting an attribute and then you're casting that attribute to an anchor. I guess the solution with the minimal change to your code would be just modifying the XPath to return an anchor:
htmlPage.getByXPath("//div[#class='profile']//a");

try
//div[#class='profile']//data(#href)

Selenium locate <img> nested in <div> class

I have the following html code:
<span id="spanId" class="myThumbnails">
<div class="Thumbnail" style="margin-bottom:12px;text-align:center;">
<img id="thumbl00_cph_Img1" style="border-width:0px;" src="http://someImg.jpg"></img>
<input id="thumbl00_cph_Img1" type="hidden" value="http://someImg.jpg"></input>
</div>
<div class="Thumbnail" style="margin-bottom:12px;text-align:center;"></div>
<div class="Thumbnail" style="margin-bottom:12px;text-align:center;"></div>
<div class="Thumbnail" style="margin-bottom:12px;text-align:center;"></div>
</span>
I've extracted the span using xpath & then findElements by className but now I need the inner <img> src attribute, since the id is generated i can't use it is there a way to extract img?

WebElement has getAttribute method. That does exactly what you want. So your code could be something similar to:
driver.findElement(By.Xpath("//div[#class=\"Thumbnail\"]/img").getAttribute("src")

You can use jQuery selector
$('span#spanId img').src
in case there is only 1 img tag, if there is more than 1 img tag, just use loop to get src attribute

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Xpath query for HTML - what am I doing wrong? - html

Related

Traversing the DOM with querySelector

Using Nokogiri's CSS method to get all elements within an alt tag

XPath to select link containing text?

HtmlUnit - Unable to get anchors from div

Selenium locate <img> nested in <div> class

Categories

Resources