Scrapy writing XPath expression for unknown depth - html

I have an html file which is like:
<div id='author'>
<div>
<div>
...
<a> John Doe </a>
I do not know how many div's would be under the author div. It may have different depth for different pages.
So what would be the XPath expression for this kind of xml?
By the way, I tried:
//div[#id = "author"]/*/a/text()
but this only seems to work for grandchildren of the author div.

Use double slash to find an a element anywhere inside the div element with id="author":
//div[#id = "author"]//a/text()

Related

How to select <div class="ok">.....<a href="soft://an.id/">...</div> nodes?

A document has several <div class="ok"> tags. I am able to select all of them with
"//*[#class="ok"]" (i don't have to specify div, because only div tags have this class). I get a list of 6 nodes matching this.
Now, i need
either to test each node in order to see if it includes the tag <a href="soft://an.id/">. This inclusion is not direct. I mean, the <div> includes a <table> with many <tr> and <td> and <span>, and the <a..> (only one, or none) somewhere before </div>.
or to directly select only (div) nodes of class="ok" that include this <a> tag.
I have tried many things, that all fail. Including protecting the "/" in the href detection (is it required?).
I am quite familiar with regular expressions, but i must confess that i find XPath syntax even harder to understand.. And the W3C reference documents are so hard, without examples..
Any hints are welcome.
In order to select only <div class="ok"> element containing <a href="soft://an.id/"> child element you can use the following XPath locator:
"//div[#class='ok' and .//a[#href='soft://an.id/']]"
If I understand you correctly, you have a nested somewhere under the div with class "ok", right?
So in xpath, the a / is meant for a direct locator under/above the current tag. If you are looking for the somewhere under the found div, you need to use:
//div[#class="ok"]//a[#href="soft://an.id/"]
Then you need to check if it exists or not by using some kind of an assertion.

How can I get the element of a-tag in the div class with selenium?

I recently work on the project that I have to get the element from a specific website.
I want to get the text elements that are something below.
<div class="block-content">
<div class="block-heading">
<a href="https://www~~~~~~">
<i class="fa fa-map">
::before
</i>
"Text I want to get"
</a>
</div>
</div>
I have been trying to solve this for a while, but I could not find anything working fine.
I would love you if you could help me.
Thank you.
According to the information you provided the text you are looking for is inside a element so the xpath for this element is something like:
//a[contains(#href,'https://www')]
But since there is also i element inside it, getting the text from a element will give you both text contained in a itself and the text inside the i.
So you should get the text from i that is looking like just a (space) here and reduce it from the text you are receiving from the a.
In case you want to perform this action on all the a elements containing href and i element inside it you can use the following xpath:
//a[#href and ./i]
If there are more specific definitions about the elements you are looking for - the xpath I mentioned should be updated accordingly
From your comment, I understood that you would like to extract that text. So here is the code for you which would extract the text you want.
Selenium::WebDriver::Wait
.new(timeout: 60)
.until { !driver.find_element(xpath: "//i[#class='fa fa-map-marker']/..").text.empty? }
p driver.find_element(xpath: "//i[#class='fa fa-map-marker']/..").text[/(?<=before \")\w+ \w+ \w+ \w+ \w+/]
output
"Text I want to get"
I couldn't get the elements that I wanted directly, so here's what I did.
It is just that I did modify the elements with some methods though.
def seller_name
shop_info_elements = #driver.find_elements(:class_name, "block-content")
shop_info_text= shop_info_elements.first.text
shop_info_text_array = shop_info_text.lines
seller_name = shop_info_text_array.first.chomp
seller_name
end
It is not beautiful, but it can work for any other pages on the same site.

CSS: Select elements with only one parent matching attribute selector

I'm a bit of a noob to coding so sorry if this is a dumb question, but I'm trying to write a general purpose scraper for getting some product data using the "schema.org/Product" HTML microdata.
However, I came into an issue when testing (on this page in particular where the name was being set as "Electronics" from the Breadcrumbs schema) as there were ancestor elements with different itemtypes/schema.
I first have this variable declared to check if the page has an element using the Product schema microdata.
var productMicrodata = document.querySelector('[itemscope][itemtype="https://schema.org/Product"], [itemscope][itemtype="http://schema.org/Product"]');
I then wanted to select for all elements with the itemprop attribute. e.g.
productMicrodata.querySelectorAll('[itemprop]');
The issue however is that I want to ignore any elements that have other ancestors with different itemtypes/schema attributes, as in this instance the Breadcrumbs and ListItem schema data is still being included.
I figured I would then just be able to do something like this:
productMicrodata.querySelectorAll(':not([itemscope]) [itemprop]');
However this is still returning matches for the child elements having ancestor elements with different itemscope attributes (e.g. breadcrumbs).
I'm sure I'm just missing something super obvious, but any help on how I can achieve only selecting elements that have only the one ancestor with itemtype="http://schema.org/Product" attribute would be much appreciated.
EDIT: For clarification of where the element(s) are that I'm trying to avoid matching with are, here's what the DOM looks like on the example page linked. I'm trying to ignore the elements that have any ancestors with itemtype attributes.
EDIT 2: changed incorrect use of parent to ancestor. Apologies, I am still new to this :|
EDIT 4/SOLUTION: I've found a non-CSS solution for what I'm trying to achieve using the javascript Element.closest() method. e.g.
let productMicrodata = document.querySelectorAll('[itemprop]');
let itemProp = {};
for (let i = 0; i < productMicrodata.length; i++) {
if (productMicrodata[i].closest('[itemtype]').getAttribute('itemtype') === "http://schema.org/Product" || productMicrodata[i].closest('[itemtype]').getAttribute('itemtype') === "https://schema.org/Product") {
itemProp[productMicrodata[i].getAttribute('itemprop')] = productMicrodata[i].textContent;
}
}
console.log(itemProp);
:not([itemscope]) [itemprop] means:
An element with an itemprop attribute and any ancestor with no itemprop ancestor.
So:
<div>
<div itemprop>
<div itemprop> <!-- this one -->
</div>
</div>
</div>
… would match because while the parent element has the attribute, the grandparent does not.
You need to use the child combinator to eliminate elements with matching parent elements:
:not([itemscope]) > [itemprop]
[...] help on how I can achieve only selecting elements that have only
the itemtype="http://schema.org/Product" attribute would be much
appreciated.
Attribute selectors can take explicit values:
[myAttribute="myValue"]
So the syntax for this would be:
var productMicrodata.querySelectorAll('[itemtype="http://schema.org/Product"]');

Finding XPath for text in div following input

I got an issue reading XPath. Need some help/advise from experts.
Part of my HTML is below:
<div class = "input required_field">
<div class="rounded_corner_error">
<input id="FnameInput" class="ideField" type="text" value="" name="first_name>
<div class ="help-tooltip">LOGIN BACK TO MAIN</div>
<div class="error-tooltip">
I need to find the XPath of the text message (LOGIN BACK TO MAIN)
Using Firebug I find the XPath
("//html/body/div/div[5]/div/div/form/fieldset/div/div[2]/div[2]/div/div");
But using above XPath I can read only class = help-tooltip but I need to read LOGIN BACK TO MAIN.
Try adding /text() on the end of the xpath you have.
It does not really look like your XPath matches your XHTML element.
You should try something simpler and more generic, such as:
//div[#class="help-tooltip"]/text()
See Selecting a css class with xpath.
I would use:
# Selecting the div element
//input[#id="FnameInput"]/following-sibling::div[#class="help-tooltip"]
# Selecting the text content of the div
//input[#id="FnameInput"]/following-sibling::div[#class="help-tooltip"]/text()
…since a syntactically-valid HTML document will have a unique id attribute, and as such that's a pretty strong anchor point.
Note that the latter expression will select the text node, not the text string content of that node; you need to extract the value of the text node if you want the string. How you do that depends on what tools you are using:
In JavaScript/DOM that would be the .nodeValue property of the text node.
For Nokogiri that would be the .content method.
…but I have no idea what technology you are using your XPath with.

Jsoup: <div> within an <a>

According to this answer:
HTML 4.01 specifies that <a> elements
may only contain inline elements. A
<div> is a block element, so it may
not appear inside an <a>.
But...
HTML5 allows <a> elements to contain
blocks.
Well, I just tried selecting a <div class="m"> within an <a> block, using:
Elements elems = a.select("m");
and elmes returns empty, despite the div being there.
So I am thinking: Either I am not using the correct syntax for selecting a div within an a or... Jsoup doesn't support this HTML5-only feature?
What is the right Jsoup syntax for selecting a div within an a?
Update: I just tried
Elements elems = a.getElementsByClass("m");
And Jsoup had no problems with it (i.e. it returns the correct number of such divs within a).
So my question now is: Why?
Why does a.getElementsByClass("m") work whereas a.select("m") doesn't?
Update: I just tried, per #Delan Azabani's suggestion:
Elements elems = a.select(".m");
and it worked. So basically the a.select() works but I was missing the . in front of the class name.
The select function takes a selector. If you pass 'm' as the argument, it'll try to find m elements that are children of the a element. You need to pass '.m' as the argument, which will find elements with the m class under the a element.
The current version of jsoup (1.5.2) does support div tags nested within a tags.
In situations like this I suggest printing out the parse tree, to ensure that jsoup has parsed the HTML like you expect, or if it hasn't to know what the correct selector to use.
E.g.:
Document doc = Jsoup.parse("<a href='./'><div class=m>Check</div></a>");
System.out.println("Parse tree:\n" + doc);
Elements divs = doc.select("a .m");
System.out.println("\nDiv in A:\n" + divs);
Gives:
Parse tree:
<html>
<head></head>
<body>
<a href="./">
<div class="m">
Check
</div></a>
</body>
</html>
Div in A:
<div class="m">
Check
</div>