Unable to find fitting selector using scrapy and xpath/css

Unable to find fitting selector using scrapy and xpath/css - html

I am currently experimenting with crawlers and how they work.
With that, I am currently stuck finding the right selector call for scrapy - neither xpath nor css works.
Here is the source code:
<body data-new-gr-c-s-check-loaded="14.980.0">
<div id="__next">
<div class="layout layout--public">
<section class="container-fluid container-section coach-list-section">
<div class="single-coach">
<div class="row">
<div class="col-md-4">...</div>
<div class="col-md-3 coach-main-content">
<div class="coach-info">...</div>
<h2>
Name, Age
</h2>
I want to retrieve the "Name, Age".
With the following code I only get an empty list back but I don't know why:
response.xpath('//body/div[contains(#id, "__next")]
/div[contains(#class, "layout layout--public")]
/section[contains(#class, "container-fluid container-section coach-list-section")]
/div[contains(#class, "single-coach")]').getall()
Note: the code is in one line, just for better visualization I entered it in multiple lines.
EDIT:
So I used the dev console in the browser and searched for the right xpath which is:
//body /div[contains(#id, "__next")] //div[contains(#class, "single-coach")] //div[contains(#class, "main-content")] /h2 /a
In the dev console, I can also see the right element highlighted.
Trying this xpath for scrapy doesn't work. I entered the following code in the scrapy shell...
response.xpath('//body/div[contains(#id,"__next")]//div[contains(#class,"single-coach")]//div[contains(#class,"main-content")]/h2/a').getall()
... and still received and empty list --> []
The ending .getall() should work as the xpath //body returns me the wished information.
SECOND EDIT
The content is loaded dynamically on the website. Hence, I was not able to find the right selector.
For everyone who also has this problem: I suggest to look it up in the scrapy documentation: https://docs.scrapy.org/en/latest/topics/dynamic-content.html

Try this:
response.xpath('//section//h2/a/text()').getall()

Related

How to use the exclude filter syntax in Xpath

I'm trying to parse a HTML from my firm by using Xpath, and here are the sample html structure of my target website:
<div class='my_target' id='A'>
This is a sample website HTML!
<span>APPLE</span>
<span>BANANA</span>
<span>ORANGE</span>
<span>IGNORE_1</span>
<span>IGNORE_2</span>
</div>
<div class='not_my_target' id='B'>
This is a sample website HTML!
<span>APPLE</span>
<span>BANANA</span>
<span>ORANGE</span>
<span>IGNORE_1</span>
<span>IGNORE_2</span>
</div>
And here are the elements I want to get:
<div class='my_target' id='A'>
This is a sample website HTML!
<span>APPLE</span>
<span>BANANA</span>
<span>ORANGE</span>
</div>
I've tried the code like:
//div[#id='A' and (not(self::span and contains(text(), "IGNORE_1")) or not(self::span and contains(text(), "IGNORE_2"))]
But it didn't work Q_Q
Did I write a wrong syntax ? Any one could help ?
Thanks

Try this:
//div[#id='A']/span[not(contains(text(),'IGNORE_1')) and not(contains(text(),'IGNORE_2'))]
This will search for the ID value of A and then check span for not containing IGNORE_1 and IGNORE_2.
Problem with your case:
You are searching for ID and setting conditions that it shouldn't contain span, IGNORE_1 and IGNORE_2. That's why you are unable to get the desired result.
//div[#id='A' and (not(self::span and contains(text(), "IGNORE_1")) or not(self::span and contains(text(), "IGNORE_2"))]

Why does the requests library omit the contents of a div element while fetching this page?

I'm trying to scrape a webpage to retrieve a piece of information I found using inspect element.
A search results page contains the following div:
<div id="se-search-serp">
...
</div>
Using inspect element, I can find the desired tags inside this element:
<div class="sc-fznyAO kChtze">
<div data-area="main" class="sc-fznKkj sc-pcxhi kVWWZA">
<h1 class="sc-qXFOy hZbUVv automation-total-books-found">500 results in Books</h1>
<div class="sc-pQfvp hBInHv">
<div class="sc-qXHHN kITAEN"><ul id="search-results-tabs" role="tablist" data-test="tabs" class="sc-pAyMl frsGKv">
<li role="presentation">
<a role="tab" id="search-results-tabs_tabheader_0" data-test="search-results-tabs_tabheader_0" href="#search-results-tabs_tabpanel_0" data-index="0" tabindex="-1" class="sc-pITNg jukMUo">All Matches</a></li>
...
I would like to access this content through the Python Requests library. However, when I request the page and print out the text, the contents of this div are missing. Here's the string of relevant text in the very long output:
...
<div id="se-search-serp"></div>
...
I have checked that this is the div that contained the content in expect element. It has the same place in the hierarchy and no other divs of the same id exist on the page.
What is going on here? I've observed that many of the tags within <div id="se-search-serp"> belong to classes with illegible names ("sc-fznyAO","sc-qXFOy", etc.)like the ones shown above. I cannot find any tags like these in the printed page text, although they do show in inspect element. Could this be relevant?
And finally, is there any way to access this content through the requests library?
Edit: The site url I am using is: https://www.chegg.com/search/the%20art%20of%20electronics/#p=1

How can I add a generic page header with site navigation to an asciidoc document?

I'm trying to build a basic documentation site using asciidoctor. One thing I am struggling with is how I can inject a standard site header (a banner with a logo and top-level navigation) into each documentation page.
I render my asciidoc directly to html from the command line. I have tried to find a way to somehow inject an extra div element and position it fixed at the top, but I can't figure out what mechanism to use for this. My early attempts involved using docinfo.html but that gets injected into the html in the <head> element, instead of the <body>.
I am aware that full-on publication systems like Jekyll allow me to configure "front matter" that can probably take care of this, but I was hoping there was a mechanism or trick using vanilla asciidoctor that can achieve this.

Ted Bergeron on the Mailing List mentioned a simple project:
Demo website created just using Asciidoctor.
Check the corresponding repo to see the files and how to create the site (just using one command).
In summary: simply create a header asciidoc file that includes your site nav (in the demo site this is done using table markup), without including a level-0 (document) title. Include this header file right at the top in every page of your site. Then render by just running asciidoctor *.txt on your project.

--embedded option + simple post processing
With this option, asciidoctor generates only the interior part of the <body> element.
For example:
main.adoc
= Super title
== Second level
asdf
== Also second level
qwer
then:
asciidoctor --embedded main.adoc
produces:
main.html
<div class="sect1">
<h2 id="_second_level">Second level</h2>
<div class="sectionbody">
<div class="paragraph">
<p>asdf</p>
</div>
</div>
</div>
<div class="sect1">
<h2 id="_also_second_level">Also second level</h2>
<div class="sectionbody">
<div class="paragraph">
<p>qwer</p>
</div>
</div>
</div>
You can then just cat a header and closing footer, and you are done.
Tested with Asciidoctor 2.0.10.

xpath working in chrome console but not in protractor script

html:
<div class="view doc">
<div class="view-doc-heading-dec mt10 ng-binding" id="docSummaryHeader"> Document Title </div>
<div class="view-doc-inner mt11 ng-binding" id="docBodyHeader">
</div>
I want to retrieve 'Document Title' in above elements with xpath:
$x('//*[#id=docSummaryHeader]')[0]
works in chrome console
but
element(by.xpath('//*[#id=docSummaryHeader]'))
in protractor doesn't allow [0]
If I use
element(by.xpath('//*[#id=docSummaryHeader]'))
it gives multiple elements in current html

Find all elements and get the desired one by index:
element.all(by.xpath('//*[#id="docSummaryHeader"]')).get(0);
or:
element.all(by.xpath('//*[#id="docSummaryHeader"]')).first();
Or, you can use the XPath-indexing (1-based):
element(by.xpath('//*[#id="docSummaryHeader"][1]'))

Actually you don't need xpath here:
$$('#docSummaryHeader').first();
Consider using CSS selector instead.

selecting deep elements in d3

So for my question you can refer to udacity.com main page.
Im trying to access this text -"The Udacity Difference" somewhere on the middle of the page.
I tried this :
d3.select("div.banner-content.h2.h-slim")
d3.select("div.banner-content.h-slim")
None of the above is working. So I tried looking in dev-tools element section for this page.
Then I could hover and see that :
div.banner-content has further
{div.container
{div.row
{div.col-xs-12 text-center
{h2.h-slim}}}}
Then I thought ok I shoud try if I can get the "container" atleast, but then even this
d3.select("div.banner-content.div.container")
OR
d3.select("div.banner-content.container")
doesnt work !!!!
Wheres the fault in logic ?

You can find nested elements using d3's chained syntax as shown below.
HTML:
<div class="banner-content">
<div class="container">
<h2 class="h-slim">Header</h2>
</div>
</div>
Code:
d3.select("div.banner-content").select("div.container").select("h2.h-slim")
EDIT: No need to type such long syntax. Just need to separate the selectors using a space.
d3.select("div.banner-content div.container h2.h-slim")
Code below will also give you the same results in this case. You can specifically add tags if necessary.
d3.select("div.banner-content .container .h-slim")

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Unable to find fitting selector using scrapy and xpath/css - html

Try this: response.xpath('//section//h2/a/text()').getall()

Related

How to use the exclude filter syntax in Xpath

Why does the requests library omit the contents of a div element while fetching this page?

How can I add a generic page header with site navigation to an asciidoc document?

xpath working in chrome console but not in protractor script

selecting deep elements in d3

Categories

Resources