I'm doing a scrapy project in which I try to extract data on sponsored TripAdvisor listings (https://www.tripadvisor.com/Hotels-g189541-Copenhagen_Zealand-Hotels.html).
This is how the html code looks like
<div class="listing_title ui_columns is-gapless is-mobile is-multiline">
<div class="ui_column is-narrow">
<span class="ui_merchandising_pill sponsored_v2">Sponsored</span>
</div>
<div class="ui_column is-narrow title_wrap">
<a target="_blank" href="/Hotel_Review-g189541-d206753-Reviews-Scandic_Front-Copenhagen_Zealand.html" id="property_206753" class="property_title prominent " data-clicksource="HotelName" onclick="return false;" dir="ltr"> Scandic Front</a>
</div>
</div>
I was able to successfully retrieve elements such as the link, id, name with constructs such as response.css(".listing_title").css("a::text").extract().
However, I have trouble retrieving anything from the "Sponsored" -tag attached to the accommodation listings - result is an empty list despite there being two listings with the "Sponsored"-tag on the website.
I tried response.css(".sponsored_v2").css("::text").extract()without any success.
What can I do ?
It looks like you have a typo, try changing .exctract to .extract, you have an extra c.
Related
I need to define an xpath before an element on the page. I have a string(FIO) that I can find using xpath and I need to bind to it. I don't understand how to do it.
My xpath witch i can find on page:
/html/body/div[1]/div[2]/section/div/div[1]/div/ul/li[2]//div[1]/span[contains(., '"+FIO+"')]
look at screenshot, i need find string 1, it have xpath:
/html/body/div[1]/div[2]/section/div/div[1]/div/ul/li[2]/ul/li[4]/ul/li[1]/div/div/a
image
string with my param(FIO) 2, have xpath:
/html/body/div[1]/div[2]/section/div/div[1]/div/ul/li[2]/ul/li[4]/ul/li[1]/div/div/div[1]/span
and i shortened it and inserted a variable:
/html/body/div[1]/div[2]/section/div/div[1]/div/ul/li[2]//div[1]/span[contains(., '"+FIO+"')]
how i can get xpath to element 2 with binding at element 1 ? maybe following sibling ?
sorry, i can't copy the code correctly, only like this:
</div>
</div>
<ul>
<li>
<div class="structure2__item1">
<div class="structure2__item2" style="">
<a class="structure2__position" href=https://**>
"String 2"
</a>
<div class="structure2__name" style="">
<span>String_FIO</span>
</div>
</div>
</div>
</li>
<li>
//div[child::span[contains(text(), "String_FIO")]]/preceding-sibling::a
This would help fetch the a tag from the span.
(From next time - please look out for the standards mentioned in the comments.)
Ok, I've been trying to figure this out for a while, and not quite sure it's possible in pure CSS.
I'm trying to create a bit of custom styling on a page of FileRun links that I send to clients. Sending a bunch of subfolders of large TIFF images (I split them up to make the download manageable). Most clients can figure out that they should go into each subfolder and download them individually. However, the "download all" button appears on the main page of the link, and plenty of not so tech-savvy clients send me angry emails complaining that they hit the "download all" button, and can't open or download the 5GB zip file that FileRun creates of the entire main folder link.
An example of a page is here:
https://demo.filerun.co/wl/?id=T2Gv5oGiGMxO3welkXbaqs92fZ6meJmU
The main limitation is that FileRun is encoded in IonCube, so I only have access to the CSS file, so no way I can add javascript or PHP code.
I've been trying to find a way to write CSS to hide the DOWNLOAD ALL button <a class="actionBtn"> by changing the CSS to .actionBtn {display:none;} in the main link page, but not any subfolders. I have found you can tell when you are in a subfolder page when there is a 2+ level breadcrumb containing a carat.
e.g. in the 'elf' subfolder, this can be detected by the presence of the > in the breadcrumb, and the presence of <span class="bcSep">></span>
Is there any way to change the attribute of actionBtn or right div on the right, depending on the presence of the <span class="bcSep">></span> or number of elements in the breadcrumb?
The nesting order in the header div on the root page is:
<div class="left">
<a class="breadCrumb">xxx</a>
</div>
<div class="right">
<a class="actionBtn">DOWNLOAD ALL</a>
</div>
On any subfolders it is:
<div class="left">
<a class="breadCrumb">xxx</a>
<span class="bcSep">></span>
<a class="breadCrumb">xxx</a>
...
</div>
<div class="right">
<a class="actionBtn">DOWNLOAD ALL</a>
</div>
I've tried child selectors, but can't find a way to target the actionBtn or right element from the breadCrumb or left element... Any ideas or am I asking for the impossible from pure CSS?
Since all three of your products (colored, samba and skaven) as well as the DOWNLOAD ALL anchor link have unique URLS, you can just use the href attribute value to only select the anchor tag on the homepage using a css attribute selector like this:
a[href="http://someUniqueURL.com/"].actionBtn {
display:none;
}
Check and run the following Code Snippet for a practical example of the above approach:
/* CSS */
a[href="https://demo.filerun.co/?module=weblinks§ion=public&multidownload=1&id=T2Gv5oGiGMxO3welkXbaqs92fZ6meJmU"].actionBtn {
display:none;
}
<!-- HTML -->
<p>Homepage Link</p>
Download All
<hr/>
<p>Product 1</p>
Product 1
<hr/>
<p>Product 2</p>
Product 2
<hr/>
<p>Product 3</p>
Product 3
<hr/>
I asked a similar question here,
Trouble getting correct Xpath
but it only got me so far.
I need to grab the links and I understand that Scrapy needs to verify the HTML. This is the HTML
class="shopthepost-widget" data-widget-id="708473" data-widget-uid="1"><div id="stp-55d44feabd0eb" class="stp-outer stp-no-controls ">
<a class="stp-control stp-left stp-hidden"><</a>
<div class="stp-inner">
<div class="stp-slide" style="left: -0%">
<a href="http://rstyle.me/iA-n/zzhv34c_" target="_blank" rel="nofollow" class="stp-product " data-index="0">
<span class="stp-help"></span>
<img src="//images.rewardstyle.com/img?v=2.13&p=n_24878713">
</a>
<a href="http://rstyle.me/iA-n/zzhvw4c_" target="_blank" rel="nofollow" class="stp-product " data-index="1">
<span class="stp-help"></span>
<img src="//images.rewardstyle.com/img?v=2.13&p=n_24878708">
</a>
So I tried
for widget in response.xpath("//div[#class='shopthepost-widget']"):
print response.xpath('.//*[#class="shopthepost-widget"]//a/#href').extract()
This yields nothing back but if I replace href with text() then it yields all attributes inside the HTML. This is not what I need. I want only the links and I need them to be passed to an item.
This has got me completely stumped. All help will be met with near infinite thanks.
Again, I can tell you the same what I have told you for your previous question:
When you load the site in your browser, the JavaScript is executed between the divs having #class='shopthepost-widget'.
When you load the site with Scrapy, the JavaScript does not get executed and stays the same as it is -- and you do not get any results nor a tags inside the previously mentioned divs.
<div class="shopthepost-widget" data-widget-id="708473">
<script type="text/javascript">!function(d,s,id){var e, p = /^http:/.test(d.location) ? 'http' : 'https';if(!d.getElementById(id)) {e = d.createElement(s);e.id = id;e.src = p + '://' + 'widgets.rewardstyle.com' + '/js/shopthepost.js';d.body.appendChild(e);}if(typeof window.__stp === 'object') if(d.readyState === 'complete') {window.__stp.init();}}(document, 'script', 'shopthepost-script');</script>
<br>
<div class="rs-adblock">
<img onerror="this.parentNode.innerHTML='Disable your ad blocking software to view this content.'" src="//assets.rewardstyle.com/images/search/350.gif" style="height: 15px; width: 15px;"><noscript>JavaScript is currently disabled in this browser. Reactivate it to view this content.</noscript>
</div>
</div>
So it is obvious that you do not get any results with your XPath because there is no result you would need.
However you could use Chrome for example and look at the XHR requests which are sent when you load the site. They seem to contain the results you are looking for. After you find the request you can emulate it, send it as a Request or loading it and then you can parse it.
It just HTML part of my PHP code which gives listing of products. I want the product image to be clickable which redirect to product detail page, but it seems that anchor tag is not working.
Here is my code:
<div class="container">
<div class="row">
<a href="/ProductUrl" class="grid-item"> //This code is basically under a loop which results in 6 products
<img src="/img1.jpg" alt="gem">
</a>
</div>
</div>
Right click the html document, view its page source, click the link of the href, check if its there. It might be in the wrong url.
#TheWell has given a good solution. Also you can give url like this,
<a href="../ProductUrl">
I have some block of code and need to get data out of it and trying different version of xpath commands but with no success.
<div>
<div class="some_class">
<a title="id" href="some_href">
<nobr>1<br>
</a>
</div>
<div class="some_other_class">
<a title="name" href="some_href">
<nobr>John<br>
</a>
</div>
</div>
<div>
<div class="some_class">
<a title="id" href="some_href">
<nobr>2<br>
</a>
</div>
<div class="some_other_class">
<a title="name" href="some_href">
<nobr>John<br>
</a>
</div>
</div>
// and many blocks like this
So, this div blocks are the same except they are different by content of its sub-element. I need xpath query to get John's href which <a title="id"> is equal to 1.
I've tried something like this:
//div[./div/nobr='1' AND ./div/nobr='John']
to get only div that contains data I need and then wouldn't be hard to get John's href.
Also, I've managed to get John's href with:
//a[./nobr='John'][#title='name']/#href
but that way it doesn't depend on value from <a title="id"...> element but it has to depend on it.
Any suggestions?
I think what you want is
//div/div[a/#title='id']/following-sibling::div[1]/a/#href
which, given a well-formed input document, will return (individual results separated by --------):
href="some_href"
-----------------------
href="some_href"
You did not explain it very clearly though, as kjhughes has noted, and perhaps your sample HTML is not ideal.
Regarding your attempted path expressions, as the input is HTML, it is hard to know whether
<nobr>John<br>
means that "John" is inside the nobr element or not.
Thanks Mathias, your example was helpful, but as there are many elements with #title='id' it isn't reliable solution that will always catch good elements.
I've managed to make workaround, first catched the whole div, and then extract href I need.
//div[./div/a[#title='name']/nobr='John' and ./div/a[#title='id']/nobr='1']
//a[./nobr='John'][#title='name']/#href