Difficulty grabbing links inside HTML with scrapy

Difficulty grabbing links inside HTML with scrapy - html

I asked a similar question here,
Trouble getting correct Xpath
but it only got me so far.
I need to grab the links and I understand that Scrapy needs to verify the HTML. This is the HTML
class="shopthepost-widget" data-widget-id="708473" data-widget-uid="1"><div id="stp-55d44feabd0eb" class="stp-outer stp-no-controls ">
<a class="stp-control stp-left stp-hidden"><</a>
<div class="stp-inner">
<div class="stp-slide" style="left: -0%">
<a href="http://rstyle.me/iA-n/zzhv34c_" target="_blank" rel="nofollow" class="stp-product " data-index="0">
<span class="stp-help"></span>
<img src="//images.rewardstyle.com/img?v=2.13&p=n_24878713">
</a>
<a href="http://rstyle.me/iA-n/zzhvw4c_" target="_blank" rel="nofollow" class="stp-product " data-index="1">
<span class="stp-help"></span>
<img src="//images.rewardstyle.com/img?v=2.13&p=n_24878708">
</a>
So I tried
for widget in response.xpath("//div[#class='shopthepost-widget']"):
print response.xpath('.//*[#class="shopthepost-widget"]//a/#href').extract()
This yields nothing back but if I replace href with text() then it yields all attributes inside the HTML. This is not what I need. I want only the links and I need them to be passed to an item.
This has got me completely stumped. All help will be met with near infinite thanks.

Again, I can tell you the same what I have told you for your previous question:
When you load the site in your browser, the JavaScript is executed between the divs having #class='shopthepost-widget'.
When you load the site with Scrapy, the JavaScript does not get executed and stays the same as it is -- and you do not get any results nor a tags inside the previously mentioned divs.
<div class="shopthepost-widget" data-widget-id="708473">
<script type="text/javascript">!function(d,s,id){var e, p = /^http:/.test(d.location) ? 'http' : 'https';if(!d.getElementById(id)) {e = d.createElement(s);e.id = id;e.src = p + '://' + 'widgets.rewardstyle.com' + '/js/shopthepost.js';d.body.appendChild(e);}if(typeof window.__stp === 'object') if(d.readyState === 'complete') {window.__stp.init();}}(document, 'script', 'shopthepost-script');</script>
<br>
<div class="rs-adblock">
<img onerror="this.parentNode.innerHTML='Disable your ad blocking software to view this content.'" src="//assets.rewardstyle.com/images/search/350.gif" style="height: 15px; width: 15px;"><noscript>JavaScript is currently disabled in this browser. Reactivate it to view this content.</noscript>
</div>
</div>
So it is obvious that you do not get any results with your XPath because there is no result you would need.
However you could use Chrome for example and look at the XHR requests which are sent when you load the site. They seem to contain the results you are looking for. After you find the request you can emulate it, send it as a Request or loading it and then you can parse it.

Related

CSS Selector / XPath needed for accessing a <span>

I'm doing a scrapy project in which I try to extract data on sponsored TripAdvisor listings (https://www.tripadvisor.com/Hotels-g189541-Copenhagen_Zealand-Hotels.html).
This is how the html code looks like
<div class="listing_title ui_columns is-gapless is-mobile is-multiline">
<div class="ui_column is-narrow">
<span class="ui_merchandising_pill sponsored_v2">Sponsored</span>
</div>
<div class="ui_column is-narrow title_wrap">
<a target="_blank" href="/Hotel_Review-g189541-d206753-Reviews-Scandic_Front-Copenhagen_Zealand.html" id="property_206753" class="property_title prominent " data-clicksource="HotelName" onclick="return false;" dir="ltr"> Scandic Front</a>
</div>
</div>
I was able to successfully retrieve elements such as the link, id, name with constructs such as response.css(".listing_title").css("a::text").extract().
However, I have trouble retrieving anything from the "Sponsored" -tag attached to the accommodation listings - result is an empty list despite there being two listings with the "Sponsored"-tag on the website.
I tried response.css(".sponsored_v2").css("::text").extract()without any success.
What can I do ?

It looks like you have a typo, try changing .exctract to .extract, you have an extra c.

Modify a HTML file based off google doc/spreed sheet to embed instagram

I am rather new to this so please forgive me if this is obvious or makes no sense. Basically I have not been able to find a way to embed a Instagram page's feed like you can with Twitter or Facebook. The only way I have found are by going through a third party site that either takes up the full page or is a monthly subscription. I am trying to have a html file with a script that pulls a set of 2 links from a google spreed sheet or google doc and uploads them into a link and image tags with out having to go in and modify the HTML file every time. I am current taking the source link from instagram and the image link and creating a linked image with them. I am then displaying that file in a iframe. This was my attempt at a hopefully simple work around and would like to know if it is possible. This is what I currently have as the basic set up.
<a target="_parent" href="##link##">
<img class="image" src="##link##" style="height:auto; width:100%;">
</a>
My thought process was that if I could connect either a google sheet or doc I could use a script with something like this.
<a id="1" target="_parent" href=""><img id="1" class="image" src="" style="height:auto; width:100%;"></a>
<a id="2" target="_parent" href=""><img id="2" class="image" src="" style="height:auto; width:100%;"></a>
<script>
var linkA = a, linkB = b, L2=1, L1=1;
for(a=0; a<100; a++){
document.getElementById(L1).src = [A(L1)]; //not sure if this is the proper way to locate the coordinates from the sheet
L1++;}
for(b=0; b<100; b++){
document.getElementById(L2).href = [B(L2)];
L2++;}
</script>
The issue I have run into is how to get the data from the sheet and then the proper way to use the data in a script to run have it replace the links. Ideally I would be able to simply add the two links into the sheet and the script would load them on to the page next time it loads.

regex adding target="_blank" to all links but exclude ones that already have target="_blank" or links that have <a name="..."> and <a href="#...">

So I have this regex that I designed, but can't seem to exclude links on a page that already have target="_blank" or links that contain <a name="..."> or <a hre="#..."> How would I exclude links with target="_blank" and not add target="_blank" to anchor links?
Find: <a href=(".*)|([^#][^"]*)\\s>(\w.*)(</a>)
Replace: <a href=$1 target="_blank"$2$3

Regex is notoriously the wrong tool for this job.
HTML is structured data that regex doesn't understand, which means you run into exactly the sort of issues you're having: for any non-trivial problem, the many allowed variations in HTML structure make it very difficult to parse using string manipulation techniques.
DOM methods are designed for manipulating that sort of data, so use them instead. The following will loop through every <a> tag in the document, exclude those with no href attribute, those whose href begins with '#', or those with a name attribute, and set the 'target' attribute on the rest.
Array.from(document.getElementsByTagName('a')).forEach(function(a) {
if (
a.getAttribute("href") &&
a.getAttribute("href").indexOf('#') !==0 &&
a.getAttribute("name") === null
) {
a.setAttribute('target', '_blank'); // on links that already have this attribute this will do nothing
}
});
// Just to confirm:
console.log(document.getElementById('container').innerHTML)
<div id="container">
test
test2
test3
<a name="foo">test4</a>
</div>

Link that forwards a page with params jsp

I'm building a website for a school project. In a page i need a link to another page (profile.jsp) with params, because I need to do a query in that page. I've tried forward in JSP but it opens the page right away, I need to activate this with a link, or also with a button. Here I'm using windows.location but it doesn't allow me to give params as far as i now.
<div class="mask_container">
<h2><%out.println(title);%></h2>
<p onClick="JavaScript:window.location='profile.jsp';"><%out.println(autor);%></p>
<img src="img/like-white.png" class="social">
<img src="img/comment-white.png" class="social">
<img src="img/download-white.png" class="social">
</div>

how about passing your parameter values in query string as shown below, instead of <p> tag put below <a> tag to redirect to another page
<%out.println(autor);%>
You can get value of parameter on other page using request.getParameter("param") this will gives you parameter value.
May this will help you.

simple html anchor tag is not redirecting on click

It just HTML part of my PHP code which gives listing of products. I want the product image to be clickable which redirect to product detail page, but it seems that anchor tag is not working.
Here is my code:
<div class="container">
<div class="row">
<a href="/ProductUrl" class="grid-item"> //This code is basically under a loop which results in 6 products
<img src="/img1.jpg" alt="gem">
</a>
</div>
</div>

Right click the html document, view its page source, click the link of the href, check if its there. It might be in the wrong url.

#TheWell has given a good solution. Also you can give url like this,
<a href="../ProductUrl">

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Difficulty grabbing links inside HTML with scrapy - html

Related

CSS Selector / XPath needed for accessing a <span>

Modify a HTML file based off google doc/spreed sheet to embed instagram

regex adding target="_blank" to all links but exclude ones that already have target="_blank" or links that have <a name="..."> and <a href="#...">

Link that forwards a page with params jsp

simple html anchor tag is not redirecting on click

Categories

Resources