Selecting href of link with image inside using xpath - html

I'm using scrapy to write a scraper that finds links with images inside them and grabs the link's href. The page I'm scraping is populated with image thumbnails, and when you click on the thumbnail it links to a full size version of the image. I'd like to grab the full size images.
The html looks somewhat like this:
<a href="example.com/full_size_image.jpg">
<img src="example.com/image_thumbnail.jpg">
</a>
And I want to grab "example.com/full_size_image.jpg".
My current method of doing so is
img_urls = scrapy.Selector(response).xpath('//a/img/..').xpath("#href").extract()
But I'd like to reduce that to a single xpath expression, as I plan to allow the user to enter their own xpath expression string.

You can check if an element has an another child element this way:
response.xpath('//a[img]/#href').extract()
Note that I'm using the response.xpath() shortcut and providing a single XPath expression.

Related

Scrape the content inside of a div tag, which is not displayed as text

I am scraping amazon reviews and they give an unique identifier to each review which I would like to scrape. However the identifier is never displayed as text but just exists in the following form:
<div id="R2XLFP626GRWEM" data-hook="review" class="a-section review aok-relative">
I want "R2XLFP626GRWEM" to be returned.
When using
response.xpath('.//div[#data-hook="review"]').extract()
I get the whole content of the div tag, which is quite a lot, considering that the whole review is embedded in it.
Product I'm scraping
Content I need:
You can get the id values by using CSS selectors instead of xpath like below.
response.css('.a-section .review::attr(id)').extract()
or by using xpath
response.xpath('//*[#class="a-section review aok-relative"]/#id').extract()
or by modifying original xpath query
response.xpath('.//div[#data-hook="review"]/#id').extract()
To collect attribute data using xpath use #. you can read more about it here
For example in your case:
response.xpath(".//div[#class='a-section review aok-relative']/#id").extract()

How to output background-image: url this image.jpg

I want this image how to get this image.jpg as output link. like XPath?
I tried for a few days but no luck is this possible?
<div class="vjs-poster" tabindex="-1" aria-disabled="false" style="background-image: url("https://image.jpg");"></div>
First of all, you need to get the HTML element that you want to extract the background image if (you can do it using document.getElementById or by using jQuery with xpath), then, you need to extract the style related to background image
element.style.backgroundImage
After that you can manipulate the string however you want, either by splitting by "url" or with regular expression.

Convert to CSS Selector

Trying to convert the below given HTML tag of a Image Button which I want to click but not getting clicked while using Xpath.
HTML Script
<img src="../../../../imagepool/transparent%21tmlservicedesk?cid=1"
id="reg_img_304316340" aralttxt="1" artxt="Show Application List"
arimgcenter="1" alt="Show Application List" title="Show Application List"
class="btnimg" style="top:0px; left:0px; width:23px; height:140px;">
Xpath Generated for the same:
//div[#class='btnimgdiv']/img[#id='reg_img_304316340']/#src
Read some of the articles that for image buttons CSS selector is much better than xpath and wanted to know how to convert the html to CSS selector.
Image BUtton which i want to click but not getting clicked while using Xpath
This is because you are using id attribute value of the element which looks like dynamically generated.
Read some of the articles that for image buttons CSS selector is much better than xpath
Yes, you are right, using cssSeector is much faster than xpath to locate an element.
wanted to know how to convert the html to CSS selector.
You need to use that attribute value which is unique and unchangeable to locate element, you can use below cssSelector :-
img.btnimg[title='Show Application List']
Reference Link :-
http://www.w3schools.com/cssref/css_selectors.asp
https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors

Partial HTML Selection Using Jsoup

So I was wondering if there is a way to find the element that belongs to a specific String that you know exists on a HTML page as part of an attribute. The example is I know that "Apr-16-2015" is somewhere in an attribute on the HTML page. If I go look for it, it's part of the attribute title:
<a title="Apr-16-2015 5:04 AM"
However, I do not have the information about the exact time, i.e. the "5:04 AM". I was wondering if there is a way to partially search an attribute in order for it to return the full element.
This is my code:
org.jsoup.nodes.Element links = lastPage.select("[title=\"Apr-16-2015\"]").first();
Again, it doesn't work because I did not enter the full attribute title, as given above. My question: "Is there any way to make this selector work by not entering the full information, as I will be unable to have the latter part of the attribute to my disposition?"
You can use it in the following way:
lastPage.select("[title^=\"Apr-16-2015\"]").first();
As described on JSoup Documentation:
[attr^=value], [attr$=value], [attr*=value]: elements with attributes
that start with, end with, or contain the value, e.g. [href*=/path/]
References:
http://jsoup.org/cookbook/extracting-data/selector-syntax

how to get page to a specific part of page?

I'm trying to create a page where a user clicks on a link on the left and is taking to a specific section on the page.
Here is example. I've added as much of the code I'm using as I can.
What your trying to do works with the Id or Name attribute.
To elaborate: The anchor tag that your rendering as the target of where your page needs to go should be:
<a id="myId"></a>
or
<a name="myId"></a>
or both..
When you build a link to another part of the page, you need two parts, the link (that you click), and the target (that the page scrolls to).
The link's href attribute needs to start with a '#'. This signifies that the link is 'internal' to the page, and not another, external page.
The target can be either a named anchor <a name="something"></a> or an element with an ID: <div id="something">. You don't include the '#' in the name or the ID.
That's the key part you're missing. Take the '#' off the front of your <a name=""> values and it will work.
Let us know if that works, and we can help you develop this further: There's a lot more polish you could add, but let's get the basics working first.