Scraping one html element into another - html

I'm trying to collect url from a list on a table in R. But table is a html element into web page, so xpath doesn't work adequately. I obtain the following result:
> doc<-read_html(url("http://www.bibliotecanacional.gov.co/rnbp/directorio-de-bibliotecas-publicas"))
> v<-toString(xml_find_all(doc, xpath='//*[#id="ContentPlaceHolder1_Ejemplo2_GridviewConCSSFriendly1_GridViewJedis_LinkButton1_0"]'))
> v
[1] ""
In the image, you can see how I extract xpath by inspection of url element.
Extraction of xpath
I will be grateful with your help. Thanks.

That page contains an iframe..so you need to switch to the iframe first before you could get the element from that iframe.
It has an iframe with title: Libros digitales y aplicaciones producidas BNC
Not sure how to do that using what you're using, but you might be able to look that up easily in here.

Related

Fetching tags with selenium href within a class name

I am new to using selenium. I am using selenium to extract the links of google search results. I want to take all the links from the search results. this is what the html looks like where the link type that I want to extract is found in the <a href= >:
<div class='r'>
<a href="https://www.linkedin.com/in/thu-huong-trish-nguyen-7bba5722" ping="/url?
sa=t&source=web&rct=j&url=https://www.linkedin.com/in/thu-huong-trish-nguyen-
7bba5722&ved=2ahUKEwiqw5D0qt3rAhVG7J4KHd3GBbQQFjAAegQIAxAB"><br><h3 class="LC20lb
DKV0Md">Thu-Huong (Trish) Nguyen - Research Data Analyst II - LinkedIn</h3><div class="TbwUpd
NJjxre"><cite class="iUh30 gBIQub bc tjvcx">www.linkedin.com<span class="eipWBe"> › ... </span></cite></div></a>
The rest of the results have the exact same class type and form, I essentially want the https://www.linkedin.com link. I did this as an attempt
elems = driver.find_elements_by_xpath("//a[#href]")
for elem in elems:
print(elem.get_attribute("href"))
this worked wonderful, I am pretty much extracting all the links on the google search results, however the only problem is that I only want the link types I specified. My code returns the links that I want and great deal of links that I do not want.
A solution that I thought would work perfect is it were possible to use the fact that all these links fall within the class type r.
I tried incorporating the r into the driver.find_elements but have not found any solutions online.
Any ideas
This XPath will help you to get all the a tags containing https://www.linkedin.com in href.
//div[#class='g']//div[#class='r']/a[contains(#href, 'https://www.linkedin.com')]

Scrape the content inside of a div tag, which is not displayed as text

I am scraping amazon reviews and they give an unique identifier to each review which I would like to scrape. However the identifier is never displayed as text but just exists in the following form:
<div id="R2XLFP626GRWEM" data-hook="review" class="a-section review aok-relative">
I want "R2XLFP626GRWEM" to be returned.
When using
response.xpath('.//div[#data-hook="review"]').extract()
I get the whole content of the div tag, which is quite a lot, considering that the whole review is embedded in it.
Product I'm scraping
Content I need:
You can get the id values by using CSS selectors instead of xpath like below.
response.css('.a-section .review::attr(id)').extract()
or by using xpath
response.xpath('//*[#class="a-section review aok-relative"]/#id').extract()
or by modifying original xpath query
response.xpath('.//div[#data-hook="review"]/#id').extract()
To collect attribute data using xpath use #. you can read more about it here
For example in your case:
response.xpath(".//div[#class='a-section review aok-relative']/#id").extract()

Scraping HTML elements between ::before and ::after with scrapy and xpath

I am trying to scrape some links from a webpage in python with scrapy and xpath, but the elements I want to scrape are between ::before and ::after so xpath can't see them as they do not exist in the HTML but are dynamically created with javascript. Is there a way to scrape those elements?
::before
<div class="well-white">...</div>
<div class="well-white">...</div>
<div class="well-white">...</div>
::after
This is the actual page http://ec.europa.eu/research/participants/portal/desktop/en/opportunities/amif/calls/amif-2018-ag-inte.html#c,topics=callIdentifier/t/AMIF-2018-AG-INTE/1/1/1/default-group&callStatus/t/Forthcoming/1/1/0/default-group&callStatus/t/Open/1/1/0/default-group&callStatus/t/Closed/1/1/0/default-group&+identifier/desc
I can't replicate your exact document state.
However if you load the page you can see some template language loaded in the same format your example data is:
Also if you check XHR network inpector you can see some AJAX requests for json data is being made:
So you can download the whole data you are looking for in handy json format over here:
http://ec.europa.eu/research/participants/portal/data/call/amif/amif_topics.json
scrapy shell "http://ec.europa.eu/research/participants/portal/data/call/amif/amif_topics.json"
> import json
> data = json.loads(response.body_as_unicode())
> data['topicData']['Topics'][0]
{'topicId': 1259874, 'ccm2Id': 31081390, 'subCallId': 910867, ...
Very very easy!
you just use the "Absolute XPath" and "Relative XPath" (https://www.guru99.com/xpath-selenium.html) together.By this trick you can pass form ::before (and maybe ::after). For example in your case (I supposed that,:
//div[#id='"+FindField+"'] // following :: td[#class='KKKK'] is before your "div".
FindField='your "id" associated to the "div"'
driver.find_element_by_xpath ( "//div[#id='"+FindField+"'] // following :: td[#class='KKKK'] / div")
NOTE:only one "/" must be use.
Also you can use only "Absolute XPath" in all addressing (Note:must be use "//" at the first Address.

Selecting href of link with image inside using xpath

I'm using scrapy to write a scraper that finds links with images inside them and grabs the link's href. The page I'm scraping is populated with image thumbnails, and when you click on the thumbnail it links to a full size version of the image. I'd like to grab the full size images.
The html looks somewhat like this:
<a href="example.com/full_size_image.jpg">
<img src="example.com/image_thumbnail.jpg">
</a>
And I want to grab "example.com/full_size_image.jpg".
My current method of doing so is
img_urls = scrapy.Selector(response).xpath('//a/img/..').xpath("#href").extract()
But I'd like to reduce that to a single xpath expression, as I plan to allow the user to enter their own xpath expression string.
You can check if an element has an another child element this way:
response.xpath('//a[img]/#href').extract()
Note that I'm using the response.xpath() shortcut and providing a single XPath expression.

Partial HTML Selection Using Jsoup

So I was wondering if there is a way to find the element that belongs to a specific String that you know exists on a HTML page as part of an attribute. The example is I know that "Apr-16-2015" is somewhere in an attribute on the HTML page. If I go look for it, it's part of the attribute title:
<a title="Apr-16-2015 5:04 AM"
However, I do not have the information about the exact time, i.e. the "5:04 AM". I was wondering if there is a way to partially search an attribute in order for it to return the full element.
This is my code:
org.jsoup.nodes.Element links = lastPage.select("[title=\"Apr-16-2015\"]").first();
Again, it doesn't work because I did not enter the full attribute title, as given above. My question: "Is there any way to make this selector work by not entering the full information, as I will be unable to have the latter part of the attribute to my disposition?"
You can use it in the following way:
lastPage.select("[title^=\"Apr-16-2015\"]").first();
As described on JSoup Documentation:
[attr^=value], [attr$=value], [attr*=value]: elements with attributes
that start with, end with, or contain the value, e.g. [href*=/path/]
References:
http://jsoup.org/cookbook/extracting-data/selector-syntax