Web scraping an embedded table using R - html

I am currently working a project to scrape the content of the Performance Characteristics table on this website
https://www.ishares.com/uk/individual/en/products/251795/ishares-ftse-100-ucits-etf-inc-fund
The data I am wanting to extract from this table is the 12 m trailing yield of 3.43%
The code I wrote to do this is:
url <- "https://www.ishares.com/uk/individual/en/products/251795/ishares-ftse-100-ucits-etf-inc-fund"
etf_Data <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="fundamentalsAndRisk"]/div') %>%
html_table()
etf_Data <- etf_Data[[1]]
which provided me with an empty list with the error message 'Error in etf_Data[[1]] : subscript out of bounds'
Using Google inspect I have tried various XPaths including reading it in html_text:
url <- "https://www.ishares.com/uk/individual/en/products/251795/ishares-ftse-100-ucits-etf-inc-fund"
etf_Data <- url %>%
read_html() %>%
html_nodes(xpath='//*[#id="fundamentalsAndRisk"]/div/div[4]/span[2]') %>%
html_text()
etf_Data <- etf_Data[[1]]
However with no success.
Having gone through other Stack Overflow responses I have not been able to solve my issue.
Would someone be able to assist.
Thank you
C

Couple of things:
There is a different URI you end up at in order to get the content you want. This comes when you manually accept certain conditions on the page
The data you want is not within a table
You can add a queryString with EntryPassthrough parameter = True to get to the right URI and then use :contains and an adjacent sibling combinator to get the desired value
library(rvest)
library(magrittr)
url <- "https://www.ishares.com/uk/individual/en/products/251795/ishares-ftse-100-ucits-etf-inc-fund?switchLocale=y&siteEntryPassthrough=true"
trailing_12m_yield <- url %>%
read_html() %>%
html_element('.caption:contains("12m Trailing Yield") + .data') %>% html_text2()

Related

Web-Scraping Search Result

I am trying to scrape a google search result but my element keeps showing up empty. Here is my code:
url = ("https://www.google.com/search?q=most+popular+place+in+france&rlz=1C1CHBF_enUS875US875&sxsrf=AOaemvIURBlj0HQnAmvIlMyjdcP1WPeh5Q%3A1637686399981&ei=fxydYc2IO42jqtsPkZOVuAw&oq=most+popular+place+in+fr&gs_lcp=Cgdnd3Mtd2l6EAMYADIFCAAQgAQyBQgAEIAEMgYIABAWEB4yCQgAEMkDEBYQHjIGCAAQFhAeMgYIABAWEB4yBggAEBYQHjIGCAAQFhAeMgYIABAWEB4yBggAEBYQHjoECCMQJzoFCAAQkQI6CwguEIAEEMcBENEDOggIABCABBCxAzoLCAAQgAQQsQMQgwE6EQguEIAEELEDEIMBEMcBEKMCOgsILhCABBCxAxCDAToECC4QQzoFCC4QgAQ6CwguEIAEEMcBEK8BOgsILhCABBDHARCjAjoECAAQQzoHCAAQyQMQQzoKCAAQgAQQhwIQFDoICAAQyQMQkQI6DQgAEIAEEIcCELEDEBQ6CwgAEIAEELEDEMkDOgUIABCSAzoICAAQgAQQyQNKBAhBGABQAFiBJGDlLGgBcAJ4AIAB8QGIAa8ckgEGMC4yNC4xmAEAoAEBwAEB&sclient=gws-wiz")
first_page <- read_html(url)
titles <- html_nodes(first_page, xpath = "///div[1]/div/div/span/div[1]/a/div/div[2]/span[1]") %>%
html_text()
I have searched "most popular place in france" into google and am only trying to scrape in the "Eiffel tower" search result. Please help.

Scraping URLs from wikipedia in R yielding only half the URL

I am currently trying to extract the URLs from the wikipedia page with a list of chief executive officers, the code then opens the URLs and copies the text into .txt files for me to use. the trouble is the allurls object only contains the later half of the URL. For example allurls[1] gives ""/wiki/Pierre_Nanterme"". Thus when I run this code
library("xml2")
library("rvest")
url <- "https://en.wikipedia.org/wiki/List_of_chief_executive_officers"
allurls <- url %>% read_html() %>% html_nodes("td:nth-child(2) a") %>%
html_attr("href") %>%
.[!duplicated(.)]%>%lapply(function(x)
read_html(x)%>%html_nodes("body"))%>%
Map(function(x,y)
write_html(x,tempfile(y,fileext=".txt"),options="format"),.,
c(paste("tmp",1:length(.))))
allurls[1]
I get the following error:
Error: '/wiki/Pierre_Nanterme' does not exist.

Web scraping using R - Table of many pages

I have this website which has a table of many pages. Can someone help me read all pages of that table into R?
Website:
https://www.fdic.gov/bank/individual/failed/banklist.html
You can scrape the entire HTML table using the rvest package. See the code below. The code automatically identifies the entire table and reads in all 555 entries.
require(rvest)
URL <- "https://www.fdic.gov/bank/individual/failed/banklist.html"
failed_banks <- URL %>%
read_html() %>%
html_table() %>%
as.data.frame()

Scraping data with rvest from a form

I would like to scrape data from this link: http://dati.acs.beniculturali.it/CPC/CPC.detail.html?A00001
library(rvest)
library(dplyr)
url <- 'http://dati.acs.beniculturali.it/CPC/CPC.detail.html?A00001'
read_html(url) %>%
html_node(xpath = '//*[#id="dataContainer"]/div[1]') %>%
html_text()
As far as I understand, the problem seems to be that the data is not in table format - I could then use html_table() to extract what I need. Looking at the HTML structure, the div is nested into a series of divs that are included in a form. I also tried:
read_html(url) %>%
html_node('form') %>%
html_text()
But I only get a series of \n. What am I missing?

Get span content using rvest

I'm trying to scrape a set of web pages with rvest package. It works when getting the content of the web pages, but I can't get create time for the first floor, which is 2017-08-17 01:47 for this web page.
url <- read_html("http://tieba.baidu.com/p/5275787419", encoding = "UTF-8")
# This works
contents <- url %>% html_nodes(".d_post_content_firstfloor .clearfix") %>% html_text()
# This doesn't work
create_time <- url %>% html_nodes(".d_post_content_firstfloor li+ li span") %>% html_text()
create_time
character(0)
I want to get the time of first floor on the web but I don't know how to access to it.
One way to achieve this could be
create_time<- url %>% html_nodes(xpath= '//*[#id="j_p_postlist"]/div[1]') %>% xml_attr("data-field")
gsub(".*date\\\":\\\"(.*)\\\",\\\"vote_crypt.*","\\1",create_time)
Output is:
[1] "2017-08-17 01:47"
Hope this helps!