Looping: with different row number in R - html

I wonder if you could give me a hint on how to get over the problem I encountered when trying to extract data from HTML files. I looked through other questions regarding the issue but still cannot figure out what changes exactly should I make. I have five HTML files in a folder. From each of them, I want to extract HTML links which I will later use. First, I extracted this data without any effort reading every HTML separately and creating a separate data frame for each HTML with much-needed links (/item.asp?id=). Then I used a 'rbind' function to merge columns from each data frame. The key here is that the first three HTML pages have 20 rows of the data I need, the fourth HTML has 16 rows, and the fifth and the last has 9 rows.
The looping code works just fine when I loop over the first three pages for which I have 20 rows each, but the code doesn't allow me to do the same for the fourth and fifth HTML pages because there the row number is different. I get the problem:
Error in [[<-.data.frame(*tmp*, i, value = c("/item.asp?id=22529120", : replacement has 16 rows, data has 20
The code is as follows:
#LOOP over others
path = "C:/Users/Dasha/Downloads/R STUDIO/RECTORS/test retrieve"
out.file<-""
file.names <- dir(path, pattern =".html")
for (i in 1:length(file.names))
{
page <- read_html(file.names[i])
links <- page %>% html_nodes("a") %>% html_attr("href")
##get all links into a dataframe
df <-as.data.frame(links)
##get links which contain /item.asp
page_article <- df[grep("/item.asp", df$links), ]
##for each HTML save a separate data frame with links column
java[i] <-as.data.frame(page_article)
##save number of a page where this link is
page_num[i] <- paste(toString(i))
##save id of a person this page belongs to
id[i] <- as.character(file.names[i])
}
Can anyone give a bit of advice on how to solve this issue? If I am successful, I then must be capable to create a single column with links, another column with an id and a number of the HTML page.

Write a function which returns a dataframe after reading from each HTML file.
read_html_files <- function(filename) {
page <- read_html(filename)
links <- page %>% html_nodes("a") %>% html_attr("href")
page_article <- grep("/item.asp", links, value = TRUE)
data.frame(filename, page_article)
}
Use purrr::map_df and pass this function to every file and combine the output in one dataframe (result).
path = "C:/Users/Dasha/Downloads/R STUDIO/RECTORS/test retrieve"
file.names <- list.files(path, pattern ="\\.html$", full.names = TRUE)
result <- purrr::map_df(file.names, read_html_files, .id = 'id')
result

Related

Web Scraping - Unable to determine node or text heading argument to extract data from URL via functoin htlm_node, htlm_nodes/s at package rvest

Aim to extract three data points from a URL.Able to locate the specific top and individual CSSs nodes and xpaths using selectorGadget. Aim to use html_node function (html(url),CSS)) to extract the elements I am interested at.
Have used the main node CSS (CSS node ._2t2gK1hs") and was able to extract the first element as a string. The top CSS node appears to have embedded only the first element not the other subsequent two although the three elements (one text and the other two numeric elements) share the same CSS node address codes (For all three top "._39sLqIkw" with a heading followed by "._1NHwuRzF")
[![Snapshot of CSS and selector gadget for the specific data points I would like to extract.][1]][1]
In attempting to extract the data I tried:
page0_url<-read_html ("https://www.tripadvisor.com/Hotel_Review-g1063979-d1902679-Reviews-Mas_El_Mir
Ripoll_Province_of_Girona_Catalonia.html")
html_node(page0_url, "._2t2gK1hs")```
#Resulting in a string with the top element I aim to extract embedded.
{html_node}
div class="_2t2gK1hs" data-tab="TABS_ABOUT" data-section-signature="about" id="ABOUT_TAB"
[1] <div>\n<div class="_39sLqIkw">PRICE RANGE</div>\n<div class="_1NHwuRzF">€124<!-- --> - <!-- --€222<!-- --> <!-- -->(Based on Average Rates for a Standard Room) ...>
#Failed to extract the two remaining three elements by selecting the individual CSSs or xpaths.
library(rvest)
page0_url<-read_html ("https://www.tripadvisor.com/Hotel_Review-g1063979-d1902679-Reviews-Mas_El_Mir-Ripoll_Province_of_Girona_Catalonia.html")
html_nodes(xpath = "//*[contains(concat( " ", #class, " " ), concat( " ", "_1NHwuRzF", " " ))]") %>%
html_text(trim = TRUE)```
#Tried passing without success the specific element node followed/preceded by #PRICE RANGE, #LOCATION, #NUMBER OF ROOMS.
#I wonder how should I pass the argument and what node/s to use in the above function.
#Expected result
PRICE RANGE
122 222
LOCATION
Spain Catalonia Province of Gerona Ripoll
NUMBER OF ROOMS
5
Thank you
Those classes look dynamic. Here is a hopefully more robust selector strategy based on the relationship between more stable looking elements, and avoiding using likely dynamic class values:
library(rvest)
library(magrittr)
page0_url <- read_html('https://www.tripadvisor.com/Hotel_Review-g1063979-d1902679-Reviews-Mas_El_MirRipoll_Province_of_Girona_Catalonia.html')
data <- page0_url %>%
html_nodes('.in-ssr-only [data-tab=TABS_ABOUT] div[class]') %>%
html_text()
data

How to scrape ordered and unordered lists in wikipedia using rvest relative to a header

I'm wanting to scrape the events from several countries from Wikipdia and place each individual event into a row of a table. A certain data can have one event (where there is a single main bullet point) or multiple events (where there are "sub bullet points")
I'm having trouble with is how to grab both the ordered and unordered lists at once and separating them cleanly. The code below will grab the "sub bullets", but not the "main" ones. And if I change the code to exclude the /li then it will place the "sub bullets" into a single cell. I was wondering if there was a way to separate the "main" and "sub bullet points" more easily.
There appear to be slight differences in the html layout for pages that contain events for different countries. Is it possible to specify an xml path based on a header (rather than a relative or absolute position) and then grab the elements after that? Unfortunately, being so new to html, I'm not quite sure how to do that or if it is even possible. Is it possible to find the header "Events by month", find the header "January" and then get all bullet points and sub bullet points in separate cells of a table?
Any help would be appreciated.
Thankyou
# This gets the sub bullet points of the events, but not the main ones
page <- xml2::read_html("https://en.wikipedia.org/wiki/2020_in_the_United_States")
month_data = page %>%
html_nodes(xpath = "/html/body/div[3]/div[3]/div[5]/div[1]/ul[3]/li") %>%
html_text()
This webpage is has no structure, it is just one long list of tags without clearly separating the different sections out.
This is partial solution:
library(rvest)
library(xml2)
library(dplyr)
page <- xml2::read_html("https://en.wikipedia.org/wiki/2020_in_the_United_States")
lineitems <- page %>% html_nodes(xpath = "//html/body/div[3]/div[3]/div[5]/div[1]/ul[3]/li")
#Count the number of child ul nodes
subcount <- lineitems %>% html_node("ul") %>% xml_length()
output <- lapply(1:length(subcount), function(i) {
if(subcount[i] == 0 ){
out <- lineitems[i] %>% html_text()
}
else {
out <- lineitems[i] %>% html_node("ul") %>%
html_nodes(xpath=".//li") %>% html_text()
}
out
})
#name the list items with the data
names(output) <- lineitems %>% html_node("a") %>%
html_attr("title")
#a list for each date
output
I didn't have the time or patience to refine this. You may have a easier time trying to select the nodes based on the available attributes instead of the particular html/xml tags.

Getting data from an html page using R

Anyone can help me why the below code doe not have any data for the selected table?
library('httr')
library('rvest')
url= read_html("http://projects.worldbank.org/search?lang=en&searchTerm=&sectorcode_exact=AB")
table = html_node(url,"table#f05v5-sorting-table.border-top2.border-allside.clearboth")
Thanks!
You are missing some steps. Your workflow should look like this:
dat_html <- read_html(
"http://projects.worldbank.org/search?lang=en&searchTerm=&sectorcode_exact=AB"
)
dat_nodes <- html_nodes(dat_html, xpath = "xxxx")
dat <- html_table(dat_nodes)
dat will be a list, so if you want a data frame, you could do something like:
dat_df <- as.data.frame(dat)
Or, if you like tibbles:
dat_tbl <- as_tibble(dat)
I cannot find the table you are interested in on that webpage, so you have to replace "xxxx" by the xpath of the table you are interested in.
To find the xpath, if you are inspecting the page from chrome or chromium, you can right click on the node in the inspector window, and look for Copy, then Copy XPath.

Accessing html Tables with rvest

So I am wanting to scrape some NBA data. The following is what I have so far, and it is perfectly functional:
install.packages('rvest')
library(rvest)
url = "https://www.basketball-reference.com/boxscores/201710180BOS.html"
webpage = read_html(url)
table = html_nodes(webpage, 'table')
data = html_table(table)
away = data[[1]]
home = data[[3]]
colnames(away) = away[1,] #set appropriate column names
colnames(home) = home[1,]
away = away[away$MP != "MP",] #remove rows that are just column names
home = home[home$MP != "MP",]
the problem is that these tables don't include the team names, which is important. To get this information, I was thinking I would scrape the four factors table on the webpage, however, rvest doesnt seem to be recognizing this as a table. The div that contains the four factors table is:
<div class="overthrow table_container" id="div_four_factors">
And the table is:
<table class="suppress_all sortable stats_table now_sortable" id="four_factors" data-cols-to-freeze="1"><thead><tr class="over_header thead">
This made me think that I could access the table via something along the lines of
table = html_nodes(webpage,'#div_four_factors')
but this doesnt seem to work as I am getting just an empty list. How can I access the four factors table?
I am by no means an HTML expert but it appears that the table you are interested in is commented out in the source code then the comment is overridden at some point before being rendered.
If we assume that the Home team is always listed second, we can just use positional arguments and scrape another table on the page:
table = html_nodes(webpage,'#bottom_nav_container')
teams <- html_text(table[1]) %>%
stringr::str_split("Schedule\n")
away$team <- trimws(teams[[1]][1])
home$team <- trimws(teams[[1]][2])
Obviously not the cleanest solution but such is life in the world of web scraping

R to change the values in html form and scrape web data

I would like to scrape the historical weather data from this page http://www.weather.gov.sg/climate-historical-daily.
I am using the code given in this link Using r to navigate and scrape a webpage with drop down html forms.
However, I am not able to get the data probably due to change in structure of the page. In the code from the above link pgform <-html_form(pgsession)[[3]] was used to change the values of the form. I was not able to find a similar form in my case.
url <- "http://www.weather.gov.sg/climate-historical-daily"
pgsession <- html_session(url)
pgsource <- read_html(url)
pgform <- html_form(pgsession)
result in my case
> pgform
[[1]]
<form> 'searchform' (GET http://www.weather.gov.sg/)
<button submit> '<unnamed>
<input text> 's':
Since the page has a CSV download button and the links it provides follow a pattern, you can generate and download a set of URLs. You'll need a set of the station IDs, which you can scrape from the dropdown itself:
library(rvest)
page <- 'http://www.weather.gov.sg/climate-historical-daily' %>% read_html()
station_id <- page %>% html_nodes('button#cityname + ul a') %>%
html_attr('onclick') %>% # If you need names, grab the `href` attribute, too.
sub(".*'(.*)'.*", '\\1', .)
which can then be put into expand.grid with the months and years to generate all the necessary combinations:
df <- expand.grid(station_id,
month = sprintf('%02d', 1:12),
year = 2014:2016)
(Note if you want 2017 data, you'll need to construct those separately and rbind so as not to construct months that haven't happened yet.)
The combinations can then be paste0ed into URLs:
urls <- paste0('http://www.weather.gov.sg/files/dailydata/DAILYDATA_',
df$station_id, '_', df$year, df$month, '.csv')
which can be lapplyed across to download all the files:
# Warning! This will download a lot of files! Make sure you're in a clean directory.
lapply(urls, function(url){download.file(url, basename(url), method = 'curl')})