How to scrape with table class name with R? - html

I am tryng to scrape several web pages, particulaty some tables in the pages.
But the problem is the places of tables change with respect to each page.
Here is my code.
url <- paste0("https://en.wikipedia.org/wiki/2011%E2%80%9312_Welsh_Premier_League")
webpage <- read_html(url)
j<-webpage%>% html_node(xpath='//*[#id="mw-content-text"]/div[1]/table') %>%html_table(fill=T)
This code works fine, but I want to scrape the other seaons, too. The place of table changes in every season.
My question is I found that the table class that I want to scrape is named as "wikitable plainrowheaders", as below. I would like to know how to scrape with table class name.
How to scrape all tables with table class named as "wikitable plainrowheaders" in a wikipedia page?
<table class="wikitable plainrowheaders" style="text-align:center;font-size:100%;">

Since you know the table class name, just change the corresponding xpath.
library(rvest)
url <- paste0("https://en.wikipedia.org/wiki/2011%E2%80%9312_Welsh_Premier_League")
webpage <- read_html(url)
j <- webpage %>%
html_nodes(xpath="//table[#class='wikitable plainrowheaders']") %>%
html_table(fill=T)

Related

Looping: with different row number in R

I wonder if you could give me a hint on how to get over the problem I encountered when trying to extract data from HTML files. I looked through other questions regarding the issue but still cannot figure out what changes exactly should I make. I have five HTML files in a folder. From each of them, I want to extract HTML links which I will later use. First, I extracted this data without any effort reading every HTML separately and creating a separate data frame for each HTML with much-needed links (/item.asp?id=). Then I used a 'rbind' function to merge columns from each data frame. The key here is that the first three HTML pages have 20 rows of the data I need, the fourth HTML has 16 rows, and the fifth and the last has 9 rows.
The looping code works just fine when I loop over the first three pages for which I have 20 rows each, but the code doesn't allow me to do the same for the fourth and fifth HTML pages because there the row number is different. I get the problem:
Error in [[<-.data.frame(*tmp*, i, value = c("/item.asp?id=22529120", : replacement has 16 rows, data has 20
The code is as follows:
#LOOP over others
path = "C:/Users/Dasha/Downloads/R STUDIO/RECTORS/test retrieve"
out.file<-""
file.names <- dir(path, pattern =".html")
for (i in 1:length(file.names))
{
page <- read_html(file.names[i])
links <- page %>% html_nodes("a") %>% html_attr("href")
##get all links into a dataframe
df <-as.data.frame(links)
##get links which contain /item.asp
page_article <- df[grep("/item.asp", df$links), ]
##for each HTML save a separate data frame with links column
java[i] <-as.data.frame(page_article)
##save number of a page where this link is
page_num[i] <- paste(toString(i))
##save id of a person this page belongs to
id[i] <- as.character(file.names[i])
}
Can anyone give a bit of advice on how to solve this issue? If I am successful, I then must be capable to create a single column with links, another column with an id and a number of the HTML page.
Write a function which returns a dataframe after reading from each HTML file.
read_html_files <- function(filename) {
page <- read_html(filename)
links <- page %>% html_nodes("a") %>% html_attr("href")
page_article <- grep("/item.asp", links, value = TRUE)
data.frame(filename, page_article)
}
Use purrr::map_df and pass this function to every file and combine the output in one dataframe (result).
path = "C:/Users/Dasha/Downloads/R STUDIO/RECTORS/test retrieve"
file.names <- list.files(path, pattern ="\\.html$", full.names = TRUE)
result <- purrr::map_df(file.names, read_html_files, .id = 'id')
result

Getting data from an html page using R

Anyone can help me why the below code doe not have any data for the selected table?
library('httr')
library('rvest')
url= read_html("http://projects.worldbank.org/search?lang=en&searchTerm=&sectorcode_exact=AB")
table = html_node(url,"table#f05v5-sorting-table.border-top2.border-allside.clearboth")
Thanks!
You are missing some steps. Your workflow should look like this:
dat_html <- read_html(
"http://projects.worldbank.org/search?lang=en&searchTerm=&sectorcode_exact=AB"
)
dat_nodes <- html_nodes(dat_html, xpath = "xxxx")
dat <- html_table(dat_nodes)
dat will be a list, so if you want a data frame, you could do something like:
dat_df <- as.data.frame(dat)
Or, if you like tibbles:
dat_tbl <- as_tibble(dat)
I cannot find the table you are interested in on that webpage, so you have to replace "xxxx" by the xpath of the table you are interested in.
To find the xpath, if you are inspecting the page from chrome or chromium, you can right click on the node in the inspector window, and look for Copy, then Copy XPath.

child number changes when web scraping for different patents

I am relatively new to web scraping.
I am having problems with child numbers when web scraping for multiple patents. The child number changes accordingly to the the location of the table in the web page. Sometimes the child is "div:nth-child(17)" and other times it is "div:nth-child(18)" when searching for different patents.
My line of code is this one:
IPCs <-sapply("http://www.sumobrain.com/patents/us/Sonic-pulse-echo-method-apparatus/4202215.html", function(url1){
tryCatch(url1 %>%
as.character() %>%
read_html() %>%
html_nodes("#inner_content2 > div:nth-child(17) > div.disp_elm_value3 > table") %>%
html_table(),
error = function(e){NA}
)
})
When I search for another patent (for example: "http://www.sumobrain.com/patents/us/Method-apparatus-quantitative-depth-differential/4982090.html") the child number changes to (18).
I am planning to analyse more than a thousand patents so I would need a code that work for both child numbers. Is there a CSS selector which allows me to select more children? I have tried the "div:nth-child(n)" and "div:nth-child(*)" but they do not work.
I am also open to using a different method. Does anybody have any suggestions?
Try this pseudo classes :
It's a range between 17 and 18.
nth-child(17):nth-child(-n+18)

Accessing html Tables with rvest

So I am wanting to scrape some NBA data. The following is what I have so far, and it is perfectly functional:
install.packages('rvest')
library(rvest)
url = "https://www.basketball-reference.com/boxscores/201710180BOS.html"
webpage = read_html(url)
table = html_nodes(webpage, 'table')
data = html_table(table)
away = data[[1]]
home = data[[3]]
colnames(away) = away[1,] #set appropriate column names
colnames(home) = home[1,]
away = away[away$MP != "MP",] #remove rows that are just column names
home = home[home$MP != "MP",]
the problem is that these tables don't include the team names, which is important. To get this information, I was thinking I would scrape the four factors table on the webpage, however, rvest doesnt seem to be recognizing this as a table. The div that contains the four factors table is:
<div class="overthrow table_container" id="div_four_factors">
And the table is:
<table class="suppress_all sortable stats_table now_sortable" id="four_factors" data-cols-to-freeze="1"><thead><tr class="over_header thead">
This made me think that I could access the table via something along the lines of
table = html_nodes(webpage,'#div_four_factors')
but this doesnt seem to work as I am getting just an empty list. How can I access the four factors table?
I am by no means an HTML expert but it appears that the table you are interested in is commented out in the source code then the comment is overridden at some point before being rendered.
If we assume that the Home team is always listed second, we can just use positional arguments and scrape another table on the page:
table = html_nodes(webpage,'#bottom_nav_container')
teams <- html_text(table[1]) %>%
stringr::str_split("Schedule\n")
away$team <- trimws(teams[[1]][1])
home$team <- trimws(teams[[1]][2])
Obviously not the cleanest solution but such is life in the world of web scraping

R to change the values in html form and scrape web data

I would like to scrape the historical weather data from this page http://www.weather.gov.sg/climate-historical-daily.
I am using the code given in this link Using r to navigate and scrape a webpage with drop down html forms.
However, I am not able to get the data probably due to change in structure of the page. In the code from the above link pgform <-html_form(pgsession)[[3]] was used to change the values of the form. I was not able to find a similar form in my case.
url <- "http://www.weather.gov.sg/climate-historical-daily"
pgsession <- html_session(url)
pgsource <- read_html(url)
pgform <- html_form(pgsession)
result in my case
> pgform
[[1]]
<form> 'searchform' (GET http://www.weather.gov.sg/)
<button submit> '<unnamed>
<input text> 's':
Since the page has a CSV download button and the links it provides follow a pattern, you can generate and download a set of URLs. You'll need a set of the station IDs, which you can scrape from the dropdown itself:
library(rvest)
page <- 'http://www.weather.gov.sg/climate-historical-daily' %>% read_html()
station_id <- page %>% html_nodes('button#cityname + ul a') %>%
html_attr('onclick') %>% # If you need names, grab the `href` attribute, too.
sub(".*'(.*)'.*", '\\1', .)
which can then be put into expand.grid with the months and years to generate all the necessary combinations:
df <- expand.grid(station_id,
month = sprintf('%02d', 1:12),
year = 2014:2016)
(Note if you want 2017 data, you'll need to construct those separately and rbind so as not to construct months that haven't happened yet.)
The combinations can then be paste0ed into URLs:
urls <- paste0('http://www.weather.gov.sg/files/dailydata/DAILYDATA_',
df$station_id, '_', df$year, df$month, '.csv')
which can be lapplyed across to download all the files:
# Warning! This will download a lot of files! Make sure you're in a clean directory.
lapply(urls, function(url){download.file(url, basename(url), method = 'curl')})