httr GET function read table - json

I want to scrape this website, and get the data from the table.
I use GET from the package httr, code is like below:
url <- 'http://datacenter.mep.gov.cn/report/water/water.jsp?'
year <- 2016
wissue <- 2
res <- GET(url,
query = list(year = year,
wissue = wissue))
resC <- content(res, as = 'text', encoding = 'utf-8')
But what I got is not a json string but something very strange like below:
"\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n<html>\r\n\t<head>\r\n\t\t<title>中华人民共和国环境保护部--政府网站数据中心</title>\r\n\t\t<meta http-equiv=\"content-type\" content=\"text/html;
I wonder is there anyway to parse this format?

The rowspan attribute is going to make dealing with this table pretty interesting. You have a few choices, two of which are:
use html_table() on the target <table> using fill=TRUE and perform surgery on the resultant data frame
attack it at the <tr>-level and build the data frame from the ground up
This answer does the latter.
library(rvest)
library(purrr)
First, we get the content in a form we can perform XML/HTML surgery on:
content(res, as = 'text', encoding = 'utf-8') %>%
read_html() -> pg
Next, we target and extract the table node with the report:
tab <- html_nodes(pg, "table#report1")
Here's te tricky bit. We first target all the <tr> elements that have #rowspan attributes but no <td> elements with a #colspan attribute:
html_nodes(tab, xpath=".//tr[td[not(#colspan) and #rowspan]]") %>%
Next, we process those invidivually:
map_df(function(x) {
We get the # of rows the <tr> spans:
html_nodes(x, xpath=".//td[#rowspan]") %>%
html_attr("rowspan") %>%
as.numeric() -> row_ct
Find all the sibling <tr> elements and reduce the set to the remaining ones in this <tr> "block":
rows <- html_nodes(x, xpath=".//following-sibling::tr")
rows <- rows[1:(row_ct-1)]
Make a data frame from that first block row
html_nodes(x, xpath=".//td") %>%
html_text() %>%
setNames(sprintf("X%d", 1:13)) %>%
as.list() %>%
flatten_df() -> first
Go through all filtered sibling rows and do the same, leaving room to fill in the spanned column:
map_df(rows, ~html_nodes(., xpath=".//td") %>%
html_text() %>%
setNames(c("X1", "X2", sprintf("X%d", 4:13))) %>%
as.list()) %>%
mutate(X3=first$X3) %>%
select(X1, X2, X3, everything()) -> rest
bind_rows(first, rest)
}) -> h2o_df
dplyr::glimpse(h2o_df)
I can't paste the output of that since SO's javascript text filter is so brain dead it thinks that the post is spam just b/c it has kanji characters.
Here's all the code in a contiguous chunk:
tab <- html_nodes(pg, "table#report1")
html_nodes(tab, xpath=".//tr[td[not(#colspan) and #rowspan]]") %>%
map_df(function(x) {
html_nodes(x, xpath=".//td[#rowspan]") %>%
html_attr("rowspan") %>%
as.numeric() -> row_ct
rows <- html_nodes(x, xpath=".//following-sibling::tr")
rows <- rows[1:(row_ct-1)]
html_nodes(x, xpath=".//td") %>%
html_text() %>%
setNames(sprintf("X%d", 1:13)) %>%
as.list() %>%
flatten_df() -> first
map_df(rows, ~html_nodes(., xpath=".//td") %>%
html_text() %>%
setNames(c("X1", "X2", sprintf("X%d", 4:13))) %>%
as.list()) %>%
mutate(X3=first$X3) %>%
select(X1, X2, X3, everything()) -> rest
bind_rows(first, rest)
}) -> h2o_df

Related

Web scraping table R

I'm trying to get the data from the rating column on this site https://www.ratingraph.com/tv-shows/one-piece-ratings-17673/, but I'm having problems with "{xml_nodeset (0)}".
my attempt:
library("rvest")
`%>%` <- magrittr::`%>%`
page <- read_html("https://www.ratingraph.com/tv-shows/one-piece-ratings-17673/")
table <- page %>%
html_nodes("table")
df <- table[2] %>%
html_table()
These are the data I want:
By inspecting the page and looking on the "Network" tab, you can see the call it makes to create the table.
The response is in JSON, which is easily parsed into an R list.
Much of this is probably unnecessary for your purpose, so you can shorten it.
If you want more than 25 rows, increase the length=25, or take it out.
page <- httr::GET(
paste0("https://www.ratingraph.com/show-episodes-list/17673/?draw=1&columns%5B0%5D%5Bdata%5D=trend&",
"columns%5B0%5D%5Bname%5D=&columns%5B0%5D%5Bsearchable%5D=false&columns%5B0%5D%5Borderable%5D=true&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=season&",
"columns%5B1%5D%5Bname%5D=&columns%5B1%5D%5Bsearchable%5D=false&columns%5B1%5D%5Borderable%5D=true&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=episode&",
"columns%5B2%5D%5Bname%5D=&columns%5B2%5D%5Bsearchable%5D=false&columns%5B2%5D%5Borderable%5D=true&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=name&",
"columns%5B3%5D%5Bname%5D=&columns%5B3%5D%5Bsearchable%5D=false&columns%5B3%5D%5Borderable%5D=true&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B4%5D%5Bdata%5D=start&",
"columns%5B4%5D%5Bname%5D=&columns%5B4%5D%5Bsearchable%5D=false&columns%5B4%5D%5Borderable%5D=true&columns%5B4%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B4%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B5%5D%5Bdata%5D=total_votes&",
"columns%5B5%5D%5Bname%5D=&columns%5B5%5D%5Bsearchable%5D=false&columns%5B5%5D%5Borderable%5D=true&columns%5B5%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B5%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B6%5D%5Bdata%5D=average_rating&",
"columns%5B6%5D%5Bname%5D=&columns%5B6%5D%5Bsearchable%5D=false&columns%5B6%5D%5Borderable%5D=true&columns%5B6%5D%5Bsearch%5D%5Bvalue%5D=&columns%5B6%5D%5Bsearch%5D%5Bregex%5D=false&order%5B0%5D%5Bcolumn%5D=1&",
"order%5B0%5D%5Bdir%5D=asc&order%5B1%5D%5Bcolumn%5D=2&order%5B1%5D%5Bdir%5D=asc&start=0&length=25&search%5Bvalue%5D=&search%5Bregex%5D=false&_=", Sys.time() %>% as.numeric() %>% paste0("000")))
table <- page %>% httr::content(as = 'parsed')
avg_ratings <- sapply(table$data, `[[`, 'average_rating') %>% as.numeric()

What should the "i"/category be in this for loop and how can I ensure it is in my working directory?

I am running a web-scraping project and running into some difficulty using the urls for search results from an initial scrape to scrape information from the search results themselves.
My first loop provides the back halves of the urls I need, after the / (for example, yelp.com/abd - I have abd), which I have in a nested list. However, when I summarize that nested list, like so:
profile_url_lst <- list()
for(page_num in 1:73){
main_url <- paste0("https://www.theeroticreview.com/reviews/newreviewsList.asp?searchreview=1&gCity=region1%2Dus%2Drhode%2Disland&gCityName=Rhode+Island+%28State%29&SortBy=3&gDistance=0&page=", page_num)
html_content <- read_html(main_url)
profile_urls <- html_content %>% html_nodes("body")%>% html_children() %>% html_children() %>% .[2] %>% html_children() %>%
html_children() %>% .[3] %>% html_children() %>% .[4] %>% html_children() %>% html_children() %>% html_children() %>%
html_attr("href")
profile_url_lst[[page_num]] <- profile_urls
Sys.sleep(2)
}
profile_url_lst
profiles <- cbind(profile_urls)
profiles
I only receive the urls from the last page of results.
I pasted the domain name to those urls with paste0, which worked fine, but I then encounter another problem. When I use the variable name in a for loop, R returns "variable name is not in your working directory).
complete_urls <- paste0('https://www.theeroticreview.com', profiles)
complete <- cbind(complete_urls)
complete
TED_lst <- list()
for(complete_urls in 1:73) {
html_content1 <- read_html('complete_urls')
TED <- html_content1 %>% html_nodes("'") %>% html_text()
TED_lst[i] <- TEDs
Sys.sleep(2)
How do I paste the domain name to all the collected urls and bind them, and what should the category be in the for loop?
Assuming you intend to read_html from each url within complete_urls you want to avoid overwriting that variable by using it as the loop variable; as well as referencing it as a string literal. You could instead seq_along the items and index in. Here I print rather than read_html
complete_urls <- c('A', 'B')
for(i in seq_along(complete_urls)){
print(complete_urls[[i]])
}
It is probably better to write a custom function to apply to each url and pass that into a tidyverse function/possibly something where you can take advantage of parallel|async running.

How to retrieve a multiple tables from a webpage using R

I want to extract all vaccine tables with the description on the left and their description inside the table using R,
this is the link for the webpage
this is how the first table look on the webpage:
I tried using XML package, but I wasn't succeful, I used:
vup<-readHTMLTable("https://milken-institute-covid-19-tracker.webflow.io/#vaccines_intro", which=5)
I get an error:
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"NULL"’
In addition: Warning message:
XML content does not seem to be XML: ''
How to do this?
This webpage does not use a tables thus the reason for your error. Due to the multiple subsections and hidden text, the formatting on the page is quite complicated and requires finding the nodes of interest individually.
I prefer using the "rvest" and "xml2" package for the easier and more straight forward syntax.
This is not a complete solution and should get you moving in the correct direction.
library(rvest)
library(dplyr)
#find the top of the vacine section
parentvaccine <- page %>% html_node(xpath="//div[#id='vaccines_intro']") %>% xml_parent()
#find the vacine rows
vaccines <- parentvaccine %>% html_nodes(xpath = ".//div[#class='chart_row for_vaccines']")
#find info on each one
company <- vaccines %>% html_node(xpath = ".//div[#class='is_h5-2 is_developer w-richtext']") %>% html_text()
product <- vaccines %>% html_node(xpath = ".//div[#class='is_h5-2 is_vaccines w-richtext']") %>% html_text()
phase <- vaccines %>% html_node(xpath = ".//div[#class='is_h5-2 is_stage']") %>% html_text()
misc <- vaccines %>% html_node(xpath = ".//div[#class='chart_row-expanded for_vaccines']") %>% html_text()
#determine vacine type
#Get vacine type
vaccinetypes <- parentvaccine %>% html_nodes(xpath = './/div[#class="chart-section for_vaccines"]') %>%
html_node('div.is_h3') %>% html_text()
#dtermine the number of vacines in each category
lengthvector <-parentvaccine %>% html_nodes(xpath = './/div[#role="list"]') %>% xml_length() %>% sum()
#make vector of correct length
VaccineType <- rep(vaccinetypes, each=lengthvector)
answer <- data.frame(VaccineType, company, product, phase)
head(answer)
To generate this code, involved reading the html code and identifying the correct nodes and the unique attributes for the desired information.

Reading off links on a site and storing them in a list

I am trying to read off the urls to data from StatsCan as follows:
# 2015
url <- "https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/crude-oil-prices-2015/18122"
x1 <- read_html(url) %>%
html_nodes(xpath = '//*[#class="col-md-4"]/ul/li/ul/li/a') %>%
html_attr("href")
# 2014
url2 <- "https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/crude-oil-prices-2014/16993"
x2 <- read_html(url) %>%
html_nodes(xpath = '//*[#class="col-md-4"]/ul/li/ul/li/a') %>%
html_attr("href")
Doing so returns two empty lists; I am confused as this worked for this link: https://www.nrcan.gc.ca/our-natural-resources/energy-sources-distribution/clean-fossil-fuels/crude-oil/oil-pricing/18087. Ultimately I want to loop over the list and read off the tables on each page as so:
for (i in 1:length(x2)){
out.data <- read_html(x2[i]) %>%
html_table(fill = TRUE) %>%
`[[`(1) %>%
as_tibble()
write.xlsx(out.data, str_c(destination,i,".xlsx"))
}
In order to extract all url, I recommend using the css selector ".field-item li a" and subset according to a pattern.
links <- read_html(url) %>%
html_nodes(".field-item li a") %>%
html_attr("href") %>%
str_subset("fuel-prices/crude")
Your XPath needs to be fixed. You can use the following one :
//strong[contains(.,"Oil")]/following-sibling::ul//a

Cannot find numbers of pages of website in web scraping

I want to take number of pages from web site. I try to do it like on tutorial. I used this function:
get_last_page <- function(html){
pages_data <- html %>%
# The '.' indicates the class
html_nodes('.pagination-page') %>%
# Extract the raw text as a list
html_text()
# The second to last of the buttons is the one
pages_data[(length(pages_data)-1)] %>%
# Take the raw string
unname() %>%
# Convert to number
as.numeric()
}
first_page <- read_html(url)
(latest_page_number <- get_last_page(first_page))
for website
url <-'http://www.trustpilot.com/review/www.amazon.com'
it works fine.When I tried to do it with
url <-'https://energybase.ru/en/oil-gas-field/index'
I got integer(0).
I change
html_nodes('.pagination-page')
to
html_nodes('.html_nodes('data-page')')
And failed.
How can I change my code to make it works fine?
I think you have to go about this a little differently here.
The energybase.ru URL isn't organized quite the same way as the TrustPilot URL.
For our purposes here, we're interested in the fact that the last page has its own node .last. From there, you just have to extract the value of the data-page attribute and increment it by 1.
library("rvest")
library("magrittr")
url <- 'https://energybase.ru/en/oil-gas-field/index'
read_html(url) %>% html_nodes(".last") %>% html_children() %>% html_attr("data-page") %>% as.numeric()+1
# [1] 21
Edit: note, you can always intercept the piping at html_children() (by adding a %>% html_attrs() to it) to find out what attributes are available at your disposal there.
You could use the rel=last attribute=value node and extract the number from the href
library("rvest")
library("magrittr")
pg <- read_html('https://energybase.ru/en/oil-gas-field/index')
number_of_pages <- str_match_all(pg %>% html_node("[rel=last]") %>% html_attr("href"),'page=(\\d+)')[[1]][,2] %>% as.numeric()
Or, there are a number of ways you could calculate it given that there are more pages than pagination visibile. One way is to get the total count from the appropriate li in the drop down and divide by the results per page count.
library(rvest)
library(magrittr)
pg <- read_html('https://energybase.ru/en/oil-gas-field/index')
total_sites <- strtoi(pg %>% html_node('#navbar-facilities > li:nth-child(13)') %>% html_attr('data-amount'), base = 0L)
# or use: total_sites <- pg %>% html_node('#navbar-facilities > li:nth-child(13)') %>% html_attr('data-amount') %>% as.numeric()
sites_per_page <- length(pg %>% html_nodes('.index-list-item'))
number_of_pages <- ceiling(total_sites/sites_per_page)