Creating Dataframe from a json file - json

I want to create a proper data frame reading from a json file. I am able to view the created data frame properly, but dplyr function group_by does not work on it. It is probably because when I do the str() of the data frame created it gives every column as a list of strings as opposed to a vector of strings. I am trying the following:
require(jsonlite)
train_file = 'train.json'
train_data <- fromJSON(train_file)
rb = data.frame(sapply(train_data,c), stringsAsFactors = FALSE)
rbs = rb %>% slice(1:10)
rbsg = rbs %>%
group_by(colname)
This gives the following error:
Error: cannot group column colname, of class 'list'
Very specifically, the file that I am trying to read is the train.json file in this kaggle competition:
https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries/data

You need to unnest() the column of interest before operating on it (e.g. before using group_by() or other dplyr verbs):
library(jsonlite)
library(tidyverse)
rbs <- fromJSON("train.json") %>%
bind_rows()
rbsg <- rbs %>%
unnest(bedrooms) %>%
group_by(bedrooms)
rbs_filtered <- rbs %>%
unnest(bathrooms) %>%
filter(bathrooms > 5)

Related

Scrape object from html with rvest

I am new in web scraping with r and I am trying to get a daily updated object which is probably not text. The url is
here and I want to extract the daily situation table in the end of the page. The class of this object is
class="aem-GridColumn aem-GridColumn--default--12 aem-GridColumn--offset--default--0"
I am not really experienced with html and css so if you have any useful source or advice on how I can extract objects from a webpage I would really appreciate it, since SelectorGadget in that case indicate "No valid path found."
Without getting into the business of writing web scrapers, I think this should help you out:
library(rvest)
url = 'https://covid19.public.lu/en.html'
source = read_html(url)
selection = html_nodes( source , '.cmp-gridStat__item-container' ) %>% html_node( '.number' ) %>% html_text() %>% toString()
We can convert the text obtained from Daily situation update using vroom package
library(rvest)
library(vroom)
url = 'https://covid19.public.lu/en.html'
df = url %>%
read_html() %>%
html_nodes('.cmp-gridStat__item-container') %>%
html_text2()
vroom(df, delim = '\\n', col_names = F)
# A tibble: 22 x 1
X1
<chr>
1 369 People tested positive for COVID-19
2 Per 100.000 inhabitants: 58,13
3 Unvaccinated: 91,20
Edit:
html_element vs html_elemnts
The pout of html_elemnts (html_nodes) is,
[1] "369 People tested positive for COVID-19\n\nPer 100.000 inhabitants: 58,13\n\nUnvaccinated: 91,20\n\nVaccinated: 41,72\n\nRatio Unvaccinated / Vaccinated: 2,19\n\n "
[2] "4 625 Number of PCR tests performed\n\nPer 100.000 inhabitants: 729\n\nPositivity rate in %: 7,98\n\nReproduction rate: 0,97"
[3] "80 Hospitalizations\n\nNormal care: 57\nIntensive care: 23\n\nNew deaths: 1\nTotal deaths: 890"
[4] "6 520 Vaccinations per day\n\nDose 1: 785\nDose 2: 468\nComplementary dose: 5 267"
[5] "960 315 Total vaccines administered\n\nDose 1: 452 387\nDose 2: 395 044\nComplementary dose: 112 884"
and that of html_element (html_node)` is
[1] "369 People tested positive for COVID-19\n\nPer 100.000 inhabitants: 58,13\n\nUnvaccinated: 91,20\n\nVaccinated: 41,72\n\nRatio Unvaccinated / Vaccinated: 2,19\n\n "
As you can see html_nodes returns all value associated with the nodes whereashtml_node only returns the first node. Thus, the former fetches you all the nodes which is really helpful.
html_text vs html_text2
The html_text2retains the breaks in strings usually \n and \b. These are helpful when working with strings.
More info is in rvest documentation,
https://cran.r-project.org/web/packages/rvest/rvest.pdf
There is probably a much more elegant way to do this efficiently, but when I need brute force something like this, I try to break it down into small parts.
Use the httr library to get the raw html.
Use str_extract from the stringr library to extract the specific piece of data from the html.
I use both a positive lookbehind and lookahead regex to get the exact piece of data I need. It basically takes the form of "?<=text_right_before).+?(?=text_right_after)
library(httr)
library(stringr)
r <- GET("https://covid19.public.lu/en.html")
html<-content(r, "text")
normal_care=str_extract(html, regex("(?<=Normal care: ).+?(?=<br>)"))
intensive_care=str_extract(html, regex("(?<=Intensive care: ).+?(?=</p>)"))
I wondered if you could get the same data from any of their public APIs. If you simply want a pdf with that table (plus lots of other tables of useful info) you can use the API to extract.
If you want as a DataFrame (resembling as per webpage) you can write a user defined function, with the help of pdftools, to reconstruct the table from the pdf. Bit more effort but as you already have other answers covering using rvest thought I'd have a look at this. I looked at tabularize but that wasn't particularly effective.
More than likely, you could pull several of the API datasets together to get the full content without the need to parse the pdf publication I use e.g. there is an Excel spreadsheet that gives the case numbers.
N.B. There are a few bottom calcs from the webpage not included below. I have only processed the testing info table from the pdf.
Rapports journaliers:
https://data.public.lu/en/datasets/covid-19-rapports-journaliers/#_
https://download.data.public.lu/resources/covid-19-rapports-journaliers/20211210-165252/coronavirus-rapport-journalier-10122021.pdf
API datasets:
https://data.public.lu/api/1/datasets/#
library(tidyverse)
library(jsonlite)
## https://data.library.virginia.edu/reading-pdf-files-into-r-for-text-mining/
# install.packages("pdftools")
library(pdftools)
r <- jsonlite::read_json("https://data.public.lu/api/1/datasets/#")
report_index <- match(TRUE, map(r$data, function(x) x$slug == "covid-19-rapports-journaliers"))
latest_daily_covid_pdf <- r$data[[report_index]]$resources[[1]]$latest # coronavirus-rapport-journalier
filename <- "covd_daily.pdf"
download.file(latest_daily_covid_pdf, filename, mode = "wb")
get_latest_daily_df <- function(filename) {
data <- pdf_text(filename)
text <- data[[1]] %>% strsplit(split = "\n{2,}")
web_data <- text[[1]][3:12]
df <- map(web_data, function(x) strsplit(x, split = "\\s{2,}")) %>%
unlist() %>%
matrix(nrow = 10, ncol = 5, byrow = T) %>%
as_tibble()
colnames(df) <- text[[1]][2] %>%
strsplit(split = "\\s{2,}") %>%
map(function(x) gsub("(.*[a-z])\\d+", "\\1", x)) %>%
unlist()
title <- text[[1]][1] %>%
strsplit(split = "\n") %>%
unlist() %>%
tail(1) %>%
gsub("\\s+", " ", .) %>%
gsub(" TOTAL", "", .)
colnames(df)[2:3] <- colnames(df)[2:3] %>% paste(title, ., sep = " ")
colnames(df)[4:5] <- colnames(df)[4:5] %>% paste("TOTAL", ., sep = " ")
colnames(df)[1] <- "Metric"
clean_col <- function(x) {
gsub("\\s+|,", "", x) %>% as.numeric()
}
clean_col2 <- function(x) {
gsub("\n", " ", gsub("([a-z])(\\d+)", "\\1", x))
}
df <- df %>% mutate(across(.cols = -c(colnames(df)[1]), clean_col),
Metric = clean_col2(Metric)
)
return(df)
}
View(get_latest_daily_df(filename))
Output:
Alternate:
If you simply want to pull items then process you could extract each column as an item in a list. Replace br elements such that the content within those end up in a comma separated list:
library(rvest)
library(magrittr)
library(stringi)
library(xml2)
page <- read_html("https://covid19.public.lu/en.html")
xml_find_all(page, ".//br") %>% xml_add_sibling("span", ",") #This method from https://stackoverflow.com/a/46755666 #hrbrmstr
xml_find_all(page, ".//br") %>% xml_remove()
columns <- page %>% html_elements(".cmp-gridStat__item")
map(columns, ~ .x %>%
html_elements("p") %>%
html_text(trim = T) %>%
gsub("\n\\s{2,}", " ", .)
%>%
stri_remove_empty())

How to parse addresses from website specifying class in R?

I would like to parse addresses of all stores on the following website:
https://www.carrefour.fr/magasin/region/ looping through the regions. So starting for example with the region "auvergne-rhone-alpes-84", hence full url = https://www.carrefour.fr/magasin/region/auvergne-rhone-alpes-84. Note that I can add more regions afterwards, I just want to make it work with one for now.
carrefour <- "https://www.carrefour.fr/magasin/region/"
addresses_vector = c()
for (current_region in c("auvergne-rhone-alpes-84")) {
current_region_url = paste(carrefour, current_region, "/", sep="")
x <- GET(url=current_region_url)
html_doc <- read_html(x) %>%
html_nodes("[class = 'ds-body-text ds-store-card__details--content ds-body-text--size-m ds-body-text--color-standard-2']")
addresses_vector <- c(addresses_vector, html_doc %>%
rvest::html_nodes('body')%>%
xml2::xml_find_all(".//div[contains(#class, 'ds-body-text ds-store-card__details--content ds-body-text--size-m ds-body-text--color-standard-2')]") %>%
rvest::html_text())
}
I also tried with x%>% read_html() %>% rvest::html_nodes(xpath="/html/body/main/div[1]/div/div[2]/div[2]/ol/li[1]/div/div[1]/div[2]/div[2]")%>% rvest::html_text() (copying the whole xpath by hand) or x%>%read_html() %>%html_nodes("div.ds-body-text.ds-store-card__details--content.ds-body-text--size-m.ds-body-text--color-standard-2") %>%html_text() and several other ways but I always get a character(0) element returned.
Any help is appreciated!
You could write a couple of custom functions to help then use purrr to map the store data function to inputs from the output of the first helper function.
First, extract the region urls and extract the region names and region ids. Store these in a tibble. This is the first helper function get_regions.
Then use another function, get_store_info, to extract from these region urls the store info, which is stored in a div tag, from which it is dynamically extracted when JavaScript runs in the browser, but not when using rvest.
Apply the function that extracts the store info over the list of region urls and region ids.
If you use map2_dfr to pass both region id and region link to the function which extracts store data, you then have the region id to link back on to join the result of the map2_dfr to that of region tibble generated earlier.
Then do some column cleaning e.g., drop ones you don't want.
library(rvest)
library(purrr)
library(dplyr)
library(readr)
library(jsonlite)
get_regions <- function() {
url <- "https://www.carrefour.fr/magasin"
page <- read_html(url)
regions <- page %>% html_nodes(".store-locator-footer-list__item > a")
t <- tibble(
region = regions %>% html_text(trim = T),
link = regions %>% html_attr("href") %>% url_absolute(url),
region_id = NA_integer_
) %>% mutate(region_id = str_match(link, "-(\\d+)$")[, 2] %>%
as.integer())
return(t)
}
get_store_info <- function(region_url, r_id) {
region_page <- read_html(region_url)
store_data <- region_page %>%
html_node("#store-locator") %>%
html_attr(":context-stores") %>%
parse_json(simplifyVector = T) %>%
as_tibble()
store_data$region_id <- r_id
return(store_data)
}
region_df <- get_regions()
store_df <- map2_dfr(region_df$link, region_df$region_id, get_store_info)
final_df <- inner_join(region_df, store_df, by = 'region_id') # now clean columns within this.

How to download a html table with inconsistent number of columns in R?

I´m currently trying to download a table from the following URL:
url1<-"http://iambweb.ams.or.at/ambweb/showcognusServlet?tabkey=3643193&regionDisplay=%C3%96sterreich&export=html&outputLocale=de"
I downloaded and saved the file as .xls because I thought it is a Excel-file with the following code:
temp <- paste0(tempfile(), ".xls")
download.file(url1, destfile = temp, mode = "wb")
First I tried to read it in R as a Excel file but it seems to be a html (can be read by Excel though):
dfAMS <- read_excel(path = temp, sheet = "Sheet1", range = "I7:I37")
Therefore:
df <- read_html(temp)
Now unfortunately I´m stuck because the following lines of code won´t give me the intended result (a nice table or at least column I7:I37 in the .xls):
dfAMS <- html_node(df, "table") %>% html_table(fill = T) %>% tibble::as_tibble()
dplyr::glimpse(df)
I´m pretty sure the solution is rather simple but I´m currently stuck and can´t find a solution...
Thanks in advance!
Klamsi, the url points to an html file renamed to have a ".xls" extension. This is somewhat common practice among webmasters. Try it yourself by renaming the ".xls" extention to ".html".
A second problem is that the html has a very messy table configuration. The table of interest is the fifth table in the document.
This is a workaround to obtain the values for the overall population (or "range A7:B37, I7:K37")
url <- "http://iambweb.ams.or.at/ambweb/showcognusServlet?tabkey=3643193&regionDisplay=%C3%96sterreich&export=html&outputLocale=en"
df <- read_html(url) %>%
html_table(header = TRUE, fill = TRUE) %>%
.[[5]] %>% #Extract the fifth table in the list
as.data.frame() %>%
.[,c(1:11)] %>%
select(1:2, 9:11)
names <- unlist(df[1,])
names[1:2] <- c("item", "Bundesland")
colnames(df) <- names
df <- df[-1,]
df %>% head()
item Bundesland Bestand Veränderung zum VJ absolut Veränderung zum VJ in %
2 Arbeitslosigkeit Bgld 7119 -973 -0.120242214532872
3 Arbeitslosigkeit Ktn 16564 -2160 -0.115359965819269
4 Arbeitslosigkeit NÖ 46342 -6095 -0.116234719758949
5 Arbeitslosigkeit OÖ 29762 -4649 -0.135102147569091
6 Arbeitslosigkeit Sbg 11173 -643 -0.0544177386594448
7 Arbeitslosigkeit Stmk 28677 -5602 -0.1634236704688

R highcharter get data from plots saved as html

I plot data with highcharter package in R, and save them as html to keep interactive features. In most cases I plot more than one graph, therefore bring them together as a canvas.
require(highcharter)
hc_list <- lapply(list(sin,cos,tan,tanh),mapply,seq(1,5,by = 0.1)) %>%
lapply(function(x) highchart() %>% hc_add_series(x))
hc_grid <- hw_grid(hc_list,ncol = 2)
htmltools::browsable(hc_grid) # print
htmltools::save_html(hc_grid,"test_grid.html") # save
I want to extract the data from plots that I have saved as html in the past, just like these. Normally I would do hc_list[[1]]$x$hc_opts$series, but when I import html into R and try to do the same, I get an error. It won't do the job.
> hc_imported <- htmltools::includeHTML("test_grid.html")
> hc_imported[[1]]$x$hc_opts$series
Error in hc_imported$x : $ operator is invalid for atomic vectors
If I would be able to write a function like
get_my_data(my_imported_highcharter,3) # get data from 3rd plot
it would be the best. Regards.
You can use below code
require(highcharter)
hc_list <- lapply(list(sin,cos,tan,tanh),mapply,seq(1,5,by = 0.1)) %>%
lapply(function(x) highchart() %>% hc_add_series(x))
hc_grid <- hw_grid(hc_list,ncol = 2)
htmltools::browsable(hc_grid) # print
htmltools::save_html(hc_grid,"test_grid.html") # save
# hc_imported <- htmltools::includeHTML("test_grid.html")
# hc_imported[[1]]$x$hc_opts$series
library(jsonlite)
library(RCurl)
library(XML)
get_my_data<-function(my_imported_highcharter,n){
webpage <- readLines(my_imported_highcharter)
pagetree <- htmlTreeParse(webpage, error=function(...){})
body <- pagetree$children$html$children$body
divbodyContent <- body$children$div$children[[n]]
script<-divbodyContent$children[[2]]
data<-as.character(script$children[[1]])[6]
data<-fromJSON(data,simplifyVector = FALSE)
data<-data$x$hc_opts$series[[1]]$data
return(data)
}
get_my_data("test_grid.html",3)
get_my_data("test_grid.html",1)

how to use tidyjson inside dplyr

I have dataframe, called data_df, which has one column which contain json string, column name is json_response.
I want access very specific key-value from it. Example of one of json string as follows. I want to know how many times success is true in string.
x = "[{\"s\":\"D\",\"success\":true,\"start.time\":\"2016-01-27 19:27:27\",\"stop.time\":\"2016-01-27 19:27:30\",\"status_code\":200,\"called\":true,\"milliseconds\":3738.6858,\"_row\":\"DataX\"},{\"s\":\"C\",\"success\":true,\"start.time\":\"2016-01-27 19:27:30\",\"stop.time\":\"2016-01-27 19:27:32\",\"status_code\":200,\"called\":true,\"milliseconds\":1815.1433,\"_row\":\"Clarity\"}]"
If I only want to use tidyjson, I can do it as follows, which works as I want.
library(dplyr)
library(tidyjson)
x %>% gather_array %>%
spread_values(called = jstring("called")) %>%
summarize(x = sum(called == "TRUE"))
Now if I want to do it for whole column, how should I do it? I don't want to use a loop.
Following is my code which I tried to use.
data_df %>%
transmute(
test = json_response %>% gather_array %>%
spread_values(called = jstring("called")) %>%
summarize(x = sum(called=="TRUE"))
)
Following is the error I got when I ran the above code:
Error: not compatible with STRSXP
Instead of using tidyjson you can use rjson combined with dplyr in a way like this:
data_df$test <- data_df %>% rowwise %>%
do(test = .$json_response %>% as.character %>% fromJSON %>% sapply(`[[`, "called") %>% sum) %>%
as.data.frame
You can use tidyjson for this, simply convert data_df into a tbl_json object, and then proceed as before:
data_df %>%
as.tbl_json(json.column = "json_response") %>%
# track each document if you don't already have an ID
mutate(rownum = 1:n()) %>%
gather_array %>%
# use jlogical for correct type
spread_values(success = jlogical("success")) %>%
group_by(rownum) %>%
summarize(num.successes = sum(success))