how to use tidyjson inside dplyr - json

I have dataframe, called data_df, which has one column which contain json string, column name is json_response.
I want access very specific key-value from it. Example of one of json string as follows. I want to know how many times success is true in string.
x = "[{\"s\":\"D\",\"success\":true,\"start.time\":\"2016-01-27 19:27:27\",\"stop.time\":\"2016-01-27 19:27:30\",\"status_code\":200,\"called\":true,\"milliseconds\":3738.6858,\"_row\":\"DataX\"},{\"s\":\"C\",\"success\":true,\"start.time\":\"2016-01-27 19:27:30\",\"stop.time\":\"2016-01-27 19:27:32\",\"status_code\":200,\"called\":true,\"milliseconds\":1815.1433,\"_row\":\"Clarity\"}]"
If I only want to use tidyjson, I can do it as follows, which works as I want.
library(dplyr)
library(tidyjson)
x %>% gather_array %>%
spread_values(called = jstring("called")) %>%
summarize(x = sum(called == "TRUE"))
Now if I want to do it for whole column, how should I do it? I don't want to use a loop.
Following is my code which I tried to use.
data_df %>%
transmute(
test = json_response %>% gather_array %>%
spread_values(called = jstring("called")) %>%
summarize(x = sum(called=="TRUE"))
)
Following is the error I got when I ran the above code:
Error: not compatible with STRSXP

Instead of using tidyjson you can use rjson combined with dplyr in a way like this:
data_df$test <- data_df %>% rowwise %>%
do(test = .$json_response %>% as.character %>% fromJSON %>% sapply(`[[`, "called") %>% sum) %>%
as.data.frame

You can use tidyjson for this, simply convert data_df into a tbl_json object, and then proceed as before:
data_df %>%
as.tbl_json(json.column = "json_response") %>%
# track each document if you don't already have an ID
mutate(rownum = 1:n()) %>%
gather_array %>%
# use jlogical for correct type
spread_values(success = jlogical("success")) %>%
group_by(rownum) %>%
summarize(num.successes = sum(success))

Related

How to parse addresses from website specifying class in R?

I would like to parse addresses of all stores on the following website:
https://www.carrefour.fr/magasin/region/ looping through the regions. So starting for example with the region "auvergne-rhone-alpes-84", hence full url = https://www.carrefour.fr/magasin/region/auvergne-rhone-alpes-84. Note that I can add more regions afterwards, I just want to make it work with one for now.
carrefour <- "https://www.carrefour.fr/magasin/region/"
addresses_vector = c()
for (current_region in c("auvergne-rhone-alpes-84")) {
current_region_url = paste(carrefour, current_region, "/", sep="")
x <- GET(url=current_region_url)
html_doc <- read_html(x) %>%
html_nodes("[class = 'ds-body-text ds-store-card__details--content ds-body-text--size-m ds-body-text--color-standard-2']")
addresses_vector <- c(addresses_vector, html_doc %>%
rvest::html_nodes('body')%>%
xml2::xml_find_all(".//div[contains(#class, 'ds-body-text ds-store-card__details--content ds-body-text--size-m ds-body-text--color-standard-2')]") %>%
rvest::html_text())
}
I also tried with x%>% read_html() %>% rvest::html_nodes(xpath="/html/body/main/div[1]/div/div[2]/div[2]/ol/li[1]/div/div[1]/div[2]/div[2]")%>% rvest::html_text() (copying the whole xpath by hand) or x%>%read_html() %>%html_nodes("div.ds-body-text.ds-store-card__details--content.ds-body-text--size-m.ds-body-text--color-standard-2") %>%html_text() and several other ways but I always get a character(0) element returned.
Any help is appreciated!
You could write a couple of custom functions to help then use purrr to map the store data function to inputs from the output of the first helper function.
First, extract the region urls and extract the region names and region ids. Store these in a tibble. This is the first helper function get_regions.
Then use another function, get_store_info, to extract from these region urls the store info, which is stored in a div tag, from which it is dynamically extracted when JavaScript runs in the browser, but not when using rvest.
Apply the function that extracts the store info over the list of region urls and region ids.
If you use map2_dfr to pass both region id and region link to the function which extracts store data, you then have the region id to link back on to join the result of the map2_dfr to that of region tibble generated earlier.
Then do some column cleaning e.g., drop ones you don't want.
library(rvest)
library(purrr)
library(dplyr)
library(readr)
library(jsonlite)
get_regions <- function() {
url <- "https://www.carrefour.fr/magasin"
page <- read_html(url)
regions <- page %>% html_nodes(".store-locator-footer-list__item > a")
t <- tibble(
region = regions %>% html_text(trim = T),
link = regions %>% html_attr("href") %>% url_absolute(url),
region_id = NA_integer_
) %>% mutate(region_id = str_match(link, "-(\\d+)$")[, 2] %>%
as.integer())
return(t)
}
get_store_info <- function(region_url, r_id) {
region_page <- read_html(region_url)
store_data <- region_page %>%
html_node("#store-locator") %>%
html_attr(":context-stores") %>%
parse_json(simplifyVector = T) %>%
as_tibble()
store_data$region_id <- r_id
return(store_data)
}
region_df <- get_regions()
store_df <- map2_dfr(region_df$link, region_df$region_id, get_store_info)
final_df <- inner_join(region_df, store_df, by = 'region_id') # now clean columns within this.

What should the "i"/category be in this for loop and how can I ensure it is in my working directory?

I am running a web-scraping project and running into some difficulty using the urls for search results from an initial scrape to scrape information from the search results themselves.
My first loop provides the back halves of the urls I need, after the / (for example, yelp.com/abd - I have abd), which I have in a nested list. However, when I summarize that nested list, like so:
profile_url_lst <- list()
for(page_num in 1:73){
main_url <- paste0("https://www.theeroticreview.com/reviews/newreviewsList.asp?searchreview=1&gCity=region1%2Dus%2Drhode%2Disland&gCityName=Rhode+Island+%28State%29&SortBy=3&gDistance=0&page=", page_num)
html_content <- read_html(main_url)
profile_urls <- html_content %>% html_nodes("body")%>% html_children() %>% html_children() %>% .[2] %>% html_children() %>%
html_children() %>% .[3] %>% html_children() %>% .[4] %>% html_children() %>% html_children() %>% html_children() %>%
html_attr("href")
profile_url_lst[[page_num]] <- profile_urls
Sys.sleep(2)
}
profile_url_lst
profiles <- cbind(profile_urls)
profiles
I only receive the urls from the last page of results.
I pasted the domain name to those urls with paste0, which worked fine, but I then encounter another problem. When I use the variable name in a for loop, R returns "variable name is not in your working directory).
complete_urls <- paste0('https://www.theeroticreview.com', profiles)
complete <- cbind(complete_urls)
complete
TED_lst <- list()
for(complete_urls in 1:73) {
html_content1 <- read_html('complete_urls')
TED <- html_content1 %>% html_nodes("'") %>% html_text()
TED_lst[i] <- TEDs
Sys.sleep(2)
How do I paste the domain name to all the collected urls and bind them, and what should the category be in the for loop?
Assuming you intend to read_html from each url within complete_urls you want to avoid overwriting that variable by using it as the loop variable; as well as referencing it as a string literal. You could instead seq_along the items and index in. Here I print rather than read_html
complete_urls <- c('A', 'B')
for(i in seq_along(complete_urls)){
print(complete_urls[[i]])
}
It is probably better to write a custom function to apply to each url and pass that into a tidyverse function/possibly something where you can take advantage of parallel|async running.

How to retrieve a multiple tables from a webpage using R

I want to extract all vaccine tables with the description on the left and their description inside the table using R,
this is the link for the webpage
this is how the first table look on the webpage:
I tried using XML package, but I wasn't succeful, I used:
vup<-readHTMLTable("https://milken-institute-covid-19-tracker.webflow.io/#vaccines_intro", which=5)
I get an error:
Error in (function (classes, fdef, mtable) :
unable to find an inherited method for function ‘readHTMLTable’ for signature ‘"NULL"’
In addition: Warning message:
XML content does not seem to be XML: ''
How to do this?
This webpage does not use a tables thus the reason for your error. Due to the multiple subsections and hidden text, the formatting on the page is quite complicated and requires finding the nodes of interest individually.
I prefer using the "rvest" and "xml2" package for the easier and more straight forward syntax.
This is not a complete solution and should get you moving in the correct direction.
library(rvest)
library(dplyr)
#find the top of the vacine section
parentvaccine <- page %>% html_node(xpath="//div[#id='vaccines_intro']") %>% xml_parent()
#find the vacine rows
vaccines <- parentvaccine %>% html_nodes(xpath = ".//div[#class='chart_row for_vaccines']")
#find info on each one
company <- vaccines %>% html_node(xpath = ".//div[#class='is_h5-2 is_developer w-richtext']") %>% html_text()
product <- vaccines %>% html_node(xpath = ".//div[#class='is_h5-2 is_vaccines w-richtext']") %>% html_text()
phase <- vaccines %>% html_node(xpath = ".//div[#class='is_h5-2 is_stage']") %>% html_text()
misc <- vaccines %>% html_node(xpath = ".//div[#class='chart_row-expanded for_vaccines']") %>% html_text()
#determine vacine type
#Get vacine type
vaccinetypes <- parentvaccine %>% html_nodes(xpath = './/div[#class="chart-section for_vaccines"]') %>%
html_node('div.is_h3') %>% html_text()
#dtermine the number of vacines in each category
lengthvector <-parentvaccine %>% html_nodes(xpath = './/div[#role="list"]') %>% xml_length() %>% sum()
#make vector of correct length
VaccineType <- rep(vaccinetypes, each=lengthvector)
answer <- data.frame(VaccineType, company, product, phase)
head(answer)
To generate this code, involved reading the html code and identifying the correct nodes and the unique attributes for the desired information.

Apply custom function (creating HTML tables) to every column of a data frame using purrr / map

I have a custom function, which generates HTML tables with DT:datatable and works well when applied without iterative approaches.
function_datatable <- function(df, input_var) {
df %>%
group_by(!!enquo(input_var)) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
filter(! is.na(!!enquo(input_var))) %>%
DT::datatable()
}
However, it does not work when applied to an iterative approach for example with purrr::map or map2. My goal is to apply this function to every variable of a data frame / tibble. The goal is to apply this in an R Markdown, where a variety of variables should be displayed one after the other. So the syntax should be something like this (code obviously does not work)
df %>% map(function_datatable)
I got a couple of different error message when trying out different approaches and combinations. Some were related to group_by and character others to the datatable/htmlwidget class.
Here's an example with mtcars and removing last line of your function.
require(tidyverse)
dataf <- tibble(mtcars)
function_datatable <- function(col_name, df) {
return(df %>%
group_by_at(col_name) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
filter(! is.na(!!enquo(col_name))))
}
map(names(dataf), function_datatable, dataf)
Updated code ("state of the art" tidyverse syntax). Some tidyeval stuff was needed in order to deal with the correct way of programming the function. Thanks #Manuel Zambelli for the map() answer!
library(tidyverse)
function_datatable <- function(input_var, df) {
df %>%
group_by(across(all_of(input_var))) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
filter(! is.na(!!rlang::sym(input_var))) %>%
DT::datatable()
}
map(names(mtcars), function_datatable, mtcars)

Scraping web table using R and rvest

I'm new in web scraping using R.
I'm trying to scrape the table generated by this link:
https://gd.eppo.int/search?k=saperda+tridentata.
In this specific case, it's just one record in the table but it could be more (I am actually interested in the first column but the whole table is ok).
I tried to follow the suggestion by Allan Cameron given here (rvest, table with thead and tbody tags) as the issue seems to be exactly the same but with no success maybe for my little knowledge on how webpages work. I always get a "no data" table. Maybe I am not following correctly the suggested step "# Get the JSON as plain text from the link generated by Javascript on the page".
Where can I get this link? In this specific case I used "https://gd.eppo.int/media/js/application/zzsearch.js?7", is this one?
Below you have my code.
Thank you in advance!
library(httr)
library(rlist)
library(rvest)
library(jsonlite)
library(dplyr)
pest.name <- "saperda+tridentata"
url <- paste("https://gd.eppo.int/search?k=",pest.name, sep="")
resp <- GET(url) %>% content("text")
json_url <- "https://gd.eppo.int/media/js/application/zzsearch.js?7"
JSON <- GET(json_url) %>% content("text", encoding = "utf8")
table_contents <- JSON %>%
{gsub("\\\\n", "\n", .)} %>%
{gsub("\\\\/", "/", .)} %>%
{gsub("\\\\\"", "\"", .)} %>%
strsplit("html\":\"") %>%
unlist %>%
extract(2) %>%
substr(1, nchar(.) -2) %>%
paste0("</tbody>")
new_page <- gsub("</tbody>", table_contents, resp)
read_html(new_page) %>%
html_nodes("table") %>%
html_table()
The data comes from another endpoint you can see in the network tab when refreshing the page. You can send a request with your search phrase in the params and then extract the json you need from the response.
library(httr)
library(jsonlite)
params = list('k' = 'saperda tridentata','s' = 1,'m' = 1,'t' = 0)
r <- httr::GET(url = 'https://gd.eppo.int/ajax/search', query = params)
data <- jsonlite::parse_json(r %>% read_html() %>% html_node('p') %>%html_text())
print(data[[1]]$e)