I am trying to convert the following multi-document JSON file into a data.frame.
x = '[
{"name": "Bob","groupIds": ["kwt6x61", "yiahf43"]},
{"name": "Sally","groupIds": "yiahf43"}
]'
I'm almost there by using
y = x %>% gather_array() %>%
spread_values(
name = jstring("name"),
groupIds = jstring("groupIds")
)
print(y)
Which returns:
document.id array.index name groupIds
1 1 1 Bob list("kwt6x61", "yiahf43")
2 1 2 Sally yiahf43
Can someone help spread the groupsIds into addtional rows?
This is an interesting problem. The issue stems from the fact that an array of 1 is stored as a string. Otherwise, enter_object('groupIds') %>% gather_array %>% append_values_string would work nicely. tidyjson does not seem to handle this situation nicely. I wonder whether this would even be considered valid JSON, since in one case groupIds is a string, and in another it is an array.
In any case, although this is not an ideal solution, you can use json_types() to illustrate the difference and then conditionally treat each. I converted to a tbl_df (i.e. dropped JSON component) for future processing when done parsing.
library(tidyjson)
library(dplyr)
library(tidyr)
x = '[
{"name": "Bob","groupIds": ["kwt6x61", "yiahf43"]},
{"name": "Sally","groupIds": "yiahf43"}
]'
## Show the different types
z <- x %>% gather_array() %>% spread_values(
name=jstring('name')
) %>% enter_object('groupIds') %>% json_types()
## Conditionally treat each
final <- bind_rows(
z[z$type=='array',] %>% gather_array('id') %>% append_values_string('groupId')
, z[z$type=='string',] %>% append_values_string('groupId') %>% mutate(id=1)
) %>% tbl_df
## Spread them out, maybe? Depends on what you're looking for
final %>% spread('id','groupId')
Related
I am new in web scraping with r and I am trying to get a daily updated object which is probably not text. The url is
here and I want to extract the daily situation table in the end of the page. The class of this object is
class="aem-GridColumn aem-GridColumn--default--12 aem-GridColumn--offset--default--0"
I am not really experienced with html and css so if you have any useful source or advice on how I can extract objects from a webpage I would really appreciate it, since SelectorGadget in that case indicate "No valid path found."
Without getting into the business of writing web scrapers, I think this should help you out:
library(rvest)
url = 'https://covid19.public.lu/en.html'
source = read_html(url)
selection = html_nodes( source , '.cmp-gridStat__item-container' ) %>% html_node( '.number' ) %>% html_text() %>% toString()
We can convert the text obtained from Daily situation update using vroom package
library(rvest)
library(vroom)
url = 'https://covid19.public.lu/en.html'
df = url %>%
read_html() %>%
html_nodes('.cmp-gridStat__item-container') %>%
html_text2()
vroom(df, delim = '\\n', col_names = F)
# A tibble: 22 x 1
X1
<chr>
1 369 People tested positive for COVID-19
2 Per 100.000 inhabitants: 58,13
3 Unvaccinated: 91,20
Edit:
html_element vs html_elemnts
The pout of html_elemnts (html_nodes) is,
[1] "369 People tested positive for COVID-19\n\nPer 100.000 inhabitants: 58,13\n\nUnvaccinated: 91,20\n\nVaccinated: 41,72\n\nRatio Unvaccinated / Vaccinated: 2,19\n\n "
[2] "4 625 Number of PCR tests performed\n\nPer 100.000 inhabitants: 729\n\nPositivity rate in %: 7,98\n\nReproduction rate: 0,97"
[3] "80 Hospitalizations\n\nNormal care: 57\nIntensive care: 23\n\nNew deaths: 1\nTotal deaths: 890"
[4] "6 520 Vaccinations per day\n\nDose 1: 785\nDose 2: 468\nComplementary dose: 5 267"
[5] "960 315 Total vaccines administered\n\nDose 1: 452 387\nDose 2: 395 044\nComplementary dose: 112 884"
and that of html_element (html_node)` is
[1] "369 People tested positive for COVID-19\n\nPer 100.000 inhabitants: 58,13\n\nUnvaccinated: 91,20\n\nVaccinated: 41,72\n\nRatio Unvaccinated / Vaccinated: 2,19\n\n "
As you can see html_nodes returns all value associated with the nodes whereashtml_node only returns the first node. Thus, the former fetches you all the nodes which is really helpful.
html_text vs html_text2
The html_text2retains the breaks in strings usually \n and \b. These are helpful when working with strings.
More info is in rvest documentation,
https://cran.r-project.org/web/packages/rvest/rvest.pdf
There is probably a much more elegant way to do this efficiently, but when I need brute force something like this, I try to break it down into small parts.
Use the httr library to get the raw html.
Use str_extract from the stringr library to extract the specific piece of data from the html.
I use both a positive lookbehind and lookahead regex to get the exact piece of data I need. It basically takes the form of "?<=text_right_before).+?(?=text_right_after)
library(httr)
library(stringr)
r <- GET("https://covid19.public.lu/en.html")
html<-content(r, "text")
normal_care=str_extract(html, regex("(?<=Normal care: ).+?(?=<br>)"))
intensive_care=str_extract(html, regex("(?<=Intensive care: ).+?(?=</p>)"))
I wondered if you could get the same data from any of their public APIs. If you simply want a pdf with that table (plus lots of other tables of useful info) you can use the API to extract.
If you want as a DataFrame (resembling as per webpage) you can write a user defined function, with the help of pdftools, to reconstruct the table from the pdf. Bit more effort but as you already have other answers covering using rvest thought I'd have a look at this. I looked at tabularize but that wasn't particularly effective.
More than likely, you could pull several of the API datasets together to get the full content without the need to parse the pdf publication I use e.g. there is an Excel spreadsheet that gives the case numbers.
N.B. There are a few bottom calcs from the webpage not included below. I have only processed the testing info table from the pdf.
Rapports journaliers:
https://data.public.lu/en/datasets/covid-19-rapports-journaliers/#_
https://download.data.public.lu/resources/covid-19-rapports-journaliers/20211210-165252/coronavirus-rapport-journalier-10122021.pdf
API datasets:
https://data.public.lu/api/1/datasets/#
library(tidyverse)
library(jsonlite)
## https://data.library.virginia.edu/reading-pdf-files-into-r-for-text-mining/
# install.packages("pdftools")
library(pdftools)
r <- jsonlite::read_json("https://data.public.lu/api/1/datasets/#")
report_index <- match(TRUE, map(r$data, function(x) x$slug == "covid-19-rapports-journaliers"))
latest_daily_covid_pdf <- r$data[[report_index]]$resources[[1]]$latest # coronavirus-rapport-journalier
filename <- "covd_daily.pdf"
download.file(latest_daily_covid_pdf, filename, mode = "wb")
get_latest_daily_df <- function(filename) {
data <- pdf_text(filename)
text <- data[[1]] %>% strsplit(split = "\n{2,}")
web_data <- text[[1]][3:12]
df <- map(web_data, function(x) strsplit(x, split = "\\s{2,}")) %>%
unlist() %>%
matrix(nrow = 10, ncol = 5, byrow = T) %>%
as_tibble()
colnames(df) <- text[[1]][2] %>%
strsplit(split = "\\s{2,}") %>%
map(function(x) gsub("(.*[a-z])\\d+", "\\1", x)) %>%
unlist()
title <- text[[1]][1] %>%
strsplit(split = "\n") %>%
unlist() %>%
tail(1) %>%
gsub("\\s+", " ", .) %>%
gsub(" TOTAL", "", .)
colnames(df)[2:3] <- colnames(df)[2:3] %>% paste(title, ., sep = " ")
colnames(df)[4:5] <- colnames(df)[4:5] %>% paste("TOTAL", ., sep = " ")
colnames(df)[1] <- "Metric"
clean_col <- function(x) {
gsub("\\s+|,", "", x) %>% as.numeric()
}
clean_col2 <- function(x) {
gsub("\n", " ", gsub("([a-z])(\\d+)", "\\1", x))
}
df <- df %>% mutate(across(.cols = -c(colnames(df)[1]), clean_col),
Metric = clean_col2(Metric)
)
return(df)
}
View(get_latest_daily_df(filename))
Output:
Alternate:
If you simply want to pull items then process you could extract each column as an item in a list. Replace br elements such that the content within those end up in a comma separated list:
library(rvest)
library(magrittr)
library(stringi)
library(xml2)
page <- read_html("https://covid19.public.lu/en.html")
xml_find_all(page, ".//br") %>% xml_add_sibling("span", ",") #This method from https://stackoverflow.com/a/46755666 #hrbrmstr
xml_find_all(page, ".//br") %>% xml_remove()
columns <- page %>% html_elements(".cmp-gridStat__item")
map(columns, ~ .x %>%
html_elements("p") %>%
html_text(trim = T) %>%
gsub("\n\\s{2,}", " ", .)
%>%
stri_remove_empty())
I want to load the table at the bottom of the following webpage into R, either as a dataframe or table: https://www.lawschooldata.org/school/Yale%20University/18. My first instinct was to use the readHTMLTable function in the XML package
library(XML)
url <- "https://www.lawschooldata.org/school/Yale%20University/18"
##warning message after next line
table <- readHTMLTable(url)
table
However, this returns an empty list and gives me the following warning:
Warning message:XML content does not seem to be XML: ''
I also tried adapting code I found here Scraping html tables into R data frames using the XML package. This worked for 5 of the 6 tables on the page, but just returned the header row and one row with values from the header row for the 6th table, which is the one I am interested in. Code below:
library(XML)
library(RCurl)
library(rlist)
theurl <- getURL("https://www.lawschooldata.org/school/Yale%20University/18",.opts = list(ssl.verifypeer = FALSE) )
tables <- readHTMLTable(theurl)
##generates a list of the 6 tables on the page
tables <- list.clean(tables, fun = is.null, recursive = FALSE)
##takes the 6th table, which is the one I am interested in
applicanttable <- tables[[6]]
##the problem is that this 6th table returns just the header row and one row of values
##equal to those the header row
head(applicanttable)
Any insights would be greatly appreciated! For reference, I have also consulted the following posts that appear to have similar goals, but could not find a solution there:
Scraping html tables into R data frames using the XML package
Extracting html table from a website in R
The data is dynamically pulled from a nested JavaScript array, within a script tag when JavaScript runs in the browser. This doesn't happen when you use rvest to retrieve the non-rendered content (as seen in view-source).
You can regex out the appropriate nested array and then re-construct the table by splitting out the rows, adding the appropriate headers and performing some data manipulations on various columns e.g. some columns contain html which needs to be parsed to obtain the desired value.
As some columns e.g. Name contain values which could be interpreted as file paths , when using read_html, I use htmltidy to ensure handling as valid html.
N.B. If you use RSelenium then the page will render and you can just grab the table direct without reconstructing it.
TODO:
There are still some data type manipulations you could choose to apply to a few columns.
There is some more logic to be applied to ensure only Name is returned in Name column. Take the case of df$Name[10], this returns "Character and fitness issues" instead of Anxiousboy, due to the required value actually sitting in element.nextSibling.nextSibling of the p tag which is actually selected. These, infrequent, edge cases, need some additional logic built in. In this case, you might test for a particular string being returned then resort to re-parsing with an xpath expression.
R:
library(rvest)
#> Loading required package: xml2
#> Warning: package 'xml2' was built under R version 4.0.3
library(stringr)
library(htmltidy)
#> Warning: package 'htmltidy' was built under R version 4.0.3
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
get_value <- function(input) {
value <- tidy_html(input) %>%
read_html() %>%
html_node("a, p, span") %>%
html_text(trim = T)
result <- ifelse(is.na(value), input, value)
return(result)
}
tidy_result <- function(result) {
return(gsub("<.*", "", result))
}
page <- read_html("https://www.lawschooldata.org/school/Yale%20University/18")
s <- page %>% toString()
headers <- page %>%
html_nodes("#applicants-table th") %>%
html_text(trim = T)
s <- stringr::str_extract(s, regex("DataTable\\(\\{\n\\s+data:(.*\\n\\]\\n\\])", dotall = T)) %>%
gsub("\n", "", .)
rows <- stringr::str_extract_all(s, regex("(\\[.*?\\])", dotall = T))[[1]] %>% as.list()
df <- sapply(rows, function(x) {
stringr::str_match_all(x, "'(.*?)'")[[1]][, 2]
}) %>%
t() %>%
as_tibble(.name_repair = "unique")
#> New names:
#> * `` -> ...1
#> * `` -> ...2
#> * `` -> ...3
#> * `` -> ...4
#> * `` -> ...5
#> * ...
names(df) <- headers
df <- df %>%
rowwise() %>%
mutate(across(c("Name", "GRE", "URM", "$$$$"), .f = get_value)) %>%
mutate_at(c("Result"), tidy_result)
write.csv(df, "Yale Applications.csv")
Created on 2021-06-23 by the reprex package (v0.3.0)
Sample output:
I have a custom function, which generates HTML tables with DT:datatable and works well when applied without iterative approaches.
function_datatable <- function(df, input_var) {
df %>%
group_by(!!enquo(input_var)) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
filter(! is.na(!!enquo(input_var))) %>%
DT::datatable()
}
However, it does not work when applied to an iterative approach for example with purrr::map or map2. My goal is to apply this function to every variable of a data frame / tibble. The goal is to apply this in an R Markdown, where a variety of variables should be displayed one after the other. So the syntax should be something like this (code obviously does not work)
df %>% map(function_datatable)
I got a couple of different error message when trying out different approaches and combinations. Some were related to group_by and character others to the datatable/htmlwidget class.
Here's an example with mtcars and removing last line of your function.
require(tidyverse)
dataf <- tibble(mtcars)
function_datatable <- function(col_name, df) {
return(df %>%
group_by_at(col_name) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
filter(! is.na(!!enquo(col_name))))
}
map(names(dataf), function_datatable, dataf)
Updated code ("state of the art" tidyverse syntax). Some tidyeval stuff was needed in order to deal with the correct way of programming the function. Thanks #Manuel Zambelli for the map() answer!
library(tidyverse)
function_datatable <- function(input_var, df) {
df %>%
group_by(across(all_of(input_var))) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
filter(! is.na(!!rlang::sym(input_var))) %>%
DT::datatable()
}
map(names(mtcars), function_datatable, mtcars)
I want to extract data from the json object in R
R Package used tidyjson, magrittr, jsonlite
trial <- '[{ "KEYS": {"USER_ID": "1266", "MOBILE_NO": "9000000000"}}]'
trial %>%
gather_array %>% # stack as an array
spread_values(USER_ID = jstring("KEYS.USER_ID"),
MOBILE_NO = jstring("KEYs.MOBILE_NO") )
Output of this code is not as required. Anyone with suggestions.
document.id array.index USER_ID MOBILE_NO
1 1 1 <NA> <NA>
Expected output:
document.id array.index USER_ID MOBILE_NO
1 1 1266 9000000000
tidyjson uses multi-parameter paths, rather than "dot-separated" paths, as you attempted. You can really tackle this two ways:
Recommended, as it does not throw away the rest of the object:
trial <- '[{ "KEYS": {"USER_ID": "1266", "MOBILE_NO": "9000000000"}}]'
trial %>%
gather_array %>% # stack as an array
spread_values(USER_ID = jstring('KEYS','USER_ID'),
MOBILE_NO = jstring('KEYS','MOBILE_NO'))
Can also use enter_object if preferred or necessary:
trial <- '[{ "KEYS": {"USER_ID": "1266", "MOBILE_NO": "9000000000"}}]'
trial %>%
gather_array %>% # stack as an array
enter_object('KEYS') %>%
spread_values(USER_ID = jstring('USER_ID'),
MOBILE_NO = jstring('MOBILE_NO'))
I have dataframe, called data_df, which has one column which contain json string, column name is json_response.
I want access very specific key-value from it. Example of one of json string as follows. I want to know how many times success is true in string.
x = "[{\"s\":\"D\",\"success\":true,\"start.time\":\"2016-01-27 19:27:27\",\"stop.time\":\"2016-01-27 19:27:30\",\"status_code\":200,\"called\":true,\"milliseconds\":3738.6858,\"_row\":\"DataX\"},{\"s\":\"C\",\"success\":true,\"start.time\":\"2016-01-27 19:27:30\",\"stop.time\":\"2016-01-27 19:27:32\",\"status_code\":200,\"called\":true,\"milliseconds\":1815.1433,\"_row\":\"Clarity\"}]"
If I only want to use tidyjson, I can do it as follows, which works as I want.
library(dplyr)
library(tidyjson)
x %>% gather_array %>%
spread_values(called = jstring("called")) %>%
summarize(x = sum(called == "TRUE"))
Now if I want to do it for whole column, how should I do it? I don't want to use a loop.
Following is my code which I tried to use.
data_df %>%
transmute(
test = json_response %>% gather_array %>%
spread_values(called = jstring("called")) %>%
summarize(x = sum(called=="TRUE"))
)
Following is the error I got when I ran the above code:
Error: not compatible with STRSXP
Instead of using tidyjson you can use rjson combined with dplyr in a way like this:
data_df$test <- data_df %>% rowwise %>%
do(test = .$json_response %>% as.character %>% fromJSON %>% sapply(`[[`, "called") %>% sum) %>%
as.data.frame
You can use tidyjson for this, simply convert data_df into a tbl_json object, and then proceed as before:
data_df %>%
as.tbl_json(json.column = "json_response") %>%
# track each document if you don't already have an ID
mutate(rownum = 1:n()) %>%
gather_array %>%
# use jlogical for correct type
spread_values(success = jlogical("success")) %>%
group_by(rownum) %>%
summarize(num.successes = sum(success))