Extract data in columnar format from JSON in R - json

I want to extract data from the json object in R
R Package used tidyjson, magrittr, jsonlite
trial <- '[{ "KEYS": {"USER_ID": "1266", "MOBILE_NO": "9000000000"}}]'
trial %>%
gather_array %>% # stack as an array
spread_values(USER_ID = jstring("KEYS.USER_ID"),
MOBILE_NO = jstring("KEYs.MOBILE_NO") )
Output of this code is not as required. Anyone with suggestions.
document.id array.index USER_ID MOBILE_NO
1 1 1 <NA> <NA>
Expected output:
document.id array.index USER_ID MOBILE_NO
1 1 1266 9000000000

tidyjson uses multi-parameter paths, rather than "dot-separated" paths, as you attempted. You can really tackle this two ways:
Recommended, as it does not throw away the rest of the object:
trial <- '[{ "KEYS": {"USER_ID": "1266", "MOBILE_NO": "9000000000"}}]'
trial %>%
gather_array %>% # stack as an array
spread_values(USER_ID = jstring('KEYS','USER_ID'),
MOBILE_NO = jstring('KEYS','MOBILE_NO'))
Can also use enter_object if preferred or necessary:
trial <- '[{ "KEYS": {"USER_ID": "1266", "MOBILE_NO": "9000000000"}}]'
trial %>%
gather_array %>% # stack as an array
enter_object('KEYS') %>%
spread_values(USER_ID = jstring('USER_ID'),
MOBILE_NO = jstring('MOBILE_NO'))

Related

How can I filter out numbers from an html table in R?

I am currently working on a forecasting model and to do this I would like to import data from an HTML website into R and save the values-part of the data set into a new list.
I have used the following approach in R:
# getting website data:
link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
document <- htmlParse(GET(link, user_agent("Mozilla")))
removeNodes(getNodeSet(document,"//*/comment()"))
doc.tables<-readHTMLTable(document)
# show BID/ASK block:
doc.tables[2]
Which (doc.tables[2]) gives me in this case the result:
$`NULL`
Bid 0,765
1 Ask 0,80
How can i filter out the numbers (0,765 & 0,80) of the table, to save it into a list?
The issue is the 0.765 is actually the name of your data.frame column.
Your data frame being doc.tables[[2]]
You can grab the name by calling names(doc.tables[[2]])[2])
store that as a variable like name <- names(doc.tables[[2]])[2])
then you can grab the 0,80 by using doc.tables[[2]][[2]], store that as a variable if you like.
Final code should look like... my_list <- list(name, doc.tables[[2]][[2]])
Here is a way with rvest, not package XML.
The code below uses two more packages, stringr and readr, to extract the values and their names.
library(httr)
library(rvest)
library(dplyr)
link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
page <- read_html(link)
tbl <- page %>%
html_elements("tr") %>%
html_text() %>%
.[3:4] %>%
stringr::str_replace_all(",", ".")
tibble(name = stringr::str_extract(tbl, "Ask|Bid"),
value = readr::parse_number(tbl))
#> # A tibble: 2 x 2
#> name value
#> <chr> <dbl>
#> 1 Bid 0.765
#> 2 Ask 0.8
Created on 2022-03-26 by the reprex package (v2.0.1)
Without saving the pipe result to a temporary object, tbl, the pipe can continue as below.
library(httr)
library(rvest)
library(stringr)
suppressPackageStartupMessages(library(dplyr))
link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
page <- read_html(link)
page %>%
html_elements("tr") %>%
html_text() %>%
.[3:4] %>%
str_replace_all(",", ".") %>%
tibble(name = str_extract(., "Ask|Bid"),
value = readr::parse_number(.)) %>%
.[-1]
#> # A tibble: 2 x 2
#> name value
#> <chr> <dbl>
#> 1 Bid 0.765
#> 2 Ask 0.8
Created on 2022-03-27 by the reprex package (v2.0.1)
This is building on Jahi Zamy’s observation that some of your data are showing up as column names and on the example code in the question.
library(httr)
library(XML)
# getting website data:
link <- "https://www.tradegate.de/orderbuch.php?isin=US13200M5085"
document <- htmlParse(GET(link, user_agent("Mozilla")))
# readHTMLTable() assumes tables have a header row by default,
# but these tables do not, so use header=FALSE
doc.tables <- readHTMLTable(document, header=FALSE)
# Extract column from BID/ASK table
BidAsk = doc.tables1[[2]][,2]
# Replace commas with point decimal separator
BidAsk = as.numeric(gsub(",", ".", BidAsk))
# Convert to numeric
BidAsk = as.numeric(BidAsk)

Scrape object from html with rvest

I am new in web scraping with r and I am trying to get a daily updated object which is probably not text. The url is
here and I want to extract the daily situation table in the end of the page. The class of this object is
class="aem-GridColumn aem-GridColumn--default--12 aem-GridColumn--offset--default--0"
I am not really experienced with html and css so if you have any useful source or advice on how I can extract objects from a webpage I would really appreciate it, since SelectorGadget in that case indicate "No valid path found."
Without getting into the business of writing web scrapers, I think this should help you out:
library(rvest)
url = 'https://covid19.public.lu/en.html'
source = read_html(url)
selection = html_nodes( source , '.cmp-gridStat__item-container' ) %>% html_node( '.number' ) %>% html_text() %>% toString()
We can convert the text obtained from Daily situation update using vroom package
library(rvest)
library(vroom)
url = 'https://covid19.public.lu/en.html'
df = url %>%
read_html() %>%
html_nodes('.cmp-gridStat__item-container') %>%
html_text2()
vroom(df, delim = '\\n', col_names = F)
# A tibble: 22 x 1
X1
<chr>
1 369 People tested positive for COVID-19
2 Per 100.000 inhabitants: 58,13
3 Unvaccinated: 91,20
Edit:
html_element vs html_elemnts
The pout of html_elemnts (html_nodes) is,
[1] "369 People tested positive for COVID-19\n\nPer 100.000 inhabitants: 58,13\n\nUnvaccinated: 91,20\n\nVaccinated: 41,72\n\nRatio Unvaccinated / Vaccinated: 2,19\n\n "
[2] "4 625 Number of PCR tests performed\n\nPer 100.000 inhabitants: 729\n\nPositivity rate in %: 7,98\n\nReproduction rate: 0,97"
[3] "80 Hospitalizations\n\nNormal care: 57\nIntensive care: 23\n\nNew deaths: 1\nTotal deaths: 890"
[4] "6 520 Vaccinations per day\n\nDose 1: 785\nDose 2: 468\nComplementary dose: 5 267"
[5] "960 315 Total vaccines administered\n\nDose 1: 452 387\nDose 2: 395 044\nComplementary dose: 112 884"
and that of html_element (html_node)` is
[1] "369 People tested positive for COVID-19\n\nPer 100.000 inhabitants: 58,13\n\nUnvaccinated: 91,20\n\nVaccinated: 41,72\n\nRatio Unvaccinated / Vaccinated: 2,19\n\n "
As you can see html_nodes returns all value associated with the nodes whereashtml_node only returns the first node. Thus, the former fetches you all the nodes which is really helpful.
html_text vs html_text2
The html_text2retains the breaks in strings usually \n and \b. These are helpful when working with strings.
More info is in rvest documentation,
https://cran.r-project.org/web/packages/rvest/rvest.pdf
There is probably a much more elegant way to do this efficiently, but when I need brute force something like this, I try to break it down into small parts.
Use the httr library to get the raw html.
Use str_extract from the stringr library to extract the specific piece of data from the html.
I use both a positive lookbehind and lookahead regex to get the exact piece of data I need. It basically takes the form of "?<=text_right_before).+?(?=text_right_after)
library(httr)
library(stringr)
r <- GET("https://covid19.public.lu/en.html")
html<-content(r, "text")
normal_care=str_extract(html, regex("(?<=Normal care: ).+?(?=<br>)"))
intensive_care=str_extract(html, regex("(?<=Intensive care: ).+?(?=</p>)"))
I wondered if you could get the same data from any of their public APIs. If you simply want a pdf with that table (plus lots of other tables of useful info) you can use the API to extract.
If you want as a DataFrame (resembling as per webpage) you can write a user defined function, with the help of pdftools, to reconstruct the table from the pdf. Bit more effort but as you already have other answers covering using rvest thought I'd have a look at this. I looked at tabularize but that wasn't particularly effective.
More than likely, you could pull several of the API datasets together to get the full content without the need to parse the pdf publication I use e.g. there is an Excel spreadsheet that gives the case numbers.
N.B. There are a few bottom calcs from the webpage not included below. I have only processed the testing info table from the pdf.
Rapports journaliers:
https://data.public.lu/en/datasets/covid-19-rapports-journaliers/#_
https://download.data.public.lu/resources/covid-19-rapports-journaliers/20211210-165252/coronavirus-rapport-journalier-10122021.pdf
API datasets:
https://data.public.lu/api/1/datasets/#
library(tidyverse)
library(jsonlite)
## https://data.library.virginia.edu/reading-pdf-files-into-r-for-text-mining/
# install.packages("pdftools")
library(pdftools)
r <- jsonlite::read_json("https://data.public.lu/api/1/datasets/#")
report_index <- match(TRUE, map(r$data, function(x) x$slug == "covid-19-rapports-journaliers"))
latest_daily_covid_pdf <- r$data[[report_index]]$resources[[1]]$latest # coronavirus-rapport-journalier
filename <- "covd_daily.pdf"
download.file(latest_daily_covid_pdf, filename, mode = "wb")
get_latest_daily_df <- function(filename) {
data <- pdf_text(filename)
text <- data[[1]] %>% strsplit(split = "\n{2,}")
web_data <- text[[1]][3:12]
df <- map(web_data, function(x) strsplit(x, split = "\\s{2,}")) %>%
unlist() %>%
matrix(nrow = 10, ncol = 5, byrow = T) %>%
as_tibble()
colnames(df) <- text[[1]][2] %>%
strsplit(split = "\\s{2,}") %>%
map(function(x) gsub("(.*[a-z])\\d+", "\\1", x)) %>%
unlist()
title <- text[[1]][1] %>%
strsplit(split = "\n") %>%
unlist() %>%
tail(1) %>%
gsub("\\s+", " ", .) %>%
gsub(" TOTAL", "", .)
colnames(df)[2:3] <- colnames(df)[2:3] %>% paste(title, ., sep = " ")
colnames(df)[4:5] <- colnames(df)[4:5] %>% paste("TOTAL", ., sep = " ")
colnames(df)[1] <- "Metric"
clean_col <- function(x) {
gsub("\\s+|,", "", x) %>% as.numeric()
}
clean_col2 <- function(x) {
gsub("\n", " ", gsub("([a-z])(\\d+)", "\\1", x))
}
df <- df %>% mutate(across(.cols = -c(colnames(df)[1]), clean_col),
Metric = clean_col2(Metric)
)
return(df)
}
View(get_latest_daily_df(filename))
Output:
Alternate:
If you simply want to pull items then process you could extract each column as an item in a list. Replace br elements such that the content within those end up in a comma separated list:
library(rvest)
library(magrittr)
library(stringi)
library(xml2)
page <- read_html("https://covid19.public.lu/en.html")
xml_find_all(page, ".//br") %>% xml_add_sibling("span", ",") #This method from https://stackoverflow.com/a/46755666 #hrbrmstr
xml_find_all(page, ".//br") %>% xml_remove()
columns <- page %>% html_elements(".cmp-gridStat__item")
map(columns, ~ .x %>%
html_elements("p") %>%
html_text(trim = T) %>%
gsub("\n\\s{2,}", " ", .)
%>%
stri_remove_empty())

Passing a character variable into a function in R (Tidyjson)

I'm messing around with tidyjson (latest from github, published by Jeremy Stanley). I wanted to sort of automate searching and extract the nested arrays. The following examples below provide the output I want.
'{"name": {"first": "bob", "last": "jones"}, "age": 32}' %>%
enter_object("name") %>%
gather_keys %>%
append_values_string
'{"name": {"first": "bob", "last": "jones"}, "age": 32}' %>%
enter_object(name) %>%
gather_keys %>%
append_values_string
These both give the same output:
# A tbl_json: 2 x 3 tibble with a "JSON" attribute
`attr(., "JSON")` document.id key string
<chr> <int> <chr> <chr>
1 "bob" 1 first bob
2 "jones" 1 last jones
However, if I declare a character variable before and pass it along it fails.
object_name <- "name"
'{"name": {"first": "bob", "last": "jones"}, "age": 32}' %>%
enter_object(list(name="name")) %>%
gather_keys %>%
append_values_string
Error: Path components must be single names or character strings
Any ideas why this would happen?
If you are familiar with Hadley's book Advanced R, this is a piece of non-standard evaluation that unfortunately does not presently have a workaround in pure tidyjson (I would prefer a enter_object_ that uses standard evaluation, more like dplyr). I am hopeful of that functionality at some point being available, because as you suggest, it would be nice to vectorize and automate these sorts of programs.
The Non-Standard Evaluation is the "magic" that allows you to pass in the un-quoted name and still get good results in your second example (instead of the program looking for an object called name). The hazard is it does not resolve objects like object_name in your case.
That said, it seems you can work-around with do.call and a list of parameters (I fixed your example, as I think it went a bit awry)
library(tidyjson)
json <- "{\"name\": {\"first\": \"bob\", \"last\": \"jones\"}, \"age\": 32}"
object_name <- "name"
do.call(enter_object, args = list(json, object_name)) %>% gather_object %>%
append_values_string
#> # A tbl_json: 2 x 3 tibble with a "JSON" attribute
#> `attr(., "JSON")` document.id name string
#> <chr> <int> <chr> <chr>
#> 1 "\"bob\"" 1 first bob
#> 2 "\"jones\"" 1 last jones
I definitely recommend checking out some of the new features / functionality in the development version of tidyjson with devtools::install_github('jeremystan/tidyjson'), but unfortunately no support for standard evaluation in "path"s yet.

Trouble spreading values using tidyjson

I am trying to convert the following multi-document JSON file into a data.frame.
x = '[
{"name": "Bob","groupIds": ["kwt6x61", "yiahf43"]},
{"name": "Sally","groupIds": "yiahf43"}
]'
I'm almost there by using
y = x %>% gather_array() %>%
spread_values(
name = jstring("name"),
groupIds = jstring("groupIds")
)
print(y)
Which returns:
document.id array.index name groupIds
1 1 1 Bob list("kwt6x61", "yiahf43")
2 1 2 Sally yiahf43
Can someone help spread the groupsIds into addtional rows?
This is an interesting problem. The issue stems from the fact that an array of 1 is stored as a string. Otherwise, enter_object('groupIds') %>% gather_array %>% append_values_string would work nicely. tidyjson does not seem to handle this situation nicely. I wonder whether this would even be considered valid JSON, since in one case groupIds is a string, and in another it is an array.
In any case, although this is not an ideal solution, you can use json_types() to illustrate the difference and then conditionally treat each. I converted to a tbl_df (i.e. dropped JSON component) for future processing when done parsing.
library(tidyjson)
library(dplyr)
library(tidyr)
x = '[
{"name": "Bob","groupIds": ["kwt6x61", "yiahf43"]},
{"name": "Sally","groupIds": "yiahf43"}
]'
## Show the different types
z <- x %>% gather_array() %>% spread_values(
name=jstring('name')
) %>% enter_object('groupIds') %>% json_types()
## Conditionally treat each
final <- bind_rows(
z[z$type=='array',] %>% gather_array('id') %>% append_values_string('groupId')
, z[z$type=='string',] %>% append_values_string('groupId') %>% mutate(id=1)
) %>% tbl_df
## Spread them out, maybe? Depends on what you're looking for
final %>% spread('id','groupId')

Tidyjson: is there an 'exit_object()' equivalent?

I'm using package tidyjson to parse a json string and extract the key values into columns. The json in nested, and while I can drill down at a node, I can't figure out a way to go up to the previous level. The code is below:
library(tidyjson)
library(data.table)
library(dplyr)
input <- '{
"name": "Bob",
"age": 30,
"social": {
"married": "yes",
"kids": "no"
},
"work": {
"title": "engineer",
"salary": 5000
}
}'
output <- input %>% as.tbl_json() %>%
spread_values(name = jstring("name"),
age = jnumber("age")) %>%
enter_object("social") %>%
spread_values(married = jstring("married"),
kids = jstring("kids")) %>%
#### I would need an exit_obeject() here
enter_object("work") %>%
spread_values(title = jstring("title"),
salary = jnumber("salary"))
There's a note in the documentation:
"Note that there are often situations where there are multiple arrays
or objects of differing types that exist at the same level of the JSON
hierarchy. In this case, you need to use enter_object() to enter each
of them in separate pipelines to create separate data.frames that can
then be joined relationally."
As such I've been staging my tidyjson commands and putting the outputs together with merge, e.g.:
# first the high-level values
output_table <- input_tbl_json %>%
spread_values(val1 = jstring('val1'),
val2 = jnumber('val2'))
# then enter an object and get something from inside, merging it as a new column
output_table <- merge(output_table,
input_tbl_json %>%
enter_object('thing') %>%
spread_values(val3 = jstring('thing1')),
by = c('document.id'))
output table columns should look like | document.id | val1 | val2 | val3 |
That workflow may fall over with operations like gather_keys() that add rows, but I haven't had call to test it.
I think an overlooked piece of functionality within tidyjson is the ability to use more complex paths in the jnumber, jstring, etc. functions.
You can do something like the following without "entering an object." I find this to be a very satisfying solution, for the most part. Perhaps more satisfying than multiple enter/exits.
input <- '{
"name": "Bob",
"age": 30,
"social": {
"married": "yes",
"kids": "no"
},
"work": {
"title": "engineer",
"salary": 5000
}
}'
output <- input %>% as.tbl_json() %>%
spread_values(
name = jstring('name')
, age=jnumber('age')
, married=jstring('social','married')
, kids = jstring('social','kids')
, title= jstring('work','title')
, salary = jnumber('work','salary')
)