fast JSON to data.frame/data.table

fast JSON to data.frame/data.table - json

I have this R code to convert JSON data to a data.frame. It works fine but it is rather slow for huge JSON files. What's the more efficient way to do this (won't mind having a data.table output)?
json_data <- fromJSON(json_dt_url)
json_data <- json_data[['data']]
my_df <- data.frame()
for (i in 1:length(json_data))
{
my_df <- rbind(my_df, as.data.frame(json_data[[i]]))
}

If you are looking for fast JSON parsing, take a look at RcppSimdJson.
library(RcppSimdJson)
jsonfile <- system.file("jsonexamples", "small", "demo.json", package="RcppSimdJson")
res <- fload(jsonfile)
str(res)
#> List of 1
#> $ Image:List of 6
#> ..$ Width : int 800
#> ..$ Height : int 600
#> ..$ Title : chr "View from 15th Floor"
#> ..$ Thumbnail:List of 3
#> .. ..$ Url : chr "http://www.example.com/image/481989943"
#> .. ..$ Height: int 125
#> .. ..$ Width : int 100
#> ..$ Animated : logi FALSE
#> ..$ IDs : int [1:4] 116 943 234 38793
Created on 2020-08-05 by the reprex package (v0.3.0)
Using the benchmarking code from the package, we can compare different parsing approaches:
file <- system.file("jsonexamples", "mesh.json", package = "RcppSimdJson")
res <- bench::mark(
RcppSimdJson = RcppSimdJson::fload(file),
jsonlite = jsonlite::fromJSON(file),
jsonify = jsonify::from_json(file),
RJSONIO = RJSONIO::fromJSON(file),
ndjson = ndjson::stream_in(file),
check = FALSE
)
res
#> # A tibble: 5 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 RcppSimdJson 1.51ms 1.67ms 582. 5.82MB 5.98
#> 2 jsonlite 44.68ms 48.95ms 18.8 2.74MB 22.6
#> 3 jsonify 9.76ms 11.34ms 87.5 1.12MB 43.7
#> 4 RJSONIO 33.11ms 35.17ms 28.6 2.93MB 3.82
#> 5 ndjson 136.35ms 138.67ms 7.21 9.41MB 30.6
Created on 2020-08-05 by the reprex package (v0.3.0)
We see that RcppSimdJson is by far the fastest.

data2 <- fromJSON("data.json", flatten = TRUE)
reference https://rdrr.io/cran/jsonlite/f/vignettes/json-apis.Rmd

Try this way:
library(jsonlite)
json_data <- read_json("data.json", simplifyVector = TRUE)
Include the sample input so that I can test the solution myself!

Related

Can't update role of mutated variables

Background
I'm using the R package {recipes} for data preprocessing. Assume that I would like to transform some variable and then declare the transformed variable as the outcome variable for modeling.
Problem and minimal example:
However, an error is thrown:
library(tidymodels)
rec <- recipe( ~ ., data = mtcars) |>
step_mutate(mpg2 = mpg * 2) |>
update_role(mpg2, new_role = "outcome")
#> Error in `update_role()`:
#> ! Can't subset columns that don't exist.
#> ✖ Column `mpg2` doesn't exist.
Created on 2023-01-15 with reprex v2.0.2
What I've tried
The help pages of step_mutate() and update_role() do not mention the case of updating the role of an mutated variables. When I update the role of a variable without having mutated it, no error is thrown.
There are SO questions around with a similar error message (such as here, here, or here), but those questions seem to tap into different aspects.
Sessioninfo
sessionInfo()
#> R version 4.2.1 (2022-06-23)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur ... 10.16
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] yardstick_1.1.0 workflowsets_1.0.0 workflows_1.1.2 tune_1.0.1
#> [5] tidyr_1.2.1 tibble_3.1.8 rsample_1.1.1 recipes_1.0.4
#> [9] purrr_1.0.1 parsnip_1.0.3 modeldata_1.0.1 infer_1.0.4
#> [13] ggplot2_3.4.0 dplyr_1.0.10 dials_1.1.0 scales_1.2.1
#> [17] broom_1.0.2 tidymodels_1.0.0
#>
#> loaded via a namespace (and not attached):
#> [1] foreach_1.5.2 splines_4.2.1 R.utils_2.12.2
#> [4] prodlim_2019.11.13 assertthat_0.2.1 highr_0.10
#> [7] GPfit_1.0-8 yaml_2.3.6 globals_0.16.2
#> [10] ipred_0.9-13 pillar_1.8.1 backports_1.4.1
#> [13] lattice_0.20-45 glue_1.6.2 digest_0.6.31
#> [16] hardhat_1.2.0 colorspace_2.0-3 htmltools_0.5.4
#> [19] Matrix_1.5-3 R.oo_1.25.0 timeDate_4022.108
#> [22] pkgconfig_2.0.3 lhs_1.1.6 DiceDesign_1.9
#> [25] listenv_0.9.0 gower_1.0.1 lava_1.7.1
#> [28] timechange_0.2.0 styler_1.8.1 generics_0.1.3
#> [31] ellipsis_0.3.2 furrr_0.3.1 withr_2.5.0
#> [34] nnet_7.3-18 cli_3.6.0 survival_3.5-0
#> [37] magrittr_2.0.3 evaluate_0.19 R.methodsS3_1.8.2
#> [40] fs_1.5.2 fansi_1.0.3 future_1.30.0
#> [43] parallelly_1.34.0 R.cache_0.16.0 MASS_7.3-58.1
#> [46] class_7.3-20 tools_4.2.1 lifecycle_1.0.3
#> [49] stringr_1.5.0 munsell_0.5.0 reprex_2.0.2
#> [52] compiler_4.2.1 rlang_1.0.6 grid_4.2.1
#> [55] iterators_1.0.14 rstudioapi_0.14 rmarkdown_2.19
#> [58] gtable_0.3.1 codetools_0.2-18 DBI_1.1.3
#> [61] R6_2.5.1 lubridate_1.9.0 knitr_1.41
#> [64] fastmap_1.1.0 future.apply_1.10.0 utf8_1.2.2
#> [67] stringi_1.7.12 parallel_4.2.1 Rcpp_1.0.9
#> [70] vctrs_0.5.1 rpart_4.1.19 tidyselect_1.2.0
#> [73] xfun_0.36
```

This behavior is currently not properly documented.
The reason why you are having problems is because add_role(), update_role() and remove_role() can only be applied to the variables passed to recipe(), and they are all executed before the step functions.
This means that the following two snippets of code returns the same result
recipe( ~ ., data = mtcars) |>
step_mutate(mpg2 = mpg * 2) |>
update_role(mpg2, new_role = "outcome")
recipe( ~ ., data = mtcars) |>
update_role(mpg2, new_role = "outcome") |>
step_mutate(mpg2 = mpg * 2)
Reference here https://github.com/tidymodels/recipes/blob/ab2405a0393bba06d9d7a52b4dbba6659a6dfcbd/R/roles.R#L132 :
Roles can only be changed on the original data supplied to recipe()
More talk here https://github.com/tidymodels/recipes/issues/437.
The role argument of step_mutate() allows you to specify the role of the variables it creates
library(recipes)
recipe( ~ ., data = mtcars) |>
step_mutate(mpg2 = mpg * 2, role = "outcome") |>
prep() |>
summary()
#> # A tibble: 12 × 4
#> variable type role source
#> <chr> <list> <chr> <chr>
#> 1 mpg <chr [2]> predictor original
#> 2 cyl <chr [2]> predictor original
#> 3 disp <chr [2]> predictor original
#> 4 hp <chr [2]> predictor original
#> 5 drat <chr [2]> predictor original
#> 6 wt <chr [2]> predictor original
#> 7 qsec <chr [2]> predictor original
#> 8 vs <chr [2]> predictor original
#> 9 am <chr [2]> predictor original
#> 10 gear <chr [2]> predictor original
#> 11 carb <chr [2]> predictor original
#> 12 mpg2 <chr [2]> outcome derived
Additionally, it is not recommended that you try to create/modify the outcome inside a recipe. Such modifications should happen before, preferable before data splitting.

Webscraping does not identify the full HTML link

My problem is very similar to this one. I want to identify all the HTML links in this website so I can then open the link and download the tables.
The problem is that when I create the extract_links functions as pointed out in that answer, I get a list of all the HTMLs, but this are not complete.
To make it more clear:
If you press "Junio" in year "2022" the real HTML is the following:
http://transparencia.uantof.cl/index.php?action=plantillas_generar_plantilla&ig=21&m=6&a=2022&ia=7658
but the HTML that I am recovering from the source of the website lacks the last bit (&ia=7658):
http://transparencia.uantof.cl/index.php?action=plantillas_generar_plantilla&ig=21&m=6&a=2022
Which does not direct me to the table I want.
The problem is that these numbers do not seem to follow any logic and change between year/month links. Any help on how to retrieve the full HTML links will be greatly appreciated. If you also happen to know how can I retrieve the year/month of the file to add as an extra column that would also be great.
Thanks to the help of #margusl I was able to realize that rvest redirects automatically and that solves my problem.
I am trying to use the following code to loop over different links to obtain the tables, store them in a data frame and then download them:
yr.list <- seq(2019,2020)
mes.list <- seq(1,12)
combined_df <- data.frame()
for (yr in yr.list){
for (mes in mes.list) {
root <- "http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21"
# Full link
url <- paste(root,"&m=",mes,"&a=",yr,sep="")
# Parse HTML File
file<-read_html(url, encoding = "latin1")
file<- rvest::html_table(file)
str(file)
# This is the relevant table
table <- as.data.frame(file[[1]])
# in your loop, add the files that you read to the combined_df
combined_df <- rbind(combined_df, table)
}
}
It does not work because the read_html code with the encoding works only for some years, but not for all. for example, when running:
url <- "http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=3&a=2015"
file<-read_html(url, encoding = "latin1")
It does not recover the tables with names/surnames that recovers in the previous months but something else. Why can't this work on all the sub-pages? Is this a encoding problem again?

If you open that last page you had issues with, you'll see that it serves a sort of a submenu with 2 more links - http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=3&a=2015 . Meaning that it's not enough to just generate links for each month & year and extract first table of each page, all those pages should be checked for content and exceptions should be handled.
Though I took somewhat opportunistic approach and it happened to work with URL range defined in question + those few odd samples, but there could be other surprises down the road. Switched to httr for making requests as it allows to collect and monitor response headers, separating content retrieval and parsing also seems to work around encoding issues, at least in this case. First collecting and then parsing also simplifies debugging, you can check if certain responses / headers were different from the rest (i.e. response length being 10x smaller than average or final, redirected, url differs from the rest). And it's easy to change content handling
/ parsing for a small subset of responses, if needed. If you are not sure what rvest has retrieved, you can always save the response to a html file and check it with browser or editor, something like
html <- read_html(url_or_text_content); write(as.character(html), "dump.html")
library(rvest)
library(httr)
library(purrr)
library(dplyr)
library(tidyr)
library(stringr)
yr.list <- seq(2019,2020)
mes.list <- seq(1,12)
# combine mes.list & yr.list
url.params <- expand.grid(mes = mes.list, yr = yr.list)
# few extra samples:
url.params <- rbind(url.params,
list(mes = 6, yr = 2022), # here rvest strugglest with correct encoding
list(mes = 3, yr = 2015) # returns page with sub-categories
)
url.list <- str_glue("http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m={url.params$mes}&a={url.params$yr}")
url.list
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=1&a=2019
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=2&a=2019
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=3&a=2019
#> ...
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=11&a=2020
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=12&a=2020
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=6&a=2022
#> http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=3&a=2015
# url list for input, output is a tibble with all responses (incl. "url", "date",
# "status_code", header details and response body)
fetch_urls <- function(url.list){
# collect all responses to a list with httr, enable verbose, parse responses later
# add progress bar - requests take a while
resp.list = vector(mode = "list", length = length(url.list))
pb <- txtProgressBar(max = length(url.list), style = 3)
for (i in seq_along(url.list)){
resp.list[[i]] <- GET(url.list[i])
setTxtProgressBar(pb,i)
}
close(pb)
# turn responses into tibble to check urls, response sizes and status codes
resp.tibble <- bind_cols(
map_df(resp.list, ~ .[c("url", "date", "status_code")], .id = "req_id"),
map_df(resp.list, headers) %>% rename_with(~ paste0("header_",.x)),
# map_df(resp_follow.list, "times"),
map_chr(resp.list, content, as = "text") %>% tibble(html_doc = .)
)
return(resp.tibble)
}
resp.tibble <- fetch_urls(url.list)
# check resulting table without html_doc column
# View(resp.tibble[-ncol(resp.tibble)])
resp.tibble %>%
select(req_id:status_code,`header_content-length`) %>%
arrange(`header_content-length`)
#> # A tibble: 26 × 5
#> req_id url date statu…¹ heade…²
#> <chr> <chr> <dttm> <int> <chr>
#> 1 14 http://transparencia.uantof.cl/in… 2022-10-31 17:29:12 200 21371
#> 2 26 http://transparencia.uantof.cl/in… 2022-10-31 17:31:45 200 2230
#> 3 24 http://transparencia.uantof.cl/in… 2022-10-31 17:31:21 200 24035
#> 4 21 http://transparencia.uantof.cl/in… 2022-10-31 17:30:42 200 24173
#> 5 20 http://transparencia.uantof.cl/in… 2022-10-31 17:30:29 200 24183
#> 6 23 http://transparencia.uantof.cl/in… 2022-10-31 17:31:08 200 24184
#> 7 22 http://transparencia.uantof.cl/in… 2022-10-31 17:30:55 200 24207
#> 8 18 http://transparencia.uantof.cl/in… 2022-10-31 17:30:04 200 24405
#> 9 16 http://transparencia.uantof.cl/in… 2022-10-31 17:29:38 200 24715
#> 10 7 http://transparencia.uantof.cl/in… 2022-10-31 17:27:32 200 24716
#> # … with 16 more rows, and abbreviated variable names ¹status_code,
#> # ²`header_content-length`
# 26. is kind of suspicious:
# 25 http://transparencia.uantof.cl/index.php?action=plantillas_generar_plantilla&ig=21&m=6&a=2022&ia=76…
# 26 http://transparencia.uantof.cl/index.php?action=plantillas_selec_archivo&ig=21&m=3&a=2015
# looks like there has been no redirection and its header_content-length is about 10x smaller than for other responses
# checking it more closely reveals that the page includes a "submenu" instead of table(s):
# <p class="subMenu_interiores">
# <b>2015 - Marzo</b>
# ABRIL 2015
# Marzo 2015
# </p>
# lets' collect urls that were not redirected from our tibble and harvest links from stored html:
suburl.list <- resp.tibble %>%
# urls that do NOT include "plantillas_generar_plantilla"
filter(!str_detect(url, "plantillas_generar_plantilla")) %>%
pull(html_doc) %>%
# rvest does not like lists, thus let's map()
map( ~ read_html(.x) %>% html_elements("#columna1_interiores a") %>% html_attr("href")) %>%
unlist() %>%
paste0("http://transparencia.uantof.cl/",.)
suburl.list
#> [1] "http://transparencia.uantof.cl/index.php?action=plantillas_generar_plantilla&ig=21&m=3&a=2015&ia=772"
#> [2] "http://transparencia.uantof.cl/index.php?action=plantillas_generar_plantilla&ig=21&m=3&a=2015&ia=648"
# fetch content from those submenu urls
subresp.tibble <- fetch_urls(suburl.list)
# sanity check:
subresp.tibble %>%
select(req_id:status_code,`header_content-length`)
#> # A tibble: 2 × 5
#> req_id url date statu…¹ heade…²
#> <chr> <chr> <dttm> <int> <chr>
#> 1 1 http://transparencia.uantof.cl/ind… 2022-10-31 17:31:52 200 25385
#> 2 2 http://transparencia.uantof.cl/ind… 2022-10-31 17:31:59 200 25332
#> # … with abbreviated variable names ¹status_code, ²`header_content-length`
# better, sizes align with previous results.
# collect all relevant responses
table_1 <- resp.tibble %>%
filter(str_detect(url, "plantillas_generar_plantilla")) %>%
bind_rows(subresp.tibble) %>%
# extract html (as strings)
pull(html_doc) %>%
# rvest does not like lists, thus let's map(), pluck(1) extracts first table (from each page)
map(~ read_html(.x) %>% html_table() %>% pluck(1)) %>%
# first attempt to bind rows fails, aparently column types differ
# change all non-character columns to character
map (~ mutate(.x, across(!where(is.character),as.character))) %>%
# bind all tables by rows
bind_rows()
# columns vary across tables so number of NA fields in final result is rather high
Final result for 26 pages, a 10,987 × 30 tibble:
table_1
#> # A tibble: 10,987 × 30
#> Nº PLANTA PATERNO MATERNO NOMBRES G TITULO CARGO REGION ASIGN…¹
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1 DIRECTIVO ABARZA CASTRO JESSIC… 6 ADMIN… JEFE… SEGUN… (1)(8)…
#> 2 2 PROFESIONAL ABASOLO QUINTE… NURY D… 12 EDUCA… PROF… SEGUN… (4)(8)…
#> 3 3 ACADEMICO ACOSTA PENA ROXANA… 11 EDUCA… PROF… SEGUN… (2)(8)…
#> 4 4 AUXILIARES ACOSTA PIZARRO ROBERT… 23 LICEN… AUXI… SEGUN… (7)(8)…
#> 5 5 DIRECTIVO AGENO SIERRA ROSELL… 4 MATRO… DIRE… SEGUN… (1)(8)…
#> 6 6 AUXILIARES AGUIRRE LAZO RENE G… 16 LICEN… AUXI… SEGUN… (7)(8)…
#> 7 7 TECNICOS ALAMOS MARIN SERGIO… 13 TECNI… TECN… SEGUN… (5)(8)…
#> 8 8 AUXILIARES ALAYANA CORTES CHRIST… 23 LICEN… AUXI… SEGUN… (7)(8)…
#> 9 9 ACADEMICO ALCOTA AGUIRRE PATRIC… 9 ING. … PROF… SEGUN… (2)(8)…
#> 10 10 ADMINISTRATI… ALFARO BARRAZA MARIA … 23 LICEN… ADMI… SEGUN… (6)(8)…
#> # … with 10,977 more rows, 20 more variables: `UNID MONETARIA` <chr>,
#> # `REMUNERACION MENSUAL BRUTA` <chr>, HORAS <chr>, `CANT. HORAS` <chr>,
#> # `MONTO HORAS EXTRAS` <chr>, `FECHA DE INGRESO` <chr>, `F. HASTA` <chr>,
#> # OBSERVACIONES <chr>, GRADO <chr>, ESTAMENTO <chr>,
#> # `Apellido Paterno` <chr>, `Apellido Materno` <chr>, Nombres <chr>,
#> # `Grado ERUA` <chr>, `CALIFICACION PROFESIONAL O FORMACION` <chr>,
#> # `CARGO O FUNCION` <chr>, `R BRUTA` <chr>, `Horas Extras` <chr>, …
Created on 2022-10-31 with reprex v2.0.2

tidy() function cant process last_fit() obejcts

Functions like last_fit() from the tune package produces last_fit objects which are large nested lists containing the fit results. I tried to transform them into data.frames using the tidy() function from the broom package but this resulted in the following error:
MRE :
library(tidymodels)
library(tidyverse)
data <- mtcars
model_default<-
parsnip::boost_tree(
mode = "regression"
) %>%
set_engine('xgboost',objective = 'reg:squarederror')
wf <- workflow() %>%
add_model(model_default) %>%
add_recipe(recipe(mpg~.,data))
lf <- last_fit(wf,split)
tidy_lf <- tidy(lf)
Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) :
is.atomic(x) is not TRUE
In addition: Warning messages:
1: Data frame tidiers are deprecated and will be removed in an upcoming release of broom.
2: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
3: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
4: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
5: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
6: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
7: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
Question : How can I use tidy() with an last_fit() output?

The object that last_fit() creates is a tibble (containing metrics, predictions, etc), not a model that can be tidied. You can use extract_workflow() to extract out the fitted workflow from the object created by last_fit(), and this object can be tidied:
library(tidymodels)
car_split <- initial_split(mtcars)
wf <- workflow() %>%
add_model(linear_reg()) %>%
add_recipe(recipe(mpg ~ ., mtcars))
lf <- last_fit(wf, car_split)
lf
#> # Resampling results
#> # Manual resampling
#> # A tibble: 1 × 6
#> splits id .metrics .notes .predictions .workflow
#> <list> <chr> <list> <list> <list> <list>
#> 1 <split [24/8]> train/test split <tibble> <tibble> <tibble [8 × 4]> <workflow>
lf %>%
extract_workflow() %>%
tidy()
#> # A tibble: 11 × 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -33.6 36.0 -0.935 0.367
#> 2 cyl -0.0296 1.34 -0.0221 0.983
#> 3 disp 0.0252 0.0269 0.934 0.367
#> 4 hp -0.00539 0.0319 -0.169 0.868
#> 5 drat -0.167 2.54 -0.0659 0.948
#> 6 wt -5.69 2.79 -2.04 0.0623
#> 7 qsec 3.32 1.76 1.89 0.0820
#> 8 vs -4.40 3.80 -1.16 0.268
#> 9 am 2.54 2.67 0.950 0.360
#> 10 gear 2.69 2.28 1.18 0.259
#> 11 carb -0.0486 1.11 -0.0439 0.966
Created on 2022-03-23 by the reprex package (v2.0.1)

R: Missed values when scraping webpage

When scraping data from a webpage, some elements/values are not returned.
Specifically, I use the rvest package to scrap.
The webpage that contains the information I want is https://azure.microsoft.com/en-us/pricing/details/virtual-machines/windows/ - however, when I scrap the data, the columns with prices only return "$-".
Sample code:
library(rvest)
webpage <- read_html("https://azure.microsoft.com/en-us/pricing/details/virtual-machines/windows/")
tbls <- html_nodes(webpage, "table")
tbls_ls <- webpage %>%
html_nodes("table") %>%
.[1:(length(tbls)-2)] %>%
html_table()
Output of first df:
> List of 22 $ :'data.frame': 7 obs. of 6 variables: ..$ Instance
> : chr [1:7] "B1L" "B1S" "B2S" "B1MS" ... ..$ Cores
> : int [1:7] 1 1 2 1 2 4 8 ..$ RAM
> : chr [1:7] "0.50 GiB" "1.00 GiB" "4.00 GiB" "2.00 GiB" ... ..$
> Temporary Storage : chr [1:7] "1 GiB" "2
> GiB" "8 GiB" "4 GiB" ... ..$ Price
> : chr [1:7] "$-" "$-" "$-" "$-" ... ..$ Prices with Azure Hybrid
> Benefit1 (% savings): chr [1:7] "$-" "$-" "$-" "$-" ...
What can I do to get the whole value of these specific elements?

They have a single set of price data irrespective of the filter. So you need to take that attribute's value and parse the json.
library(rvest)
webpage <- read_html("https://azure.microsoft.com/en-us/pricing/details/virtual-machines/windows/")
tbls <- html_nodes(webpage, "table")
webpage %>%
html_nodes("table") %>%
.[1:(length(tbls)-2)] %>%
html_table()
ss <- webpage %>% html_nodes("table span.price-data ") %>% xml_attr('data-amount')
lapply(ss,function(x){data.frame(jsonlite::fromJSON(x))})
Sample output:
[[176]]
regional.asia.pacific.southeast regional.australia.east regional.canada.central regional.canada.east
1 1.496 1.496 1.376 1.376
regional.europe.west regional.japan.east regional.united.kingdom.south regional.us.east.2 regional.usgov.virginia
1 1.488 1.464 1.448 1.373 1.504
regional.us.west regional.us.west.2
1 1.376 1.248
[[177]]
regional.asia.pacific.southeast regional.australia.east regional.canada.central regional.canada.east
1 4.464 4.464 4.224 4.224
regional.europe.west regional.japan.east regional.united.kingdom.south regional.us.east.2 regional.usgov.virginia
1 4.448 4.4 4.368 4.365 4.48
regional.us.west regional.us.west.2
1 4.224 3.968
You need to match that particular value and take the price from this.

Transform list cell in data frame into rows

I'm sorry for no code to replicate, I can provide a picture only. See it below please.
A data frame with Facebook insights data prepared from JSON consists a column "values" with list values. For the next manipulation I need to have only one value in the column. So the row 3 on picture should be transformed into two (with list content or value directly):
post_story_adds_by_action_type_unique lifetime list(like = 38)
post_story_adds_by_action_type_unique lifetime list(share = 11)
If there are 3 or more values in data frame list cell, it should make 3 or more single value rows.
Do you know how to do it?
I use this code to get the json and data frame:
i <- fromJSON(post.request.url)
i <- as.data.frame(i$insights$data)
Edit:
There will be no deeper nesting, just this one level.
The list is not needed in the result, I need just the values and their names.

Let's assume you're starting with something that looks like this:
mydf <- data.frame(a = c("A", "B", "C", "D"), period = "lifetime")
mydf$values <- list(list(value = 42), list(value = 5),
list(value = list(like = 38, share = 11)),
list(value = list(like = 38, share = 13)))
str(mydf)
## 'data.frame': 4 obs. of 3 variables:
## $ a : Factor w/ 4 levels "A","B","C","D": 1 2 3 4
## $ period: Factor w/ 1 level "lifetime": 1 1 1 1
## $ values:List of 4
## ..$ :List of 1
## .. ..$ value: num 42
## ..$ :List of 1
## .. ..$ value: num 5
## ..$ :List of 1
## .. ..$ value:List of 2
## .. .. ..$ like : num 38
## .. .. ..$ share: num 11
## ..$ :List of 1
## .. ..$ value:List of 2
## .. .. ..$ like : num 38
## .. .. ..$ share: num 13
## NULL
Instead of retaining lists in your output, I would suggest flattening out the data, perhaps using a function like this:
myFun <- function(indt, col) {
if (!is.data.table(indt)) indt <- as.data.table(indt)
other_names <- setdiff(names(indt), col)
list_col <- indt[[col]]
rep_out <- sapply(list_col, function(x) length(unlist(x, use.names = FALSE)))
flat <- {
if (is.null(names(list_col))) names(list_col) <- seq_along(list_col)
setDT(tstrsplit(names(unlist(list_col)), ".", fixed = TRUE))[
, val := unlist(list_col, use.names = FALSE)][]
}
cbind(indt[rep(1:nrow(indt), rep_out)][, (col) := NULL], flat)
}
Here's what it does with the "mydf" I shared:
myFun(mydf, "values")
## a period V1 V2 V3 val
## 1: A lifetime 1 value NA 42
## 2: B lifetime 2 value NA 5
## 3: C lifetime 3 value like 38
## 4: C lifetime 3 value share 11
## 5: D lifetime 4 value like 38
## 6: D lifetime 4 value share 13

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

fast JSON to data.frame/data.table - json

data2 <- fromJSON("data.json", flatten = TRUE) reference https://rdrr.io/cran/jsonlite/f/vignettes/json-apis.Rmd

Try this way: library(jsonlite) json_data <- read_json("data.json", simplifyVector = TRUE) Include the sample input so that I can test the solution myself!

Related

Can't update role of mutated variables

Webscraping does not identify the full HTML link

tidy() function cant process last_fit() obejcts

R: Missed values when scraping webpage

Transform list cell in data frame into rows

Categories

Resources