Unable to convert JSON to dataframe - json

I want to convert a json-file into a dataframe in R. With the following code:
link <- 'https://www.dropbox.com/s/ckfn1fpkcix1ccu/bevingenbag.json'
document <- fromJSON(file = link, method = 'C')
bev <- do.call("cbind", document)
i'm getting this:
type features
1 FeatureCollection list(type = "Feature", geometry = list(type = "Point", coordinates = c(6.54800000288927, 52.9920000044505)), properties = list(gid = "1496600", yymmdd = "19861226", lat = "52.992", lon = "6.548", mag = "2.8", depth = "1.0", knmilocatie = "Assen", baglocatie = "Assen", tijd = "74751"))
which is the first row of a matrix. All the other rows have the same structure. I'm interested in the properties = list(gid = "1496600", yymmdd = "19861226", lat = "52.992", lon = "6.548", mag = "2.8", depth = "1.0", knmilocatie = "Assen", baglocatie = "Assen", tijd = "74751") part, which should be converted into a dataframe with the columns gid, yymmdd, lat, lon, mag, depth, knmilocatie, baglocatie, tijd.
I searched for and tryed several solutions but none of them worked. I used the rjson package for this. I also tryed the RJSONIO & jsonlite package, but was unable to extract the desired information.
Anyone an idea how to solve this problem?

Here's a way to obtain the data frame:
library(rjson)
document <- fromJSON(file = "bevingenbag.json", method = 'C')
dat <- do.call(rbind, lapply(document$features,
function(x) data.frame(x$properties)))
Edit: How to replace empty values with NA:
dat$baglocatie[dat$baglocatie == ""] <- NA
The result:
head(dat)
gid yymmdd lat lon mag depth knmilocatie baglocatie tijd
1 1496600 19861226 52.992 6.548 2.8 1.0 Assen Assen 74751
2 1496601 19871214 52.928 6.552 2.5 1.5 Hooghalen Hooghalen 204951
3 1496602 19891201 52.529 4.971 2.7 1.2 Purmerend Kwadijk 200914
4 1496603 19910215 52.771 6.914 2.2 3.0 Emmen Emmen 21116
5 1496604 19910425 52.952 6.575 2.6 3.0 Geelbroek Ekehaar 102631
6 1496605 19910808 52.965 6.573 2.7 3.0 Eleveld Assen 40114

This is just another, quite similar, approach.
#SvenHohenstein's approach creates a dataframe at each step, an expensive process. It's much faster to create vectors and re-type the whole result at the end. Also, Sven's approach makes each column a factor, which might or might not be what you want. The approach below runs about 200 times faster. This can be important if you intend to do this repeatedly. Finally, you will need to convert columns lon, lat, mag, and depth to numeric.
library(microbenchmark)
library(rjson)
document <- fromJSON(file = "bevingenbag.json", method = 'C')
json2df.1 <- function(json){ # #SvenHohenstein approach
df <- do.call(rbind, lapply(json$features,
function(x) data.frame(x$properties, stringsAsFactors=F)))
return(df)
}
json2df.2 <- function(json){
df <- do.call(rbind,lapply(json[["features"]],function(x){c(x$properties)}))
df <- data.frame(apply(result,2,as.character), stringsAsFactors=F)
return(df)
}
microbenchmark(x<-json2df.1(document), y<-json2df.2(document), times=10)
# Unit: milliseconds
# expr min lq median uq max neval
# x <- json2df.1(document) 2304.34378 2654.95927 2822.73224 2977.75666 3227.30996 10
# y <- json2df.2(document) 13.44385 15.27091 16.78201 18.53474 19.70797 10
identical(x,y)
# [1] TRUE

Related

How can I adjust my pcor model for confounders and do it for many models at one time?

I have a dataset with many columns. First column is the outcome (Test)(Dependent variable, y). Columns 2-32 are confounders. Finally, columns 33-54 are miRNAs (expression)(Independent variable, x).
I want to do a partial correlation (to obtain p-value and estimate) between each one of the independent variables with the dependent variable, adjusting by confounders. Since my variables don't follow a normal distribution, I want to use Spearman method.
I don't want to put all of them in the same model, I want different models, one by one. That is:
Model 1: Test vs miRNA1 by confounders
Model 2: Test vs miRNA2 by confounders
[...]
Model 21: Test vs miRNA21 by confounders
I tried with an auxiliary function. But it doesn't work. Any help? Thanks :)
The script is here:
#data
n <- 10000
nc <- 30
nm <- 20
y <- rnorm(n = n)
X <- matrix(rnorm(n = n*(nc+nm)), ncol = nc + nm)
df <- data.frame(y = y, X)
#variable names
confounders <- colnames(df)[2:31]
mirnas <- colnames(df)[32:51]
#auxiliar regression function
pcor_fun <- function(data, y_col, X_cols) {
formula <- as.formula(paste(y_col, X_cols))
pcor <- pcor.test(formula = formula, data = data, method = "spearman")
pcor_summary <- summary(pcor)$coef
return(pcor_summary)
}
#simple linear regressions
lm_list1 <- lapply(X = mirnas, FUN = pcor_fun, data = df, y_col = "y")
lm_list1[[1]]
#adjusting by confounders
lm_list2 <- lapply(X = mirnas, FUN = function(x) pcor_fun(data = df, y_col = "y", X_cols = c(confounders, x)))
lm_list2[[1]]

tbl_uvregression for lme4 objects

I am trying to get a univariate regression table using tbl_uvregression from gtsummary. I am running these regression models with lme4 and I am not sure where and how to specify the random effect. Here's an example using the trial data from the survival package.
library(lme4)
#> Loading required package: Matrix
library(gtsummary)
library(survival)
data(trial)
trial %>%
tbl_uvregression(
method = glmer,
y = response,
method.args = list(family = binomial),
exponentiate = TRUE,
pvalue_fun = function(x) style_pvalue(x, digits = 2),
formula = "{y} ~ {x}+ {1|grade}"
)
#> Error: Problem with `mutate()` input `formula_chr`.
#> x object 'grade' not found
#> i Input `formula_chr` is `glue(formula)`.
Created on 2020-09-28 by the reprex package (v0.3.0)
Please help
For the RE in the model do not specify with the {} instead use ().
library(lme4)
#> Loading required package: Matrix
library(gtsummary)
library(survival)
data(trial)
trial %>%
tbl_uvregression(
method = glmer,
y = response,
method.args = list(family = binomial),
exponentiate = TRUE,
pvalue_fun = function(x) style_pvalue(x, digits = 2),
formula = "{y} ~ {x}+ (1|grade)"
)

R: Selecting certain from a JSON file

I've imported a JSON file into R from ( http://eric.clst.org/wupl/Stuff/gz_2010_us_040_00_20m.json ) and I'm trying to select only counties in Kansas.
Right now I have all the data into one variable and I'm trying to make subdata of this that is just counties of Kansas. I'm not sure how to go about this.
What you have there is geoJson, which can be read directly by library(sf), to give you an sf object, which is also data.frame. Then you can use the usual data.frame subsetting operations
library(sf)
sf <- sf::read_sf("http://eric.clst.org/wupl/Stuff/gz_2010_us_040_00_20m.json")
sf[sf$NAME == "Kansas", ]
# Simple feature collection with 1 feature and 5 fields
# geometry type: MULTIPOLYGON
# dimension: XY
# bbox: xmin: -102.0517 ymin: 36.99308 xmax: -94.58993 ymax: 40.00316
# epsg (SRID): 4326
# proj4string: +proj=longlat +datum=WGS84 +no_defs
# GEO_ID STATE NAME LSAD CENSUSAREA geometry
# 30 0400000US20 20 Kansas 81758.72 MULTIPOLYGON(((-99.541116 3...
And seeing as you want the individual counties, you need to use the counties data set
sf_counties <- sf::read_sf("http://eric.clst.org/wupl/Stuff/gz_2010_us_050_00_500k.json")
sf_counties[sf_counties$STATE == 20, ]
To stay with a JSON workflow, can try jqr
library(jqr)
url <- 'http://eric.clst.org/wupl/Stuff/gz_2010_us_040_00_20m.json'
download.file(url, (f <- tempfile(fileext = ".json")))
res <- paste0(readLines(f), collapse = " ")
out <- jq(res, '.features[] | select(.properties.NAME == "Kansas")')
can map easily like
library(leaflet)
leaflet() %>%
addTiles() %>%
addGeoJSON(out) %>%
setView(-98, 38, 6)
library(rjson)
lst=fromJSON(file = 'http://eric.clst.org/wupl/Stuff/gz_2010_us_040_00_20m.json')
index = which(sapply(lapply(lst$features,"[[",'properties'),'[[','NAME')=='Kansas')
subdata = lst$features[[index]]

Convert JSON into CSV in R programming

I have JSON of the form:
{"abc":
{
"123":[45600],
"378":[78689],
"343":[23456]
}
}
I need to convert above format JSON to CSV file in R.
CSV format :
ds y
123 45600
378 78689
343 23456
I'm using R library rjson to do so. I'm doing something like this:
jsonFile <- fromJSON(file=fileName)
json_data_frame <- as.data.frame(jsonFile)
but it's not doing the way I need it.
You can use jsonlite::fromJSON to read the data into a list, though you'll need to pull it apart to assemble it into a data.frame:
abc <- jsonlite::fromJSON('{"abc":
{
"123":[45600],
"378":[78689],
"343":[23456]
}
}')
abc <- data.frame(ds = names(abc[[1]]),
y = unlist(abc[[1]]), stringsAsFactors = FALSE)
abc
#> ds y
#> 123 123 45600
#> 378 378 78689
#> 343 343 23456
I believe you got the json file reader - fromJSON function right.
df <- data.frame( do.call(rbind, rjson::fromJSON( '{"a":true, "b":false, "c":null}' )) )
The code below gets me Google's Location History (json) archive from https://takeout.google.com. This is if you have enabled a 'Timeline' (location tracking) in Google Maps on your cell. Credit to http://rpubs.com/jsmanij/131030 for the original code. Note that json files like this can be quite large and plyr::llply is so much more efficient than lapply in parsing a list. Data.table gives me the more efficient 'rbindlist' to take the list to a data.table. Google logs between 350 to 800 GPS calls each day for me! A multi-year location history is converted to quite a sizeable list by 'fromJSON':
format(object.size(doc1),units="MB")
[1] "962.5 Mb"
I found 'do.call(rbind..)' un-optimized. The timestamp, lat, and long needed some work to be useful to Google Earth Pro, but I am getting carried away. At the end, I use 'write.csv' to take a data.table to CSV. That is all the original OP wanted here.
ts lat long latitude longitude
1: 1416680531900 487716717 -1224893214 48.77167 -122.4893
2: 1416680591911 487716757 -1224892938 48.77168 -122.4893
3: 1416680668812 487716933 -1224893231 48.77169 -122.4893
4: 1416680728947 487716468 -1224893275 48.77165 -122.4893
5: 1416680791884 487716554 -1224893232 48.77166 -122.4893
library(data.table)
library(rjson)
library(plyr)
doc1 <- fromJSON(file="LocationHistory.json", method="C")
object.size(doc1)
timestamp <- function(x) {as.list(x$timestampMs)}
timestamps <- as.list(plyr::llply(doc1$locations,timestamp))
timestamps <- rbindlist(timestamps)
latitude <- function(x) {as.list(x$latitudeE7)}
latitudes <- as.list(plyr::llply(doc1$locations,latitude))
latitudes <- rbindlist(latitudes)
longitude <- function(x) {as.list(x$longitudeE7)}
longitudes <- as.list(plyr::llply(doc1$locations,longitude))
longitudes <- rbindlist(longitudes)
datageoms <- setnames(cbind(timestamps,latitudes,longitudes),c("ts","lat","long")) [order(ts)]
write.csv(datageoms,"datageoms.csv",row.names=FALSE)

R - MLR - Classifier Calibration - Benchmark Results

I've run a benchmark experiment with nested cross validation (tuning + performance measurement) for a classification problem and would like to create calibration charts.
If I pass a benchmark result object to generateCalibrationData, what does plotCalibration do? Is it averaging? If so how?
Does it make sense to have an aggregate = FALSE option to understand variability across folds as per generateThreshVsPerfData for ROC curves?
In response to #Zach's request for a reproducible example, I (the OP) edit my original post as follows:
Edit: Reproducible Example
# Practice Data
library("mlr")
library("ROCR")
library(mlbench)
data(BreastCancer)
dim(BreastCancer)
levels(BreastCancer$Class)
head(BreastCancer)
BreastCancer <- BreastCancer[, -c(1, 6, 7)]
BreastCancer$Cl.thickness <- as.factor(unclass(BreastCancer$Cl.thickness))
BreastCancer$Cell.size <- as.factor(unclass(BreastCancer$Cell.size))
BreastCancer$Cell.shape <- as.factor(unclass(BreastCancer$Cell.shape))
BreastCancer$Marg.adhesion <- as.factor(unclass(BreastCancer$Marg.adhesion))
head(BreastCancer)
# Define Nested Cross-Validation Strategy
cv.inner <- makeResampleDesc("CV", iters = 2, stratify = TRUE)
cv.outer <- makeResampleDesc("CV", iters = 6, stratify = TRUE)
# Define Performance Measures
perf.measures <- list(auc, mmce)
# Create Task
bc.task <- makeClassifTask(id = "bc",
data = BreastCancer,
target = "Class",
positive = "malignant")
# Create Tuned KSVM Learner
ksvm <- makeLearner("classif.ksvm",
predict.type = "prob")
ksvm.ps <- makeParamSet(makeDiscreteParam("C", values = 2^(-2:2)),
makeDiscreteParam("sigma", values = 2^(-2:2)))
ksvm.ctrl <- makeTuneControlGrid()
ksvm.lrn = makeTuneWrapper(ksvm,
resampling = cv.inner,
measures = perf.measures,
par.set = ksvm.ps,
control = ksvm.ctrl,
show.info = FALSE)
# Create Tuned Random Forest Learner
rf <- makeLearner("classif.randomForest",
predict.type = "prob",
fix.factors.prediction = TRUE)
rf.ps <- makeParamSet(makeDiscreteParam("mtry", values = c(2, 3, 5)))
rf.ctrl <- makeTuneControlGrid()
rf.lrn = makeTuneWrapper(rf,
resampling = cv.inner,
measures = perf.measures,
par.set = rf.ps,
control = rf.ctrl,
show.info = FALSE)
# Run Cross-Validation Experiments
bc.lrns = list(ksvm.lrn, rf.lrn)
bc.bmr <- benchmark(learners = bc.lrns,
tasks = bc.task,
resampling = cv.outer,
measures = perf.measures,
show.info = FALSE)
# Calibration Charts
bc.cal <- generateCalibrationData(bc.bmr)
plotCalibration(bc.cal)
Produces the following:
Aggregared Calibration Plot
Attempting to un-aggregate leads to:
> bc.cal <- generateCalibrationData(bc.bmr, aggregate = FALSE)
Error in generateCalibrationData(bc.bmr, aggregate = FALSE) :
unused argument (aggregate = FALSE)
>
> sessionInfo()
R version 3.2.3 (2015-12-10)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mlbench_2.1-1 ROCR_1.0-7 gplots_3.0.1 mlr_2.9
[5] stringi_1.1.1 ParamHelpers_1.10 ggplot2_2.1.0 BBmisc_1.10
loaded via a namespace (and not attached):
[1] digest_0.6.9 htmltools_0.3.5 R6_2.2.0 splines_3.2.3
[5] scales_0.4.0 assertthat_0.1 grid_3.2.3 stringr_1.0.0
[9] bitops_1.0-6 checkmate_1.8.2 gdata_2.17.0 survival_2.38-3
[13] munsell_0.4.3 tibble_1.2 randomForest_4.6-12 httpuv_1.3.3
[17] parallelMap_1.3 mime_0.5 DBI_0.5-1 labeling_0.3
[21] chron_2.3-47 shiny_1.0.0 KernSmooth_2.23-15 plyr_1.8.4
[25] data.table_1.9.6 magrittr_1.5 reshape2_1.4.1 kernlab_0.9-25
[29] ggvis_0.4.3 caTools_1.17.1 gtable_0.2.0 colorspace_1.2-6
[33] tools_3.2.3 parallel_3.2.3 dplyr_0.5.0 xtable_1.8-2
[37] gtools_3.5.0 backports_1.0.4 Rcpp_0.12.4
no plotCalibration doesn't do any averaging, though it can plot a smooth.
if you call generateCalibrationData on a benchmark result object it will treat each iteration of your resampled predictions as exchangeable and compute the calibration across all resampled predictions for that bin.
yes it probably would make sense to have an option to generate an unaggregated calibration data object and be able to plot it. you are welcome to open an issue on GitHub to that effect, but this is going to be low on my priority list TBH.