Performance problem transforming JSON data

Performance problem transforming JSON data - json

I've got some data in JSON format that I want to do some visualization on. The data (approximately 10MB of JSON) loads pretty fast, but reshaping it into a usable form takes a couple of minutes for just under 100,000 rows. I have something that works, but I think it can be done much better.
It may be easiest to understand by starting with my sample data.
Assuming you run the following command in /tmp:
curl http://public.west.spy.net/so/time-series.json.gz \
| gzip -dc - > time-series.json
You should be able to see my desired output (after a while) here:
require(rjson)
trades <- fromJSON(file="/tmp/time-series.json")$rows
data <- do.call(rbind,
lapply(trades,
function(row)
data.frame(date=strptime(unlist(row$key)[2], "%FT%X"),
price=unlist(row$value)[1],
volume=unlist(row$value)[2])))
someColors <- colorRampPalette(c("#000099", "blue", "orange", "red"),
space="Lab")
smoothScatter(data, colramp=someColors, xaxt="n")
days <- seq(min(data$date), max(data$date), by = 'month')
smoothScatter(data, colramp=someColors, xaxt="n")
axis(1, at=days,
     labels=strftime(days, "%F"),
     tick=FALSE)

You can get a 40x speedup by using plyr. Here is the code and the benchmarking comparison. The conversion to date can be done once you have the data frame and hence I have removed it from the code to facilitate apples-to-apples comparison. I am sure a faster solution exists.
f_ramnath = function(n) plyr::ldply(trades[1:n], unlist)[,-c(1, 2)]
f_dustin = function(n) do.call(rbind, lapply(trades[1:n],
function(row) data.frame(
date = unlist(row$key)[2],
price = unlist(row$value)[1],
volume = unlist(row$value)[2]))
)
f_mrflick = function(n) as.data.frame(do.call(rbind, lapply(trades[1:n],
function(x){
list(date=x$key[2], price=x$value[1], volume=x$value[2])})))
f_mbq = function(n) data.frame(
t(sapply(trades[1:n],'[[','key')),
t(sapply(trades[1:n],'[[','value')))
rbenchmark::benchmark(f_ramnath(100), f_dustin(100), f_mrflick(100), f_mbq(100),
replications = 50)
test elapsed relative
f_ramnath(100) 0.144 3.692308
f_dustin(100) 6.244 160.102564
f_mrflick(100) 0.039 1.000000
f_mbq(100) 0.074 1.897436
EDIT. MrFlick's solution leads to an additional 3.5x speedup. I have updated my tests.

I received another transformation by MrFlick in irc that was significantly faster and worth mentioning here:
data <- as.data.frame(do.call(rbind,
lapply(trades,
function(x) {list(date=x$key[2],
price=x$value[1],
volume=x$value[2])})))
It seems to be made significantly faster by not building the inner frames.

You are doing vectorized operations on single elements, which is very inefficient. Price and volume can be extracted like this:
t(sapply(trades,'[[','value'))
And dates like this:
strptime(sapply(trades,'[[','key')[c(F,T)],'%FT%X')
Now only some sugar and the complete code looks like this:
data.frame(
strptime(sapply(trades,'[[','key')[c(F,T)],'%FT%X'),
t(sapply(trades,'[[','value')))->data
names(data)<-c('date','price','volume')
On my notebook, the whole set gets converted in about 0.7s, while 10k first rows (10%) take circa 8s using the original algorithm.

Is batching an option? Process 1000 rows at a time perhaps depending on how deep your json is. Do you really need to transform all the data? I am not sure about r and what exactly you are dealing with, but I am thinking of a generic approach.
Also do take a look at this: http://jackson.codehaus.org/ : A High-performance JSON processor.

Related

for loop using ggplot for longitudinal data

I am trying to visualize my longtudinal data by graphs using ggplot through for loop.
for(i in colnames(dat_longC)[c(3,5:10,14,17:19,30:39)]){
print(ggplot(dat_longC, aes(x = exam, y = i, group = zz_nr))+ geom_point()+
geom_line() + xlab("Examination") + ylab(i))}
}
when I use the ggplot command in a for loop, I only get a line extending between examination times. If I use the same command on a single variable, it works and gives me trajectory graphs. What do you think could be the problem?

Your problem is that you are using i to indicate the column. That is just an index, so it does not know what you are actually trying to plot. you really want colnames(dat_longC)[i]. Unfortunately, that will still not work because you are using a string as a variable name, which does not work for ggplot2. Instead, you will likely need !!sym(colnames(dat_longC)[i]). I can't really test without your data, but here is some example code to help guide you.
library(tidyverse)
map(colnames(mtcars)[2:4],
\(x) ggplot(mtcars, aes(!!sym(x), mpg))+
geom_point()+
ggtitle(x))
#> [[1]]
#>
#> [[2]]
#>
#> [[3]]

for(i in colnames(dat_longC)[c(3,5:10,14,17:19,30:39)]){
print(ggplot(dat_longC, aes_string(x = "exam" , y = i, group = "zz_nr"))+ geom_point()+
geom_line() + xlab("Examination") + ylab(i))}
}
Thank you very much for your reply!
I just used aes_string and added quote to the variable names and it worked out.

anova_test not returning Mauchly's for three way within subject ANOVA

I am using a data set called sleep (found here: https://drive.google.com/file/d/15ZnsWtzbPpUBQN9qr-KZCnyX-0CYJHL5/view) to run a three way within subject ANOVA comparing Performance based on Stimulation, Deprivation, and Time. I have successfully done this before using anova_test from rstatix. I want to look at the sphericity output but it doesn't appear in the output. I have got it to come up with other three way within subject datasets, so I'm not sure why this is happening. Here is my code:
anova_test(data = sleep, dv = Performance, wid = Subject, within = c(Stimulation, Deprivation, Time))
I also tried to save it to an object and use get_anova_table, but that didn't look any different.
sleep_aov <- anova_test(data = sleep, dv = Performance, wid = Subject, within = c(Stimulation, Deprivation, Time))
get_anova_table(sleep_aov, correction = "GG")
This is an ideal dataset I pulled from the internet, so I'm starting to think the data had a W of 1 (perfect sphericity) and so rstatix is skipping this output. Is this something anova_test does?
Here also is my code using a dataset that does return Mauchly's:
weight_loss_long <- pivot_longer(data = weightloss, cols = c(t1, t2, t3), names_to = "time", values_to = "loss")
weight_loss_long$time <- factor(weight_loss_long$time)
anova_test(data = weight_loss_long, dv = loss, wid = id, within = c(diet, exercises, time))

Not an expert at all, but it might be because your factors have only two levels.
From anova_summary() help:
"Value
return an object of class anova_test a data frame containing the ANOVA table for independent measures ANOVA. However, for repeated/mixed measures ANOVA, it is a list containing the following components are returned:
ANOVA: a data frame containing ANOVA results
Mauchly's Test for Sphericity: If any within-Ss variables with more than 2 levels are present, a data frame containing the results of Mauchly's test for Sphericity. Only reported for effects that have more than 2 levels because sphericity necessarily holds for effects with only 2 levels.
Sphericity Corrections: If any within-Ss variables are present, a data frame containing the Greenhouse-Geisser and Huynh-Feldt epsilon values, and corresponding corrected p-values. "

How to get descriptive table for both continuous and categorical variables?

I want to get descriptive table in html format for all variables that are in data frame. I need for continuous variables mean and standard deviation. For categorical variables frequency (absolute count) of each category and percentage of each category. Also I need the count of missing values to be included.
Lets use this data:
data("ToothGrowth")
df<-ToothGrowth
df$len[2]<-NA
df$supp[5]<-NA
I want to get table in html format that will look like this:
----------------------------------------------------------------------
Variables N (missing) Mean (SD) / %
----------------------------------------------------------------------
len 59 (1) 18.9 (7.65)
supp
OJ 30 50%
VC 29 48.33%
NA 1 1.67%
dose 60 1.17 (0.629)
I need also to set the number of digits after decimal point to show.
If you know better variant to display that information in html in better way than please provide your solution.

Here's a programatic way to create separate summary tables for the numeric and factor columns. Note that this doesn't make note of NAs in the table as you requested, but does ignore NAs to calculate summary stats as you did. It's a starting point, anyway. From here you could combine the tables and format the headers however you want.
If you knit this code within an RMarkdown document with HTML output, kable will automatically generate the html table and a css will format the table nicely with a horizontal rules as pictured below. Note that there's also a booktabs option to kable that makes prettier tables like the LaTeX booktabs package. Otherwise, see the documentation for knitr::kable for options.
library(dplyr)
library(tidyr)
library(knitr)
data("ToothGrowth")
df<-ToothGrowth
df$len[2]<-NA
df$supp[5]<-NA
numeric_cols <- dplyr::select_if(df, is.numeric) %>%
gather(key = "variable", value = "value") %>%
group_by(variable) %>%
summarize(count = n(),
mean = mean(value, na.rm = TRUE),
sd = sd(value, na.rm = TRUE))
factor_cols <- dplyr::select_if(df, is.factor) %>%
gather(key = "variable", value = "value") %>%
group_by(variable, value) %>%
summarize(count = n()) %>%
mutate(p = count / sum(count, na.rm = TRUE))
knitr::kable(numeric_cols)
knitr::kable(factor_cols)

I found r package table1 that does what I want. Here is a code:
library(table1)
data("ToothGrowth")
df<-ToothGrowth
df$len[2]<-NA
df$supp[5]<-NA
table1(reformulate(colnames(df)), data=df)

How to get dataset into array

I have worked all the tutorials and searched for "load csv tensorflow" but just can't get the logic of it all. I'm not a total beginner, but I don't have much time to complete this, and I've been suddenly thrown into Tensorflow, which is unexpectedly difficult.
Let me lay it out:
Very simple CSV file of 184 columns that are all float numbers. A row is simply today's price, three buy signals, and the previous 180 days prices
close = tf.placeholder(float, name='close')
signals = tf.placeholder(bool, shape=[3], name='signals')
previous = tf.placeholder(float, shape=[180], name = 'previous')
This article: https://www.tensorflow.org/guide/datasets
It covers how to load pretty well. It even has a section on changing to numpy arrays, which is what I need to train and test the 'net. However, as the author says in the article leading to this Web page, it is pretty complex. It seems like everything is geared toward doing data manipulation, where we have already normalized our data (nothing has really changed in AI since 1983 in terms of inputs, outputs, and layers).
Here is a way to load it, but not in to Numpy and no example of not manipulating the data.
with tf.Session as sess:
sess.run( tf.global variables initializer())
with open('/BTC1.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter =',')
line_count = 0
for row in csv_reader:
?????????
line_count += 1
I need to know how to get the csv file in to the
close = tf.placeholder(float, name='close')
signals = tf.placeholder(bool, shape=[3], name='signals')
previous = tf.placeholder(float, shape=[180], name = 'previous')
so that I can follow the tutorials to train and test the net.

It's not that clear for me your question. You might be answering, tell me if I'm wrong, how to feed data in your model? There are several fashions to do so.
Use placeholders with feed_dict during the session. This is the basic and easier one but often suffers from training performance issue. Further explanation, check this post.
Use queue. Hard to implement and badly documented, I don't suggest, because it's been taken over by the third method.
tf.data API.
...
So to answer your question by the first method:
# get your array outside the session
with open('/BTC1.csv') as csv_file:
csv_reader = csv.reader(csv_file, delimiter =',')
dataset = np.asarray([data for data in csv_reader])
close_col = dataset[:, 0]
signal_cols = dataset[:, 1: 3]
previous_cols = dataset[:, 3:]
# let's say you load 100 row each time for training
batch_size = 100
# define placeholders like you
...
with tf.Session() as sess:
...
for i in range(number_iter):
start = i * batch_size
end = (i + 1) * batch_size
sess.run(train_operation, feed_dict={close: close_col[start: end, ],
signals: signal_col[start: end, ],
previous: previous_col[start: end, ]
}
)
By the third method:
# retrieve your columns like before
...
# let's say you load 100 row each time for training
batch_size = 100
# construct your input pipeline
c_col, s_col, p_col = wrapper(filename)
batch = tf.data.Dataset.from_tensor_slices((close_col, signal_col, previous_col))
batch = batch.shuffle(c_col.shape[0]).batch(batch_size) #mix data --> assemble batches --> prefetch to RAM and ready inject to model
iterator = batch.make_initializable_iterator()
iter_init_operation = iterator.initializer
c_it, s_it, p_it = iterator.get_next() #get next batch operation automatically called at each iteration within the session
# replace your close, signal, previous placeholder in your model by c_it, s_it, p_it when you define your model
...
with tf.Session() as sess:
# you need to initialize the iterators
sess.run([tf.global_variable_initializer, iter_init_operation])
...
for i in range(number_iter):
start = i * batch_size
end = (i + 1) * batch_size
sess.run(train_operation)
Good luck!

R - Vectorize a JSON call

Working with Mapquest directions API to plot thousands of routes using ggplot2 in R.
Basic code theory: Have a list of end locations and a single start location. For each end location, a call to fromJSON returns routing coordinates from Mapquest. From there, have already vectorized the assignment of coordinates (read as lists in lists) to the geom_path geom of ggplot2.
Right now, running this on a location set of ~ 1200 records takes ~ 4 minutes. Would love to get that down. Any thoughts on how to vectorize the call to fromJSON (which returns a list of lists)?
Windows 7, 64-bit, R 2.14.2
libraries: plyr, ggplot2, rjson, mapproj, XML
k = 0
start_loc = "263+NORTH+CENTER+ST.,+MESA+ARIZ."
end_loc = funder_trunc[,length(funder_trunc)]
route_urls = paste(mapquest_baseurl, "&from=", start_loc, "&to=", end_loc, "&ambiguities=ignore", sep="")
for (n in route_urls) {
route_legs = fromJSON(file = url(n))$route$legs[[1]]$maneuvers
lats = unlist(lapply(route_legs, function(x) return(x$startPoint[[2]])))
lngs = unlist(lapply(route_legs, function(x) return(x$startPoint[[1]])))
frame = data.frame(cbind(lngs, lats))
path_added = geom_path(aes(lngs, lats), data = frame)
p = p + path_added
k = k + 1
print(paste("Processed ", k, " of ", nrow(funder_trunc), " records in set.", sep=""))
}

Going out on a limb here since I don't use rjson or mapproj, but it seems like calling the server thousands of times is the real culprit. If the mapquest server doesn't have an API that allows you to send multiple requests in one go, you are in trouble. If it does, then you need to find out how to use/modify rjson and/or mapproj to call it...
As #Chase said, you might be able to call it in parallel, but the server won't like getting too many parallel requests from the same client - it might ban you. Btw, it might not even like getting thousands of serial requests in rapid succession from the same client either - but apparently your current code works so I guess it doesn't mind.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Performance problem transforming JSON data - json

Related

for loop using ggplot for longitudinal data

anova_test not returning Mauchly's for three way within subject ANOVA

How to get descriptive table for both continuous and categorical variables?

How to get dataset into array

R - Vectorize a JSON call

Categories

Resources