Dynamic, General Indexing in R - json

Is there a way to index across many lists, data frames, etc., in R? That is to say, generally? For example, you could retrieve a list of the second element of the second element of lists a & b via c(a[[2]][[2]],b[[2]][[2]]), but how can you do this without writing the names of each list and the respective indexing brackets?
Input:
l1 <- as.list(c(1,2,3,4,5))
l2 <- as.list(c(6,7,8,9,10))
a <- list(l1,l2)
l4 <- as.list(c(1,2,3,4,5))
l5 <- as.list(c(6,7,8,9,10))
b <- list(l4,l5)
Desired output:
[1] 7 7
I know that you could create a list of only upper nested lists - assuming the same naming convention - with this:
nol <- objects()
nol <- grep("^[a-z]$", nol, value=TRUE)
I just don't know how to apply across this list.

You can do this via vapply as follows:
vapply(mget(nol), function(x) x[[2]][[2]], FUN.VALUE = double(1), USE.NAMES = FALSE)
The idea here is that mget gives you a list of your objects. you could also create it by list(a,b). And the anonymous function function(x) x[[2]][[2]] returns your values.

Related

Dealing with Duplicates in a Combined Dataframe

first of all a big shout-out and big thank you to all in helping to answer my questions. You guys are amazing.
I would need your help once again in Coding with R.
The situation arises with two Dataframes, where Dataframe1 one describes a Portuguese class and Dataframe2 describes a Math class. I do want to find the duplicate (as there are some, as one student takes both classes) and not delete him, but expand the column "Class" by indicating, he is on both classes, something like "Math+Portuguese".
I tried to simplify my Dataframes (in reality they are much bigger, but the final approach should be the sam) by creating two new ones. There is one duplicate (the student where both parents are doctors). I just want to have him one time in the Dataframe, with the wording "Math+Portuguese" in the column "Class".
For the identification of the duplicates, the column "Grades" has to be ignored.
Thank you very much for you help.
All the best,
Alexander
# Creation of Dataset 1 (Portuguese students)
school <- c(rep("S1",7),rep("S2",3))
Age <- c(18,18,19,19,20,20,21,21,22,22)
professionf <- c(rep("teacher",9),rep("doctor",1))
professionm <- c(rep("police",9),rep("doctor",1))
Class <- rep("Portuguese",10)
Grade <- round(runif(10,1,5),0)
DataframeP <- cbind(school, Age, professionf,professionm,Grade,Class)
View(DataframeP)
#Creation of Dataset 2 (Math students)
school <- c(rep("S1",7),rep("S2",3))
Age <- c(18,18,19,19,20,20,21,21,22,22)
professionf <- c(rep("lawyer",9),rep("doctor",1))
professionm <- c(rep("police",9),rep("doctor",1))
Class <- rep("Math",10)
Grade <- round(runif(10,1,5),0)
DataframeM <- cbind(school, Age, professionf,professionm,Grade,Class)
View(DataframeM)
#Combination of the two Dataframes, where the identification of the dupicates should take place
DF_All <- rbind(DataframeM,DataframeP)
View(DF_All)
That should do it, dear Alexander!
library(data.table)
require(dplyr)
df_merged <- merge(x = DataframeP, y = DataframeM, by = c("school", "Age", "professionf", "penter code hererofessionm"), all = TRUE)
df_merged <- within(df_merged, Class.x[Class.x == 'Portuguese' & Class.y == 'Math'] <- 'Portoguese + Math')
df_merged$Class.x = coalesce(df_merged$Class.x, df_merged$Class.y)
df_merged$Grade.x = coalesce(df_merged$Grade.x, df_merged$Grade.y)
df_merged <- df_merged[1:(length(df_merged)-2)]
setnames(df_merged, old = c('Grade.x','Class.x'), new = c('Grade','Class'))
df_merged

Edit multiple JSON files in R and saving them to CSV (Alternative for "for" loop)

I have multiple JSON files containing Tweets from Twitter. I want to import and edit them in R one by one.
For a single file my code looks like this:
data <- fromJSON("filename.json")
data <- data[c(1:3,13,14)]
data$lang <- ifelse(data$lang!="de",NA,data$lang)
data <- na.omit(data)
write_as_csv(data,"filename.csv")
Now I want to apply this code to multiple files. I found a "for" loop code here:
Loop in R to read many files
Applied to my problem it should look something like this:
setwd("~/Documents/Elections")
ldf <- list()
listjson <- dir(pattern = "*.json")
for (k in 1:length(listjson)){
data[k] <- fromJSON(listjson[k])
data[k] <- data[k][c(1:3,13,14)]
data[k]$lang <- ifelse(data[k]$lang!="de",NA,data[k]$lang)
data[k] <- na.omit(data[k])
filename <- paste(k, ".csv")
write_as_csv(listjson[k],filename)
}
But the first line in the loop already doesn't work.
> data[k] <- fromJSON(listjson[k])
Warning message:
In `[<-.data.frame`(`*tmp*`, k, value = list(createdAt = c(1505935036000, :
provided 35 variables to replace 1 variables
I can't figure out why. Also, I wonder if there is a nicer way to realize this problem without using a for loop. I read about the apply family, I just don't know how to apply it to my problem. Thanks in advance!
This is an example how my data looks:
https://drive.google.com/file/d/19cRS6p_mHbO6XXprfvc6NPZWuf_zG7jr/view?usp=sharing
It should work like this:
setwd("~/Documents/Elections")
listjson <- dir(pattern = "*.json")
for (k in 1:length(listjson)){
# Load the JSON that correspond to the k element in your list of files
data <- fromJSON(listjson[k])
# Select relevant columns from the dataframe
data <- data[,c(1:3,13,14)]
# Manipulate data
data$lang <- ifelse(data$lang!="de",NA,data$lang)
data <- na.omit(data)
filename <- paste(listjson[k], ".csv")
write_as_csv(data,filename)
}
For the second part of the question, apply applies a function over rows or columns of a dataframe. This is not your case, as you are looping through a vector of character to get filenames to be used somewhere else.

Is it possible, in R, to access the values of a list with a for loop on the names of the fields?

I have a big json file, containing 18 fields, some of which contain some other subfields. I read the file in R in the following way:
json_file <- "daily_profiles_Bnzai_20150914_20150915_20150914.json"
data <- fromJSON(sprintf("[%s]", paste(readLines(json_file), collapse=",")))
This gives me a giant list with all the fields contained in the json file. I want to make it into a data.frame and do some operations in the meantime. For example if I do:
doc_length <- data.frame(t(apply(as.data.frame(data$doc_lenght_map), 1, unlist)))
os <- data.frame(t(apply(as.data.frame(data$operating_system), 1, unlist)))
navigation <- as.data.frame(data$navigation)
monday <- data.frame(t(apply(navigation[,grep("Monday",names(data$navigation))],1,unlist)))
Monday <- data.frame(apply(monday, 1, sum))
works fine, I get what I want, with all the right subfields and then I want to join them in a final data.frame that I will use to do other operations.
Now, I'd like to do something like that on the subset of fields where I don't need to do operations. So, for example, the days of the week contained in navigation are not included. I'd like to have something like (suppose I have a data.frame df):
for(name in names(data))
{
df <- cbind(df, data.frame(t(apply(as.data.frame(data$name), 1, unlist)))
}
The above loop gives me errors. So, what I want to do is finding a way to access all the fields of the list in an automatic way, as in the loop, where the iterator "name" takes on all the fields of the list, without having to call them singularly and then doing some operations with those fields. I tried even with
for(name in names(data))
{
df <- cbind(df, data.frame(t(apply(as.data.frame(data[name]), 1, unlist)))
}
but it doesn't take all of the subfields. I also tried with
data[, name]
but it doesn't work either. So I think I need to use the "$" operator.
Is it possible to do something like that?
Thank you a lot!
Davide
Like the other commenters, I am confused, but I will throw this out to see if it might point you in the right direction.
# make mtcars a list as an example
data <- lapply(mtcars,identity)
do.call(
cbind,
lapply(
names(data),
function(name){
data.frame(data[name])
}
)
)

Parallel programming in R

I have a file that consists of multiple JSON objects. I need to read through these files and extract certain fields from the JSON objects. To complicate things, some of the objects do not contain all the fields. I am dealing with a large file of over 200,000 JSON objects. I would like to split job across multiple cores. I have tried to experiment with doSNOW, foreach, and parallel and really do not understand how to do this. The following is my code that I would like to make more efficient.
foreach (i in 2:length(linn)) %dopar% {
json_data <- fromJSON(linn[i])
if(names(json_data)[1]=="info")
next
mLocation <- ifelse('location' %!in% names(json_data$actor),'NULL',json_data$actor$location$displayName)
mRetweetCount <- ifelse('retweetCount' %!in% names(json_data),0,json_data$retweetCount)
mGeo <- ifelse('geo' %!in% names(json_data),c(-0,-0),json_data$geo$coordinates)
tweet <- rbind(tweet,
data.frame(
record.no = i,
id = json_data$id,
objecttype = json_data$actor$objectType,
postedtime = json_data$actor$postedTime,
location = mLocation,
displayname = json_data$generator$displayName,
link = json_data$generator$link,
body = json_data$body,
retweetcount = mRetweetCount,
geo = mGeo)
)
}
Rather than trying to parallelize an iteration, I think you're better off trying to vectorize (hmm, actually most of the below is still iterating...). For instance here we get all our records (no speed gain yet, though see below...)
json_data <- lapply(linn, fromJSON)
For location we pre-allocate a vector of NAs to represent records for which there is no location, then find records that do have a location (maybe there's a better way of doing this...) and update them
mLocation <- rep(NA, length(json_data))
idx <- sapply(json_data, function(x) "location" %in% names(x$actor))
mLocation[idx] <- sapply(json_data[idx], function(x) x$location$displayName)
Finally, create a 200,000 row data frame in a single call (rather than your 'copy and append' pattern, which makes a copy of the first row, then the first and second row, then the first, second, third row, then ... so N-squared rows, in addition to recreating factors and other data.frame specific expenses; this is likely where you spend most of your time)
data.frame(i=seq_along(json_data), location=mLocation)
The idea would be to accumulate all the columns, and then do just one call to data.frame(). I think you could cheat on parsing line-at-a-time, by pasting everything into a single string repersenting a JSON array, and parsing in one call
json_data <- fromJSON(sprintf("[%s]", paste(linn, collapse=",")))

Using \Sexpr{} in LaTeX tabular environment

I am trying to use \Sexpr{} to include values from my R objects in a LaTeX table. I am essentially trying to replicate the summary output of a lm object in R because xtable's built in methods xtable.lm and xtable.summary.lm don't seem to include the Fstats, adjusted R-squared, etc (all the stuff at the bottom of the summary printout of the lm object in R console) So I tried accomplishing this by building a matrix to replicate the xtable.summary.lm output then construct a data frame of the relevant info for the extra stuff so I can refer to the values using \Sexpr{}. I tried doing this by using add.to.row to append the \multicolumn{} command in order to merge all columns of the last row of the LaTeX table and then just pass all the information I need into that cell of the table.
The problem is that I get an "Undefined control sequence" for the \Sexpr{} expression in the \multicolumn{} expression. Are these two not compatible? If so, what am I doing wrong and if not does anyone know how to do what I am trying to do?
Thanks,
Here is the relevant part of my code:
<<Test, results=tex>>=
model1 <- lm(stndfnl ~ atndrte + frosh + soph)
# Build matrix to replicate xtable.summary.lm output
x <- summary(model1)
colnames <- c("Estimate", "Std. Error", "t value", "Pr(<|t|)")
rownames <- c("(Intercept)", attr(x$terms, "term.labels"))
fpval <- pf(x$fstatistic[1],x$fstatistic[2], x$fstatistic[3], lower.tail=FALSE)
mat1 <- matrix(coef(x), nrow=length(rownames), ncol=length(colnames), dimnames=list(rownames,colnames))
# Make a data frame for extra information to be called by \Sexpr in last row of table
residse <- x$sigma
degf <- x$df[2]
multr2 <- x$r.squared
adjr2 <- x$adj.r.squared
fstat <- x$fstatistic[1]
fstatdf1 <- x$fstatistic[2]
fstatdf2 <- x$fstatistic[3]
extradat <- data.frame(v1 = round(residse,4), v2 =degf, v3=round(multr2,4), v4=round(adjr2,4),v5=round(fstat,3), v6=fstatdf1, v7=fstatdf2, v8=round(fpval,6))
addtorow<- list()
addtorow$pos <-list()
addtorow$pos[[1]] <- dim(mat1)[1]
addtorow$command <-c('\\hline \\multicolumn{5}{l}{Residual standard error:\\Sexpr{extradat$v1}} \\\\ ')
print(xtable(mat1, caption="Summary Results for Regression in Equation \\eqref{model1} ", label="tab:model1"), add.to.row=addtorow, sanitize.text.function=NULL, caption.placement="top")
You don't need to have Sexpr in your R code; the R code can use the expressions directly. Sexpr is not a LaTeX command, even though it looks like one; it's an Sweave command, so it doesn't work to have it as output from R code.
Try
addtorow$command <-paste('\\hline \\multicolumn{5}{l}{Residual standard error:',
extradat$v1, '} \\\\ ')
Also, no need to completely recreate the matrix used by xtable, you can just build on the default output. Building on what you have above, something like:
mytab <- xtable(model1, caption="Summary Results", label="tab:model1")
addtorow$pos[[1]] <- dim(mytab)[1]
print(mytab, add.to.row=addtorow, sanitize.text.function=NULL,
caption.placement="top")
See http://people.su.se/~lundh/reproduce/sweaveintro.pdf for an example which you might be able to use as is.