R dataframe with list of dictionaries as field - json

I have a data frame with a column called identifiers which contains product identifiers data as a string which is a list of dictionaries.
test_data <- data.frame(
identifiers = c(
"[{\"type\":\"ISBN\",\"value\":\"9781231027073\"}]",
"[{\"type\":\"EAN\",\"value\":\"5055266202847\"},{\"type\":\"EAN\",\"value\":\"4053162095984\"}]"),
id = c(1,2), stringsAsFactors = FALSE)
> test_data
identifiers id
1 [{"type":"ISBN","value":"9781231027073"}] 1
2 [{"type":"EAN","value":"5055266202847"},{"type":"EAN","value":"4053162095984"}] 2
What I would like to achieve is:
output_test_data <- data.frame(
type = c("ISBN", "EAN", "EAN"),
value = c("9781231027073","5055266202847","4053162095984"),
id = c(1,2,2), stringsAsFactors = FALSE)
> output_test_data
type value id
1 ISBN 9781231027073 1
2 EAN 5055266202847 2
3 EAN 4053162095984 2
The closest I got to the solution is to apply the fomJSON function from jsonlite.
jsonlite::fromJSON(test_data$identifiers[1])
or with a loop like this:
for (i in test_data$identifiers) {
print(jsonlite::fromJSON(i))
}
However I am struggling to:
1) get it applied to all rows.
2) preserve the information about id, from original data into the results.
Could anyone help with this?

You could do this:
df_result <- apply(test_data,1,function(x){
id_tmp <- x[2]
df_out <- jsonlite::fromJSON(x[1])
df_out$id <- id_tmp
return(df_out)
})
df_result <- do.call("rbind",df_result)

Related

Is there an R-function to extract specific string variables

I have a df in R, where one of the columns is long html sting that contains alot of arguments. I want to extract specific values into news columns:
This is what I have:
name <- c("John", "Max")
bio <- c("<status>1</status><profession>Revisor</professio>", "<status>1</status><born>19.06.1995</born><profession>Tech</professio>" )
df <- data.frame(name, bio)
This is what I want:
name <- c("John", "Max")
status <- c(1,1)
profession <- c("Revisor", "Tech")
df <- data.frame(name, status, profession)
Is is possible by using "regular expression" with the R package stringr. Here is an example :
library(stringr)
name <- c("John", "Max")
bio <- c("<status>1</status><profession>Revisor</professio>", "<status>1</status><born>19.06.1995</born><profession>Tech</professio>" )
status <- stringr::str_extract_all(bio, pattern = "<status>\\d</status>")
status <- stringr::str_replace_all(status, pattern = "(<status>)(\\d)(</status>)", "\\2")
profession <- stringr::str_extract_all(bio, pattern = "<profession>[:alpha:]*</professio>")
profession <- stringr::str_replace_all(profession, pattern = "(<profession>)([:alpha:]*)(</professio>)", "\\2")
df <- data.frame(name, status, profession)
This can be done easily with extract:
library(tidyr)
df %>%
extract(bio,
into = c("status", "profession"),
regex = "<status>(\\d+)</status>.*<profession>(\\w+)</professio>")
name status profession
1 John 1 Revisor
2 Max 1 Tech
The regex part describes the strings from which elements should be extracted in full, while capturing the elements of interest in capture groups defined by ().
Alternatively, you can use str_extract:
library(stringr)
df$status <- str_extract(bio, pattern = "(?<=<status>)\\d+(?=</status>)")
df$profession <- str_extract(bio, pattern = "(?=<profession>)\\w+(?=</professio>)")
Here we are making use of lookarounds to conditionally match, for example:
(?<=<status>): positive lookbehind to assert that the match must be preceded by the literal string <status>
\\d+: one or more digits
(?=</status>): positive lookahead to assert that the match must be followed by the literal string </status>

Loop for regression over multiple factors

I am struggling to get a loop to run several regressions and store the coefficients and intercepts. I have a data similar as this:
data <- data.frame(y = rnorm(10), x1 = rnorm(10)*2, ID = c(rep(1,10), rep(2,10)), group = c(rep(3,5), rep(4,5)))
Where ID and group are factors, therefore:
data$ID <- as.factor(data$ID)
data$group <- as.factor(data$group)
So far I tried 2 approaches.
First I did the following:
for (i in unique(data$ID)){
for (j in unique(data$group)){
fit <- glm(y ~ x1, data=data[data$ID == i & data$group == j, ])
}
}
Afterwards I did the following:
myfun <- function(data) {
step(glm(y ~ x1, data = data), trace=0)
}
fcomb <- unique(data[,c("ID","group")])
mod <- list()
for(i in 1:nrow(fcomb)) {
mod <- c(mod,list(myfun(subset(data,ID==fcomb$ID[i] & group==fcomb$group[i]))))
}
In the end I would like to have a dataset in which for each ID and group I would have the intercept and the beta for the effect of x1 in y.
When I performed the second strategy I got something, but the betas and the intercepts are the same (which is totally impossible) and I still don't know how to store the values.
set.seed(1839)
data <- data.frame(
y = rnorm(10),
x1 = rnorm(10) * 2,
ID = c(rep(1, 10), rep(2, 10)),
group = c(rep(3, 5), rep(4, 5))
)
grid <- expand.grid(ID = unique(data$ID), group = unique(data$group))
results <- lapply(1:nrow(grid), function(x) {
lm(y ~ x1, data[data$ID == grid[x, 1] & data$group == grid[x, 2], ])$coef
})
results <- t(do.call(cbind, results))
results <- cbind(grid, results)
results
Returns:
ID group (Intercept) x1
1 1 3 -0.454072247 1.0295731
2 2 3 -0.454072247 1.0295731
3 1 4 0.007800405 -0.1832663
4 2 4 0.007800405 -0.1832663

Convert list to string of all dataframe rows

I hace the following dataframe:
json = '[
{"id":"1","list":["A","B"]},
{"id":"2","list":["C","D"]}
]'
df <- fromJSON(json)
df
Output:
id list
1 1 c("A", "B")
2 2 c("C", "D")
Now, I want the list to be a string like this:
id list
1 1 "A, B"
2 2 "C, D"
So, I've tried the following but nothing changes:
df$list <- paste(df$list, sep = ", ")
I've also tried the following but it concats the two lists in every row:
df$list <- toString(df$list)
# Output
id list
1 1 c("A", "B"), c("C", "D")
2 2 c("A", "B"), c("C", "D")
Is there a way to change every row separately?
Another solution would be to import the JSON arrays directly to a given format, is this possible?
Thanks!
We need to loop through the column and do the toString
df$list <- sapply(df$list, toString)

Assign each aggregate value to seperate variable in R and display it in HTML

I am using the following R script to calculate a monthly CpK number:
mydf <- read.csv('file.csv', header = TRUE, sep=",")
date <- strptime(mydf$PDATETIME, "%Y/%m/%d %H:%M:%S")
plot(date,mydf$MEAS_AVG,xlab='Date',ylab='MEAS_AVG',main='year')
abline(h=mydf$TARG_MIN,col=3,lty=1)
abline(h=mydf$TARG_MAX,col=3,lty=1)
grid(NULL,NULL,col="black")
legend("topright", legend = c(" ", " "), text.width = strwidth("1,000,000"), lty = 1:2, xjust = 1, yjust = 1, title = "Data")
myavg <-mean(mydf$MEAS_AVG, na.rm=TRUE)
newds <- (mydf$MEAS_AVG - myavg)^2
newsum <- sum(newds, na.rm=TRUE)
N <- length(mydf$MEAS_AVG) - 1
newN <- 1/N
total <- newN*newsum
sigma <- total^(1/2)
USL <- mean(mydf$TARG_MAX, na.rm=TRUE)
LSL <- mean(mydf$TARG_MIN, na.rm=TRUE)
cpk <- min(((USL-myavg)/(3*sigma)),((myavg-LSL)/(3*sigma)))
cpkmonthly <- aggregate(mydf$MEAS_AVG, na.rm=TRUE, list(month=months(as.Date(mydf$PDATETIME))), mean)
monthlycpk <- by(mydf$MEAS_AVG, na.rm=TRUE, list(month=months(as.Date(mydf$PDATETIME))), mean)
cpk 'variable to store the entire year's CpK number
cpkmonthly 'variable to store the each month's mean CpK number
So far, the above script correctly goes through all the code assigns values to the cpkmonthly and cpk variables. Their outputs are as follows:
> cpk
[1] 0.5892231
> cpkmonthly
month x
1 April 0.2456467
2 August 0.2415564
3 July 0.2456895
4 June 0.2541071
5 March 0.1234333
6 May 0.4321418
Question: How to I break apart the appregated "cpkmonthly" variable and assign a seperate variable for each entry? Ideally, I would like each to go into an array, because I would like to have the final output variable be in a HTML display string.
SudoCode:
cpkmonth[1] = April
cpkvalue[1] = .245...
cpkmonth[2] = August
cpkvalue[2] = .2415...
...
I would like the final table in HTML to look like this:
So the final output variable would need to be in this format:
<tr><td>"Total Cpk"</td><tdcpkmonth[0]</td><td>cpkmonth[1]</td><td>...</td></tr>
<tr><td>"cpk"</td><tdcpkvalue[0]</td><td>cpkvalue[1]</td><td>...</td></tr>
For the HTML, I have tried using toJSON/RJSON,R2HTML,HTMLUtil, and a few others, but I am simply looking for one output variable. Is this possible?
You should be able to access both of these columns using the $ syntax:
cpkmonth = cpkmonthly$month
cpkvalue = cpkmonthly$value
you can also use [:
cpkmonth = cpkmonthly['month']

Subsetting a data frame in a function using another data frame as parameter

I would like to submit a data frame to a function and use it to subset another data frame.
This is the basic data frame:
foo <- data.frame(var1= c(1, 1, 1, 2, 2, 3), var2=c('A', 'A', 'B', 'B', 'C', 'C'))
I use the following function to find out the frequencies of var2 for specified values of var1.
foobar <- function(x, y, z){
a <- subset(x, (x$var1 == y))
b <- subset(a, (a$var2 == z))
n=nrow(b)
return(n)
}
Examples:
foobar(foo, 1, "A") # returns 2
foobar(foo, 1, "B") # returns 1
foobar(foo, 3, "C") # returns 1
This works. But now I want to submit a data frame of values to foobar. Instead of the above examples, I would like to submit df to foobar and get the same results as above (2, 1, 1)
df <- data.frame(var1=c(1, 1, 3), var2=c("A", "B", "C"))
When I change foobar to accept two arguments like foobar(foo, df) and use y[, c(var1)] and y[, c(var2)] instead of the two parameters x and y it still doesn't work. Which way is there to do this?
edit1: last paragraph clarified
edit2: var1 type corrected
Try this:
library(plyr)
match_df <- function(x, match) {
vars <- names(match)
# Create unique id for each row
x_id <- id(match[vars])
match_id <- id(x[vars])
# Match identifiers and return subsetted data frame
x[match(x_id, match_id, nomatch = 0), ]
}
match_df(foo, df)
# var1 var2
# 1 1 A
# 3 1 B
# 5 2 C
Your function foobar is expecting three arguments, and you only supplied two arguments to it with foobar(foo, df). You can use apply to get what you want:
apply(df, 1, function(x) foobar(foo, x[1], x[2]))
And in use:
> apply(df, 1, function(x) foobar(foo, x[1], x[2]))
[1] 2 1 1
To respond to your edit:
I'm not entirely sure what y[, c(var1)] means, but here's an attempt at trying to figure out what you are trying to do.
What I think you were trying to do was: foobar(foo, y = df[, "var1"], z = df[, "var2"]).
First, note that the use of c() is not needed here and you can reference the columns you want by placing the name of the column in quotes OR reference the column by number (as I did above). Secondly, df[, "var1"] returns all of the rows for the column names var1 which has a length of three:
> length(df[, "var1"])
[1] 3
The function you defined is not set up to deal with vectors of length greater than 1. That is why we need to iterate through each row of your dataframe to grab a single value, process it, and then go to the next row in the data.frame. That is what the apply function does. It is equivalent to saying something along the lines of for (i in 1: length(nrow(df)) but is a more idiomatic way of handling such issues.
Finally, is there a reason you generated var1 as a factor? It probably makes more sense to treate these as numeric in my opinion. Compare:
> str(df)
'data.frame': 3 obs. of 2 variables:
$ var1: Factor w/ 2 levels "1","3": 1 1 2
$ var2: Factor w/ 3 levels "A","B","C": 1 2 3
Versus
> df2 <- data.frame(var1=c(1,1,3), var2=c("A", "B", "C"))
> str(df2)
'data.frame': 3 obs. of 2 variables:
$ var1: num 1 1 3
$ var2: Factor w/ 3 levels "A","B","C": 1 2 3
In summary - apply is the function you are after here. You may want to spend some time thinking about whether your data should be numeric or a factor, but apply is still what you want.
foobar2 <- function(x, df) {
.dofun <- function(y, z){
a <- subset(x, x$var1==y)
b <- subset(a, a$var2==z)
n <- nrow(b)
return (n)
}
ans <- mapply(.dofun, as.character(df$var1), as.character(df$var2))
names(ans) <- NULL
return(ans)
}