Convert list to string of all dataframe rows - json

I hace the following dataframe:
json = '[
{"id":"1","list":["A","B"]},
{"id":"2","list":["C","D"]}
]'
df <- fromJSON(json)
df
Output:
id list
1 1 c("A", "B")
2 2 c("C", "D")
Now, I want the list to be a string like this:
id list
1 1 "A, B"
2 2 "C, D"
So, I've tried the following but nothing changes:
df$list <- paste(df$list, sep = ", ")
I've also tried the following but it concats the two lists in every row:
df$list <- toString(df$list)
# Output
id list
1 1 c("A", "B"), c("C", "D")
2 2 c("A", "B"), c("C", "D")
Is there a way to change every row separately?
Another solution would be to import the JSON arrays directly to a given format, is this possible?
Thanks!

We need to loop through the column and do the toString
df$list <- sapply(df$list, toString)

Related

Grouping CSV file by ID and extracting JSON column

I currently have a CSV like this:
A B C
1 10 {"a":"one","b":"two","c":"three"}
1 10 {"a":"four","b":"five","c":"six"}
1 10 {"a":"seven","b":"eight","c":"nine"}
1 10 {"a":"ten","b":"eleven","c":"twelve"}
2 10 {"a":"thirteen","b":"fourteen","c":"fifteen"}
2 10 {"a":"sixteen","b":"seventeen","c":"eighteen"}
2 10 {"a":"nineteen","b":"twenty","c":"twenty-one"}
3 10 {"a":"twenty-two","b":"twenty-three","c":"twenty-four"}
3 10 {"a":"twenty-five","b":"twenty-six","c":"twenty-seven"}
3 10 {"a":"twenty-eight","b":"twenty-nine","c":"thirty"}
3 10 {"a":"thirty-one","b":"thirty-two","c":"thirty-three"}
I want to group by column A, ignore column B, and take only the "b" field in C, and get an output like:
A C
1 ['two','five','eight','eleven']
2 ['fourteen','seventeen','twenty']
3 ['twenty-three','twenty-six','twenty-nine','thirty-two']
Can I do this? I have pandas if that will be useful! Also I would like the output file to be tab delimited.
Try this:
import pandas as pd
import json
# read file that looks exactly as given above
df = pd.read_csv("file.csv", delim_whitespace=True)
# drop the 'B' column
del df['B']
# 'C' will start life as a string. convert from json, extract values, return as list
df['C'] = df['C'].map(lambda x: json.loads(x)['b'])
# 'C' now holds just the 'b' values. group these together:
df = df.groupby('A').C.apply(lambda x : list(x))
print(df)
This returns:
A
1 [two, five, eight, eleven]
2 [fourteen, seventeen, twenty]
3 [twenty-three, twenty-six, twenty-nine, thirty...
IIUC
df.groupby('A').C.apply(lambda x : [y['b'] for y in x ])
A
1 [two, five, eight, eleven]
2 [fourteen, seventeen, twenty]
3 [twenty-three, twenty-six, twenty-nine, thirty...
Name: C, dtype: object

R dataframe with list of dictionaries as field

I have a data frame with a column called identifiers which contains product identifiers data as a string which is a list of dictionaries.
test_data <- data.frame(
identifiers = c(
"[{\"type\":\"ISBN\",\"value\":\"9781231027073\"}]",
"[{\"type\":\"EAN\",\"value\":\"5055266202847\"},{\"type\":\"EAN\",\"value\":\"4053162095984\"}]"),
id = c(1,2), stringsAsFactors = FALSE)
> test_data
identifiers id
1 [{"type":"ISBN","value":"9781231027073"}] 1
2 [{"type":"EAN","value":"5055266202847"},{"type":"EAN","value":"4053162095984"}] 2
What I would like to achieve is:
output_test_data <- data.frame(
type = c("ISBN", "EAN", "EAN"),
value = c("9781231027073","5055266202847","4053162095984"),
id = c(1,2,2), stringsAsFactors = FALSE)
> output_test_data
type value id
1 ISBN 9781231027073 1
2 EAN 5055266202847 2
3 EAN 4053162095984 2
The closest I got to the solution is to apply the fomJSON function from jsonlite.
jsonlite::fromJSON(test_data$identifiers[1])
or with a loop like this:
for (i in test_data$identifiers) {
print(jsonlite::fromJSON(i))
}
However I am struggling to:
1) get it applied to all rows.
2) preserve the information about id, from original data into the results.
Could anyone help with this?
You could do this:
df_result <- apply(test_data,1,function(x){
id_tmp <- x[2]
df_out <- jsonlite::fromJSON(x[1])
df_out$id <- id_tmp
return(df_out)
})
df_result <- do.call("rbind",df_result)

write items from a list to csv file column by column using pandas dataframe.to_csv

I have a list named items
items=['a' , 'b','c']
Code is:
df = pandas.DataFrame(items)
df.to_csv("myfile.csv",headers=None,index=False)
the values written to the file are in different rows but same column.(vertically written)
But
I want the values to be written as : a b c ie. in same row but different column.
Help please
You get each element in different rows because you load the df as that way.
If you want in different column I would suggest to do transpose,
df = df.T
or you can load as one row like below,
items=[['a' , 'b','c']]
df = pd.DataFrame(items)
df
Out[22]:
0 1 2
0 a b c
And then write the output to csv,
eg:
df = pandas.DataFrame(items)
df = df.T
df.to_csv("myfile.csv",headers=None,index=False)
df = pd.DataFrame(items)
df
Out[5]:
0
0 a
1 b
2 c
df.T
Out[11]:
0 1 2
0 a b c

How to write a JSON object from R dataframe with grouping

In general I feel there is a need to make JSON objects by folding multiple columns. There is no direct way to do this afaik. Please point it out if there is ..
I have data of this from
A B C
1 a x
1 a y
1 c z
2 d p
2 f q
2 f r
How do I write a json which looks like
{'query':'1', 'type':[{'name':'a', 'values':[{'value':'x'}, {'value':'y'}]}, {'name':'c', 'values':[{'value':'z'}]}]}
and similarly for 'query':'2'
I am looking to spit them in the mongo import/export individual json lines format.
Any pointers are also appreciated..
You've got a little "non-standard" thing going with two keys of "value" (I don't know if this is legal json), as you can see here:
(js <- jsonlite::fromJSON('{"query":"1", "type":[{"name":"a", "values":[{"value":"x"}, {"value":"y"}]}, {"name":"c", "values":[{"value":"z"}]}]}'))
## $query
## [1] "1"
##
## $type
## name values
## 1 a x, y
## 2 c z
... with a data.frame cell containing a list of data.frames:
js$type$values[[1]]
## value
## 1 x
## 2 y
class(js$type$values[[1]])
## [1] "data.frame"
If you can accept your "type" variable containing a vector instead of a named-list, then perhaps the following code will suffice:
jsonlite::toJSON(lapply(unique(dat[, 'A']), function(a1) {
list(query = a1,
type = lapply(unique(dat[dat$A == a1, 'B']), function(b2) {
list(name = b2,
values = dat[(dat$A == a1) & (dat$B == b2), 'C'])
}))
}))
## [{"query":[1],"type":[{"name":["a"],"values":["x","y"]},{"name":["c"],"values":["z"]}]},{"query":[2],"type":[{"name":["d"],"values":["p"]},{"name":["f"],"values":["q","r"]}]}]

Subsetting a data frame in a function using another data frame as parameter

I would like to submit a data frame to a function and use it to subset another data frame.
This is the basic data frame:
foo <- data.frame(var1= c(1, 1, 1, 2, 2, 3), var2=c('A', 'A', 'B', 'B', 'C', 'C'))
I use the following function to find out the frequencies of var2 for specified values of var1.
foobar <- function(x, y, z){
a <- subset(x, (x$var1 == y))
b <- subset(a, (a$var2 == z))
n=nrow(b)
return(n)
}
Examples:
foobar(foo, 1, "A") # returns 2
foobar(foo, 1, "B") # returns 1
foobar(foo, 3, "C") # returns 1
This works. But now I want to submit a data frame of values to foobar. Instead of the above examples, I would like to submit df to foobar and get the same results as above (2, 1, 1)
df <- data.frame(var1=c(1, 1, 3), var2=c("A", "B", "C"))
When I change foobar to accept two arguments like foobar(foo, df) and use y[, c(var1)] and y[, c(var2)] instead of the two parameters x and y it still doesn't work. Which way is there to do this?
edit1: last paragraph clarified
edit2: var1 type corrected
Try this:
library(plyr)
match_df <- function(x, match) {
vars <- names(match)
# Create unique id for each row
x_id <- id(match[vars])
match_id <- id(x[vars])
# Match identifiers and return subsetted data frame
x[match(x_id, match_id, nomatch = 0), ]
}
match_df(foo, df)
# var1 var2
# 1 1 A
# 3 1 B
# 5 2 C
Your function foobar is expecting three arguments, and you only supplied two arguments to it with foobar(foo, df). You can use apply to get what you want:
apply(df, 1, function(x) foobar(foo, x[1], x[2]))
And in use:
> apply(df, 1, function(x) foobar(foo, x[1], x[2]))
[1] 2 1 1
To respond to your edit:
I'm not entirely sure what y[, c(var1)] means, but here's an attempt at trying to figure out what you are trying to do.
What I think you were trying to do was: foobar(foo, y = df[, "var1"], z = df[, "var2"]).
First, note that the use of c() is not needed here and you can reference the columns you want by placing the name of the column in quotes OR reference the column by number (as I did above). Secondly, df[, "var1"] returns all of the rows for the column names var1 which has a length of three:
> length(df[, "var1"])
[1] 3
The function you defined is not set up to deal with vectors of length greater than 1. That is why we need to iterate through each row of your dataframe to grab a single value, process it, and then go to the next row in the data.frame. That is what the apply function does. It is equivalent to saying something along the lines of for (i in 1: length(nrow(df)) but is a more idiomatic way of handling such issues.
Finally, is there a reason you generated var1 as a factor? It probably makes more sense to treate these as numeric in my opinion. Compare:
> str(df)
'data.frame': 3 obs. of 2 variables:
$ var1: Factor w/ 2 levels "1","3": 1 1 2
$ var2: Factor w/ 3 levels "A","B","C": 1 2 3
Versus
> df2 <- data.frame(var1=c(1,1,3), var2=c("A", "B", "C"))
> str(df2)
'data.frame': 3 obs. of 2 variables:
$ var1: num 1 1 3
$ var2: Factor w/ 3 levels "A","B","C": 1 2 3
In summary - apply is the function you are after here. You may want to spend some time thinking about whether your data should be numeric or a factor, but apply is still what you want.
foobar2 <- function(x, df) {
.dofun <- function(y, z){
a <- subset(x, x$var1==y)
b <- subset(a, a$var2==z)
n <- nrow(b)
return (n)
}
ans <- mapply(.dofun, as.character(df$var1), as.character(df$var2))
names(ans) <- NULL
return(ans)
}