I am reading in a data table from a CSV file. Some elements in the CSV are in JSON format, so one of the columns has JSON formatted data, for example:
user_id tv_sec action_info
1: 47074 1426791420 {"foo": {"bar":12345,"baz":309}, "type": "type1"}
2: 47074 1426791658 {"foo": '{"bar":23409,"baz":903}, "type": "type2"}
3: 47074 1426791923 {"foo": {"bar":97241,"baz":218}, "type": "type3"}
I would like to flatten out the action_info column and add the data as columns, as follows:
user_id tv_sec bar baz type
1: 47074 1426791420 12345 309 type1
2: 47074 1426791658 23409 903 type2
3: 47074 1426791923 97241 218 type3
I am not sure how to achieve this. I found a library to convert strings to JSON in R (RJSONIO) but I'm having a hard time figuring out what to do next. When I experiment with just trying to convert all rows in the action_info column to JSON with the command userActions[,.(fromJSON(action_info))] I basically get a data table with what seems like all the values accumulated in some way that's not entirely clear to me. For example, running that with my (non-example) data I get:
V1
1: 2.188603e+12,2.187628e+12,2.186202e+12,1.164000e+03
2: type1
Warning messages:
1: In if (is.na(encoding)) return(0L) :
the condition has length > 1 and only the first element will be used
2: In if (is.na(i)) { :
the condition has length > 1 and only the first element will be used
So, I'm trying to figure out:
how to operate on the column to convert it from JSON to values (I think I am doing this correctly though, but I'm not certain)
how to get the values and create columns out of them in either the current or new data table.
Rather ugly but should work:
library(dplyr)
library(data.table)
lapply(as.character(df$action_info), RJSONIO::fromJSON) %>%
lapply(function(e) list(bar=e$foo[1], baz=e$foo[2], type=e$type)) %>%
rbindlist() %>%
cbind(df) %>%
select(-action_info)
Data:
library(data.table)
df <- data.table(structure(list(user_id = c(47074L, 47074L, 47074L), tv_sec = c(1426791420L,
1426791658L, 1426791923L), action_info = c("{\"foo\": {\"bar\":12345,\"baz\":309}, \"type\": \"type1\"}",
"{\"foo\": {\"bar\":23409,\"baz\":903}, \"type\": \"type2\"}",
"{\"foo\": {\"bar\":97241,\"baz\":218}, \"type\": \"type3\"}"
)), .Names = c("user_id", "tv_sec", "action_info"), row.names = c(NA,
-3L), class = "data.frame"))
Here's one way to do it with data_table:
df[, c('bar', 'baz', 'type'):=as.list(unlist(fromJSON(action_info[1]))),
by=action_info]
How it works:
The by=action_info essentially makes sure we just call fromJSON once per unique action_info (once per row in your case); this is because fromJSON doesn't work on vectorised input.
The fromJSON(action_info[1]) converts the action_info to JSON (the [1] is on the off chance that you have multiple rows with the same action_info since fromJSON doesn't work on vector input).
The unlist flattens the nested "foo: {bar...}" (do fromJSON(df$action_info[1]) and unlist(fromJSON(df$action_info[1])) to see what I mean).
The as.list converts the result back into a list, with one element per "column" (data.table needs this to do the multiple assignment)
Then the c('bar', 'baz', 'type'):= assigns the output back out to the columns.
Note we don't match by name, so 'bar' is always the first part of the JSON, 'baz' is always the second, etc. If your action_info could have a {bar: ..., baz: ...} as well as a {baz: ..., bar: ...} the baz of the second will be assigned to the bar column. If you want to be cleverer and assign by name, you will have to think of something cleverer (for you could do as.list(...)[c('foo.bar', 'foo.baz', 'type')] to ensure the elements are in the right order before assigning).
Related
I want JuliaDB.loadtable() to read a CSV (really a bunch of CSVs, but for simplicity let's try just one), where all columns are parsed as String.
Here's what I've tried:
using CSV
using DataFrames
using JuliaDB
df1 = DataFrame(
[['a', 'b', 'c'], [1, 2, 3]],
["name", "id"]
)
CSV.write("df1.csv", df1)
# This works, but if I have 10+ columns it would get unwieldy
df1 = loadtable("df1.csv"; colparsers=Dict(:name=>String, :id=>String),)
# This doesn't work
df1 = loadtable("df1.csv"; colparsers=String,)
# MethodError: no method matching iterate(::Type{String})
Here's how it's done in R:
df1 = read.csv("df1.csv", colClasses = "character")
If you know the number of columns (or just an upper bound on it), you can use types, I should think (from CSV.jl documentation):
types: a Vector or Dict of types to be used for column types; a Dict can map column index Int, or name Symbol or String to type for a column, i.e. Dict(1=>Float64) will set the first column as a Float64, Dict(:column1=>Float64) will set the column named column1 to Float64 and, Dict("column1"=>Float64) will set the column1 to Float64; if a Vector if provided, it must match the # of columns provided or detected in header
json_data = [{'User_Info':[{'Name':'John'},{'Name':'Ashly'},
{'Name':'Herbert'}]},
{'User_Info':[{'Name':''}]},
{'User_Info':[{'Name':'Lee'},{'Name':'Patrick'},{'Name':'Herbert'}]},
{'User_Info':[{'Name':'Benjamine'}]}]
I have JSON data and the length of the data is 5. I'd like to use loops to find names from that data. I've tried the code below but didn't get the expected outputs:
names_outputs = []
for ppl in json_data:
for i in ppl['User_Info']:
names_outputs.append(i['Name'])
print(names_outputs)
>>['John','Ashly','Herbert','Lee','Patrick','Walter','Steve','Benjamine']
However, my expected outputs should be like this:
[['John','Ashly','Herbert'],[],['Lee','Patrick','Herbert'],['Walter','Steve'],['Benjamine']]
You can use a nested list comprehension for that:
>>> [[name["Name"] for name in people] for people in [d["User_Info"] for d in json_data]]
[['John', 'Ashly', 'Herbert'], [''], ['Lee', 'Patrick', 'Herbert'], ['Benjamine']]
If you want to eliminate empty strings, use filter:
>>> [filter(None, [name["Name"] for name in people]) for people in [d["User_Info"] for d in json_data]]
[['John', 'Ashly', 'Herbert'], [], ['Lee', 'Patrick', 'Herbert'], ['Benjamine']]
Below is the code to convert csv file to json format in python.
I have two fields 'recommendation' and 'rating'. Based on the recommendation value I need to set the value for rating field like if recommendation is 1 then rating =1 and vice versa. With the answer I got I'm getting output for only one record entry instead of getting all the records. I think it's overriding. Do I need to create separate list for that and append each record entry to the list to get the output for all records.
here's the updated code:
def main(input_file):
csv_rows = []
with open(input_file, 'r') as csvfile:
reader = csv.DictReader(csvfile, delimiter='|')
title = reader.fieldnames
for row in reader:
entry = OrderedDict()
for field in title:
entry[field] = row[field]
[c.update({'RATING': c['RECOMMENDATIONS']}) for c in reader]
csv_rows.append(entry)
with open(json_file, 'w') as f:
json.dump(csv_rows, f, sort_keys=True, indent=4, ensure_ascii=False)
f.write('\n')
I want to create the nested format like the below:
"rating": {
"user_rating": {
"rating": 1
},
"recommended": {
"rating": 1
}
After you've read the file in, using the csv.DictReader, you'll have a list of dicts. Since you want to set the values now, it's a simple dict manipulation. There are several ways, of which one is:
[c.update({'rating': c['recommendation']}) for c in read_csvDictReader]
Hope that helps.
Let's suppose I have a simple JSON array like this:
[
{
"name": "Alex",
"age": 12
},
{
"name": "Peter"
}
]
Notice that the second object doesn't have an age field.
I'm using JSON4S to query JSON (using the for-comprehension style to extract values):
for {
JArray(persons) <- json
JObject(person) <- persons
JField("name", JString(name)) <- person
JField("age", JString(age)) <- person
} yield new Person(name, age)
The problem for me is that this expression will skip the second object (the one with the missing age field). I don't want to skip such objects; I need to get it as null or better as None.
This answer gives an example of how to deal with null values in JSON using custom extractors, but it works only if the field is present and if its value is null.
Deconstructing objects in json4s may lead to some inconvenience, as you no longer can use fancy \ and \\ queries.
I prefer to do something like that:
for {
JArray(persons) <- json
person#JObject(_) <- persons
JString(name) <- person \ "name"
age = (person \ "age").extractOpt[Int]
} yield (name, age)
res7: List[(String, Option[Int])] = List(("Alex", Some(12)), ("Peter", None))
This example also illustrates two alternatives how object fields can be extracted (you can also use name = (person \ "name").extract[String] instead).
I have a big json file, containing 18 fields, some of which contain some other subfields. I read the file in R in the following way:
json_file <- "daily_profiles_Bnzai_20150914_20150915_20150914.json"
data <- fromJSON(sprintf("[%s]", paste(readLines(json_file), collapse=",")))
This gives me a giant list with all the fields contained in the json file. I want to make it into a data.frame and do some operations in the meantime. For example if I do:
doc_length <- data.frame(t(apply(as.data.frame(data$doc_lenght_map), 1, unlist)))
os <- data.frame(t(apply(as.data.frame(data$operating_system), 1, unlist)))
navigation <- as.data.frame(data$navigation)
monday <- data.frame(t(apply(navigation[,grep("Monday",names(data$navigation))],1,unlist)))
Monday <- data.frame(apply(monday, 1, sum))
works fine, I get what I want, with all the right subfields and then I want to join them in a final data.frame that I will use to do other operations.
Now, I'd like to do something like that on the subset of fields where I don't need to do operations. So, for example, the days of the week contained in navigation are not included. I'd like to have something like (suppose I have a data.frame df):
for(name in names(data))
{
df <- cbind(df, data.frame(t(apply(as.data.frame(data$name), 1, unlist)))
}
The above loop gives me errors. So, what I want to do is finding a way to access all the fields of the list in an automatic way, as in the loop, where the iterator "name" takes on all the fields of the list, without having to call them singularly and then doing some operations with those fields. I tried even with
for(name in names(data))
{
df <- cbind(df, data.frame(t(apply(as.data.frame(data[name]), 1, unlist)))
}
but it doesn't take all of the subfields. I also tried with
data[, name]
but it doesn't work either. So I think I need to use the "$" operator.
Is it possible to do something like that?
Thank you a lot!
Davide
Like the other commenters, I am confused, but I will throw this out to see if it might point you in the right direction.
# make mtcars a list as an example
data <- lapply(mtcars,identity)
do.call(
cbind,
lapply(
names(data),
function(name){
data.frame(data[name])
}
)
)