R: Converting a nested list with empty elements to data.frame (from json) - json

I have imported a json file like this one:
library(rjson)
json_str <- '[{"id": 1, "code": 7909, "text": [{"col1": "a", "col2": "some text"}], "date": "2015-12-01"}, {"id": 2, "code": 7651, "text": [], "date": "2015-12-01"}, {"id": 3, "code": 4768, "text": [{"col1": "aaa", "col2": "Blah, blah"}, {"col1": "bbb", "col2": "Blah, blah, blah"}], "date": "2015-12-01"}]'
my.list <- fromJSON(json_str)
str(my.list)
Needless to say the real file is much longer.
As a result I get a nested list of 3 elements where each element is a list of 4, and then, the element $text is a list of variable length from nothing to any number of elements, in my case, usually no more than 3.
After some research I have found several answers about converting a list to data.frame, for example here and here. However, none of them work when one or more of the nested lists in '$text` is empty.
do.call(rbind, lapply(my.list, data.frame, stringsAsFactors=FALSE))
library(data.table)
rbindlist(my.list, fill=TRUE)
Both return an error.
I would like to either convert the list in $text to several columns of the data.frame or just one (pasting the content).
Another option would be to be able to skip some elements (say $text) and convert the rest of the list, then in a separate line convert those elements (say $text) to a different data.frame. I think I could somehow relate one data.frame to the other.
Can anyone give me any idea on how to do this.
Thanks

By the sounds of it, something like the following should work:
do.call(rbind.data.frame, lapply(my.list, function(x) {
x[["text"]] <- toString(unlist(x[["text"]]))
x
}))
## id code text date
## 2 1 7909 a, some text 2015-12-01
## 21 2 7651 2015-12-01
## 3 3 4768 aaa, Blah, blah, bbb, Blah, blah, blah 2015-12-01
This follows your idea of pasting the values together (here using toString) to form a single column in the data.frame.

Related

Postgres json to view

I have a table like this (with an jsonb column):
https://dbfiddle.uk/lGuHdHEJ
If I load this json with python into a dataframe:
import pandas as pd
import json
data={
"id": [1, 2],
"myyear": [2016, 2017],
"value": [5, 9]
}
data=json.dumps(data)
df=pd.read_json(data)
print (df)
I get this result:
id myyear value
0 1 2016 5
1 2 2017 9
How can a get this result directly from the json column via sql in a postgres view?
Note: This assumes that your id, my_year, and value array are consistent and have the same length.
This answer uses PostgresSQL's json_array_elements_text function to explode array elements to the rows.
select jsonb_array_elements_text(payload -> 'id') as "id",
jsonb_array_elements_text(payload -> 'bv_year') as "myyear",
jsonb_array_elements_text(payload -> 'value') as "value"
from main
And this gives the below output,
id myyear value
1 2016 5
2 2017 9
Although this is not the best design to store the properties in jsonb object and could lead to data inconsistencies later. If it's in your control I would suggest storing the data where each property's mapping is clear. Some suggestions,
You can instead have separate columns for each property.
If you want to store it as jsonb only then consider [{id: "", "year": "", "value": ""}]

Dask how to open json with list of dicts

I'm trying to open a bunch of JSON files using read_json In order to get a Dataframe as follow
ddf.compute()
id owner pet_id
0 1 "Charlie" "pet_1"
1 2 "Charlie" "pet_2"
3 4 "Buddy" "pet_3"
but I'm getting the following error
_meta = pd.DataFrame(
columns=list(["id", "owner", "pet_id"]])
).astype({
"id":int,
"owner":"object",
"pet_id": "object"
})
ddf = dd.read_json(f"mypets/*.json", meta=_meta)
ddf.compute()
*** ValueError: Metadata mismatch found in `from_delayed`.
My JSON files looks like
[
{
"id": 1,
"owner": "Charlie",
"pet_id": "pet_1"
},
{
"id": 2,
"owner": "Charlie",
"pet_id": "pet_2"
}
]
As far I understand the problem is that I'm passing a list of dicts, so I'm looking for the right way to specify it the meta= argument
PD:
I also tried doing it in the following way
{
"id": [1, 2],
"owner": ["Charlie", "Charlie"],
"pet_id": ["pet_1", "pet_2"]
}
But Dask is wrongly interpreting the data
ddf.compute()
id owner pet_id
0 [1, 2] ["Charlie", "Charlie"] ["pet_1", "pet_2"]
1 [4] ["Buddy"] ["pet_3"]
The invocation you want is the following:
dd.read_json("data.json", meta=meta,
blocksize=None, orient="records",
lines=False)
which can be largely gleaned from the docstring.
meta looks OK from your code
blocksize must be None, since you have a whole JSON object per file and cannot split the file
orient "records" means list of objects
lines=False means this is not a line-delimited JSON file, which is the more common case for Dask (you are not assuming that a newline character means a new record)
So why the error? Probably Dask split your file on some newline character, and so a partial record got parsed, which therefore did not match your given meta.

How to take any CSV file and convert it to JSON?(with python as a script engine) [Novice user trying to learn NiFi]

1) There is a CSV file containing the following information (the first row is the header):
first,second,third,total
1,4,9,14
7,5,2,14
3,8,7,18
2) I would like to find the sum of individual rows and generate a final file with a modified header. The final file should look like this:
[
{
"first": 1,
"second": 4,
"third": 9,
"total": 14
},
{
"first": 7,
"second": 5,
"third": 2,
"total": 14
},
{
"first": 3,
"second": 8,
"third": 7,
"total": 18
}
]
But it does not work and I am not sure how to fix this. Can anyone provide me an understanding on how to approach this problem?
NiFi flow:
Although i'm not into Python, by just googling around i think this might do it:
import csv
with open("YOURFILE.csv") as f:
reader = csv.DictReader(f)
data = [r for r in reader]
import json
with open('result.json', 'w') as outfile:
json.dump(data, outfile)
You can use Query Record processor and add new property as
total
select first,second,third,first+second+third total from FLOWFILE
Configure the CsvReader controller service with matching avro schema with int as datatype for all the fields and Json Setwriter controller service,Include total field name so that the output from Query Record processor will be all the columns and the sum of the columns as total.
Connect total relationship from Query Record processor for further processing
Refer to these links regarding Query Record and Configure Record Reader/Writer

How to Change a value in a Dataframe based on a lookup from a json file

I want to practice building models and I figured that I'd do it with something that I am familiar with: League of Legends. I'm having trouble replacing an integer in a dataframe with a value in a json.
The datasets I'm using come off of the kaggle. You can grab it and run it for yourself.
https://www.kaggle.com/datasnaek/league-of-legends
I have json file of the form: (it's actually must bigger, but I shortened it)
{
"type": "champion",
"version": "7.17.2",
"data": {
"1": {
"title": "the Dark Child",
"id": 1,
"key": "Annie",
"name": "Annie"
},
"2": {
"title": "the Berserker",
"id": 2,
"key": "Olaf",
"name": "Olaf"
}
}
}
and dataframe of the form
print df
gameDuration t1_champ1id
0 1949 1
1 1851 2
2 1493 1
3 1758 1
4 2094 2
I want to replace the ID in t1_champ1id with the lookup value in the json.
If both of these were dataframe, then I could use the merge option.
This is what I've tried. I don't know if this is the best way to read in the json file.
import pandas
df = pandas.read_csv("lol_file.csv",header=0)
champ = pandas.read_json("champion_info.json", typ='series')
for i in champ.data[0]:
for j in df:
if df.loc[j,('t1_champ1id')] == i:
df.loc[j,('t1_champ1id')] = champ[0][i]['name']
I get the below error:
the label [gameDuration] is not in the [index]'
I'm not sure that this is the most efficient way to do this, but I'm not sure how to do it at all either.
What do y'all think?
Thanks!
for j in df: iterates over the column names in df, which is unnecessary, since you're only looking to match against the column 't1_champ1id'. A better use of pandas functionality is to condense the id:name pairs from your JSON file into a dictionary, and then map it to df['t1_champ1id'].
player_names = {v['id']:v['name'] for v in json_file['data'].itervalues()}
df.loc[:, 't1_champ1id'] = df['t1_champ1id'].map(player_names)
# gameDuration t1_champ1id
# 0 1949 Annie
# 1 1851 Olaf
# 2 1493 Annie
# 3 1758 Annie
# 4 2094 Olaf
Created a dataframe from the 'data' in the json file (also transposed the resulting dataframe and then set the index to what you want to map, the id) then mapped that to the original df.
import json
with open('champion_info.json') as data_file:
champ_json = json.load(data_file)
champs = pd.DataFrame(champ_json['data']).T
champs.set_index('id',inplace=True)
df['champ_name'] = df.t1_champ1id.map(champs['name'])

R list(structure(list())) to data frame

I have a JSON data source providing a list of hashes:
[
{ "a": "foo",
"b": "sdfshk"
},
{ "a": "foo",
"b": "ihlkyhul"
}
]
I use fromJSON() in the rjson package to convert that to an R data structure. It returns:
list(
structure(list(a = "foo", b = "sdfshk"), .Names = c("a", "b")),
structure(list(a = "foo", b = "ihlkyhul"), .Names = c("a", "b"))
)
I need to get this into an R data frame, but data.frame() turns that into a single-row data frame with four columns instead of a 2x2 data frame as expected. I lack the R-fu to do the transform from one to the other, though it looks like it should be straightforward.
Bonus points:
The actual problem is a bit more complex, because the JSON data source isn't as regular as I show above. The objects it returns vary in type. That is, the field set in each can be one of a few different types:
[
{ "a": "foo",
"b": "asdfhalsdhfla"
},
{ "a": "bar",
"c": "akjdhflakjhsdlfkah",
"d": "jfhglskhfglskd",
},
{ "a": "foo",
"b": "dfhlkhldsfg"
}
]
As you can see, the "a" field in each object is a type tag, indicating which other fields the object will have.
I'm not too particular how the solution copes with this.
It wouldn't be horrible if the two object types were just mooshed together, so you get columns a, b, c, and d, and the rows simply have N/A or NULL values where the JSON source object doesn't have a value for a given field. I believe I can clean the resulting data frame with subset(df, a == "foo"). I'll end up with some empty columns that way, but it won't matter to my program.
It would be better if the solution provides a way to select which JSON source rows go into the data frame and which get rejected, so the result has only the columns and rows actually required.
If you have a jagged list you want converted to a data.frame, you could use Hadley's plyr's rbind.fill. Saved my neck on a couple of occasions. Let me know if this is what you're looking for. Notice that I modified your first example to include only "b" in the third element to make it jagged.
> x <- list(
+ structure(list(a = "foo", b = "sdfshk"), .Names = c("a", "b")),
+ structure(list(a = "foo", b = "ihlkyhul"), .Names = c("a", "b")),
+ structure(list(b = "asdf"), .Names = "b")
+ )
>
> library(plyr)
> do.call("rbind.fill", lapply(x, as.data.frame))
a b
1 foo sdfshk
2 foo ihlkyhul
3 <NA> asdf