I have a pandas dataframe that looks like this
df_in = pd.DataFrame(data = {'another_col': ['a', 'x', '4'], 'json': [
[{"Key":"firstkey", "Value": 1.4}, {"Key": "secondkey", "Value": 6}],
[{"Key":"firstkey", "Value": 5.4}, {"Key": "secondkey", "Value": 11}],
[{"Key":"firstkey", "Value": 1.6}, {"Key": "secondkey", "Value": 9}]]}
)
which when printed looks like
another_col json
0 a [{'Key': 'firstkey', 'Value': 1.4}, {'Key': 's...
1 x [{'Key': 'firstkey', 'Value': 5.4}, {'Key': 's...
2 4 [{'Key': 'firstkey', 'Value': 1.6}, {'Key': 's...
I need to transform it and parse each row of json into columns. I want the resulting dataframe to look like
another_col firstkey secondkey
0 a 1.4 6
1 x 5.4 11
2 4 1.6 9
How do I do this? I have been trying with pd.json_normalize with no success.
A secondary concern is speed... I have to apply this on ~5mm rows...but first let's get it working. :-)
You can convert to dataframe and unstack , then join:
u = df_in['json'].explode()
out = df_in[['another_col']].join(pd.DataFrame(u.tolist(),index=u.index)
.set_index('Key',append=True)['Value'].unstack())
print(out)
another_col firstkey secondkey
0 a 1.4 6.0
1 x 5.4 11.0
2 4 1.6 9.0
Related
I have a dataframe df:
d = {'col1': [1, 2,0,55,12,3], 'col3': ['A','A','A','B','B','B'] }
df = pd.DataFrame(data=d)
df
col1 col3
0 1 A
1 2 A
2 0 A
3 55 B
4 12 B
6 3 B
and want to build a Json from it, as the results looks like this :
json_result = { 'A' : [1,2,0], 'B': [55,12,3] }
basically, I would like for each group of the col3 to affect an array of its corresponding values from the dataframe
Aggregate list and then use Series.to_json:
print (df.groupby('col3')['col1'].agg(list).to_json())
{"A":[1,2,0],"B":[55,12,3]}
or if need dictionary use Series.to_dict:
print (df.groupby('col3')['col1'].agg(list).to_dict())
{'A': [1, 2, 0], 'B': [55, 12, 3]}
I have a two-fold issue and looking for clues as to how to approach it.
I have a json file that is formatted as such:
{
"code": 2000,
"data": {
"1": {
"attribute1": 40,
"attribute2": 1.4,
"attribute3": 5.2,
"attribute4": 124
"attribute5": "65.53%"
},
"94": {
"attribute1": 10,
"attribute2": 4.4,
"attribute3": 2.2,
"attribute4": 12
"attribute5": "45.53%"
},
"96": {
"attribute1": 17,
"attribute2": 9.64,
"attribute3": 5.2,
"attribute4": 62
"attribute5": "51.53%"
}
},
"message": "SUCCESS"
}
My goals are to:
I would first like to sort the data by any of the attributes.
There are around 100 of these, I would like to grab the top 5 (depending on how they are sorted), then...
Output the data in a table e.g.:
These are sorted by: attribute5
---
attribute1 | attribute2 | attribute3 | attribute4 | attribute5
40 |1.4 |5.2|124|65.53%
17 |9.64|5.2|62 |51.53%
10 |4.4 |2.2|12 |45.53%
*also, attribute5 above is a string value
Admittedly, my knowledge here is very limited.
I attempted to mimick the method used here:
python sort list of json by value
I managed to open the file and I can extract the key values from a sample row:
import json
jsonfile = path-to-my-file.json
with open(jsonfile) as j:
data=json.load(j)
k = data["data"]["1"].keys()
print(k)
total=data["data"]
for row in total:
v = data["data"][str(row)].values()
print(v)
this outputs:
dict_keys(['attribute1', 'attribute2', 'attribute3', 'attribute4', 'attribute5'])
dict_values([1, 40, 1.4, 5.2, 124, '65.53%'])
dict_values([94, 10, 4.4, 2.2, 12, '45.53%'])
dict_values([96, 17, 9.64, 5.2, 62, '51.53%'])
Any point in the right direction would be GREATLY appreciated.
Thanks!
If you don't mind using pandas you could do it like this
import pandas as pd
rows = [v for k,v in data["data"].items()]
df = pd.DataFrame(rows)
# then to get the top 5 values by attribute can choose either ascending
# or descending with the ascending keyword and head prints the top 5 rows
df.sort_values('attribute1', ascending=True).head()
This will allow you to sort by any attribute you need at any time and print out a table.
Which will produce output like this depending on what you sort by
attribute1 attribute2 attribute3 attribute4 attribute5
0 40 1.40 5.2 124 65.53%
1 10 4.40 2.2 12 45.53%
2 17 9.64 5.2 62 51.53%
I'll leave this answer here in case you don't want to use pandas but the answer from #MatthewBarlowe is way less complicated and I recommend that.
For sorting by a specific attribute, this should work:
import json
SORT_BY = "attribute4"
with open("test.json") as j:
data = json.load(j)
items = data["data"]
sorted_keys = list(sorted(items, key=lambda key: items[key][SORT_BY], reverse=True))
Now, sorted_keys is a list of the keys in order of the attribute they were sorted by.
Then, to print this as a table, I used the tabulate library. The final code for me looked like this:
from tabulate import tabulate
import json
SORT_BY = "attribute4"
with open("test.json") as j:
data = json.load(j)
items = data["data"]
sorted_keys = list(sorted(items, key=lambda key: items[key][SORT_BY], reverse=True))
print(f"\nSorted by: {SORT_BY}")
print(
tabulate(
[
[sorted_keys[i], *items[sorted_keys[i]].values()]
for i, _ in enumerate(items)
],
headers=["Column", *items["1"].keys()],
)
)
When sorting by 'attribute5', this outputs:
Sorted by: attribute5
Column attribute1 attribute2 attribute3 attribute4 attribute5
-------- ------------ ------------ ------------ ------------ ------------
1 40 1.4 5.2 124 65.53%
96 17 9.64 5.2 62 51.53%
94 10 4.4 2.2 12 45.53%
I am wondering how to transfer the dataframe to json format.
name ㅣ type ㅣ count
'james'ㅣ 'message'ㅣ 4
'kane' ㅣ 'text' ㅣ 3
'james'ㅣ 'text' ㅣ 2
'kane' ㅣ 'message'ㅣ 3
----------------------------result--------------------------------
dataframe to json fomat
data = [
{name : 'james', 'message' : 4, 'text; : 2}, {'name' : 'kane', 'message' :3, 'text' : 3}
]
How to change dataframe to json data?
You can use to_json and collect_list functions.
import pyspark.sql.functions as f
df1 = df.withColumn('json', f.struct('name', 'type', 'count')) \
.groupBy().agg(f.collect_list('json').alias('data')) \
.withColumn('data', f.to_json(f.struct(f.col('data')))) \
.show(10, False)
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|data |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"data":[{"name":"james","type":"message","count":4.0},{"name":"kane","type":"text","count":3.0},{"name":"james","type":"text","count":2.0},{"name":"kane","type":"message","count":3.0}]}|
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I have a json data as below:
{
"X": "abc",
"Y": 1,
"Z": 4174,
"t_0":
{
"M": "bm",
"T": "sp",
"CUD": 4,
"t_1": '
{
"CUD": "1",
"BBC": "09",
"CPR": -127
},
"EVV": "10.7000",
"BBC": -127,
"CMIX": "25088"
},
"EYR": "sp"
}
The problem is converting to python data-frame creates two columns of same name CUD. One is under t_0 and another is under t_1. But both are different events. How can I append json tag name to column names so that I can differentiate two columns of same name. Something like t_0_CUD , t_1_CUD.
My code is below:
df = pd.io.json.json_normalize(json_data)
df.columns = df.columns.map(lambda x: x.split(".")[-1])
If use only first part of solution it return what you need, only instead _ are used .:
df = pd.io.json.json_normalize(json_data)
print (df)
X Y Z EYR t_0.M t_0.T t_0.CUD t_0.t_1.CUD t_0.t_1.BBC t_0.t_1.CPR \
0 abc 1 4174 sp bm sp 4 1 09 -127
t_0.EVV t_0.BBC t_0.CMIX
0 10.7000 -127 25088
If need _:
df.columns = df.columns.str.replace('\.','_')
print (df)
X Y Z EYR t_0_M t_0_T t_0_CUD t_0_t_1_CUD t_0_t_1_BBC t_0_t_1_CPR \
0 abc 1 4174 sp bm sp 4 1 09 -127
t_0_EVV t_0_BBC t_0_CMIX
0 10.7000 -127 25088
I have large pandas tabular dataframe to convert into JSON.
The standard .to_json() functions does not make a compact format for JSON.
How to get JSON output forma like this, using pandas only ?
{"index": [ 0, 1 ,3 ],
"col1": [ "250", "1" ,"3" ],
"col2": [ "250", "1" ,"3" ]
}
This is a much compact format form of JSON for tabular data.
(I can do a loop over the rows.... but)
It seems you need to_dict first and then dict to json:
df = pd.DataFrame({"index": [ 0, 1 ,3 ],
"col1": [ "250", "1" ,"3" ],
"col2": [ "250", "1" ,"3" ]
})
print (df)
col1 col2 index
0 250 250 0
1 1 1 1
2 3 3 3
print (df.to_dict(orient='list'))
{'col1': ['250', '1', '3'], 'col2': ['250', '1', '3'], 'index': [0, 1, 3]}
import json
print (json.dumps(df.to_dict(orient='list')))
{"col1": ["250", "1", "3"], "col2": ["250", "1", "3"], "index": [0, 1, 3]}
Because it is not implemented yet:
print (df.to_json(orient='list'))
ValueError: Invalid value 'list' for option 'orient'
EDIT:
If index is not column, add reset_index:
df = pd.DataFrame({"col1": [250, 1, 3],
"col2": [250, 1, 3]})
print (df)
col1 col2
0 250 250
1 1 1
2 3 3
print (df.reset_index().to_dict(orient='list'))
{'col1': [250, 1, 3], 'index': [0, 1, 2], 'col2': [250, 1, 3]}
You can use to_dict and json (and add the index as extra column if required via assign):
import json
df = pd.DataFrame({"col1": [250, 1, 3],
"col2": [250, 1, 3]})
json_dict = df.assign(index=df.index).to_dict(orient="list")
print(json.dumps(json_dict))
>>> '{"index": [0, 1, 2], "col1": [250, 1, 3], "col2": [250, 1, 3]}'