Json to pandas dataframe with slight modification - json

I have a json data as below:
{
"X": "abc",
"Y": 1,
"Z": 4174,
"t_0":
{
"M": "bm",
"T": "sp",
"CUD": 4,
"t_1": '
{
"CUD": "1",
"BBC": "09",
"CPR": -127
},
"EVV": "10.7000",
"BBC": -127,
"CMIX": "25088"
},
"EYR": "sp"
}
The problem is converting to python data-frame creates two columns of same name CUD. One is under t_0 and another is under t_1. But both are different events. How can I append json tag name to column names so that I can differentiate two columns of same name. Something like t_0_CUD , t_1_CUD.
My code is below:
df = pd.io.json.json_normalize(json_data)
df.columns = df.columns.map(lambda x: x.split(".")[-1])

If use only first part of solution it return what you need, only instead _ are used .:
df = pd.io.json.json_normalize(json_data)
print (df)
X Y Z EYR t_0.M t_0.T t_0.CUD t_0.t_1.CUD t_0.t_1.BBC t_0.t_1.CPR \
0 abc 1 4174 sp bm sp 4 1 09 -127
t_0.EVV t_0.BBC t_0.CMIX
0 10.7000 -127 25088
If need _:
df.columns = df.columns.str.replace('\.','_')
print (df)
X Y Z EYR t_0_M t_0_T t_0_CUD t_0_t_1_CUD t_0_t_1_BBC t_0_t_1_CPR \
0 abc 1 4174 sp bm sp 4 1 09 -127
t_0_EVV t_0_BBC t_0_CMIX
0 10.7000 -127 25088

Related

Sort and Select Top 5 JSON values

I have a two-fold issue and looking for clues as to how to approach it.
I have a json file that is formatted as such:
{
"code": 2000,
"data": {
"1": {
"attribute1": 40,
"attribute2": 1.4,
"attribute3": 5.2,
"attribute4": 124
"attribute5": "65.53%"
},
"94": {
"attribute1": 10,
"attribute2": 4.4,
"attribute3": 2.2,
"attribute4": 12
"attribute5": "45.53%"
},
"96": {
"attribute1": 17,
"attribute2": 9.64,
"attribute3": 5.2,
"attribute4": 62
"attribute5": "51.53%"
}
},
"message": "SUCCESS"
}
My goals are to:
I would first like to sort the data by any of the attributes.
There are around 100 of these, I would like to grab the top 5 (depending on how they are sorted), then...
Output the data in a table e.g.:
These are sorted by: attribute5
---
attribute1 | attribute2 | attribute3 | attribute4 | attribute5
40 |1.4 |5.2|124|65.53%
17 |9.64|5.2|62 |51.53%
10 |4.4 |2.2|12 |45.53%
*also, attribute5 above is a string value
Admittedly, my knowledge here is very limited.
I attempted to mimick the method used here:
python sort list of json by value
I managed to open the file and I can extract the key values from a sample row:
import json
jsonfile = path-to-my-file.json
with open(jsonfile) as j:
data=json.load(j)
k = data["data"]["1"].keys()
print(k)
total=data["data"]
for row in total:
v = data["data"][str(row)].values()
print(v)
this outputs:
dict_keys(['attribute1', 'attribute2', 'attribute3', 'attribute4', 'attribute5'])
dict_values([1, 40, 1.4, 5.2, 124, '65.53%'])
dict_values([94, 10, 4.4, 2.2, 12, '45.53%'])
dict_values([96, 17, 9.64, 5.2, 62, '51.53%'])
Any point in the right direction would be GREATLY appreciated.
Thanks!
If you don't mind using pandas you could do it like this
import pandas as pd
rows = [v for k,v in data["data"].items()]
df = pd.DataFrame(rows)
# then to get the top 5 values by attribute can choose either ascending
# or descending with the ascending keyword and head prints the top 5 rows
df.sort_values('attribute1', ascending=True).head()
This will allow you to sort by any attribute you need at any time and print out a table.
Which will produce output like this depending on what you sort by
attribute1 attribute2 attribute3 attribute4 attribute5
0 40 1.40 5.2 124 65.53%
1 10 4.40 2.2 12 45.53%
2 17 9.64 5.2 62 51.53%
I'll leave this answer here in case you don't want to use pandas but the answer from #MatthewBarlowe is way less complicated and I recommend that.
For sorting by a specific attribute, this should work:
import json
SORT_BY = "attribute4"
with open("test.json") as j:
data = json.load(j)
items = data["data"]
sorted_keys = list(sorted(items, key=lambda key: items[key][SORT_BY], reverse=True))
Now, sorted_keys is a list of the keys in order of the attribute they were sorted by.
Then, to print this as a table, I used the tabulate library. The final code for me looked like this:
from tabulate import tabulate
import json
SORT_BY = "attribute4"
with open("test.json") as j:
data = json.load(j)
items = data["data"]
sorted_keys = list(sorted(items, key=lambda key: items[key][SORT_BY], reverse=True))
print(f"\nSorted by: {SORT_BY}")
print(
tabulate(
[
[sorted_keys[i], *items[sorted_keys[i]].values()]
for i, _ in enumerate(items)
],
headers=["Column", *items["1"].keys()],
)
)
When sorting by 'attribute5', this outputs:
Sorted by: attribute5
Column attribute1 attribute2 attribute3 attribute4 attribute5
-------- ------------ ------------ ------------ ------------ ------------
1 40 1.4 5.2 124 65.53%
96 17 9.64 5.2 62 51.53%
94 10 4.4 2.2 12 45.53%

Pandas converts string-typed JSON value to INT

I have list of objects as JSON. Each object has two properties: id(string) and arg(number).
When I use pandas.read_json(...), the resulting DataFrame has the id interpreted as number as well, which causes problems, since information is lost.
import pandas as pd
json = '[{ "id" : "1", "arg": 1 },{ "id" : "1_1", "arg": 2}, { "id" : "11", "arg": 2}]'
df = pd.read_json(json)
I'd expect to have a DataFrame like this:
id arg
0 "1" 1
1 "1_1" 2
2 "11" 2
I get
id arg
0 1 1
1 11 2
2 11 2
and suddenly, the once unique id is not so unique anymore.
How can I tell pandas to stop doing that?
My search so far only yielded results, where people where trying to achive the opposite - having columns of string beeing interpreted as numbers. I totally don't want to achive that in this case!
If you set the dtype parameter to False, read_json will not infer the types automatically:
df = pd.read_json(json, dtype=False)
Use dtype parameter for preventing cast id to numbers:
df = pd.read_json(json, dtype={'id':str})
print (df)
id arg
0 1 1
1 1_1 2
2 11 2
print (df.dtypes)
id object
arg int64
dtype: object

rotate pandas DataFrame with rows of json to plain DataFrame

I have a pandas dataframe that looks like this
df_in = pd.DataFrame(data = {'another_col': ['a', 'x', '4'], 'json': [
[{"Key":"firstkey", "Value": 1.4}, {"Key": "secondkey", "Value": 6}],
[{"Key":"firstkey", "Value": 5.4}, {"Key": "secondkey", "Value": 11}],
[{"Key":"firstkey", "Value": 1.6}, {"Key": "secondkey", "Value": 9}]]}
)
which when printed looks like
another_col json
0 a [{'Key': 'firstkey', 'Value': 1.4}, {'Key': 's...
1 x [{'Key': 'firstkey', 'Value': 5.4}, {'Key': 's...
2 4 [{'Key': 'firstkey', 'Value': 1.6}, {'Key': 's...
I need to transform it and parse each row of json into columns. I want the resulting dataframe to look like
another_col firstkey secondkey
0 a 1.4 6
1 x 5.4 11
2 4 1.6 9
How do I do this? I have been trying with pd.json_normalize with no success.
A secondary concern is speed... I have to apply this on ~5mm rows...but first let's get it working. :-)
You can convert to dataframe and unstack , then join:
u = df_in['json'].explode()
out = df_in[['another_col']].join(pd.DataFrame(u.tolist(),index=u.index)
.set_index('Key',append=True)['Value'].unstack())
print(out)
another_col firstkey secondkey
0 a 1.4 6.0
1 x 5.4 11.0
2 4 1.6 9.0

fromJSON encoding issue

I try to convert a json object to R dataframe, here is the json object:
json <-
'[
{"Name" : "a", "Age" : 32, "Occupation" : "凡达"},
{"Name" : "b", "Age" : 21, "Occupation" : "打蜡设计费"},
{"Name" : "c", "Age" : 20, "Occupation" : "的拉斯克奖飞"}
]'
then I use fromJSON, mydf <- jsonlite::fromJSON(json), the result is
Name Age Occupation
1 a 32 <U+51E1><U+8FBE>
2 b 21 <U+6253><U+8721><U+8BBE><U+8BA1><U+8D39>
3 c 20 <U+7684><U+62C9><U+65AF><U+514B><U+5956><U+98DE>
I was wondering how this happens, and is there any solution?
Using the package rjson can solve the problem, but the output is a list, but I want a dataframe output.
Thank you.
I've tried Sys.setlocale(locale = "Chinese"), well the characters are indeed Chinese,but the results are still weird like below:
Name Age Occupation
1 a 32 ·²´ï
2 b 21 ´òÀ¯Éè¼Æ·Ñ
3 c 20 µÄÀ­Ë¹¿Ë½±·É

read mixed data into R

I have a text file '\t' separated. First two columns are text and third one is in JSON format like {type: [{a: a1, timestamp: 1}, {a:a2, timestamp: 2}]}
How can i put it into DF correctly?
I would like to parse line like factor1\tparam1\t{type: [{a: a1, timestamp: 1}, {a:a2, timestamp: 2}]} into DF like
factor_column param_column a_column ts_column
factor1 param1 a1 1
factor1 param1 a2 2
I have saved that one line of text you have provided into a file called 'parseJSON.txt'. You can then read the file in as per usual using read.table, then make use of library(jsonlite) to parse the 3rd column.
I've also formatted the line of text to include quotes around the JSON code:
factor1 param1 {"type": [{"a": "a1", "timestamp": 1}, {"a":"a2", "timestamp": 2}]}
library(jsonlite)
dat <- read.table("parseJSON.txt",
sep="\t",
header=F,
quote="")
#parse 3rd column using jsonlite
js <- fromJSON(as.character(dat[1,3]))
js is now a list
> js
$type
a timestamp
1 a1 1
2 a2 2
which can be combined with the first two columns of dat
res <- cbind(dat[,1:2],js$type)
names(res) <- c("factor_column", "param_column", "a_column", "ts_column")
which gives
> res
factor_column param_column a_column ts_column
1 factor1 param1 a1 1
2 factor1 param1 a2 2