Read JSON list and load it a dataframe

Read JSON list and load it a dataframe - json

My input JSON file has list of objects as below:
[
{
"name": "Hariharan",
"place": "Chennai",
"items": [
{"item_name":"This is a book shelf", "item_level": 1},
{"item_name":"Introduction", "item_level": 1},
{"item_name":"ABCDEF", "item_level": 2},
{"item_name":"grains", "item_level": 3},
{"item_name":"market place", "item_level": 1},
{"item_name":"Vegentables", "item_level": 1},
{"item_name":"Fruits", "item_level": 4},
{"item_name":"EFGHIJ", "item_level": 2},
{"item_name":"Conclusion", "item_level": 1}
],
"descriptions": [
{"item_name": "Books"}
]
}
]
I want to read this json file and load the data into dataframe.
Dataframe columns should be name, place, items and descriptions. Both items and descriptions are again list of items.
I am able to read json
import json
with open('data/test.json', 'r', encoding='utf8') as fp:
data = json.load(fp)
print(type(data)) is <class 'list'>
I get an error when I try to load the list on to dataframe:
df = pd.read_json(data)
ValueError: Invalid file path or buffer object type: <class 'list'>

Related

Parsing nested JSON and collecting data in a list

I am trying to parse a nested JSON and trying to collect data into a list under some condition.
Input JSON as below:
[
{
"name": "Thomas",
"place": "USA",
"items": [
{"item_name":"This is a book shelf", "level":1},
{"item_name":"Introduction", "level":1},
{"item_name":"parts", "level":2},
{"item_name":"market place", "level":3},
{"item_name":"books", "level":1},
{"item_name":"pens", "level":1},
{"item_name":"pencils", "level":1}
],
"descriptions": [
{"item_name": "Books"}
]
},
{
"name": "Samy",
"place": "UK",
"items": [
{"item_name":"This is a cat house", "level":1},
{"item_name":"Introduction", "level":1},
{"item_name":"dog house", "level":3},
{"item_name":"cat house", "level":1},
{"item_name":"cat food", "level":2},
{"item_name":"cat name", "level":1},
{"item_name":"Samy", "level":2}
],
"descriptions": [
{"item_name": "cat"}
]
}
]
I am reading json as below:
with open('test.json', 'r', encoding='utf8') as fp:
data = json.load(fp)
for i in data:
if i['name'] == "Thomas":
#collect "item_name", values in a list (my_list) if "level":1
#my_list = []
Expected output:
my_list = ["This is a book shelf", "Introduction", "books", "pens", "pencils"]
Since it's a nested complex JSON, I am not able to collect the data into a list as I mentioned above. Please let me know no how to collect the data from the nested JSON.

Try:
import json
with open("test.json", "r", encoding="utf8") as fp:
data = json.load(fp)
my_list = [
i["item_name"]
for d in data
for i in d["items"]
if d["name"] == "Thomas" and i["level"] == 1
]
print(my_list)
This prints:
['This is a book shelf', 'Introduction', 'books', 'pens', 'pencils']
Or without list comprehension:
my_list = []
for d in data:
if d["name"] != "Thomas":
continue
for i in d["items"]:
if i["level"] == 1:
my_list.append(i["item_name"])
print(my_list)

Once we have the data we iterate over the outermost list of objects.
We check if the object has the name equals to "Thomas" if true then we apply filter method with a lambda function on items list with a condition of level == 1
This gives us a list of item objects who have level = 1
In order to extract the item_name we use a comprehension so the final value in the final_list will be as you have expected.
["This is a book shelf", "Introduction", "books", "pens", "pencils"]
import json
def get_final_list():
with open('test.json', 'r', encoding='utf8') as fp:
data = json.load(fp)
final_list = []
for obj in data:
if obj.get("name") == "Thomas":
x = list(filter(lambda item: item['level'] == 1, obj.get("items")))
final_list = final_list + x
final_list = [i.get("item_name") for i in final_list]
return final_list

Why generating json data from a list (array of arrays) results in a quotation mark problem?

Considering the dataframe below:
timestamp coordinates
0 [402, 404] [[2.5719,49.0044], [2.5669,49.0043]]
1 [345, 945] [[2.5719,49.0044], [2.5669,49.0043]]
I'd like to generate a json file like below:
[
{
"vendor": 1,
"path": [
[2.5719,49.0044],
[2.5669,49.0043]
],
"timestamps": [402, 404]
},
{
"vendor": 1,
"path": [
[2.5719,49.0044],
[2.5669,49.0043]
],
"timestamps": [345, 945]
}]
To do so, my idea is:
For each row of my df, generate a new column geometry
containing row json data
Then append all geometries in a json
However, my function below doesn't work.
df["geometry"] = df.apply(lambda row: {
"vendor": 1,
"path": row["coordinates"],
"timestamps": row["timestamp"]
},
axis = 1)
Indeed, the result is (for example):
Note the quote marks (') around arrays in path
{
'vendor': 1,
'path': ['[2.5719,49.0044]', '[2.5669,49.0043]'],
'timestamps': [402, 404]
}
Any idea?
Thanks

Presumably the values in coordinates column are of type string. You can use ast.literal_eval to convert it to list:
from ast import literal_eval
df["geometry"] = df.apply(
lambda row: {
"vendor": 1,
"path": literal_eval(row["coordinates"]),
"timestamps": row["timestamp"],
},
axis=1,
)
print(df)
Prints:
timestamp coordinates geometry
0 [402, 404] [[2.5719,49.0044], [2.5669,49.0043]] {'vendor': 1, 'path': [[2.5719, 49.0044], [2.5669, 49.0043]], 'timestamps': [402, 404]}
1 [345, 945] [[2.5719,49.0044], [2.5669,49.0043]] {'vendor': 1, 'path': [[2.5719, 49.0044], [2.5669, 49.0043]], 'timestamps': [345, 945]}

pandas column to list for a json file

from a Dataframe, I want to have a JSON output file with one key having a list:
Expected output:
[
{
"model": "xx",
"id": 1,
"name": "xyz",
"categories": [1,2],
},
{
...
},
]
What I have:
[
{
"model": "xx",
"id": 1,
"name": "xyz",
"categories": "1,2",
},
{
...
},
]
The actual code is :
df = pd.read_excel('data_threated.xlsx')
result = df.reset_index(drop=True).to_json("output_json.json", orient='records')
parsed = json.dumps(result)
jsonfile = open("output_json.json", 'r')
data = json.load(jsonfile)
How can I achive this easily?
EDIT:
print(df['categories'].unique().tolist())
['1,2,3', 1, nan, '1,2,3,6', 9, 8, 11, 4, 5, 2, '1,2,3,4,5,6,7,8,9']

You can use:
df = pd.read_excel('data_threated.xlsx').reset_index(drop=True)
df['categories'] = df['categories'].apply(lambda x: [int(i) for i in x.split(',')] if isinstance(x, str) else '')
df.to_json('output.json', orient='records', indent=4)
Content of output.json
[
{
"model":"xx",
"id":1,
"name":"xyz",
"categories":[
1,
2
]
}
]
Note you can also use:
df['categories'] = pd.eval(df['categories'])

how to remove duplicates from a json defaultdict?

(Re-post with accurate data sample)
I have a json dictionary where each value in turn is a defaultdict as follows:
"Parent_Key_A": [{"a": 1.0, "b": 2.0}, {"a": 5.1, "c": 10}, {"b": 20.3, "a": 1.0}]
I am trying to remove both duplicate keys and values so that each element of the json has unique values. So for the above example, I am looking for output something like this:
"Parent_Key_A": {"a":[1.0,5.1], "b":[2.0,20.3], "c":[10]}
Then I need to write this output to a json file. I tried using set to handle duplicates but set is not json serializable.
Any suggestions on how to handle this?

The solution using itertools.chain() and itertools.groupby() functions:
import itertools, json
input_d = { "Parent_Key_A": [{"a": 1.0, "b": 2.0}, {"a": 5.1, "c": 10}, {"b": 20.3, "a": 1.0}] }
items = itertools.chain.from_iterable(list(d.items()) for d in input_d["Parent_Key_A"])
# dict comprehension (updated syntax here)
input_d["Parent_Key_A"] = { k:[i[1] for i in sorted(set(g))]
for k,g in itertools.groupby(sorted(items), key=lambda x: x[0]) }
print(input_d)
The output:
{'Parent_Key_A': {'a': [1.0, 5.1], 'b': [2.0, 20.3], 'c': [10]}}
Printing to json file:
json.dump(input_d, open('output.json', 'w+'), indent=4)
output.json contents:
{
"Parent_Key_A": {
"a": [
1.0,
5.1
],
"c": [
10
],
"b": [
2.0,
20.3
]
}
}

JSON to CSV conversion Linux terminal

I have the following example.json. How can I parse it to csv in order to get the mean value (between ** mean_value **).
I want something like in example.csv:
305152,277504,320512
[
{
"name": "stats",
"columns": [
"time",
"mean"
],
"points": [
[
1444038496000,
**305152**
],
[
1444038494000,
**277504**
],
[
1444038492000,
**320512**
]
]
}
]

In python it looks like this
import json
results = []
with open('example.json', 'r') as f:
content = json.loads(f.read())
for element in content:
results.append(','.join([str(y[1]) for y in element['points']]))
with open('example.csv', 'w') as f:
f.write('\n'.join(results))

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Read JSON list and load it a dataframe - json

Related

Parsing nested JSON and collecting data in a list

Why generating json data from a list (array of arrays) results in a quotation mark problem?

pandas column to list for a json file

how to remove duplicates from a json defaultdict?

JSON to CSV conversion Linux terminal

Categories

Resources