Read JSON to pandas dataframe - Getting ValueError: Mixing dicts with non-Series may lead to ambiguous ordering - json

I am trying to read in the JSON structure below into pandas dataframe, but it throws out the error message:
ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.
Json data:
'''
{
"Name": "Bob",
"Mobile": 12345678,
"Boolean": true,
"Pets": ["Dog", "cat"],
"Address": {
"Permanent Address": "USA",
"Current Address": "UK"
},
"Favorite Books": {
"Non-fiction": "Outliers",
"Fiction": {"Classic Literature": "The Old Man and the Sea"}
}
}
'''
How do I get this right? I have tried the script below...
'''
j_df = pd.read_json('json_file.json')
j_df
with open(j_file) as jsonfile:
data = json.load(jsonfile)
'''

Read json from file first and pass to json_normalize with DataFrame.explode:
import json
with open('json_file.json') as data_file:
data = json.load(data_file)
df = pd.json_normalize(j).explode('Pets').reset_index(drop=True)
print (df)
Name Mobile Boolean Pets Address.Permanent Address \
0 Bob 12345678 True Dog USA
1 Bob 12345678 True cat USA
Address.Current Address Favorite Books.Non-fiction \
0 UK Outliers
1 UK Outliers
Favorite Books.Fiction.Classic Literature
0 The Old Man and the Sea
1 The Old Man and the Sea
EDIT: For write values to sentence you can select necessary columns, remove duplicates, create numpy array and loop:
for x, y in df[['Name','Favorite Books.Fiction.Classic Literature']].drop_duplicates().to_numpy():
print (f"{x}’s favorite classical iterature book is {y}.")
Bob’s favorite classical iterature book is The Old Man and the Sea.

Related

Dask how to open json with list of dicts

I'm trying to open a bunch of JSON files using read_json In order to get a Dataframe as follow
ddf.compute()
id owner pet_id
0 1 "Charlie" "pet_1"
1 2 "Charlie" "pet_2"
3 4 "Buddy" "pet_3"
but I'm getting the following error
_meta = pd.DataFrame(
columns=list(["id", "owner", "pet_id"]])
).astype({
"id":int,
"owner":"object",
"pet_id": "object"
})
ddf = dd.read_json(f"mypets/*.json", meta=_meta)
ddf.compute()
*** ValueError: Metadata mismatch found in `from_delayed`.
My JSON files looks like
[
{
"id": 1,
"owner": "Charlie",
"pet_id": "pet_1"
},
{
"id": 2,
"owner": "Charlie",
"pet_id": "pet_2"
}
]
As far I understand the problem is that I'm passing a list of dicts, so I'm looking for the right way to specify it the meta= argument
PD:
I also tried doing it in the following way
{
"id": [1, 2],
"owner": ["Charlie", "Charlie"],
"pet_id": ["pet_1", "pet_2"]
}
But Dask is wrongly interpreting the data
ddf.compute()
id owner pet_id
0 [1, 2] ["Charlie", "Charlie"] ["pet_1", "pet_2"]
1 [4] ["Buddy"] ["pet_3"]
The invocation you want is the following:
dd.read_json("data.json", meta=meta,
blocksize=None, orient="records",
lines=False)
which can be largely gleaned from the docstring.
meta looks OK from your code
blocksize must be None, since you have a whole JSON object per file and cannot split the file
orient "records" means list of objects
lines=False means this is not a line-delimited JSON file, which is the more common case for Dask (you are not assuming that a newline character means a new record)
So why the error? Probably Dask split your file on some newline character, and so a partial record got parsed, which therefore did not match your given meta.

pandas json normalize key error with a particular json attribute

I have a json as:
mytestdata = {
"success": True,
"message": "",
"data": {
"totalCount": 95,
"goal": [
{
"user_id": 123455,
"user_email": "john.smith#test.com",
"user_first_name": "John",
"user_last_name": "Smith",
"people_goals": [
{
"goal_id": 545555,
"goal_name": "test goal name",
"goal_owner": "123455",
"goal_narrative": "",
"goal_type": {
"id": 1,
"name": "Team"
},
"goal_create_at": "1595874095",
"goal_modified_at": "1595874095",
"goal_created_by": "123455",
"goal_updated_by": "123455",
"goal_start_date": "1593561600",
"goal_target_date": "1601424000",
"goal_progress": "34",
"goal_progress_color": "#ff9933",
"goal_status": "1",
"goal_permission": "internal,team",
"goal_category": [],
"goal_owner_full_name": "John Smith",
"goal_team_id": "766754",
"goal_team_name": "",
"goal_workstreams": []
}
]
}
]
}
}
I am trying to display all details in "people_goals" along with "user_last_name", "user_first_name","user_email", "user_id" with json_normalize.
So far I am able to display "people_goals", "user_first_name","user_email" with the code
df2 = pd.json_normalize(data=mytestdata['data'], record_path=['goal', 'people_goals'],
meta=[['goal','user_first_name'], ['goal','user_last_name'], ['goal','user_email']], errors='ignore')
However I am having issue when trying to include ['goal', 'user_id'] in the meta=[]
The error is:
TypeError Traceback (most recent call last)
<ipython-input-192-b7a124a075a0> in <module>
7 df2 = pd.json_normalize(data=mytestdata['data'], record_path=['goal', 'people_goals'],
8 meta=[['goal','user_first_name'], ['goal','user_last_name'], ['goal','user_email'], ['goal','user_id']],
----> 9 errors='ignore')
10
11 # df2 = pd.json_normalize(data=mytestdata['data'], record_path=['goal', 'people_goals'])
The only difference I see for 'user_id' is that it is not a string
Am I missing something here?
Your code works on my platform. I've migrated away from using record_path and meta parameters for two reasons. a) they are difficult to work out b) there are compatibility issues between versions of pandas
Therefore I now use approach of use json_normalize() multiple times to progressively expand JSON. Or use pd.Series. Have included both as examples.
df = pd.json_normalize(data=mytestdata['data']).explode("goal")
df = pd.concat([df, df["goal"].apply(pd.Series)], axis=1).drop(columns="goal").explode("people_goals")
df = pd.concat([df, df["people_goals"].apply(pd.Series)], axis=1).drop(columns="people_goals")
df = pd.concat([df, df["goal_type"].apply(pd.Series)], axis=1).drop(columns="goal_type")
df.T
df2 = pd.json_normalize(pd.json_normalize(
pd.json_normalize(data=mytestdata['data']).explode("goal").to_dict(orient="records")
).explode("goal.people_goals").to_dict(orient="records"))
df2.T
print(df.T.to_string())
output
0
totalCount 95
user_id 123455
user_email john.smith#test.com
user_first_name John
user_last_name Smith
goal_id 545555
goal_name test goal name
goal_owner 123455
goal_narrative
goal_create_at 1595874095
goal_modified_at 1595874095
goal_created_by 123455
goal_updated_by 123455
goal_start_date 1593561600
goal_target_date 1601424000
goal_progress 34
goal_progress_color #ff9933
goal_status 1
goal_permission internal,team
goal_category []
goal_owner_full_name John Smith
goal_team_id 766754
goal_team_name
goal_workstreams []
id 1
name Team

Fetch all values irrespective of keys from a column of JSON type in a Spark dataframe using Spark with scala

I was trying to load a TSV file containing some metadata on movies using Spark, this TSV file contains genre info on movies in JSON format [ The last column in every row ]
Sample File
975900 /m/03vyhn Ghosts of Mars 2001-08-24 14010832 98.0 {"/m/02h40lc": "English Language"} {"/m/09c7w0": "United States of America"} {"/m/01jfsb": "Thriller", "/m/06n90": "Science Fiction", "/m/03npn": "Horror", "/m/03k9fj": "Adventure", "/m/0fdjb": "Supernatural", "/m/02kdv5l": "Action", "/m/09zvmj": "Space western"}
3196793 /m/08yl5d Getting Away with Murder: The JonBenét Ramsey Mystery 2000-02-16 95.0 {"/m/02h40lc": "English Language"} {"/m/09c7w0": "United States of America"} {"/m/02n4kr": "Mystery", "/m/03bxz7": "Biographical film", "/m/07s9rl0": "Drama", "/m/0hj3n01": "Crime Drama"}
I had tried the below code which enables me to access a particular value from the genre JSON
val ss = SessionCreator.createSession("DataCleaning", "local[*]")//helper function creates a spark session and returns it
val headerInfoRb = ResourceBundle.getBundle("conf.headerInfo")
val movieDF = DataReader.readFromTsv(ss, "D:/Utility/Datasets/MovieSummaries/movie.metadata.tsv")
.toDF(headerInfoRb.getString("metadataReader").split(',').toSeq:_*)//Datareader.readFromTsv is a helper function to read TSV file ,takes sparkSession and file path as input to resurn a dataframe, which uses sparkSession's read function
movieDF.select("wiki_mv_id","mv_nm","mv_genre")
.withColumn("genre_frmttd", get_json_object(col("mv_genre"), "$./m/02kdv5l"))
.show(1,false)
Output
+----------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
|wiki_mv_id|mv_nm |mv_genre |genre_frmttd|
+----------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
|975900 |Ghosts of Mars|{"/m/01jfsb": "Thriller", "/m/06n90": "Science Fiction", "/m/03npn": "Horror", "/m/03k9fj": "Adventure", "/m/0fdjb": "Supernatural", "/m/02kdv5l": "Action", "/m/09zvmj": "Space western"}|Action |
+----------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
only showing top 1 row
I want genre_frmttd column in the way shown below for every row in the data Frame [ below snippet is for the first sample row ]
[Thriller,Fiction,Horror,Adventure,Supernatural,Action,Space Western]
I am bit of a rookie in scala and spark, do suggest some way to list out the values
parse the JSON using from_json
cast it to MapType(StringType, StringType)
extract only values using map_values
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{MapType, StringType}
movieDF.select("wiki_mv_id","mv_nm","mv_genre")
.withColumn("genre_frmttd",map_values(from_json(col("mv_genre"),MapType(StringType, StringType))))
.show(1,false)

How to Change a value in a Dataframe based on a lookup from a json file

I want to practice building models and I figured that I'd do it with something that I am familiar with: League of Legends. I'm having trouble replacing an integer in a dataframe with a value in a json.
The datasets I'm using come off of the kaggle. You can grab it and run it for yourself.
https://www.kaggle.com/datasnaek/league-of-legends
I have json file of the form: (it's actually must bigger, but I shortened it)
{
"type": "champion",
"version": "7.17.2",
"data": {
"1": {
"title": "the Dark Child",
"id": 1,
"key": "Annie",
"name": "Annie"
},
"2": {
"title": "the Berserker",
"id": 2,
"key": "Olaf",
"name": "Olaf"
}
}
}
and dataframe of the form
print df
gameDuration t1_champ1id
0 1949 1
1 1851 2
2 1493 1
3 1758 1
4 2094 2
I want to replace the ID in t1_champ1id with the lookup value in the json.
If both of these were dataframe, then I could use the merge option.
This is what I've tried. I don't know if this is the best way to read in the json file.
import pandas
df = pandas.read_csv("lol_file.csv",header=0)
champ = pandas.read_json("champion_info.json", typ='series')
for i in champ.data[0]:
for j in df:
if df.loc[j,('t1_champ1id')] == i:
df.loc[j,('t1_champ1id')] = champ[0][i]['name']
I get the below error:
the label [gameDuration] is not in the [index]'
I'm not sure that this is the most efficient way to do this, but I'm not sure how to do it at all either.
What do y'all think?
Thanks!
for j in df: iterates over the column names in df, which is unnecessary, since you're only looking to match against the column 't1_champ1id'. A better use of pandas functionality is to condense the id:name pairs from your JSON file into a dictionary, and then map it to df['t1_champ1id'].
player_names = {v['id']:v['name'] for v in json_file['data'].itervalues()}
df.loc[:, 't1_champ1id'] = df['t1_champ1id'].map(player_names)
# gameDuration t1_champ1id
# 0 1949 Annie
# 1 1851 Olaf
# 2 1493 Annie
# 3 1758 Annie
# 4 2094 Olaf
Created a dataframe from the 'data' in the json file (also transposed the resulting dataframe and then set the index to what you want to map, the id) then mapped that to the original df.
import json
with open('champion_info.json') as data_file:
champ_json = json.load(data_file)
champs = pd.DataFrame(champ_json['data']).T
champs.set_index('id',inplace=True)
df['champ_name'] = df.t1_champ1id.map(champs['name'])

compare input to fields in a json file in ruby

I am trying to create a function that takes an input. Which in this case is a tracking code. Look that tracking code up in a JSON file then return the tracking code as output. The json file is as follows:
[
{
"tracking_number": "IN175417577",
"status": "IN_TRANSIT",
"address": "237 Pentonville Road, N1 9NG"
},
{
"tracking_number": "IN175417578",
"status": "NOT_DISPATCHED",
"address": "Holly House, Dale Road, Coalbrookdale, TF8 7DT"
},
{
"tracking_number": "IN175417579",
"status": "DELIVERED",
"address": "Number 10 Downing Street, London, SW1A 2AA"
}
]
I have started using this function:
def compare_content(tracking_number)
File.open("pages/tracking_number.json", "r") do |file|
file.print()
end
Not sure how I would compare the input to the json file. Any help would be much appreciated.
You can use the built-in JSON module.
require 'json'
def compare_content(tracking_number)
# Loads ENTIRE file into string. Will not be effective on very large files
json_string = File.read("pages/tracking_number.json")
# Uses the JSON module to create an array from the JSON string
array_from_json = JSON.parse(json_string)
# Iterates through the array of hashes
array_from_json.each do |tracking_hash|
if tracking_number == tracking_hash["tracking_number"]
# If this code runs, tracking_hash has the data for the number you are looking up
end
end
end
This will parse the JSON supplied into an array of hashes which you can then compare to the number you are looking up.
If you are the one generating the JSON file and this method will be called a lot, consider mapping the tracking numbers directly to their data for this method to potentially run much faster. For example,
{
"IN175417577": {
"status": "IN_TRANSIT",
"address": "237 Pentonville Road, N1 9NG"
},
"IN175417578": {
"status": "NOT_DISPATCHED",
"address": "Holly House, Dale Road, Coalbrookdale, TF8 7DT"
},
"IN175417579": {
"status": "DELIVERED",
"address": "Number 10 Downing Street, London, SW1A 2AA"
}
}
That would parse into a hash, where you could much more easily grab the data:
require 'json'
def compare_content(tracking_number)
json_string = File.read("pages/tracking_number.json")
hash_from_json = JSON.parse(json_string)
if hash_from_json.key?(tracking_number)
tracking_hash = hash_from_json[tracking_number]
else
# Tracking number does not exist
end
end