Table from nested list, struct - pyarrow

I have this json data:
consumption_json = """
{
"count": 48,
"next": null,
"previous": null,
"results": [
{
"consumption": 0.063,
"interval_start": "2018-05-19T00:30:00+0100",
"interval_end": "2018-05-19T01:00:00+0100"
},
{
"consumption": 0.071,
"interval_start": "2018-05-19T00:00:00+0100",
"interval_end": "2018-05-19T00:30:00+0100"
},
{
"consumption": 0.073,
"interval_start": "2018-05-18T23:30:00+0100",
"interval_end": "2018-05-18T00:00:00+0100"
}
]
}
"""
and I would like to covert the results list to an Arrow table.
I have managed this by first converting it to python data structure, using python's json library, and then converting that to an Arrow table.
import json
consumption_python = json.loads(consumption_json)
results = consumption_python['results']
table = pa.Table.from_pylist(results)
print(table)
pyarrow.Table
consumption: double
interval_start: string
interval_end: string
----
consumption: [[0.063,0.071,0.073]]
interval_start: [["2018-05-19T00:30:00+0100","2018-05-19T00:00:00+0100","2018-05-18T23:30:00+0100"]]
interval_end: [["2018-05-19T01:00:00+0100","2018-05-19T00:30:00+0100","2018-05-18T00:00:00+0100"]]
But, for reasons of performance, I'd rather just use pyarrow exclusively for this.
I can use pyarrow's json reader to make a table.
reader = pa.BufferReader(bytes(consumption_json, encoding='ascii'))
table_from_reader = pa.json.read_json(reader)
And 'results' is a struct nested inside a list. (Actually, everything seems to be nested).
print(table_from_reader['results'].type)
list<item: struct<consumption: double, interval_start: timestamp[s], interval_end: timestamp[s]>>
How do I turn this into a table directly?
following this https://stackoverflow.com/a/72880717/3617057
I can get closer...
import pyarrow.compute as pc
flat = pc.list_flatten(table_from_reader["results"])
print(flat)
[
-- is_valid: all not null
-- child 0 type: double
[
0.063,
0.071,
0.073
]
-- child 1 type: timestamp[s]
[
2018-05-18 23:30:00,
2018-05-18 23:00:00,
2018-05-18 22:30:00
]
-- child 2 type: timestamp[s]
[
2018-05-19 00:00:00,
2018-05-18 23:30:00,
2018-05-17 23:00:00
]
]

flat is a ChunkedArray whose underlying arrays are StructArray. To convert it to a table, you need to convert each chunks to a RecordBatch and concatenate them in a table:
pa.Table.from_batches(
[
pa.RecordBatch.from_struct_array(s)
for s in flat.iterchunks()
]
)
If flat is just a StructArray (not a ChunkedArray), you can call:
pa.Table.from_batches(
[
pa.RecordBatch.from_struct_array(flat)
]
)

Related

Pulling specific Parent/Child JSON data with Python

I'm having a difficult time figuring out how to pull specific information from a json file.
So far I have this:
# Import json library
import json
# Open json database file
with open('jsondatabase.json', 'r') as f:
data = json.load(f)
# assign variables from json data and convert to usable information
identifier = data['ID']
identifier = str(identifier)
name = data['name']
name = str(name)
# Collect data from user to compare with data in json file
print("Please enter your numerical identifier and name: ")
user_id = input("Numerical identifier: ")
user_name = input("Name: ")
if user_id == identifier and user_name == name:
print("Your inputs matched. Congrats.")
else:
print("Your inputs did not match our data. Please try again.")
And that works great for a simple JSON file like this:
{
"ID": "123",
"name": "Bobby"
}
But ideally I need to create a more complex JSON file and can't find deeper information on how to pull specific information from something like this:
{
"Parent": [
{
"Parent_1": [
{
"Name": "Bobby",
"ID": "123"
}
],
"Parent_2": [
{
"Name": "Linda",
"ID": "321"
}
]
}
]
}
Here is an example that you might be able to pick apart.
You could either:
Make a custom de-jsonify object_hook as shown below and do something with it. There is a good tutorial here.
Just gobble up the whole dictionary that you get without a custom de-jsonify and drill down into it and make a list or set of the results. (not shown)
Example:
import json
from collections import namedtuple
data = '''
{
"Parents":
[
{
"Name": "Bobby",
"ID": "123"
},
{
"Name": "Linda",
"ID": "321"
}
]
}
'''
Parent = namedtuple('Parent', ['name', 'id'])
def dejsonify(json_str: dict):
if json_str.get("Name"):
parent = Parent(json_str.get('Name'), int(json_str.get('ID')))
return parent
return json_str
res = json.loads(data, object_hook=dejsonify)
print(res)
# then we can do whatever... if you need lookups by name/id,
# we could put the result into a dictionary
all_parents = {(p.name, p.id) : p for p in res['Parents']}
lookup_from_input = ('Bobby', 123)
print(f'found match: {all_parents.get(lookup_from_input)}')
Result:
{'Parents': [Parent(name='Bobby', id=123), Parent(name='Linda', id=321)]}
found match: Parent(name='Bobby', id=123)

How to convert dataframe output to json format and then Normalize the data?

I am running a sql and output i am reading as pandas df. Now i need to convert the data in to json and need to normalize the data. I tried to_json but this give partial solution.
Dataframe output:
| SalesPerson | ContactID |
|12345 |Tom|
|12345 |Robin|
|12345 |Julie|
Expected JSON:
{"SalesPerson": "12345", "ContactID":"Tom","Robin","Julie"}
Please see below code which i tried.
q = Select COL1, SalesPerson , ContactIDfrom table;
df = pd.read_sql(q, sqlconn)
df1=df.iloc[:, 1:2]
df2 = df1.to_json(orient='records')
also to_json result bracket which i also dont need.
Try this:
df.groupby('SalesPerson').apply(lambda x: pd.Series({
'ContactID': x['ContactID'].values
})).reset_index().to_json(orient='records')
Output (pretty printed):
[
{
"SalesPerson": 1,
"ContactID": ["Tom", "Robin", "Julie"]
},
{
"SalesPerson": 2,
"ContactID": ["Jack", "Mike", "Mary"]
}
]

Python: JSON to Dictionary

Two examples for a JSON request. Both examples should have the correct JSON syntax, yet only the second version seems to be translatable to a dictionary.
#doesn't work
string_js3 = """{"employees": [
{
"FNAME":"FTestA",
"LNAME":"LTestA",
"SSN":6668844441
},
{
"FNAME":"FTestB",
"LNAME":"LTestB",
"SSN":6668844442
}
]}
"""
#works
string_js4 = """[
{
"FNAME":"FTestA",
"LNAME":"LTestA",
"SSN":6668844441
},
{
"FNAME":"FTestB",
"LNAME":"LTestB",
"SSN":6668844442
}]
"""
This gives an error, while the same with string_js4 works
L1 = json.loads(string_js3)
print(L1[0]['FNAME'])
So I have 2 questions:
1) Why doesn't the first version work
2) Is there a simple way to make the first version also work?
Both of these strings are valid JSON. Where you are getting stuck is in how you are accessing the resulting data structures.
L1 (from string_js3) is a (nested) dict;
L2 (from string_js4) is a list of dicts.
Walkthrough:
import json
string_js3 = """{
"employees": [{
"FNAME": "FTestA",
"LNAME": "LTestA",
"SSN": 6668844441
},
{
"FNAME": "FTestB",
"LNAME": "LTestB",
"SSN": 6668844442
}
]
}"""
string_js4 = """[{
"FNAME": "FTestA",
"LNAME": "LTestA",
"SSN": 6668844441
},
{
"FNAME": "FTestB",
"LNAME": "LTestB",
"SSN": 6668844442
}
]"""
L1 = json.loads(string_js3)
L2 = json.loads(string_js4)
The resulting objects:
L1
{'employees': [{'FNAME': 'FTestA', 'LNAME': 'LTestA', 'SSN': 6668844441},
{'FNAME': 'FTestB', 'LNAME': 'LTestB', 'SSN': 6668844442}]}
L2
[{'FNAME': 'FTestA', 'LNAME': 'LTestA', 'SSN': 6668844441},
{'FNAME': 'FTestB', 'LNAME': 'LTestB', 'SSN': 6668844442}]
type(L1), type(L2)
(dict, list)
1) Why doesn't the first version work?
Because calling L1[0] is trying to return the value from the key 0, and that key doesn't exist. From the docs, "It is an error to extract a value using a non-existent key." L1 is a dictionary with just one key:
L1.keys()
dict_keys(['employees'])
2) Is there a simple way to make the first version also work?
There are several ways, but it ultimately depends on what your larger problem looks like. I'm going to assume you want to modify the Python code rather than the JSON files/strings themselves. You could do:
L3 = L1['employees'].copy()
You now have a list of dictionaries that resembles L2:
L3
[{'FNAME': 'FTestA', 'LNAME': 'LTestA', 'SSN': 6668844441},
{'FNAME': 'FTestB', 'LNAME': 'LTestB', 'SSN': 6668844442}]

How to Change a value in a Dataframe based on a lookup from a json file

I want to practice building models and I figured that I'd do it with something that I am familiar with: League of Legends. I'm having trouble replacing an integer in a dataframe with a value in a json.
The datasets I'm using come off of the kaggle. You can grab it and run it for yourself.
https://www.kaggle.com/datasnaek/league-of-legends
I have json file of the form: (it's actually must bigger, but I shortened it)
{
"type": "champion",
"version": "7.17.2",
"data": {
"1": {
"title": "the Dark Child",
"id": 1,
"key": "Annie",
"name": "Annie"
},
"2": {
"title": "the Berserker",
"id": 2,
"key": "Olaf",
"name": "Olaf"
}
}
}
and dataframe of the form
print df
gameDuration t1_champ1id
0 1949 1
1 1851 2
2 1493 1
3 1758 1
4 2094 2
I want to replace the ID in t1_champ1id with the lookup value in the json.
If both of these were dataframe, then I could use the merge option.
This is what I've tried. I don't know if this is the best way to read in the json file.
import pandas
df = pandas.read_csv("lol_file.csv",header=0)
champ = pandas.read_json("champion_info.json", typ='series')
for i in champ.data[0]:
for j in df:
if df.loc[j,('t1_champ1id')] == i:
df.loc[j,('t1_champ1id')] = champ[0][i]['name']
I get the below error:
the label [gameDuration] is not in the [index]'
I'm not sure that this is the most efficient way to do this, but I'm not sure how to do it at all either.
What do y'all think?
Thanks!
for j in df: iterates over the column names in df, which is unnecessary, since you're only looking to match against the column 't1_champ1id'. A better use of pandas functionality is to condense the id:name pairs from your JSON file into a dictionary, and then map it to df['t1_champ1id'].
player_names = {v['id']:v['name'] for v in json_file['data'].itervalues()}
df.loc[:, 't1_champ1id'] = df['t1_champ1id'].map(player_names)
# gameDuration t1_champ1id
# 0 1949 Annie
# 1 1851 Olaf
# 2 1493 Annie
# 3 1758 Annie
# 4 2094 Olaf
Created a dataframe from the 'data' in the json file (also transposed the resulting dataframe and then set the index to what you want to map, the id) then mapped that to the original df.
import json
with open('champion_info.json') as data_file:
champ_json = json.load(data_file)
champs = pd.DataFrame(champ_json['data']).T
champs.set_index('id',inplace=True)
df['champ_name'] = df.t1_champ1id.map(champs['name'])

In Apache Spark how could I merge multiple SQL columns from an exploded JSON Array?

I'm reading multiple JSON files from a directory; this JSON has multiple items 'cars' in an array. I'm trying to explode and merge the discrete values from the item 'car' to one dataframe.
A JSON file looks like:
{
"cars": {
"items":
[
{
"latitude": 42.0001,
"longitude": 19.0001,
"name": "Alex"
},
{
"latitude": 42.0002,
"longitude": 19.0002,
"name": "Berta"
},
{
"latitude": 42.0003,
"longitude": 19.0003,
"name": "Chris"
},
{
"latitude": 42.0004,
"longitude": 19.0004,
"name": "Diana"
}
]
}
}
My approaches to explode and merge the values to just one dataframe are:
// Read JSON files
val jsonData = sqlContext.read.json(s"/mnt/$MountName/.")
// To sqlContext to DataFrame
val jsonDF = jsonData.toDF()
/* Approach 1 */
// User-defined function to 'zip' two columns
val zip = udf((xs: Seq[Double], ys: Seq[Double]) => xs.zip(ys))
jsonDF.withColumn("vars", explode(zip($"cars.items.latitude", $"cars.items.longitude"))).select($"cars.items.name", $"vars._1".alias("varA"), $"vars._2".alias("varB"))
/* Apporach 2 */
val df = jsonData.select($"cars.items.name", $"cars.items.latitude", $"cars.items.longitude").toDF("name", "latitude", "longitude")
val df1 = df.select(explode(df("name")).alias("name"), df("latitude"), df("longitude"))
val df2 = df1.select(df1("name").alias("name"), explode(df1("latitude")).alias("latitude"), df1("longitude"))
val df3 = df2.select(df2("name"), df2("latitude"), explode(df2("longitude")).alias("longitude"))
As you may see the result of the Approach 1 is just a dataframe of two discrete 'merged' parameters like:
+--------------------+---------+---------+
| name| varA| varB|
+--------------------+---------+---------+
|[Leo, Britta, Gor...|48.161079|11.556778|
|[Leo, Britta, Gor...|48.124666|11.617682|
|[Leo, Britta, Gor...|48.352043|11.788091|
|[Leo, Britta, Gor...| 48.25184|11.636337|
The result for Approach is as follows:
+----+---------+---------+
|name| latitude|longitude|
+----+---------+---------+
| Leo|48.161079|11.556778|
| Leo|48.161079|11.617682|
| Leo|48.161079|11.788091|
| Leo|48.161079|11.636337|
| Leo|48.161079|11.560595|
| Leo|48.161079|11.788632|
(The result is a mapping of each 'name' with each 'latitude' with each 'longitude')
The result should be as follows:
+--------------------+---------+---------+
| name| varA| varB|
+--------------------+---------+---------+
|Leo |48.161079|11.556778|
|Britta |48.124666|11.617682|
|Gorch |48.352043|11.788091|
Do you know how read the files, split and merge the values that each line is just one object?
Thanks you very much for your help!
For getting the expected result you can try following approach:
// Read JSON files
val jsonData = sqlContext.read.json(s"/mnt/$MountName/.")
// To sqlContext to DataFrame
val jsonDF = jsonData.toDF()
// Approach
val df1 = jsonDF.select(explode(df("cars.items")).alias("items"))
val df2 = df1.select("items.name", "items.latitude", "items.longitude")
The above approach will provide you following result:
+-----+--------+---------+
| name|latitude|longitude|
+-----+--------+---------+
| Alex| 42.0001| 19.0001|
|Berta| 42.0002| 19.0002|
|Chris| 42.0003| 19.0003|
|Diana| 42.0004| 19.0004|
+-----+--------+---------+