Python Pandas Nested JSON to Dataframe - json

I am attempting to convert nested JSON data to a flat table:
(I have edited this as I thought I had a working solution and was asking for advice on optimisation, turns out I don't have it working...)
import pandas as pd
import json
from collections import OrderedDict
# https://stackoverflow.com/questions/36720940/parsing-nested-json-into-dataframe
def flatten_json(json_object, container=None, name=''):
if container is None:
container = OrderedDict()
if isinstance(json_object, dict):
for key in json_object:
flatten_json(json_object[key], container=container, name=name + key + '_')
elif isinstance(json_object, list):
for n, item in enumerate(json_object, 1):
flatten_json(item, container=container, name=name + str(n) + '_')
else:
container[str(name[:-1])] = str(json_object)
return container
data = '{"page":1,"pages":2,"totaItems":22,"data":[{"eId":38344,"bId":29802,"fname":"Adon","cId":21,"cName":"Regional","vType":"None","totalMinutes":590,"minutesExcludingViolations":590,"sId":15,"snme":"CD","customFields":[{"id":3,"value":false},{"id":4,"value":false},{"id":5,"value":"2056-04-05T00:00:00Z"}]},{"eId":38344,"bId":29802,"fname":"Adon","cId":21,"cName":"Regional","vType":"None","totalMinutes":590,"minutesExcludingViolations":590,"sId":15,"snme":"CD","customFields":[{"id":3,"value":false},{"id":4,"value":false}]}]}'
json_data = json.loads(data)
dataframes = list()
for record in json_data['data']:
out = pd.DataFrame(flatten_json(record), index=[0])
dataframes.append(out)
frame = pd.concat(dataframes)
print(frame)
However I cant help but feel this might be overly complicated for what I am trying to achieve. This script is the result of a few hours research and its the best I can come up with. Does anyone have any pointers/advice to perhaps refine this?
I'm essentially completely flattening the JSON data (under the data record) into a dataframe to later be exported to CSV.
Ideal output:
+-------+-----+----------+----------------+----------------+----------------------+-------+-------+----------------------------+-----+------+--------------+-------+
| bId | cId | cName | customFields_3 | customFields_4 | customFields_5 | eId | fname | minutesExcludingViolations | sId | snme | totalMinutes | vType |
+-------+-----+----------+----------------+----------------+----------------------+-------+-------+----------------------------+-----+------+--------------+-------+
| 29802 | 21 | Regional | FALSE | FALSE | 2056-04-05T00:00:00Z | 38344 | Adon | 590 | 15 | CD | 590 | None |
| 29802 | 21 | Regional | FALSE | FALSE | null | 38344 | Adon | 590 | 15 | CD | 590 | None |
+-------+-----+----------+----------------+----------------+----------------------+-------+-------+----------------------------+-----+------+--------------+-------+
EDIT: Turns out I didn't notice but this solution doesn't work. I've added my idealised output and shortened the input data slightly to make it easier to work with for now.
EDIT2: Possible solution... Gives the right output.
main_frame = pd.DataFrame(json_data['data'])
del main_frame['customFields']
frames = list()
for record in json_data['data']:
out = pd.DataFrame.from_records(record['customFields']).T
out = out.reset_index(drop=True)
out.columns = out.iloc[0]
out = out.reindex(out.index.drop(0))
frames.append(out)
custom_fields_frame = pd.concat(frames).reset_index(drop=True)
main_frame = main_frame.join(custom_fields_frame)
print(main_frame)
Thanks,

This solution would do the job efficiently! Converting the nested json to dataframe
nested_json=[{"page":1,"pages":2,"totaItems":22,"data":[{"eId":38344,"bId":29802,"fname":"Adon","cId":21,"cName":"Regional","vType":"None","totalMinutes":590,"minutesExcludingViolations":590,"sId":15,"snme":"CD","customFields":[{"id":3,"value":"false"},{"id":4,"value":"false"},{"id":5,"value":"true"},{"id":6,"value":"false"},{"id":7,"value":"false"},{"id":14,"value":"2056-04-05T00:00:00Z"},{"id":15,"value":"Tester"}]},{"eId":38344,"bId":29802,"fname":"Adon","cId":21,"cName":"Regional","vType":"None","totalMinutes":590,"minutesExcludingViolations":590,"sId":15,"snme":"CD","customFields":[{"id":3,"value":"false"},{"id":5,"value":"true"},{"id":6,"value":"false"},{"id":7,"value":"false"},{"id":14,"value":"2056-04-05T00:00:00Z"},{"id":15,"value":"Tester"},{"id":16,"value":"false"},{"id":17,"value":"false"}]}]}]
from pandas.io.json import json_normalize
json_df = json_normalize(nested_json)
json_columns = list(json_df.columns.values)
#just picks the column_name instead of something.something.column_name
for w in range(len(json_columns)):
json_columns[w] = json_columns[w].split('.')[-1].lower()
json_df.columns = json_columns

Related

to_html() formatters, float and string together

I'm trying to to_html(formatters={column: function}) in order to make visual changes.
df = pd.DataFrame({'name':['Here',4.45454,5]}, index=['A', 'B', 'C'])
def _form(val):
value = 'STRING' if isinstance(val, str) else '{:0.0f}'.format(val)
return value
df.to_html(formatters={'name':_form})
I get
name
A
STRING
B
4.45454
C
5
instead of
| | name |
| -------- | -------------- |
| A | STRING |
| B | 4 |
| C | 5 |
Problem here is float value doesn't change.
On the other hand when I have all values float or integers, it gives desired result:
df = pd.DataFrame({'name':[323.322,4.45454,5]}, index=['A', 'B', 'C'])
def _form(val):
value = 'STRING' if isinstance(val, str) else '{:0.0f}'.format(val)
return value
df.to_html(formatters={'name':_form})
How can it be fixed?
Thank you.

Django Database transaction rollback in Loop

User may import a excel and I want to check if the data are correct.
# Excel Data
| id | item | qty |
|:---- |:------:| -----:|
| 1 | item A | 10 |
| 2 | item B | 20 |
| 3 | item C | 30 |
| 4 | item D | 40 | <-- For example, Not enough qty to minus (only have 1)
| 5 | item E | 50 |
# Database
| id | item | qty |
|:---- |:------:| -----:|
| 1 | item A | 100 |
| 2 | item B | 200 |
| 3 | item C | 300 |
| 4 | item D | 1 | <-- For example, Not enough qty to minus (Need 40)
| 5 | item E | 500 |
I need to check the Database is that item has qty to minus, if yes then save the changes, if not, then rollback all changed data in this excel data (rollback to before import this excel) and return errors details to user.
def myFunction(self, request):
try:
error_details = []
with transaction.atomic():
for data in excal_data:
result = checkIfVerify(data) # Here will be a function which will cause error 'You can't execute queries until the end of the 'atomic' block'
if result is True:
serializer = modelSerailizer(data)
serializer.save()
else:
error_details.append("some explanation...")
if len(error_details) > 0:
transaction.set_rollback(True)
raise CustomError
excpet CustomError:
pass
return Response(....)
# checkIfVerify(data)
def checkIfVerify(data):
# this sql will need to join many tables which is hard to use ORM
sql = ....
results = []
with connection.cursor() as cursor:
cursor.execute(sql)
results = cursor.fetchall()
cursor.close()
connection.close()
if results .....:
return True
else:
return False
But the problem seem to be I cannot able to use raw SQL execute inside the transaction.atomic() block, If I put transaction.atomic() inside the loop after the checking function, it not able to rollback all data.
How should I do?

How to import CSV file into Octave and keep the column headers

I am trying to import a CSV file so that I can use it with the k-means clustering algorithm. The file contains 6 columns and over 400 rows. Here is a picture of the excel document I used (before exporting it into a CSV file). In essence, I want to be able to use the column header names in my code so that I can use the column names when plotting the data, as well as clustering it.
I looked into some other documentation and came up with this code but nothing came as an output when I just put it into the command window:
[Player BA OPS RBI OBP] = CSVIMPORT( 'MLBdata.csv', 'columns', {'Player', 'BA', 'OPS', 'RBI', 'OBP'}
The only thing that has worked for me so far is the dlm read function, but it returns 0 when there is a String of words
N = dlmread('MLBdata.csv')
Octave
Given file data.csv with the following contents:
Player,Year,BA,OPS,RBI,OBP
SandyAlcantara,2019,0.086,0.22,4,0.117
PeteAlonso,2019,0.26,0.941,120,0.358
BrandonLowe,2019,0.27,0.85,51,0.336
MikeSoroka,2019,0.077,0.22,3,0.143
Open an octave terminal and type:
pkg load io
C = csv2cell( 'data.csv' )
resulting in the following cell array:
C =
{
[1,1] = Player
[2,1] = SandyAlcantara
[3,1] = PeteAlonso
[4,1] = BrandonLowe
[5,1] = MikeSoroka
[1,2] = Year
[2,2] = 2019
[3,2] = 2019
[4,2] = 2019
[5,2] = 2019
[1,3] = BA
[2,3] = 0.086000
[3,3] = 0.2600
[4,3] = 0.2700
[5,3] = 0.077000
[1,4] = OPS
[2,4] = 0.2200
[3,4] = 0.9410
[4,4] = 0.8500
[5,4] = 0.2200
[1,5] = RBI
[2,5] = 4
[3,5] = 120
[4,5] = 51
[5,5] = 3
[1,6] = OBP
[2,6] = 0.1170
[3,6] = 0.3580
[4,6] = 0.3360
[5,6] = 0.1430
}
From there on, you can collect that data into arrays or structs as you like and continue working. One nice option is Andrew Janke's nice 'tablicious' package:
octave:13> pkg load tablicious
octave:14> T = cell2table( C(2:end,:), 'VariableNames', C(1,:) );
octave:15> prettyprint(T)
-------------------------------------------------------
| Player | Year | BA | OPS | RBI | OBP |
-------------------------------------------------------
| SandyAlcantara | 2019 | 0.086 | 0.22 | 4 | 0.117 |
| PeteAlonso | 2019 | 0.26 | 0.941 | 120 | 0.358 |
| BrandonLowe | 2019 | 0.27 | 0.85 | 51 | 0.336 |
| MikeSoroka | 2019 | 0.077 | 0.22 | 3 | 0.143 |
-------------------------------------------------------

MySQL many-many JSON aggregation merging duplicate keys

I'm having trouble returning a JSON representation of a many-many join. My plan was to encode the columns returned using the following JSON format
{
"dog": [
"duke"
],
"location": [
"home",
"scotland"
]
}
This format would handle duplicate keys by aggregating the results in a JSON array, howver all of my attempts at aggregating this structure so far have just removed duplicates, so the arrays only ever have a single element.
Tables
Here is a simplified table structure I've made for the purposes of explaining this query.
media
| media_id | sha256 | filepath |
| 1 | 33327AD02AD09523C66668C7674748701104CE7A9976BC3ED8BA836C74443DBC | /photos/cat.jpeg |
| 2 | 323b5e69e72ba980cd4accbdbb59c5061f28acc7c0963fee893c9a40db929070 | /photos/dog.jpeg |
| 3 | B986620404660DCA7B3DEC4EFB2DE80C0548AB0DE243B6D59DA445DE2841E474 | /photos/dog2.jpeg |
| 4 | 1be439dd87cd87087a425c760d6d8edc484f126b5447beb2203d21e09e2a8f11 | /photos/balloon.jpeg |
media_metdata_labels_has_media (for many-many joins)
| media_metadata_labels_label_id | media_media_id |
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 1 | 2 |
| 4 | 2 |
| 5 | 2 |
| 1 | 3 |
| 6 | 3 |
| 7 | 3 |
| 8 | 4 |
| 9 | 4 |
media_metadata_labels
| label_id | label_key | label_value |
| 2 | cat | lily |
| 4 | dog | duke |
| 6 | dog | rex |
| 1 | pet size | small |
| 3 | location | home |
| 7 | location | park |
| 8 | location | scotland |
| 9 | location | sky |
| 5 | location | studio |
My current attempt
My latest attempt at querying this data uses JSON_MERGE_PRESERVE with two arguments, the first is just an empty JSON object and the second is an invalid JSON document. It's invalid because there are duplicate keys, but I was hoping that JSON_MERGE_PRESERVE would merge them. It turns out JSON_MERGE_PRESERVE will only merge duplicates if they're not in the same JSON argument.
For example, this won't merge two keys
SET #key_one = '{}';
SET #key_two = '{"location": ["home"], "location": ["scotland"]}';
SELECT JSON_MERGE_PRESERVE(#key_one, #key_two);
-- returns {"location": ["scotland"]}
but this will
SET #key_one = '{"location": ["home"] }';
SET #key_two = '{"location": ["scotland"]}';
SELECT JSON_MERGE_PRESERVE(#key_one, #key_two);
-- returns {"location": ["home", "scotland"]}
So anyway, here's my current attempt
SELECT
m.media_id,
m.filepath,
JSON_MERGE_PRESERVE(
'{}',
CAST(
CONCAT(
'{',
GROUP_CONCAT(CONCAT('"', l.label_key, '":["', l.label_value, '"]')),
'}'
)
AS JSON)
)
as labels
FROM media AS m
LEFT JOIN media_metadata_labels_has_media AS lm ON lm.media_media_id = m.media_id
LEFT JOIN media_metadata_labels AS l ON l.label_id = lm.media_metadata_labels_label_id
GROUP BY m.media_id, m.filepath
-- HAVING JSON_CONTAINS(labels, '"location"', CONCAT('$.', '"home"')); -- this would let me filter on labels one they're in the correct JSON format
After trying different combinations of JSON_MERGE, JSON_OBJECTAGG, JSON_ARRAYAGG, CONCAT and GROUP_CONCAT this still leaves me scratching my head.
Disclaimer: Since posting this question I've started using mariadb instead of oracle MySQL. The function below should work for MySQL too, but in case it doesn't then any changes required will likely be small syntax fixes.
I solved this by creating a custom aggregation function
DELIMITER //
CREATE AGGREGATE FUNCTION JSON_LABELAGG (
json_key TEXT,
json_value TEXT
) RETURNS JSON
BEGIN
DECLARE complete_json JSON DEFAULT '{}';
DECLARE current_jsonpath TEXT;
DECLARE current_jsonpath_value_type TEXT;
DECLARE current_jsonpath_value JSON;
DECLARE CONTINUE HANDLER FOR NOT FOUND RETURN complete_json;
main_loop: LOOP
FETCH GROUP NEXT ROW;
SET current_jsonpath = CONCAT('$.', json_key); -- the jsonpath to our json_key
SET current_jsonpath_value_type = JSON_TYPE(JSON_EXTRACT(complete_json, current_jsonpath)); -- the json object type at the current path
SET current_jsonpath_value = JSON_QUERY(complete_json, current_jsonpath); -- the json value at the current path
-- if this is the first label value with this key then place it in a new array
IF (ISNULL(current_jsonpath_value_type)) THEN
SET complete_json = JSON_INSERT(complete_json, current_jsonpath, JSON_ARRAY(json_value));
ITERATE main_loop;
END IF;
-- confirm that an array is at this jsonpath, otherwise that's an exception
CASE current_jsonpath_value_type
WHEN 'ARRAY' THEN
-- check if our json_value is already within the array and don't push a duplicate if it is
IF (ISNULL(JSON_SEARCH(JSON_EXTRACT(complete_json, current_jsonpath), "one", json_value))) THEN
SET complete_json = JSON_ARRAY_APPEND(complete_json, current_jsonpath, json_value);
END IF;
ITERATE main_loop;
ELSE
SIGNAL SQLSTATE '45000'
SET MESSAGE_TEXT = 'Expected JSON label object to be an array';
END CASE;
END LOOP;
RETURN complete_json;
END //
DELIMITER ;
and editing my query to use it
SELECT
m.media_id,
m.filepath,
JSON_LABELAGG(l.label_key, l.label_value) as labels
FROM media AS m
LEFT JOIN media_metadata_labels_has_media AS lm ON lm.media_media_id = m.media_id
LEFT JOIN media_metadata_labels AS l ON l.label_id = lm.media_metadata_labels_label_id
GROUP BY m.media_id, m.filepath

Pyspark - getting values from an array that has a range of min and max values

I'm trying to write a query in PySpark that will get the correct value from an array.
For example, I have dataframe called df with three columns, 'companyId', 'companySize' and 'weightingRange'. The 'companySize' column is just the number of employees. The column 'weightingRange' is an array with the following in it
[ {"minimum":0, "maximum":100, "weight":123},
{"minimum":101, "maximum":200, "weight":456},
{"minimum":201, "maximum":500, "weight":789}
]
so the dataframe looks like this (weightingRange is as above, its truncated in the below example for clearer formating)
+-----------+-------------+------------------------+--+
| companyId | companySize | weightingRange | |
+-----------+-------------+------------------------+--+
| ABC1 | 150 | [{"maximum":100, etc}] | |
| ABC2 | 50 | [{"maximum":100, etc}] | |
+-----------+-------------+------------------------+--+
So for a entry for company size = 150 I need to return the weight 456 into a column called 'companyWeighting'
So it should show the following
+-----------+-------------+------------------------+------------------+
| companyId | companySize | weightingRange | companyWeighting |
+-----------+-------------+------------------------+------------------+
| ABC1 | 150 | [{"maximum":100, etc}] | 456 |
| ABC2 | 50 | [{"maximum":100, etc}] | 123 |
+-----------+-------------+------------------------+------------------+
I've had a look at
df.withColumn("tmp",explode(col("weightingRange"))).select("tmp.*")
and then joining but trying to apply that would Cartesian the data.
Suggestions appreciated!
You can approach like this,
First creating a sample dataframe,
import pyspark.sql.functions as F
df = spark.createDataFrame([
('ABC1', 150, [ {"min":0, "max":100, "weight":123},
{"min":101, "max":200, "weight":456},
{"min":201, "max":500, "weight":789}]),
('ABC2', 50, [ {"min":0, "max":100, "weight":123},
{"min":101, "max":200, "weight":456},
{"min":201, "max":500, "weight":789}])],
['companyId' , 'companySize', 'weightingRange'])
Then, creating a udf function and applying it on each row to get the new column,
def get_weight(wt,wt_rnge):
for _d in wt_rnge:
if _d['min'] <= wt <= _d['max']:
return _d['weight']
get_weight_udf = F.udf(lambda x,y: get_weight(x,y))
df = df.withColumn('companyWeighting', get_weight_udf(F.col('companySize'), F.col('weightingRange')))
df.show()
You get the output as,
+---------+-----------+--------------------+----------------+
|companyId|companySize| weightingRange|companyWeighting|
+---------+-----------+--------------------+----------------+
| ABC1| 150|[Map(weight -> 12...| 456|
| ABC2| 50|[Map(weight -> 12...| 123|
+---------+-----------+--------------------+----------------+