Dataframe from nested JSON Python - json

Looking for help with pd.json_normalize for this JSON. I am unable to figure it out. I am struggling with the "itemsToMake.itemReference" and the "salesOrderLineItemReference"
Here is what I have so far.
df1 = pd.json_normalize(data['data']['jobs']['items']
,['itemsToMake','operations']
,['createdUtc','number'
,['itemsToMake','quantityToMake']
,['itemsToMake','itemReference']
,'salesOrderLineItemReference'])
Here is the JSON
{
"data": {
"jobs": {
"items": [
{
"createdUtc": "2021-07-01T00:03:34.520Z",
"number": 11229,
"itemsToMake": [
{
"operations": [
{
"estimatedSetupTimeInSeconds": 600,
"estimatedDurationTimeInSeconds": 0,
"operation": {
"name": "Pull Material"
}
},
{
"estimatedSetupTimeInSeconds": 900,
"estimatedDurationTimeInSeconds": 720,
"operation": {
"name": "Cut Material any Shear (2 person operation)"
}
},
{
"estimatedSetupTimeInSeconds": 900,
"estimatedDurationTimeInSeconds": 810,
"operation": {
"name": "Folding"
}
}
],
"quantityToMake": 18,
"itemReference": {
"name": "ANGLE-TRIM|2op|8\"Max",
"id": "5f496b3bcb66432ca0c471f6",
"description": "Angle trim with 2 operations - up to 8\" Max SO | 10' sections (6 Sections per Sheet)\n032, 040, 050, 063, 26ga, 24ga, 22ga, 20ga"
}
}
],
"salesOrderLineItemReference": {
"sONumber": 5308
}
}
]
}
}
}
Here is what I get so far.
estimatedSetupTimeInSeconds estimatedDurationTimeInSeconds operation.name createdUtc number itemsToMake.quantityToMake itemsToMake.itemReference salesOrderLineItemReference
0 600 0 Pull Material 2021-07-01T00:03:34.520Z 11229 18 {'name': 'ANGLE-TRIM|2op|8"Max', 'id': '5f496b... {'sONumber': 5308}
1 900 720 Cut Material any Shear (2 person operation) 2021-07-01T00:03:34.520Z 11229 18 {'name': 'ANGLE-TRIM|2op|8"Max', 'id': '5f496b... {'sONumber': 5308}
2 900 810 Folding 2021-07-01T00:03:34.520Z 11229 18 {'name': 'ANGLE-TRIM|2op|8"Max', 'id': '5f496b... {'sONumber': 5308}

json_normalize has its limitations. It seems to me that it isn't possible to convert it all in once to a flattened df.
You could just add this line to achieve it:
# your code
df = pd.json_normalize(data=data['data']['jobs']['items'],
record_path=['itemsToMake','operations'],
meta=[
'createdUtc',
'number',
['itemsToMake','quantityToMake'],
['itemsToMake','itemReference'],
'salesOrderLineItemReference'
]
)
# handle the 2 columns
res = df.join(df.pop(x).apply(pd.Series) for x in ['itemsToMake.itemReference', 'salesOrderLineItemReference'])
Output:
estimatedSetupTimeInSeconds estimatedDurationTimeInSeconds operation.name createdUtc number itemsToMake.quantityToMake name id description sONumber
0 600 0 Pull Material 2021-07-01T00:03:34.520Z 11229 18 ANGLE-TRIM|2op|8"Max 5f496b3bcb66432ca0c471f6 Angle trim with 2 operations - up to 8" Max SO | 10' sections (6 Sections per Sheet)\n032, 040, 050, 063, 26ga, 24ga, 22ga, 20ga 5308
1 900 720 Cut Material any Shear (2 person operation) 2021-07-01T00:03:34.520Z 11229 18 ANGLE-TRIM|2op|8"Max 5f496b3bcb66432ca0c471f6 Angle trim with 2 operations - up to 8" Max SO | 10' sections (6 Sections per Sheet)\n032, 040, 050, 063, 26ga, 24ga, 22ga, 20ga 5308
2 900 810 Folding 2021-07-01T00:03:34.520Z 11229 18 ANGLE-TRIM|2op|8"Max 5f496b3bcb66432ca0c471f6 Angle trim with 2 operations - up to 8" Max SO | 10' sections (6 Sections per Sheet)\n032, 040, 050, 063, 26ga, 24ga, 22ga, 20ga 5308

Related

jq sort by version as string

I'm trying to sort the following json reponse to pick the latest version:
[
{
"TagVersion": "1.0.11"
},
{
"TagVersion": "1.1.8"
},
{
"TagVersion": "1.0.10",
},
{
"TagVersion": "1.0.9",
},
{
"TagVersion": "1.0.77"
}
]
Correct sorting should be:
{
"TagVersion": "1.0.9",
},
{
"TagVersion": "1.0.10",
},
{
"TagVersion": "1.0.11"
},
{
"TagVersion": "1.0.77"
},
{
"TagVersion": "1.1.8"
}
I'm currently able to do part of the job. It's working for simple cases (all version part major/minor/bug has the same number of digits).
jq -r [.[]]|max_by(.TagVersion|split(".") | map(tonumber)
The best way to do it in my mind should be to add a multiplication the each part. Example:
# With condition than every "part" as a maximum of 2 digits. It won't work with 3 digits
# Version 1.23.87
1 * 1000 + 23 * 10 + 87 = 1317
# Version 3.0.0
1 * 1000 + 0 * 10 + 0 = 3000
# Version 1.89.78
1 * 1000 + 89*10 + 78 = 1968
Does anybody have an idea to implement this? 🙂
Turn each component into a number, then sort on the array of integers.
jq 'sort_by(.TagVersion|split(".")|map(tonumber))'
Output:
[
{
"TagVersion": "1.0.9"
},
{
"TagVersion": "1.0.10"
},
{
"TagVersion": "1.0.11"
},
{
"TagVersion": "1.0.77"
},
{
"TagVersion": "1.1.8"
}
]

Normalize/Flatten a very deeply nested JSON (in which names and properties are the same across levels)

I'm trying to flatten or normalize this very nested json into dataframe with pandas.
Problem is: At every level, the names and the properties are the same.
I haven't found any pandas questions similar to this one. But I do see 2 similar questions but it's in R and JavaScript:
Normalize deeply nested objects
and Normalize deeply nested objects
I don't know if you can inspire from these.
My original file is 40M. So here's a sampple of it:
data = [
{
"id": "haha",
"type": "table",
"composition": [
{
"id": "AO",
"type": "basket",
},
{
"id": "KK",
"type": "basket",
# "isAutoDiv": false,
"composition": [
{
"id": "600",
"type": "apple",
"num": 1.116066714
},
{
"id": "605",
"type": "apple",
"num": 1.1166976714
}
]
}
]
},
{
"id": "hoho",
"type": "table",
"composition": [
{
"id": "KT",
"type": "basket"
},
{
"id": "OT",
"type": "basket"
},
{
"id": "CL",
"type": "basket",
# "isAutoDiv": false,
"composition": [
{
"id": "450",
"type": "apple"
},
{
"id": "630",
"type": "apple"
},
{
"id": "023",
"type": "index",
"composition": [
{
"id": "AAOAAOAOO",
"type": "applejuice"
},
{
"id": "MMNMMNNM",
"type": "applejuice"
},
]
}
]
}
]
}
]
You see? Names and properties are the same at every level.
I used this line to normalize it. But I don't know how to normalize the objects that are nested in nested objects when they have the same names and properties:
df = json_normalize(data, record_path = ['composition'], meta = ['id', 'type'], record_prefix = 'compo_')
compo_composition compo_id compo_type id type
0 NaN AO basket haha table
1 [{'id': '600', 'type': 'apple', 'num': 1.11606... KK basket haha table
2 NaN KT basket hoho table
3 NaN OT basket hoho table
4 [{'id': '450', 'type': 'apple'}, {'id': '630',... CL basket hoho table
You see in the column "compo_composition" there are still nested objects.
Now I want it to have these columns:
compo_compo_compo__id compo_compo_compo_type compo_compo__id compo_compo_type compo_id compo_type id type
Tons of thanks. It frustrates me for days and I haven't found an answer anywhere.
You have to write your custom parser. This assumes that (a) your JSON is abritrarily deep and (b) every element along the path is unique (ala table > basket > index, not table > table > basket)
# Make a copy so we do not change the original data
tmp = data.copy()
compositions = []
while len(tmp) > 0:
item = tmp.pop(0)
if 'composition' in item:
# If a level has children, add that level's `id`
# to the path and process its children
path = item.get('path', {})
path[item['type'] + '_id'] = item['id']
children = [
{'path': path, **child} for child in item.get('composition', [])
]
tmp += children
else:
# If a level has no child, we are done
compositions += [item]
And the final dataframe:
df = pd.DataFrame([c['path'] for c in compositions]) \
.join(pd.DataFrame(compositions)) \
.drop(columns='path')
Result:
table_id basket_id index_id id type num
0 haha KK NaN AO basket NaN
1 hoho CL 023 KT basket NaN
2 hoho CL 023 OT basket NaN
3 haha KK NaN 600 apple 1.116067
4 haha KK NaN 605 apple 1.116698
5 hoho CL 023 450 apple NaN
6 hoho CL 023 630 apple NaN
7 hoho CL 023 AAOAAOAOO applejuice NaN
8 hoho CL 023 MMNMMNNM applejuice NaN

How to search nested JSON in MySQL

I am using MySQL 5.7+ with the native JSON data type. Sample data:
[
{
"code": 2,
"stores": [
{
"code": 100,
"quantity": 2
},
{
"code": 200,
"quantity": 3
}
]
},
{
"code": 4,
"stores": [
{
"code": 300,
"quantity": 4
},
{
"code": 400,
"quantity": 5
}
]
}
]
Question: how do I extract an array where code = 4?
The following (working) query has the position of the data I want to extract and the search criterion hardcoded:
SELECT JSON_EXTRACT(data_column, '$[0]')
FROM json_data_table
WHERE data_column->'$[1].code' = 4
I tried using a wildcard (data_column->'$[*].code' = 4) but I get no results in return.
SELECT row FROM
(
SELECT data_column->"[*]" as row
FROM json_data_table
WHERE 4 IN JSON_EXTRACT(data_column, '$[*].code')
)
WHERE row->".code" = 4
... though this would be much easier to work with if this wasn't an unindexed array of objects at the top level. You may want to consider some adjustments to the schema.
Note:
If you have multiple rows in your data, specifying "$[i]" will pick that row, not the aggregate of it. With your dataset, "$[1].code" will always evaluate to the value of code in that single row.
Essentially, you were saying:
$ json collection
[1] second object in the collection.
.code attribute labeled "code".
... since there will only ever be one match for that query, it will always eval to 4...
WHERE 4 = 4
Alternate data structure if possible
Since the entire purpose of "code" is as a key, make it the key.
[
"code2":{
"stores": [
{
"code": 100,
"quantity": 2
},
{
"code": 200,
"quantity": 3
}
]
},
"code4": {
"stores": [
{
"code": 300,
"quantity": 4
},
{
"code": 400,
"quantity": 5
}
]
}
]
Then, all it would require would be:
SELECT datacolumn->"[code4]" as code4
FROM json_data_table
This is what you are looking for.
SELECT data_column->'$[*]' FROM json_data_table where data_column->'$[*].code' like '%4%'.
The selected data will have [] around it when selecting from an array thus data_column->'$[*].code' = 4 is not possible.

Am I duplicating data in my API?

My gradepoints and percents objects hold the same values of grades with different keys. Please take a look at my json below and let me know if I'm doing it right. Is there a way to optimize this API?
I could provide the percents along with the gradepoints after a comma like "a1": "10,90" but this way I will need to split them up on client side JS, which I'm restraining from.
{
"gradepoints": [
{
"a1": 10
},
{
"a1": 10
},
{
"c2": 5
},
{
"e1": "eiop"
},
{
"d": 4
},
{
"b1": 8
}
],
"percents": [
{
"a1": 90
},
{
"a1": 90
},
{
"c2": 45
},
{
"e1": "eiop"
},
{
"d": 36
},
{
"b1": 72
}
],
"gpa": 7.4,
"overall": 70.3,
"eiop": 2
}
I would do it something like this:
{
grades: [
{ name: "a1",
gradepoint: 10,
percent: 90
},
{ name: "a1",
gradepoint: 10,
percent: 90
},
{ name: "c2",
gradepoint: 5,
percent: 45
},
...
],
gpa: 7.4,
overall: 70.3,
eiop: 2
}
Related data should be kept together in an object.
If it weren't for the duplicate a1 entries, I would probably make grades be an object, with the names as keys. But an object can't have duplicate keys, so it has to be put in the values.

SQL percentage SUM different than 100

I'm using this query to grab the usage percentage of stickers (pdo):
SELECT
id_sticker,
CAST((COUNT(*) / :stickers_count * 100) AS UNSIGNED) as percentage
FROM user_sticker AS sticker_total
WHERE id_user_to = :id_user
GROUP BY id_sticker
ORDER BY percentage DESC
This is the final result:
{
"data": [
{
"id_sticker": 2,
"percentage": 28.5714285714
},
{
"id_sticker": 1,
"percentage": 14.2857142857
},
{
"id_sticker": 3,
"percentage": 14.2857142857
},
{
"id_sticker": 5,
"percentage": 14.2857142857
},
{
"id_sticker": 6,
"percentage": 14.2857142857
},
{
"id_sticker": 7,
"percentage": 14.2857142857
}
]
}
The total sum of the percentages is 99.9999999999 ... it should be 100 (that is triggering an error with the piechart component i'm using). Any ideas? Thanks!
SOLUTION
I ended adding this php fix after grabbing the data:
$dif = 100;
foreach($result as $item) {
$dif = $dif - $item['percentage'];
}
if($dif > 0) {
$result[0]['percentage'] += $dif;
} elseif($dif < 0) {
$result[count($result)-1]['percentage'] += $dif;
}
It's just a rounding error. If you need it to add up to 100, just round the values to 1 or 2 decimal places (that should be enough for a pie chart) and recalculate the last one as 100 - sum(1..(n-1)) (that's pseudocode, by the way).
You can't seriously be expecting perfect precision with floating arithmetic, can you?