Unnesting JSON String stored in a column [BigQuery] - json

I have a table with one of the columns containing a raw JSON string as follows:
Sample JSON stored in order_lines:
{
"STR_BLK_002":{
"amount":167,
"type":"part spare",
"total_discount":0,
"color":"Black",
"is_out_of_stock":false,
"variable_fields":{
"Size":"XL",
"trueColor":"Black"
},
"category_id":"44356721",
"status_list":[
{
"id":1,
"time":"2021-04-01T15:01:54.746Z",
"status":"ORDER PLACED"
},
{
"id":2,
"time":"2021-04-02T10:31:00.397Z",
"status":"PACKED"
},
{
"id":3,
"time":"2021-04-04T10:31:01.719Z",
"status":"SHIPPED"
},
{
"id":3,
"time":"2021-04-04T18:12:06.896Z",
"status":"SHIPPED"
}
],
"product_id":270,
"price_per_quantity":167,
"quantity":1,
"maximum_quantity":10,
"variant_name":"Helmet strap",
"current_status":30,
"estimated_delivery":"09 Apr 2021",
"total_before_discount":167,
"delivery_statuses":[
{
"time":"2021-04-01T15:10:13.594Z",
"status":"FULFILLABLE"
},
{
"time":"2021-04-02T10:31:00.397Z",
"status":"PACKED"
},
{
"time":"2021-04-03T10:31:01.197Z",
"status":"READY_TO_SHIP"
},
{
"time":"2021-04-04T10:31:01.719Z",
"status":"DISPATCHED"
},
{
"time":"2021-04-04T18:12:06.896Z",
"status":"SHIPPED"
}
],
"sku_code":"STR_BLK_002"
}
}
I want to unnest this string so that the key value pairs can be accessed individually. Also the sku_code, ('STR_BLK_002' in the sample shared above) is not available in any other column and the string can contain more a single sku, so if there are 2 sku(s) corresponding to an order then the JSON string would be:
{
"STR_BLK_002":{
"amount":167,
"type":"part spare",
"total_discount":0,
"color":"Black",
"is_out_of_stock":false,
"variable_fields":{
"Size":"XL",
"trueColor":"Black"
},
"category_id":"44356721",
"status_list":[
{
"id":1,
"time":"2021-04-01T15:01:54.746Z",
"status":"ORDER PLACED"
},
{
"id":2,
"time":"2021-04-02T10:31:00.397Z",
"status":"PACKED"
},
{
"id":3,
"time":"2021-04-04T10:31:01.719Z",
"status":"SHIPPED"
},
{
"id":3,
"time":"2021-04-04T18:12:06.896Z",
"status":"SHIPPED"
}
],
"product_id":270,
"price_per_quantity":167,
"quantity":1,
"maximum_quantity":10,
"variant_name":"Helmet strap",
"current_status":3,
"estimated_delivery":"09 Apr 2021",
"total_before_discount":167,
"delivery_statuses":[
{
"time":"2021-04-01T15:10:13.594Z",
"status":"FULFILLABLE"
},
{
"time":"2021-04-02T10:31:00.397Z",
"status":"PACKED"
},
{
"time":"2021-04-03T10:31:01.197Z",
"status":"READY_TO_SHIP"
},
{
"time":"2021-04-04T10:31:01.719Z",
"status":"DISPATCHED"
},
{
"time":"2021-04-04T18:12:06.896Z",
"status":"SHIPPED"
}
],
"sku_code":"STR_BLK_002"
},
"STR_BLK_008":{
"amount":590,
"type":"accessory",
"total_discount":0,
"color":"blue",
"is_out_of_stock":false,
"variable_fields":{
"Size":"XL",
"trueColor":"prussian blue"
},
"category_id":"65577970",
"status_list":[
{
"id":1,
"time":"2021-04-06T15:01:54.746Z",
"status":"ORDER PLACED"
},
{
"id":2,
"time":"2021-04-07T10:31:00.397Z",
"status":"PACKED"
},
{
"id":3,
"time":"2021-04-07T10:31:01.719Z",
"status":"SHIPPED"
},
{
"id":3,
"time":"2021-04-08T18:12:06.896Z",
"status":"SHIPPED"
}
],
"product_id":276,
"price_per_quantity":590,
"quantity":1,
"maximum_quantity":5,
"variant_name":"Car Perfume",
"current_status":3,
"estimated_delivery":"09 Apr 2021",
"total_before_discount":590,
"delivery_statuses":[
{
"time":"2021-04-06T15:10:13.594Z",
"status":"FULFILLABLE"
},
{
"time":"2021-04-07T10:31:00.397Z",
"status":"PACKED"
},
{
"time":"2021-04-07T10:31:01.197Z",
"status":"READY_TO_SHIP"
},
{
"time":"2021-04-08T10:31:01.719Z",
"status":"DISPATCHED"
},
{
"time":"2021-04-10T18:12:06.896Z",
"status":"SHIPPED"
}
],
"sku_code":"STR_BLK_008"
}
}
I want to break this information into separate columns, so that I can fetch the corresponding values for each SKU.

So basically what I think you want to do is first transform your column into a array of structs so that instead of having this:
{
"STR_BLK_002": {...},
"STR_BLK_003": {...}
}
You have something like this:
[
{
"amount":167,
"type":"part spare",
"total_discount":0,
...
},
{
"amount":590,
"type":"accessory",
"total_discount":0,
...
}
]
With the data in that format you can leverage UNNEST to make each entry into its own row, and then use JSON functions to pull out fields into their own columns, for example JSON_EXTRACT_SCALAR
In order to do this, I built a Javascript UDF that find the keys in the object and then iterates through each key to create an array of structs.
CREATE TEMP FUNCTION format_json(str STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS r"""
var obj = JSON.parse(str);
var keys = Object.keys(obj);
var arr = [];
for (i = 0; i < keys.length; i++) {
arr.push(JSON.stringify(obj[keys[i]]));
}
return arr;
""";
SELECT
JSON_EXTRACT_SCALAR(formatted_json,'$.amount') as amount
,JSON_EXTRACT_SCALAR(formatted_json,'$.type') as type
,JSON_EXTRACT_SCALAR(formatted_json,'$.total_discount') as total_discount
,JSON_EXTRACT_SCALAR(formatted_json,'$.color') as color
,JSON_EXTRACT_SCALAR(formatted_json,'$.is_out_of_stock') as is_out_of_stock
,JSON_EXTRACT_SCALAR(formatted_json,'$.sku_code') as sku_code
from
testing.json_test
left join unnest(format_json(order_lines)) as formatted_json
Which results in this:

Below should give you good start
select
json_extract_scalar(line, '$.sku_code') as sku_code,
json_extract_scalar(line, '$.amount') as amount,
json_extract_scalar(line, '$.type') as type,
json_extract_scalar(line, '$.total_discount') as total_discount,
json_extract_scalar(line, '$.color') as color,
json_extract_scalar(line, '$.variable_fields.Size') as Size,
json_extract_scalar(line, '$.variable_fields.trueColor') as trueColor,
from `project.dataset.table`,
unnest(split(regexp_replace(regexp_replace(order_lines, r'\s', ''), r'"STR_BLK_\d+":{', '"STR_BLK":{'),'"STR_BLK":')) order_line with offset,
unnest([struct('{' || trim(order_line, ',{}}') || '}' as line)])
where offset > 0
if applied to first example in your question - output is
if applied to second example in your question - output is
Hope, you can extend this example to whatever final goal you have in mind

Related

How to do custom window function on JSON object with pandas?

I have a rather nested JSON object below, and I am trying to calculate the user (ie 'profileId') with the most events (ie length of 'parameters' key.
I have the code below to get the length of the parameter, but I am trying to now have that calculation be correct for each record, as they way I have it set now would set it the same value for each record - I looked into pandas window functions https://pandas.pydata.org/docs/user_guide/window.html but am having trouble getting to the correct outcome.
response = response.json()
df = pd.json_normalize(response['items'])
df['calcfield'] = len(df["events"].iloc[0][0].get('parameters'))
the output of df['arrayfield'] is below:
[
{
"type":"auth",
"name":"activity",
"parameters":[
{
"name":"api_name",
"value":"admin"
},
{
"name":"method_name",
"value":"directory.users.list"
},
{
"name":"client_id",
"value":"722230783769-dsta4bi9fkom72qcu0t34aj3qpcoqloq.apps.googleusercontent.com"
},
{
"name":"num_response_bytes",
"intValue":"7158"
},
{
"name":"product_bucket",
"value":"GSUITE_ADMIN"
},
{
"name":"app_name",
"value":"Untitled project"
},
{
"name":"client_type",
"value":"WEB"
}
]
}
] }, {
"kind":"admin#reports#activity",
"id":{
"time":"2022-05-05T23:58:48.914Z",
"uniqueQualifier":"-4002873813067783265",
"applicationName":"token",
"customerId":"C02f6wppb"
},
"etag":"\"5T53xK7dpLei95RNoKZd9uz5Xb8LJpBJb72fi2HaNYM/9DTdB8t7uixvUbjo4LUEg53_gf0\"",
"actor":{
"email":"nancy.admin#hyenacapital.net",
"profileId":"100230688039070881323"
},
"ipAddress":"54.80.168.30",
"events":[
{
"type":"auth",
"name":"activity",
"parameters":[
{
"name":"api_name",
"value":"gmail"
},
{
"name":"method_name",
"value":"gmail.users.messages.list"
},
{
"name":"client_id",
"value":"927538837578.apps.googleusercontent.com"
},
{
"name":"num_response_bytes",
"intValue":"2"
},
{
"name":"product_bucket",
"value":"GMAIL"
},
{
"name":"app_name",
"value":"Zapier"
},
{
"name":"client_type",
"value":"WEB"
}
]
ORIGINAL JSON BLOB I READ IN
{
"kind":"admin#reports#activities",
"etag":"\"5g8\"",
"nextPageToken":"A:1651795128914034:-4002873813067783265:151219070090:C02f6wppb",
"items":[
{
"kind":"admin#reports#activity",
"id":{
"time":"2022-05-05T23:59:39.421Z",
"uniqueQualifier":"5526793068617678141",
"applicationName":"token",
"customerId":"cds"
},
"etag":"\"jkYcURYoi8\"",
"actor":{
"email":"blah#blah.net",
"profileId":"1323"
},
"ipAddress":"107.178.193.87",
"events":[
{
"type":"auth",
"name":"activity",
"parameters":[
{
"name":"api_name",
"value":"admin"
},
{
"name":"method_name",
"value":"directory.users.list"
},
{
"name":"client_id",
"value":"722230783769-dsta4bi9fkom72qcu0t34aj3qpcoqloq.apps.googleusercontent.com"
},
{
"name":"num_response_bytes",
"intValue":"7158"
},
{
"name":"product_bucket",
"value":"GSUITE_ADMIN"
},
{
"name":"app_name",
"value":"Untitled project"
},
{
"name":"client_type",
"value":"WEB"
}
]
}
]
},
{
"kind":"admin#reports#activity",
"id":{
"time":"2022-05-05T23:58:48.914Z",
"uniqueQualifier":"-4002873813067783265",
"applicationName":"token",
"customerId":"df"
},
"etag":"\"5T53xK7dpLei95RNoKZd9uz5Xb8LJpBJb72fi2HaNYM/9DTdB8t7uixvUbjo4LUEg53_gf0\"",
"actor":{
"email":"blah.blah#bebe.net",
"profileId":"1324"
},
"ipAddress":"54.80.168.30",
"events":[
{
"type":"auth",
"name":"activity",
"parameters":[
{
"name":"api_name",
"value":"gmail"
},
{
"name":"method_name",
"value":"gmail.users.messages.list"
},
{
"name":"client_id",
"value":"927538837578.apps.googleusercontent.com"
},
{
"name":"num_response_bytes",
"intValue":"2"
},
{
"name":"product_bucket",
"value":"GMAIL"
},
{
"name":"client_type",
"value":"WEB"
}
]
}
]
}
]
}
Use:
df.groupby('actor.profileId')['events'].apply(lambda x: [len(x.iloc[i][0]['parameters']) for i in range(len(x))])
which returns the list of each profileid count of parameters. Output and the sample data:
actor.profileId
1323 [7]
1324 [7]
Name: events, dtype: object
It's not entirely clear what you asking and df['arrayfield'] isn't in your example provided. However, if you look at the events column after json_normalize, you can use the following line to pull out the length of each parameters key. The blob you gave as an example was set to response...
df = pd.json_normalize(response['items'])
df['calcfield'] = df['events'].str[0].str.get('parameters').str.len()
Becauase each parameters key has 7 elements, it's tough to say this is what you really want.

Return selected JSON object from mongo find method

Here is the sample JSON
Sample JSON:
[
{
"_id": "123456789",
"YEAR": "2019",
"VERSION": "2019.Version",
"QUESTION_GROUPS": [
{
"QUESTIONS": [
{
"QUESTION_NAME": "STATE_CODE",
"QUESTION_VALUE": "MH"
},
{
"QUESTION_NAME": "COUNTY_NAME",
"QUESTION_VALUE": "IN"
}
]
},
{
"QUESTIONS": [
{
"QUESTION_NAME": "STATE_CODE",
"QUESTION_VALUE": "UP"
},
{
"QUESTION_NAME": "COUNTY_NAME",
"QUESTION_VALUE": "IN"
}
]
}
]
}
]
Query that am using :
db.collection.find({},
{
"QUESTION_GROUPS.QUESTIONS.QUESTION_NAME": "STATE_CODE"
})
My requirement is retrive all QUESTION_VALUE whose QUESTION_NAME is equals to STATE_CODE.
Thanks in Advance.
If I get you well, What you are trying to do is something like:
db.collection.find(
{
"QUESTION_GROUPS.QUESTIONS.QUESTION_NAME": "STATE_CODE"
},
{
"QUESTION_GROUPS.QUESTIONS.QUESTION_VALUE": 1
})
Attention: you will get ALL the "QUESTION_VALUE" for ANY document which has a QUESTION_GROUPS.QUESTIONS.QUESTION_NAME with that value.
Attention 2: You will get also the _Id. It is by default.
In case you would like to skip those issues, you may need to use Aggregations, and unwind the "QUESTION_GROUPS"-> "QUESTIONS". This way you can skip both the irrelevant results, and the _id field.
It sounds like you want to unwind the arrays and grab only the question values back
Try this
db.collection.aggregate([
{
$unwind: "$QUESTION_GROUPS"
},
{
$unwind: "$QUESTION_GROUPS.QUESTIONS"
},
{
$match: {
"QUESTION_GROUPS.QUESTIONS.QUESTION_NAME": "STATE_CODE"
}
},
{
$project: {
"QUESTION_GROUPS.QUESTIONS.QUESTION_VALUE": 1
}
}
])

Mongodb insert with multiple conditions

I'm having multiple documents in a collection, each document has this data structure :
{
_id: "some object id",
data1: [
{
data2_id : 13233,
data2: [
{
sub_data1: "text1",
sub_data2: "text2",
sub_data3: "text3",
},
{
sub_data1: "text4",
sub_data2: "text5",
sub_data3: "text6",
}
]
},
{
data2_id : 53233,
data2: [
{
sub_data1: "text4",
sub_data2: "text5",
sub_data3: "text6",
}
...
]
},
{
data2_id : 56233,
data2: [
{
sub_data1: "text7",
sub_data2: "text8",
sub_data3: "text9",
}
...
]
},
{
data2_id : 53236,
data2: [
{
sub_data1: "text10",
sub_data2: "text22",
sub_data3: "text33",
}
...
]
}
]
}
I'd like to update to a set of ids that maches some condition, update only the sub object within the document.
I've tries this:
db.collection.update({
"$and": [
{
"_id": {
"$in": [
{
"$id": "54369aca9bc25af3ca8b4568"
},
{
"$id": "54369aca9bc25af3ca8b4562"
}
]
}
},
{
"data1.data2": {
"$elemMatch": {
"sub_data1": "text4",
"sub_data2": "text5"
}
}
}
]
},
{
"data1.data2.$.sub_data3" : "text updated"
}
)
But I get the following error:
Update of data into MongoDB failed: dev.**.com:27017: cannot use the part (data2 of data1.data2.0.sub_data3) to traverse the element...
Any Ideas?
There is an open issue here that imposes a limitation when trying to update elements of an array nested within another array.
Besides, there are some improvements you can do here:
For your query you don't need the $and
db.collection.update(
{
"_id": {
"$in": [
{"$id": "54369aca9bc25af3ca8b4568"},
{"$id": "54369aca9bc25af3ca8b4562"}
]},
"data1.data2": {
"$elemMatch": {
"sub_data1": "text4",
"sub_data2": "text5"
}
},{..update...})
You might want to use $set:
db.collection.update(query,{ $set:{"name": "Mike"} })
Otherwise, you might lose the rest of the data within your document.

Return a field that has an array in mongoDB, and return the first and last value in that array

Scenario: Consider the document present in the MongoDB in collection named twitCount.
{
"_id" : ObjectId("53d1340478441a1c0d25c40c"),
"items" : [
{
"date" : ISODate("2014-07-22T22:18:05.000Z"),
"value" : 4,
"_id" : ObjectId("53d134048b3956000063aa72")
},
{
"date" : ISODate("2014-07-21T22:09:20.000Z"),
"value" : 10,
"_id" : ObjectId("53d134048b3956000063aa71")
}
...
],
"ticker" : "OM:A1M"
}
I only want to fetch the first and last date inside "items". I've tried lot of different "queries". But I cannot get it right. The "ticker" is unique
The following query is the only one that returns something, but it returns everything(that is expected).
twitCount.aggregate([{ $match : { ticker: theTicker}} ], function(err, result){
if (err) {
console.log(err);
return;
}
console.log(result)
})
So, In the end I want the query to return it something like this [2013-02-01, 2014-07-24];
I really need help with this, all links on manual/core/aggregation are purple and I don't know where to get more information.
Hard to tell if your intent here is to work with a single document or multiple documents that match your condition. As suggested, a single document would really just involve using the shift and pop methods native to JavaScript on the singular result to get the first and last elements of the array. You might also need to employ array sort here
twitCount.findOne({ "ticker": "OM:A1M" },function(err,doc) {
doc.items = doc.items.sort(function(a,b) {
return ( a.date.valueOf() > b.date.valueOf() ) ? 1
: ( a.date.valueOf() < b.date.valueOf() ) ? -1 : 0;
});
doc.items = [doc.items.shift(),doc.items.pop()];
console.log( doc );
})
The other suggestions don't really apply as operators like $pop permanently mondify the array in updates. And the $slice operator that can be used in a query would really only be of use to you if the array contents are already sorted, and additionally you would be making two queries to return first and last, which is not what you want.
But if you really are looking to do this over multiple documents then the aggregation framework is the answer. The key area to understand when working with arrays is that you must use an $unwind pipeline stage on the array first. This "de-normalizes" to a form where a copy of the document is effectively produced for each array element:
twitCount.aggregate([
// Match your "documents" first
{ "$match": { "ticker": "OM:A1M" } },
// Unwind the array
{ "$unwind": "$items" },
// Sort the values
{ "$sort": { "items.date": 1 } },
// Group with $first and $last items
{ "$group": {
"_id": "$ticker",
"first": { "$first": "$items" },
"last": { "$last": "$items" }
}}
],function(err,result) {
If you really want "items" back as an array then you can just do things a little differently:
twitCount.aggregate([
// Match your "documents" first
{ "$match": { "ticker": "OM:A1M" } },
// Unwind the array
{ "$unwind": "$items" },
// Sort the values
{ "$sort": { "items.date": 1 } },
// Group with $first and $last items
{ "$group": {
"_id": "$ticker",
"first": { "$first": "$items" },
"last": { "$last": "$items" },
"type": { "$first": { "$const": [true,false] } }
}},
// Unwind the "type"
{ "$unwind": "$type" },
// Conditionally push to the array
{ "$group": {
"_id": "$_id",
"items": {
"$push": {
"$cond": [
"$type",
"$first",
"$last"
]
}
}
}}
],function(err,result) {
Or if your $match statement is just intended to select and you want the "first" and "last" from each document "_id" then you just change the key in the initial $group to "$_id" rather than the "$ticker" field value:
twitCount.aggregate([
// Match your "documents" first
{ "$match": { "ticker": "OM:A1M" } },
// Unwind the array
{ "$unwind": "$items" },
// Sort the values
{ "$sort": { "items.date": 1 } },
// Group with $first and $last items
{ "$group": {
"_id": "$_id",
"ticker": { "$first": "$ticker" },
"first": { "$first": "$items" },
"last": { "$last": "$items" },
"type": { "$first": { "$const": [true,false] } }
}},
// Unwind the "type"
{ "$unwind": "$type" },
// Conditionally push to the array
{ "$group": {
"_id": "$_id",
"ticker": { "$first": "$ticker" },
"items": {
"$push": {
"$cond": [
"$type",
"$first",
"$last"
]
}
}
}}
],function(err,result) {
In that last case, you would get something like this, based on the data you have provided:
{
"_id" : ObjectId("53d1340478441a1c0d25c40c"),
"ticker" : "OM:A1M",
"items" : [
{
"date" : ISODate("2014-07-21T22:09:20Z"),
"value" : 10,
"_id" : ObjectId("53d134048b3956000063aa71")
},
{
"date" : ISODate("2014-07-22T22:18:05Z"),
"value" : 4,
"_id" : ObjectId("53d134048b3956000063aa72")
}
]
}
You can find the Full List of Aggregation Operators in the documentation. It is worth getting to know how these function as depending on what you are doing the aggregation framework can be a very useful tool.

Loading TreeStore with JSON that has different children fields

I am having a JSON data like below.
{
"divisions": [{
"name": "division1",
"id": "div1",
"subdivisions": [{
"name": "Sub1Div1",
"id": "div1sub1",
"schemes": [{
"name": "Scheme1",
"id": "scheme1"
}, {
"name": "Scheme2",
"id": "scheme2"
}]
}, {
"name": "Sub2Div1",
"id": "div1sub2",
"schemes": [{
"name": "Scheme3",
"id": "scheme3"
}]
}
]
}]
}
I want to read this into a TreeStore, but cannot change the subfields ( divisions, subdivisions, schemes ) to be the same (eg, children).
How can achieve I this?
When nested JSON is loaded into a TreeStore, essentially the children nodes are loaded through a recursive calls between TreeStore.fillNode() method and NodeInterface.appendChild().
The actual retrieval of each node's children field is done within TreeStore.onNodeAdded() on this line:
dataRoot = reader.getRoot(data);
The getRoot() of the reader is dynamically created in the reader's buildExtractors() method, which is what you'll need to override in order to deal with varying children fields within nested JSON. Here is how it's done:
Ext.define('MyVariJsonReader', {
extend: 'Ext.data.reader.Json',
alias : 'reader.varijson',
buildExtractors : function()
{
var me = this;
me.callParent(arguments);
me.getRoot = function ( aObj ) {
// Special cases
switch( aObj.name )
{
case 'Bill': return aObj[ 'children' ];
case 'Norman': return aObj[ 'sons' ];
}
// Default root is `people`
return aObj[ 'people' ];
};
}
});
This will be able to interpret such JSON:
{
"people":[
{
"name":"Bill",
"expanded":true,
"children":[
{
"name":"Kate",
"leaf":true
},
{
"name":"John",
"leaf":true
}
]
},
{
"name":"Norman",
"expanded":true,
"sons":[
{
"name":"Mike",
"leaf":true
},
{
"name":"Harry",
"leaf":true
}
]
}
]
}
See this JsFiddle for fully working code.