Exploding a json column in Athena using Presto stored procedure - json

The Scenario
I've chose S3 folder location to create the table from a csv file, which has 1 column in JSON format. This needs to be exploded in way that creates many entries for one particular user & event.
The Problem
Athena Table looks something as follows:
agenda_data, event_id, partner_id, record_last_updated, user_id
"{'enclosed_data': {'task_active': 'true', 'status': 'completed'}, 'Agenda-1': {'currentProgress': '', 'timelines': '30/4/2020'}, 'Agenda-2': {'currentProgress': ' ', 'timelines': '25/4/2020'}, 'Agenda-3': {'currentProgress': ' ', 'timelines': '25/4/2020'}, 'Agenda-4': {'currentProgress': ' ', 'timelines': '28/4/2020'}, 'meta': {'foo': 'bar'}, 'Summary': {'finYear': '2020'}}, 'event_id': '20200407181839', 'record_last_updated': '2020-04-07T18:24:44.557362Z','user_id': '121000'}",20200407181839,Actionable,2020-04-06T13:20:31.114397Z,121000
"{'enclosed_data': {'consolidator': {'task_active': 'true', 'status': 'completed'},'Agenda-1': {'currentProgress': '', 'timelines': '25/4/2020'},'Agenda-2': {'currentProgress': 'On Going', 'timelines': '20/4/2020'},'Agenda-3': {'currentProgress': 'Completed', 'timelines': '07/4/2020'},'Agenda-4': {'currentProgress': ' ', 'timelines': '13/4/2020'},'meta': {'foo': 'bar'}, 'Summary': {'finYear': '2020'}}, event_id': '20200407202551',record_last_updated': '2020-04-07T20:32:48.215545Z', user_id': '12354'}",20200407202551,Actionable,2020-04-07T20:32:48.215545Z,12354
The Column agenda_data contains JSON data, which needs to be exploded. To put it clearly I'll repost the minimized structure of JSON.
{
"enclosed_data": {
"task_active": "true",
"status": "completed"
},
"Agenda-1": {
"currentProgress": "",
"timelines": "25/4/2020"
},
"Agenda-2": {
"currentProgress": "On Going",
"timelines": "20/4/2020"
},
"meta": {
"foo": "bar"
},
"Summary": {
"finYear": "2020"
}
},
"event_id": "20200407202551",
"record_last_updated": "2020-04-07T20:32:48.215545Z",
"user_id": "121000"
}
I need to project the Data of Agendas only when exploded, for same I tried resolving multiple blogs, I found Documents sensible though, here they go:
link1: Which helps very little
link2: Doesn't apply since I don't have Arrays in here
link3: Couldn't get either
The Expected output
The expected output is as follows:
event_id, partner_id, record_last_updated, user_id, agenda, currentProgress, timelines
20200407181839, Actionable, 2020-04-07T20:32:48.215545Z, 121000, Agenda-1, " ", "30/4/2020"
20200407181839, Actionable, 2020-04-07T20:32:48.215545Z, 121000, Agenda-2, " ", "25/4/2020"
20200407181839, Actionable, 2020-04-07T20:32:48.215545Z, 121000, Agenda-3, " ", "25/4/2020"
20200407181839, Actionable, 2020-04-07T20:32:48.215545Z, 121000, Agenda-4, " ", "28/4/2020"
20200407202551, Actionable, 2020-04-07T20:32:48.215545Z, 12354, Agenda-1, " ", "25/4/2020"
20200407202551, Actionable, 2020-04-07T20:32:48.215545Z, 12354, Agenda-2, "On Going", "20/4/2020"
20200407202551, Actionable, 2020-04-07T20:32:48.215545Z, 12354, Agenda-3, "Completed", "07/4/2020"
20200407202551, Actionable, 2020-04-07T20:32:48.215545Z, 12354, Agenda-4, " ", "13/4/2020"
EDIT #1
Success so far, that I could manage to parse json using presto function as follows:
QUERY
with meeting_data AS
(SELECT '{
"enclosed_data": {
"task_active": "true",
"status": "completed"
},
"Agenda-1": {
"currentProgress": "",
"timelines": "25/4/2020"
},
"Agenda-2": {
"currentProgress": "On Going",
"timelines": "20/4/2020"
},
"meta": {
"foo": "bar"
},
"Summary": {
"finYear": "2020"
}
},
"event_id": "20200407202551",
"record_last_updated": "2020-04-07T20:32:48.215545Z",
"user_id": "121000"
}' AS blob)
SELECT json_extract(blob,
'$["Agenda-1"]') AS agenda1, json_extract(blob, '$.enclosed_data.status') AS m_status, json_extract(blob, '$.Summary.finYear') AS finYear
FROM meeting_data
OUTPUT
agenda1, m_status, finYear
{"Agenda-1": {"currentProgress": "", "timelines":"25/4/2020"}},"completed", "2020-21"
OPEN QUESTIONS
I understood I can access the JSON when put it manually, I need this to be fetched from column one by one using loop, but how?
Once looped, how do I explode and get the expected output by repeating the other column values which aren't in JSON format?
Can this be achieved by writing a function/stored procedure in presto?

Related

How to use SQL FOR JSON PATH - dot notation for Custom JSON output

I'm trying to output a SQL query results in a custom JSON format.
I've tried several dot notation formats (I believe necessary) to get the
desired format.
The table has test data like
Status = 'Test Status'
Type = 'Test Type'
Code = 'Test Code'
What I've tried;
SELECT
[Status] AS [id:950 .VALUE]
,[Type] AS [id:951 .VALUE]
,[Code] AS [id:952 .VALUE]
FROM MyTable
FOR JSON PATH, ROOT('fieldval')
Which gets me close with this;
{
"fieldval": [
{
"id:950 ": {
"VALUE": Test Status"
},
"id:951 ": {
"VALUE": "Test Type"
},
"id:952 ": {
"VALUE": "Test Code"
}
}
]
}
But I need it in this format
{
"type": "CustomJSON",
"fieldval": [
{
"id": "950",
"value": "Test Status",
"fieldName": "Status"
},
{
"id": "951",
"value": "Test Type",
"fieldName": "Type"
},
{
"id": "952",
"value": "Test Code",
"fieldName": "Code"
}
]
}
What do I need to add/change? Thanks
You need to use JSON_QUERY() to add arrays of data to the outer JSON, e.g.:
create table dbo.Example (
ExampleID nvarchar(3), --<<-- nvarchar since the required JSON has strings here, not numbers.
ExampleValue nvarchar(11),
ExampleFieldName nvarchar(6)
);
insert dbo.Example (ExampleID, ExampleValue, ExampleFieldName)
values
(N'950', N'Test Status', N'Status'),
(N'951', N'Test Type', N'Type'),
(N'952', N'Test Code', N'Code');
select
N'CustomJSON' as [type],
json_query((
select
ExampleID as [id],
ExampleValue as [value],
ExampleFieldName as [fieldName]
from dbo.Example
for json path
)) as [fieldval]
for json path, without_array_wrapper;
Which yields the desired result:
{
"type": "CustomJSON",
"fieldval": [
{
"id": "950",
"value": "Test Status",
"fieldName": "Status"
},
{
"id": "951",
"value": "Test Type",
"fieldName": "Type"
},
{
"id": "952",
"value": "Test Code",
"fieldName": "Code"
}
]
}

How to convert json to csv with single header and multiple values?

I have input
data = [
{
"details": [
{
"health": "Good",
"id": "1",
"timestamp": 1579155574
},
{
"health": "Bad",
"id": "1",
"timestamp": 1579155575
}
]
},
{
"details": [
{
"health": "Good",
"id": "2",
"timestamp": 1588329978
},
{
"health": "Good",
"device_id": "2",
"timestamp": 1588416380
}
]
}
]
Now I want to convert it in csv something like below,
id,health
1,Good - 1579155574,Bad - 1579155575
2,Good - 1588329978,Good - 1588416380
Is this possible?
Currently I am converting this in simple csv, my code and response are as below,
f = csv.writer(open("test.csv", "w", newline=""))
f.writerow(["id", "health", "timestamp"])
for data in data:
for details in data['details']:
f.writerow([details['id'],
details["health"],
details["timestamp"],
])
Response:
id,health,timestamp
1,Good,1579155574
1,Bad,1579155575
2,Good,1579261319
2,Good,1586911295
So how could I get the expected output? I am using python3.
You almost have done your job, I think you do not need use csv module.
And CSV does not mean anything, it just a name let people know what it is. CSV ,TXT and JSON are same things to computers, they are something to record the words.
I don't know whole patterns of your datas, but you can get output value you want.
output = 'id,health\n'
for data in datas:
output += f'{data["details"][0]["id"]},'
for d in data["details"]:
if 'health' in d:
output += f'{d["health"]} - {d["timestamp"]},'
else:
output += f'{d["battery_health"]} - {d["timestamp"]},'
output = output[:-1] + '\n'
with open('test.csv', 'w') as op:
op.write(output)

How to use a JSON subquery in a MySQL JSON Statement

Following scenario: I have got a JSON data column named 'address' in my sql table. I inserted an array of objects into it. Later I will need to pick a whole object of the array by finding out which type in the request will be passed.
So if I pass a { "type": "shipping" } to my REST API I would like to get all the data in the object with the type "shipping" marked.
Just one of the different JSON formats I tried:
{
"address": [
{
"type": "shipping",
"street": "streetName",
"streetNumber": "1"
},
{
"type": "billing",
"street": "streetName",
"streetNumber": "2"
},
{
"type": "custom",
"street": "streetName",
"streetNumber": "3"
}
]
}
This is how the data gets stored in the single column:
[
{"type": "custom", "street": "streetName", "streetNumber": "1"},
{"type": "shipping", "street": "streetName", "streetNumber": "2"},
{"type": "billing", "street": "streetName", "streetNumber": "3"}
]
So I tried to do some magic with the JSON function of MySQL. Querys like
SELECT JSON_SEARCH(address, 'one', 'shipping')
FROM user_data;
works acutally fine but it is not the result I need. So I thought about using subquerys
SELECT JSON_EXTRACT(address, (SELECT JSON_SEARCH(address, 'one', 'shipping')))
FROM user_data
WHERE user_id = 1;
but this just ends up with "invalid JSON path expression" (Error Code: 3143).
I already tried with different querys and JSON data formats like instead using an array I tried to insert the data like
"address": {
"shipping": {
"street": ...
.
},
"billing": {
"street": ...
.
}
}
Now I am just getting confused and it seems like there is just the mysql documentation which does not help right now...
Your query generate invalid JSON path expression because the result contains quotes
SELECT JSON_SEARCH(address, 'one', 'shipping')
FROM user_data;
Result: "$[0].type"
You can strip those quotes with REPLACE function
SELECT REPLACE( JSON_SEARCH(address, 'one', 'shipping'), '"', '')
FROM user_data;
Result: $[0].type
To get the object path, you should replace also '.type' from the end of the result
SELECT REPLACE(REPLACE( JSON_SEARCH(address, 'one', 'shipping'), '"', ''), '.type', '')
FROM user_data;
Result: $[0]
Final query looks like this:
SELECT
JSON_EXTRACT(
address,
REPLACE(
REPLACE(
(SELECT JSON_SEARCH(address, 'one', 'billing') FROM user_data),
'"',
'' ),
'.type',
'')
)
FROM user_data
WHERE user_id = 1;
Example here: https://www.db-fiddle.com/f/mYBXEs3M4xFDGeQvGwUgqQ/0

Bulk importing JSON into SQL Server

I need to bulk import data into a SQL Server database.
My JSON looks like this:
$projectsJSON = #"
[
{
"id": 35,
"created_at": "2016-01-12T11:40:36+01:00",
"customer_id": 34,
"name": ".com",
"note": "COMXXXX-",
"updated_at": "2016-07-15T12:13:54+02:00",
"archived": false,
"customer_name": "PMName"
},
{
"id": 23,
"created_at": "2010-01-11T12:58:50+01:00",
"customer_id": 43,
"name": "PN",
"note": "{\r\n \"Billable\": 1\r\n}\r\n",
"updated_at": "2017-11-24T15:49:31+01:00",
"archived": false,
"customer_name": "MSM"
}
]
"#
$projects = $projectsJSON |ConvertFrom-Json
$dt2 = New-Object system.Data.DataTable
$dt2 = $projects|select-object id, created_at, customer_id, name, note, updated_at, archived, customer_name
$bulkCopy = New-Object Data.SqlClient.SqlBulkCopy($DestConnStr, [System.Data.SqlClient.SqlBulkCopyOptions]::KeepIdentity)
$bulkCopy.BulkCopyTimeout = 600
$bulkCopy.DestinationTableName = "project"
$bulkCopy.WriteToServer($dt2)
Unfortunatelly, I keep getting this error:
Cannot convert argument "rows", with value: "System.Object[]", for
"WriteToServer" to type "System.Data.DataRow[]": "Cannot convert the
"#{id=35; created_at=2016-01-12T11:40:36+01:00; customer_id=34;
name=.com; note=COMXXXX-; updated_at=2016-07-15T12:13:54+02:00;
archived=False; customer_name=PMName}" value of type
"Selected.System.Management.Automation.PSCustomObject" to type
"System.Data.DataRow"." At P:\PsideSync\sync.ps1:261 char:3
+ $bulkCopy.WriteToServer($dt2)
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (:) [], MethodException
+ FullyQualifiedErrorId : MethodArgumentConversionInvalidCastArgument
What would be the right way to import lots of JSON Data into SQL Server?
TIA
As #IRon notes you are filling $dt2 twice. I assume you want to go with the second initialization.
You could get Out-DataTable from here: https://gallery.technet.microsoft.com/scriptcenter/4208a159-a52e-4b99-83d4-8048468d29dd
Then it's just a matter of calling
$projects = $projectsJSON |ConvertFrom-Json
$dt2 = $projects|select-object id, created_at, customer_id, name, note, updated_at, archived, customer_name
$bulkCopy = New-Object Data.SqlClient.SqlBulkCopy($DestConnStr, [System.Data.SqlClient.SqlBulkCopyOptions]::KeepIdentity)
$bulkCopy.BulkCopyTimeout = 600
$bulkCopy.DestinationTableName = "project"
$bulkCopy.WriteToServer($dt2 | Out-DataTable)
Disclaimer: I haven't tried this myself.

How to Get JSON values Python

Learning Days
Code to the get the data in JSON Format
#...
cursor.execute("SELECT * FROM user")
response = {
"version": "5.2",
"user_type": "online",
"user": list(cursor),
}
response = json.dumps(response, sort_keys=False, indent=4, separators=(',', ': '))
print(response)
# ...
This produces output as
{
"version": "5.2",
"user_type": "online",
"user":
[
{
"name": "John",
"id": 50
},
{
"name": "Mark",
"id": 57
}
]
}
print(response["user"]) - TypeError: string indices must be integers
How do i access the values in JSON
json.dumps return a string, need a small conversion something like this, not sure is this the exact method to do
Solution:
response = JSONEncoder().encode(response )
response = JSONDecoder().decode(response )
response = json.loads(response )
print(response['user'[0]['id'])