Error in querying embedded json using apache drill - apache-drill

donutTest.json (in my local system at /home/dev):
{
"id":"0001",
"type":"donut",
"name":"Cake",
"batter":{
"id":"1001",
"type":"Regular"
},
"topping":[
{ "id":"5001", "type":"None"},
{ "id":"5002", "type":"Glazed"}
]
}
This query is working fine.
select topping[0].id as topping_id, topping[3].type as topping_type from dfs.`/home/dev/donutTest.json`;
But when I tried:
select batter.id as batter_id, batter.type as batter_type from dfs.`/home/dev/donutTest.json`;
It's showing error.
Table 'batter' not found
topping[0] and batter both are embedded document still error.

Try using a table alias and then reference that in the select statement.
select donut.batter.id as batter_id, donut.batter.type as batter_type from dfs.`/home/dev/donutTest.json` as donut;
This way Drill has a reference to the actual table alias and then the nested structure underneath.

Related

Can I index and search nested object keys in JSON on postgres?

If I have a table called configurations where rows contain a jsonb column called data with values similar to the following:
{
"US": {
"1234": {
"id": "ABCD"
}
},
"CA": {
"5678": {
"id": "WXYZ"
}
}
}
My hope is to be able to write a query akin to the following:
select * from configurations where data->'$.*.*.id' = 'WXYZ'
(Please note: I'm aware that the SQL above is not correct, treat it as pseudo.)
Questions:
What is the correct syntax to perform the query I've written above?
What type of index would I need to create to ensure I'm not scanning the entire table using any query from my previous question?
You can turn your pseudo code into real jsonpath code:
select * from configurations where data ## '$.*.*.id == "WXYZ"'
And this can use a default gin index on "data":
create index on configurations using gin (data);

Can you SQL populate a BigQuery table and set the table column modes in the same API call?

I'm using Google App Script to migrate data through BigQuery and I've run into an issue because the SQL I'm using to perform a WRITE_TRUNCATE load is causing the destination table to be recreated with column modes of NULLABLE rather than their previous mode of REQUIRED.
Attempting to change the modes to REQUIRED after the data is loaded using a metadata patch causes an error even though the columns don't contain any null values.
I considered working around the issue by dropping the table and recreating it again with the same REQUIRED modes, then loading the data using WRITE_APPEND instead of WRITE_TRUNCATE. But this isn't possible because a user wants to have the same source and destination table in their SQL.
Does anyone know if it's possible to define a BigQuery.Jobs.insert request that includes the output schema information/metadata?
If it's not possible the only alternative I can see is to use my original work around of a WRITE_APPEND but add a temporary table into the process, to allow for the destination table appearing in the source SQL. But if this can be avoid that would be nice.
Additional Information:
I did experiment with different ways of setting the schema information but when they didn't return an error message the schema seemed to get ignored.
I.e. this is the json I'm passing into BigQuery.Jobs.insert
jsnConfig =
{
"configuration":
{
"query":
{
"destinationTable":
{
"projectId":"my-project",
"datasetId":"sandbox_dataset",
"tableId":"hello_world"
},
"writeDisposition":"WRITE_TRUNCATE",
"useLegacySql":false,
"query":"SELECT COL_A, COL_B, '1' AS COL_C, COL_TIMESTAMP, COL_REQUIRED FROM `my-project.sandbox_dataset.hello_world_2` ",
"allowLargeResults":true,
"schema":
{
"fields":
[
{
"description":"Desc of Column A",
"type":"STRING",
"mode":"NULLABLE",
"name":"COL_A"
},
{
"description":"Desc of Column B",
"type":"STRING",
"mode":"REQUIRED",
"name":"COL_B"
},
{
"description":"Desc of Column C",
"type":"STRING",
"mode":"REPEATED",
"name":"COL_C"
},
{
"description":"Desc of Column Timestamp",
"type":"INTEGER",
"mode":"NULLABLE",
"name":"COL_TIMESTAMP"
},
{
"description":"Desc of Column Required",
"type":"STRING",
"mode":"REQUIRED",
"name":"COL_REQUIRED"
}
]
}
}
}
}
var job = BigQuery.Jobs.insert(jsnConfig, "my-project");
The result is that the new or existing hello_world table is truncated and loaded with the data specified in the query (so part of the json package is being read), but the column descriptions and modes aren't added as defined in the schema section. They're just blank and NULLABLE in the table.
More
When I tested the REST request above using Googles API page for BigQuery.Jobs.Insert it highlighted the "schema" property in the request as invalid. I think it appears the schema can be defined if you're loading the data from a file, i.e. BigQuery.Jobs.Load but it doesn't seem to support that functionality if you're putting the data in using an SQL source.
See the documentation here: https://cloud.google.com/bigquery/docs/schemas#specify-schema-manual-python
You can pass a schema object with your load job, meaning you can set fields to mode=REQUIRED
this is the command you should use:
bq --location=[LOCATION] load --source_format=[FORMAT] [PROJECT_ID]:[DATASET].[TABLE] [PATH_TO_DATA_FILE] [PATH_TO_SCHEMA_FILE]
as #Roy answered, this is done via load only. Can you output the logs of this command?

Query for Power BI Output from Azure Stream Analytics with JSON data

I'm having an issue extracting data from IOT Hub to Azure Stream Analytics to Power BI.
Here is the data coming from Stream Analytics:
{
"header":{
"version":1
},
"data":{
"treatmentId":"1",
"machineData":[
{
"recordId":3,
"records":[
{
"fields":[
{
"value":"+182",
"key":"VP"
}
],
"group":"PR"
}
]
}
]
},
"EventProcessedUtcTime":"2018-12-05T16:52:43.6450807Z",
"PartitionId":0,
"EventEnqueuedUtcTime":"2018-12-05T16:38:47.1900000Z",
"IoTHub":{
"CorrelationId":null
}
}
Using the following query:
SELECT *
INTO DataPowerBI
FROM iothub;
I am getting the following output in PowerBI:
I am not able to get the child level data under "data", like treatment id, machine data, groups, keys. Can I get a query for pushing all levels of the data, both parent and children?
Thanks in advance!
Raj
By using select *, you only get the upper level data-fields back. If you want the data that is nested, you need to specify the data you want.
select data.treatmentid will get you the treatmentId
I am not sure how it works with nesting within nesting. You could try
select data.machinedata.recordId to get the recordId.

How do I query a complex JSONB field in Django 1.9

I have a table item with a field called data of type JSONB. I would like to query all items that have text that equals 'Super'. I am trying to do this currently by doing this:
Item.objects.filter(Q(data__areas__texts__text='Super'))
Django debug toolbar is reporting the query used for this is:
WHERE "item"."data" #> ARRAY['areas', 'texts', 'text'] = '"Super"'
But I'm not getting back any matching results. How can I query this using Django? If it's not possible in Django, then how can I query this in Postgresql?
Here's an example of the contents of the data field:
{
"areas": [
{
"texts": [
{
"text": "Super"
}
]
},
{
"texts": [
{
"text": "Duper"
}
]
}
]
}
try Item.objects.filter(data__areas__0__texts__0__text='Super')
it is not exact answer, but it can clarify some jsonb filter features, also read django docs
I am not sure what you want to achieve with this structure, but I was able to get the desired result only with strange raw query, it can look like this:
Item.objects.raw("SELECT id, data FROM (SELECT id, data, jsonb_array_elements(\"table_name\".\"data\" #> '{areas}') as areas_data from \"table_name\") foo WHERE areas_data #> '{texts}' #> '[{\"text\": \"Super\"}]'")
Dont forget to change table_name in query (in your case it should be yourappname_item).
Not sure you can use this query in real programs, but it probably can help you to find a way for a better solution.
Also, there is very good intro to jsonb query syntax
Hope it will help you

MySQL query to retrieve tabular data in json format

I have a table like below in MySQl database
user-name mail
ganesh g#g.com
gani gani#gani.com
gan gan#gan.com
I need query to retrieve above table in JSON format
Example:
[{
user-name:"ganesh",
mail:"g#g.com"
},
{
user-name:"gani",
mail:"gani#gani.com"
},
{
user-name:"gan",
mail:"gan#gan.com"
}
]
I need help, to do above
It's not recommended to do such things in the DBMS, do it in the script that is loading the data instead, if you're wrapping some legacy code you can't edit then wrap it with more code to format the data.
If all that fails do something like this: http://www.thomasfrank.se/mysql_to_json.html
SELECT
CONCAT("[",
GROUP_CONCAT(
CONCAT("{username:'",username,"'"),
CONCAT(",email:'",email),"'}")
)
,"]")
AS json FROM users;