Loading nested array into bigquery from public google cloud dataset

Loading nested array into bigquery from public google cloud dataset - json

I'm trying to load a public dataset from Google Cloud into BigQuery (quickdraw_dataset). The data is in JSON format as below:
{
"key_id":"5891796615823360",
"word":"nose",
"countrycode":"AE",
"timestamp":"2017-03-01 20:41:36.70725 UTC",
"recognized":true,
"drawing":[[[129,128,129,129,130,130,131,132,132,133,133,133,133,...]]]
}
The issue that I'm running into is that the "drawing" field is a nested array. I gather from reading other posts that you can't read arrays into BigQuery? This post suggests that one way around this issue is to read in the array as a string. But, when I use the following schema, I get this error:
`
[
{
"name": "key_id",
"type": "STRING"
},
{
"name": "word",
"type": "STRING"
},
{
"name": "countrycode",
"type": "STRING"
},
{
"name": "timestamp",
"type": "STRING"
},
{
"name": "recognized",
"type": "BOOLEAN"
},
{
"name": "drawing",
"type": "STRING"
}
]
Error while reading data, error message: JSON parsing error in row starting at position 0: Array specified for non-repeated field: drawing.
Is there a way to read this dataset into BigQuery?
Thanks in advance!

Load the whole row as a CSV, then parse inside BigQuery.
Load:
bq load --F \\t temp.eraser gs://quickdraw_dataset/full/simplified/eraser.ndjson row
Query:
SELECT JSON_EXTRACT_SCALAR(row, '$.countrycode') a
, JSON_EXTRACT_SCALAR(row, '$.word') b
, JSON_EXTRACT_ARRAY(row, '$.drawing')[OFFSET(0)] c
FROM temp.eraser

Related

JSON Schema Validation Error in MarkLogic - XDMP-VALIDATEERRORS

Using MarkLogic version 10.0-4.2, I am trying to validate a simple JSON record against a simple JSON schema.
JSON Schema:
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"SourceSystemName": {
"type": "string"
},
"BatchDtTm": {
"type": "string"
},
"SubjectArea": {
"type": "string"
},
"DocumentType": {
"type": "string"
},
"LastUpdatedDt": {
"type": "string"
},
"required": [
"SourceSystemName",
"BatchDtTm",
"SubjectArea",
"DocumentType",
"LastUpdatedDt",
]
}
}
Code being run in Query Console:
let jsonRecord = {"SourceSystemName":"ODH","BatchDtTm":"09/17/21 08:51:48:472723","SubjectArea":"Customer","DocumentType":"Preference","LastUpdatedDt":"09/17/21 03:59:53:629707"};
xdmp.jsonValidate(jsonRecord, cts.doc('/schemas/NewSchema.json').toString());
When I run the above code, I get error
XDMP-JSVALIDATEBADSCHEMA: Invalid schema "": ""
I'm not really sure what is 'invalid' about my schema. Can someone offer some insight into what MarkLogic is viewing as 'invalid'?

The second parameter for $schema is supposed to be the URI of the schema document.
$schema URI of the JSON schema to use for validation.
You are attempting to pass in the stringified content.
Try:
xdmp.jsonValidate(jsonRecord, '/schemas/NewSchema.json');
And ensure that the schema document is inserted into the Schemas database, not the content database.

In Azure Logic App, how can I remove the T time part from a date contained in some results

I have a Logic App which runs a Stored Procedure. That Stored Procedure returns some results, which are turned into a CSV Table, saved as a Blob file, and then e-mailed out. There is a column in the results which is just a Date field but somewhere along the way Azure has added a Timestamp to the results which I want to remove - how can I do this?
For example: 2018-07-18T00:00:00
Below is the Schema details written in the Parse JSON step.
{
"properties": {
"Table1": {
"items": {
"properties": {
"BusinessArea": {
"type": "string"
},
"CaseID": {
"type": "string"
},
"CaseOpenedDate": {
"type": "string"
},
"Lead": {
"type": "string"
}
},
"type": "object"
},
"type": "array"
}
},
"type": "object"
}

I amended the SQL Procedure to CAST(CaseOpenedDate AS VARCHAR(10)) and that has worked. The Time stamp is no longer added and the data looks like a normal date.
Thank you Frank Gong for helping me work this out.

Why does a numeric key in the JSON Structure always get displayed first

(Cannot summarize the problem in a single statement, hence the ambiguous title)
I create a JSON structure via Angular Typescript, wherein when a user interacts with certains parts of the component the JSON Structure gets updated.
Steps
Initially, the JSON under consideration is by default set to the following:
{
"keyword": {
"value": "product",
"type": "main"
}
}
For example, a user chooses some parameter Name. Once the user complies to certain steps in the UI, the JSON structure gets updated to the following:
{
"keyword": {
"value": "product",
"type": "main"
},
"Name": {
"value": " <hasProperty> Name",
"type": "dataprop"
}
}
Once the user selects a numeric value for a parameter like dryTime, the JSON gets updated to the following:
{
"20": { // WHY WOULD 20 be here?
"value": "<hasValue> 20",
"type": "fValue"
},
"keyword": {
"value": "Varnish",
"type": "main"
},
"Name": {
"value": " <hasProperty> Name",
"type": "dataprop"
},
"dryingTime": {
"value": " <hasProperty> dryingTime",
"type": "dataprop"
}
}
I understand that a JSON is an unordered data structure. But a previous implementation of something similar actually worked well, i.e., the value 20 here was 20.0 before and it was displayed after dryingTime in my JSON.
The order is critical for me as I parse all the Keys in the above mentioned JSON using a for loop and store it in an array. This array needs to show all the keys in the order of the User Interaction.
Where am I going wrong here if I decide to stay with JSON and not with an array to store such interactions?

Yes, JSON fields are unordered. JSON array is ordered.
If you want to keep the order of elements insterted, you could build your JSON like so:
{
"keyword": {
"value": "Varnish",
"type": "main"
},
"props": [
{
"name": "dryingTime",
"value": 20
},
{
"name": "anotherOrderedField",
"value": "fieldValue"
}
]
}

Defining Azure Stream Analytics iot-hub input source through Powershell

I'm trying to write a powershell script that creates a new streamAnalytics job in my azure portal account, with input source as iot-hub and output source as blob storage account.
To do so, I'm using AzureRM command new-streamAnalyticsJob, and json files.
my problem is: I have not seen any documentation or example for json file where the inputs source is iot-hub. only event-hub.
what are the parameters I need to give in the json file? can anyone display an example for json file with input source to streamAnalytics job as Iot-hub?

I got the answer eventually: the required field I had to add to the inputs Oliver posted earlier here is:
"endpoint":"messages/events"
I added it under Datasource Properties section, and it works fine!
Thanks Oliver

To come back on the error message you are seeing, to add to Olivier's sample you need a Property named endpoint which corresponds to the endpoint in IoT Hub. If you are looking for Telemetry messages this will be:
"endpoint": "messages/events"
This can be found in the schema for Azure ARM: https://github.com/Azure/azure-rest-api-specs/blob/current/specification/streamanalytics/resource-manager/Microsoft.StreamAnalytics/2016-03-01/examples/Input_Create_Stream_IoTHub_Avro.json
So to complete Olivier's example, when using API version '':
"Inputs": [
{
"Name": "Hub",
"Properties": {
"DataSource": {
"Properties": {
"consumerGroupName": "[variables('asaConsumerGroup')]",
"iotHubNamespace": "[parameters('iotHubName')]",
"sharedAccessPolicyKey": "[listkeys(variables('iotHubKeyResource'), variables('iotHubVersion')).primaryKey]",
"sharedAccessPolicyName": "[variables('iotHubKeyName')]",
"endpoint": "messages/events"
},
"Type": "Microsoft.Devices/IotHubs"
},
"Serialization": {
"Properties": {
"Encoding": "UTF8"
},
"Type": "Json"
},
"Type": "Stream"
}
}
],

That'd look like the following for the inputs part of the ASA resource:
"Inputs": [
{
"Name": "IoTHubStream",
"Properties": {
"DataSource": {
"Properties": {
"consumerGroupName": "[variables('CGName')]",
"iotHubNamespace": "[variables('iotHubName')]",
"sharedAccessPolicyKey": "[listkeys(variables('iotHubKeyResource'), variables('iotHubVersion')).primaryKey]",
"sharedAccessPolicyName": "[variables('iotHubKeyName')]"
},
"Type": "Microsoft.Devices/IotHubs"
},
"Serialization": {
"Properties": {
"Encoding": "UTF8"
},
"Type": "Json"
},
"Type": "Stream"
}
}
]

how to extract the JSON schema (sub schema) of a property from complete object json schema

I need a help regarding schema extraction by property.
For example i have a JSON schema:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"description": "A simple address format",
"type": "object",
"properties": {
"street-name": { "type": "string" },
"locality":{ "type": "string" },
"region": { "type": "string" },
"postal-code": { "type": "int" },
"country-name": { "type": "string"}
},
"required": ["locality", "region", "country-name"]
}
I have an use case, where i need to extract the schema corresponding to each property and send to another service, where it will do validation against the value and save in database. Here is the sample object i need to send to another service.
{
"propertyName": "street-name",
"value": "19, Canton street",
**"schema": { "type": "string" }**
}
The questions is,
how we extract the schema for a particular property from a give JSON schema??
Given the property path, Is there any nodejs module exists to do this schema extraction? or if there is any other solutions exists ?
Because this is very simple scenario, but if we have array, anyOf, OneOf type its getting complicated;
Thanks in advance ! Please let me know if the question is not clear !
sadish

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Loading nested array into bigquery from public google cloud dataset - json

Related

JSON Schema Validation Error in MarkLogic - XDMP-VALIDATEERRORS

In Azure Logic App, how can I remove the T time part from a date contained in some results

Why does a numeric key in the JSON Structure always get displayed first

Defining Azure Stream Analytics iot-hub input source through Powershell

how to extract the JSON schema (sub schema) of a property from complete object json schema

Categories

Resources