Azure data factory ingest csv with full stop in header - csv

I have a copy data activity in Azure data factory which reads in a csv file. This csv file is produced by a 3rd party so I cannot change it. One of the headings has a full stop (or period) in it: 'foo.bar'. When I run the activity I get the error message:
Failure happened on 'Source' side. ErrorCode=JsonInvalidDataFormat,
'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,
Message=Error occurred when deserializing source JSON file 'foo;bar'.
Check if the data is in valid JSON object format.,Source=Microsoft.DataTransfer.ClientLibrary,'
The csv look like this
state,sys_updated_on,foo.bar,sys_id
New,03/06/2021 12:42:18,S Services,xxx
Resolved,03/06/2021 12:35:06,MS Services,yyy
New,03/06/2021 12:46:18,S Services,zzz
The source dataset looks like this:
{
"name": "my_dataset",
"properties": {
"linkedServiceName": {
"referenceName": "my_linked_service",
"type": "LinkedServiceReference"
},
"annotations": [],
"type": "DelimitedText",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"fileName": "i.csv",
"folderPath": "Temp/exports/widgets",
"container": "blah"
},
"columnDelimiter": ",",
"escapeChar": "\\",
"firstRowAsHeader": true,
"quoteChar": "\""
},
"schema": []
}
}

You can skip the header like this:
Don't set first row as header:
Set Skip line count: 1 at source settings:
Some other way, you also could use Data Flow with Derived column to create a new schema to replace the column foo.bar:

Related

Can Filebeat parse JSON fields instead of the whole JSON object into kibana?

I am able to get a single JSON object in Kibana:
By having this in the filebeat.yml file:
output.elasticsearch:
hosts: ["localhost:9200"]
How can I get the individual elements in the JSON string. So say if I wanted to compare all the "pseudorange" fields of all my JSON objects. How would I:
Select "pseudorange" field from all my JSON messages to compare them.
Compare them visually in kibana. At the moment I can't even find the message let alone the individual fields in the visualisation tab...
I have heard of people using logstash to parse the string somehow but is there no way of doing this simply with filebeat? If there isn't then what do I do with logstash to help filter the individual fields in the json instead of have my message just one big json string that I cannot interact with?
I get the following output from output.console, note I am putting some information in <> to hide it:
"#timestamp": "2021-03-23T09:37:21.941Z",
"#metadata": {
"beat": "filebeat",
"type": "doc",
"version": "6.8.14",
"truncated": false
},
"message": "{\n\t\"Signal_data\" : \n\t{\n\t\t\"antenna type:\" : \"GPS\",\n\t\t\"frequency type:\" : \"GPS\",\n\t\t\"position x:\" : 0.0,\n\t\t\"position y:\" : 0.0,\n\t\t\"position z:\" : 0.0,\n\t\t\"pseudorange:\" : 20280317.359730639,\n\t\t\"pseudorange_error:\" : 0.0,\n\t\t\"pseudorange_rate:\" : -152.02620448094211,\n\t\t\"svid\" : 18\n\t}\n}\u0000",
"source": <ip address>,
"log": {
"source": {
"address": <ip address>
}
},
"input": {
"type": "udp"
},
"prospector": {
"type": "udp"
},
"beat": {
"name": <ip address>,
"hostname": "ip-<ip address>",
"version": "6.8.14"
},
"host": {
"name": "ip-<ip address>",
"os": {
<ubuntu info>
},
"id": <id>,
"containerized": false,
"architecture": "x86_64"
},
"meta": {
"cloud": {
<cloud info>
}
}
}
In Filebeat, you can leverage the decode_json_fields processor in order to decode a JSON string and add the decoded fields into the root obejct:
processors:
- decode_json_fields:
fields: ["message"]
process_array: false
max_depth: 2
target: ""
overwrite_keys: true
add_error_key: false
Credit to Val for this. His answer worked however as he suggested my JSON string had a \000 at the end which stops it being JSON and prevented the decode_json_fields processor from working as it should...
Upgrading to version 7.12 of Filebeat (also ensure version 7.12 of Elasticsearch and Kibana because mismatched versions between them can cause issues) allows us to use the script processor: https://www.elastic.co/guide/en/beats/filebeat/current/processor-script.html.
Credit to Val here again, this script removed the null terminator:
- script:
lang: javascript
id: trim
source: >
function process(event) {
event.Put("message", event.Get("message").trim());
}
After the null terminator was removed the decode_json_fields processor did its job as Val suggested and I was able to extract the individual elements of the JSON field which allowed Kibana visualisation to look at the elements I wanted!

Loading nested array into bigquery from public google cloud dataset

I'm trying to load a public dataset from Google Cloud into BigQuery (quickdraw_dataset). The data is in JSON format as below:
{
"key_id":"5891796615823360",
"word":"nose",
"countrycode":"AE",
"timestamp":"2017-03-01 20:41:36.70725 UTC",
"recognized":true,
"drawing":[[[129,128,129,129,130,130,131,132,132,133,133,133,133,...]]]
}
The issue that I'm running into is that the "drawing" field is a nested array. I gather from reading other posts that you can't read arrays into BigQuery? This post suggests that one way around this issue is to read in the array as a string. But, when I use the following schema, I get this error:
`
[
{
"name": "key_id",
"type": "STRING"
},
{
"name": "word",
"type": "STRING"
},
{
"name": "countrycode",
"type": "STRING"
},
{
"name": "timestamp",
"type": "STRING"
},
{
"name": "recognized",
"type": "BOOLEAN"
},
{
"name": "drawing",
"type": "STRING"
}
]
Error while reading data, error message: JSON parsing error in row starting at position 0: Array specified for non-repeated field: drawing.
Is there a way to read this dataset into BigQuery?
Thanks in advance!
Load the whole row as a CSV, then parse inside BigQuery.
Load:
bq load --F \\t temp.eraser gs://quickdraw_dataset/full/simplified/eraser.ndjson row
Query:
SELECT JSON_EXTRACT_SCALAR(row, '$.countrycode') a
, JSON_EXTRACT_SCALAR(row, '$.word') b
, JSON_EXTRACT_ARRAY(row, '$.drawing')[OFFSET(0)] c
FROM temp.eraser

Copy JSON Array data from REST data factory to Azure Blob as is

I have used REST to get data from API and the format of JSON output that contains arrays. When I am trying to copy the JSON as it is using copy activity to BLOB, I am only getting first object data and the rest is ignored.
In the documentation is says we can copy JSON as is by skipping schema section on both dataset and copy activity. I followed the same and I am the getting the output as below.
https://learn.microsoft.com/en-us/azure/data-factory/connector-rest#export-json-response-as-is
Tried copy activity without schema, using the header as first row and output files to BLOB as .json and .txt
Sample REST output:
{
"totalPages": 500,
"firstPage": true,
"lastPage": false,
"numberOfElements": 50,
"number": 0,
"totalElements": 636,
"columns": {
"dimension": {
"id": "variables/page",
"type": "string"
},
"columnIds": [
"0"
]
},
"rows": [
{
"itemId": "1234",
"value": "home",
"data": [
65
]
},
{
"itemId": "1235",
"value": "category",
"data": [
92
]
},
],
"summaryData": {
"totals": [
157
],
"col-max": [
123
],
"col-min": [
1
]
}
}
BLOB Output as the text is below: which is only first object data
totalPages,firstPage,lastPage,numberOfElements,number,totalElements
500,True,False,50,0,636
If you want to write the JSON response as is, you can use an HTTP connector. However, please note that the HTTP connector doesn't support pagination.
If you want to keep using the REST connector and to write a csv file as output, can you please specify how you want the nested objects and arrays to be written ?
In csv files, we can not write arrays. You could always use a custom activity or an azure function activity to call the REST API, parse it the way you want and write to a csv file.
Hope this helps.

Defining Azure Stream Analytics iot-hub input source through Powershell

I'm trying to write a powershell script that creates a new streamAnalytics job in my azure portal account, with input source as iot-hub and output source as blob storage account.
To do so, I'm using AzureRM command new-streamAnalyticsJob, and json files.
my problem is: I have not seen any documentation or example for json file where the inputs source is iot-hub. only event-hub.
what are the parameters I need to give in the json file? can anyone display an example for json file with input source to streamAnalytics job as Iot-hub?
I got the answer eventually: the required field I had to add to the inputs Oliver posted earlier here is:
"endpoint":"messages/events"
I added it under Datasource Properties section, and it works fine!
Thanks Oliver
To come back on the error message you are seeing, to add to Olivier's sample you need a Property named endpoint which corresponds to the endpoint in IoT Hub. If you are looking for Telemetry messages this will be:
"endpoint": "messages/events"
This can be found in the schema for Azure ARM: https://github.com/Azure/azure-rest-api-specs/blob/current/specification/streamanalytics/resource-manager/Microsoft.StreamAnalytics/2016-03-01/examples/Input_Create_Stream_IoTHub_Avro.json
So to complete Olivier's example, when using API version '':
"Inputs": [
{
"Name": "Hub",
"Properties": {
"DataSource": {
"Properties": {
"consumerGroupName": "[variables('asaConsumerGroup')]",
"iotHubNamespace": "[parameters('iotHubName')]",
"sharedAccessPolicyKey": "[listkeys(variables('iotHubKeyResource'), variables('iotHubVersion')).primaryKey]",
"sharedAccessPolicyName": "[variables('iotHubKeyName')]",
"endpoint": "messages/events"
},
"Type": "Microsoft.Devices/IotHubs"
},
"Serialization": {
"Properties": {
"Encoding": "UTF8"
},
"Type": "Json"
},
"Type": "Stream"
}
}
],
That'd look like the following for the inputs part of the ASA resource:
"Inputs": [
{
"Name": "IoTHubStream",
"Properties": {
"DataSource": {
"Properties": {
"consumerGroupName": "[variables('CGName')]",
"iotHubNamespace": "[variables('iotHubName')]",
"sharedAccessPolicyKey": "[listkeys(variables('iotHubKeyResource'), variables('iotHubVersion')).primaryKey]",
"sharedAccessPolicyName": "[variables('iotHubKeyName')]"
},
"Type": "Microsoft.Devices/IotHubs"
},
"Serialization": {
"Properties": {
"Encoding": "UTF8"
},
"Type": "Json"
},
"Type": "Stream"
}
}
]

Creating Ext.data.Model using JSON config

In the app we're developing, we create all the JSON at the server side using dinamically generated configs (JSON objects). We use that for stores (and other stuff, like GUIs), with a dinamically generated list of its data fields.
With a JSON like this:
{
"proxy": {
"type": "rest",
"url": "/feature/163",
"timeout": 600000
},
"baseParams": {
"node": "163"
},
"fields": [{"name": "id", "type": "int" },
{"name": "iconCls", "type": "auto"},
{"name": "text","type": "string"
},{ "name": "name", "type": "auto"}
],
"xtype": "jsonstore",
"autoLoad": true,
"autoDestroy": true
}, ...
Ext will gently create an "implicit model" with which I'll be able to work with, load it on forms, save it, delete it, etc.
What I want is to specify through a JSON config not the fields, but the model itself. Is this possible?
Something like:
{
model: {
name: 'MiClass',
extends: 'Ext.data.Model',
"proxy": {
"type": "rest",
"url": "/feature/163",
"timeout": 600000},
etc... }
"autoLoad": true,
"autoDestroy": true
}, ...
That way I would be able to create a whole JSON from the server without having to glue stuff using JS statements on the client side.
Best regards,
I don't see why not. The syntax to create a model class is similar to that of store and components:
Ext.define('MyApp.model.MyClass', {
extend:'Ext.data.Model',
fields:[..]
});
So if you take this apart you could call Ext.define(className,config);
where className is a string and config is a JSON object and both are generated on the server.
There's no way to achieve what I want.
The only way you can do it is by means of defining the fields of the Ext.data.Store and have it to generate the implicit model by using the fields configuration.