Bigquery importing map's json data into a table - json

I have a json file which has data from a pojo, which is basically
String
Map<String, Set<OBJ>>
=> where object has
int
int
string
timestamp
Here is a sample row:
{"key":"123qwe123","mapData":{"3539":[{"id":36,"type":1,"os":"WINDOWS","lastSeenDate":"2015-06-03 22:46:38 UTC"}],"16878":[{"id":36,"type":1,"os":"WINDOWS","lastSeenDate":"2015-06-03 22:26:34 UTC"}],"17312":[{"id":36,"type":1,"os":"WINDOWS","lastSeenDate":"2015-06-03 22:26:48 UTC"}]}}
I tried to do following schema, but thats not working:
[
{
"name" : "key",
"type" : "string"
},
{
"name" : "mapData",
"type" : "record",
"mode": "repeated",
"fields": [
{
"name": "some_id",
"type": "record",
"mode" : "repeated",
"fields" : [
{
"name": "id",
"type": "integer",
"mode": "nullable"
},
{
"name": "type",
"type": "integer",
"mode": "nullable"
},
{
"name": "os",
"type": "string",
"mode": "nullable"
},
{
"name": "lastSeenDate",
"type": "timestamp",
"mode": "nullable"
}
]
}
] } ]
When i run i get: repeated record must be imported as a JSON array
I know something is up with schema but not figured out yet.

Related

Convert Nested Json Schem to Pyspark Schema

I have a schema which has nested fields.When I try to convert it with:
jtopy=json.dumps(schema_message['SchemaDefinition']) #json.dumps take a dictionary as input and returns a string as output.
print(jtopy)
dict_json=json.loads(jtopy) # json.loads take a string as input and returns a dictionary as output.
print(dict_json)
new_schema = StructType.fromJson(dict_json)
print(new_schema)
It returns error:
return StructType([StructField.fromJson(f) for f in json["fields"]])
TypeError: string indices must be integers
The schema is Definition as described below is what Im passing
{
"type": "record",
"name": "tags",
"namespace": "com.tigertext.data.events.tags",
"doc": "Schema for tags association to accounts (role,etc..)",
"fields": [
{
"name": "header",
"type": {
"type": "record",
"name": "eventHeader",
"namespace": "com.tigertext.data.events",
"doc": "Metadata about the event record.",
"fields": [
{
"name": "topic",
"type": "string",
"doc": "The topic this record belongs to. e.g. messages"
},
{
"name": "server",
"type": "string",
"doc": "The server that generated this event. e.g. xmpp-07"
},
{
"name": "service",
"type": "string",
"doc": "The service that generated this event. e.g. erlang-producer"
},
{
"name": "environment",
"type": "string",
"doc": "The environment this record belongs to. e.g. dev, prod"
},
{
"name": "time",
"type": "long",
"doc": "The time in epoch this record was produced."
}
]
}
},
{
"name": "eventType",
"type": {
"type": "enum",
"name": "eventType",
"symbols": [
"CREATE",
"UPDATE",
"DELETE",
"INIT"
]
},
"doc": "event type"
},
{
"name": "tagId",
"type": "string",
"doc": "Tag ID for the tag"
},
{
"name": "orgToken",
"type": "string",
"doc": "org ID"
},
{
"name": "tagName",
"type": "string",
"doc": "name of the tag"
},
{
"name": "colorId",
"type": "string",
"doc": "color id"
},
{
"name": "colorName",
"type": "string",
"doc": "color name"
},
{
"name": "colorValue",
"type": "string",
"doc": "color value e.g. #C8C8C8"
},
{
"name": "entities",
"type": [
"null",
{
"type": "array",
"items": {
"type": "record",
"name": "entity",
"fields": [
{
"name": "entityToken",
"type": "string"
},
{
"name": "entityType",
"type": "string"
}
]
}
}
],
"default": null
}
]
}
Above is the schema of the kafka topic I want to parse into pyspark schema

Dynamic avro schema creation

How to create avro schema for below json code
{
"_id" : "xxxx",
"name" : "xyz",
"data" : {
"abc" : {
"0001" : "gha",
"0002" : "bha"
}
}
}
Here,
"0001" : "gha",
"0002" : "bha"
key: value would be dynamic.
Maybe a schema like this does what you want?
{
"type": "record",
"name": "MySchema",
"namespace": "my.name.space",
"fields": [
{
"name": "_id",
"type": "string"
},
{
"name": "name",
"type": "string"
},
{
"name": "data",
"type": {
"type": "record",
"name": "Data",
"fields": [
{
"name": "abc",
"type": {
"type": "map",
"values": "string"
}
}
]
}
}
]
}
It's not dynamic, but you can add as many key-value pairs to the map as you like. Field names starting with a numeric value aren't allowed in Avro.

How to add a field in JSON structure automatically

I am using bigquery and getting the Schema details using the below command
bq show --schema --format=prettyjson [PROJECT_ID]:[DATASET].[TABLE]
The above command gives me something like the below structure
[
{
"mode": "NULLABLE",
"name": "flt_date",
"type": "DATE"
},
{
"mode": "NULLABLE",
"name": "month",
"type": "INTEGER"
},
{
"mode": "NULLABLE",
"name": "year",
"type": "INTEGER"
},
]
I would like to add an extra field called "description" in each array structure, Can anyone please help me with how to get this.Any script or any command would be helpful. The new structure would be
[
{ "description" : " ",
"mode": "NULLABLE",
"name": "flt_date",
"type": "DATE"
},
{ "description" : " ",
"mode": "NULLABLE",
"name": "month",
"type": "INTEGER"
},
{ "description" : " ",
"mode": "NULLABLE",
"name": "year",
"type": "INTEGER"
},
]
Note: My existing table doesn't have any description for column Names which I need to add later using terraform. Any Suggestions would be greatly appreciated
Iterate through each and use .update()
json = [
{
"mode": "NULLABLE",
"name": "flt_date",
"type": "DATE"
},
{
"mode": "NULLABLE",
"name": "month",
"type": "INTEGER"
},
{
"mode": "NULLABLE",
"name": "year",
"type": "INTEGER"
},
]
for each in json:
each.update({"description":" "})

How to parse a dynamic Json - Power Automate

Im getting a http response from Azure LogAnalytics, the response is a Json like this
{
"tables": [
{
"name": "PrimaryResult",
"columns": [
{
"name": "TimeGenerated",
"type": "datetime"
},
{
"name": "DestinationIP",
"type": "string"
},
{
"name": "DestinationUserName",
"type": "string"
},
{
"name": "country_name",
"type": "string"
},
{
"name": "country_iso_code",
"type": "string"
},
{
"name": "AccountCustomEntity",
"type": "string"
}
],
"rows": [
[
"2021-05-17T14:07:01.878Z",
"158.000.000.33",
"luis",
"United States",
"US",
"luis"
]
]
}
]
}
I will never get the same colums or sometimes i will get more rows with data like this
{
"tables": [
{
"name": "PrimaryResult",
"columns": [
{
"name": "Account",
"type": "string"
},
{
"name": "Computer",
"type": "string"
},
{
"name": "IpAddress",
"type": "string"
},
{
"name": "AccountType",
"type": "string"
},
{
"name": "Activity",
"type": "string"
},
{
"name": "LogonTypeName",
"type": "string"
},
{
"name": "ProcessName",
"type": "string"
},
{
"name": "StartTimeUtc",
"type": "datetime"
},
{
"name": "EndTimeUtc",
"type": "datetime"
},
{
"name": "ConnectinCount",
"type": "long"
},
{
"name": "timestamp",
"type": "datetime"
},
{
"name": "AccountCustomEntity",
"type": "string"
},
{
"name": "HostCustomEntity",
"type": "string"
},
{
"name": "IPCustomEntity",
"type": "string"
}
],
"rows": [
[
"abc\\abc",
"EQ-DC02.abc.LOCAL",
"0.0.0.0",
"User",
"4624 - An account was successfully logged on.",
"10 - RemoteInteractive",
"C:\\Windows\\System32\\svchost.exe",
"2021-05-17T15:02:25.457Z",
"2021-05-17T15:02:25.457Z",
2,
"2021-05-17T15:02:25.457Z",
"abc\\abc",
"EQ-DC02.abc.LOCAL",
"0.0.0.0"
],
[
"abc\\eona",
"EQPD-SW01.abc.LOCAL",
"0.0.0.0",
"User",
"4624 - An account was successfully logged on.",
"10 - RemoteInteractive",
"C:\\Windows\\System32\\svchost.exe",
"2021-05-17T15:21:45.993Z",
"2021-05-17T15:21:45.993Z",
1,
"2021-05-17T15:21:45.993Z",
"abc\\abc",
"EQPD-SW01.abc.LOCAL",
"0.0.0.0"
]
]
}
]
}
Im using Power Automate to parse this kind of Json to a Object or to make a response
the question is, how can i parse this "Columns" and "Rows" to a object?
Similar discussion happened in community forum and the solution identified was:
parse JSON and transform it to XML and then search keys with XPATH in Flow

Importing a nested json as a table with multiple nesting

In our data we have json fields that include repeated sections, as well as infinite nesting possibilities (the samples I have so far are quite simplistic). After seeing BQ repeated fields and records, I decided to try restructuring the data into repeated record fields, as our use case is related to analytics and then wanted to test out different use cases for the data to see which approach is more efficient (time/cost/difficulty) for the analysis we intend to do on it. I have created a sample json record that I want to upload to BQ, that uses all the features that I think we would need (I have validated is using http://jsonlint.com/):
{
"aid": "6dQcrgMVS0",
"hour": "2016042723",
"unixTimestamp": "1461814784",
"browserId": "BdHOHp2aL9REz9dXVeKDaxdvefE3Bgn6NHZcDQKeuC67vuQ7PBIXXJda3SOu",
"experienceId": "EXJYULQOXQ05",
"experienceVersion": "1.0",
"pageRule": "V1XJW61TPI99UWR",
"userSegmentRule": "67S3YVMB7EMQ6LP",
"branch": [{
"branchId": "1",
"branchType": "userSegments",
"itemId": "userSegment67S3YVMB7EMQ6LP",
"headerId": "null",
"itemMethod": "null"
}, {
"branchId": "1",
"branchType": "userSegments",
"itemId": "userSegment67S3YVMB7EMQ6LP",
"headerId": "null",
"itemMethod": "null"
}],
"event": [{
"eventId": "546",
"eventName": "testEvent",
"eventDetails": [{
"key": "a",
"value": "1"
}, {
"key": "b",
"value": "2"
}, {
"key": "c",
"value": "3"
}]
}, {
"eventId": "547",
"eventName": "testEvent2",
"eventDetails": [{
"key": "d",
"value": "4"
}, {
"key": "e",
"value": "5"
}, {
"key": "f",
"value": "6"
}]
}]
}
I am using BQ interface, to upload this json into a table with the following structure:
[
{
"name": "aid",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "hour",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "unixTimestamp",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "browserId",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "experienceId",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "experienceVersion",
"type": "FLOAT",
"mode": "NULLABLE"
},
{
"name": "pageRule",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "userSegmentRule",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "branch",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "branchId",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "branchType",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "itemId",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "headerId",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "itemMethod",
"type": "STRING",
"mode": "NULLABLE"
}
]
},
{
"name": "event",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "evenId",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "eventName",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "eventDetails",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "key",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "value",
"type": "STRING",
"mode": "NULLABLE"
}
]
}
]
}
]
My jobs fail with a
JSON parsing error in row starting at position 0 in <file_id>. Expected key (error code: invalid)
It is possible I can't have multiple nesting in a table, but the error seems more as if there was an issue with parsing the JSON itself. I was able to generate and successfully import a json with a simple repeated record (see example below):
{
"eventId": "546",
"eventName": "testEvent",
"eventDetails": [{
"key": "a",
"value": "1"
}, {
"key": "b",
"value": "2"
}, {
"key": "c",
"value": "3"
}]
}
Any advice is appreciated.
There doesn't seem to be anything problematic with your schema, so BigQuery should be able to load your data with your schema.
First, make sure you are uploading newline-delimited JSON to BigQuery. Your example row has many newline characters in the middle of your JSON row, and the parser is trying to interpret each line as a separate JSON row.
Second, it looks like your schema has the key "evenId" in the "event" record, but your example row has the key "eventId".