How to convert a 10G JSON file to Avro?

How to convert a 10G JSON file to Avro? - json

I have a roughly 10G JSON file. Each line contains exactly one JSON document. I was wondering what is the best way to convert this to Avro. Ideally I would like to keep several documents (like 10M) per file. I think Avro supports having multiple documents in the same file.

You should be able to use Avro tools' fromjson command (see here for more information and examples). You'll probably want to split your file into 10M chunks beforehand (for example using split(1)).

The easiest way to convert a large JSON file to Avro is using avro-tools from the Avro website.
After creating a simple schema the file can be directly converted.
java -jar avro-tools-1.7.7.jar fromjson --schema-file cpc.avsc --codec deflate test.1g.json > test.1g.deflate.avro
The example schema:
{
"type": "record",
"name": "cpc_schema",
"namespace": "com.streambright.avro",
"fields": [{
"name": "section",
"type": "string",
"doc": "Section of the CPC"
}, {
"name": "class",
"type": "string",
"doc": "Class of the CPC"
}, {
"name": "subclass",
"type": "string",
"doc": "Subclass of the CPC"
}, {
"name": "main_group",
"type": "string",
"doc": "Main-group of the CPC"
}, {
"name": "subgroup",
"type": "string",
"doc": "Subgroup of the CPC"
}, {
"name": "classification_value",
"type": "string",
"doc": "Classification value of the CPC"
}, {
"name": "doc_number",
"type": "string",
"doc": "Patent doc_number"
}, {
"name": "updated_at",
"type": "string",
"doc": "Document update time"
}],
"doc:": "A basic schema for CPC codes"
}

Related

How to a build a JSON file from a CSV using an existing JSON schema format?

I have a JSON schema and a CSV file. The CSV file has 2,511 rows and one header row (2,512 rows total). Each row has 43 columns. I was able to convert the CSV to a JSON using one of the myriad of online converters, but the result is what I believed is termed a 'flat JSON file'.
Here is the CSV header row:
F1,F2,F3.1.F1,F3.1.F2,F3.1.F3,F3.1.F4,...F3.10.F1,F3.10.F2,F3.10.F3,F3.10.F4,F4
Here is my JSON schema:
{
"$schema": "http://json-schema.org/schema#",
"$id": "./.schema.json",
"title": "",
"description": "",
"type": "object",
"properties": {
"F1": {
"description": "",
"type": "string"
},
"F2": {
"description": "",
"type": "string"
},
"F3": {
"description": "",
"type": "array",
"items": {
"description": "",
"type": "object",
"properties": {
"F3.F1": {
"description": "",
"type": "string"
},
"F3.F2": {
"description": "",
"type": "string"
},
"F3.F3": {
"description": "",
"type": "string"
},
"F3.F4": {
"description": "",
"type": "string"
}
},
"required": [
"F3.F1",
"F3.F2",
"F3.F3",
"F3.F4"
]
},
"numItems": 10,
"unique": false
},
"F4": {
"description": "",
"type": "string"
}
},
"required": [
"F1",
"F2",
"F3",
"F4"
],
"additionalProperties": false
}
From the CSV->JSON conversion, my JSON file looks like:
[
{
"F1": 2429546524130460000,
"F2": 2429519276857919500,
"F3.1.F1": 2428316170619109000,
"F3.1.F2": 0.0690932185744956,
"F3.1.F3": 2.6355498567408557,
"F3.1.F4": 0.4369495787854096,
...
"F3.10.F1": 2429415922764859400,
"F3.10.F2": 0.15328371980044203,
"F3.10.F3": 2.677944208300451,
"F3.10.F4": 0.31036472544281585,
"F4": 0.16889514829995647
},
... //repeated 2,509 times
{
"F1": 1143081876266241000,
"F2": 1143588785487818100,
"F3.1.F1": 1141377392726037800,
"F3.1.F2": 1.332366799133926,
"F3.1.F3": 0.24878185970548322,
"F3.1.F4": 1.560443994684636,
...
"F3.10.F1": "XXX",
"F3.10.F2": "XXX",
"F3.10.F3": "XXX",
"F3.10.F4": "XXX",
"F4": 2.2916768389567497
}
]
Clearly, making the necessary changes 2,511 times is impractical, so I am hoping there is a method to make the changes automatically. I can code, but I could not find any specific solutions anywhere to go from a CSV to a JSON with the JSON output matching a specific JSON schema. Preferably, I would like a solution that is not restricted to just converting this one set of data to this one specific format, i.e., a general solution that could be used with a different CSV and different JSON schema.

Azure Data Factory Dataset Structure Format

I am creating a Microsoft.DataFactory/factories/datasets resource via Terraform. Here is the template from the official docs:
{
"type": "Microsoft.DataFactory/factories/datasets",
"apiVersion": "2018-06-01",
"name": "string",
"properties": {
"annotations": [ object ],
"description": "string",
"folder": {
"name": "string"
},
"linkedServiceName": {
"parameters": {},
"referenceName": "string",
"type": "LinkedServiceReference"
},
"parameters": {},
"schema": {},
"structure": {},
"type": "string"
// For remaining properties, see Dataset objects
}
}
I am stuck on the proper format for structure. The first few entries are fine.
"structure": [
{
"name": "id",
"type": "String"
},
{
"name": "partitionKey",
"type": "String"
},
{
"name": "createdDate",
"type": "Int64"
}
But then I need to translate a complex object and I am not sure how to go about it. The object looks something like:
"properties": {
"title": "Blah",
"tags": [
"A tag",
"Another tag"
],
"description": [
"So",
"Many",
"Words"
]
}
How do I parse this?
"name": "properties",
"type": "Object"
Will that suffice? Do I need to go into the nested keys? Any pointers would be much appreciated!

First and foremost: why do you try to deploy datasets from Terraform? Terraform is an infrastructure as code (IaC) software tool and is intended to deploy Infrastructure only (e.g. ADF instance, Azure SQL Server, endpoints, etc).
ADF's pipelines, datasets, Linked Services are typically application-type of objects and they should be deployed like that separately, after Infrastructure.
You can use #adftools to deploy all ADF objects directly from your artefact or code.

Generate schema from json file using jq

I have a newline-delimited JSON file. Is it possible to generate a schema using a tool like jq? I've had some success with jq in the past but haven't done something as complicated as this.
Here's the format of the schema I'm aiming for: https://cloud.google.com/bigquery/docs/nested-repeated#example_schema. Notice that nesting is handled with a fields key of the parent, and arrays are handled with "mode": "repeated". (Any help with some sort of schema is greatly appreciated and I then can massage into this format).
Copying from the link above, I'd like to generate from this:
{"id":"1","first_name":"John","last_name":"Doe","dob":"1968-01-22","addresses":[{"status":"current","address":"123 First Avenue","city":"Seattle","state":"WA","zip":"11111","numberOfYears":"1"},{"status":"previous","address":"456 Main Street","city":"Portland","state":"OR","zip":"22222","numberOfYears":"5"}]}
...to...
[
{
"name": "id",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "first_name",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "last_name",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "dob",
"type": "DATE",
"mode": "NULLABLE"
},
{
"name": "addresses",
"type": "RECORD",
"mode": "REPEATED",
"fields": [
{
"name": "status",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "address",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "city",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "state",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "zip",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "numberOfYears",
"type": "STRING",
"mode": "NULLABLE"
}
]
}
]
(ref BigQuery autodetect doesn't work with inconsistent json?, showing that I can't use the BigQuery autodetect because the items aren't the same. I'm fairly confident I can merge schemas together manually to create a superset)

Here's a simple recursive function that may help if you decide to roll your own:
def schema:
def isdate($v): $v | test("[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9]");
def array($k;$v): {"name":$k,"type":"RECORD",mode:"REPEATED","fields":($v[0] | schema)};
def date($k): {"name":$k,"type":"DATE", mode:"NULLABLE"};
def string($k): {"name":$k,"type":"STRING",mode:"NULLABLE"};
def item($k;$v):
$v | if type == "array" then array($k;$v)
elif type == "string" and isdate($v) then date($k)
elif type == "string" then string($k)
else empty end;
[ to_entries[] | item(.key;.value) ]
;
schema
Try it online!

Any help with some sort of schema is greatly appreciated and I then can massage into this format
There is a schema-inference module written in jq at http://gist.github.com/pkoppstein/a5abb4ebef3b0f72a6ed but the inferred schemas are "structural" - they mirror the input JSON. For your sample, the inferred schema is as shown below. As you can see, it would be quite easy to transform this into the format you have in mind, except that extra work would be required to infer the mode values.
Please note that the above-mentioned module infers the "common schema" from an arbitrarily large "sample" of JSON documents. That is, it is a schema inference engine rather than simply a "schema generator".
The above link references a companion schema-checker named JESS, also written in jq. The "E" in "JESS" stands for "extended", signifying that the JESS schema language for specifying schemas allows complex constraints to be included.
{
"id": "string",
"first_name": "string",
"last_name": "string",
"dob": "string",
"addresses": [
{
"status": "string",
"address": "string",
"city": "string",
"state": "string",
"zip": "string",
"numberOfYears": "string"
}
]
}

defining arrays in schema when loading a JSON to bigquery table from bigquery WebUI

I am loading a json file to a table in a bigquery dataset . A sample json in that file is :
{"a": "string_a","b": "string_b","c": 4.42,"d_list":["x","y","z"]}
I define the schema field as:
a:string, b:string, c:float, d_list:string
This gives an import error Field:d_list, array specified for non-repeated field
I think d_list should be defined as:
{
"type": "STRING",
"name": "d_list",
"mode": "repeated"
}
Is it right? If yes how can I use WEbUI to define it in this way?

The Web UI also accepts JSON line as noted in the helper icon, so you can have a JSON array of fields defined as, and you can paste this into the web UI.
[
{
"type": "STRING",
"name": "a",
"mode": "nullable"
},
{
"type": "STRING",
"name": "b",
"mode": "nullable"
},
{
"type": "FLOAT",
"name": "c",
"mode": "nullable"
},
{
"type": "STRING",
"name": "d_list",
"mode": "repeated"
}
]

Schema definition generated online Schema-Generator is not accepted by BigQuery while using Load Table API

Any relevant help will be appreciated.
I have several different JSON docs whic need to be inserted into BigQuery. Now to avoid generating schema manually, I am using the help of online Json Schema Generation tools available. But the schema generated by them are not being accepted by BigQuery Load Data wizard.
For eaxmple: for a Json data like this:
{"_id":100,"actor":"KK","message":"CCD is good in Pune",
"comment":[{"actor":"Subho","message":"CCD is not as good in Kolkata."},
{"actor":"bisu","message":"CCD is costly too in Kolkata"}]
}
the generated schema by online tool is:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"description": "Generated from c:jsonccd.json with shasum a003286a350a6889b152
b3e33afc5458f3771e9c",
"type": "object",
"required": [
"_id",
"actor",
"message",
"comment"
],
"properties": {
"_id": {
"type": "integer"
},
"actor": {
"type": "string"
},
"message": {
"type": "string"
},
"comment": {
"type": "array",
"minItems": 1,
"uniqueItems": true,
"items": {
"type": "object",
"required": [
"actor",
"message"
],
"properties": {
"actor": {
"type": "string"
},
"message": {
"type": "string"
}
}
}
}
}
}
But when I put it into BigQuery in the Load Data wizard, it fails with errors.
How can this be mitigated?
Thanks.

The schema generated by that tool is way more complex than what BigQuery requires.
Look at the sample in the docs:
"schema": {
"fields": [
{"name":"f1", "type":"STRING"},
{"name":"f2", "type":"INTEGER"}
]
},
https://developers.google.com/bigquery/loading-data-into-bigquery?hl=en#loaddatapostrequest
Meanwhile the tool mentioned in the question adds fields like $schema, description, type, required, properties that are not necessary and confusing to the BigQuery schema parser.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

How to convert a 10G JSON file to Avro? - json

I have a roughly 10G JSON file. Each line contains exactly one JSON document. I was wondering what is the best way to convert this to Avro. Ideally I would like to keep several documents (like 10M) per file. I think Avro supports having multiple documents in the same file.

You should be able to use Avro tools' fromjson command (see here for more information and examples). You'll probably want to split your file into 10M chunks beforehand (for example using split(1)).

Related

How to a build a JSON file from a CSV using an existing JSON schema format?

Azure Data Factory Dataset Structure Format

Generate schema from json file using jq

defining arrays in schema when loading a JSON to bigquery table from bigquery WebUI

Schema definition generated online Schema-Generator is not accepted by BigQuery while using Load Table API

Categories

Resources