Transform Avro file with Wrangler into JSON in cloud Datafusion - json

I try to read an Avro file, make a basic transformation (remove records with name = Ben) using Wrangler and write the result as JSON file into google cloud storage.
The Avro file has the following schema:
{
"type": "record",
"name": "etlSchemaBody",
"fields": [
{
"type": "string",
"name": "name"
}
]
}
Transformation in wrangler is the following:
transformation
The following is the output schema for JSON file:
output schema
When I run the pipeline it runs successfully and the JSON file is created in cloud storage. But the JSON output is empty.
When trying a preview run I get the following message:
warning message
Why is the JSON output file in gcloud storage empty?

When using the Wrangler to make transformations, the default values for the GCS source are format: text and body: string (data type); however, to properly work with an Avro file in the Wrangler you need to change that, you need to set the format to blob and the body data type to bytes, as follows:
After that, the preview for your pipeline should produce output records. You can see my working example next:
Sample data
Transformations
Input records preview for GCS sink (final output)
Edit:
You need to set the format: blob and the output schema as body: bytes if you want to parse the file to Avro within the Wrangler, as described before, because it needs the content of the file in a binary format.
On the other hand if you only want to apply filters (within the Wrangler), you could do the following:
Open the file using format: avro, see img.
Set the output schema according to the fields that your Avro file has, in this case name with string data type, see img.
Use only filters on the Wrangler (no parsing to Avro here), see img.
And this way you can also get the desired result.

Related

Data Factory Copy Data Source as Body of Sink Object

I've been trying to create an ADF pipeline to move data from one of our databases into an azure storage folder - but I can't seem to get the transform to work correctly.
I'm using a Copy Data task and have the source and sink set up as datasets and data is flowing from one to the other, it's just the format that's bugging me.
In our Database we have a single field that contains a JSON object, this needs to be mapped into the sink object but doesn't have a Column name, it is simply the base object.
So for example the source looks like this
and my output needs to look like this
[
{
"ID": 123,
"Field":"Hello World",
"AnotherField":"Foo"
},
{
"ID": 456,
"Field":"Don't Panic",
"AnotherField":"Bar"
}
]
However, the Copy Data task seems to only seems to accept direct Source -> Sink mapping, and also is treating the SQL Server field as VARCHAR (which I suppose it is). So as a result I'm getting this out the other side
[
{
"Json": "{\"ID\": 123,\"Field\":\"Hello World\",\"AnotherField\":\"Foo\"}"
},
{
"Json": "{\"ID\": 456,\"Field\":\"Don't Panic\",\"AnotherField\":\"Bar\"}"
}
]
I've tried using the internal #json() parse function on the source field but this causes errors in the pipeline. I also can't get the sink to just map directly as an object inside the output array.
I have a feeling I just shouldn't be using Copy Data, or that Copy Data doesn't support the level of transformation I'm trying to do. Can anybody set me on the right path?
Using a JSON dataset as a source in your data flow allows you to set five additional settings. These settings can be found under the JSON settings accordion in the Source Options tab. For Document Form setting, you can select one of Single document, Document per line and Array of documents types.
Select Document form as Array of documents.
Refer - https://learn.microsoft.com/en-us/azure/data-factory/format-json

Convert JSON to Avro in Nifi

I want to convert json data to avro.
I have used GenerateFlowFile and put dummy json value [{"firstname":"prathik","age":21},{"firstname":"arun","age":22}].
I have then used ConvertRecord processor and set JsonTreeReader and AvroRecordSetWriter with AvroSchemaRegistry which has the following schema:AvroScehma
But i am getting this as my output: Output (Avro Data)
I am new to Apache Nifi.
Thanks in Advance.
But i am getting this as my output: Output (Avro Data)
That's to be expected. Avro is a binary file format, and what you see is an attempt to make that data viewable in a text format. It's supposed to act like that.

In greenplum pxf external table get empty string while fetching element from json array of object

while accessing json data by creating external table using pxf json plugin in multiline json table example
when use following column definition
"coordinates.values[0]" INTEGER,
easily fetch 8 from below json
"coordinates":{
"type":"Point",
"values":[
8,
52
]
}
but if we change the json to something like this
"coordinates": {
"type": "geoloc",
"values":[
{
"latitude" : 72,
"longtitue" : 80
}
]
}
and change the column definition like this
"coordinates.values[0].latitude" INTEGER,
fetches empty string....
Unfortunately the JSON profile in PXF does not support accessing JSON objects inside arrays. However, Greenplum has very good support for JSON and you can achieve the same result by doing the following:
CREATE EXTERNAL TABLE pxf_read_json (j1 json)
LOCATION ('pxf://tmp/file.json?PROFILE=hdfs:text:multi&FILE_AS_ROW=true')
FORMAT 'CSV';
The pxf_read_json table will access JSON files on the external system. Each file is read as multi-line text files, each file represents a single table row on Greenplum. You can then query the external data as follows:
SELECT values->>'latitude' as latitude, values->>'longtitue' as longitude
FROM pxf_read_json
JOIN LATERAL json_array_elements(j1->'coordinates'->'values') values
ON true;
With this approach, you can still take advantage of PXF's support to access external system as well as leveraging the powerful JSON support in Greenplum.
Additional information about reading a multi-Line text file into a single table row can be found here. And information about Greenplum support for JSON can be found here.

ConvertRecord processor issue with XML to JSON conversion Apache Nifi

I have a requirement where I am converting data from an API to JSON format. The output of the API is initially in XML, so I am using XMLReader controller service to read the XML data, and JSONRecordSetWriter controller service to convert it to JSON format in Apache Nifi 1.9.2.
When I use ConvertRecord processor with the same controller services, my output merely shows the avro schema and not the data expected. I have tried out many options like using AvroSchemaRegistry controller service, but only the schema is seen and null values are passed. Can anyone explain this behavior?
XML flowfile output:
<field1 value="AAAA"/>
<field2 value="BBBB"/>
<field3 value="male"/>
JSON output:
[ {
"field1" : null,
"field2" : null,
"field3" : null
} ]
The documentation specifies that "Records are expected in the second level of XML data, embedded in an enclosing root tag." Your input file appears to be a list of XML tags with no root tag enclosing them.
You could use ReplaceText to wrap the XML in a root tag, then the XMLReader should parse the fields as expected.

How to define nested array to ingest data and convert?

I am using Firehose and Glue to ingest data and convert JSON to the parquet file in S3.
I was successful to achieve it with normal JSON (not nested or array). But I am failed for a nested JSON array. What I have done:
the JSON structure
{
"class_id": "test0001",
"students": [{
"student_id": "xxxx",
"student_name": "AAAABBBCCC",
"student_gpa": 123
}]
}
the Glue schema
class_id : string
students : array ARRAY<STRUCT<student_id:STRING,student_name:STRING,student_gpa:INT>>
I receive error:
The schema is invalid. Error parsing the schema: Error: type expected at the position 0 of 'ARRAY<STRUCT<student_id:STRING,student_name:STRING,student_gpa:INT>>' but 'ARRAY' is found.
Any suggestion is appreciated.
I ran into that because I created schemas manually in the AWS console. The problem is, that it shows some help text next to form to enter your nested data which capitalizes everything, but Parquet can only work with lowercase definitions.
Write despite the example given by AWS:
array<struct<student_id:string,student_name:string,student_gpa:int>>