Using NiFi to ingest json data in HBase - json

I'm trying to write a pretty simple XML file stored in HDFS to HBase. I'd like to transform the XML file into json format and create one row in HBase for each element within the json array. See following the XML structure:
<?xml version="1.0" encoding="UTF-8"?>
<customers>
<customer customerid="1" name="John Doe"></customer>
<customer customerid="2" name="Tommy Mels"></customer>
</customers>
And see following the desired HBase output rows:
1 {"customerid"="1","name"="John Doe"}
2 {"customerid"="2","name"="Tommy Mels"}
I've tried out many different processors for my flow but this is what I've got now: GetHDFS -> ConvertRecord -> SplitJson -> PutHBaseCell. The ConvertRecord is working fine and is converting the XML file to json format properly but I can't manage to split the json records into 2. See following what I've managed to write in HBase so far (with a different processors combination):
c5927a55-d217-4dc1-af04-0aff743 column=person:rowkey, timestamp=1574329272237, value={"customerid":"1","name":"John Doe"}\x0A{
cfe4e "customerid":"2","name":"Tommy Mels"}
For the splitjson processor I'm using the following jsonpathexpression: $.*
As of now, I'm getting an IllegalArgumentException in the PutHBaseCell processor stating that the Row length is 0, see following the PutHBaseCell processor settings:
Any hints?

I think the issue is that SplitJson isn't working properly since technically the content of your flow file is multiple json documents, with one per-line. I think SplitJson would expect them to be inside an array like:
[
{"customerid"="1","name"="John Doe"},
{"customerid"="2","name"="Tommy Mels"}
]
One option is to use SplitRecord with a JsonTreeReader which should be able to understand the json-per-line format.
Another option is to avoid splitting all together and go from ConvertRecord -> PutHBaseRecord with a JsonTreeReader.

Related

How to remove escaped character when parsing xml to json with copy data activity in Azure Data Factory?

I have an ADF pipeline exporting from xml dataset (ADLS) to json dataset (ADLS) with a copy Data activity. Due to the complex xml structure, I need to parse the nested xml to nested json then use T-SQL to parse the nested json into Synapse table.
However, the output nested has double backslash (It seems like escape characters) at nodes which have comma in it. You can check a sample of xml input and json output below:
xml input
<Address2>test, test</Address2>
json output
"Address2":"test\\, test"
How can I remove the double backslash in the output json with copy data activity in Azure Data Factory ?
Unfortunately there is no such provision in CopyData Activity.
However, I just tried with just the lines you provided as sample source and sink with CopyData Activity and it just copies as is. I don't see any \\. Perhaps you could share the exact pipeline you have, with details of the nested XML, JSON and T-SQL that you are using.
Repro: (with all default settings and properties)

Where can I get the FFProbe JSON schema definition?

I am using FFProbe to get media file information in JSON format.
I am looking for a complete schema definition for the JSON output option in FFProbe.
See: https://ffmpeg.org/ffprobe.html#json
Without the schema I find that different files produce different output, and I have to add more serialization logic by hand as I discover more properties and more tags in the JSON.
Something equivalent to MkvToolNix's full JSON schema definition, but for FFProbe:
See: https://gitlab.com/mbunkus/mkvtoolnix/-/blob/master/doc/json-schema/mkvmerge-identification-output-schema-v12.json
Any ideas if such a schema exists for FFProbe?
There isn't one but there is a XML schema which you could try to convert. It's at https://github.com/FFmpeg/FFmpeg/blob/master/doc/ffprobe.xsd

Is format always json when SELECTing from stage?

Snowflake supports multiple file types via creation FILE_FORMAT (avro, json, csv etc).
Now I have tested SELECTing from snowflake stage (s3) both:
*.avro files (generated from nifi processor batching 10k source oracle table).
*.json files (json per line).
And when Select $1 from #myStg, snowflake expands as many rows as records on avro or json files (cool), but.. the $1 variant is both json format and now i wonder if whatever snowflake file_format we use do records always arrive as json on the variant $1 ?
I haven't tested csv or others snowflake file_formats.
Or i wonder if i get json from the avros (from oracle table) because maybe NiFi processor creates avro files (with internally uses json format).
Maybe im making some confusion here.. i know avro files contain both:
avro schema - language similar to json key/value.
compressed data (binary).
Thanks,
Emanuel O.
I tried with CSV, When Its came to CSV its parsing each records in the file like below
So when its came to JSON it will treat one complete JSON as one records so its displaying in JSON format.

How to output original json along with transformed json in JoltTransformation

I want to save both transformed and original json into Hbase using the same Key. I am using JoltTransformation + EvaluateJsonPath to transform and find an element from transformed json. I want to use this element to save both transformed and original json.
If I can get original json along with transformed json then I can save both of them using the same key.
Thanks,
Ani
The JoltTransformJson processor only has success and failure relationships, and success is going to be the flow file with the content after the transform. So the only way to get the original content is to route the flow file from before JoltTransformJson, so that it goes to an HBase processor and also to the JoltTransformJson processor.
You could also first insert the original json to hbase then continue on to the transform, so something like:
Source -> PutHBaseJson -> JoltTransformJson -> PutHBaseJson
The first one is inserting the original json, the second one inserting the transformed json. As long as you use the same row id, then they'll be part of the same row.

How to insert multiple Json data to hbase using NiFI.?

Please tell me how to insert multiple json data into hbase using Nifi
PutHbaseJson Image Output
PutHbaseCell Image Output
when we try to insert more than one id's or object.
This is the file which i have tried with PutHbaseCell
{"id" : "1334134","name" : "Apparel Fabric","path" : "Arts, Crafts & Sewing/Fabric/Apparel Fabric"},
{"id" : "412","name" : "Apparel Fabric","path" : "Arts, Crafts & Sewing/Fabric/Apparel Fabric"}
Image of PutHbaseCell Processor
PutHBaseJson expects each flow file to contain one JSON document which becomes a row in HBase. The row id can be specified in the processor using expression language, or it can come from one of the fields in the JSON. The other field/value pairs in the JSON become the the columns/values of the row in HBase.
If you want to use PutHBaseJson, you just need to split up your data in NiFi before it reaches this processor. There are many ways to do this.. SplitJson, SplitText, SplitContent, ExecuteScript, a custom processors.
Alternatively there is a PutHBaseRecord processors which can use a record reader to read records from a flow file and send them all to HBase. In your case you would need a JSON record reader. The data also has to be in a format that is understood by the record reader, and I believe for JSON it would need to be an array of documents.