How to insert multiple Json data to hbase using NiFI.? - json

Please tell me how to insert multiple json data into hbase using Nifi
PutHbaseJson Image Output
PutHbaseCell Image Output
when we try to insert more than one id's or object.
This is the file which i have tried with PutHbaseCell
{"id" : "1334134","name" : "Apparel Fabric","path" : "Arts, Crafts & Sewing/Fabric/Apparel Fabric"},
{"id" : "412","name" : "Apparel Fabric","path" : "Arts, Crafts & Sewing/Fabric/Apparel Fabric"}
Image of PutHbaseCell Processor

PutHBaseJson expects each flow file to contain one JSON document which becomes a row in HBase. The row id can be specified in the processor using expression language, or it can come from one of the fields in the JSON. The other field/value pairs in the JSON become the the columns/values of the row in HBase.
If you want to use PutHBaseJson, you just need to split up your data in NiFi before it reaches this processor. There are many ways to do this.. SplitJson, SplitText, SplitContent, ExecuteScript, a custom processors.
Alternatively there is a PutHBaseRecord processors which can use a record reader to read records from a flow file and send them all to HBase. In your case you would need a JSON record reader. The data also has to be in a format that is understood by the record reader, and I believe for JSON it would need to be an array of documents.

Related

Efficient way to parse a file with different json schemas in spark

I am trying to find the best way to parse a json file with inconsistent schema (but the schema of the same type is known and consistent) in spark in order to split it by "type" and store it in parquet
{"type":1, "data":{"data_of_type1" : 1}}
{"type":2, "data":{"data_of_type2" : "can be any type"}}
{"type":3, "data":{"data_of_type3" : "value1", "anotherone": 1}}
I want also to reduce the IO because I am dealing with huge volumes, so I don't want to do a first split (by type) then process each type independently...
Current idea (not working):
Loaded the json and parse only the type ( "data" is loaded as a string)
attach to each row the corresponding schema (a DDL as string in a new column)
try to parse the "data" with the DDL from the previous column (method from_json)
=> Throwing error : Schema should be specified in DDL format as a string literal or output of the schema_of_json/schema_of_csv functions instead of schema
Do you have any idea if it's possible?

Spark from_avro 2nd argument is a constant string, any way to obtain schema string from some column of each record?

suppose we are developing an application that pulls Avro records from a source
stream (e.g. Kafka/Kinesis/etc), parses them into JSON, then further processes that
JSON with additional transformations. Further assume these records can have a
varying schema (which we can look up and fetch from a registry).
We would like to use Spark's built in from_avro function, But it is pretty clear that
Spark from_avro wants you to hard code a >Fixed< schema into your code. It doesn't seem
to allow the schema to vary row by incoming row.
That sort of makes sense if you are parsing the Avro to Internal row format.. One would need
a consistent structure for the dataframe. But what if we wanted something like
from_avro which grabbed the bytes from some column in the row and also grabbed the string
representation of the Avro schema from some other column in the row, and then parsed that Avro
into a JSON string.
Does such built-in method exist? Or is such functionality available in a 3rd party library ?
Thanks !

ADF: Split a JSON file with an Array of Objects into Single JSON files containing One Element in Each

I'm using Azure Data Factory and trying to convert a JSON file that is an array of JSON objects into separate JSON files each contain one element e.g. the input:
[
{"Animal":"Cat","Colour":"Red","Age":12,"Visits":[{"Reason":"Injections","Date":"2020-03-15"},{"Reason":"Check-up","Date":"2020-01-02"}]},
{"Animal":"Dog","Colour":"Blue","Age":1,"Visits":[{"Reason":"Check-up","Date":"2020-02-08"}]},
{"Animal":"Guinea Pig","Colour":"Green","Age":5,"Visits":[{"Reason":"Injections","Date":"2019-12-01"},{"Reason":"Check-up","Date":"2020-02-26"}]}
]
However, I've tried Data Flow to split this array up into single files containing each element of the JSON array but cannot work it out. Ideally I would also want to name each file dynamically e.g. Cat.json, Dog.json and "Guinea Pig.json".
Is Data Flow the correct tool for this with Azure Data Factory (version 2)?
Data Flows should do it for you. Your JSON snippet above will generate 3 rows. Each of those rows can be sent to a single sink. Set the Sink as a JSON sink with no filename in the dataset. In the Sink transformation, use the 'File Name Option' of 'As Data in Column'. Add a Derived Column before that which sets a new column called 'filename' with this expression:
Animal + '.json'
Use the column name 'filename' as data in column in the sink.
You'll get a separate file for each row.

Loosing data from Source to Sink in Copy Data

In my MS Azure datafactory, I have a rest API connection to a nested JSON dataset.
The Source "Preview data" shows all data. (7 orders from the online store)
In the "Activity Copy Data", is the menu tab "Mapping" where I map JSON fields with the sink SQL table columns. If I under "Collection Reference" I select None, all 7 orders are copied over.
But if I want the nested metadata, I select the meta field in "Collection Reference", then I get my nested data, in multiple order lines, each with a one metadata point, but I only get data from 1 order, not 7
I think I have a reason for my problem. One of the fields in the nested meta data, is both a string and array. But I still don't have a solution
sceen shot of meta data
Your sense is right,it caused by your nested structure meta data. Based on the statements of Collection Reference property:
If you want to iterate and extract data from the objects inside an
array field with the same pattern and convert to per row per object,
specify the JSON path of that array to do cross-apply. This property
is supported only when hierarchical data is source.
same pattern is key point here, I think. However, your data inside metadata array are not same as your screenshot.
My workaround is using Azure Blob Storage to make a transition, REST API ---> Azure Blob Storage--->Your sink dataset. Inside Blob Storage Dataset, you could flatten the incoming JSON data with Cross-apply nested JSON array setting:
You could refer to this blog to learn about this feature. Then you could copy the flatten data into your destination.

Neo4j node property containing raw json as metadata

Is this possible to have a node property as json raw string and to filter on it with cypher ?
I have a node with some defined properties and metadata (json raw string).
I would like to select or filter on those metadata property.
This is something like this :
START movie=node:TYPE_INDEX(Type = 'MOVIE') // Start with the reference
MATCH movie-[t:TAG]->tag
WHERE collect(movie.Metadata).RatingPress > 3
RETURN distinct movie.Label
And metadata are something like this :
{"RatingPress" : "0","RatingSpectator" : 3"}
I have expected to use collect function in order to call the property like this :
collect(movie.Metadata).RatingPress
But, of course it fails...
Is this a way to bind some json string from a node property with cypher ?
Thanks for your help
That's going against the principles of properties. Why not set the properties in the JSON metadata directly on the node?
But to answer your question:
No, cypher has no knowledge about JSON.
We treat the entire Node as a JSON blob. Since Neo4j doesn't support hierarchical properties, we flatten out the JSON into delimited property names on save and unflatten them on read. You can then form Cypher queries on (for example) property name "foo.bar.baz". The queries tend to look a bit funky because you'll need to quote them using single back quotes, but it works.