Possibilities for structuring ingested json data using Nifi - json

Is it possible, using Nifi, to load a json file into a structured table?
I've called the following weather forecast data (from 6000 weather stations), which i'm currently loading into HDFS. It all appears on one line:
{"SiteRep":{"Wx":{"Param":[{"name":"F","units":"C","$":"Feels Like Temperature"},{"name":"G","units":"mph","$":"Wind Gust"},{"name":"H","units":"%","$":"Screen Relative Humidity"},{"name":"T","units":"C","$":"Temperature"},{"name":"V","units":"","$":"Visibility"},{"name":"D","units":"compass","$":"Wind Direction"},{"name":"S","units":"mph","$":"Wind Speed"},{"name":"U","units":"","$":"Max UV Index"},{"name":"W","units":"","$":"Weather Type"},{"name":"Pp","units":"%","$":"Precipitation Probability"}]},"DV":{"dataDate":"2017-01-12T22:00:00Z","type":"Forecast","Location":[{"i":"14","lat":"54.9375","lon":"-2.8092","name":"CARLISLE AIRPORT","country":"ENGLAND","continent":"EUROPE","elevation":"50.0","Period":{"type":"Day","value":"2017-01-13Z","Rep":{"D":"WNW","F":"-3","G":"25","H":"67","Pp":"0","S":"13","T":"2","V":"EX","W":"1","U":"1","$":"720"}}},{"i":"22","lat":"53.5797","lon":"-0.3472","name":"HUMBERSIDE AIRPORT","country":"ENGLAND","continent":"EUROPE","elevation":"24.0","Period":{"type":"Day","value":"2017-01-13Z","Rep":{"D":"NW","F":"-2","G":"43","H":"63","Pp":"3","S":"25","T":"4","V":"EX","W":"3","U":"1","$":"720"}}}, .....
Ideally, I want the schema structuring into a 6000 row table.
I've tried writing a schema to pass the above into Pig, but haven't been successful, probably because I'm not familiar enough with json to translate this correctly.
Casting around for an easy way to add some structure to the data, I've spotted that there's a PutHBaseJson processor in Nifi.
Can anyone advise if this PutHBaseJson processor would work with the above data structure? And if so, can anyone point me towards a decent tutorial to give me a starting point on the configuration?
Greatly appreciate any guidance.

You probably want to use the SplitJson processor to split the 6000 record JSON structure into 6000 individual flowfiles. If you need to "inject" the parameter definitions from the top-level response, you can do a ReplaceText or JoltTransformJSON operation to manipulate the individual JSON records. Here is a good article by Yolanda Davis describing how to perform Jolt transforms (JSON -> JSON) in NiFi.
Once you have the individual flowfiles containing a single JSON record, putting them into HBase is very easy. Bryan Bende wrote an article describing the necessary configurations for the PutHBaseJson processor.

Related

Loading Raw JSON Into Delta Lake (Like in Snowflake)

I am testing Delta Lake for a simple use case that is very easy in Snowflake, but I'm having a heck of a time understanding if it can be done, much less actually doing it.
I want to be able to load a JSON file "raw," without specifying a schema, and I want to be able to query and flatten it later. In Snowflake, I can create a column of type VARIANT and load the JSON text there, and later I can ask for the different parts by using :: and lateral flatten, etc.
The examples I've seen so far about Delta Lake have had "schema inference" or "autoloading" stipulations, and with those it seems that even if I don't specify a schema, one is created for me and then I still have to guess (or look up) what columns Delta Lake created for me so I can query those parts of the JSON. It seems a little too complicated.
This page has the following comment:
When ingesting data, you may need to keep it in a JSON string, and some data may not be in the correct data type.
... but it provides no example of how to do that. To me this suggests that you can somehow store the raw JSON and query it later, but I don't know how. Just make a STRING column and insert the JSON as string? Can someone post an example?
Am I trialing the wrong tool for what I need, or am I missing something? Thank you for your help.
As far as I'm aware, there is no direct equivalent to the VARIANT column in Snowflake. What that page is suggesting is storing the data as a string, and then using the semi-structured access operators to parse it as JSON on the fly.
e.g. given a table named devices with a column named specifications of type string with value
"""{
"device": "potato phone",
"sku": "POTATO0001",
}"""
Then you can query it like this:
SELECT specifications:device, specifications:sku from devices
edit: to address some of your other questions
This doesn't do schema enforcement. It's possible to create a Struct column in delta lake that can store structured data, but all the data in that column need to be compatible with the Struct schema. If you are querying a JSON string column, you are on your own for schema management.

neo4j insert complicated json with relationship between nodes for same

This is going to be little complex
I am trying to save a json with nested array structure. I have the json like below which I am trying to save
JSON LINK
Is there a possibility to save the above json with cypher query, because previously I tried with py2neo library for pyhton, which is based on model definition, Therefore the above json with nested structure will be going to have a dynamic keys a bit.
What I actually tried is, I'm leaving it below
query = '''
CREATE (part:Part PARTJSON)
MERGE (part) - [:LINKED_TO] - (general:General GENERALJSON)
MERGE (general) - [:LINKED_TO] - (bom:Bom BOMJSON )
MERGE (general) - [:LINKED_TO] - (generaldata:GeneralData GENERALDATAJSON )
.......
'''
Is there a possibility to write a cypher query to save it in one flash.
If so, can help with possible ideas so that it will be useful for neo4j users with roadblocks.
Thanks in advance.

Python: Dump JSON Data Following Custom Format

I'm working on some Python code for my local billiard hall and I'm running into problems with JSON encoding. When I dump my data into a file I obviously get all the data in a single line. However, I want my data to be dumped into the file following the format that I want. For example (Had to do picture to get point across),
My custom JSON format
. I've looked up questions on custom JSONEncoders but it seems they all have to do with datatypes that aren't JSON serializable. I never found a solution for my specific need which is having everything laid out in the manner that I want. Basically, I want all of the list elements to on a separate row but all of the dict items to be in the same row. Do I need to write my own custom encoder or is there some other approach I need to take? Thanks!

Pig Json Multistorage?

Using PIG (0.14), i'm interested in the following use-case: I wish to process my raw JSON into multiple output directories based upon their key and store the result (aggregated data) as JSON. The JSON has an evolving (dynamic) schema which is read in with elephant-bird, and (so-far) has not caused any problems.
I can either store the output in the correct directories (using MultiStorage) or as JSON (using JsonStorage) but not both. As far as i can tell, there is no publicly available UDF for this purpose.
Have I missed something, or is it just a case of writing my own UDF to perform this? This seems like a simple use-case and I would have thought would have been supported.
For those who are looking for an answer to this; a UDF is required.
It is possible (and relatively straight forward) to combine the piggybank UDFs of JsonStorage and MultiStorage to create a pseudo "JsonMultiStorage" class.

Convert a list to a JSON Object in erlang (mochijson)

i would really appreciate any help.
I would like to convert this list
[[{id1,1},{id2,2},{id3,3},{id4,4}],[{id1,5},{id2,6},{id3,7},{id4,8}],[...]]
to a JSON object.
Need some inspiration :)
please help.
Thank you.
Since you asked for inspiration, I can immagine two directions you can take
You can write code to hand-role your own JSON which, if your need is modest enough, can be a very light-weight and appropriate solution. It would be pretty simple Erlang to take that one data-structure and convert it to the JSON.
"[[{\"id1\":1},{\"id2\":2},{\"id3\":3},{\"id4\":4}],[{\"id1\":5},{\"id2\":6} {\"id3\":7},{\"id4\":8}]]"
You can produce a data-structure that mochiweb's mochijson:encode/1 and decode/1 can handle. I took your list and hand coded it to JSON, getting:
X = "[[{\"id1\":1},{\"id2\":2},{\"id3\":3},{\"id4\":4}],[{\"id1\":5},{\"id2\":6},{\"id3\":7},{\"id4\":8}]]".
then I used mochison:decode(X) to see what structure mochiweb uses to represent JSON (too lazy to look at the documentation).
Y = mochijson:decode(X).
{array,[{array,[{struct,[{"id1",1}]},
{struct,[{"id2",2}]},
{struct,[{"id3",3}]},
{struct,[{"id4",4}]}]},
{array,[{struct,[{"id1",5}]},
{struct,[{"id2",6}]},
{struct,[{"id3",7}]},
{struct,[{"id4",8}]}]}]}
So, if you can create this slightly more elaborate data structure then the one you are using, then you can get the JSON by using mochijson:encode/1. Here is an example imbeddied in an io:format statement so that it prints it as a string -- often you would use the io_lib:format/X depending on your application.
io:format("~s~n",[mochijson:encode(Y)]).
[[{"id1":1},{"id2":2},{"id3":3},{"id4":4}],[{"id1":5},{"id2":6},{"id3":7},{"id4":8}]]