neo4j insert complicated json with relationship between nodes for same - json

This is going to be little complex
I am trying to save a json with nested array structure. I have the json like below which I am trying to save
JSON LINK
Is there a possibility to save the above json with cypher query, because previously I tried with py2neo library for pyhton, which is based on model definition, Therefore the above json with nested structure will be going to have a dynamic keys a bit.
What I actually tried is, I'm leaving it below
query = '''
CREATE (part:Part PARTJSON)
MERGE (part) - [:LINKED_TO] - (general:General GENERALJSON)
MERGE (general) - [:LINKED_TO] - (bom:Bom BOMJSON )
MERGE (general) - [:LINKED_TO] - (generaldata:GeneralData GENERALDATAJSON )
.......
'''
Is there a possibility to write a cypher query to save it in one flash.
If so, can help with possible ideas so that it will be useful for neo4j users with roadblocks.
Thanks in advance.

Related

Entity Framework Queries For Complicated JSON Documents (npgsql)

I am handling legacy (old) JSON files that we are now uploading to a database that was built using code-first EF Core (with the JSON elements saved as a jsonb field in a postgresql db, represented as JsonDocument properties in the EF classes). We want to be able to query these massive documents against any of the JSON's many properties. I've been very interested in the excellent docs here https://www.npgsql.org/efcore/mapping/json.html?tabs=data-annotations%2Cpoco, but the problem in our case is that our JSON has incredibly complicated hierarchies.
According to the npgsql/EF doc above, a way to do this for "shallow" json hierarchies would be something like:
myDbContext.MyClass
.Where(e => e.JsonDocumentField.RootElement.GetProperty("FieldToSearch").GetString() == "SearchTerm")
.ToList();
But that only works if is directly under the root of the JSONDocument. If the doc is structed like, say
{"A": {
...
"B": {
...
"C": {
...
"FieldToSearch":
<snip>
Then the above query won't work. There is an alternative to map our JSON to an actual POCO model, but this JSON structure (a) may change and (b) is truly massive and would result in some ridiculously complicated objects.
Right now, I'm building SQL strings using field configurations where I save strings to find the fields I want using psql's JSON querying language
Example:
"(JSONDocumentField->'A'->'B'->'C'->>'FieldToSearch')"
and then using that sql against the DB using
myDbContext.MyClass.FromSqlRaw(sql).ToList();
This is hacky and I'd much rather do it in a method call. Is there a way to force JsonDocument's GetProperty call to drill down into the hierarchy to find the first/any instance of the property name in question (or another method I'm not aware of)?
Thanks!

Merging and/or Reading 88 JSON Files into Dataframe - different datatypes

I basically have a procedure where I make multiple calls to an API and using a token within the JSON return pass that pack to a function top call the API again to get a "paginated" file.
In total I have to call and download 88 JSON files that total 758mb. The JSON files are all formatted the same way and have the same "schema" or at least should do. I have tried reading each JSON file after it has been downloaded into a data frame, and then attempted to union that dataframe to a master dataframe so essentially I'll have one big data frame with all 88 JSON files read into.
However the problem I encounter is roughly on file 66 the system (Python/Databricks/Spark) decides to change the file type of a field. It is always a string and then I'm guessing when a value actually appears in that field it changes to a boolean. The problem is then that the unionbyName fails because of different datatypes.
What is the best way for me to resolve this? I thought about reading using "extend" to merge all the JSON files into one big file however a 758mb JSON file would be a huge read and undertaking.
Could the other solution be to explicitly set the schema that the JSON file is read into so that it is always the same type?
If you know the attributes of those files, you can define the schema before reading them and create an empty df with that schema so you can to a unionByName with the allowMissingColumns=True:
something like:
from pyspark.sql.types import *
my_schema = StructType([
StructField('file_name',StringType(),True),
StructField('id',LongType(),True),
StructField('dataset_name',StringType(),True),
StructField('snapshotdate',TimestampType(),True)
])
output = sqlContext.createDataFrame(sc.emptyRDD(), my_schema)
df_json = spark.read.[...your JSON file...]
output.unionByName(df_json, allowMissingColumns=True)
I'm not sure this is what you are looking for. I hope it helps

Python: Dump JSON Data Following Custom Format

I'm working on some Python code for my local billiard hall and I'm running into problems with JSON encoding. When I dump my data into a file I obviously get all the data in a single line. However, I want my data to be dumped into the file following the format that I want. For example (Had to do picture to get point across),
My custom JSON format
. I've looked up questions on custom JSONEncoders but it seems they all have to do with datatypes that aren't JSON serializable. I never found a solution for my specific need which is having everything laid out in the manner that I want. Basically, I want all of the list elements to on a separate row but all of the dict items to be in the same row. Do I need to write my own custom encoder or is there some other approach I need to take? Thanks!

Possibilities for structuring ingested json data using Nifi

Is it possible, using Nifi, to load a json file into a structured table?
I've called the following weather forecast data (from 6000 weather stations), which i'm currently loading into HDFS. It all appears on one line:
{"SiteRep":{"Wx":{"Param":[{"name":"F","units":"C","$":"Feels Like Temperature"},{"name":"G","units":"mph","$":"Wind Gust"},{"name":"H","units":"%","$":"Screen Relative Humidity"},{"name":"T","units":"C","$":"Temperature"},{"name":"V","units":"","$":"Visibility"},{"name":"D","units":"compass","$":"Wind Direction"},{"name":"S","units":"mph","$":"Wind Speed"},{"name":"U","units":"","$":"Max UV Index"},{"name":"W","units":"","$":"Weather Type"},{"name":"Pp","units":"%","$":"Precipitation Probability"}]},"DV":{"dataDate":"2017-01-12T22:00:00Z","type":"Forecast","Location":[{"i":"14","lat":"54.9375","lon":"-2.8092","name":"CARLISLE AIRPORT","country":"ENGLAND","continent":"EUROPE","elevation":"50.0","Period":{"type":"Day","value":"2017-01-13Z","Rep":{"D":"WNW","F":"-3","G":"25","H":"67","Pp":"0","S":"13","T":"2","V":"EX","W":"1","U":"1","$":"720"}}},{"i":"22","lat":"53.5797","lon":"-0.3472","name":"HUMBERSIDE AIRPORT","country":"ENGLAND","continent":"EUROPE","elevation":"24.0","Period":{"type":"Day","value":"2017-01-13Z","Rep":{"D":"NW","F":"-2","G":"43","H":"63","Pp":"3","S":"25","T":"4","V":"EX","W":"3","U":"1","$":"720"}}}, .....
Ideally, I want the schema structuring into a 6000 row table.
I've tried writing a schema to pass the above into Pig, but haven't been successful, probably because I'm not familiar enough with json to translate this correctly.
Casting around for an easy way to add some structure to the data, I've spotted that there's a PutHBaseJson processor in Nifi.
Can anyone advise if this PutHBaseJson processor would work with the above data structure? And if so, can anyone point me towards a decent tutorial to give me a starting point on the configuration?
Greatly appreciate any guidance.
You probably want to use the SplitJson processor to split the 6000 record JSON structure into 6000 individual flowfiles. If you need to "inject" the parameter definitions from the top-level response, you can do a ReplaceText or JoltTransformJSON operation to manipulate the individual JSON records. Here is a good article by Yolanda Davis describing how to perform Jolt transforms (JSON -> JSON) in NiFi.
Once you have the individual flowfiles containing a single JSON record, putting them into HBase is very easy. Bryan Bende wrote an article describing the necessary configurations for the PutHBaseJson processor.

Read small json files with declared schema from S3Bucket - Spark Performance Issue

I have huge number (35k) of small (16kb) json files stored on S3Bucket. I need to load them into DataFrame for futher processing, here is my code for extract:
val jsonData = sqlContext.read.json("s3n://bucket/dir1/dir2")
.where($"nod1.filter1"==="filterValue")
.where($"nod2.subNode1.subSubNode2.created"(0)==="filterValue2")
I'm storing this data into temp table and use for futher operations (exploading nested structures into separate data frames)
jsonData.registerTempTable("jsonData")
So now I have autogenerated schema for this deeply nested dataframe.
With above code I have terrible performance issues I presume its caused by not using sc.parallelize during bucket load, moreover I'm pretty sure that autogeneration of schema in read.json() method is taking a lot of time.
Questions part:
How should my bucket load look like, to be more efficient and faster?
Is there any way to declare this schema in advance (I need to work around Case Class tuple problem thou) to avoid auto-generation?
Does filtering data during load make sense or I should simply load all and filter data after?
Found so far:
sqlContext.jsonRdd(rdd, schema)
It did the part with auto generated schema, but InteliJ screams about depreciated method, is there any alternative for it?
As an alternative to case class, use a custom class that implements the Product interface, and then DataFrame will use the schema exposed by your class members without the case class constraints. See in-line comment here http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
If your json is composed of unrooted fragments you could use s3distcp to group the files and concatenate them into fewer files. Also try s3a protocol as it is better performance than s3n.