Loading a entity relation triple csv as nodes - csv

Suppose I have a csv file with data in the format (Subject, relation, Object).
Is it possible to load this into neo4j as a graph modeled such that the subject and object become nodes and the relation between them is the relation from the triple?
Essentially while loading from the csv, I want to load the subject and object as individual nodes and the relation is the one joining them.
(subject)-[:relation]->(object)
My csv is in the format
ent1,state,ent2
a,is,b
.
.
.

Yes, It's possible. You need to install the APOC plugin in Neo4j and then use apoc.merge.relationship.
Refer the following query to load the data: Add/Modify required details in the query.
LOAD CSV FROM "file:///path-to-file" AS line
MERGE (sub:Subject {name:line[0]})
MERGE (obj:Object {name:line[2]})
WITH sub, obj, line
CALL apoc.merge.relationship(sub,line[1],{},{},obj) YIELD rel
RETURN COUNT(*);

Related

neo4j load 1000 json documents using apoc

I'm trying to load json documents into neo4j which has complex structure...I cannot use csv format...
Can someone please tell how to load 1000+ json documents using apoc?
I know how to load one document trying to find how to load in loop the 1000 documents
If you want to import from multiple JSON documents serially, you can do something like this (assuming that urls is passed to the Cypher query as a parameter):
UNWIND $urls AS url
CALL apoc.load.json(url) YIELD value
...
If it makes sense to perform the imports concurrently (e.g., there is no risk of deadlocks -- because different files would not write-lock the same set of relationships or nodes), you could consider using one of the APOC periodic execution procedures.
Assuming that your files are in the "import" directory, the following CQL should work:
CALL apoc.load.directory()
YIELD value as files
UNWIND files as file
CALL apoc.periodic.iterate('
CALL apoc.load.json($file)
YIELD value as line
', '
CREATE (pc:SomeNode{type: line.type}) //Process line here
', {batchSize:10000, parallel:true, params: {file: file}})
YIELD batches, total, updateStatistics
RETURN batches, total, updateStatistics

ADF: Split a JSON file with an Array of Objects into Single JSON files containing One Element in Each

I'm using Azure Data Factory and trying to convert a JSON file that is an array of JSON objects into separate JSON files each contain one element e.g. the input:
[
{"Animal":"Cat","Colour":"Red","Age":12,"Visits":[{"Reason":"Injections","Date":"2020-03-15"},{"Reason":"Check-up","Date":"2020-01-02"}]},
{"Animal":"Dog","Colour":"Blue","Age":1,"Visits":[{"Reason":"Check-up","Date":"2020-02-08"}]},
{"Animal":"Guinea Pig","Colour":"Green","Age":5,"Visits":[{"Reason":"Injections","Date":"2019-12-01"},{"Reason":"Check-up","Date":"2020-02-26"}]}
]
However, I've tried Data Flow to split this array up into single files containing each element of the JSON array but cannot work it out. Ideally I would also want to name each file dynamically e.g. Cat.json, Dog.json and "Guinea Pig.json".
Is Data Flow the correct tool for this with Azure Data Factory (version 2)?
Data Flows should do it for you. Your JSON snippet above will generate 3 rows. Each of those rows can be sent to a single sink. Set the Sink as a JSON sink with no filename in the dataset. In the Sink transformation, use the 'File Name Option' of 'As Data in Column'. Add a Derived Column before that which sets a new column called 'filename' with this expression:
Animal + '.json'
Use the column name 'filename' as data in column in the sink.
You'll get a separate file for each row.

Loosing data from Source to Sink in Copy Data

In my MS Azure datafactory, I have a rest API connection to a nested JSON dataset.
The Source "Preview data" shows all data. (7 orders from the online store)
In the "Activity Copy Data", is the menu tab "Mapping" where I map JSON fields with the sink SQL table columns. If I under "Collection Reference" I select None, all 7 orders are copied over.
But if I want the nested metadata, I select the meta field in "Collection Reference", then I get my nested data, in multiple order lines, each with a one metadata point, but I only get data from 1 order, not 7
I think I have a reason for my problem. One of the fields in the nested meta data, is both a string and array. But I still don't have a solution
sceen shot of meta data
Your sense is right,it caused by your nested structure meta data. Based on the statements of Collection Reference property:
If you want to iterate and extract data from the objects inside an
array field with the same pattern and convert to per row per object,
specify the JSON path of that array to do cross-apply. This property
is supported only when hierarchical data is source.
same pattern is key point here, I think. However, your data inside metadata array are not same as your screenshot.
My workaround is using Azure Blob Storage to make a transition, REST API ---> Azure Blob Storage--->Your sink dataset. Inside Blob Storage Dataset, you could flatten the incoming JSON data with Cross-apply nested JSON array setting:
You could refer to this blog to learn about this feature. Then you could copy the flatten data into your destination.

Importing massive dataset in Neo4j where each entity has differing properties

I'm trying to bulk load a massive dataset into a single Neo4j instance. Each node will represent a general Entity which will have specific properties, e.g.:
label
description
date
In addition to these there are zero or more properties specific to the Entity type, so for example if the Entity is a Book, the properties will look something like this:
label
description
date
author
first published
...
And if the Entity is a Car the properties will look something like this:
label
description
date
make
model
...
I first attempted to import the dataset by streaming each Entity from the filesystem and using Cypher to insert each node (some 200M entities and 400M relationships). This was far too slow (as I had expected but worth a try).
I've therefore made use of the bulk import tool neo4j-admin import which works over a CSV file which has specified headers for each property. The problem I'm having is that I don't see a way to add the additional properties specific to each Entity. The only solution I can think of is to include a CSV column for every possible property expressed across the set of entities, however I believe I will end up with a bunch of redundant properties on all my entities.
EDIT1
Each Entity is unique, so there will be some 1M+ types (labels in Neo4j)
Any suggestions on how to accomplish this would be appreciated.
The import command of neo4j-admin supports importing from multiple node and relationship files.
Therefore, to support multiple "types" of nodes (called labels in neo4j), you can split your original CSV file into separate files, one for each Entity "type". Each file can then have data columns specific to that type.
[UPDATED]
Here is one way to support the import of nodes having arbitrary schemata from a CSV file.
The CSV file should not have a header.
Every property on a CSV line should be represented by an adjacent pair of values: 1 for the property name, and 1 for the property value.
With such a CSV file, this code (which takes advantage of the APOC function apoc.map.fromValues) should work:
LOAD CSV FROM "file:///mydata.csv" AS line
CREATE (e:Entity)
SET e = apoc.map.fromValues(line);
NOTE: the above code would use strings for all values. If you want some property values to be integers, booleans, etc., then you can do something like this instead (but this is probably only sensible if the same property occurs frequently; if the property does not exist on a line no property will be created in the node, but it will waste some time):
LOAD CSV FROM "file:///mydata.csv" AS line
WITH apoc.map.fromValues(line) AS data
WITH apoc.map.setKey(data, 'foo', TOINTEGER(data.foo)) AS data
CREATE (e:Entity)
SET e = apoc.map.fromValues(line);

how to load big files ( json or csv ) in spark once

I do more than one select in two register tables in spark loaded from JSON and CSV.
but in every select the two files loaded every time, can I load in a global object once?
you can use persist() with StorageLevel as MEMORY_AND_DISK
import org.apache.spark.storage.StorageLevel
dataFrame.persist(StorageLevel.MEMORY_AND_DISK)
check the documentation here
Note: this option is more useful, where you have performed some aggregations/tranformation on the input dataset and before going to do next transformation