Cypher MERGE being too slow - csv

I have a CSV file with about 15 million rows. I am trying to import them with CSV IMPORT but it's taking too long.
When I try to import them with CREATE they get imported in a decent amount of time, but that creates a lot of duplicates. So, I tried to use MERGE instead but it's taking a lot of time. The query was running for more than 10 hours before I terminated it. After that I tried to import just few columnsand waited more than 30 minutes before I terminated the query. Here is the code for running just the few columns:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///companydata.csv' AS line1
MERGE (company:Company {companyNumber:line1.CompanyNumber,
companyName:line1.CompanyName,
uri:line1.URI
})
My question is: does MERGE usually behave like this or am I doing something wrong?

Based on the name of your input file (companydata.csv) and the columns used in the MERGE, it looks like you're misusing MERGE, unless the URI is really part of a composite primary key:
Match on a unique key
Set the properties
Otherwise, you'll be creating duplicate nodes, as a MERGE either finds a match with all criteria, or creates a new node.
The query should probably look like
CREATE CONSTRAINT ON (c:Company) ASSERT c.companyNumber IS UNIQUE;
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///companydata.csv' AS line1
MERGE (company:Company {companyNumber: line1.CompanyNumber})
SET company.companyName = line1.CompanyName,
company.uri = line1.URI;
The unicity constraint will also create an index, speeding up lookups.
Update
There's a comment about how to address nodes which don't have a unique property, such as a CompanyInfo node related to the Company and containing extra properties: the CompanyInfo might not have a unique property, but it has a unique relationship to a unique Company, and that's enough to identify it.
A more complete query would then look like:
CREATE CONSTRAINT ON (c:Company) ASSERT c.companyNumber IS UNIQUE;
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///companydata.csv' AS line1
MERGE (company:Company {companyNumber: line1.CompanyNumber})
SET company.companyName = line1.CompanyName,
company.uri = line1.URI
// This pattern is unique
MERGE (company)-[:BASIC_INFO]->(companyInfo:CompanyInfo)
SET companyInfo.category = line1.CompanyCategory,
companyInfo.status = line1.CompanyStatus,
companyInfo.countryOfOrigin = line1.CountryofOrigin;
Note that if companydata.csv is the first file imported, and there are no duplicates, you could simply use CREATE instead of MERGE and get rid of the SET, which looks like your initial query:
CREATE CONSTRAINT ON (c:Company) ASSERT c.companyNumber IS UNIQUE;
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///companydata.csv' AS line1
CREATE (company:Company {companyNumber: line1.CompanyNumber,
companyName: line1.CompanyName,
uri: line1.URI})
CREATE (company)-[:BASIC_INFO]->(companyInfo:CompanyInfo {category: line1.CompanyCategory,
status: line1.CompanyStatus,
countryOfOrigin: line1.CountryofOrigin});
However, I noticed that you mentioned trying CREATE and getting duplicates, so I guess you do have duplicates in the file. In that case, you can stick to the MERGE query, but you might want to adjust your deduplication strategy by replacing SET (last entry wins) by ON CREATE SET (first entry wins) or ON CREATE SET / ON MATCH SET (full control).

Related

Neo4j Cypher - Load Tree Structure from CSV

New to cypher, and I'm trying to load in a csv of a tree structure with 5 columns. For a single row, every item is a node, and every node in column n+1 is a child of the node in column n.
Example:
Csv columns: Level1, Level2, Level3, Level4, Level5
Structure: Level1_thing <--child_of-- Level2_thing <--child_of-- Level3_thing etc...
The database is non-normalized, so there are many repetitions of node names in all the levels except the lowest ones. What's the best way to load in this csv using cypher and create this tree structure from the csv?
Apologies if this question is poorly formatted or asked, I'm new to both stack overflow and graph DBs.
What you are searching is the MERGE command.
To do your script you have to do it in two phases for an optimal execution
1) Create nodes if they don't already exist
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///my_file.csv" AS row
MERGE (l5:Node {value:row.Level5})
MERGE (l4:Node {value:row.Level4})
MERGE (l3:Node {value:row.Level3})
MERGE (l2:Node {value:row.Level2})
MERGE (l1:Node {value:row.Level1})
2) Create relationships if they don't already exist
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///my_file.csv" AS row
MATCH (l5:Node {value:row.Level5})
MATCH (l4:Node {value:row.Level4})
MATCH (l3:Node {value:row.Level3})
MATCH (l2:Node {value:row.Level2})
MATCH (l1:Node {value:row.Level1})
MERGE (l5)-[:child_of]->(l4)
MERGE (l4)-[:child_of]->(l3)
MERGE (l3)-[:child_of]->(l2)
MERGE (l2)-[:child_of]->(l1)
And before all, you need to create a constraint on your node to facilitate the work of the MERGE. On my example it will be :
CREATE CONSTRAINT ON (n:Node) ASSERT n.value IS UNIQUE;
IIUC, you can use the LOAD CSV function in Cypher to load both nodes and relationships. In your case you can use MERGE to take care of duplicates. Your example should work in this way, with a bit of pseudo-code:
LOAD CSV with HEADERS from "your_path" AS row
MERGE (l1:Label {prop:row.Level1}
...
MERGE (l5:Label {prop:row.Level1}
MERGE (l1)<-[CHILD_OF]-(l2)<-...-(l5)
Basically you can create on-the-fly nodes and relationships while reading from the .csv file with headers. Hope that helps.
If the csv-file does not have a header line, and the column sequence is fixed, then you can solve the problem like this:
LOAD CSV from "file:///path/to/tree.csv" AS row
// WITH row SKIP 1 // If there is a headers, you can skip the first line
// Pass on the columns:
UNWIND RANGE(0, size(row)-2) AS i
MERGE (P:Node {id: row[i]})
MERGE (C:Node {id: row[i+1]})
MERGE (P)<-[:child_of]-(C)
RETURN *
And yes, before that it's really worth adding an index:
CREATE INDEX ON :Node(id)

cypher - load multiple csv files

I have many csv files with names 0_0.csv , 0_1.csv , 0_2.csv , ... , 1_0.csv , 1_1.csv , ... , z_17.csv.
I wanted to know how can I import them in a loop or something ?
Also I wanted to know am I doing it good ? ( each file is 50MB and whole files size is about 100GB )
This is my code :
create index on :name(v)
create index on :value(v)
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///0_0.txt" AS csv
FIELDTERMINATOR ','
MERGE (n:name {v:csv.name})
MERGE (m:value {v:csv.value})
CREATE (n)-[:kind {v:csv.kind}]->(m)
You could handle multiple files by constructing a file name. Unfortunately this seems to break when using the USING PERIODIC COMMIT query hint so it won't be a good option for you. You could create a script to wrap it up and send the commands to bin/cypher-shell though.
UNWIND ['0','1','z'] as outer
UNWIND range(0,17) as inner
LOAD CSV WITH HEADERS FROM 'file:///'+ outer +'_' + toString(inner) + '.csv' AS csv
FIELDTERMINATOR ','
MERGE (n:name {v:csv.name})
MERGE (m:value {v:csv.value})
CREATE (n)-[:kind {v:csv.kind}]->(m)
As far as your actual load query goes. Do you name and value nodes come up multiple times in the files? If they are unique, you would be better off loading the the data in multiple passes. Load the nodes first without the indexes; then add the indexes once the nodes are loaded; and then do the relationships as the last step.
Using CREATE for the :kind relationship will result in multiple relationships even if it is the same value for csv.kind. You might want to use MERGE instead if that is the case.
For 100 GB of data though if you are starting with an empty database and are looking for speed, I would take a look at using bin/neo4j-admin import.

neo4j Model Creation from flat CSV

I am doing a proof of concept and need a little guidance. I have a flat file that contains the following attributes: ID, Name, Email, Gender, StreetAddress, City, State, Zip, Phone, AltPhone, SSN (all fake data.)
I want to import this in a way that each Person is a node, each address is a node, each ssn is a node, and each phone/altphone is a node. This is to mimic many of the examples of fraud ring detection. How can I load this CSV file, creating these relationships? There will be duplicate addresses and phone numbers but where duplicates exist, only 1 node should exist.
Is there a way to do this using the standard LOAD CSV or do I need to break all this data up, relationally, outside of neoj4?
You can do this using the MERGE Cypher command. MERGE will look for a pattern and create it if it doesn't exist, but will not create duplicate data.
First, define uniqueness constraints based on your data model. You should define a uniqueness constraint for any Label property used in a MERGE statement:
CREATE CONSTRAINT ON (p:Person) ASSERT p.personID IS UNIQUE;
CREATE CONSTRAINT ON (phone:Phone) ASSERT phone.number IS UNIQUE;
...
Then, using MERGE with LOAD CSV will look something like this (depending on the data model you want):
LOAD CSV WITH HEADERS FROM "file:///flat_file.csv" AS row
MERGE (p:Person {personID: row.ID})
SET p.name = row.Name,
p.email = row.Email,
p.gender = row.Gender
MERGE (phone:PhoneNumber {number: row.Phone})
MERGE (altPhone:PhoneNumber {number: row.AltPhone})
MERGE (ssn:SSN {number: row.SSN})
MERGE (address:StreetAddress {address: row.StreetAddress})
MERGE (city:City {name: row.City})
MERGE (state:State {name: row.State})
MERGE (p)-[:HAS_SSN]->(ssn)
MERGE (p)-[:HAS_PHONE]->(phone)
MERGE (p)-[:HAS_ALT_PHONE]->(altPhone)
MERGE (p)-[:HAS_ADDRESS]->(address)
MERGE (address)-[:IS_IN]->(city)
MERGE (city)-[:IS_IN]->(state)
The general recommendation is to break up the cypher statements in LOAD CSV to accomplish what you need in steps. Simply, you wouldn't do everything in a single LOAD CSV statement. You might create the SSN nodes, and then Address nodes, etc.
Moreover, you will want to look at creating Indexes and leverage MERGE for rows that may be duplicated.
Here is a good article on things to consider when using LOAD CSV.
And a post from Mark on loading data. Mark has a ton of great posts, so I would encourage you to poke around his blog.
Lastly, check out the doc's for Merge.

Uploading CSV in neo4j

I am trying to upload the following csv (https://www.dropbox.com/s/95j774tg13qsdxr/out.csv?dl=0) file in to neo4j by following command
LOAD CSV WITH HEADERS FROM
"file:/home/pavan637/Neo4jDemo/out.csv"
AS csvimport
match (uniprotid:UniprotID{Uniprotid: csvimport.Uniprot_ID})
merge (Prokaryotes_Proteins: Prokaryotes_Proteins{UniprotID: csvimport.DBUni, ProteinID: csvimport.ProteinID, IdentityPercentage: csvimport.IdentityPercentage, AlignedLength:csvimport.al, Mismatches:csvimport.mm, QueryStart:csvimport.qs, QueryEnd: csvimport.qe, SubjectStrat: csvimport.ss, SubjectEnd: csvimport.se, Evalue: csvimport.evalue, BitScore: csvimport.bs})
merge (uniprotid)-[:BlastResults]->(Prokaryotes_Proteins)
I used "match" command in the LOAD CSV command in order to match with the "Uniprot_ID's" of previously loaded CSV.
I have first loaded ReactomeDB.csv (https://www.dropbox.com/s/9e5m1629p3pi3m5/Reactomesample.csv?dl=0) with the following cypher
LOAD CSV WITH HEADERS FROM
"file:/home/pavan637/Neo4jDemo/Reactomesample.csv"
AS csvimport
merge (uniprotid:UniprotID{Uniprotid: csvimport.Uniprot_ID})
merge (reactionname: ReactionName{ReactionName: csvimport.ReactionName, ReactomeID: csvimport.ReactomeID})
merge (uniprotid)-[:ReactionInformation]->(reactionname)
into neo4j which was successful.
Later on I am uploading out.csv
From both the CSV files, Uniprot_ID columns are present and some of those ID's are same. Though some of the Uniprot_ID are common, neo4j is not returning any rows.
Any solutions
Thanks in Advance
Pavan Kumar Alluri
Just a few tips:
only use ONE label and ONE property for MERGE
set the others with ON CREATE SET ...
try to create nodes and rels separately, otherwise you might get into memory issues
you should be consistent with your spelling and upper/lowercase of properties and labels, otherwise you will spent hours in debugging (labels, rel-types and property-names are case-sensitive)
you probably don't need merge for relationships, create should do fine
for your statement:
CREATE CONSTRAINT ON (up:UniprotID) assert pp.Uniprotid is unique;
CREATE CONSTRAINT ON (pp:Prokaryotes_Proteins) assert pp.UniprotID is unique;
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file:/home/pavan637/Neo4jDemo/out.csv" AS csvimport
merge (pp: Prokaryotes_Proteins {UniprotID: csvimport.DBUni})
ON CREATE SET pp.ProteinID=csvimport.ProteinID,
pp.IdentityPercentage=csvimport.IdentityPercentage, ...
;
LOAD CSV WITH HEADERS FROM "file:/home/pavan637/Neo4jDemo/out.csv" AS csvimport
match (uniprotid:UniprotID{Uniprotid: csvimport.Uniprot_ID})
match (pp: Prokaryotes_Proteins {UniprotID: csvimport.DBUni})
merge (uniprotid)-[:BlastResults]->(Prokaryotes_Proteins);

CSV LOAD and updating existing nodes / creating new ones

I might be on the wrong track so I could use some helpful input. I receive data from other systems by CSV files which I can import into my DB with CSV LOAD. So far so good.
I stucked when I need to reload the CSV again to follow up updates. I cannot delet the former data as I might have additional user input already attached so I would need a query that imports the CSV data, makes a match and when it finds the node it will just use SET to override the existing properties. Saying that I am unsure how to catch the cases where there is no node in the DB (new record) and we need to create a node.
LOAD CSV FROM "file:xxx.csv" AS csvLine
MATCH (c:Customer {code:"ABC"})
SET c.name = name: csvLine[0]
***OPTIONAL MATCH // Here I am unsure how to express when the node is not found***
MERGE (c:Customer { name: csvLine[0], code: csvLine[1]})
So ideally Cypher would check if the node is there and make an UPDATE by SET the new property coming with the CSV or - if the node cannot be found - creates a new one with the CSV data.
And - as a sidenote: How would I find nodes that are not in the CSV file but in the DB in order to mark them as obsolete? (This might not be able in the import but maybe someone has an idea how to solve this in order to keep the DB clean of deleted records - which can only be detected by a comparison with the latest CSV import - happy for every idea).
Any idea or hint how to write the query for updaten the graph while importing?
You need to use MERGEs ON MATCH and/or ON CREATE handlers, see http://neo4j.com/docs/stable/query-merge.html#_use_on_create_and_on_match. I assume the customer code in the second column is the identifier - so the name in column one might change on updates:
LOAD CSV FROM "file:xxx.csv" AS csvLine
MERGE (c:Customer {code:csvLine[1]})
ON CREATE SET c.name = csvLine[0]
ON MATCH SET c.name = csvLine[0]