Import relationships from CSV to Neo4J - cannot create multiple relationships - csv

I have simple, large CSV file, with no headers, of the structure:
name1, name2
name3, name4
name2, name4
...
I'm trying to import it all to Neo4J and create the relationships at the same time. First I've added the constraint CREATE CONSTRAINT ON (u:User) ASSERT u.name IS UNIQUE and then I ran:
USING PERIODIC COMMIT
LOAD CSV FROM '${file}' AS line
WITH line LIMIT 50000
MERGE (u:User {name: line[0]})-[:connected_to]->(q:User {name: line[1]})
The graph I get are just connected pairs. I cannot find a single node that has more than one relationship (even though many nodes appear many times in both the left and right columns). Also, I expected to see some clusters.
Clearly I'm doing something wrong with my insertion. I assume I can run down the file twice and create all nodes and then create all relationships, but I feel like I'm missing something simple that can do it all in one operation.
Correction: Had one of the property names as 'number' - they are both 'name'.

You need to create the entries first individually. MERGE will ensure the the entire pattern is created. As a result, you only get pairs matching each row of your file.
If you MERGE each name first in the line and then MERGE the relationship afterwards you will get the connected graph you desire. Note that the relationship MERGE is undirected. THis will ensure that only a single relationship is created between two particular nodes regardless of the order in the file or the number of occurrences.
USING PERIODIC COMMIT
LOAD CSV FROM '${file}' AS line
WITH line LIMIT 50000
MERGE (u:User {name: trim(line[0])} )
MERGE (q:User {name: trim(line[1])} )
MERGE (u)-[:connected_to]-(q)
If the data that contains entries similar to this where they repeat in different order and wanted to have relationships created in both directions then you could make the relationship MERGE directed
...
name1, name2
name2, name1
...
as in the following example
USING PERIODIC COMMIT
LOAD CSV FROM '${file}' AS line
WITH line LIMIT 50000
MERGE (u:User {name: trim(line[0])} )
MERGE (q:User {name: trim(line[1])} )
MERGE (u)-[:connected_to]->(q)

Related

Neo4j load csv creating more relationships than lines in CSV

I have a file, people.csv, with values (NPI, FirstName, LastName). This Cypher query populates the data base with as many People nodes as there are lines of the csv.
:auto USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'file:///people.csv' AS row
CREATE (:Person {id: toInteger(row.NPI), first_name: row.FirstName, last_name: row.LastName});
CREATE INDEX FOR (p:Person) ON (p.id);
There is another file, refers.csv, with values (ReferNPI, ReferredNPI, NumReferrals). This query produces many thousand times more relationships than there are lines of the file, even though each line is intended to represent one relationship.
:auto USING PERIODIC COMMIT LOAD CSV WITH HEADERS FROM 'file:///refers.csv' AS row
MATCH (refer:Person {id: toInteger(row.ReferNPI)})
MATCH (referred:Person {id: toInteger(row.ReferredNPI)})
CREATE (refer)-[:REFERS {num_refers: toInteger(row.NumReferrals)}]->(referred)
It would appear that my understanding of Cypher's semantics is incorrect; perhaps it's doing every possible combination of nodes that match these two patterns. How can I ensure that each only one pair of nodes is connected per line of the csv?
In your people.csv files, the NPI is not the defined index and rather the id generated by neo4j. Thus, if NPI is duplicated value in people.csv, you will create duplicated relationships in refers.csv.
for example:
People.csv contains
p1: NPI:123 (id generated is 789
p2: NPI:123 (id is 790)
Refers.csv
referNPI: 123 referred: 345
Instead of having one relationship (id:789)-[: REFERS]->(NPI: 345), you will created another relationship from (id:790) ->(NPI: 345) because both person1 and person2 have the same NPI.
===Edited to answer the comment below.
To show that Person has duplicate nodes, try running this cypher query:
MATCH (p:Person)
WITH p.id as pNPI, collect(p) as person
RETURN pNPI, size(person) as counts LIMIT 5
I would suggest that you do the following:
Change CREATE to MERGE so that it will not create duplicates
Use ID instead of id, neo4j is case sensitive
Create a CONSTRAINT on ID, this will prevent you on creating duplicate Person with the same ID(NPI).
CREATE CONSTRAINT ON (n:Person) assert n.ID IS UNIQUE;

Neo4j Cypher - Load Tree Structure from CSV

New to cypher, and I'm trying to load in a csv of a tree structure with 5 columns. For a single row, every item is a node, and every node in column n+1 is a child of the node in column n.
Example:
Csv columns: Level1, Level2, Level3, Level4, Level5
Structure: Level1_thing <--child_of-- Level2_thing <--child_of-- Level3_thing etc...
The database is non-normalized, so there are many repetitions of node names in all the levels except the lowest ones. What's the best way to load in this csv using cypher and create this tree structure from the csv?
Apologies if this question is poorly formatted or asked, I'm new to both stack overflow and graph DBs.
What you are searching is the MERGE command.
To do your script you have to do it in two phases for an optimal execution
1) Create nodes if they don't already exist
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///my_file.csv" AS row
MERGE (l5:Node {value:row.Level5})
MERGE (l4:Node {value:row.Level4})
MERGE (l3:Node {value:row.Level3})
MERGE (l2:Node {value:row.Level2})
MERGE (l1:Node {value:row.Level1})
2) Create relationships if they don't already exist
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///my_file.csv" AS row
MATCH (l5:Node {value:row.Level5})
MATCH (l4:Node {value:row.Level4})
MATCH (l3:Node {value:row.Level3})
MATCH (l2:Node {value:row.Level2})
MATCH (l1:Node {value:row.Level1})
MERGE (l5)-[:child_of]->(l4)
MERGE (l4)-[:child_of]->(l3)
MERGE (l3)-[:child_of]->(l2)
MERGE (l2)-[:child_of]->(l1)
And before all, you need to create a constraint on your node to facilitate the work of the MERGE. On my example it will be :
CREATE CONSTRAINT ON (n:Node) ASSERT n.value IS UNIQUE;
IIUC, you can use the LOAD CSV function in Cypher to load both nodes and relationships. In your case you can use MERGE to take care of duplicates. Your example should work in this way, with a bit of pseudo-code:
LOAD CSV with HEADERS from "your_path" AS row
MERGE (l1:Label {prop:row.Level1}
...
MERGE (l5:Label {prop:row.Level1}
MERGE (l1)<-[CHILD_OF]-(l2)<-...-(l5)
Basically you can create on-the-fly nodes and relationships while reading from the .csv file with headers. Hope that helps.
If the csv-file does not have a header line, and the column sequence is fixed, then you can solve the problem like this:
LOAD CSV from "file:///path/to/tree.csv" AS row
// WITH row SKIP 1 // If there is a headers, you can skip the first line
// Pass on the columns:
UNWIND RANGE(0, size(row)-2) AS i
MERGE (P:Node {id: row[i]})
MERGE (C:Node {id: row[i+1]})
MERGE (P)<-[:child_of]-(C)
RETURN *
And yes, before that it's really worth adding an index:
CREATE INDEX ON :Node(id)

Cypher MERGE being too slow

I have a CSV file with about 15 million rows. I am trying to import them with CSV IMPORT but it's taking too long.
When I try to import them with CREATE they get imported in a decent amount of time, but that creates a lot of duplicates. So, I tried to use MERGE instead but it's taking a lot of time. The query was running for more than 10 hours before I terminated it. After that I tried to import just few columnsand waited more than 30 minutes before I terminated the query. Here is the code for running just the few columns:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///companydata.csv' AS line1
MERGE (company:Company {companyNumber:line1.CompanyNumber,
companyName:line1.CompanyName,
uri:line1.URI
})
My question is: does MERGE usually behave like this or am I doing something wrong?
Based on the name of your input file (companydata.csv) and the columns used in the MERGE, it looks like you're misusing MERGE, unless the URI is really part of a composite primary key:
Match on a unique key
Set the properties
Otherwise, you'll be creating duplicate nodes, as a MERGE either finds a match with all criteria, or creates a new node.
The query should probably look like
CREATE CONSTRAINT ON (c:Company) ASSERT c.companyNumber IS UNIQUE;
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///companydata.csv' AS line1
MERGE (company:Company {companyNumber: line1.CompanyNumber})
SET company.companyName = line1.CompanyName,
company.uri = line1.URI;
The unicity constraint will also create an index, speeding up lookups.
Update
There's a comment about how to address nodes which don't have a unique property, such as a CompanyInfo node related to the Company and containing extra properties: the CompanyInfo might not have a unique property, but it has a unique relationship to a unique Company, and that's enough to identify it.
A more complete query would then look like:
CREATE CONSTRAINT ON (c:Company) ASSERT c.companyNumber IS UNIQUE;
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///companydata.csv' AS line1
MERGE (company:Company {companyNumber: line1.CompanyNumber})
SET company.companyName = line1.CompanyName,
company.uri = line1.URI
// This pattern is unique
MERGE (company)-[:BASIC_INFO]->(companyInfo:CompanyInfo)
SET companyInfo.category = line1.CompanyCategory,
companyInfo.status = line1.CompanyStatus,
companyInfo.countryOfOrigin = line1.CountryofOrigin;
Note that if companydata.csv is the first file imported, and there are no duplicates, you could simply use CREATE instead of MERGE and get rid of the SET, which looks like your initial query:
CREATE CONSTRAINT ON (c:Company) ASSERT c.companyNumber IS UNIQUE;
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///companydata.csv' AS line1
CREATE (company:Company {companyNumber: line1.CompanyNumber,
companyName: line1.CompanyName,
uri: line1.URI})
CREATE (company)-[:BASIC_INFO]->(companyInfo:CompanyInfo {category: line1.CompanyCategory,
status: line1.CompanyStatus,
countryOfOrigin: line1.CountryofOrigin});
However, I noticed that you mentioned trying CREATE and getting duplicates, so I guess you do have duplicates in the file. In that case, you can stick to the MERGE query, but you might want to adjust your deduplication strategy by replacing SET (last entry wins) by ON CREATE SET (first entry wins) or ON CREATE SET / ON MATCH SET (full control).

neo4j Model Creation from flat CSV

I am doing a proof of concept and need a little guidance. I have a flat file that contains the following attributes: ID, Name, Email, Gender, StreetAddress, City, State, Zip, Phone, AltPhone, SSN (all fake data.)
I want to import this in a way that each Person is a node, each address is a node, each ssn is a node, and each phone/altphone is a node. This is to mimic many of the examples of fraud ring detection. How can I load this CSV file, creating these relationships? There will be duplicate addresses and phone numbers but where duplicates exist, only 1 node should exist.
Is there a way to do this using the standard LOAD CSV or do I need to break all this data up, relationally, outside of neoj4?
You can do this using the MERGE Cypher command. MERGE will look for a pattern and create it if it doesn't exist, but will not create duplicate data.
First, define uniqueness constraints based on your data model. You should define a uniqueness constraint for any Label property used in a MERGE statement:
CREATE CONSTRAINT ON (p:Person) ASSERT p.personID IS UNIQUE;
CREATE CONSTRAINT ON (phone:Phone) ASSERT phone.number IS UNIQUE;
...
Then, using MERGE with LOAD CSV will look something like this (depending on the data model you want):
LOAD CSV WITH HEADERS FROM "file:///flat_file.csv" AS row
MERGE (p:Person {personID: row.ID})
SET p.name = row.Name,
p.email = row.Email,
p.gender = row.Gender
MERGE (phone:PhoneNumber {number: row.Phone})
MERGE (altPhone:PhoneNumber {number: row.AltPhone})
MERGE (ssn:SSN {number: row.SSN})
MERGE (address:StreetAddress {address: row.StreetAddress})
MERGE (city:City {name: row.City})
MERGE (state:State {name: row.State})
MERGE (p)-[:HAS_SSN]->(ssn)
MERGE (p)-[:HAS_PHONE]->(phone)
MERGE (p)-[:HAS_ALT_PHONE]->(altPhone)
MERGE (p)-[:HAS_ADDRESS]->(address)
MERGE (address)-[:IS_IN]->(city)
MERGE (city)-[:IS_IN]->(state)
The general recommendation is to break up the cypher statements in LOAD CSV to accomplish what you need in steps. Simply, you wouldn't do everything in a single LOAD CSV statement. You might create the SSN nodes, and then Address nodes, etc.
Moreover, you will want to look at creating Indexes and leverage MERGE for rows that may be duplicated.
Here is a good article on things to consider when using LOAD CSV.
And a post from Mark on loading data. Mark has a ton of great posts, so I would encourage you to poke around his blog.
Lastly, check out the doc's for Merge.

Uploading CSV in neo4j

I am trying to upload the following csv (https://www.dropbox.com/s/95j774tg13qsdxr/out.csv?dl=0) file in to neo4j by following command
LOAD CSV WITH HEADERS FROM
"file:/home/pavan637/Neo4jDemo/out.csv"
AS csvimport
match (uniprotid:UniprotID{Uniprotid: csvimport.Uniprot_ID})
merge (Prokaryotes_Proteins: Prokaryotes_Proteins{UniprotID: csvimport.DBUni, ProteinID: csvimport.ProteinID, IdentityPercentage: csvimport.IdentityPercentage, AlignedLength:csvimport.al, Mismatches:csvimport.mm, QueryStart:csvimport.qs, QueryEnd: csvimport.qe, SubjectStrat: csvimport.ss, SubjectEnd: csvimport.se, Evalue: csvimport.evalue, BitScore: csvimport.bs})
merge (uniprotid)-[:BlastResults]->(Prokaryotes_Proteins)
I used "match" command in the LOAD CSV command in order to match with the "Uniprot_ID's" of previously loaded CSV.
I have first loaded ReactomeDB.csv (https://www.dropbox.com/s/9e5m1629p3pi3m5/Reactomesample.csv?dl=0) with the following cypher
LOAD CSV WITH HEADERS FROM
"file:/home/pavan637/Neo4jDemo/Reactomesample.csv"
AS csvimport
merge (uniprotid:UniprotID{Uniprotid: csvimport.Uniprot_ID})
merge (reactionname: ReactionName{ReactionName: csvimport.ReactionName, ReactomeID: csvimport.ReactomeID})
merge (uniprotid)-[:ReactionInformation]->(reactionname)
into neo4j which was successful.
Later on I am uploading out.csv
From both the CSV files, Uniprot_ID columns are present and some of those ID's are same. Though some of the Uniprot_ID are common, neo4j is not returning any rows.
Any solutions
Thanks in Advance
Pavan Kumar Alluri
Just a few tips:
only use ONE label and ONE property for MERGE
set the others with ON CREATE SET ...
try to create nodes and rels separately, otherwise you might get into memory issues
you should be consistent with your spelling and upper/lowercase of properties and labels, otherwise you will spent hours in debugging (labels, rel-types and property-names are case-sensitive)
you probably don't need merge for relationships, create should do fine
for your statement:
CREATE CONSTRAINT ON (up:UniprotID) assert pp.Uniprotid is unique;
CREATE CONSTRAINT ON (pp:Prokaryotes_Proteins) assert pp.UniprotID is unique;
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file:/home/pavan637/Neo4jDemo/out.csv" AS csvimport
merge (pp: Prokaryotes_Proteins {UniprotID: csvimport.DBUni})
ON CREATE SET pp.ProteinID=csvimport.ProteinID,
pp.IdentityPercentage=csvimport.IdentityPercentage, ...
;
LOAD CSV WITH HEADERS FROM "file:/home/pavan637/Neo4jDemo/out.csv" AS csvimport
match (uniprotid:UniprotID{Uniprotid: csvimport.Uniprot_ID})
match (pp: Prokaryotes_Proteins {UniprotID: csvimport.DBUni})
merge (uniprotid)-[:BlastResults]->(Prokaryotes_Proteins);