Uploading CSV in neo4j - csv

I am trying to upload the following csv (https://www.dropbox.com/s/95j774tg13qsdxr/out.csv?dl=0) file in to neo4j by following command
LOAD CSV WITH HEADERS FROM
"file:/home/pavan637/Neo4jDemo/out.csv"
AS csvimport
match (uniprotid:UniprotID{Uniprotid: csvimport.Uniprot_ID})
merge (Prokaryotes_Proteins: Prokaryotes_Proteins{UniprotID: csvimport.DBUni, ProteinID: csvimport.ProteinID, IdentityPercentage: csvimport.IdentityPercentage, AlignedLength:csvimport.al, Mismatches:csvimport.mm, QueryStart:csvimport.qs, QueryEnd: csvimport.qe, SubjectStrat: csvimport.ss, SubjectEnd: csvimport.se, Evalue: csvimport.evalue, BitScore: csvimport.bs})
merge (uniprotid)-[:BlastResults]->(Prokaryotes_Proteins)
I used "match" command in the LOAD CSV command in order to match with the "Uniprot_ID's" of previously loaded CSV.
I have first loaded ReactomeDB.csv (https://www.dropbox.com/s/9e5m1629p3pi3m5/Reactomesample.csv?dl=0) with the following cypher
LOAD CSV WITH HEADERS FROM
"file:/home/pavan637/Neo4jDemo/Reactomesample.csv"
AS csvimport
merge (uniprotid:UniprotID{Uniprotid: csvimport.Uniprot_ID})
merge (reactionname: ReactionName{ReactionName: csvimport.ReactionName, ReactomeID: csvimport.ReactomeID})
merge (uniprotid)-[:ReactionInformation]->(reactionname)
into neo4j which was successful.
Later on I am uploading out.csv
From both the CSV files, Uniprot_ID columns are present and some of those ID's are same. Though some of the Uniprot_ID are common, neo4j is not returning any rows.
Any solutions
Thanks in Advance
Pavan Kumar Alluri

Just a few tips:
only use ONE label and ONE property for MERGE
set the others with ON CREATE SET ...
try to create nodes and rels separately, otherwise you might get into memory issues
you should be consistent with your spelling and upper/lowercase of properties and labels, otherwise you will spent hours in debugging (labels, rel-types and property-names are case-sensitive)
you probably don't need merge for relationships, create should do fine
for your statement:
CREATE CONSTRAINT ON (up:UniprotID) assert pp.Uniprotid is unique;
CREATE CONSTRAINT ON (pp:Prokaryotes_Proteins) assert pp.UniprotID is unique;
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file:/home/pavan637/Neo4jDemo/out.csv" AS csvimport
merge (pp: Prokaryotes_Proteins {UniprotID: csvimport.DBUni})
ON CREATE SET pp.ProteinID=csvimport.ProteinID,
pp.IdentityPercentage=csvimport.IdentityPercentage, ...
;
LOAD CSV WITH HEADERS FROM "file:/home/pavan637/Neo4jDemo/out.csv" AS csvimport
match (uniprotid:UniprotID{Uniprotid: csvimport.Uniprot_ID})
match (pp: Prokaryotes_Proteins {UniprotID: csvimport.DBUni})
merge (uniprotid)-[:BlastResults]->(Prokaryotes_Proteins);

Related

LOAD CSV: mulitple MERGE and Eager operator

I have a large CSV file that contains mulitple nodes per line. I would like to use LOAD CSV to MERGE the nodes and set some properties. However, I always get the "Eager operator" warning for this query:
USING PERIODIC COMMIT
LOAD CSV FROM 'file:///MRCONSO.RRF' AS line FIELDTERMINATOR '|'
MERGE (c:Concept {cui: line[0]})
ON CREATE SET c.language = line[1]
MERGE (l:LexicalForm {lui: line[3]})
ON CREATE SET l.status = line[2];
When I remove the ON CREATE part it works but I want to merge on a specific ID, not on the combination of the ID and the other properties.
Is it possible to rephrase this somehow to avoid the Eager operator? I would like to create 6 different nodes from a single line and the alternative would be to iterate the file 6 times.
You have an RRF file and load CSV requires a CSV line. When you reference a field in the csv you need to include the data type and your tag (line). You also need the periodic commit number of rows per iteration. Your file needs to be in the import folder of your database.
For example:
USING PERIODIC COMMIT 5000
LOAD CSV FROM 'file:///MRCONSO.RRF' AS line FIELDTERMINATOR '|'
MERGE (c:Concept {cui: toInteger(linecui),language: toString(line.language})

Neo4j Cypher - Load Tree Structure from CSV

New to cypher, and I'm trying to load in a csv of a tree structure with 5 columns. For a single row, every item is a node, and every node in column n+1 is a child of the node in column n.
Example:
Csv columns: Level1, Level2, Level3, Level4, Level5
Structure: Level1_thing <--child_of-- Level2_thing <--child_of-- Level3_thing etc...
The database is non-normalized, so there are many repetitions of node names in all the levels except the lowest ones. What's the best way to load in this csv using cypher and create this tree structure from the csv?
Apologies if this question is poorly formatted or asked, I'm new to both stack overflow and graph DBs.
What you are searching is the MERGE command.
To do your script you have to do it in two phases for an optimal execution
1) Create nodes if they don't already exist
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///my_file.csv" AS row
MERGE (l5:Node {value:row.Level5})
MERGE (l4:Node {value:row.Level4})
MERGE (l3:Node {value:row.Level3})
MERGE (l2:Node {value:row.Level2})
MERGE (l1:Node {value:row.Level1})
2) Create relationships if they don't already exist
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///my_file.csv" AS row
MATCH (l5:Node {value:row.Level5})
MATCH (l4:Node {value:row.Level4})
MATCH (l3:Node {value:row.Level3})
MATCH (l2:Node {value:row.Level2})
MATCH (l1:Node {value:row.Level1})
MERGE (l5)-[:child_of]->(l4)
MERGE (l4)-[:child_of]->(l3)
MERGE (l3)-[:child_of]->(l2)
MERGE (l2)-[:child_of]->(l1)
And before all, you need to create a constraint on your node to facilitate the work of the MERGE. On my example it will be :
CREATE CONSTRAINT ON (n:Node) ASSERT n.value IS UNIQUE;
IIUC, you can use the LOAD CSV function in Cypher to load both nodes and relationships. In your case you can use MERGE to take care of duplicates. Your example should work in this way, with a bit of pseudo-code:
LOAD CSV with HEADERS from "your_path" AS row
MERGE (l1:Label {prop:row.Level1}
...
MERGE (l5:Label {prop:row.Level1}
MERGE (l1)<-[CHILD_OF]-(l2)<-...-(l5)
Basically you can create on-the-fly nodes and relationships while reading from the .csv file with headers. Hope that helps.
If the csv-file does not have a header line, and the column sequence is fixed, then you can solve the problem like this:
LOAD CSV from "file:///path/to/tree.csv" AS row
// WITH row SKIP 1 // If there is a headers, you can skip the first line
// Pass on the columns:
UNWIND RANGE(0, size(row)-2) AS i
MERGE (P:Node {id: row[i]})
MERGE (C:Node {id: row[i+1]})
MERGE (P)<-[:child_of]-(C)
RETURN *
And yes, before that it's really worth adding an index:
CREATE INDEX ON :Node(id)

cypher - load multiple csv files

I have many csv files with names 0_0.csv , 0_1.csv , 0_2.csv , ... , 1_0.csv , 1_1.csv , ... , z_17.csv.
I wanted to know how can I import them in a loop or something ?
Also I wanted to know am I doing it good ? ( each file is 50MB and whole files size is about 100GB )
This is my code :
create index on :name(v)
create index on :value(v)
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///0_0.txt" AS csv
FIELDTERMINATOR ','
MERGE (n:name {v:csv.name})
MERGE (m:value {v:csv.value})
CREATE (n)-[:kind {v:csv.kind}]->(m)
You could handle multiple files by constructing a file name. Unfortunately this seems to break when using the USING PERIODIC COMMIT query hint so it won't be a good option for you. You could create a script to wrap it up and send the commands to bin/cypher-shell though.
UNWIND ['0','1','z'] as outer
UNWIND range(0,17) as inner
LOAD CSV WITH HEADERS FROM 'file:///'+ outer +'_' + toString(inner) + '.csv' AS csv
FIELDTERMINATOR ','
MERGE (n:name {v:csv.name})
MERGE (m:value {v:csv.value})
CREATE (n)-[:kind {v:csv.kind}]->(m)
As far as your actual load query goes. Do you name and value nodes come up multiple times in the files? If they are unique, you would be better off loading the the data in multiple passes. Load the nodes first without the indexes; then add the indexes once the nodes are loaded; and then do the relationships as the last step.
Using CREATE for the :kind relationship will result in multiple relationships even if it is the same value for csv.kind. You might want to use MERGE instead if that is the case.
For 100 GB of data though if you are starting with an empty database and are looking for speed, I would take a look at using bin/neo4j-admin import.

Cypher MERGE being too slow

I have a CSV file with about 15 million rows. I am trying to import them with CSV IMPORT but it's taking too long.
When I try to import them with CREATE they get imported in a decent amount of time, but that creates a lot of duplicates. So, I tried to use MERGE instead but it's taking a lot of time. The query was running for more than 10 hours before I terminated it. After that I tried to import just few columnsand waited more than 30 minutes before I terminated the query. Here is the code for running just the few columns:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///companydata.csv' AS line1
MERGE (company:Company {companyNumber:line1.CompanyNumber,
companyName:line1.CompanyName,
uri:line1.URI
})
My question is: does MERGE usually behave like this or am I doing something wrong?
Based on the name of your input file (companydata.csv) and the columns used in the MERGE, it looks like you're misusing MERGE, unless the URI is really part of a composite primary key:
Match on a unique key
Set the properties
Otherwise, you'll be creating duplicate nodes, as a MERGE either finds a match with all criteria, or creates a new node.
The query should probably look like
CREATE CONSTRAINT ON (c:Company) ASSERT c.companyNumber IS UNIQUE;
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///companydata.csv' AS line1
MERGE (company:Company {companyNumber: line1.CompanyNumber})
SET company.companyName = line1.CompanyName,
company.uri = line1.URI;
The unicity constraint will also create an index, speeding up lookups.
Update
There's a comment about how to address nodes which don't have a unique property, such as a CompanyInfo node related to the Company and containing extra properties: the CompanyInfo might not have a unique property, but it has a unique relationship to a unique Company, and that's enough to identify it.
A more complete query would then look like:
CREATE CONSTRAINT ON (c:Company) ASSERT c.companyNumber IS UNIQUE;
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///companydata.csv' AS line1
MERGE (company:Company {companyNumber: line1.CompanyNumber})
SET company.companyName = line1.CompanyName,
company.uri = line1.URI
// This pattern is unique
MERGE (company)-[:BASIC_INFO]->(companyInfo:CompanyInfo)
SET companyInfo.category = line1.CompanyCategory,
companyInfo.status = line1.CompanyStatus,
companyInfo.countryOfOrigin = line1.CountryofOrigin;
Note that if companydata.csv is the first file imported, and there are no duplicates, you could simply use CREATE instead of MERGE and get rid of the SET, which looks like your initial query:
CREATE CONSTRAINT ON (c:Company) ASSERT c.companyNumber IS UNIQUE;
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///companydata.csv' AS line1
CREATE (company:Company {companyNumber: line1.CompanyNumber,
companyName: line1.CompanyName,
uri: line1.URI})
CREATE (company)-[:BASIC_INFO]->(companyInfo:CompanyInfo {category: line1.CompanyCategory,
status: line1.CompanyStatus,
countryOfOrigin: line1.CountryofOrigin});
However, I noticed that you mentioned trying CREATE and getting duplicates, so I guess you do have duplicates in the file. In that case, you can stick to the MERGE query, but you might want to adjust your deduplication strategy by replacing SET (last entry wins) by ON CREATE SET (first entry wins) or ON CREATE SET / ON MATCH SET (full control).

Neo4J load CSV and create relationships doesn't import all data

I am using following CSV load Cypher statement to import csv file with about 3.5m records. But it only imports about 3.2m. So about 300000 records are not imported.
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM ("file:///path/to/csvfile.csv") as line
CREATE (ticket:Ticket {id: line.transaction_hash, from_stop: toInt(line.from_stop), to_stop: toInt(line.to_stop), ride_id: toInt(line.ride_id), price: toFloat(line.price)})
MATCH (from_stop:Stop)-[r:RELATES]->(to_stop:Stop) WHERE toInt(line.route_id) in r.routes
CREATE (from_stop)-[:CONNECTS {ticket_id: ID(ticket)}]->(to_stop)
Note that Stop nodes are already created in separate import statement.
When I only created Nodes without creating relationships it was able to import all data. This same import statement works fine with smaller set of same format csv data.
I tried twice just to make sure it wasn't terminated accidentally.
Is there node to relationship limit in Neo4J? Or what could be other reason?
Neo4J version: 3.0.3 size of database directory is 5.31 GiB.
This is probably because whenever the MATCH does not succeed for a line, the entire query for that line (including the first CREATE) also fails.
On the other hand, the failure of an OPTIONAL MATCH would not abort the entire query for a line. Try this:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM ("file:///path/to/csvfile.csv") as line
CREATE (ticket:Ticket {id: line.transaction_hash, from_stop: toInt(line.from_stop), to_stop: toInt(line.to_stop), ride_id: toInt(line.ride_id), price: toFloat(line.price)})
OPTIONAL MATCH (from:Stop)-[r:RELATES]->(to:Stop)
WHERE toInt(line.route_id) in r.routes
FOREACH(x IN CASE WHEN from IS NULL THEN NULL ELSE [1] END |
CREATE (from)-[:CONNECTS {ticket_id: ID(ticket)}]->(to)
);
The FOREACH clause uses a somewhat roundabout technique to only CREATE the relationship if the OPTIONAL MATCH succeeded for a line.