Importing 64k tupples into neo4j - csv

I'm trying to import roughly 64 thousand rows into a neo4j graph. During the import I'm converting some attributes to relations as these are being used by other fields as well with a merge.
This is my cypher query:
USING PERIODIC COMMIT 150
LOAD CSV WITH HEADERS FROM "http://example.com/some.csv" as csvline
MERGE (gem:Gemeente { name: csvline.GEMEENTE})
MERGE (cbs:CBS { name: csvline.CBSCODE})
CREATE (obj:Object { id: toInt(csvline.NUMMER),
prop2: toInt(csvline.PROP2)
})
CREATE (obj)-[:IN_GEMEENTE]->(gem)
CREATE (obj)-[:CBS_CODE]->(cbs)
When I manually truncate the csv-file to 10 rows; this cypher runs perfectly. I'm getting a nice graph with the appropriate relationships.
But running the Cypher-script for every row in my csv-file the server just stalls with an error/warning.
Within the dashboard at 7474 I'm just getting a plain simple error, without any information. While in the neo4j shell I'm getting the following error:
Error occurred in server thread; nested exception is:
java.lang.OutOfMemoryError: Java heap space
So it appears I'm running out of memory. So I tried to reduce the commit number; but this has no effect.
Off course I have a indexes on both :Gemeente(naam) and :CBS(naam)
A solution could be to split up the file in 'affordable' chunks; but that's off course a lot of work :) And not a real solution.
How can I resolve this issue?

You're probably running into the "eager" issue. It's discussed in these posts:
http://jexp.de/blog/2014/10/load-cvs-with-success/
http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
It will probably work better like this:
USING PERIODIC COMMIT 150
LOAD CSV WITH HEADERS FROM "http://example.com/some.csv" as csvline
MERGE (gem:Gemeente { name: csvline.GEMEENTE});
USING PERIODIC COMMIT 150
LOAD CSV WITH HEADERS FROM "http://example.com/some.csv" as csvline
MERGE (cbs:CBS { name: csvline.CBSCODE});
USING PERIODIC COMMIT 150
LOAD CSV WITH HEADERS FROM "http://example.com/some.csv" as csvline
CREATE (obj:Object { id: toInt(csvline.NUMMER),
prop2: toInt(csvline.PROP2)
})
MATCH
(gem:Gemeente { name: csvline.GEMEENTE}),
(cbs:CBS { name: csvline.CBSCODE})
CREATE (obj)-[:IN_GEMEENTE]->(gem)
CREATE (obj)-[:CBS_CODE]->(cbs)
You may not need to split it up as much as that, though. Also, since you'd need to load the csv file at least twice, you might want to lave it locally and run the CSV import from disk. The syntax is LOAD CSV WITH HEADERS FROM "file:///path/to/file" as csvline (I had lots of trouble finding an example when I first tried it. Itsfile://` followed by the path. My example is a unix path, but that can also be followed by a windows path, I believe)

Related

Can neo4j-admin import skip CSV rows where there is an import error?

I am trying to use neo4j-admin import to populate a neo4j database with CSV input data. According to documentation, escaping quotation marks with \" is not supported but my input has these and other formatting anomalies. Hence neo4j-admin import obviously fails for input CSV
> neo4j-admin import --mode=csv --id-type=INTEGER \
> --high-io=true \
> --ignore-missing-nodes=true \
> --ignore-duplicate-nodes=true \
> --nodes:user="import/headers_users.csv,import/users.csv"
Neo4j version: 3.5.11
Importing the contents of these files into /data/databases/graph.db:
Nodes:
:user
/var/lib/neo4j/import/headers_users.csv
/var/lib/neo4j/import/users.csv
Available resources:
Total machine memory: 15.58 GB
Free machine memory: 598.36 MB
Max heap memory : 17.78 GB
Processors: 8
Configured max memory: -2120992358.00 B
High-IO: true
IMPORT FAILED in 97ms.
Data statistics is not available.
Peak memory usage: 0.00 B
Error in input data
Caused by:ERROR in input
data source: BufferedCharSeeker[source:/var/lib/neo4j/import/users.csv, position:91935, line:866]
in field: company:string:3
for header: [user_id:ID(user), login:string, company:string, created_at:string, type:string, fake:string, deleted:string, long:string, lat:string, country_code:string, state:string, city:string, location:string]
raw field value: yyeshua
original error: At /var/lib/neo4j/import/users.csv # position 91935 - there's a field starting with a quote and whereas it ends that quote there seems to be characters in that field after that ending quote. That isn't supported. This is what I read: 'Universidad Pedagógica Nacional \"F'
My question is whether is it possible to skip or ignore poorly formatted rows of the CSV file for which neo4j-admin import throws an error. No such option seems available in the docs. I understand that solutions exist using LOAD CSV and that CSVs ought to be preprocessed prior to import. Note I am able to import CSV successfully when I fix formatting issues.
Perhaps it's worth describing the differences between the bulk importer and LOAD CSV.
LOAD CSV does a transactional load of your data into the database - this means you get all of the ACID goodness, etc. The side effect of this is that it's not the fastest way to load data.
The bulk importer assumes that the data is in a data-base ready format, that you've dealt with duplicates, any processing you needed to get it into the right form, etc., and will just pull the data as is and form it as specified into the database. This is not a transactional load of the data, and because it assumes the data being loaded is already 'database ready', it is ingested very quickly indeed.
There are other options to import data in, but generally if you need to do some sort of row skipping/correction on import, you don't really want to be doing it via the offline bulk importer. I would suggest you either do some form of pre-processing on your on CSV prior to using, neo4j-admin import, or look at one of the other import options available where you can dictate how to handle any poorly formatted rows.

Loading a graph to Neo4j from a csv file. issue: "Couldn't load the external resource"

I'm using Neo4j 3.3.0 on Ubuntu, which is hosted (via virtual box) on Windows.
I've tried the below Cypher query. The data (more than 30000 rows) contains 3 columns as text-relation-text. However, it says
Couldn't load the external resource at
file:///home/bharath/Desktop/neo4j/node_relations.csv
Data:
abandon, Antonym, maintain
abapical, Antonym, apical
abase, Antonym, exalt
Code:
LOAD CSV WITH HEADERS FROM
"file:///home/bharath/Desktop/neo4j/node_relations.csv" AS line
FIELDTERMINATOR ','
CREATE (t1:node1 {text: line[0] }),
(t2:node2 {text: line[2] }),
(r:rel {text: line[1]}),
(t1)-[:r]->(t2)
RETURN line
LIMIT 5;
I'm looking for some help regarding this, any other approach? or do I have to change the query? Thanks in advance!
Try coping your file to the import directory and using:
LOAD CSV WITH HEADERS FROM "file:///node_relations.csv" AS line
(...)
The import directory for linux installations is <neo4j-home>/import or /var/lib/neo4j/import if you are using a Debian package.
Take a look in the file location docs.

Very weird behaviour in Neo4j load CSV

What I'm trying to import is a CSV file with phone calls, and represent it as phone numbers in nodes and each call as an arrow.
The file is separated by pipes.
I have tried a first version:
load csv from 'file:///com.csv' as line FIELDTERMINATOR '|'
with line
merge (a:line {number:COALESCE(line[1],"" )})
return line
limit 5
and worked as expected, one node (outgoing number) is created for each row.
After that I could test what I've done with a simple
Match (a) return a
So I've tried the following step is creating the second node of the call (receiver)
load csv from 'file:///com.csv' as line FIELDTERMINATOR '|'
with line
merge (a:line {number:COALESCE(line[1],"" )})
merge (b:line {number:COALESCE(line[2],"" )})
return line
limit 5
After I run this code I receive no answer (I'm using the browser GUI at localhost:7474/broser) of this operation and if I try to perform any query on this server I get no result either.
So again if I run
match (a) return a
nothing happens.
The only way I've got to go back to life is stoping the server and starting it again.
Any ideas?
It is possible, that opening that big file twice will cause the problem because it is heavily based on the operational system how to handle big files.
Anyway, if you run it accidentally without the 'limit 5' clause then It can happen, since you are trying to load the 26GB in a single transaction.
Since LOAD CSV is for medium sized datasets, I recommend two solutions:
- Using the neo4j-import tool, or
- I would try to split up the file to smaller parts, and you should use periodic commit to prevent the out of memory situations and hangs, like this:
USING PERIODIC COMMIT 100000
LOAD CSV FROM ...

Cassandra RPC Timeout on import from CSV

I am trying to import a CSV into a column family in Cassandra using the following syntax:
copy data (id, time, vol, speed, occupancy, status, flags) from 'C:\Users\Foo\Documents\reallybig.csv' with header = true;
The CSV file is about 700 MB, and for some reason when I run this command in cqlsh I get the following error:
"Request did not complete within rpc_timeout."
What is going wrong? There are no errors in the CSV, and it seems to me that Cassandra should be suck in this CSV without a problem.
Cassandra installation folder has a .yaml file to set rpc timeout value which is "rpc_timeout_in_ms ", you could modify the value and restart cassandra.
But another way is cut your big csv to multiply files and input the files one by one.
This actually ended up being my own misinterpretation of COPY-FROM as the CSV was about 17 million rows. Which in this case the best option was to use the bulk loader example and run sstableloader. However, the answer above would certainly work if I wanted to break the CSV into 17 different CSV's which is an option.

neo4j shell -creating edges from .csv

I am using comma separated value file to create nodes and edges in a Neo4j database. The commands which create nodes run with no issue. The attempt to create edges fails with this error:
Exception in thread "GC-Monitor" java.lang.OutOfMemoryError: GC
overhead limit exceeded
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "GC-Monitor"
Further, in the output from the commands there was this:
neo4j-sh (?)$ using periodic commit 400 load csv with headers from 'file://localhost/tmp/vm2set3.csv' as line match (u:VM {id: line.vm_id}),(s:VNIC {id: line.set3_id}) create (u)-[:VNIC]->(s);
SystemException: Kernel has encountered some problem, please perform neccesary action (tx recovery/restart)
neo4j-sh (?)$
SystemException: Kernel has encountered some problem, please perform neccesary action (tx recovery/restart)
neo4j-sh (?)$ using periodic commit 400 load csv with headers from 'file://localhost/tmp/unix2switch.csv' as line match (u:UNIX {id: line.intf_id}),(s:switch {id: line.set2a_id}) create (u)-[:cable]->(s);
SystemException: Kernel has encountered some problem, please perform neccesary action (tx recovery/restart)
neo4j-sh (?)$
My shell script is:
cat /home/ES2Neo/2.1/neo4j_commands.cql | /export/neo4j-community-2.1.4/bin/neo4j-shell -path /export/neo4j-community-2.1.4/data/graph.db > /tmp/na.out
The commands are like this:
load csv WITH HEADERS from 'file://localhost/tmp/intf.csv' AS line CREATE (:UNIX {id: line.id, MAC: line.MAC ,BIA: line.BIA ,host: line.host,name: line.name});
for nodes, and
using periodic commit 400 load csv with headers from 'file://localhost/tmp/unix2switch.csv' as line match (u:UNIX {id: line.intf_id}),(s:switch {id: line.set2a_id}) create (u)-[:cable]->(s);
for edges.
The csv input files look like this:
"intf_id","set2a_id"
"100321","6724919"
"125850","6717849"
"158249","6081895"
"51329","5565380"
"57248","6680663"
"235196","6094139"
"229242","4800249"
"225630","6661742"
"183281","4760022"
Is there something I am doing wrong? Is there something in Neo4j configuration I need to check? Thanks.
The problem is that you're running out of memory for loading the data into the database.
Take a look at this blog post which goes into a number of details about how to load CSV data in successfully.
In particular, here's the key bit from the blog post you should pay attention to.
The more memory you have the faster it will import your data.
So make sure to edit conf/neo4j-wrapper.conf and set:
# Uncomment these lines to set specific initial and maximum
# heap size in MB.
wrapper.java.initmemory=4096
wrapper.java.maxmemory=4096
In conf/neo4j.properties set:
# Default values for the low-level graph engine
neostore.nodestore.db.mapped_memory=50M
neostore.relationshipstore.db.mapped_memory=500M
neostore.propertystore.db.mapped_memory=100M
neostore.propertystore.db.strings.mapped_memory=100M
neostore.propertystore.db.arrays.mapped_memory=0M