neo4j to csv java heap space error - csv

I'm trying to get into a csv some query results from my neo4j DB graph.
I have neo4j 2.2.6 version and I am facing java.lang.outofmemoryerror : Java Heap Space error while I was trying to get all of my nodes with some properties (~1M nodes,~4M rel Graph) to a csv on neo4jshell via import-cypher,export-cypher. When I change the wrapper (wrapper.java.maxmemory,wrapper.java.minmemory to 4g) as I've seen to some other post the error remains and when I change properties (dbms.pagecache.memory to 3g) it crushes before I even open the server.

To test if it's a heap space problem you do not have to change the page cache, only the heap.
According to http://neo4j.com/docs/stable/server-performance.html the initmemory and maxmemory properties take the heap size in MB.
# neo4j-wrapper.conf
wrapper.java.initmemory=4000
wrapper.java.maxmemory=4000
In addition to the heap settings you can add a PERIODIC COMMIT to your LOAD CSV query. This will commit regularly to avoid memory errors: http://neo4j.com/docs/stable/query-periodic-commit.html
USING PERIODIC COMMIT 1000
LOAD CSV FROM ...

Related

mariadb c-connector bind / execute mess up memory allocation

i'm using mariadb c-connector with prepare, bind and execute. it works usualy. but one case end up in "corrupted unsorted chunks" and core dumping when freeing bind buffer. i suggest the whole malloc organisation is messed up after calling mysql_stmt_execute(). my test's MysqlDynamic.c show:
the problem only is connected to x509cert variable bound by bnd[9]
freeing memory only fails if bnd[9].is_null = 0, if is_null execute end normally
freeing memory (using FreeStmt()) after bind and before execute end normally
print of bnd[9].buffer before execute show (void*) is connected to the correct string buffer
same behavior for setting bnd[9].buffer_length to STMT_INDICATOR_NTS or strlen()
other similar bindings (picture, bnd[10]) do not lead to corrupted memory and core dump.
i defined a c structure test for test data in my test program MysqlDynamic.c which is bound in MYSQL_BIND structure.
bindings for x509cert (string buffer) see bindInsTest():
bnd[9].buffer_type = MYSQL_TYPE_STRING;
bnd[9].buffer_length = STMT_INDICATOR_NTS;
bnd[9].is_null = &para->x509certI;
bnd[9].buffer = (void*) para->x509cert;
please get the details out of source file MysqlDynamic.c. please adapt defines in the source to your environment, verify content, and run it. you will find compile info in source code. MysqlDynymic -c will create the table. MysqlDynamic -i will insert 3 records each run. And 'MysqlDynamic -d` drop the the table again.
MysqlDynamic -vc show:
session set autocommit to <0>
connection id: 175
mariadb server ver:<100408>, client ver:<100408>
connected on localhost to db test by testA
>> if program get stuck - table is locked
table t_test created
mysql connection closed
pgm ended normaly
MysqlDynamic -i show
ins2: BufPara <92> name<master> stamp<> epoch<1651313806000>
cert is cert<(nil)> buf<(nil)> null<1>
picure is pic<0x5596a0f0c220> buf<0x5596a0f0c220> null<0> length<172>
ins1: BufPara <91> name<> stamp<2020-04-30> epoch<1650707701123>
cert is cert<0x5596a0f181d0> buf<0x5596a0f181d0> null<0>
picure is pic<(nil)> buf<(nil)> null<1> length<0>
ins0: BufPara <90> name<gugus> stamp<1988-10-12T18:43:36> epoch<922337203685477580>
cert is cert<(nil)> buf<(nil)> null<1>
picure is pic<(nil)> buf<(nil)> null<1> length<0>
free(): corrupted unsorted chunks
Aborted (core dumped)
checking t_test table content show all records are inserted as expected.
you can disable loading of x509cert and/or picture by commenting out the defines line 57/58. the program than end normally. you also can comment out line 208. the buffers are then indicated as NULL.
Questions:
is there a generic coding mistake in the program causing this behavior?
can you run the program in your environment without core dumping? i'm currently using version 10.04.08.
any improvment in code will be welcome.

Can neo4j-admin import skip CSV rows where there is an import error?

I am trying to use neo4j-admin import to populate a neo4j database with CSV input data. According to documentation, escaping quotation marks with \" is not supported but my input has these and other formatting anomalies. Hence neo4j-admin import obviously fails for input CSV
> neo4j-admin import --mode=csv --id-type=INTEGER \
> --high-io=true \
> --ignore-missing-nodes=true \
> --ignore-duplicate-nodes=true \
> --nodes:user="import/headers_users.csv,import/users.csv"
Neo4j version: 3.5.11
Importing the contents of these files into /data/databases/graph.db:
Nodes:
:user
/var/lib/neo4j/import/headers_users.csv
/var/lib/neo4j/import/users.csv
Available resources:
Total machine memory: 15.58 GB
Free machine memory: 598.36 MB
Max heap memory : 17.78 GB
Processors: 8
Configured max memory: -2120992358.00 B
High-IO: true
IMPORT FAILED in 97ms.
Data statistics is not available.
Peak memory usage: 0.00 B
Error in input data
Caused by:ERROR in input
data source: BufferedCharSeeker[source:/var/lib/neo4j/import/users.csv, position:91935, line:866]
in field: company:string:3
for header: [user_id:ID(user), login:string, company:string, created_at:string, type:string, fake:string, deleted:string, long:string, lat:string, country_code:string, state:string, city:string, location:string]
raw field value: yyeshua
original error: At /var/lib/neo4j/import/users.csv # position 91935 - there's a field starting with a quote and whereas it ends that quote there seems to be characters in that field after that ending quote. That isn't supported. This is what I read: 'Universidad Pedagógica Nacional \"F'
My question is whether is it possible to skip or ignore poorly formatted rows of the CSV file for which neo4j-admin import throws an error. No such option seems available in the docs. I understand that solutions exist using LOAD CSV and that CSVs ought to be preprocessed prior to import. Note I am able to import CSV successfully when I fix formatting issues.
Perhaps it's worth describing the differences between the bulk importer and LOAD CSV.
LOAD CSV does a transactional load of your data into the database - this means you get all of the ACID goodness, etc. The side effect of this is that it's not the fastest way to load data.
The bulk importer assumes that the data is in a data-base ready format, that you've dealt with duplicates, any processing you needed to get it into the right form, etc., and will just pull the data as is and form it as specified into the database. This is not a transactional load of the data, and because it assumes the data being loaded is already 'database ready', it is ingested very quickly indeed.
There are other options to import data in, but generally if you need to do some sort of row skipping/correction on import, you don't really want to be doing it via the offline bulk importer. I would suggest you either do some form of pre-processing on your on CSV prior to using, neo4j-admin import, or look at one of the other import options available where you can dictate how to handle any poorly formatted rows.

Very weird behaviour in Neo4j load CSV

What I'm trying to import is a CSV file with phone calls, and represent it as phone numbers in nodes and each call as an arrow.
The file is separated by pipes.
I have tried a first version:
load csv from 'file:///com.csv' as line FIELDTERMINATOR '|'
with line
merge (a:line {number:COALESCE(line[1],"" )})
return line
limit 5
and worked as expected, one node (outgoing number) is created for each row.
After that I could test what I've done with a simple
Match (a) return a
So I've tried the following step is creating the second node of the call (receiver)
load csv from 'file:///com.csv' as line FIELDTERMINATOR '|'
with line
merge (a:line {number:COALESCE(line[1],"" )})
merge (b:line {number:COALESCE(line[2],"" )})
return line
limit 5
After I run this code I receive no answer (I'm using the browser GUI at localhost:7474/broser) of this operation and if I try to perform any query on this server I get no result either.
So again if I run
match (a) return a
nothing happens.
The only way I've got to go back to life is stoping the server and starting it again.
Any ideas?
It is possible, that opening that big file twice will cause the problem because it is heavily based on the operational system how to handle big files.
Anyway, if you run it accidentally without the 'limit 5' clause then It can happen, since you are trying to load the 26GB in a single transaction.
Since LOAD CSV is for medium sized datasets, I recommend two solutions:
- Using the neo4j-import tool, or
- I would try to split up the file to smaller parts, and you should use periodic commit to prevent the out of memory situations and hangs, like this:
USING PERIODIC COMMIT 100000
LOAD CSV FROM ...

neo4j shell -creating edges from .csv

I am using comma separated value file to create nodes and edges in a Neo4j database. The commands which create nodes run with no issue. The attempt to create edges fails with this error:
Exception in thread "GC-Monitor" java.lang.OutOfMemoryError: GC
overhead limit exceeded
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread "GC-Monitor"
Further, in the output from the commands there was this:
neo4j-sh (?)$ using periodic commit 400 load csv with headers from 'file://localhost/tmp/vm2set3.csv' as line match (u:VM {id: line.vm_id}),(s:VNIC {id: line.set3_id}) create (u)-[:VNIC]->(s);
SystemException: Kernel has encountered some problem, please perform neccesary action (tx recovery/restart)
neo4j-sh (?)$
SystemException: Kernel has encountered some problem, please perform neccesary action (tx recovery/restart)
neo4j-sh (?)$ using periodic commit 400 load csv with headers from 'file://localhost/tmp/unix2switch.csv' as line match (u:UNIX {id: line.intf_id}),(s:switch {id: line.set2a_id}) create (u)-[:cable]->(s);
SystemException: Kernel has encountered some problem, please perform neccesary action (tx recovery/restart)
neo4j-sh (?)$
My shell script is:
cat /home/ES2Neo/2.1/neo4j_commands.cql | /export/neo4j-community-2.1.4/bin/neo4j-shell -path /export/neo4j-community-2.1.4/data/graph.db > /tmp/na.out
The commands are like this:
load csv WITH HEADERS from 'file://localhost/tmp/intf.csv' AS line CREATE (:UNIX {id: line.id, MAC: line.MAC ,BIA: line.BIA ,host: line.host,name: line.name});
for nodes, and
using periodic commit 400 load csv with headers from 'file://localhost/tmp/unix2switch.csv' as line match (u:UNIX {id: line.intf_id}),(s:switch {id: line.set2a_id}) create (u)-[:cable]->(s);
for edges.
The csv input files look like this:
"intf_id","set2a_id"
"100321","6724919"
"125850","6717849"
"158249","6081895"
"51329","5565380"
"57248","6680663"
"235196","6094139"
"229242","4800249"
"225630","6661742"
"183281","4760022"
Is there something I am doing wrong? Is there something in Neo4j configuration I need to check? Thanks.
The problem is that you're running out of memory for loading the data into the database.
Take a look at this blog post which goes into a number of details about how to load CSV data in successfully.
In particular, here's the key bit from the blog post you should pay attention to.
The more memory you have the faster it will import your data.
So make sure to edit conf/neo4j-wrapper.conf and set:
# Uncomment these lines to set specific initial and maximum
# heap size in MB.
wrapper.java.initmemory=4096
wrapper.java.maxmemory=4096
In conf/neo4j.properties set:
# Default values for the low-level graph engine
neostore.nodestore.db.mapped_memory=50M
neostore.relationshipstore.db.mapped_memory=500M
neostore.propertystore.db.mapped_memory=100M
neostore.propertystore.db.strings.mapped_memory=100M
neostore.propertystore.db.arrays.mapped_memory=0M

loading a subset of a file using Pig

I am playing with hortonworks sandbox to learn hadoop etc.
I am trying to load a file on a single machine "cluster":
A = LOAD 'googlebooks-eng-all-3gram-20090715-0.csv' using PigStorage('\t')
AS (ngram:chararray, year:int, count1:int, count2:int, count3:int);
B = LIMIT A 10;
Dump B;
Unfortunately the file is slightly too big for the ram that I have on my VM..
I am wondering if it's possible to LOAD a subset of the .csv file?
Is something like this possible:
LOAD 'googlebooks-eng-all-3gram-20090715-0.csv' using PigStorage('\t') LOAD ONLY FIRST 100MB?
Why exactly do you need to load the entire file into RAM? You should be able to run the whole file regardless of how much memory you need. Try adding this to the top of your script:
--avoid java.lang.OutOfMemoryError: Java heap space (execmode: -x local)
set io.sort.mb 10;
Your pig script will now read as:
--avoid java.lang.OutOfMemoryError: Java heap space (execmode: -x local)
set io.sort.mb 10;
A = LOAD 'googlebooks-eng-all-3gram-20090715-0.csv' using PigStorage('\t')
AS (ngram:chararray, year:int, count1:int, count2:int, count3:int);
B = LIMIT A 10;
Dump B;
Assuming you're just getting an OutOfMemoryError when you are running your script, this should solve your problem.
The way you define you solutions is not possible while in Hadoop however if you can achieve your objective when you are in OS Shell, rather than Hadoop shell. In Linux shell you can write a script to read first 100MB from source file, save it to local file system and then use as Pig source.
#Script .sh
# Read file and save 100 MB content in file system
# Create N files of 100MB each
# write a pig_script to process your data as shown below
# Launch Pig script and pass the N files as parameter as below:
pig -f pigscript.pig -param inputparm=/user/currentuser/File1.File2,..,FileN
#pigscript.pig
A = LOAD '$inputparm' using PigStorage('\t') AS (ngram:chararray, year:int, count1:int, count2:int, count3:int);
B = LIMIT A 10;
Dump B;
In general case, multiple files can be passed in Hadoop shell by their name, so you call out file names from Hadoop shell as well.
The key here is that in Pig there is no default way to read x from a file and process, it is all or nothing so you may need to find ways to solve achieve your objective.