Neo4j: Import data from CSV (PostgreSQL) do not commence - csv

I want to move one table with self reference from PostgreSQL to Neo4j.
PostgreSQL:
COPY (SELECT * FROM "public".empbase) TO '/tmp/empbase.csv' WITH CSV header;
Result:
$ cat /tmp/empbase.csv | head
e_id,e_name,e_bossid
1,emp_no_1,
2,emp_no_2,
3,emp_no_3,
4,emp_no_4,
5,emp_no_5,3
6,emp_no_6,2
7,emp_no_7,3
8,emp_no_8,1
9,emp_no_9,4
Size:
$ du -h /tmp/empbase.csv
631M /tmp/empbase.csv
I import data to neo4j with:
neo4j-sh (?)$ USING PERIODIC COMMIT 1000
> LOAD CSV WITH HEADERS FROM "file:/tmp/empbase.csv" AS row
> CREATE (:EmpBase:_EmpBase { neo_eb_id: toInt(row.e_id),
> neo_eb_bossID: toInt(row.e_bossid),
> neo_eb_name: row.e_name});
and this works fine:
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 20505764
Properties set: 61517288
Labels added: 41011528
846284 ms
The Neo4j console says:
Location:
/home/neo4j/data/graph.db
Size:
5.54 GiB
But then I want to proceed with the relation that each emp has a boss. So simple emp->bossid SELF reference.
Now I do it like this:
LOAD CSV WITH HEADERS FROM "file:/tmp/empbase.csv" AS row
MATCH (employee:EmpBase:_EmpBase {neo_eb_id: toInt(row.e_id)})
MATCH (manager:EmpBase:_EmpBase {neo_eb_id: toInt(row.e_bossid)})
MERGE (employee)-[:REPORTS_TO]->(manager);
But this works for 5-6 hours and breaks in the end with system failures it freezez the system.
I think this might be terribly wrong.
1. Am I doing sth wrong or is it bug for No4j?
2. Why out of 631 MB csv now I get 5,5 GB?
EDIT1:
$ du -h /home/neo4j/data/
20K /home/neo4j/data/graph.db/index
899M /home/neo4j/data/graph.db/schema/index/lucene/1
899M /home/neo4j/data/graph.db/schema/index/lucene
899M /home/neo4j/data/graph.db/schema/index
27M /home/neo4j/data/graph.db/schema/label/lucene
27M /home/neo4j/data/graph.db/schema/label
925M /home/neo4j/data/graph.db/schema
6,5G /home/neo4j/data/graph.db
6,5G /home/neo4j/data/
SOLUTION:
Wait until the :schema in console says ONLINe not POPULATING
change log size in config file
Add USING PERIODIC COMMIT 1000 in second csv import
Index only on label

Only match on one Label: MATCH (employee:EmpBase {neo_eb_id: toInt(row.e_id)})
Did you create the index: CREATE INDEX ON :EmpBase(neo_eb_id);
then wait for the index to get online :schema in browser
OR if it is a unique id: CREATE CONSTRAINT ON (e:EmpBase) assert e.neo_eb_id is unique;
Otherwise your match will scan all nodes in the database for each MATCH.
For your second question, I think it's the transaction log files,
you can limit their size in conf/neo4j.properties with
keep_logical_logs=100M size
The actual nodes and properties files shouldn't be that large. Also you don't have to store the boss-id in the database. That's actually handled by the relationship :)

Related

Query of load csv not completing even after 12 hours

I have been using Neo4j for quite a while now. I ran this query earlier before my computer crashed 7 days ago and somehow unable to run it now. I need to create a graph database out of a csv of bank transactions. The original dataset has around 5 million rows and has around 60 columns.
This is the query I used, starting from 'Export CSV from real data' demo by Nicole White:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Transactions_with_risk_scores.csv" AS line
WITH DISTINCT line, SPLIT(line.VALUE_DATE, "/") AS date
WHERE line.TRANSACTION_ID IS NOT NULL AND line.VALUE_DATE IS NOT NULL
MERGE (transaction:Transaction {id:line.TRANSACTION_ID})
SET transaction.base_currency_amount =toInteger(line.AMOUNT_IN_BASE_CURRENCY),
transaction.base_currency = line.BASE_CURRENCY,
transaction.cd_code = line.CREDIT_DEBIT_CODE,
transaction.txn_type_code = line.TRANSACTION_TYPE_CODE,
transaction.instrument = line.INSTRUMENT,
transaction.region= line.REGION,
transaction.scope = line.SCOPE,
transaction.COUNTRY_RISK_SCORE= line.COUNTRY_RISK_SCORE,
transaction.year = toInteger(date[2]),
transaction.month = toInteger(date[1]),
transaction.day = toInteger(date[0]);
I tried:
Using LIMIT 0 before running query as per Micheal Hunger's suggestion in a post about 'Loading Large datasets'.
Used single MERGE per statement (this is first merge and there are 4 other merges to be used) as suggested by Michael again in another post.
Tried CALL apoc.periodic.iterate and apoc.cypher.parallel but doesn't work with LOAD CSV (seem to work only with MERGE and CREATE queries without LOAD CSV).
I get following error with CALL apoc.periodic.iterate(""):
Neo.ClientError.Statement.SyntaxError: Invalid input 'f': expected whitespace, '.', node labels, '[', "=~", IN, STARTS, ENDS, CONTAINS, IS, '^', '*', '/', '%', '+', '-', '=', '~', "<>", "!=", '<', '>', "<=", ">=", AND, XOR, OR, ',' or ')' (line 2, column 29 (offset: 57))
Increased max heap size to 16G as my laptop is of 16GB RAM. Btw finding it difficult to write this post as I tried running again now with 'PROFILE ' and it is still running since an hour.
Help needed to load query of this 5 million rows dataset. Any help would highly be appreciated.Thanks in advance! I am using Neo4j 3.5.1 on PC.
MOST IMPORTANT: Create Index/Constraint on the key property.
CREATE CONSTRAINT ON (t:Transaction) ASSERT t.id IS UNIQUE;
Don't set the max heap size to full of system RAM. Set it to 50%.
Try ON CREATE SET instead of SET.
You can also use apoc.periodic.iterate to load the data, but USING PERIODIC COMMIT is also fine.
Importantly, if you are 'USING PERIODIC COMMIT' and the query is not finishing or running out of memory, it is likely because of using Distinct. Avoid Distinct as duplicate transactions will be handled by MERGE.
NOTE: (If you use apoc.periodic.iterate to MERGE nodes/relationships with parameter parallel=true then it fails with NULL POINTER EXCEPTION. use it carefully)
Questioner edit: Removing Distinct in 3rd line for Transaction node and re-running the query worked!

Neo4j: Relationships from CSV imported extremely slow

I have some issues importing a large set of relationships (2M records) from a CSV file.
I'm running Neo4j 2.1.7 on Mac OSX (10.9.5), 16GB RAM.
The file has the following schema:
user_id, shop_id
1,230
1,458
1,783
2,942
2,123
etc.
As mentioned above - it contains about 2M records (relationships).
Here is the query I'm running using the browser UI (I was also trying to do the same with a REST call):
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file://path/to/my/file.csv" AS relation
MATCH (user:User {id: relation.user_id})
MATCH (shop:Shop {id: relation.shop_id})
MERGE (user)-[:LIKES]->(shop)
This query takes ages to run, about 800 seconds. I do have indexes on :User(id) and :Shop(id). Created them with:
CREATE INDEX ON :User(id)
CREATE INDEX ON :Shop(id)
Any ideas on how to increase the performance?
Thanks
Remove the space before shop_id
try to run:
LOAD CSV WITH HEADERS FROM "file:test.csv" AS r
return r.user_id, r.shop_id limit 10;
to see if it is loaded correctly. On your original data r.shop_id is null as the column name is shop_id
Also make sure that you didn't store the id's as numeric values in the first place, then you have to use toInt(r.shop_id)
Try to profile your statement in Neo4j Browser (2.2.) or in Neo4j-Shell.
Remove the PERIODIC COMMIT for that purpose and limit the rows:
PROFILE
LOAD CSV WITH HEADERS FROM "file://path/to/my/file.csv" AS relation
WITH relation LIMIT 10000
MATCH (user:User {id: relation.user_id})
MATCH (shop:Shop {id: relation.shop_id})
MERGE (user)-[:LIKES]->(shop)

Create Neo4j database using CSV files

I have 2 CSV files which I want to convert into a Neo4j database. They look like this:
first file:
name,enzyme
Aminomonas paucivorans,M1.Apa12260I
Aminomonas paucivorans,M2.Apa12260I
Bacillus cellulosilyticus,M1.BceNI
Bacillus cellulosilyticus,M2.BceNI
second file
name,motif
Aminomonas paucivorans,GGAGNNNNNGGC
Aminomonas paucivorans,GGAGNNNNNGGC
Bacillus cellulosilyticus,CCCNNNNNCTC
As you can see the common factor is the Name of the organism and the. Each Organism will have a few Enzymes and each Enzyme will have 1 Motif. Motifs can be same between enzymes . I used the following statement to create my database:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file1.csv" AS csvLine
MATCH (o:Organism { name: csvLine.name}),(e:Enzyme { name: csvLine.enzyme})
CREATE (o)-[:has_enzyme]->(e) //or maybe CREATE UNIQUE?
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file2.csv" AS csvLine
MATCH (o:Organism { name: csvLine.name}),(m:Motif { name: csvLine.motif})
CREATE (o)-[:has_motif]->(m) //or maybe CREATE UNIQUE?
This gives me errors on the very first line at USING PERIODIC COMMIT which says Invalid input 'S': expected. If I get rid of ti, the next error I get is WITH is required between CREATE and LOAD CSV (line 6, column 1)
"MATCH (o:Organism { name: csvLine.name}),(m:Motif { name: csvLine.motif})" . I googled this issue which led me to this answer . I tried the answer given ther (refreshing the browser cache) but the problem persists. WHat am I doing wrong here? Is the query correct? Is there an another solution to this issue? Any help will be greatly appreciated
Your queries have two issues at once:
You can't refer to a local file just with "file1.csv", because neo4j is expecting a URL
You're using MATCH in cases where the data may not originally exist; you need to use MERGE there instead, which basically acts like the create unique comment you added.
I don't know what the source of your specific error message is, but as written it doesn't look like these queries could possibly work. Here are your queries reformulated, so that they will work (I tested it on my machine with your CSV samples)
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/home/myuser/tmp/file1.csv" AS csvLine
MERGE (o:Organism { name: coalesce(csvLine.name, "No Name")})
MERGE (e:Enzyme { name: csvLine.enzyme})
MERGE (o)-[:has_enzyme]->(e);
Notice here 3 merge statements (MERGE basically does MATCH + CREATE if it doesn't already exist), and the fact that I've used a file: URL.
The second query gets formulated basically the same way:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/home/myuser/tmp/file2.csv" AS csvLine
MERGE (o:Organism { name: coalesce(csvLine.name, "No Name")})
MERGE (m:Motif { name: csvLine.motif})
MERGE (o)-[:has_motif]->(m);
EDIT I added coalesce in the Organism's name property. If you have null values for name in the CSV, then the query would otherwise fail. Coalesce guarantees that if csvLine.name is null, then you'll get back "No Name" instead.

Most effective way to push data from a SQL Server database into a Greenplum database?

Greenplum Database version:
PostgreSQL 8.2.15 (Greenplum Database 4.2.3.0 build 1)
SQL Server Database version:
Microsoft SQL Server 2008 R2 (SP1)
Our current approach:
1) Export each table to a flat file from SQL Server
2) Load the data into Greenplum with pgAdmin III using PSQL Console's psql.exe utility
Benifits...
Speed: OK, but is there anything faster? We load millions of rows of data in minutes
Automation: OK, we call this utility from an SSIS package using a Shell script in VB
Pitfalls...
Reliability: ETL is dependent on the file server to hold the flat files
Security: Lots of potentially sensitive data on the file server
Error handling: It's a problem. psql.exe never raises an error that we can catch even if it does error out and loads no data or a partial file
What else we have tried...
.Net Providers\Odbc Data Provider: We have configured a System DSN using DataDirect 6.0 Greenplum Wire Protocol. Good performance for a DELETE. Dog awful slow for an INSERT.
For reference, this is the aforementioned VB script in SSIS...
Public Sub Main()
Dim v_shell
Dim v_psql As String
v_psql = "C:\Program Files\pgAdmin III\1.10\psql.exe -d "MyGPDatabase" -h "MyGPHost" -p "5432" -U "MyServiceAccount" -f \\MyFileLocation\SSIS_load\sql_files\load_MyTable.sql"
v_shell = Shell(v_psql, AppWinStyle.NormalFocus, True)
End Sub
This is the contents of the "load_MyTable.sql" file...
\copy MyTable from '\\MyFileLocation\SSIS_load\txt_files\MyTable.txt' with delimiter as ';' csv header quote as '"'
If you're getting your data load done in minutes, then the current method is probably good enough. However, if you find yourself having to load larger volumes of data (terabyte scale for instance), the usual preferred method for bulk-loading into Greenplum is via gpfdist and corresponding EXTERNAL TABLE definitions. gpload is a decent wrapper that provides abstraction over much of this process and is driven by YAML control files. The general idea is that gpfdist instance(s) are spun up at the location(s) where your data is staged, preferrably as CSV text files, and then the EXTERNAL TABLE definition within Greenplum is made aware of the URIs for the gpfdist instances. From the admin guide, a sample definition of such an external table could look like this:
CREATE READABLE EXTERNAL TABLE students (
name varchar(20), address varchar(30), age int)
LOCATION ('gpfdist://<host>:<portNum>/file/path/')
FORMAT 'CUSTOM' (formatter=fixedwidth_in,
name=20, address=30, age=4,
preserve_blanks='on',null='NULL');
The above example expects to read text files whose fields from left to right are a 20-character (at most) string, a 30-character string, and an integer. To actually load this data into a staging table inside GP:
CREATE TABLE staging_table AS SELECT * FROM students;
For large volumes of data, this should be the most efficient method since all segment hosts are engaged in the parallel load. Do keep in mind that the simplistic approach above will probably result in a randomly distributed table, which may not be desirable. You'd have to customize your table definitions to specify a distribution key.

MySQL LOAD DATA INFILE slows down after initial insert using raw sql in django

I'm using the following custom handler for doing bulk insert using raw sql in django with a MySQLdb backend with innodb tables:
def handle_ttam_file_for(f, subject_pi):
import datetime
write_start = datetime.datetime.now()
print "write to disk start: ", write_start
destination = open('temp.ttam', 'wb+')
for chunk in f.chunks():
destination.write(chunk)
destination.close()
print "write to disk end", (datetime.datetime.now() - write_start)
subject = Subject.objects.get(id=subject_pi)
def my_custom_sql():
from django.db import connection, transaction
cursor = connection.cursor()
statement = "DELETE FROM ttam_genotypeentry WHERE subject_id=%i;" % subject.pk
del_start = datetime.datetime.now()
print "delete start: ", del_start
cursor.execute(statement)
print "delete end", (datetime.datetime.now() - del_start)
statement = "LOAD DATA LOCAL INFILE 'temp.ttam' INTO TABLE ttam_genotypeentry IGNORE 15 LINES (snp_id, #dummy1, #dummy2, genotype) SET subject_id=%i;" % subject.pk
ins_start = datetime.datetime.now()
print "insert start: ", ins_start
cursor.execute(statement)
print "insert end", (datetime.datetime.now() - ins_start)
transaction.commit_unless_managed()
my_custom_sql()
The uploaded file has 500k rows and is ~ 15M in size.
The load times seem to get progressively longer as files are added.
Insert times:
1st: 30m
2nd: 50m
3rd: 1h20m
4th: 1h30m
5th: 1h35m
I was wondering if it is normal for load times to get longer as files of constant size (# rows) are added and if there is anyway to improve performance of bulk inserts.
I found the main issue with bulk inserting to my innodb table was a mysql innodb setting I had overlooked.
The setting for innodb_buffer_pool_size is default 8M for my version of mysql and causing a huge slow down as my table size grew.
innodb-performance-optimization-basics
choosing-innodb_buffer_pool_size
The recommended size according to the articles is 70 to 80 percent of the memory if using a dedicated mysql server. After increasing the buffer pool size, my inserts went from an hour+ to less than 10 minutes with no other changes.
Another change I was able to make was getting ride of the LOCAL argument in the LOAD DATA statement (thanks #f00). My problem before was that i kept getting file not found, or cannot get stat errors when trying to have mysql access the file django uploaded.
Turns out this is related to using ubuntu and this bug.
Pick a directory from which mysqld should be allowed to load files.
Perhaps somewhere writable only by
your DBA account and readable only by
members of group mysql?
sudo aa-complain /usr/sbin/mysqld
Try to load a file from your designated loading directory: 'load
data infile
'/var/opt/mysql-load/import.csv' into
table ...'
sudo aa-logprof aa-logprof will identify the access violation
triggered by the 'load data infile
...' query, and interactively walk you
through allowing access in the future.
You probably want to choose Glob from
the menu, so that you end up with read
access to '/var/opt/mysql-load/*'.
Once you have selected the right
(glob) pattern, choose Allow from the
menu to finish up. (N.B. Do not
enable the repository when prompted to
do so the first time you run
aa-logprof, unless you really
understand the whole apparmor
process.)
sudo aa-enforce /usr/sbin/mysqld
Try to load your file again. It should work this time.