Query of load csv not completing even after 12 hours - csv

I have been using Neo4j for quite a while now. I ran this query earlier before my computer crashed 7 days ago and somehow unable to run it now. I need to create a graph database out of a csv of bank transactions. The original dataset has around 5 million rows and has around 60 columns.
This is the query I used, starting from 'Export CSV from real data' demo by Nicole White:
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM "file:///Transactions_with_risk_scores.csv" AS line
WITH DISTINCT line, SPLIT(line.VALUE_DATE, "/") AS date
WHERE line.TRANSACTION_ID IS NOT NULL AND line.VALUE_DATE IS NOT NULL
MERGE (transaction:Transaction {id:line.TRANSACTION_ID})
SET transaction.base_currency_amount =toInteger(line.AMOUNT_IN_BASE_CURRENCY),
transaction.base_currency = line.BASE_CURRENCY,
transaction.cd_code = line.CREDIT_DEBIT_CODE,
transaction.txn_type_code = line.TRANSACTION_TYPE_CODE,
transaction.instrument = line.INSTRUMENT,
transaction.region= line.REGION,
transaction.scope = line.SCOPE,
transaction.COUNTRY_RISK_SCORE= line.COUNTRY_RISK_SCORE,
transaction.year = toInteger(date[2]),
transaction.month = toInteger(date[1]),
transaction.day = toInteger(date[0]);
I tried:
Using LIMIT 0 before running query as per Micheal Hunger's suggestion in a post about 'Loading Large datasets'.
Used single MERGE per statement (this is first merge and there are 4 other merges to be used) as suggested by Michael again in another post.
Tried CALL apoc.periodic.iterate and apoc.cypher.parallel but doesn't work with LOAD CSV (seem to work only with MERGE and CREATE queries without LOAD CSV).
I get following error with CALL apoc.periodic.iterate(""):
Neo.ClientError.Statement.SyntaxError: Invalid input 'f': expected whitespace, '.', node labels, '[', "=~", IN, STARTS, ENDS, CONTAINS, IS, '^', '*', '/', '%', '+', '-', '=', '~', "<>", "!=", '<', '>', "<=", ">=", AND, XOR, OR, ',' or ')' (line 2, column 29 (offset: 57))
Increased max heap size to 16G as my laptop is of 16GB RAM. Btw finding it difficult to write this post as I tried running again now with 'PROFILE ' and it is still running since an hour.
Help needed to load query of this 5 million rows dataset. Any help would highly be appreciated.Thanks in advance! I am using Neo4j 3.5.1 on PC.

MOST IMPORTANT: Create Index/Constraint on the key property.
CREATE CONSTRAINT ON (t:Transaction) ASSERT t.id IS UNIQUE;
Don't set the max heap size to full of system RAM. Set it to 50%.
Try ON CREATE SET instead of SET.
You can also use apoc.periodic.iterate to load the data, but USING PERIODIC COMMIT is also fine.
Importantly, if you are 'USING PERIODIC COMMIT' and the query is not finishing or running out of memory, it is likely because of using Distinct. Avoid Distinct as duplicate transactions will be handled by MERGE.
NOTE: (If you use apoc.periodic.iterate to MERGE nodes/relationships with parameter parallel=true then it fails with NULL POINTER EXCEPTION. use it carefully)
Questioner edit: Removing Distinct in 3rd line for Transaction node and re-running the query worked!

Related

Neo4j: Relationships from CSV imported extremely slow

I have some issues importing a large set of relationships (2M records) from a CSV file.
I'm running Neo4j 2.1.7 on Mac OSX (10.9.5), 16GB RAM.
The file has the following schema:
user_id, shop_id
1,230
1,458
1,783
2,942
2,123
etc.
As mentioned above - it contains about 2M records (relationships).
Here is the query I'm running using the browser UI (I was also trying to do the same with a REST call):
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file://path/to/my/file.csv" AS relation
MATCH (user:User {id: relation.user_id})
MATCH (shop:Shop {id: relation.shop_id})
MERGE (user)-[:LIKES]->(shop)
This query takes ages to run, about 800 seconds. I do have indexes on :User(id) and :Shop(id). Created them with:
CREATE INDEX ON :User(id)
CREATE INDEX ON :Shop(id)
Any ideas on how to increase the performance?
Thanks
Remove the space before shop_id
try to run:
LOAD CSV WITH HEADERS FROM "file:test.csv" AS r
return r.user_id, r.shop_id limit 10;
to see if it is loaded correctly. On your original data r.shop_id is null as the column name is shop_id
Also make sure that you didn't store the id's as numeric values in the first place, then you have to use toInt(r.shop_id)
Try to profile your statement in Neo4j Browser (2.2.) or in Neo4j-Shell.
Remove the PERIODIC COMMIT for that purpose and limit the rows:
PROFILE
LOAD CSV WITH HEADERS FROM "file://path/to/my/file.csv" AS relation
WITH relation LIMIT 10000
MATCH (user:User {id: relation.user_id})
MATCH (shop:Shop {id: relation.shop_id})
MERGE (user)-[:LIKES]->(shop)

Wordnet MySQL statement doesn't complete

I'm using the Wordnet SQL database from here: http://wnsqlbuilder.sourceforge.net
It's all built fine and users with appropriate privileges have been set.
I'm trying to find synonyms of words and have tried to use the two example statements at the bottom of this page: http://wnsqlbuilder.sourceforge.net/sql-links.html
SELECT synsetid,dest.lemma,SUBSTRING(src.definition FROM 1 FOR 60) FROM wordsXsensesXsynsets AS src INNER JOIN wordsXsensesXsynsets AS dest USING(synsetid) WHERE src.lemma = 'option' AND dest.lemma <> 'option'
SELECT synsetid,lemma,SUBSTRING(definition FROM 1 FOR 60) FROM wordsXsensesXsynsets WHERE synsetid IN ( SELECT synsetid FROM wordsXsensesXsynsets WHERE lemma = 'option') AND lemma <> 'option' ORDER BY synsetid
However, they never complete. At least not in any reasonable amount of time and I have had to cancel all of the queries. All other queries seem to work find and when I break up the second SQL example, I can get the individual parts to work and complete in reasonable times (about 0.40 seconds)
When I try and run the full statement however, the MySQL command line client just hangs.
Is there a problem with this syntax? What is causing it to take so long?
EDIT:
Output of "EXPLAIN SELECT ..."
Output of "EXPLAIN EXTENDED ...; SHOW WARNINGS;"
I did more digging and looking into the various statements used and found the problem was in the IN command.
MySQL repeats the statement for every single row in the database. This is the cause of the hang, as it had to run through hundreds of thousands of records.
My remedy to this was to split the command into two separate database calls first getting the synsets, and then dynamically creating a bound SQL string to look for the words in the synsets.

TOS DI variable in tMySqlInput

I'm relatively new to Talend OSDI. I managed to do simple request in MySql with tMySqlInput component. However today I have a more ambitious request and have some trouble to make it work.
Indeed I need a request where the result depends on the previous line. I made it on MySQLWorkbench but not on Talend. Exemple : delay time between two dates.
Here is the request :
SET #var = NULL;
SELECT id, start_date, end_date, #var precedent, UNIX_TIMESTAMP(TIMEDIFF(start_date,#var)) AS diff, #var:=start_date AS temp
FROM ma_table
ORDER BY start_date;
and errors are :
You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'SELECT id, start_date, end_date, id_process_type, #var precedent, UNIX_TIMESTAMP' at line 2
...Not very usefull, Is this syntax forbidden on Talend ? Do it exists others solutions to do such requests on Talend ? (for delay time between two dates for examples) or other component maybe ? I am searching with tMysqlRow.
Thanks for ideas !
As #Gabriele B mentions, you might want to consider doing this in a more "Talend" way.
I'd personally make use of the tMemorizeRows component to do this though.
To simplify this I've gone and made the start and end dates as integers but it should be trivial to handle this using proper dates.
If we have some data that shows the start and end date of a process and we want to work out the delay between finishing the last one and starting the next process we can read all of the data in and then use the tMemorizeRows component to remember the last 2 rows:
We then access the memorized data by looking at the array index. So here we go to a tJavaRow component that has an extra output column, startdelay. We then calculate it by comparing the current process' start day minus the last process' end date:
output_row.id = input_row.id;
output_row.startdate = input_row.startdate;
output_row.enddate = input_row.enddate;
if (id_tMemorizeRows_1[0] != 1) {
output_row.startDelay = startdate_tMemorizeRows_1[0] - enddate_tMemorizeRows_1[1];
} else {
output_row.startDelay = 0;
}
The conditional statement it to avoid null pointer errors on the first run of the data as the enddate_tMemorizeRows_1[1] will be null at that point. You could handle the null in other ways of course.
This process is reasonably easy to understand and maintain (although there is that small bit of Java code in there) and has the benefits of only needing the load the data once and only keep a small part of it in memory at any one time. It should also be very fast.
You should consider a statement refactory to do it in a "Talend" way, maybe little slower but most portable and robust.
If your table is not huge, for example, I would recommend to load it in memory using tCacheOutput/tCacheInput (you can find them on Talend Exchange) and this design:
tMySqlLoad----->tCacheOutput_1
|
|
|
OnSubjobOk
|
|
v
tCacheInput_1------->tMap_1--------+
|
|
tJoin-------------->tMap_3------------>[output]
|
|
tCacheInput_2------->tMap_2--------'
First of all you dump your table on a memory buffer
Then, you read two times this buffer. It's in memory, so it won't hurt performances
In tMap_1 you add a auto_increment index using a Numeric.sequence
You do the same in tMap_2 but with a starting number of 2 (basically, you shift the index)
Then you auto-join the table using these brand new columns
Finally in tMap_3 you're going to release your payload (ie make the diff)
This is going to be a verbose but robust solution if your table is small. If it's not and performance is not a issue you can try an even more verbose solution like Prepared Statements.

SQL Bulk Insert CSV

I have a csv comma separated file containing hundreds of thousands of records in the following format:
3212790556,1,0.000000,,0
3212790557,2,0.000000,,0
Now using the SQL Server Import Flat file method works just dandy. I can edit the sql so that the table name and column names are something meaningful. Plus I also edit the data type from the default varchar(50) to int or decimal. This all works fine and sql import is able to import successfully.
However I am unable to do this same task using the Bulk Insert Query which is as follows:
BULK
INSERT temp1
FROM 'c:\filename.csv'
WITH
(
FIELDTERMINATOR = ',',
ROWTERMINATOR = '\n'
)
GO
This query returns the following 3 errors which I have no idea how to resolve:
Msg 4866, Level 16, State 1, Line 1
The bulk load failed. The column is too long in the data file for row 1, column 5. Verify that the field terminator and row terminator are specified correctly.
Msg 7399, Level 16, State 1, Line 1
The OLE DB provider "BULK" for linked server "(null)" reported an error. The provider did not give any information about the error.
Msg 7330, Level 16, State 2, Line 1
Cannot fetch a row from OLE DB provider "BULK" for linked server "(null)".
The purpose of my application is that there are multiple csv files in a folder that all need to go up in a single table so that I can query for sum of values. At the moment I was thinking of writing a program in C# that will execute the BULK insert in a loop (according for the number of files) and then return back with my results. I am guessing I dont need to write a code and that I can just write a script that does all of this - any one can guide me to the right path :)
Many thanks.
Edit: just added
ERRORFILE = 'C:\error.log'
to the query and I am getting 5221 rows inserted. Some times its 5222 some times its 5222 but it just fails beyond this point. Dont know whats the issue??? The CSV is perfectly fine.
SOB. WTF!!!
I cant believe that replacing \n with "0x0A" in the ROWTERMINATOR worked!!! I mean seriously. I just tried it and it worked. WTF moment!! Totally.
However what is a bit interesting is that the SQL Import wizard too only about 10 something seconds to import. The import query took well over a minute. Any guesses??

Updating the db 6000 times will take few minutes?

I am writing a test program with Ruby and ActiveRecord, and it reads a document
which is like 6000 words long. And then I just tally up the words by
recordWord = Word.find_by_s(word);
if (recordWord.nil?)
recordWord = Word.new
recordWord.s = word
end
if recordWord.count.nil?
recordWord.count = 1
else
recordWord.count += 1
end
recordWord.save
and so this part loops for 6000 times... and it takes a few minutes to
run at least using sqlite3. Is it normal? I was expecting it could run
within a couple seconds... can MySQL speed it up a lot?
With 6000 calls to write to the database, you're going to see speed issues. I would save the various tallies in memory and save to the database once at the end, not 6000 times along the way.
Take a look at AR:Extensions as well to handle the bulk insertions.
http://rubypond.com/articles/2008/06/18/bulk-insertion-of-data-with-activerecord/
I wrote up some quick code in perl that simply does:
Create the database
Insert a record that only contains a single integer
Retrieve the most recent record and verify that it returns what it inserted
And it does steps #2 and #3 6000 times. This is obviously a considerably lighter workload than having an entire object/relational bridge. For this trivial case with SQLite it still took 17 seconds to execute, so your desire to have it take "a couple of seconds" is not realistic on "traditional hardware."
Using the monitor I verified that it was primarily disk activity that was slowing it down. Based on that if for some reason you really do need the database to behave that quickly I suggest one of two options:
Do what people have suggested and find away around the requirement
Try buying some solid state disks.
I think #1 is a good way to start :)
Code:
#!/usr/bin/perl
use warnings;
use strict;
use DBI;
my $dbh = DBI->connect('dbi:SQLite:dbname=/tmp/dbfile', '', '');
create_database($dbh);
insert_data($dbh);
sub insert_data {
my ($dbh) = #_;
my $insert_sql = "INSERT INTO test_table (test_data) values (?)";
my $retrieve_sql = "SELECT test_data FROM test_table WHERE test_data = ?";
my $insert_sth = $dbh->prepare($insert_sql);
my $retrieve_sth = $dbh->prepare($retrieve_sql);
my $i = 0;
while (++$i < 6000) {
$insert_sth->execute(($i));
$retrieve_sth->execute(($i));
my $hash_ref = $retrieve_sth->fetchrow_hashref;
die "bad data!" unless $hash_ref->{'test_data'} == $i;
}
}
sub create_database {
my ($dbh) = #_;
my $status = $dbh->do("DROP TABLE test_table");
# return error status if CREATE resulted in error
if (!defined $status) {
print "DROP TABLE failed";
}
my $create_statement = "CREATE TABLE test_table (id INTEGER PRIMARY KEY AUTOINCREMENT, \n";
$create_statement .= "test_data varchar(255)\n";
$create_statement .= ");";
$status = $dbh->do($create_statement);
# return error status if CREATE resulted in error
if (!defined $status) {
die "CREATE failed";
}
}
What kind of database connection are you using? Some databases allow you to connect 'directly' rather then using a TCP network connection that goes through the network stack. In other words, if you're making an internet connection and sending data through that way, it can slow things down.
Another way to boost performance of a database connection is to group SQL statements together in a single command.
For example, making a single 6,000 line SQL statement that looks like this
"update words set count = count + 1 where word = 'the'
update words set count = count + 1 where word = 'in'
...
update words set count = count + 1 where word = 'copacetic'"
and run that as a single command, performance will be a lot better. By default, MySQL has a 'packet size' limit of 1 megabyte, but you can change that in the my.ini file to be larger if you want.
Since you're abstracting away your database calls through ActiveRecord, you don't have much control over how the commands are issued, so it can be difficult to optimize your code.
Another thin you could do would be to keep a count of words in memory, and then only insert the final total into the database, rather then doing an update every time you come across a word. That will probably cut down a lot on the number of inserts, because if you do an update every time you come across the word 'the', that's a huge, huge waste. Words have a 'long tail' distribution and the most common words are hugely more common then more obscure words. Then the underlying SQL would look more like this:
"update words set count = 300 where word = 'the'
update words set count = 250 where word = 'in'
...
update words set count = 1 where word = 'copacetic'"
If you're worried about taking up too much memory, you could count words and periodically 'flush' them. So read a couple megabytes of text, then spend a few seconds updating the totals, rather then updating each word every time you encounter it. If you want to improve performance even more, you should consider issuing SQL commands in batches directly
Without knowing about Ruby and Sqlite, some general hints:
create a unique index on Word.s (you did not state whether you have one)
define a default for Word.count in the database ( DEFAULT 1 )
optimize assignment of count:
recordWord = Word.find_by_s(word);
if (recordWord.nil?)
recordWord = Word.new
recordWord.s = word
recordWord.count = 1
else
recordWord.count += 1
end
recordWord.save
Use BEGIN TRANSACTION before your updates then COMMIT at the end.
ok, i found some general rule:
1) use a hash to keep the count first, not the db
2) at the end, wrap all insert or updates in one transaction, so that it won't hit the db 6000 times.