I'm having an issue with an app I'm working on.
The app allows a user to upload a CSV file which gets processed by the app and in turn creates records into a number of tables. In order to improve performance, for one of the tables, it produces a new CSV file in order to use the mysql LOAD DATA INFILE functionality.
Instead, it seems to be increasing the time it takes to process.
I'm pushing all processing into the background using sidekiq. It seems to be creating the CSV without any problems, however when I execute the load data query it just sits there and I have no idea what it's doing.
My processing function does the following :
CSV.open(output_path, 'w+', { force_quotes: true }) do |writer|
writer << headers
while rows.count > 0
....
data_sets.each do |ds|
writer << [UUIDTools::UUID.random_create, resp, row[set], ds.id, now, now]
set += 1
end
resp += 1
end
end
sql = "LOAD DATA LOCAL INFILE '#{output_path}'
INTO TABLE data_set_responses
FIELDS TERMINATED BY ',' ENCLOSED BY '\"'
LINES TERMINATED BY '\n'
(id, response_number, response, data_set_id, created_at, updated_at)"
con = ActiveRecord::Base.connection
con.execute("SET autocommit = 0;")
con.execute("SET unique_checks = 0;")
con.execute("SET foreign_key_checks = 0;")
con.execute("LOCK TABLES data_set_responses WRITE;")
con.execute(sql)
con.execute("UNLOCK TABLES;")
con.execute("COMMIT;")
con.execute("SET autocommit = 1;")
con.execute("SET unique_checks = 1;")
con.execute("SET foreign_key_checks = 1;")
As of right now, my sidekiq process has been running for 22 minutes and still hasn't finished. It should be inserting around 700k rows which shouldn't be taking anywhere near this long!
The table I'm inserting into has a binary field for it's primary key (uuid) so I don't know if that's slowing it down?
Any ideas?
I ended up changing my data structure to one that didn't require the vast number of rows that this structure did. I've got it down to a matter of seconds :)
Related
I use parametric queries for normal insert/updates for security.
How do I do that for queries like this:
LOAD DATA INFILE '/filepath' INTO TABLE mytable
In my case, the path to the file would be different everytime (for different requests). Is it fine to proceed like this (since I am not getting any data from outside, the file is from the server itself):
path = /filepath
"LOAD DATA INFILE" + path + "INTO TABLE mytable"
Since LOAD DATA is not listed in SQL Syntax Allowed in Prepared Statements you can't prepare something like
LOAD DATA INFILE ? INTO TABLE mytable
But SET is listed. So a workaround could be to prepare and execute
SET #filepath = ?
And then execute
LOAD DATA INFILE #filepath INTO TABLE mytable
Update
In Python with MySQLdb the following query should work
LOAD DATA INFILE %s INTO TABLE mytable
since no prepared statement is used.
To answer your "is it fine to proceed like this" question, your example code will fail because the resulting query will be missing quotes around the filename. If you changed it to the following it could run, but is still a bad idea IMO:
path = "/filepath"
sql = "LOAD DATA INFILE '" + path + "' INTO TABLE mytable" # note the single quotes
While you may not be accepting outside input today, code has a way of sticking around and getting reused/copied, so you should use the API in a way that will escape your parameters:
sql = "LOAD DATA INFILE %s INTO TABLE mytable"
cursor.execute(sql, (path,))
And don't forget to commit if autocommit is not enabled.
I am attempting to set up a large database on my desktop (~100GB) and one of the csv files is about 40 GB. My MySQL workbench executes the query for about 30-60 minutes then loses connection, with a report of error code 2013.
What is the typical upload time per GB ?
Do I need to modify my INNODB Option File or other parameters? I cannot seem to figure out the perfect settings...below I have listed my LOAD DATA code for reference.
LOAD DATA LOCAL INFILE '/Users/ED/desktop/mirror2/CHARTEVENTS.csv'
INTO TABLE CHARTEVENTS
FIELDS TERMINATED BY ',' ESCAPED BY '\\' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(#ROW_ID,#SUBJECT_ID,#HADM_ID,#ICUSTAY_ID,#ITEMID,#CHARTTIME,#STORETIME,#CGID,#VALUE,#VALUENUM,#VALUEUOM,#WARNING,#ERROR,#RESULTSTATUS,#STOPPED)
SET
ROW_ID = #ROW_ID,
SUBJECT_ID = #SUBJECT_ID,
HADM_ID = IF(#HADM_ID='', NULL, #HADM_ID),
ICUSTAY_ID = IF(#ICUSTAY_ID='', NULL, #ICUSTAY_ID),
ITEMID = #ITEMID,
CHARTTIME = #CHARTTIME,
STORETIME = IF(#STORETIME='', NULL, #STORETIME),
CGID = IF(#CGID='', NULL, #CGID),
VALUE = IF(#VALUE='', NULL, #VALUE),
VALUENUM = IF(#VALUENUM='', NULL, #VALUENUM),
VALUEUOM = IF(#VALUEUOM='', NULL, #VALUEUOM),
WARNING = IF(#WARNING='', NULL, #WARNING),
ERROR = IF(#ERROR='', NULL, #ERROR),
RESULTSTATUS = IF(#RESULTSTATUS='', NULL, #RESULTSTATUS),
STOPPED = IF(#STOPPED='', NULL, #STOPPED);
I don't know the details of the connection between your local machine and the MySQL server, but the connection could be getting cut off for any number of reasons. One simple workaround here would be to just upload the 40GB file directly to the same remote machine running MySQL, and then use LOAD DATA (without LOCAL). With this approach, the LOAD DATA statement should take orders of magnitude less time to parse the input file, there no longer being any network latency to slow things down.
I have a LOAD DATA mysql query im trying to run and need help to fix one thing.
Here is the query im running
$query = "LOAD DATA LOCAL INFILE '$file_app'
INTO TABLE tbl_user_tmp
LINES STARTING BY '{'
TERMINATED BY '/>'
(#name)
set
name=SUBSTRING_INDEX(#name,'\"',-2),
activity=SUBSTRING_INDEX(SUBSTRING_INDEX(#name,'}',1),'/',1),
class=SUBSTRING_INDEX(SUBSTRING_INDEX(#name,'}',1),'/',-1),
user = '$user'" ;
Here is the example of data from file im trying to load.
<item component="ComponentInfo{com.apps.aaa.roadside/com.apps.aaa.roadside.Splash}" drawable="aaa_roadside1" />
With the above query i get the following
name=aaa_roadside1"
activity=com.apps.aaa.roadside
class=com.apps.aaa.roadside.Splash
Everything is correct but name. I need to removed that last "
I thought the below query would work but it does not. any ideas?
This was working before when i had TERMINATED BY set to this '\" />'
However this will not account for entires that might not have that space at end like this
<item component="ComponentInfo{com.apps.aaa.roadside/com.apps.aaa.roadside.Splash}" drawable="aaa_roadside1"/>
I need to account for both
$query = "LOAD DATA LOCAL INFILE '$file_app'
INTO TABLE tbl_user_tmp
LINES STARTING BY '{'
TERMINATED BY '/>'
(#name)
set
name=SUBSTRING(SUBSTRING_INDEX(#name,'\"',-2),-1),
activity=SUBSTRING_INDEX(SUBSTRING_INDEX(#name,'}',1),'/',1),
class=SUBSTRING_INDEX(SUBSTRING_INDEX(#name,'}',1),'/',-1),
user = '$user'" ;
I'm using the following custom handler for doing bulk insert using raw sql in django with a MySQLdb backend with innodb tables:
def handle_ttam_file_for(f, subject_pi):
import datetime
write_start = datetime.datetime.now()
print "write to disk start: ", write_start
destination = open('temp.ttam', 'wb+')
for chunk in f.chunks():
destination.write(chunk)
destination.close()
print "write to disk end", (datetime.datetime.now() - write_start)
subject = Subject.objects.get(id=subject_pi)
def my_custom_sql():
from django.db import connection, transaction
cursor = connection.cursor()
statement = "DELETE FROM ttam_genotypeentry WHERE subject_id=%i;" % subject.pk
del_start = datetime.datetime.now()
print "delete start: ", del_start
cursor.execute(statement)
print "delete end", (datetime.datetime.now() - del_start)
statement = "LOAD DATA LOCAL INFILE 'temp.ttam' INTO TABLE ttam_genotypeentry IGNORE 15 LINES (snp_id, #dummy1, #dummy2, genotype) SET subject_id=%i;" % subject.pk
ins_start = datetime.datetime.now()
print "insert start: ", ins_start
cursor.execute(statement)
print "insert end", (datetime.datetime.now() - ins_start)
transaction.commit_unless_managed()
my_custom_sql()
The uploaded file has 500k rows and is ~ 15M in size.
The load times seem to get progressively longer as files are added.
Insert times:
1st: 30m
2nd: 50m
3rd: 1h20m
4th: 1h30m
5th: 1h35m
I was wondering if it is normal for load times to get longer as files of constant size (# rows) are added and if there is anyway to improve performance of bulk inserts.
I found the main issue with bulk inserting to my innodb table was a mysql innodb setting I had overlooked.
The setting for innodb_buffer_pool_size is default 8M for my version of mysql and causing a huge slow down as my table size grew.
innodb-performance-optimization-basics
choosing-innodb_buffer_pool_size
The recommended size according to the articles is 70 to 80 percent of the memory if using a dedicated mysql server. After increasing the buffer pool size, my inserts went from an hour+ to less than 10 minutes with no other changes.
Another change I was able to make was getting ride of the LOCAL argument in the LOAD DATA statement (thanks #f00). My problem before was that i kept getting file not found, or cannot get stat errors when trying to have mysql access the file django uploaded.
Turns out this is related to using ubuntu and this bug.
Pick a directory from which mysqld should be allowed to load files.
Perhaps somewhere writable only by
your DBA account and readable only by
members of group mysql?
sudo aa-complain /usr/sbin/mysqld
Try to load a file from your designated loading directory: 'load
data infile
'/var/opt/mysql-load/import.csv' into
table ...'
sudo aa-logprof aa-logprof will identify the access violation
triggered by the 'load data infile
...' query, and interactively walk you
through allowing access in the future.
You probably want to choose Glob from
the menu, so that you end up with read
access to '/var/opt/mysql-load/*'.
Once you have selected the right
(glob) pattern, choose Allow from the
menu to finish up. (N.B. Do not
enable the repository when prompted to
do so the first time you run
aa-logprof, unless you really
understand the whole apparmor
process.)
sudo aa-enforce /usr/sbin/mysqld
Try to load your file again. It should work this time.
I am writing a test program with Ruby and ActiveRecord, and it reads a document
which is like 6000 words long. And then I just tally up the words by
recordWord = Word.find_by_s(word);
if (recordWord.nil?)
recordWord = Word.new
recordWord.s = word
end
if recordWord.count.nil?
recordWord.count = 1
else
recordWord.count += 1
end
recordWord.save
and so this part loops for 6000 times... and it takes a few minutes to
run at least using sqlite3. Is it normal? I was expecting it could run
within a couple seconds... can MySQL speed it up a lot?
With 6000 calls to write to the database, you're going to see speed issues. I would save the various tallies in memory and save to the database once at the end, not 6000 times along the way.
Take a look at AR:Extensions as well to handle the bulk insertions.
http://rubypond.com/articles/2008/06/18/bulk-insertion-of-data-with-activerecord/
I wrote up some quick code in perl that simply does:
Create the database
Insert a record that only contains a single integer
Retrieve the most recent record and verify that it returns what it inserted
And it does steps #2 and #3 6000 times. This is obviously a considerably lighter workload than having an entire object/relational bridge. For this trivial case with SQLite it still took 17 seconds to execute, so your desire to have it take "a couple of seconds" is not realistic on "traditional hardware."
Using the monitor I verified that it was primarily disk activity that was slowing it down. Based on that if for some reason you really do need the database to behave that quickly I suggest one of two options:
Do what people have suggested and find away around the requirement
Try buying some solid state disks.
I think #1 is a good way to start :)
Code:
#!/usr/bin/perl
use warnings;
use strict;
use DBI;
my $dbh = DBI->connect('dbi:SQLite:dbname=/tmp/dbfile', '', '');
create_database($dbh);
insert_data($dbh);
sub insert_data {
my ($dbh) = #_;
my $insert_sql = "INSERT INTO test_table (test_data) values (?)";
my $retrieve_sql = "SELECT test_data FROM test_table WHERE test_data = ?";
my $insert_sth = $dbh->prepare($insert_sql);
my $retrieve_sth = $dbh->prepare($retrieve_sql);
my $i = 0;
while (++$i < 6000) {
$insert_sth->execute(($i));
$retrieve_sth->execute(($i));
my $hash_ref = $retrieve_sth->fetchrow_hashref;
die "bad data!" unless $hash_ref->{'test_data'} == $i;
}
}
sub create_database {
my ($dbh) = #_;
my $status = $dbh->do("DROP TABLE test_table");
# return error status if CREATE resulted in error
if (!defined $status) {
print "DROP TABLE failed";
}
my $create_statement = "CREATE TABLE test_table (id INTEGER PRIMARY KEY AUTOINCREMENT, \n";
$create_statement .= "test_data varchar(255)\n";
$create_statement .= ");";
$status = $dbh->do($create_statement);
# return error status if CREATE resulted in error
if (!defined $status) {
die "CREATE failed";
}
}
What kind of database connection are you using? Some databases allow you to connect 'directly' rather then using a TCP network connection that goes through the network stack. In other words, if you're making an internet connection and sending data through that way, it can slow things down.
Another way to boost performance of a database connection is to group SQL statements together in a single command.
For example, making a single 6,000 line SQL statement that looks like this
"update words set count = count + 1 where word = 'the'
update words set count = count + 1 where word = 'in'
...
update words set count = count + 1 where word = 'copacetic'"
and run that as a single command, performance will be a lot better. By default, MySQL has a 'packet size' limit of 1 megabyte, but you can change that in the my.ini file to be larger if you want.
Since you're abstracting away your database calls through ActiveRecord, you don't have much control over how the commands are issued, so it can be difficult to optimize your code.
Another thin you could do would be to keep a count of words in memory, and then only insert the final total into the database, rather then doing an update every time you come across a word. That will probably cut down a lot on the number of inserts, because if you do an update every time you come across the word 'the', that's a huge, huge waste. Words have a 'long tail' distribution and the most common words are hugely more common then more obscure words. Then the underlying SQL would look more like this:
"update words set count = 300 where word = 'the'
update words set count = 250 where word = 'in'
...
update words set count = 1 where word = 'copacetic'"
If you're worried about taking up too much memory, you could count words and periodically 'flush' them. So read a couple megabytes of text, then spend a few seconds updating the totals, rather then updating each word every time you encounter it. If you want to improve performance even more, you should consider issuing SQL commands in batches directly
Without knowing about Ruby and Sqlite, some general hints:
create a unique index on Word.s (you did not state whether you have one)
define a default for Word.count in the database ( DEFAULT 1 )
optimize assignment of count:
recordWord = Word.find_by_s(word);
if (recordWord.nil?)
recordWord = Word.new
recordWord.s = word
recordWord.count = 1
else
recordWord.count += 1
end
recordWord.save
Use BEGIN TRANSACTION before your updates then COMMIT at the end.
ok, i found some general rule:
1) use a hash to keep the count first, not the db
2) at the end, wrap all insert or updates in one transaction, so that it won't hit the db 6000 times.