mysql push index into memory - mysql

I use MySQL (MyISA). Table with over 8M rows. Primary index on 'id'.
My application show:
first run: 55 req/sec,
second run: ~120 req/sec,
third run: ~1200 req/sec,
fourth run: ~4500 req/sec,
fifth run: ~9999 req/sec
After restart mysql-server again the same.
How placing ALL index at once in memory after start database server?
In my.cnf
key_buffer_size=2000M
Code sample:
now = datetime.datetime.now()
cursor = connection.cursor()
for x in xrange(1, 10000):
id = random.randint(10, 100000) # random first 10000 records for cache
cursor.execute("""SELECT num, manufacturer_id
FROM product WHERE id=%s LIMIT 1""", [id])
cursor.fetchone()
td = datetime.datetime.now() - now
sec = td.seconds + td.days * 24 * 3600
print "%.2f operation/sec" % (float(x) / float(sec))

I think two caches are at work here. One is the index cache and that can be preloaded with LOAD INDEX INTO CACHE
The other is the query cache and I think that in your case this is where the most performance is gained. AFAIK that can't be preloaded with any mysql command.
What you could do is to replace the last N queries that run before the restart. Those queries would then poplate the cache. Or keep a file of some realistic queries to run at start up.

Related

SQL Alchemy timing out when executing large query

I have a large query to execute through SQL Alchemy which has approximately 2.5 million rows. It's connecting to a MySQL database. When I do:
transactions = Transaction.query.all()
It eventually times out around ten minutes. And gets this error: sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query')
I've tried setting different parameters when doing create_engine like:
create_engine(connect_args={'connect_timeout': 30})
What do I need to change so the query will not timeout?
I would also be fine if there is a way to paginate the results and go through them that way.
Solved by pagination:
page_size = 10000 # get x number of items at a time
step = 0
while True:
start, stop = page_size * step, page_size * (step+1)
transactions = sql_session.query(Transaction).slice(start, stop).all()
if transactions is None:
break
for t in transactions:
f.write(str(t))
f.write('\n')
if len(transactions) < page_size:
break
step += 1
f.close()

MySQL - How to optimize long running query

I have a query that takes a very long time to complete. I am wondering if there is anyway to optimize it so it would run quicker. Currently the table has around 15 million row and has indexing on expirationDate. This query has been running for 8000+ seconds.
UPDATE LOW_PRIORITY items
SET expire = 1
WHERE expirationdate < Curdate()
AND expirationdate != '000-00-00'
AND expire = '0'
AND ( `submit_id` != '742457'
OR submit_id IS NULL )
Never mind your table structure (I requested it before this edit), I have a solution for you. I've done this before with my DBAs several times on our production mySQL databases even during peak hours of the day. The approach you need to take is taking your one massive UPDATE query (that will lock the table while executing) and turn it into several individual updates. Those individual updates will not have a material impact on your database, especially if you run them in smaller batches and break it down.
To generate those individual queries, run this (assuming the items table's primary key is named "id"):
SELECT CONCAT(
‘UPDATE items SET expire = 1 WHERE id = ‘,
items.id,
‘;’) AS ''
FROM items
WHERE expirationdate < Curdate()
AND expirationdate != '000-00-00'
AND expire = '0'
AND ( `submit_id` != ‘742457'
OR submit_id IS NULL )
This will generate SQL that looks like this:
UPDATE items SET expire = 1 WHERE id = 1234;
UPDATE items SET expire = 1 WHERE id = 2345;
.....
What I personally like to do is put the above query in a text file and run this from the command line:
cat theQuery.sql | mysql -u yourusername [+ command line args with db host, etc] -p > outputSQLCommands.sql
This will pass the query to mySQL and dump the results to an output file. If you notice, we named the column in our output to '' (nothing). That makes the output very nicely formatted in the text file so it can be easily dumped back into mySQL to run the commands. In fact you could do it all in one fell swoop like this:
cat theQuery.sql | mysql -u yourusername [+ command line args with db host, etc] -p | mysql -u yourusername [+ command line args with db host, etc] -p
Then you are definitely ballin' SQL style.
After you have the output SQL file though, if it's too many queries to run one after another, you can use a command line tool like "split" to split the file into a bunch of smaller files.
I would recommend testing to find the optimum batch size based on your DB's hardware/performance/etc. For example, try running 1000 updates first, if that works with no material impact try 5000, etc. Then you can split the rest of your file into chunks you are comfortable with and run them through.
Good luck!

Does Sphinx auto update is index when you add data to your SQL?

I am curious as to whether or not Sphinx will auto update its index when you add new SQL data or whether you have to tell it specifically to reindex your db.
If it doesn't, does anyone have an example of how to automate this process when the database data changes?
The answer is no and you need to tell sphinx to reindex your db.
There are some steps and requirements which you need to know:
Main and delta are requirement
First run you need to index your main index.
After the first run, you may index delta by rotating it ( to make sure the service is running and the data on the web is can be used at the time )
Before you go further step, you need to create a table to mark your "last indexed rows". THe last indexed rows ID could be used for the next indexing delta and merging delta into main.
You need to merge your delta index to the main index.
as inside the sphinx documents http://sphinxsearch.com/docs/current.html#index-merging
Restart sphinx service.
TIPS: Create your own program that could execute the index by using C# or other languages. You may try the task schedule of windows also can.
Here is my conf:
source Main
{
type = mysql
sql_host = localhost
sql_user = root
sql_pass = password
sql_db = table1
sql_port = 3306 # optional, default is 3306
sql_query_pre = REPLACE INTO table1.sph_counter SELECT 1, MAX(PageID) FROM table1.pages;
sql_query = \
SELECT pd.`PageID`, pd.Status from table1.pages pd
WHERE pd.PageID>=$start AND pd.PageID<=$end \
GROUP BY pd.`PageID`
sql_attr_uint = Status
sql_query_info = SELECT * FROM table1.`pages` pd WHERE pd.`PageID`=$id
sql_query_range = SELECT MIN(PageID),MAX(PageID)\
FROM tabl1.`pages`
sql_range_step = 1000000
}
source Delta : Main
{
sql_query_pre = SET NAMES utf8
sql_query = \
SELECT PageID, Status from pages \
WHERE PageID>=$start AND PageID<=$end
sql_attr_uint = Status
sql_query_info = SELECT * FROM table1.`pages` pd WHERE pd.`PageID`=$id
sql_query_range = SELECT (SELECT MaxDoc FROM table1.sph_counter WHERE ID = 1) MinDoc,MAX(PageID) FROM table1.`pages`;
sql_range_step = 1000000
}
index Main
{
source = Main
path = C:/sphinx/data/Main
docinfo = extern
charset_type = utf-8
}
index Delta : Main
{
source = Delta
path = C:/sphinx/data/Delta
charset_type = utf-8
}
As found in the sphinx documentation part about real-time indexes
Real-time indexes (or RT indexes for brevity) are a new backend that lets you insert, update, or delete documents (rows) on the fly.
So to update an index on the fly you would just need to make a query like
{INSERT | REPLACE} INTO index [(column, ...)]
VALUES (value, ...)
[, (...)]
To expand on Anne's answer - if you're using SQL indices, it won't update automatically. You can manage the process of reindexing after every change - but that can be expensive. One way to get around this is have a core index with everything, and then a delta index with the same structure that indexes just the changes (this could be done by a boolean or timestamp column).
That way, you can just reindex the delta index (which is smaller, and thus faster) on a super-regular basis, and then process both core and delta together less regularly (but still, best to do it at least daily).
But otherwise, the new RT indices are worth looking at - you still need to update things yourself, and it's not tied to the database, so it's a different mindset. Also: RT indices don't have all the features that SQL indices do, so you'll need to decide what's more important.

MySQL LOAD DATA INFILE slows down after initial insert using raw sql in django

I'm using the following custom handler for doing bulk insert using raw sql in django with a MySQLdb backend with innodb tables:
def handle_ttam_file_for(f, subject_pi):
import datetime
write_start = datetime.datetime.now()
print "write to disk start: ", write_start
destination = open('temp.ttam', 'wb+')
for chunk in f.chunks():
destination.write(chunk)
destination.close()
print "write to disk end", (datetime.datetime.now() - write_start)
subject = Subject.objects.get(id=subject_pi)
def my_custom_sql():
from django.db import connection, transaction
cursor = connection.cursor()
statement = "DELETE FROM ttam_genotypeentry WHERE subject_id=%i;" % subject.pk
del_start = datetime.datetime.now()
print "delete start: ", del_start
cursor.execute(statement)
print "delete end", (datetime.datetime.now() - del_start)
statement = "LOAD DATA LOCAL INFILE 'temp.ttam' INTO TABLE ttam_genotypeentry IGNORE 15 LINES (snp_id, #dummy1, #dummy2, genotype) SET subject_id=%i;" % subject.pk
ins_start = datetime.datetime.now()
print "insert start: ", ins_start
cursor.execute(statement)
print "insert end", (datetime.datetime.now() - ins_start)
transaction.commit_unless_managed()
my_custom_sql()
The uploaded file has 500k rows and is ~ 15M in size.
The load times seem to get progressively longer as files are added.
Insert times:
1st: 30m
2nd: 50m
3rd: 1h20m
4th: 1h30m
5th: 1h35m
I was wondering if it is normal for load times to get longer as files of constant size (# rows) are added and if there is anyway to improve performance of bulk inserts.
I found the main issue with bulk inserting to my innodb table was a mysql innodb setting I had overlooked.
The setting for innodb_buffer_pool_size is default 8M for my version of mysql and causing a huge slow down as my table size grew.
innodb-performance-optimization-basics
choosing-innodb_buffer_pool_size
The recommended size according to the articles is 70 to 80 percent of the memory if using a dedicated mysql server. After increasing the buffer pool size, my inserts went from an hour+ to less than 10 minutes with no other changes.
Another change I was able to make was getting ride of the LOCAL argument in the LOAD DATA statement (thanks #f00). My problem before was that i kept getting file not found, or cannot get stat errors when trying to have mysql access the file django uploaded.
Turns out this is related to using ubuntu and this bug.
Pick a directory from which mysqld should be allowed to load files.
Perhaps somewhere writable only by
your DBA account and readable only by
members of group mysql?
sudo aa-complain /usr/sbin/mysqld
Try to load a file from your designated loading directory: 'load
data infile
'/var/opt/mysql-load/import.csv' into
table ...'
sudo aa-logprof aa-logprof will identify the access violation
triggered by the 'load data infile
...' query, and interactively walk you
through allowing access in the future.
You probably want to choose Glob from
the menu, so that you end up with read
access to '/var/opt/mysql-load/*'.
Once you have selected the right
(glob) pattern, choose Allow from the
menu to finish up. (N.B. Do not
enable the repository when prompted to
do so the first time you run
aa-logprof, unless you really
understand the whole apparmor
process.)
sudo aa-enforce /usr/sbin/mysqld
Try to load your file again. It should work this time.

Updating the db 6000 times will take few minutes?

I am writing a test program with Ruby and ActiveRecord, and it reads a document
which is like 6000 words long. And then I just tally up the words by
recordWord = Word.find_by_s(word);
if (recordWord.nil?)
recordWord = Word.new
recordWord.s = word
end
if recordWord.count.nil?
recordWord.count = 1
else
recordWord.count += 1
end
recordWord.save
and so this part loops for 6000 times... and it takes a few minutes to
run at least using sqlite3. Is it normal? I was expecting it could run
within a couple seconds... can MySQL speed it up a lot?
With 6000 calls to write to the database, you're going to see speed issues. I would save the various tallies in memory and save to the database once at the end, not 6000 times along the way.
Take a look at AR:Extensions as well to handle the bulk insertions.
http://rubypond.com/articles/2008/06/18/bulk-insertion-of-data-with-activerecord/
I wrote up some quick code in perl that simply does:
Create the database
Insert a record that only contains a single integer
Retrieve the most recent record and verify that it returns what it inserted
And it does steps #2 and #3 6000 times. This is obviously a considerably lighter workload than having an entire object/relational bridge. For this trivial case with SQLite it still took 17 seconds to execute, so your desire to have it take "a couple of seconds" is not realistic on "traditional hardware."
Using the monitor I verified that it was primarily disk activity that was slowing it down. Based on that if for some reason you really do need the database to behave that quickly I suggest one of two options:
Do what people have suggested and find away around the requirement
Try buying some solid state disks.
I think #1 is a good way to start :)
Code:
#!/usr/bin/perl
use warnings;
use strict;
use DBI;
my $dbh = DBI->connect('dbi:SQLite:dbname=/tmp/dbfile', '', '');
create_database($dbh);
insert_data($dbh);
sub insert_data {
my ($dbh) = #_;
my $insert_sql = "INSERT INTO test_table (test_data) values (?)";
my $retrieve_sql = "SELECT test_data FROM test_table WHERE test_data = ?";
my $insert_sth = $dbh->prepare($insert_sql);
my $retrieve_sth = $dbh->prepare($retrieve_sql);
my $i = 0;
while (++$i < 6000) {
$insert_sth->execute(($i));
$retrieve_sth->execute(($i));
my $hash_ref = $retrieve_sth->fetchrow_hashref;
die "bad data!" unless $hash_ref->{'test_data'} == $i;
}
}
sub create_database {
my ($dbh) = #_;
my $status = $dbh->do("DROP TABLE test_table");
# return error status if CREATE resulted in error
if (!defined $status) {
print "DROP TABLE failed";
}
my $create_statement = "CREATE TABLE test_table (id INTEGER PRIMARY KEY AUTOINCREMENT, \n";
$create_statement .= "test_data varchar(255)\n";
$create_statement .= ");";
$status = $dbh->do($create_statement);
# return error status if CREATE resulted in error
if (!defined $status) {
die "CREATE failed";
}
}
What kind of database connection are you using? Some databases allow you to connect 'directly' rather then using a TCP network connection that goes through the network stack. In other words, if you're making an internet connection and sending data through that way, it can slow things down.
Another way to boost performance of a database connection is to group SQL statements together in a single command.
For example, making a single 6,000 line SQL statement that looks like this
"update words set count = count + 1 where word = 'the'
update words set count = count + 1 where word = 'in'
...
update words set count = count + 1 where word = 'copacetic'"
and run that as a single command, performance will be a lot better. By default, MySQL has a 'packet size' limit of 1 megabyte, but you can change that in the my.ini file to be larger if you want.
Since you're abstracting away your database calls through ActiveRecord, you don't have much control over how the commands are issued, so it can be difficult to optimize your code.
Another thin you could do would be to keep a count of words in memory, and then only insert the final total into the database, rather then doing an update every time you come across a word. That will probably cut down a lot on the number of inserts, because if you do an update every time you come across the word 'the', that's a huge, huge waste. Words have a 'long tail' distribution and the most common words are hugely more common then more obscure words. Then the underlying SQL would look more like this:
"update words set count = 300 where word = 'the'
update words set count = 250 where word = 'in'
...
update words set count = 1 where word = 'copacetic'"
If you're worried about taking up too much memory, you could count words and periodically 'flush' them. So read a couple megabytes of text, then spend a few seconds updating the totals, rather then updating each word every time you encounter it. If you want to improve performance even more, you should consider issuing SQL commands in batches directly
Without knowing about Ruby and Sqlite, some general hints:
create a unique index on Word.s (you did not state whether you have one)
define a default for Word.count in the database ( DEFAULT 1 )
optimize assignment of count:
recordWord = Word.find_by_s(word);
if (recordWord.nil?)
recordWord = Word.new
recordWord.s = word
recordWord.count = 1
else
recordWord.count += 1
end
recordWord.save
Use BEGIN TRANSACTION before your updates then COMMIT at the end.
ok, i found some general rule:
1) use a hash to keep the count first, not the db
2) at the end, wrap all insert or updates in one transaction, so that it won't hit the db 6000 times.