Optimisation of volatile data querying - mysql

I'm trying to solve a problem with latency on a to a mysql-5.0 db.
The query itself is extremely simple: SELECT SUM(items) FROM tbl WHERE col = 'val'
There's an index on col and there are not more than 10000 values to sum in the worst case (mean of count(items) for all values of col would be around 10).
The table has up to 2M rows.
The query is run frequently enough that sometimes the execution time goes up to 10s, although 99% of them take << 1s
The query is not really cachable - in almost every case, each query like this one will be followed by an insert to that table in the next minute and showing old values is out of question (billing information).
keys are good enough - ~100% hits
The result I'm looking for is every single query < 1s. Are there any ways to improve the select time without changes to the table? Alternatively, are there any interesting changes that would help to resolve the problem? I thought about simply having a table where the current sum is updated for every col right after every insert - but maybe there are better ways to do it?

Another approach is to add a summary table:
create table summary ( col varchar(10) primary key, items int not null );
and add some triggers to tbl so that:
on insert:
insert into summary values( new.col, new.items )
on duplicate key update set summary.items = summary.items + new.items;
on delete:
update summary set summary.items = summary.items - old.items where summary.col = old.col
on update:
update summary set summary.items = summary.items - old.items where summary.col = old.col;
update summary set summary.items = summary.items + new.items where summary.col = new.col;
This will slow down your inserts, but allow you to hit a single row in the summary table for
select items from summary where col = 'val';
The biggest problem with this is bootstrapping the values of the summary table. If you can take the application offline, you can easily initialise summary with values from tbl.
insert into summary select col, sum(items) from tbl group by col;
However, if you need to keep the service running, it is a lot more difficult. If you have a replica, you can stop replication, build the summary table, install the triggers, restart replication, then failover the service to using the replica, and then repeat the process on the retired primary.
If you cannot do that, then you could update the summary table one value of col at a time to reduce the impact:
lock table write tbl, summary;
delete from summary where col = 'val';
insert into summary select col, sum(items) from tbl where col = 'val';
unlock tables;
Or if you can tolerate a prolonged outage:
lock table write tbl, summary;
delete from summary;
insert into summary select col, sum(items) from tbl group by col;
unlock tables;

A covering index should help:
create index cix on tbl (col, items);
This will enable the sum to be performed without reading from the data file - which should be faster.
You should also track how effective your key-buffer is, and whether you need to allocate more memory for it. This can be done by polling the server status and watching the 'key%' values:
SHOW STATUS LIKE 'Key%';
MySQL Manual - show status
The ratio between key_read_requests (ie. the number of index lookups) versus key_reads (ie. number of requests that required index blocks to be read from disk) is important. The higher the number of disk reads, the slower the query will run. You can improvethis by increasing the keybuffer size in the config file.

Related

Fastest way to remove a HUGE set of row keys from a table via primary key? [duplicate]

I have two tables. Let's call them KEY and VALUE.
KEY is small, somewhere around 1.000.000 records.
VALUE is huge, say 1.000.000.000 records.
Between them there is a connection such that each KEY might have many VALUES. It's not a foreign key but basically the same meaning.
The DDL looks like this
create table KEY (
key_id int,
primary key (key_id)
);
create table VALUE (
key_id int,
value_id int,
primary key (key_id, value_id)
);
Now, my problem. About half of all key_ids in VALUE have been deleted from KEY and I need to delete them in a orderly fashion while both tables are still under high load.
It would be easy to do
delete v
from VALUE v
left join KEY k using (key_id)
where k.key_id is null;
However, as it's not allowed to have a limit on multi table delete I don't like this approach. Such a delete would take hours to run and that makes it impossible to throttle the deletes.
Another approach is to create cursor to find all missing key_ids and delete them one by one with a limit. That seems very slow and kind of backwards.
Are there any other options? Some nice tricks that could help?
Any solution that tries to delete so much data in one transaction is going to overwhelm the rollback segment and cause a lot of performance problems.
A good tool to help is pt-archiver. It performs incremental operations on moderate-sized batches of rows, as efficiently as possible. pt-archiver can copy, move, or delete rows depending on options.
The documentation includes an example of deleting orphaned rows, which is exactly your scenario:
pt-archiver --source h=host,D=db,t=VALUE --purge \
--where 'NOT EXISTS(SELECT * FROM `KEY` WHERE key_id=`VALUE`.key_id)' \
--limit 1000 --commit-each
Executing this will take significantly longer to delete the data, but it won't use too many resources, and without interrupting service on your existing database. I have used it successfully to purge hundreds of millions of rows of outdated data.
pt-archiver is part of the Percona Toolkit for MySQL, a free (GPL) set of scripts that help common tasks with MySQL and compatible databases.
Directly from MySQL documentation
If you are deleting many rows from a large table, you may exceed the
lock table size for an InnoDB table. To avoid this problem, or simply
to minimize the time that the table remains locked, the following
strategy (which does not use DELETE at all) might be helpful:
Select the rows not to be deleted into an empty table that has the same structure as the original table:
INSERT INTO t_copy SELECT * FROM t WHERE ... ;
Use RENAME TABLE to atomically move the original table out of the way and rename the copy to the original name:
RENAME TABLE t TO t_old, t_copy TO t;
Drop the original table:
DROP TABLE t_old;
No other sessions can access the tables involved while RENAME TABLE
executes, so the rename operation is not subject to concurrency
problems. See Section 12.1.9, “RENAME TABLE Syntax”.
So in Your case You may do
INSERT INTO value_copy SELECT * FROM VALUE WHERE key_id IN
(SELECT key_id FROM `KEY`);
RENAME TABLE value TO value_old, value_copy TO value;
DROP TABLE value_old;
And according to what they wrote here RENAME operation is quick and number of records doesn't affect it.
What about this for having a limit?
delete x
from `VALUE` x
join (select key_id, value_id
from `VALUE` v
left join `KEY` k using (key_id)
where k.key_id is null
limit 1000) y
on x.key_id = y.key_id AND x.value_id = y.value_id;
First, examine your data. Find the keys which have too many values to be deleted "fast". Then find out which times during the day you have the smallest load on the system. Perform the deletion of the "bad" keys during that time. For the rest, start deleting them one by one with some downtime between deletes so that you don't put to much pressure on the database while you do it.
May be instead of limit divide whole set of rows into small parts by key_id:
delete v
from VALUE v
left join KEY k using (key_id)
where k.key_id is null and v.key_id > 0 and v.key_id < 100000;
then delete rows with key_id in 100000..200000 and so on.
You can try to delete in separated transaction batches.
This is for MSSQL, but should be similar.
declare #i INT
declare #step INT
set #i = 0
set #step = 100000
while (#i< (select max(VALUE.key_id) from VALUE))
BEGIN
BEGIN TRANSACTION
delete from VALUE where
VALUE.key_id between #i and #i+#step and
not exists(select 1 from KEY where KEY.key_id = VALUE.key_id and KEY.key_id between #i and #i+#step)
set #i = (#i+#step)
COMMIT TRANSACTION
END
Create a temporary table!
drop table if exists batch_to_delete;
create temporary table batch_to_delete as
select v.* from `VALUE` v
left join `KEY` k on k.key_id = v.key_id
where k.key_id is null
limit 10000; -- tailor batch size to your taste
-- optional but may help for large batch size
create index batch_to_delete_ix_key on batch_to_delete(key_id);
create index batch_to_delete_ix_value on batch_to_delete(value_id);
-- do the actual delete
delete v from `VALUE` v
join batch_to_delete d on d.key_id = v.key_id and d.value_id = v.value_id;
To me this is a kind of task the progress of which I would want to see in a log file. And I would avoid solving this in pure SQL, I would use some scripting in Python or other similar language. Another thing that would bother me is that lots of LEFT JOINs with WHERE IS NOT NULL between the tables might cause unwanted locks, so I would avoid JOINs either.
Here is some pseudo code:
max_key = select_db('SELECT MAX(key) FROM VALUE')
while max_key > 0:
cur_range = range(max_key, max_key-100, -1)
good_keys = select_db('SELECT key FROM KEY WHERE key IN (%s)' % cur_range)
keys_to_del = set(cur_range) - set(good_keys)
while 1:
deleted_count = update_db('DELETE FROM VALUE WHERE key IN (%s) LIMIT 1000' % keys_to_del)
db_commit
log_something
if not deleted_count:
break
max_key -= 100
This should not bother the rest of the system very much, but may take long. Another issue is to optimize the table after you deleted all those rows, but this is another story.
If the target columns are properly indexed this should go fast,
DELETE FROM `VALUE`
WHERE NOT EXISTS(SELECT 1 FROM `key` k WHERE k.key_id = `VALUE`.key_id)
-- ORDER BY key_id, value_id -- order by PK is good idea, but check the performance first.
LIMIT 1000
Alter the limit from 10 to 10000 to get acceptable performance, and rerun it several times.
Also take in mind that this mass deletes will perform locks and backups for each row ..
multiple the execution time for each row several times ...
There are some advanced methods to prevent this, but the easiest workaround
is just to put a transaction around this query.
Do you have SLAVE or Dev/Test environment with same data?
The first step is to find out your data distribution if you are worried about a particular key having 1 million value_ids
SELECT v.key_id, COUNT(IFNULL(k.key_id,1)) AS cnt
FROM `value` v LEFT JOIN `key` k USING (key_id)
WHERE k.key_id IS NULL
GROUP BY v.key_id ;
EXPLAIN PLAN for above query is much better than adding
ORDER BY COUNT(IFNULL(k.key_id,1)) DESC ;
Since you don't have partitioning on key_id (too many partitions in your case) and want to keep database running during your delete process, the option is to delete in chucks with SLEEP() between different key_id deletes to avoid overwhelming server. Don't forget to keep an eye on your binary logs to avoid disk filling.
The quickest way is :
Stop application so data is not changed.
Dump key_id and value_id from VALUE table with only matching key_id in KEY table by using
mysqldump YOUR_DATABASE_NAME value --where="key_id in (select key_id from YOUR_DATABASE_NAME.key)" --lock-all --opt --quick --quote-names --skip-extended-insert > VALUE_DATA.txt
Truncate VALUE table
Load data exported in step 2
Start Application
As always, try this in Dev/Test environment with Prod data and same infrastructure so you can calculate downtime.
Hope this helps.
I am just curious what the effect would be of adding a non-unique index on key_id in table VALUE. Selectivity is not high at all (~0.001) but I am curious how that would affect the join performance.
Why don't you split your VALUE table into several ones according to some rule like key_id module some power of 2 (like 256 for example)?

Performance of mysql counting rows in a big table

This fairly obvious question has very few (couldnt find any) solid answers.
I do simple select from table of 2 million rows.
select count(id) as total from big_table
Any machine I try this query on, usually takes at least 5 seconds to complete. This is unacceptable for realtime queries.
The reason I need an exact value of rows fetched is for precise statistical calculations later on.
Using the last auto increment value is unfortunately not an options because rows also get deleted periodically.
It can indeed be slow when running on an InnoDB engine. As stated in section 14.24 of the MySQL 5.7 Reference Manual, “InnoDB Restrictions and Limitations”, 3rd bullet point:
InnoDB InnoDB does not keep an internal count of rows in a table because concurrent transactions might “see” different numbers of rows at the same time. Consequently, SELECT COUNT(*) statements only count rows visible to the current transaction.
For information about how InnoDB processes SELECT COUNT(*) statements, refer to the COUNT() description in Section 12.20.1, “Aggregate Function Descriptions”.
The suggested solution is a counter table. This is a separate table with one row and column, having the current record count. It could be kept updated via triggers. Something like this:
create table big_table_count (rec_count int default 0);
-- one-shot initialisation:
insert into big_table_count select count(*) from big_table;
create trigger big_insert after insert on big_table
for each row
update big_table_count set rec_count = rec_count + 1;
create trigger big_delete after delete on big_table
for each row
update big_table_count set rec_count = rec_count - 1;
You can see here a fiddle, where you should alter the insert/delete statements in the build section to see the effect on:
select rec_count from big_table_count;
You could extend this for several tables, either by creating such a table for each, or to reserve a row per table in the above counter table. It would then be keyed by a column "table_name".
Improving concurrency
The above method does have an impact if you have many concurrent sessions inserting or deleting records, because they need to wait for each other to complete the update of the counter.
A solution is to not let the triggers update the same, single record, but to let them insert a new record, like this:
create trigger big_insert after insert on big_table
for each row
insert into big_table_count (rec_count) values (1);
create trigger big_delete after delete on big_table
for each row
insert into big_table_count (rec_count) values (-1);
The way to get the count then becomes:
select sum(rec_count) from big_table_count;
Then, once in a while (e.g. daily) you should re-initialise the counter table to keep it small:
truncate table big_table_count;
insert into big_table_count select count(*) from big_table;

Doing a more efficient COUNT

I have a page that loads some high-level statistics. Nothing fancy, just about 5 metrics. There are two particular queries that takes about 5s each to load:
+ SELECT COUNT(*) FROM mybooks WHERE book_id IS NOT NULL
+ SELECT COUNT(*) FROM mybooks WHERE is_media = 1
The table has about 500,000 rows. Both columns are indexed.
This information changes all the time, so I don't think that caching here would work. What are some techniques to use that could speed this up? I was thinking:
Create a denormalized stats table that is updated whenever the columns are updated.
Load the slow queries via ajax (this doesn't speed it up, but it allows the page to load immediately).
What would be suggested here? The requirement is that the page loads within 1s.
Table structure:
id (pk, autoincrementing)
book_id (bigint)
is_media (boolean)
The stats table is probably the biggest/quickest bang for buck. Assuming you have full control of your MySQL server and don't already have job scheduling in place to take care of this, you could remedy this by using the mysql event scheduler. As Vlad mentioned above, your data will be a bit out of date. Here is a quick example:
Example stats table
CREATE TABLE stats(stat VARCHAR(20) PRIMARY KEY, count BIGINT);
Initialize your values
INSERT INTO stats(stat, count)
VALUES('all_books', 0), ('media_books', 0);
Create your event that updates every 10 minutes
DELIMITER |
CREATE EVENT IF NOT EXISTS updateBookCountsEvent
ON SCHEDULE EVERY 10 MINUTE STARTS NOW()
COMMENT 'Update book counts every 10 minutes'
DO
BEGIN
UPDATE stats
SET count = (SELECT COUNT(*) FROM mybooks)
WHERE stat = 'all_books';
UPDATE stats
SET count = (SELECT COUNT(*) FROM mybooks WHERE is_media = 1)
WHERE stat = 'media_books';
END |
Check to see if it executed
SELECT * FROM mysql.event;
No? Check to see if the event scheduler is enabled
SELECT ##GLOBAL.event_scheduler;
If it is off you'll want to enable it on startup using the param --event-scheduler=ON or setting it in you my.cnf. See this answer or the docs.
There are a couple of things you can do to speed up the query.
Run optimize table on your mybooks table
Change your book_id column to be an int unsigned, which allows for 4.2 billions values and takes 4 bytes instead of 8 (bigint), making the table and index more efficient.
Also I'm not sure if this will work but rather than doing count(*) I would just select the column in the where clause. So for example your first query would be SELECT COUNT(book_id) FROM mybooks WHERE book_id IS NOT NULL

Can a row with a bigger auto-increment value appear sooner than a row with a smaller one?

Is the following scenario possible?
MySQL version 5.6.15
CREATE TABLE my_table (
id int NOT NULL AUTO_INCREMENT.
data1 int,
data2 timestamp,
PRIMARY KEY (id)
) ENGINE=InnoDB;
innodb_autoinc_lock_mode = 1
AUTO_INCREMENT=101
0 ms: Query A is run: INSERT INTO my_table (data1, data2) VALUES (101, FROM_TIMESTAMP(1418501101)), (102, FROM_TIMESTAMP(1418501102)), .. [200 values in total] .., (300, FROM_TIMESTAMP(1418501300));
500 ms: Query B is run: INSERT INTO my_table (data1, data2) VALUES (301, FROM_TIMESTAMP(1418501301));
505 ms: Query B is completed. The row gets id=301.
1000 ms: SELECT id FROM my_table WHERE id >= 300; — would return one row (id=301).
1200 ms: Query A is completed. The rows get id=101 to id=300.
1500 ms: SELECT id FROM my_table WHERE id >= 300; — would return two rows (id=300, id=301).
In other words, is it possible that the row with id=301 can be selected earlier than it's possible to select the row with id=300?
And how to avoid it, if it is possible?
Why is query A taking more than a second to run? Yuck! And yes what you're seeing is exactly how I'd expect it to behave.
Primary keys 101 through 300 are reserved immediately while inserting the new rows. This takes a couple of milliseconds. It then spends more than 1 second rebuilding the indexes and half way through doing all of that you ran another query which inserted a new row, using the next available auto_increment: 301.
As Alin Purcaru said you can avoid this specific issue by changing the lock mode, but it will cause performance issues (Instead of executing in 5 milliseconds, query B will take 700 milliseconds to execute.
Under high load situations the problem gets exponentially worse and you'll eventually see "too many connection" errors, effectively bringing your whole database offline.
Also there are other rare situations where auto_increment can give "out of order" increments, which will not be solved by locking.
Personally I think the best option is to use UUIDs instead of auto_increment for the primary key, and then have a timestamp column (as a double with microseconds) to determine what order rows were inserted into the database.
So... I read the docs for you.
According to them when what you have in your example is two simple inserts (the multi-insert is still considered simple, even if it has multiple rows, because you know how many you will insert, so the end autoincrement value may be determined), and when you have innodb_autoinc_lock_mode = 1 no lock will be set on the table and you can have your second query finish before the first and have that row available before.
If you want to avoid this, you need to set innodb_autoinc_lock_mode = 0, but you may take a toll in terms of scalability.
Disclaimer: I haven't tried it and the answer is based on my understanding of the MySQL docs.

How to Generate unique random number?

Can someone tell me a good method for automatically placing a unique random number in a mysql database table when a new record is created.
I would create a table with a pool of numbers:
Create Table pool (number int PRIMARY KEY AUTO_INCREMENT);
Insert Into pool (),(),(),(),(),(),(),(),…;
And then define a trigger which picks one random number from that pool:
CREATE TRIGGER pickrand BEFORE INSERT ON mytable
FOR EACH ROW BEGIN
DECLARE nr int;
SET nr = (SELECT number FROM pool order by rand() limit 1);
DELETE FROM pool WHERE number = nr;
SET NEW.nr = nr;
END
In order to avoid concurrency issues you have to run queries in transactions. If performance becomes an issue (because of the slow order by rand()) you can change the way to select a random record.
Your criteria of unique and random are generally conflicting. You can easily accomplish one or the other, but both is difficult, and would require evaluating every row when testing a new potential number to insert.
The best method that meets your criteria is to generate a UUID with the UUID function.
The better choice would be to re-evaluate your design and remove one of the (unique, random) criteria.
Your best option is Autoincrement Column
see here for syntax
perhaps for Random number
try this!
select FLOOR(RAND() * 401) + 100
edit
SELECT FLOOR(RAND() * 99999) AS sup_rand
FROM table
WHERE "sup_rand" NOT IN (SELECT sup_rand FROM table)
LIMIT 1
Steps:
1.Generate a random number
2.Check if its already present
3.if not continue use the number
4.Repeat