Optimizing a simple mysql select on a large table (75M+ rows) - mysql

I have a statistics table which grows at a large rate (around 25M rows/day) that I'd like to optimize for selects, the table fits in memory, and the server has plenty of spare memory (32G, table is 4G).
My simple roll-up query is:
EXPLAIN select FROM_UNIXTIME(FLOOR(endtime/3600)*3600) as ts,sum(numevent1) as success , sum(numevent2) as failure from stats where endtime > UNIX_TIMESTAMP()-3600*96 group by ts order by ts;
+----+-------------+--------------+------+---------------+------+---------+------+----------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------+------+---------------+------+---------+------+----------+----------------------------------------------+
| 1 | SIMPLE | stats | ALL | ts | NULL | NULL | NULL | 78238584 | Using where; Using temporary; Using filesort |
+----+-------------+--------------+------+---------------+------+---------+------+----------+----------------------------------------------+
Stats is an innodb table, there is a normal index on endtime.. How should I optimize this?
Note: I do plan on adding roll-up tables, but currently this is what I'm stuck with, and I'm wondering if its possible to fix it without additional application code.

I've been doing local tests. Try the following:
alter table stats add index (endtime, numevent1, numevent2);
And remove the order by as it should be implicit in the group by (I guess the parser just ignores the order by in this case, but just in case :)

Since you are using InnoDB you can also try the following:
a) Change innodb_buffer_pool_size to 24GB (requires server restart) - which will ensure that your whole table can be loaded into memory therefore will speed up sorting even as the table grows larger
b) Add innodb_file_per_table which causes InnoDB to place each new table space in its own table. Needs you to drop existing table and recreate it
c) Use the smallest available column size which can fit the data. Without seeing the actual column definitions and some sample, I cannot provide any specific ideas. Can you provide a sample schema and maybe 5 rows of data

Related

Optimize query?

My query took 28.39 seconds to run. How can I optimize it?
explain SELECT distinct UNIX_TIMESTAMP(timestamp)*1000 as timestamp,count(a.sig_name) as counter from event a,network n where n.fsi='pays' and n.net=inet_ntoa(a.ip_src) group by date(timestamp) order by timestamp asc;
+----+-------------+-------+--------+---------------+---------+---------+--- ---+---------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+---------------+---------+---------+------+---------+---------------------------------+
| 1 | SIMPLE | a | ALL | NULL | NULL | NULL | NULL | 8177074 | Using temporary; Using filesort |
| 1 | SIMPLE | n | eq_ref | PRIMARY,fsi | PRIMARY | 77 | func | 1 | Using where |
+----+-------------+-------+--------+---------------+---------+---------+------+---------+---------------------------------+
So generally looking at your query, we find that table event a is examining 8,177,074 rows. That is likely the "root" of the slowness, so we want to look at how to reduce the search space using indexes.
The main condition on event a is
n.net=inet_ntoa(a.ip_src)
The problem here is that we need to perform a calculation (inet_ntoa) on every row of a.ip_src, so there is no alternative but to scan the entire table. A potentially better solution would be to invert the comparison and ensure that a.ip_src is indexed.
a.ip_src=inet_aton(n.net)
This will only be better if we are matching less rows in n than we are in a. If that is not the case, you should seriously consider caching the result of this function in the table and creating an index on that.
Lastly I am guessing the timestamp column is in event a, in which case an index will potentially help with ordering and grouping though may not. You could try a multi_column index on (ip_src,timestamp)
Make it a practice to introduce at-least index on columns which can be used in WHERE/JOIN clauses. I've used the at-least because in many cases one should try to use PRIMARY/FOREIGN KEY relations. So if something is already a primary/foriegn key there is no need to index it further.
The above query can be simply improved by introducing the INDEX through the following query:
ALTER TABLE events ADD INDEX idx_ev_ipsrc (ip_src);
Here idx_ev_ipsrc = Name of the index key, and ip_src is the column to be indexed.
Even further enhancement:
Introduce multi-colum index on network table using following query:
ALTER TABLE network ADD INDEX idx_net_fsi_net (fsi,net);
The above will result in even low number of rows.
Note: The above queries are for MySql and can be tailored for other DBs easily.

Can't optimise mySQL query

I am running a query to retrieve some game levels from a MySQL database. The query itself takes around 0.00025 seconds to execute on a base that contains 40 level strings. I thought it was satisfactory, until I got a message from the website host telling me to optimise the below-mentioned query, or the script will be removed since it is pushing a lot of strain onto their servers.
I tried optimising by using explain and explain extended and adjusting the columns accordingly(adding indexes), but am always getting the same performance. What I noticed also is that MySQL didn't use indexes where they were available but instead did a full-table scan.
Results from EXPLAIN EXTENDED:
table id select_type type possible_keys key key_len ref rows Extra
users 1 SIMPLE ALL PRIMARY,id NULL NULL NULL 7 Using temporary; Using filesort
AllTime 1 SIMPLE ref PRIMARY,userid PRIMARY 4 Test.users.id 1
query:
SELECT users.nickname, AllTime.userid, AllTime.id, AllTime.levelname, AllTime.levelstr
FROM AllTime
INNER JOIN users
ON AllTime.userid=users.id
ORDER BY AllTime.id DESC
LIMIT ($value_from_php),20;
The tables:
users
| id(int) | nickname(varchar) |
| (Primary, Auto_increment) | |
|---------------------------|-------------------|
| 1 | username1 |
| 2 | username2 |
| 3 | username3 |
| ... | ... |
and AllTime
| id(int) | userid(int) | levelname(varchar) | levelstr(text) |
| (Primary, Auto_increment) | (index) | | |
|---------------------------|-------------|--------------------|----------------|
| 1 | 2 | levelname1 | levelstr1 |
| 2 | 2 | levelname2 | levelstr2 |
| 3 | 3 | levelname3 | levelstr3 |
| 4 | 1 | levelname4 | levelstr4 |
| 5 | 1 | levelname5 | levelstr5 |
| 6 | 1 | levelname6 | levelstr6 |
| 7 | 2 | levelname7 | levelstr7 |
Is there a way to optimize this query or would I be better off by calling two consecutive queries from php just to avoid the warning?
I am just learning MySQL, so please take that information into account when replying, thank you :)
I'm assuming you're using InnoDB.
For an INNER JOIN, MySQL typically starts with the table with the fewest rows, in this case users. However, since you just want the latest 20 AllTime records joined with the corresponding user records, you actually should start with AllTime since with the LIMIT, it will be the smaller data set.
Use STRAIGHT_JOIN to force the join order:
SELECT users.nickname, AllTime.userid, AllTime.id, AllTime.levelname,
AllTime.levelstr
FROM AllTime
STRAIGHT_JOIN users
ON users.id = AllTime.userid
ORDER BY AllTime.id DESC
LIMIT ($value_from_php),20;
It should be able to use the primary key on the AllTime table and follow it in descending order. It'll grab all the data on the same pages as it goes.
It should also use the primary key on the users table to grab the id and nickname. If there are more than just two columns, you might add a multi-column covering index on (id, nickname) to improve the speed.
If you can, convert the levelstr column to VARCHAR so that the data is stored on the same page as the rest of the data, otherwise, it has to go fetch the text columns separately. This assumes that your columns are under the 8000 byte row limit for InnoDB. There is no way to avoid the USING TEMPORARY unless you get rid of the text column.
Most likely, your host has identified this query by using the slow query log, which can identify all queries that don't use an index, or they may have red flagged it because of the Using temporary.
it doesn't look like the query has a problem.
Review the application code. Most likely the issue is in the code
Check MySQL query execution plan
possibly you are missing an index
Make sure you cache the data in Application and Database (fyi, sometimes you can load the whole database into Application memory)
Make sure you use a connection pool
Create a view (a very small chance for improvement)
Try to remove the "Order By" clause (again a very small chance it will improve the performance)
The query itself takes around 0.00025 seconds ... I got a message from the website host telling me to optimise the below-mentioned query, or the script will be removed since it is pushing a lot of strain onto their servers.
Ask the website host for more details about why this query has been flagged for attention. A query that trivial is not going to cause strain on anything unless it is being called very frequently.
Find out how many times that query is being run. I will bet you a nickel that your site is getting hammered by a bot and being executed hundreds or thousands of times per minute. If so, then that's your real problem.
LIMIT ($value_from_php),20; -- if $value_form_php is huge, then the query is slow. This is because all the 'old' pages need to be scanned before getting to the 20 you need.
By "remembering where you left off" you can make every page equally fast. See this for further details: http://mysql.rjweb.org/doc.php/pagination

MySQL query with 2 joins, large keylen leads to 'Copying to tmp table on disk' process hanging forever

I'm sure I must be doing something stupid, but as is often the case I can't figure out what it is.
I'm trying to run this query:
SELECT `f`.`FrenchWord`, `f`.`Pronunciation`, `e`.`EnglishWord`
FROM (`FrenchWords` f)
INNER JOIN `FrenchEnglishMappings` m ON `m`.`FrenchForeignKey`=`f`.`id`
INNER JOIN `EnglishWords` e ON `e`.`id`=`m`.`EnglishForeignKey`
WHERE `f`.`Pronunciation` = '[whatever]';
When I run it, what happens seems quite weird to me. I get the results of the query fine, 2 rows in about 0.002 seconds.
However, I also get a huge spike in CPU and SHOW PROCESSLIST shows two identical processes for that query with state 'Copying to tmp table on disk'. These seem to keep running endlessly until I kill them or the system freezes.
None of the tables involved is big - between 100k and 600k rows each. tmp_table_size and max_heap_table_size are both 16777216.
Edit: EXPLAIN on the statement gives:
+edit reduced keylen of Pronunciation to 112
+----+-------------+-------+--------+-------------------------------------------------------------+-----------------+---------+----------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-------------------------------------------------------------+-----------------+---------+----------------------------+------+----------------------------------------------+
| 1 | SIMPLE | f | ref | PRIMARY,Pronunciation | Pronunciation | 112 | const | 2 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | m | ref | tmpindex,CombinedIndex,FrenchForeignKey,EnglishForeignKey | tmpindex | 4 | dict.f.id | 1 | Using index |
| 1 | SIMPLE | e | eq_ref | PRIMARY,id | PRIMARY | 4 | dict.m.EnglishForeignKey | 1 | |
+----+-------------+-------+--------+-------------------------------------------------------------+-----------------+---------+----------------------------+------+----------------------------------------------+
I'd be grateful if someone could point out what might be causing this. What I really don't understand is what MySQL is doing - surely if the query is complete then it doesn't need to do anything else?
UPDATE
Thanks for all the responses. I learnt something from all of them. This query was made massively faster after following the advice of nrathaus. I added a PronunciationHash binary(16) column to FrenchWords that contains unhex( md5 ( Pronunciation ) ). That is indexed with a keylen of 16 (vs 600+ for the varchar index on Pronunciation), and queries are much faster now.
As said by the EXPLAIN, you key size is HUGE : 602, this requires MySQL to write down the data.
You need to reduce (greatly) the keylen, I believe recommended is below 128.
I suggest you create a column called MD5_FrenchWord which will contain the MD5 value of FrenchWord. Then use this column for the GROUP BY. This assumes that you are looking for similarities, when you group by rather than the actual value
You are misusing GROUP BY. This clause is entirely pointless unless you also have a summary function such as MAX(something) or COUNT(*) in your SELECT clause.
Try removing GROUP BY and see if it helps.
It's not clear what you're trying to do with GROUP BY. But you might try SELECT DISTINCT if you're trying to dedup your result set.
Looking further at this question, it seems like you might benefit from a couple of compound indexes.
First, can you make sure your table declarations have NOT NULL in as many columns as possible?
Second, you're retrieving Pronunciation, FrenchWord, and id from your Frenchwords table, so try this compound index on that table. Your query will then be able to get what it needs directly from the index, saving a bunch of disk io. Notice that Pronunciation is mentioned first in the compound index declaration because that's the value you're searching for. This allows MySQL to do a lookup on the index, and get the other information it needs directly from the index, without thrashing back to the table itself.
(Pronunciation, FrenchWord, id)
You're retrieving Englishword from Englishwords looking it up by id. So, the same reasoning can apply to this compound index.
(id, Englishword)
Finally, I can't tell what your ORDER BY is for, once you use SELECT DISTINCT. You might try getting rid of it. But it probably makes no difference.
Give this a try. If your MySQL server is still thrashing after you make these changes, you have some kind of configuration problem.

Improving MySQL Performance on a Run-Once Query with a Large Dataset

I previously asked a question on how to analyse large datasets (how can I analyse 13GB of data). One promising response was to add the data into a MySQL database using natural keys and thereby make use of INNODB's clustered indexing.
I've added the data to the database with a schema that looks like this:
TorrentsPerPeer
+----------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+------------------+------+-----+---------+-------+
| ip | int(10) unsigned | NO | PRI | NULL | |
| infohash | varchar(40) | NO | PRI | NULL | |
+----------+------------------+------+-----+---------+-------+
The two fields together form the primary key.
This table represents known instances of peers downloading torrents. I'd like to be able to provide information on how many torrents can be found at peers. I'm going to draw a histogram of the frequencies of which I see numbers of torrents (e.g. 20 peers have 2 torrents, 40 peers have 3, ...).
I've written the following query:
SELECT `count`, COUNT(`ip`)
FROM (SELECT `ip`, COUNT(`infohash`) AS `count`
FROM TorrentsPerPeer
GROUP BY `ip`) AS `counts`
GROUP BY `count`;
Here's the EXPLAIN for the sub-select:
+----+-------------+----------------+-------+---------------+---------+------------+--------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_length | ref | rows | Extra |
+----+-------------+----------------+-------+---------------+---------+------------+--------+----------+-------------+
| 1 | SIMPLE | TorrentPerPeer | index | [Null] | PRIMARY | 126 | [Null] | 79262772 | Using index |
+----+-------------+----------------+-------+---------------+---------+------------+--------+----------+-------------+
I can't seem to do an EXPLAIN for the full query because it takes way too long. This bug suggests it's because it's running the sub query first.
This query is currently running (and has been for an hour). top is reporting that mysqld is only using ~5% of the available CPU whilst its RSIZE is steadily increasing. My assumption here is that the server is building temporary tables in RAM that it's using to complete the query.
My question is then; how can I improve the performance of this query? Should I change the query somehow? I've been altering the server settings in the my.cnf file to increase the INNODB buffer pool size, should I change any other values?
If it matters the table is 79'262'772 rows deep and takes up ~8GB of disk space. I'm not expecting this to be an easy query, maybe 'patience' is the only reasonable answer.
EDIT Just to add that the query has finished and it took 105mins. That's not unbearable, I'm just hoping for some improvements.
My hunch is that with an unsigned int and a varchar 40 (especially the varchar!) you have now a HUGE primary key and it is making your index file too big to fit in whatever RAM you have for Innodb_buffer_pool. This would make InnoDB have to rely on disk to swap index pages as it searches and that is a LOT of disk seeks and not a lot of CPU work.
One thing I did for a similar issue is use something in between a truly natural key and a surrogate key. We would take the 2 fields that are actually unique (one of which was also a varchar) and in the application layer would make a fixed width MD5 hash and use THAT as the key. Yes, it means more work for the app but it makes for a much smaller index file since you are no longer using an arbitrary length field.
OR, you could just use a server with tons of RAM and see if that makes the index fit in memory but I always like to make 'throw hardware at it' a last resort :)

Ideas to improve this query indexes

I feel like the following query is too slow:
(1679.1ms)
SELECT `media_files` . *
FROM `media_files`
INNER JOIN `playlist_media_files` ON `media_files`.`id` = `playlist_media_files`.`media_file_id`
WHERE `media_files`.`type`
IN (
'AudioFile'
)
AND `playlist_media_files`.`playlist_id` =7
ORDER BY media_files.artist ASC , media_files.release_year ASC , media_files.album ASC , media_files.disc_number ASC , media_files.position ASC
EXPLAIN:
+----+-------------+----------------------+--------+---------------------------------------------------------------------------------------+-------------------------------------------+---------+---------------------------------------------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+--------+---------------------------------------------------------------------------------------+-------------------------------------------+---------+---------------------------------------------------------+------+---------------------------------+
| 1 | SIMPLE | playlist_media_files | ref | index_playlist_media_files_on_playlist_id,index_playlist_media_files_on_media_file_id | index_playlist_media_files_on_playlist_id | 4 | const | 3782 | Using temporary; Using filesort |
| 1 | SIMPLE | media_files | eq_ref | PRIMARY,index_media_files_on_type | PRIMARY | 4 | mydb.playlist_media_files.media_file_id | 1 | Using where |
+----+-------------+----------------------+--------+---------------------------------------------------------------------------------------+-------------------------------------------+---------+---------------------------------------------------------+------+---------------------------------+
Every column is indexed.
Any MySQL expert can tell how can it be improved by looking at the explain?
The multiple ORDER BY is killing the performance.
Edit: removed private URLs from comments
Update: it seems I can do something like concat(..fields..) AS sort for a late ORDER BY sort.
For this query you would probably benefit from having a composite index on (playlist_id,media_file_id) in the playlist_media_files table, which would let mysql to use only this index to know which row to read from media_files without having to read actual data from playlist_media_files to see what is the value of media_file_id for every row that satisfies playlist_id = 7 condition (a lot of them do).
You should see additional using index for the first row of explain output.
Mysql would still have to create temporary table to sort it by so many columns, but sorting 4k rows in memory is not so expensive.
So basically try:
ALTER TABLE `playlist_media_files`
ADD INDEX `playlist_media_composite` ( `playlist_id` , `media_file_id` ) ;
and see the results.
Edit: I tried to simulate the same problem on my test db, creating the same tables and filling them with random 400k rows using php, trying to get similar index cardinality.
Without composite index the same query has following execution plan:
1 SIMPLE playlist_media_files ref playlist_id,media_file_id playlist_id 4 const 3925 Using temporary; Using filesort
1 SIMPLE media_files eq_ref PRIMARY PRIMARY 4 test.playlist_media_files.media_file_id 1 Using where
And the average result is about:
Showing rows 0 - 29 ( 2,702 total, Query took 0.0359 sec)
After adding the composite index and doing ANALYZE TABLE playlist_media_files explain shows:
1 SIMPLE playlist_media_files ref playlist_id,media_file_id,playlist_media_composite playlist_media_composite 4 const 3925 Using index; Using temporary; Using filesort
1 SIMPLE media_files eq_ref PRIMARY PRIMARY 4 test.playlist_media_files.media_file_id 1 Using where
And the average result:
Showing rows 0 - 29 ( 2,702 total, Query took 0.0176 sec)
However in both cases sorting was done in memory (and still creating tmp table and sorting takes 80% of the time here) and looking at your profiling screenshot most of the time is lost on copying temporary table to disc. Thats where the difference comes from. My tables have only columns required for this query, and probably my random strings weren't as long as yours, while you have a lot more columns and you are selecting all of them, sorting only on few. So your temporary table doesn't fit in the memory and obviously doing things on disc has to be a lot slower.
So your main focus here should be either on increasing buffer sizes to accomodate your big select or limiting number of columns to select that maybe you don't need that much.
What's the meaning of the orderbys? Using it in more than one variable just doesn't make sense in this case.
Why not just order by one thing?
You might be having problems with database design normalization, are you familiar with that?