Scaling a High Score Database - mysql

I have a simple high score service for an online game, and it has become more popular than expected. The high score is a webservice which uses a MYSQL backend with a simple table as shown below. Each high score record is stored as a row in this table. The problem is, with >140k rows, I see certain key queries slowing down so much that it will soon be too slow to service requests.
The main table looks like this:
id is a unique key for each score record
game is the ID number of the game which submitted the score (currently, always equal to "1", soon will have to support more games though)
name is the display name for that player's submission
playerId is a unique ID for a given user
score is a numeric score representation ex 42,035
time is the submission time
rank is a large integer which uniquely sorts the score submissions for a given game. It is
common for people to tie at a certain score, so in that case the tie is broken by who submitted first. Therefore this field's value is equal roughly to "score * 100000000 + (MAX_TIME - time)"
+----------+---------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------+---------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| game | int(11) | YES | MUL | NULL | |
| name | varchar(100) | YES | | NULL | |
| playerId | varchar(50) | YES | | NULL | |
| score | int(11) | YES | | NULL | |
| time | datetime | YES | | NULL | |
| rank | decimal(50,0) | YES | MUL | NULL | |
+----------+---------------+------+-----+---------+----------------+
The indexes look like this:
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| pozscores | 0 | PRIMARY | 1 | id | A | 138296 | NULL | NULL | | BTREE | |
| pozscores | 0 | game | 1 | game | A | NULL | NULL | NULL | YES | BTREE | |
| pozscores | 0 | game | 2 | rank | A | NULL | NULL | NULL | YES | BTREE | |
| pozscores | 1 | rank | 1 | rank | A | 138296 | NULL | NULL | YES | BTREE | |
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
When a user requests high scores, they typically request around 75 high scores from an arbitrary point in the "sorted by rank descending list". These requests are typically for "alltime" or just for scores in the past 7 days.
A typical query looks like this:
"SELECT * FROM scoretable WHERE game=1 AND time>? ORDER BY rank DESC LIMIT 0, 75;" and runs in 0.00 sec.
However, if you request towards the end of the list
"SELECT * FROM scoretable WHERE game=1 AND time>? ORDER BY rank DESC LIMIT 10000, 75;" and runs in 0.06 sec.
"SELECT * FROM scoretable WHERE game=1 AND time>? ORDER BY rank DESC LIMIT 100000, 75;" and runs in 0.58 sec.
It seems like this will quickly start taking way too long as several thousand new scores are submitted each day!
Additionally, there are two other types of queries, used to find a particular player by id in the rank ordered list.
They look like this:
"SELECT * FROM scoretable WHERE game=1 AND time>? AND playerId=? ORDER BY rank DESC LIMIT 1"
followed by a
"SELECT count(id) as count FROM scoretable WHERE game=1 AND time>? AND rank>[rank returned from above]"
My question is: What can be done to make this a scalable system? I can see the number of rows growing to be several million very soon. I was hoping that choosing some smart indexes would help, but the improvement has only been marginal.
Update:
Here is an explain line:
mysql> explain SELECT * FROM scoretable WHERE game=1 AND time>0 ORDER BY rank DESC LIMIT 100000, 75;
+----+-------------+-----------+-------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | scoretable| range | game | game | 5 | NULL | 138478 | Using where |
+----+-------------+-----------+-------+---------------+------+---------+------+--------+-------------+
Solution Found!
I have solved the problem thanks to some of the pointers from this thread. Doing a clustered index was exactly what I needed, so I converted the table to use InnoDB in mysql, which supports clustered indexes. Next, I removed the id field, and just set the primary key to be (game ASC, rank DESC). Now, all queries run super fast, no matter what offset I use. The explain shows that no additional sorting is being done, and it looks like it is easily handling all the traffic.

Seeing as how there are no takers, I'll give it a shot. I am from an SQL Server background, but the same ideas apply.
Some general observations:
The ID column is pretty much pointless, and should not participate in any indexes unless there are other tables/queries you're not telling us about. In fact, it doesn't even need to be in your last query. You can do COUNT(*).
Your clustered index should target your most common queries. Therefore, a clustered index on game ASC, time DESC, and rank DESC works well. Sorting by time DESC is usually a good idea for historical tables like this where you are usually interested in the most recent stuff. You may also try a separate index with the rank sorted the other direction, though I'm not sure how much of a benefit this will be.
Are you sure you need SELECT *? If you can select fewer columns, you may be able to create an index which contains all columns needed for your SELECT and WHERE.
1 million rows is really not that much. I created a table like yours with 1,000,000 rows of sample data, and even with the one index (game ASC, time DESC, and rank DESC), all queries ran in less than 1 second.
(The only part I'm not sure of is playerId. The queries performed so well that playerId didn't seem to be necessary. Perhaps you can add it at the end of your clustered index.)

Related

mysql not picking up the optimal index

Here's my table:
CREATE TABLE `idx_weight` (
`ID` bigint(20) NOT NULL AUTO_INCREMENT,
`SECURITY_ID` bigint(20) NOT NULL COMMENT,
`CONS_ID` bigint(20) NOT NULL,
`EFF_DATE` date NOT NULL,
`WEIGHT` decimal(9,6) DEFAULT NULL,
PRIMARY KEY (`ID`),
UNIQUE KEY `BPK_AK` (`SECURITY_ID`,`CONS_ID`,`EFF_DATE`),
KEY `idx_weight_ix` (`SECURITY_ID`,`EFF_DATE`)
) ENGINE=InnoDB AUTO_INCREMENT=75334536 DEFAULT CHARSET=utf8
For query 1:
explain select SECURITY_ID, min(EFF_DATE) as startDate, max(EFF_DATE) as endDate from idx_weight where security_id = 1782:
+----+-------------+------------+------+----------------------+---------------+---------+-------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+----------------------+---------------+---------+-------+--------+-------------+
| 1 | SIMPLE | idx_weight | ref | BPK_AK,idx_weight_ix | idx_weight_ix | 8 | const | 887856 | Using index |
+----+-------------+------------+------+----------------------+---------------+---------+-------+--------+-------------+
This query runs fine.
Now Query 2 (the only thing changed is the security_id param):
explain select SECURITY_ID, min(EFF_DATE) as startDate, max(EFF_DATE) as endDate from idx_weight where security_id = 26622:
+----+-------------+------------+------+----------------------+--------+---------+-------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+------+----------------------+--------+---------+-------+----------+-------------+
| 1 | SIMPLE | idx_weight | ref | BPK_AK,idx_weight_ix | BPK_AK | 8 | const | 10700002 | Using index |
+----+-------------+------------+------+----------------------+--------+---------+-------+----------+-------------+
Notice that it picks up the index BPK_AK, and the actual query runs for over 1 minute.
This is incorrect. Second time took over 10 seconds. I'm guessing the first time the index is not in the buffer pool.
I can get a workaround by appending group by security_id:
explain select SECURITY_ID, min(EFF_DATE) as startDate, max(EFF_DATE) as endDate from idx_weight where security_id = 26622 group by security_id:
+----+-------------+------------+-------+----------------------+---------------+---------+------+-------+---------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+----------------------+---------------+---------+------+-------+---------------------------------------+
| 1 | SIMPLE | idx_weight | range | BPK_AK,idx_weight_ix | idx_weight_ix | 8 | NULL | 10314 | Using where; Using index for group-by |
+----+-------------+------------+-------+----------------------+---------------+---------+------+-------+---------------------------------------+
But I still don't understand why would mysql not picking idx_weight_ix for some security_id, which is a covering index for this query (and a lot cheaper). Any idea?
=========================================================================
Update:
#oysteing
Learned a new trick, cool! :)
Here's the optimizer trace:
Query 1: https://gist.github.com/aping/c4388d49d666c43172a856d77001f4ce
Query 2: https://gist.github.com/aping/1af5504b428ca136a8b1c41c40d763e4
And some extra information that might be useful:
From INFORMATION_SCHEMA.STATISTICS:
+------------+---------------+--------------+-------------+-------------+
| NON_UNIQUE | INDEX_NAME | SEQ_IN_INDEX | COLUMN_NAME | CARDINALITY |
+------------+---------------+--------------+-------------+-------------+
| 0 | BPK_AK | 1 | SECURITY_ID | 74134 |
| 0 | BPK_AK | 2 | CONS_ID | 638381 |
| 0 | BPK_AK | 3 | EFF_DATE | 68945218 |
| 1 | idx_weight_ix | 1 | SECURITY_ID | 61393 |
| 1 | idx_weight_ix | 2 | EFF_DATE | 238564 |
+------------+---------------+--------------+-------------+-------------+
CARDINALITY for SECURITY_ID are different, but technically they should be exactly the same, am I right?
From this: https://dba.stackexchange.com/questions/49656/find-the-size-of-each-index-in-a-mysql-table?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa
+---------------+-------------------+
| index_name | indexentry_length |
+---------------+-------------------+
| BPK_AK | 1376940279 |
| idx_weight_ix | 797175951 |
+---------------+-------------------+
The index size is about 800MB vs 1.3GB.
Running select count(*) from idx_weight where security_id = 1782 returns 509994
and select count(*) from idx_weight where security_id = 26622 returns 5828054
Then force using BPK_AK for query 1:
select SQL_NO_CACHE SECURITY_ID, min(EFF_DATE) as startDate, max(EFF_DATE) as endDate from idx_weight use index (BPK_AK) where security_id = 1782 took 0.2 sec.
So basically, 26622 has 10 times more rows than 1782, but using the same index, it took 50 times more time.
PS: buffer pool size is 25GB.
The optimizer traces shows that the reason for the difference in the selection of index, is due to the estimates received from InnoDB. For each potential index, the optimizer asks the storage engine for an estimate on how many records are in the range. For the first query it gets the following estimates:
BPK_AK: 1031808
idx_weight_ix: 887856
So the estimated read cost is lowest for idx_weight_ix, and this index is chosen. For the second query the estimates are:
BPK_AK: 11092112
idx_weight_ix: 12003098
And the estimated read cost of BPK_AK is lowest due to the lower number of rows . You could say that MySQL should know that the real number of rows in the range is the same in both cases, but that logic has not been implemented.
I do not know the details of how InnoDB computes this estimates, but it basically does two "index dives" to find the first and last row in the range, and then somehow computes the "distance" between the two. It could be that the estimates are affected by unused space in index pages, and that OPTIMIZE TABLE could fix this, but running OPTIMIZE TABLE will probably take very long on such a large table.
The quickest way to solve this, is to add a GROUP BY clause as mentioned by a few other people here. Then MySQL will only need to read 2 rows per group; the first and the last since index is ordered by EFF_DATE for each value of security_id. Alternatively, you could use FORCE INDEX to force a particular index.
It may also be that MySQL 8.0 will handle this query better. The cost model has change somewhat, and it will put higher cost on "cold" indexes that are not cached in the buffer pool.
When you mix normal columns (SECURITY_ID) and aggregate functions (min & max in your case), you should use the GROUP BY. If you do not, MySQL is free give any result it pleases. With GROUP BY, you will get the correct result. Newer MySQL databases force this behavior by default.
The reason the second index is not selected when you leave out the GROUP BY, is most likely due to the fact that the aggregate functions are not limited into the same group (=security_id) abd therefore cannot be used as limiter.
I can get a workaround by appending group by security_id
Well, yes. I wouldn't do it any other way, since when you use aggregate functions you NEED to group by something. I didn't even know that MySQL allowed you to work around it.
I think #slaakso is right. Upvote him.

Very simple AVG() aggregation query on MySQL server takes ridiculously long time

I am using MySQL server via Amazon could service, with default settings. The table involved mytable is of InnoDB type and has about 1 billion rows.
The query is:
select count(*), avg(`01`) from mytable where `date` = "2017-11-01";
Which takes almost 10 min to execute. I have an index on date. The EXPLAIN of this query is:
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
| 1 | SIMPLE | mytable | ref | date | date | 3 | const | 1411576 | NULL |
+----+-------------+---------------+------+---------------+------+---------+-------+---------+-------+
The indexes from this table are:
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| mytable | 0 | PRIMARY | 1 | ESI | A | 60398679 | NULL | NULL | | BTREE | | |
| mytable | 0 | PRIMARY | 2 | date | A | 1026777555 | NULL | NULL | | BTREE | | |
| mytable | 1 | lse_cd | 1 | lse_cd | A | 1919210 | NULL | NULL | YES | BTREE | | |
| mytable | 1 | zone | 1 | zone | A | 732366 | NULL | NULL | YES | BTREE | | |
| mytable | 1 | date | 1 | date | A | 85564796 | NULL | NULL | | BTREE | | |
| mytable | 1 | ESI_index | 1 | ESI | A | 6937686 | NULL | NULL | | BTREE | | |
+---------------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
If I remove AVG():
select count(*) from mytable where `date` = "2017-11-01";
It only takes 0.15 sec to return the count. The count of this specific query is 692792; The counts are similar for other dates.
I don't have an index over 01. Is it an issue? Why AVG() takes so long to compute? There must be something I didn't do properly.
Any suggestion is appreciated!
To count the number of rows with a specific date, MySQL has to locate that value in the index (which is pretty fast, after all that is what indexes are made for) and then read the subsequent entries of the index until it finds the next date. Depending on the datatype of esi, this will sum up to reading some MB of data to count your 700k rows. Reading some MB does not take much time (and that data might even already be cached in the buffer pool, depending on how often you use the index).
To calculate the average for a column that is not included in the index, MySQL will, again, use the index to find all rows for that date (the same as before). But additionally, for every row it finds, it has to read the actual table data for that row, which means to use the primary key to locate the row, read some bytes, and repeat this 700k times. This "random access" is a lot slower than the sequential read in the first case. (This gets worse by the problem that "some bytes" is the innodb_page_size (16KB by default), so you may have to read up to 700k * 16KB = 11GB, compared to "some MB" for count(*); and depending on your memory configuration, some of this data might not be cached and has to be read from disk.)
A solution to this is to include all used columns in the index (a "covering index"), e.g. create an index on date, 01. Then MySQL does not need to access the table itself, and can proceed, similar to the first method, by just reading the index. The size of the index will increase a bit, so MySQL will need to read "some more MB" (and perform the avg-operation), but it should still be a matter of seconds.
In the comments, you mentioned that you need to calculate the average over 24 columns. If you want to calculate the avg for several columns at the same time, you would need a covering index on all of them, e.g. date, 01, 02, ..., 24 to prevent table access. Be aware that an index that contains all columns requires as much storage space as the table itself (and it will take a long time to create such an index), so it might depend on how important this query is if it is worth those resources.
To avoid the MySQL-limit of 16 columns per index, you could split it into two indexes (and two queries). Create e.g. the indexes date, 01, .., 12 and date, 13, .., 24, then use
select * from (select `date`, avg(`01`), ..., avg(`12`)
from mytable where `date` = ...) as part1
cross join (select avg(`13`), ..., avg(`24`)
from mytable where `date` = ...) as part2;
Make sure to document this well, as there is no obvious reason to write the query this way, but it might be worth it.
If you only ever average over a single column, you could add 24 seperate indexes (on date, 01, date, 02, ...), although in total, they will require even more space, but might be a little bit faster (as they are smaller individually). But the buffer pool might still favour the full index, depending on factors like usage patterns and memory configuration, so you may have to test it.
Since date is part of your primary key, you could also consider changing the primary key to date, esi. If you find the dates by the primary key, you would not need an additional step to access the table data (as you already access the table), so the behaviour would be similar to the covering index. But this is a significant change to your table and can affect all other queries (that e.g. use esi to locate rows), so it has to be considered carefully.
As you mentioned, another option would be to build a summary table where you store precalculated values, especially if you do not add or modify rows for past dates (or can keep them up-to-date with a trigger).
For MyISAM tables, COUNT(*) is optimized to return very quickly if the SELECT retrieves from one table, no other columns are retrieved, and there is no WHERE clause.
For example:
SELECT COUNT(*) FROM student;
https://dev.mysql.com/doc/refman/5.6/en/group-by-functions.html#function_count
If you add AVG() or something else, you lose this optimization

Three Queries Faster than One -- What's Wrong with my Joins?

I've got a JPA ManyToMany relationship set up, which gives me three important tables: my Ticket table, my Join table, and my Inventory table. They're InnoDB tables on MySQL 5.1. The relevant bits are:
Ticket:
+--------+----------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------+----------+------+-----+---------+----------------+
| ID | int(11) | NO | PRI | NULL | auto_increment |
| Status | longtext | YES | | NULL | |
+--------+----------+------+-----+---------+----------------+
JoinTable:
+-------------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------+------+-----+---------+-------+
| InventoryID | int(11) | NO | PRI | NULL | | Foreign Key - Inventory
| TicketID | int(11) | NO | PRI | NULL | | Foreign Key - Ticket
+-------------+---------+------+-----+---------+-------+
Inventory:
+--------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+----------------+
| ID | int(11) | NO | PRI | NULL | auto_increment |
| TStampString | varchar(32) | NO | MUL | NULL | |
+--------------+--------------+------+-----+---------+----------------+
TStampStrings are of the form "yyyy.mm.dd HH:MM:SS Z" (for example, '2010.03.19 22:27:57 GMT'). Right now all of the Tickets created directly correspond to some specific hour TStampString, so that SELECT COUNT(*) FROM Ticket; is the same as SELECT COUNT(DISTINCT(SUBSTRING(TStampString, 1, 13))) FROM Inventory;
What I'd like to do is regroup certain Tickets based on the minute granularity of a TStampString: (SUBSTRING(TStampString, 1, 16)). So I'm profiling and testing the SELECT of an INSERT INTO ... SELECT statement:
EXPLAIN SELECT SUBSTRING(i.TStampString, 1, 16) FROM Ticket t JOIN JoinTable j
ON t.ID = j.TicketID JOIN Inventory i ON j.InventoryID = i.ID WHERE t.Status
= 'Regroup' GROUP BY SUBSTRING(i.TStampString, 1, 16);
+--+------+---+--------+-------------+-----+-----+----------+-------+-----------+
|id| type |tbl| type | psbl_keys | key | len | ref | rows | Extra |
+--+------+---+--------+-------------+-----+-----+----------+-------+-----------+
|1 | SMPL | t | ALL | PRI | NULL| NULL| NULL | 35569 | where |
| | | | | | | | | | +temporary|
| | | | | | | | | | +filesort |
|1 | SMPL | j | ref | PRI,FK1,FK2 | FK2 | 4 | t.ID | 378 | index |
|1 | SMPL | i | eq_ref | PRI | PRI | 4 | j.Invent | 1 | |
| | | | | | | | oryID | | |
+--+------+---+--------+-------------+-----+-----+----------+-------+-----------+
What this implies to me is that for each row in Ticket, MySQL first does the joins then later decides that the row is invalid due to the WHERE clause. Certainly the runtime is abominable (I gave up after 30 minutes). Note that it goes no faster with t.Status = 'Regroup' moved to the first JOIN clause and no WHERE clause.
But what's interesting is that if I run this query manually in three steps, doing what I thought the optimizer would do, each step returns almost immediately:
--Step 1: Select relevant Tickets (results dumped to file)
SELECT ID FROM Ticket WHERE Status = 'Regroup';
--Step 2: Get relevant Inventory entries
SELECT InventoryID FROM JoinTable WHERE TicketID IN (step 1s file);
--Step 3: Select what I wanted all along
SELECT SUBSTRING(TStampString, 1, 16) FROM Inventory WHERE ID IN (step 2s file)
GROUP BY SUBSTRING(TStampString, 1, 16);
On my particular tables, the first query gives 154 results, the second creates 206,598 lines, and the third query returns 9198 rows. All of them combined take ~2 minutes to run, with the last query having the only significant runtime.
Dumping the intermediate results to a file is cumbersome, and more importantly I'd like to know how to write my original query such that it runs reasonably. So how do I structure this three-table-join such that it runs as fast as I know is possible?
UPDATE: I've added a prefix index on Status(16), which changes my EXPLAIN profile rows to 153, 378, and 1 respectively (since the first row has a key to use). The JOIN version of my query now takes ~6 minutes, which is tolerable but still considerably slower than the manual version. I'd still like to know why the join is performing sorely suboptimally, but it may be that one can't create independent subqueries in buggy MySQL 5.1. If enough time passes I'll accept Add Index as the solution to my problem, although it's not exactly the answer to my question.
In the end I did end up manually recreating every step of the join on disk. Tens of thousands of files each with a thousand queries was still significantly faster than anything I could get my version of MySQL to do. But since that process would be horribly specific and unhelpful for the layman, I'm accepting ypercube's answer of Add (Partial) Indexes.
What you can do to speed up the query:
Add an index on Status. Even if you don't change the type to VARCHAR, you can still add a partial index:
ALTER TABLE Ticket
ADD INDEX status_idx
Status(16) ;
I assume that the Primary key of the Join table is (InventoryID, TicketID). You can add another index on (TicketID, InventoryID) as well. This may not benefit this particular query but it will be helpful in other queries you'll have.
The answer on why this happens is that the optimizer does not always choose the best plan. You can try this variation of your query and see how the EXPLAIN plan differs and if there is any efficiency gain:
SELECT SUBSTRING(i.TStampString, 1, 16)
FROM
( SELECT (DISTINCT) j.InventoryID
FROM Ticket t
JOIN JoinTable j
ON t.ID = j.TicketID
WHERE t.Status = 'Regroup'
) AS tmp
JOIN Inventory i
ON tmp.InventoryID = i.ID
GROUP BY SUBSTRING(i.TStampString, 1, 16) ;
try giving the first substring-clause an alias and using it in the group-by.
SELECT SUBSTRING(i.TStampString, 1, 16) as blaa FROM Ticket t JOIN JoinTable j
ON t.ID = j.TicketID JOIN Inventory i ON j.InventoryID = i.ID WHERE t.Status
= 'Regroup' GROUP BY blaa;
also avoid the join altogether since you dont need it..
SELECT distinct(SUBSTRING(i.TStampString, 1,16)) from inventory i where i.ID in
( select id from JoinTable j where j.TicketID in
(select id from Ticket t where t.Status = 'Regroup'));
would that work?
btw. you do have an index on the Status field ?

Optimizing / improving a slow mysql query - indexing? reorganizing?

First off, I've looked at several other questions about optimizing sql queries, but I'm still unclear for my situation what is causing my problem. I read a few articles on the topic as well and have tried implementing a couple possible solutions, as I'll describe below, but nothing has yet worked or even made an appreciable dent in the problem.
The application is a nutrition tracking system - users enter the foods they eat and based on an imported USDA database the application breaks down the foods to the individual nutrients and gives the user a breakdown of the nutrient quantities on a (for now) daily basis.
here's
A PDF of the abbreviated database schema
and here it is as a (perhaps poor quality) JPG. I made this in open office - if there are suggestions for better ways to visualize a database, I'm open to suggestions on that front as well! The blue tables are directly from the USDA, and the green and black tables are ones I've made. I've omitted a lot of data in order to not clutter things up unnecessarily.
Here's the query I'm trying to run that takes a very long time:
SELECT listing.date_time,listing.nutrdesc,data.total_nutr_mass,listing.units
FROM
(SELECT nutrdesc, nutr_no, date_time, units
FROM meals, nutr_def
WHERE meals.users_userid = '2'
AND date_time BETWEEN '2009-8-12' AND '2009-9-12'
AND (nutr_no <100000
OR nutr_no IN
(SELECT nutr_def_nutr_no
FROM nutr_rights
WHERE nutr_rights.users_userid = '2'))
) as listing
LEFT JOIN
(SELECT nutrdesc, date_time, nut_data.nutr_no, sum(ingred_gram_mass*entry_qty_num*nutr_val/100) AS total_nutr_mass
FROM nut_data, recipe_ingredients, food_entries, meals, nutr_def
WHERE nut_data.nutr_no = nutr_def.nutr_no
AND ndb_no = ingred_ndb_no
AND foods_food_id = entry_ident
AND meals_meal_id = meal_id
AND users_userid = '2'
AND date_time BETWEEN '2009-8-12' AND '2009-9-12'
GROUP BY date_time,nut_data.nutr_no ) as data
ON data.date_time = listing.date_time
AND listing.nutr_no = data.nutr_no
ORDER BY listing.date_time,listing.nutrdesc,listing.units
So I know that's rather complex - The first select gets a listing of all the nutrients that the user consumed within the given date range, and the second fills in all the quantities.
When I implement them separately, the first query is really fast, but the second is slow and gets very slow when the date ranges get large. The join makes the whole thing ridiculously slow. I know that the 'main' problem is the join between these two derived tables, and I can get rid of that and do the join by hand basically in php much faster, but I'm not convinced that's the whole story.
For example: for 1 month of data, the query takes about 8 seconds, which is slow, but not completely terrible. Separately, each query takes ~.01 and ~2 seconds respectively. 2 seconds still seems high to me.
If I try to retrieve a year's worth of data, it takes several (>10) minutes to run the whole query, which is problematic - the client-server connection sometimes times out, and in any case we don't want I don't want to sit there with a spinning 'please wait' icon. Mainly, I feel like there's a problem because it takes more than 12x as long to retrieve 12x more information, when it should take less than 12x as long, if I were doing things right.
Here's the 'explain' for each of the slow queries: (the whole thing, and just the second half).
Whole thing:
+----+--------------------+--------------------+----------------+-------------------------------+------------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------------+----------------+-------------------------------+------------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 5053 | Using temporary; Using filesort |
| 1 | PRIMARY | <derived4> | ALL | NULL | NULL | NULL | NULL | 4341 | |
| 4 | DERIVED | meals | range | PRIMARY,day_ind | day_ind | 9 | NULL | 30 | Using where; Using temporary; Using filesort |
| 4 | DERIVED | food_entries | ref | meals_meal_id | meals_meal_id | 5 | nutrition.meals.meal_id | 15 | Using where |
| 4 | DERIVED | recipe_ingredients | ref | foods_food_id,ingred_ndb_no | foods_food_id | 4 | nutrition.food_entries.entry_ident | 2 | |
| 4 | DERIVED | nutr_def | ALL | PRIMARY | NULL | NULL | NULL | 174 | |
| 4 | DERIVED | nut_data | ref | PRIMARY | PRIMARY | 36 | nutrition.nutr_def.nutr_no,nutrition.recipe_ingredients.ingred_ndb_no | 1 | |
| 2 | DERIVED | meals | range | day_ind | day_ind | 9 | NULL | 30 | Using where |
| 2 | DERIVED | nutr_def | ALL | PRIMARY | NULL | NULL | NULL | 174 | Using where |
| 3 | DEPENDENT SUBQUERY | nutr_rights | index_subquery | users_userid,nutr_def_nutr_no | nutr_def_nutr_no | 19 | func | 1 | Using index; Using where |
+----+--------------------+--------------------+----------------+-------------------------------+------------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
10 rows in set (2.82 sec)
Second chunk (data):
+----+-------------+--------------------+-------+-----------------------------+---------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+-------+-----------------------------+---------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | meals | range | PRIMARY,day_ind | day_ind | 9 | NULL | 30 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | food_entries | ref | meals_meal_id | meals_meal_id | 5 | nutrition.meals.meal_id | 15 | Using where |
| 1 | SIMPLE | recipe_ingredients | ref | foods_food_id,ingred_ndb_no | foods_food_id | 4 | nutrition.food_entries.entry_ident | 2 | |
| 1 | SIMPLE | nutr_def | ALL | PRIMARY | NULL | NULL | NULL | 174 | |
| 1 | SIMPLE | nut_data | ref | PRIMARY | PRIMARY | 36 | nutrition.nutr_def.nutr_no,nutrition.recipe_ingredients.ingred_ndb_no | 1 | |
+----+-------------+--------------------+-------+-----------------------------+---------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
5 rows in set (0.00 sec)
I've 'analyzed' all the tables involved in the query, and added an index on the datetime field that is joining meals and food entries. I called it 'day_ind'. I hoped that would accelerate things, but it didn't seem to make a difference. I also tried removing the 'sum' function, as I understand that having a function in the query will frequently mean a full table scan, which is obviously much slower. Unfortunately removing the 'sum' didn't seem to make a difference either (well, about 3-5% or so, but not the order magnitude that I'm looking for).
I would love any suggestions and will be happy to provide any more information you need to help diagnose and improve this problem. Thanks in advance!
There are a few type All in your explain suggest full table scan. and hence create temp table. You could re-index if it is not there already.
Sort and Group By are usually the performance killer, you can adjust Mysql memory settings to avoid physical i/o to tmp table if you have extra memory available.
Lastly, try to make sure the data type of the join attributes matches. Ie data.date_time = listing.date_time has same data format.
Hope that helps.
Okay, so I eventually figured out what I'm gonna end up doing. I couldn't make the 'data' query any faster - that's still the bottleneck. But now I've made it so the total query process is pretty close to linear, not exponential.
I split the query into two parts and made each one into a temporary table. Then I added an index for each of those temp tables and did the join separately afterwards. This made the total execution time for 1 month of data drop from 8 to 2 seconds, and for 1 year of data from ~10 minutes to ~30 seconds. Good enough for now, I think. I can work with that.
Thanks for the suggestions. Here's what I ended up doing:
create table listing (
SELECT nutrdesc, nutr_no, date_time, units
FROM meals, nutr_def
WHERE meals.users_userid = '2'
AND date_time BETWEEN '2009-8-12' AND '2009-9-12'
AND (
nutr_no <100000 OR nutr_no IN (
SELECT nutr_def_nutr_no
FROM nutr_rights
WHERE nutr_rights.users_userid = '2'
)
)
);
create table data (
SELECT nutrdesc, date_time, nut_data.nutr_no, sum(ingred_gram_mass*entry_qty_num*nutr_val/100) AS total_nutr_mass
FROM nut_data, recipe_ingredients, food_entries, meals, nutr_def
WHERE nut_data.nutr_no = nutr_def.nutr_no
AND ndb_no = ingred_ndb_no
AND foods_food_id = entry_ident
AND meals_meal_id = meal_id
AND users_userid = '2'
AND date_time BETWEEN '2009-8-12' AND '2009-9-12'
GROUP BY date_time,nut_data.nutr_no
);
create index joiner on data(nutr_no, date_time);
create index joiner on listing(nutr_no, date_time);
SELECT listing.date_time,listing.nutrdesc,data.total_nutr_mass,listing.units
FROM listing
LEFT JOIN data
ON data.date_time = listing.date_time
AND listing.nutr_no = data.nutr_no
ORDER BY listing.date_time,listing.nutrdesc,listing.units;

Big SQL SELECT performance difference when using <= against using < on a DATETIME column

Given the following table:
desc exchange_rates;
+------------------+----------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| time | datetime | NO | MUL | NULL | |
| base_currency | varchar(3) | NO | MUL | NULL | |
| counter_currency | varchar(3) | NO | MUL | NULL | |
| rate | decimal(32,16) | NO | | NULL | |
+------------------+----------------+------+-----+---------+----------------+
I have added indexes on time, base_currency and counter_currency, as well as a composite index on (time, base_currency, counter_currency), but I'm seeing a big performance difference when I perform a SELECT using <= against using <.
The first SELECT is:
ExchangeRate Load (95.5ms)
SELECT * FROM `exchange_rates` WHERE (time <= '2009-12-30 14:42:02' and base_currency = 'GBP' and counter_currency = 'USD') LIMIT 1
As you can see this is taking 95ms.
If I change the query such that I compare time using < rather than <= I see this:
ExchangeRate Load (0.8ms)
SELECT * FROM `exchange_rates` WHERE (time < '2009-12-30 14:42:02' and base_currency = 'GBP' and counter_currency = 'USD') LIMIT 1
Now it takes less than 1 millisecond, which sounds right to me. Is there a rational explanation for this behaviour?
The output from EXPLAIN provides further details, but I'm not 100% sure how to intepret this:
-- Output from the first, slow, select
SIMPLE | 5,5 | exchange_rates | 1 | index_exchange_rates_on_time,index_exchange_rates_on_base_currency,index_exchange_rates_on_counter_currency,time_and_currency | index_merge | Using intersect(index_exchange_rates_on_counter_currency,index_exchange_rates_on_base_currency); Using where | 813 | | index_exchange_rates_on_counter_currency,index_exchange_rates_on_base_currency
-- Output from the second, fast, select
SIMPLE | 5 | exchange_rates | 1 | index_exchange_rates_on_time,index_exchange_rates_on_base_currency,index_exchange_rates_on_counter_currency,time_and_currency | ref | Using where | 4988 | const | index_exchange_rates_on_counter_currency
(Note: I'm producing these queries through ActiveRecord (in a Rails app) but these are ultimately the queries which are being executed)
In the first case, MySQL tries to combine results from all indexes. It fetches all records from both indexes and joins them on the value of the row pointer (table offset in MyISAM, PRIMARY KEY in InnoDB).
In the second case, it just uses a single index, which, considering LIMIT 1, is the best decision.
You need to create a composite index on (base_currency, counter_currency, time) (in this order) for this query to work as fast as possible.
The engine will use the index for filtering on the leading columns (base_currency, counter_currency) and for ordering on the trailing column (time).
It also seems you want to add something like ORDER BY time DESC to your query to get the last exchange rate.
In general, any LIMIT without ORDER BY should ring the bell.