I run the following query on a weekly basis, but it is getting to the point where it now takes 22 hours to run! The purpose of the report is to aggregate impression and conversion data at the ad placement and date, so the main table I am querying does not have a primary key as there can be multiple events with the same date/placement.
The main data set has about 400K records, so it shouldn't take more than a few minutes to run this report.
The table descriptions are:
tbl_ads (400,000 records)
day_est DATE (index)
conv_day_est DATE (index)
placement_id INT (index)
adunit_id INT (index)
cost_type VARCHAR(20)
cost_value DECIMAL(10,2)
adserving_cost DECIMAL(10,2)
conversion1 INT
estimated_spend DECIMAL(10,2)
clicks INT
impressions INT
publisher_clicks INT
publisher_impressions INT
publisher_spend DECIMAL (10,2)
source VARCHAR(30)
map_external_id (75,000 records)
placement_id INT
adunit_id INT
external_id VARCHAR (50)
primary key(placement_id,adunit_id,external_id)
SQL Query
SELECT A.day_est,A.placement_id,A.placement_name,A.adunit_id,A.adunit_name,A.imp,A.clk, C.ads_cost, C.ads_spend, B.conversion1, B.conversion2,B.ID_Matched, C.pub_imps, C.pub_clicks, C.pub_spend, COALESCE(A.cost_type,B.cost_type) as cost_type, COALESCE(A.cost_value,B.cost_value) as cost_value, D.external_id
FROM (SELECT day_est, placement_id,adunit_id,placement_name,adunit_name,cost_type,cost_value,
SUM(impressions) as imp, SUM(clicks) as clk
FROM tbl_ads
WHERE source='delivery'
GROUP BY 1,2,3 ) as A LEFT JOIN
(
SELECT conv_day_est, placement_id,adunit_id, cost_type,cost_value, SUM(conversion1) as conversion1,
SUM(conversion2) as conversion2,SUM(id_match) as ID_Matched
FROM tbl_ads
WHERE source='attribution'
GROUP BY 1,2,3
) as B on A.day_est=B.conv_day_est AND A.placement_id=B.placement_id AND A.adunit_id=B.adunit_id
LEFT JOIN
(
SELECT day_est,placement_id,adunit_id,SUM(adserving_cost) as ads_cost, SUM(estimated_spend) as ads_spend,sum(publisher_clicks) as pub_clicks,sum(publisher_impressions) as pub_imps,sum(publisher_spend) as pub_spend
FROM tbl_ads
GROUP BY 1,2,3 ) as C on A.day_est=C.day_est AND A.placement_id=C.placement_id AND A.adunit_id=C.adunit_id
LEFT JOIN
(
SELECT placement_id,adunit_id,external_id
FROM map_external_id
) as D on A.placement_id=D.placement_id AND A.adunit_id=D.adunit_id
INTO OUTFILE '/tmp/weekly_report.csv';
Results of EXPLAIN:
+----+-------------+--------------------+-------+---------------+---------+---------+------+--------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+-------+---------------+---------+---------+------+--------+----------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 136518 | |
| 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 5180 | |
| 1 | PRIMARY | <derived4> | ALL | NULL | NULL | NULL | NULL | 198190 | |
| 1 | PRIMARY | <derived5> | ALL | NULL | NULL | NULL | NULL | 23766 | |
| 5 | DERIVED | map_external_id | index | NULL | PRIMARY | 55 | NULL | 20797 | Using index |
| 4 | DERIVED | tbl_ads | index | NULL | PIndex | 13 | NULL | 318400 | |
| 3 | DERIVED | tbl_ads | ALL | NULL | NULL | NULL | NULL | 318400 | Using filesort |
| 2 | DERIVED | tbl_ads | index | NULL | PIndex | 13 | NULL | 318400 | Using where |
+----+-------------+--------------------+-------+---------------+---------+---------+------+--------+----------------+
More of a speculative answer, but I don't think 22 hours is too unrealistic..
First things first... you don't need the last subquery, just state
LEFT JOIN map_external_id as D on A.placement_id=D.placement_id AND A.adunit_id=D.adunit_id
Second, in the first and second subqueries you have the field source in your WHERE statement and this field is not listed in your table scheme. Obviously it might be or enum or string type, does it have an index? I've had a table with 1'000'000 or so entries where a missing index caused a processing time of 30 seconds for a simple query (cant believe the guy who put the query in the login process).
Irrelevant question inbetween, what's the final result set size?
Thirdly, my assumption is that by running the aggregating subqueries mysql actually creates temporary tables that do not have any indices - which is bad.
Have you yet had a look on the result sets of the single subqueries? What is the typical set size? From your statements and my guesses about your typical data I would assume that the aggregation actually only marginally reduces the set size (apart from the WHERE statement). So let me guess in order of the subqueries: 200'000, 100'000, 200'000
Each of the subqueries then joins with the next on three assumably not indexed fields. So worst case for the first join: 200'000 * 100'000 = 20'000'000'000 comparisons. Going from my 30 sec for a query on 1'000'000 records experience that makes it 20'000 * 30 = 600'000 sec =+- 166 hours. obviously that's way too much, maybe there's a digit missing, maybe it was 20 sec not 30, the result sets might be different, worst case is not average case - but you get the image.
My solution approach then would be to try to create additional tables which replace your aggregation subqueries. Judging from your queries you could update it daily, as I guess you just insert rows on impressions etc, so you can just add the aggregation data incrementally. Then you transform your mega-query into the two steps of
updating your aggregation tables
doing the final dump.
The aggregation tables obviously should be indexed meaningfully. I think that should bring the final queries down to a few seconds.
Thanks for all your advice. I ended up splitting the sub queries and creating temporary tables (with PKs) for each, then joined the temp tables together at the end and it now takes about 10 mins to run.
Related
I've got a JPA ManyToMany relationship set up, which gives me three important tables: my Ticket table, my Join table, and my Inventory table. They're InnoDB tables on MySQL 5.1. The relevant bits are:
Ticket:
+--------+----------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------+----------+------+-----+---------+----------------+
| ID | int(11) | NO | PRI | NULL | auto_increment |
| Status | longtext | YES | | NULL | |
+--------+----------+------+-----+---------+----------------+
JoinTable:
+-------------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------+------+-----+---------+-------+
| InventoryID | int(11) | NO | PRI | NULL | | Foreign Key - Inventory
| TicketID | int(11) | NO | PRI | NULL | | Foreign Key - Ticket
+-------------+---------+------+-----+---------+-------+
Inventory:
+--------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+----------------+
| ID | int(11) | NO | PRI | NULL | auto_increment |
| TStampString | varchar(32) | NO | MUL | NULL | |
+--------------+--------------+------+-----+---------+----------------+
TStampStrings are of the form "yyyy.mm.dd HH:MM:SS Z" (for example, '2010.03.19 22:27:57 GMT'). Right now all of the Tickets created directly correspond to some specific hour TStampString, so that SELECT COUNT(*) FROM Ticket; is the same as SELECT COUNT(DISTINCT(SUBSTRING(TStampString, 1, 13))) FROM Inventory;
What I'd like to do is regroup certain Tickets based on the minute granularity of a TStampString: (SUBSTRING(TStampString, 1, 16)). So I'm profiling and testing the SELECT of an INSERT INTO ... SELECT statement:
EXPLAIN SELECT SUBSTRING(i.TStampString, 1, 16) FROM Ticket t JOIN JoinTable j
ON t.ID = j.TicketID JOIN Inventory i ON j.InventoryID = i.ID WHERE t.Status
= 'Regroup' GROUP BY SUBSTRING(i.TStampString, 1, 16);
+--+------+---+--------+-------------+-----+-----+----------+-------+-----------+
|id| type |tbl| type | psbl_keys | key | len | ref | rows | Extra |
+--+------+---+--------+-------------+-----+-----+----------+-------+-----------+
|1 | SMPL | t | ALL | PRI | NULL| NULL| NULL | 35569 | where |
| | | | | | | | | | +temporary|
| | | | | | | | | | +filesort |
|1 | SMPL | j | ref | PRI,FK1,FK2 | FK2 | 4 | t.ID | 378 | index |
|1 | SMPL | i | eq_ref | PRI | PRI | 4 | j.Invent | 1 | |
| | | | | | | | oryID | | |
+--+------+---+--------+-------------+-----+-----+----------+-------+-----------+
What this implies to me is that for each row in Ticket, MySQL first does the joins then later decides that the row is invalid due to the WHERE clause. Certainly the runtime is abominable (I gave up after 30 minutes). Note that it goes no faster with t.Status = 'Regroup' moved to the first JOIN clause and no WHERE clause.
But what's interesting is that if I run this query manually in three steps, doing what I thought the optimizer would do, each step returns almost immediately:
--Step 1: Select relevant Tickets (results dumped to file)
SELECT ID FROM Ticket WHERE Status = 'Regroup';
--Step 2: Get relevant Inventory entries
SELECT InventoryID FROM JoinTable WHERE TicketID IN (step 1s file);
--Step 3: Select what I wanted all along
SELECT SUBSTRING(TStampString, 1, 16) FROM Inventory WHERE ID IN (step 2s file)
GROUP BY SUBSTRING(TStampString, 1, 16);
On my particular tables, the first query gives 154 results, the second creates 206,598 lines, and the third query returns 9198 rows. All of them combined take ~2 minutes to run, with the last query having the only significant runtime.
Dumping the intermediate results to a file is cumbersome, and more importantly I'd like to know how to write my original query such that it runs reasonably. So how do I structure this three-table-join such that it runs as fast as I know is possible?
UPDATE: I've added a prefix index on Status(16), which changes my EXPLAIN profile rows to 153, 378, and 1 respectively (since the first row has a key to use). The JOIN version of my query now takes ~6 minutes, which is tolerable but still considerably slower than the manual version. I'd still like to know why the join is performing sorely suboptimally, but it may be that one can't create independent subqueries in buggy MySQL 5.1. If enough time passes I'll accept Add Index as the solution to my problem, although it's not exactly the answer to my question.
In the end I did end up manually recreating every step of the join on disk. Tens of thousands of files each with a thousand queries was still significantly faster than anything I could get my version of MySQL to do. But since that process would be horribly specific and unhelpful for the layman, I'm accepting ypercube's answer of Add (Partial) Indexes.
What you can do to speed up the query:
Add an index on Status. Even if you don't change the type to VARCHAR, you can still add a partial index:
ALTER TABLE Ticket
ADD INDEX status_idx
Status(16) ;
I assume that the Primary key of the Join table is (InventoryID, TicketID). You can add another index on (TicketID, InventoryID) as well. This may not benefit this particular query but it will be helpful in other queries you'll have.
The answer on why this happens is that the optimizer does not always choose the best plan. You can try this variation of your query and see how the EXPLAIN plan differs and if there is any efficiency gain:
SELECT SUBSTRING(i.TStampString, 1, 16)
FROM
( SELECT (DISTINCT) j.InventoryID
FROM Ticket t
JOIN JoinTable j
ON t.ID = j.TicketID
WHERE t.Status = 'Regroup'
) AS tmp
JOIN Inventory i
ON tmp.InventoryID = i.ID
GROUP BY SUBSTRING(i.TStampString, 1, 16) ;
try giving the first substring-clause an alias and using it in the group-by.
SELECT SUBSTRING(i.TStampString, 1, 16) as blaa FROM Ticket t JOIN JoinTable j
ON t.ID = j.TicketID JOIN Inventory i ON j.InventoryID = i.ID WHERE t.Status
= 'Regroup' GROUP BY blaa;
also avoid the join altogether since you dont need it..
SELECT distinct(SUBSTRING(i.TStampString, 1,16)) from inventory i where i.ID in
( select id from JoinTable j where j.TicketID in
(select id from Ticket t where t.Status = 'Regroup'));
would that work?
btw. you do have an index on the Status field ?
First off, I've looked at several other questions about optimizing sql queries, but I'm still unclear for my situation what is causing my problem. I read a few articles on the topic as well and have tried implementing a couple possible solutions, as I'll describe below, but nothing has yet worked or even made an appreciable dent in the problem.
The application is a nutrition tracking system - users enter the foods they eat and based on an imported USDA database the application breaks down the foods to the individual nutrients and gives the user a breakdown of the nutrient quantities on a (for now) daily basis.
here's
A PDF of the abbreviated database schema
and here it is as a (perhaps poor quality) JPG. I made this in open office - if there are suggestions for better ways to visualize a database, I'm open to suggestions on that front as well! The blue tables are directly from the USDA, and the green and black tables are ones I've made. I've omitted a lot of data in order to not clutter things up unnecessarily.
Here's the query I'm trying to run that takes a very long time:
SELECT listing.date_time,listing.nutrdesc,data.total_nutr_mass,listing.units
FROM
(SELECT nutrdesc, nutr_no, date_time, units
FROM meals, nutr_def
WHERE meals.users_userid = '2'
AND date_time BETWEEN '2009-8-12' AND '2009-9-12'
AND (nutr_no <100000
OR nutr_no IN
(SELECT nutr_def_nutr_no
FROM nutr_rights
WHERE nutr_rights.users_userid = '2'))
) as listing
LEFT JOIN
(SELECT nutrdesc, date_time, nut_data.nutr_no, sum(ingred_gram_mass*entry_qty_num*nutr_val/100) AS total_nutr_mass
FROM nut_data, recipe_ingredients, food_entries, meals, nutr_def
WHERE nut_data.nutr_no = nutr_def.nutr_no
AND ndb_no = ingred_ndb_no
AND foods_food_id = entry_ident
AND meals_meal_id = meal_id
AND users_userid = '2'
AND date_time BETWEEN '2009-8-12' AND '2009-9-12'
GROUP BY date_time,nut_data.nutr_no ) as data
ON data.date_time = listing.date_time
AND listing.nutr_no = data.nutr_no
ORDER BY listing.date_time,listing.nutrdesc,listing.units
So I know that's rather complex - The first select gets a listing of all the nutrients that the user consumed within the given date range, and the second fills in all the quantities.
When I implement them separately, the first query is really fast, but the second is slow and gets very slow when the date ranges get large. The join makes the whole thing ridiculously slow. I know that the 'main' problem is the join between these two derived tables, and I can get rid of that and do the join by hand basically in php much faster, but I'm not convinced that's the whole story.
For example: for 1 month of data, the query takes about 8 seconds, which is slow, but not completely terrible. Separately, each query takes ~.01 and ~2 seconds respectively. 2 seconds still seems high to me.
If I try to retrieve a year's worth of data, it takes several (>10) minutes to run the whole query, which is problematic - the client-server connection sometimes times out, and in any case we don't want I don't want to sit there with a spinning 'please wait' icon. Mainly, I feel like there's a problem because it takes more than 12x as long to retrieve 12x more information, when it should take less than 12x as long, if I were doing things right.
Here's the 'explain' for each of the slow queries: (the whole thing, and just the second half).
Whole thing:
+----+--------------------+--------------------+----------------+-------------------------------+------------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------------+----------------+-------------------------------+------------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 5053 | Using temporary; Using filesort |
| 1 | PRIMARY | <derived4> | ALL | NULL | NULL | NULL | NULL | 4341 | |
| 4 | DERIVED | meals | range | PRIMARY,day_ind | day_ind | 9 | NULL | 30 | Using where; Using temporary; Using filesort |
| 4 | DERIVED | food_entries | ref | meals_meal_id | meals_meal_id | 5 | nutrition.meals.meal_id | 15 | Using where |
| 4 | DERIVED | recipe_ingredients | ref | foods_food_id,ingred_ndb_no | foods_food_id | 4 | nutrition.food_entries.entry_ident | 2 | |
| 4 | DERIVED | nutr_def | ALL | PRIMARY | NULL | NULL | NULL | 174 | |
| 4 | DERIVED | nut_data | ref | PRIMARY | PRIMARY | 36 | nutrition.nutr_def.nutr_no,nutrition.recipe_ingredients.ingred_ndb_no | 1 | |
| 2 | DERIVED | meals | range | day_ind | day_ind | 9 | NULL | 30 | Using where |
| 2 | DERIVED | nutr_def | ALL | PRIMARY | NULL | NULL | NULL | 174 | Using where |
| 3 | DEPENDENT SUBQUERY | nutr_rights | index_subquery | users_userid,nutr_def_nutr_no | nutr_def_nutr_no | 19 | func | 1 | Using index; Using where |
+----+--------------------+--------------------+----------------+-------------------------------+------------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
10 rows in set (2.82 sec)
Second chunk (data):
+----+-------------+--------------------+-------+-----------------------------+---------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+-------+-----------------------------+---------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | meals | range | PRIMARY,day_ind | day_ind | 9 | NULL | 30 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | food_entries | ref | meals_meal_id | meals_meal_id | 5 | nutrition.meals.meal_id | 15 | Using where |
| 1 | SIMPLE | recipe_ingredients | ref | foods_food_id,ingred_ndb_no | foods_food_id | 4 | nutrition.food_entries.entry_ident | 2 | |
| 1 | SIMPLE | nutr_def | ALL | PRIMARY | NULL | NULL | NULL | 174 | |
| 1 | SIMPLE | nut_data | ref | PRIMARY | PRIMARY | 36 | nutrition.nutr_def.nutr_no,nutrition.recipe_ingredients.ingred_ndb_no | 1 | |
+----+-------------+--------------------+-------+-----------------------------+---------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
5 rows in set (0.00 sec)
I've 'analyzed' all the tables involved in the query, and added an index on the datetime field that is joining meals and food entries. I called it 'day_ind'. I hoped that would accelerate things, but it didn't seem to make a difference. I also tried removing the 'sum' function, as I understand that having a function in the query will frequently mean a full table scan, which is obviously much slower. Unfortunately removing the 'sum' didn't seem to make a difference either (well, about 3-5% or so, but not the order magnitude that I'm looking for).
I would love any suggestions and will be happy to provide any more information you need to help diagnose and improve this problem. Thanks in advance!
There are a few type All in your explain suggest full table scan. and hence create temp table. You could re-index if it is not there already.
Sort and Group By are usually the performance killer, you can adjust Mysql memory settings to avoid physical i/o to tmp table if you have extra memory available.
Lastly, try to make sure the data type of the join attributes matches. Ie data.date_time = listing.date_time has same data format.
Hope that helps.
Okay, so I eventually figured out what I'm gonna end up doing. I couldn't make the 'data' query any faster - that's still the bottleneck. But now I've made it so the total query process is pretty close to linear, not exponential.
I split the query into two parts and made each one into a temporary table. Then I added an index for each of those temp tables and did the join separately afterwards. This made the total execution time for 1 month of data drop from 8 to 2 seconds, and for 1 year of data from ~10 minutes to ~30 seconds. Good enough for now, I think. I can work with that.
Thanks for the suggestions. Here's what I ended up doing:
create table listing (
SELECT nutrdesc, nutr_no, date_time, units
FROM meals, nutr_def
WHERE meals.users_userid = '2'
AND date_time BETWEEN '2009-8-12' AND '2009-9-12'
AND (
nutr_no <100000 OR nutr_no IN (
SELECT nutr_def_nutr_no
FROM nutr_rights
WHERE nutr_rights.users_userid = '2'
)
)
);
create table data (
SELECT nutrdesc, date_time, nut_data.nutr_no, sum(ingred_gram_mass*entry_qty_num*nutr_val/100) AS total_nutr_mass
FROM nut_data, recipe_ingredients, food_entries, meals, nutr_def
WHERE nut_data.nutr_no = nutr_def.nutr_no
AND ndb_no = ingred_ndb_no
AND foods_food_id = entry_ident
AND meals_meal_id = meal_id
AND users_userid = '2'
AND date_time BETWEEN '2009-8-12' AND '2009-9-12'
GROUP BY date_time,nut_data.nutr_no
);
create index joiner on data(nutr_no, date_time);
create index joiner on listing(nutr_no, date_time);
SELECT listing.date_time,listing.nutrdesc,data.total_nutr_mass,listing.units
FROM listing
LEFT JOIN data
ON data.date_time = listing.date_time
AND listing.nutr_no = data.nutr_no
ORDER BY listing.date_time,listing.nutrdesc,listing.units;
I have a simple high score service for an online game, and it has become more popular than expected. The high score is a webservice which uses a MYSQL backend with a simple table as shown below. Each high score record is stored as a row in this table. The problem is, with >140k rows, I see certain key queries slowing down so much that it will soon be too slow to service requests.
The main table looks like this:
id is a unique key for each score record
game is the ID number of the game which submitted the score (currently, always equal to "1", soon will have to support more games though)
name is the display name for that player's submission
playerId is a unique ID for a given user
score is a numeric score representation ex 42,035
time is the submission time
rank is a large integer which uniquely sorts the score submissions for a given game. It is
common for people to tie at a certain score, so in that case the tie is broken by who submitted first. Therefore this field's value is equal roughly to "score * 100000000 + (MAX_TIME - time)"
+----------+---------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------+---------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| game | int(11) | YES | MUL | NULL | |
| name | varchar(100) | YES | | NULL | |
| playerId | varchar(50) | YES | | NULL | |
| score | int(11) | YES | | NULL | |
| time | datetime | YES | | NULL | |
| rank | decimal(50,0) | YES | MUL | NULL | |
+----------+---------------+------+-----+---------+----------------+
The indexes look like this:
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| pozscores | 0 | PRIMARY | 1 | id | A | 138296 | NULL | NULL | | BTREE | |
| pozscores | 0 | game | 1 | game | A | NULL | NULL | NULL | YES | BTREE | |
| pozscores | 0 | game | 2 | rank | A | NULL | NULL | NULL | YES | BTREE | |
| pozscores | 1 | rank | 1 | rank | A | 138296 | NULL | NULL | YES | BTREE | |
+-----------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
When a user requests high scores, they typically request around 75 high scores from an arbitrary point in the "sorted by rank descending list". These requests are typically for "alltime" or just for scores in the past 7 days.
A typical query looks like this:
"SELECT * FROM scoretable WHERE game=1 AND time>? ORDER BY rank DESC LIMIT 0, 75;" and runs in 0.00 sec.
However, if you request towards the end of the list
"SELECT * FROM scoretable WHERE game=1 AND time>? ORDER BY rank DESC LIMIT 10000, 75;" and runs in 0.06 sec.
"SELECT * FROM scoretable WHERE game=1 AND time>? ORDER BY rank DESC LIMIT 100000, 75;" and runs in 0.58 sec.
It seems like this will quickly start taking way too long as several thousand new scores are submitted each day!
Additionally, there are two other types of queries, used to find a particular player by id in the rank ordered list.
They look like this:
"SELECT * FROM scoretable WHERE game=1 AND time>? AND playerId=? ORDER BY rank DESC LIMIT 1"
followed by a
"SELECT count(id) as count FROM scoretable WHERE game=1 AND time>? AND rank>[rank returned from above]"
My question is: What can be done to make this a scalable system? I can see the number of rows growing to be several million very soon. I was hoping that choosing some smart indexes would help, but the improvement has only been marginal.
Update:
Here is an explain line:
mysql> explain SELECT * FROM scoretable WHERE game=1 AND time>0 ORDER BY rank DESC LIMIT 100000, 75;
+----+-------------+-----------+-------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+---------------+------+---------+------+--------+-------------+
| 1 | SIMPLE | scoretable| range | game | game | 5 | NULL | 138478 | Using where |
+----+-------------+-----------+-------+---------------+------+---------+------+--------+-------------+
Solution Found!
I have solved the problem thanks to some of the pointers from this thread. Doing a clustered index was exactly what I needed, so I converted the table to use InnoDB in mysql, which supports clustered indexes. Next, I removed the id field, and just set the primary key to be (game ASC, rank DESC). Now, all queries run super fast, no matter what offset I use. The explain shows that no additional sorting is being done, and it looks like it is easily handling all the traffic.
Seeing as how there are no takers, I'll give it a shot. I am from an SQL Server background, but the same ideas apply.
Some general observations:
The ID column is pretty much pointless, and should not participate in any indexes unless there are other tables/queries you're not telling us about. In fact, it doesn't even need to be in your last query. You can do COUNT(*).
Your clustered index should target your most common queries. Therefore, a clustered index on game ASC, time DESC, and rank DESC works well. Sorting by time DESC is usually a good idea for historical tables like this where you are usually interested in the most recent stuff. You may also try a separate index with the rank sorted the other direction, though I'm not sure how much of a benefit this will be.
Are you sure you need SELECT *? If you can select fewer columns, you may be able to create an index which contains all columns needed for your SELECT and WHERE.
1 million rows is really not that much. I created a table like yours with 1,000,000 rows of sample data, and even with the one index (game ASC, time DESC, and rank DESC), all queries ran in less than 1 second.
(The only part I'm not sure of is playerId. The queries performed so well that playerId didn't seem to be necessary. Perhaps you can add it at the end of your clustered index.)
This isn't solved, but I found out why: MySQL View containing UNION does not optimize well...In other words SLOW!
Original post:
I'm working with a database for a game. There are two identical tables equipment and safety_dep_box. To check if a player has a piece of equipment I'd like to check both tables.
Instead of doing two queries, I want to take advantage of the UNION functionality in MySQL. I've recently learned that I can create a VIEW. Here's my view:
CREATE VIEW vAllEquip AS SELECT * FROM equipment UNION SELECT * FROM safety_dep_box;
The view created just fine. However when I run
SELECT * FROM vAllEquip WHERE owner=<id>
The query takes forever, while independent select queries are quick. I think I know why, but I don't know how to fix it.
Thanks!
P.S. with Additional Information:
The two tables are identical in structure, but split because they are multi-100-million row tables.
The structure includes primary key on int id, multiple index on int owner.
What I don't understand is the speed difference between the following:
SELECT COUNT(*) FROM (SELECT * FROM equipment WHERE owner=1 UNION ALL SELECT * FROM safety_dep_box WHERE owner=1) AS uES;
0.42 sec
SELECT COUNT(*) FROM (SELECT * FROM equipment WHERE owner=1 UNION SELECT * FROM safety_dep_box WHERE owner=1) AS uES;
0.37 sec
SELECT COUNT(*) FROM vAllEquip WHERE owner=1;
aborted after 60 seconds
Version: 5.1.51
mysql> explain SELECT * FROM equipment UNION SELECT * FROM safety_dep_box;
+----+--------------+----------------+------+---------------+------+---------+------+---------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+----------------+------+---------------+------+---------+------+---------+-------+
| 1 | PRIMARY | equipment | ALL | NULL | NULL | NULL | NULL | 1499148 | |
| 2 | UNION | safety_dep_box | ALL | NULL | NULL | NULL | NULL | 867321 | |
| NULL | UNION RESULT | <union1,2> | ALL | NULL | NULL | NULL | NULL | NULL | |
+----+--------------+----------------+------+---------------+------+---------+------+---------+-------+
with a WHERE clause
mysql> explain SELECT * FROM equipment WHERE owner=1 UNION ALL SELECT * FROM safety_dep_box WHERE owner=1
-> ;
+----+--------------+----------------+------+-----------------------+-------+---------+-------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+----------------+------+-----------------------+-------+---------+-------+------+-------+
| 1 | PRIMARY | equipment | ref | owner,owner_2,owner_3 | owner | 4 | const | 1 | |
| 2 | UNION | safety_dep_box | ref | owner,owner_3 | owner | 4 | const | 1 | |
| NULL | UNION RESULT | <union1,2> | ALL | NULL | NULL | NULL | NULL | NULL | |
+----+--------------+----------------+------+-----------------------+-------+---------+-------+------+-------+
First off, you should probably be using UNION ALL instead of plain UNION. With plain UNION, the engine will try to de-duplicate your result set. That is likely the source of your problem.
Secondly, you'll need indexes on owner in both tables, not just one. And, ideally, they'll be integer columns.
Thirdly, Randolph is right that you should not be using "*" in your SELECT statement. List out all the columns you want included. That is especially important in a UNION because the columns must match up exactly and, if there's a disagreement in the column order in your two tables you may be forcing some type conversion to go on that is costing you some time.
Finally, the phrase "There are two identical tables" is almost always a tip-off that your database is not optimally designed. These should probably be a single table. To indicate ownership of an item, your safety_dep_box table should contain only the ownerID and itemID of the item (to relate equipment and players), and possibly an additional autonumbered integer key column.
First off, don't use SELECT * in views ever. It's lazy code. Secondly, without knowing what the base tables look like, we're even less likely to be able to help you.
The reason it takes forever is because it has to build the full result and then filter it. You'll want indexes on your owner fields, whatever they may be.
Given the following table:
desc exchange_rates;
+------------------+----------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| time | datetime | NO | MUL | NULL | |
| base_currency | varchar(3) | NO | MUL | NULL | |
| counter_currency | varchar(3) | NO | MUL | NULL | |
| rate | decimal(32,16) | NO | | NULL | |
+------------------+----------------+------+-----+---------+----------------+
I have added indexes on time, base_currency and counter_currency, as well as a composite index on (time, base_currency, counter_currency), but I'm seeing a big performance difference when I perform a SELECT using <= against using <.
The first SELECT is:
ExchangeRate Load (95.5ms)
SELECT * FROM `exchange_rates` WHERE (time <= '2009-12-30 14:42:02' and base_currency = 'GBP' and counter_currency = 'USD') LIMIT 1
As you can see this is taking 95ms.
If I change the query such that I compare time using < rather than <= I see this:
ExchangeRate Load (0.8ms)
SELECT * FROM `exchange_rates` WHERE (time < '2009-12-30 14:42:02' and base_currency = 'GBP' and counter_currency = 'USD') LIMIT 1
Now it takes less than 1 millisecond, which sounds right to me. Is there a rational explanation for this behaviour?
The output from EXPLAIN provides further details, but I'm not 100% sure how to intepret this:
-- Output from the first, slow, select
SIMPLE | 5,5 | exchange_rates | 1 | index_exchange_rates_on_time,index_exchange_rates_on_base_currency,index_exchange_rates_on_counter_currency,time_and_currency | index_merge | Using intersect(index_exchange_rates_on_counter_currency,index_exchange_rates_on_base_currency); Using where | 813 | | index_exchange_rates_on_counter_currency,index_exchange_rates_on_base_currency
-- Output from the second, fast, select
SIMPLE | 5 | exchange_rates | 1 | index_exchange_rates_on_time,index_exchange_rates_on_base_currency,index_exchange_rates_on_counter_currency,time_and_currency | ref | Using where | 4988 | const | index_exchange_rates_on_counter_currency
(Note: I'm producing these queries through ActiveRecord (in a Rails app) but these are ultimately the queries which are being executed)
In the first case, MySQL tries to combine results from all indexes. It fetches all records from both indexes and joins them on the value of the row pointer (table offset in MyISAM, PRIMARY KEY in InnoDB).
In the second case, it just uses a single index, which, considering LIMIT 1, is the best decision.
You need to create a composite index on (base_currency, counter_currency, time) (in this order) for this query to work as fast as possible.
The engine will use the index for filtering on the leading columns (base_currency, counter_currency) and for ordering on the trailing column (time).
It also seems you want to add something like ORDER BY time DESC to your query to get the last exchange rate.
In general, any LIMIT without ORDER BY should ring the bell.