Three Queries Faster than One -- What's Wrong with my Joins? - mysql

I've got a JPA ManyToMany relationship set up, which gives me three important tables: my Ticket table, my Join table, and my Inventory table. They're InnoDB tables on MySQL 5.1. The relevant bits are:
Ticket:
+--------+----------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------+----------+------+-----+---------+----------------+
| ID | int(11) | NO | PRI | NULL | auto_increment |
| Status | longtext | YES | | NULL | |
+--------+----------+------+-----+---------+----------------+
JoinTable:
+-------------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------+------+-----+---------+-------+
| InventoryID | int(11) | NO | PRI | NULL | | Foreign Key - Inventory
| TicketID | int(11) | NO | PRI | NULL | | Foreign Key - Ticket
+-------------+---------+------+-----+---------+-------+
Inventory:
+--------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+----------------+
| ID | int(11) | NO | PRI | NULL | auto_increment |
| TStampString | varchar(32) | NO | MUL | NULL | |
+--------------+--------------+------+-----+---------+----------------+
TStampStrings are of the form "yyyy.mm.dd HH:MM:SS Z" (for example, '2010.03.19 22:27:57 GMT'). Right now all of the Tickets created directly correspond to some specific hour TStampString, so that SELECT COUNT(*) FROM Ticket; is the same as SELECT COUNT(DISTINCT(SUBSTRING(TStampString, 1, 13))) FROM Inventory;
What I'd like to do is regroup certain Tickets based on the minute granularity of a TStampString: (SUBSTRING(TStampString, 1, 16)). So I'm profiling and testing the SELECT of an INSERT INTO ... SELECT statement:
EXPLAIN SELECT SUBSTRING(i.TStampString, 1, 16) FROM Ticket t JOIN JoinTable j
ON t.ID = j.TicketID JOIN Inventory i ON j.InventoryID = i.ID WHERE t.Status
= 'Regroup' GROUP BY SUBSTRING(i.TStampString, 1, 16);
+--+------+---+--------+-------------+-----+-----+----------+-------+-----------+
|id| type |tbl| type | psbl_keys | key | len | ref | rows | Extra |
+--+------+---+--------+-------------+-----+-----+----------+-------+-----------+
|1 | SMPL | t | ALL | PRI | NULL| NULL| NULL | 35569 | where |
| | | | | | | | | | +temporary|
| | | | | | | | | | +filesort |
|1 | SMPL | j | ref | PRI,FK1,FK2 | FK2 | 4 | t.ID | 378 | index |
|1 | SMPL | i | eq_ref | PRI | PRI | 4 | j.Invent | 1 | |
| | | | | | | | oryID | | |
+--+------+---+--------+-------------+-----+-----+----------+-------+-----------+
What this implies to me is that for each row in Ticket, MySQL first does the joins then later decides that the row is invalid due to the WHERE clause. Certainly the runtime is abominable (I gave up after 30 minutes). Note that it goes no faster with t.Status = 'Regroup' moved to the first JOIN clause and no WHERE clause.
But what's interesting is that if I run this query manually in three steps, doing what I thought the optimizer would do, each step returns almost immediately:
--Step 1: Select relevant Tickets (results dumped to file)
SELECT ID FROM Ticket WHERE Status = 'Regroup';
--Step 2: Get relevant Inventory entries
SELECT InventoryID FROM JoinTable WHERE TicketID IN (step 1s file);
--Step 3: Select what I wanted all along
SELECT SUBSTRING(TStampString, 1, 16) FROM Inventory WHERE ID IN (step 2s file)
GROUP BY SUBSTRING(TStampString, 1, 16);
On my particular tables, the first query gives 154 results, the second creates 206,598 lines, and the third query returns 9198 rows. All of them combined take ~2 minutes to run, with the last query having the only significant runtime.
Dumping the intermediate results to a file is cumbersome, and more importantly I'd like to know how to write my original query such that it runs reasonably. So how do I structure this three-table-join such that it runs as fast as I know is possible?
UPDATE: I've added a prefix index on Status(16), which changes my EXPLAIN profile rows to 153, 378, and 1 respectively (since the first row has a key to use). The JOIN version of my query now takes ~6 minutes, which is tolerable but still considerably slower than the manual version. I'd still like to know why the join is performing sorely suboptimally, but it may be that one can't create independent subqueries in buggy MySQL 5.1. If enough time passes I'll accept Add Index as the solution to my problem, although it's not exactly the answer to my question.
In the end I did end up manually recreating every step of the join on disk. Tens of thousands of files each with a thousand queries was still significantly faster than anything I could get my version of MySQL to do. But since that process would be horribly specific and unhelpful for the layman, I'm accepting ypercube's answer of Add (Partial) Indexes.

What you can do to speed up the query:
Add an index on Status. Even if you don't change the type to VARCHAR, you can still add a partial index:
ALTER TABLE Ticket
ADD INDEX status_idx
Status(16) ;
I assume that the Primary key of the Join table is (InventoryID, TicketID). You can add another index on (TicketID, InventoryID) as well. This may not benefit this particular query but it will be helpful in other queries you'll have.
The answer on why this happens is that the optimizer does not always choose the best plan. You can try this variation of your query and see how the EXPLAIN plan differs and if there is any efficiency gain:
SELECT SUBSTRING(i.TStampString, 1, 16)
FROM
( SELECT (DISTINCT) j.InventoryID
FROM Ticket t
JOIN JoinTable j
ON t.ID = j.TicketID
WHERE t.Status = 'Regroup'
) AS tmp
JOIN Inventory i
ON tmp.InventoryID = i.ID
GROUP BY SUBSTRING(i.TStampString, 1, 16) ;

try giving the first substring-clause an alias and using it in the group-by.
SELECT SUBSTRING(i.TStampString, 1, 16) as blaa FROM Ticket t JOIN JoinTable j
ON t.ID = j.TicketID JOIN Inventory i ON j.InventoryID = i.ID WHERE t.Status
= 'Regroup' GROUP BY blaa;
also avoid the join altogether since you dont need it..
SELECT distinct(SUBSTRING(i.TStampString, 1,16)) from inventory i where i.ID in
( select id from JoinTable j where j.TicketID in
(select id from Ticket t where t.Status = 'Regroup'));
would that work?
btw. you do have an index on the Status field ?

Related

Why is my SQL query so slow?

I run the following query on a weekly basis, but it is getting to the point where it now takes 22 hours to run! The purpose of the report is to aggregate impression and conversion data at the ad placement and date, so the main table I am querying does not have a primary key as there can be multiple events with the same date/placement.
The main data set has about 400K records, so it shouldn't take more than a few minutes to run this report.
The table descriptions are:
tbl_ads (400,000 records)
day_est DATE (index)
conv_day_est DATE (index)
placement_id INT (index)
adunit_id INT (index)
cost_type VARCHAR(20)
cost_value DECIMAL(10,2)
adserving_cost DECIMAL(10,2)
conversion1 INT
estimated_spend DECIMAL(10,2)
clicks INT
impressions INT
publisher_clicks INT
publisher_impressions INT
publisher_spend DECIMAL (10,2)
source VARCHAR(30)
map_external_id (75,000 records)
placement_id INT
adunit_id INT
external_id VARCHAR (50)
primary key(placement_id,adunit_id,external_id)
SQL Query
SELECT A.day_est,A.placement_id,A.placement_name,A.adunit_id,A.adunit_name,A.imp,A.clk, C.ads_cost, C.ads_spend, B.conversion1, B.conversion2,B.ID_Matched, C.pub_imps, C.pub_clicks, C.pub_spend, COALESCE(A.cost_type,B.cost_type) as cost_type, COALESCE(A.cost_value,B.cost_value) as cost_value, D.external_id
FROM (SELECT day_est, placement_id,adunit_id,placement_name,adunit_name,cost_type,cost_value,
SUM(impressions) as imp, SUM(clicks) as clk
FROM tbl_ads
WHERE source='delivery'
GROUP BY 1,2,3 ) as A LEFT JOIN
(
SELECT conv_day_est, placement_id,adunit_id, cost_type,cost_value, SUM(conversion1) as conversion1,
SUM(conversion2) as conversion2,SUM(id_match) as ID_Matched
FROM tbl_ads
WHERE source='attribution'
GROUP BY 1,2,3
) as B on A.day_est=B.conv_day_est AND A.placement_id=B.placement_id AND A.adunit_id=B.adunit_id
LEFT JOIN
(
SELECT day_est,placement_id,adunit_id,SUM(adserving_cost) as ads_cost, SUM(estimated_spend) as ads_spend,sum(publisher_clicks) as pub_clicks,sum(publisher_impressions) as pub_imps,sum(publisher_spend) as pub_spend
FROM tbl_ads
GROUP BY 1,2,3 ) as C on A.day_est=C.day_est AND A.placement_id=C.placement_id AND A.adunit_id=C.adunit_id
LEFT JOIN
(
SELECT placement_id,adunit_id,external_id
FROM map_external_id
) as D on A.placement_id=D.placement_id AND A.adunit_id=D.adunit_id
INTO OUTFILE '/tmp/weekly_report.csv';
Results of EXPLAIN:
+----+-------------+--------------------+-------+---------------+---------+---------+------+--------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+-------+---------------+---------+---------+------+--------+----------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 136518 | |
| 1 | PRIMARY | <derived3> | ALL | NULL | NULL | NULL | NULL | 5180 | |
| 1 | PRIMARY | <derived4> | ALL | NULL | NULL | NULL | NULL | 198190 | |
| 1 | PRIMARY | <derived5> | ALL | NULL | NULL | NULL | NULL | 23766 | |
| 5 | DERIVED | map_external_id | index | NULL | PRIMARY | 55 | NULL | 20797 | Using index |
| 4 | DERIVED | tbl_ads | index | NULL | PIndex | 13 | NULL | 318400 | |
| 3 | DERIVED | tbl_ads | ALL | NULL | NULL | NULL | NULL | 318400 | Using filesort |
| 2 | DERIVED | tbl_ads | index | NULL | PIndex | 13 | NULL | 318400 | Using where |
+----+-------------+--------------------+-------+---------------+---------+---------+------+--------+----------------+
More of a speculative answer, but I don't think 22 hours is too unrealistic..
First things first... you don't need the last subquery, just state
LEFT JOIN map_external_id as D on A.placement_id=D.placement_id AND A.adunit_id=D.adunit_id
Second, in the first and second subqueries you have the field source in your WHERE statement and this field is not listed in your table scheme. Obviously it might be or enum or string type, does it have an index? I've had a table with 1'000'000 or so entries where a missing index caused a processing time of 30 seconds for a simple query (cant believe the guy who put the query in the login process).
Irrelevant question inbetween, what's the final result set size?
Thirdly, my assumption is that by running the aggregating subqueries mysql actually creates temporary tables that do not have any indices - which is bad.
Have you yet had a look on the result sets of the single subqueries? What is the typical set size? From your statements and my guesses about your typical data I would assume that the aggregation actually only marginally reduces the set size (apart from the WHERE statement). So let me guess in order of the subqueries: 200'000, 100'000, 200'000
Each of the subqueries then joins with the next on three assumably not indexed fields. So worst case for the first join: 200'000 * 100'000 = 20'000'000'000 comparisons. Going from my 30 sec for a query on 1'000'000 records experience that makes it 20'000 * 30 = 600'000 sec =+- 166 hours. obviously that's way too much, maybe there's a digit missing, maybe it was 20 sec not 30, the result sets might be different, worst case is not average case - but you get the image.
My solution approach then would be to try to create additional tables which replace your aggregation subqueries. Judging from your queries you could update it daily, as I guess you just insert rows on impressions etc, so you can just add the aggregation data incrementally. Then you transform your mega-query into the two steps of
updating your aggregation tables
doing the final dump.
The aggregation tables obviously should be indexed meaningfully. I think that should bring the final queries down to a few seconds.
Thanks for all your advice. I ended up splitting the sub queries and creating temporary tables (with PKs) for each, then joined the temp tables together at the end and it now takes about 10 mins to run.

SQL algorithm to as near to linear time as possible and tweaking of select statement

I am using MySQL version 5.5 on Ubuntu.
My database tables are setup as follows:
DDLs:
CREATE TABLE 'asx' (
'code' char(3) NOT NULL,
'high' decimal(9,3),
'low' decimal(9,3),
'close' decimal(9,3),
'histID' int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY ('histID'),
UNIQUE KEY 'code' ('code')
)
CREATE TABLE 'asxhist' (
'date' date NOT NULL,
'average' decimal(9,3),
'histID' int(11) NOT NULL,
PRIMARY KEY ('date','histID'),
KEY 'histID' ('histID'),
CONSTRAINT 'asxhist_ibfk_1' FOREIGN KEY ('histID') REFERENCES 'asx' ('histID')
ON UPDATE CASCADE
)
t1:
| code | high | low | close | histID (primary key)|
| asx | 10.000 | 9.500 | 9.800 | 1
| nab | 42.000 | 41.250 | 41.350 | 2
t2:
| date | average | histID (foreign key) |
| 2013-01-01| 10.000 | 1 |
| 2013-01-01| 39.000 | 2 |
| 2013-01-02| 9.000 | 1 |
| 2013-01-02| 38.000 | 2 |
| 2013-01-03| 9.500 | 1 |
| 2013-01-03| 39.500 | 2 |
| 2013-01-04| 11.000 | 1 |
| 2013-01-04| 38.500 | 2 |
I am attempting to complete a select query that produces this as a result:
| code | high | low | close | asxhist.average |
| asx | 10.000 | 9.500 | 9.800 | 11.000, 9.5000 |
| nab | 42.000 | 41.250 | 41.350 | 38.500,39.500 |
Where the most recent information in table 2 is returned with table 1 in a csv format.
I have managed to get this far:
SELECT code, high, low, close,
(SELECT GROUP_CONCAT(DISTINCT t2.average ORDER BY date DESC SEPARATOR ',') FROM t2
WHERE t2.histID = t1.histID)
FROM t1;
Unfortunately this returns all values associated with hID. I'm taking a look at xaprb.com's firstleastmax-row-per-group-in-sql solution but I have been banging my head all day and the slight wooziness seems to be dimming my ability to comprehend how I should use it to my benefit. How can I limit the results to the most 5 recent values and considering the tables will eventually be megabytes in size, try and remain in O(n2) or less? (Or can I?)
Temporary work around using SUBSTRING_INDEX and not a feasible solution for huge data
SELECT code, high, low, close,
(SELECT SUBSTRING_INDEX(GROUP_CONCAT(asxhist.average), ',', 3)
FROM asxhist
WHERE asxhist.histID = asx.histID
ORDER BY date DESC)
FROM asx;
From what I gather Limit option in GROUP_CONCAT is still under feature-request.
Also on stackoverflow hack MySQL GROUP_CONCAT

Optimizing / improving a slow mysql query - indexing? reorganizing?

First off, I've looked at several other questions about optimizing sql queries, but I'm still unclear for my situation what is causing my problem. I read a few articles on the topic as well and have tried implementing a couple possible solutions, as I'll describe below, but nothing has yet worked or even made an appreciable dent in the problem.
The application is a nutrition tracking system - users enter the foods they eat and based on an imported USDA database the application breaks down the foods to the individual nutrients and gives the user a breakdown of the nutrient quantities on a (for now) daily basis.
here's
A PDF of the abbreviated database schema
and here it is as a (perhaps poor quality) JPG. I made this in open office - if there are suggestions for better ways to visualize a database, I'm open to suggestions on that front as well! The blue tables are directly from the USDA, and the green and black tables are ones I've made. I've omitted a lot of data in order to not clutter things up unnecessarily.
Here's the query I'm trying to run that takes a very long time:
SELECT listing.date_time,listing.nutrdesc,data.total_nutr_mass,listing.units
FROM
(SELECT nutrdesc, nutr_no, date_time, units
FROM meals, nutr_def
WHERE meals.users_userid = '2'
AND date_time BETWEEN '2009-8-12' AND '2009-9-12'
AND (nutr_no <100000
OR nutr_no IN
(SELECT nutr_def_nutr_no
FROM nutr_rights
WHERE nutr_rights.users_userid = '2'))
) as listing
LEFT JOIN
(SELECT nutrdesc, date_time, nut_data.nutr_no, sum(ingred_gram_mass*entry_qty_num*nutr_val/100) AS total_nutr_mass
FROM nut_data, recipe_ingredients, food_entries, meals, nutr_def
WHERE nut_data.nutr_no = nutr_def.nutr_no
AND ndb_no = ingred_ndb_no
AND foods_food_id = entry_ident
AND meals_meal_id = meal_id
AND users_userid = '2'
AND date_time BETWEEN '2009-8-12' AND '2009-9-12'
GROUP BY date_time,nut_data.nutr_no ) as data
ON data.date_time = listing.date_time
AND listing.nutr_no = data.nutr_no
ORDER BY listing.date_time,listing.nutrdesc,listing.units
So I know that's rather complex - The first select gets a listing of all the nutrients that the user consumed within the given date range, and the second fills in all the quantities.
When I implement them separately, the first query is really fast, but the second is slow and gets very slow when the date ranges get large. The join makes the whole thing ridiculously slow. I know that the 'main' problem is the join between these two derived tables, and I can get rid of that and do the join by hand basically in php much faster, but I'm not convinced that's the whole story.
For example: for 1 month of data, the query takes about 8 seconds, which is slow, but not completely terrible. Separately, each query takes ~.01 and ~2 seconds respectively. 2 seconds still seems high to me.
If I try to retrieve a year's worth of data, it takes several (>10) minutes to run the whole query, which is problematic - the client-server connection sometimes times out, and in any case we don't want I don't want to sit there with a spinning 'please wait' icon. Mainly, I feel like there's a problem because it takes more than 12x as long to retrieve 12x more information, when it should take less than 12x as long, if I were doing things right.
Here's the 'explain' for each of the slow queries: (the whole thing, and just the second half).
Whole thing:
+----+--------------------+--------------------+----------------+-------------------------------+------------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------------+----------------+-------------------------------+------------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 5053 | Using temporary; Using filesort |
| 1 | PRIMARY | <derived4> | ALL | NULL | NULL | NULL | NULL | 4341 | |
| 4 | DERIVED | meals | range | PRIMARY,day_ind | day_ind | 9 | NULL | 30 | Using where; Using temporary; Using filesort |
| 4 | DERIVED | food_entries | ref | meals_meal_id | meals_meal_id | 5 | nutrition.meals.meal_id | 15 | Using where |
| 4 | DERIVED | recipe_ingredients | ref | foods_food_id,ingred_ndb_no | foods_food_id | 4 | nutrition.food_entries.entry_ident | 2 | |
| 4 | DERIVED | nutr_def | ALL | PRIMARY | NULL | NULL | NULL | 174 | |
| 4 | DERIVED | nut_data | ref | PRIMARY | PRIMARY | 36 | nutrition.nutr_def.nutr_no,nutrition.recipe_ingredients.ingred_ndb_no | 1 | |
| 2 | DERIVED | meals | range | day_ind | day_ind | 9 | NULL | 30 | Using where |
| 2 | DERIVED | nutr_def | ALL | PRIMARY | NULL | NULL | NULL | 174 | Using where |
| 3 | DEPENDENT SUBQUERY | nutr_rights | index_subquery | users_userid,nutr_def_nutr_no | nutr_def_nutr_no | 19 | func | 1 | Using index; Using where |
+----+--------------------+--------------------+----------------+-------------------------------+------------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
10 rows in set (2.82 sec)
Second chunk (data):
+----+-------------+--------------------+-------+-----------------------------+---------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------------------+-------+-----------------------------+---------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | meals | range | PRIMARY,day_ind | day_ind | 9 | NULL | 30 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | food_entries | ref | meals_meal_id | meals_meal_id | 5 | nutrition.meals.meal_id | 15 | Using where |
| 1 | SIMPLE | recipe_ingredients | ref | foods_food_id,ingred_ndb_no | foods_food_id | 4 | nutrition.food_entries.entry_ident | 2 | |
| 1 | SIMPLE | nutr_def | ALL | PRIMARY | NULL | NULL | NULL | 174 | |
| 1 | SIMPLE | nut_data | ref | PRIMARY | PRIMARY | 36 | nutrition.nutr_def.nutr_no,nutrition.recipe_ingredients.ingred_ndb_no | 1 | |
+----+-------------+--------------------+-------+-----------------------------+---------------+---------+-----------------------------------------------------------------------+------+----------------------------------------------+
5 rows in set (0.00 sec)
I've 'analyzed' all the tables involved in the query, and added an index on the datetime field that is joining meals and food entries. I called it 'day_ind'. I hoped that would accelerate things, but it didn't seem to make a difference. I also tried removing the 'sum' function, as I understand that having a function in the query will frequently mean a full table scan, which is obviously much slower. Unfortunately removing the 'sum' didn't seem to make a difference either (well, about 3-5% or so, but not the order magnitude that I'm looking for).
I would love any suggestions and will be happy to provide any more information you need to help diagnose and improve this problem. Thanks in advance!
There are a few type All in your explain suggest full table scan. and hence create temp table. You could re-index if it is not there already.
Sort and Group By are usually the performance killer, you can adjust Mysql memory settings to avoid physical i/o to tmp table if you have extra memory available.
Lastly, try to make sure the data type of the join attributes matches. Ie data.date_time = listing.date_time has same data format.
Hope that helps.
Okay, so I eventually figured out what I'm gonna end up doing. I couldn't make the 'data' query any faster - that's still the bottleneck. But now I've made it so the total query process is pretty close to linear, not exponential.
I split the query into two parts and made each one into a temporary table. Then I added an index for each of those temp tables and did the join separately afterwards. This made the total execution time for 1 month of data drop from 8 to 2 seconds, and for 1 year of data from ~10 minutes to ~30 seconds. Good enough for now, I think. I can work with that.
Thanks for the suggestions. Here's what I ended up doing:
create table listing (
SELECT nutrdesc, nutr_no, date_time, units
FROM meals, nutr_def
WHERE meals.users_userid = '2'
AND date_time BETWEEN '2009-8-12' AND '2009-9-12'
AND (
nutr_no <100000 OR nutr_no IN (
SELECT nutr_def_nutr_no
FROM nutr_rights
WHERE nutr_rights.users_userid = '2'
)
)
);
create table data (
SELECT nutrdesc, date_time, nut_data.nutr_no, sum(ingred_gram_mass*entry_qty_num*nutr_val/100) AS total_nutr_mass
FROM nut_data, recipe_ingredients, food_entries, meals, nutr_def
WHERE nut_data.nutr_no = nutr_def.nutr_no
AND ndb_no = ingred_ndb_no
AND foods_food_id = entry_ident
AND meals_meal_id = meal_id
AND users_userid = '2'
AND date_time BETWEEN '2009-8-12' AND '2009-9-12'
GROUP BY date_time,nut_data.nutr_no
);
create index joiner on data(nutr_no, date_time);
create index joiner on listing(nutr_no, date_time);
SELECT listing.date_time,listing.nutrdesc,data.total_nutr_mass,listing.units
FROM listing
LEFT JOIN data
ON data.date_time = listing.date_time
AND listing.nutr_no = data.nutr_no
ORDER BY listing.date_time,listing.nutrdesc,listing.units;

Fast complex query to select bookings

I'm trying to write a query to get a courses information and the number of bookings and attendees. Each course can have many bookings and each booking can have many attendees.
We already have a working report, but it uses multiple queries to get the required information. One to get the courses, one to get the bookings, and one to get the number of attendees. This is very slow because of the size that the database has grown to.
There are a number of extra conditions for the reports:
Bookings must be made more than 5
minutes ago, or have been confirmed
The booking must not be canceled
The course must not be marked as deleted
The courses venue and location must be LIKE a search string
Courses with no bookings must appear in the results
This is the table structure: (I've omitted the unneeded information. All fields are not null and have no default)
mysql> DESCRIBE first_aid_courses;
+------------------+--------------+-----+----------------+
| Field | Type | Key | Extra |
+------------------+--------------+-----+----------------+
| id | int(11) | PRI | auto_increment |
| course_date | date | | |
| region_id | int(11) | | |
| location | varchar(255) | | |
| venue | varchar(255) | | |
| number_of_spaces | int(11) | | |
| deleted | tinyint(1) | | |
+------------------+--------------+-----+----------------+
mysql> DESCRIBE first_aid_bookings;
+-----------------------+--------------+-----+----------------+
| Field | Type | Key | Extra |
+-----------------------+--------------+-----+----------------+
| id | int(11) | PRI | auto_increment |
| first_aid_course_id | int(11) | | |
| placed | datetime | | |
| confirmed | tinyint(1) | | |
| cancelled | tinyint(1) | | |
+-----------------------+--------------+-----+----------------+
mysql> DESCRIBE first_aid_attendees;
+----------------------+--------------+-----+----------------+
| Field | Type | Key | Extra |
+----------------------+--------------+-----+----------------+
| id | int(11) | PRI | auto_increment |
| first_aid_booking_id | int(11) | | |
+----------------------+--------------+-----+----------------+
mysql> DESCRIBE regions;
+----------+--------------+-----+----------------+
| Field | Type | Key | Extra |
+----------+--------------+-----+----------------+
| id | int(11) | PRI | auto_increment |
| name | varchar(255) | | |
+----------+--------------+-----+----------------+
I need to select the following:
Course ID: first_aid_courses.id
Date: first_aid_courses.course_date
Region regions.name
Location: first_aid_courses.location
Bookings: COUNT(first_aid_bookings)
Attendees: COUNT(first_aid_attendees)
Spaces Remaining: COUNT(first_aid_bookings) - COUNT(first_aid_attendees)
This is what I have so far:
SELECT `first_aid_courses`.*,
COUNT(`first_aid_bookings`.`id`) AS `bookings`,
COUNT(`first_aid_attendees`.`id`) AS `attendees`
FROM `first_aid_courses`
LEFT JOIN `first_aid_bookings`
ON `first_aid_courses`.`id` =
`first_aid_bookings`.`first_aid_course_id`
LEFT JOIN `first_aid_attendees`
ON `first_aid_bookings`.`id` =
`first_aid_attendees`.`first_aid_booking_id`
WHERE ( `first_aid_courses`.`location` LIKE '%$search_string%'
OR `first_aid_courses`.`venue` LIKE '%$search_string%' )
AND `first_aid_courses`.`deleted` = 0
AND ( `first_aid_bookings`.`placed` > '$five_minutes_ago'
AND `first_aid_bookings`.`cancelled` = 0
OR `first_aid_bookings`.`confirmed` = 1 )
GROUP BY `first_aid_courses`.`id`
ORDER BY `course_date` DESC
Its not quite working, can any one help me with writing the correct query? Also there are 1000s of rows in this database, so any help on making it fast is appreciated (like which fields to index).
Ok, Ive answered my own question. Sometimes it helps to ask a question for you to figure out the answer.
SELECT `first_aid_courses`.*,
`regions`.`name` AS `region_name`,
COUNT(DISTINCT `first_aid_bookings`.`id`) AS `bookings`,
COUNT(`first_aid_attendees`.`id`) AS `attendees`
FROM `first_aid_courses`
JOIN `regions`
ON `first_aid_courses`.`region_id` = `regions`.`id`
LEFT JOIN `first_aid_bookings`
ON `first_aid_courses`.`id` =
`first_aid_bookings`.`first_aid_course_id`
LEFT JOIN `first_aid_attendees`
ON `first_aid_bookings`.`id` =
`first_aid_attendees`.`first_aid_booking_id`
WHERE ( `first_aid_courses`.`location` LIKE '%$search_string%'
OR `first_aid_courses`.`venue` LIKE '%$search_string%' )
AND `first_aid_courses`.`deleted` = 0
AND ( `first_aid_bookings`.`cancelled` = 0
AND `first_aid_bookings`.`confirmed` = 1 )
GROUP BY `first_aid_courses`.`id`
ORDER BY `course_date` ASC
This is completely untested, but maybe try selecting a count of non-null rows for bookings and attendees, like this:
SUM(IF(`first_aid_bookings`.`id` IS NOT NULL, 1, 0)) AS `bookings`,
COUNT(IF(`first_aid_attendees`.`id` IS NOT NULL, 1, 0)) AS `attendees`
Unless you have it but just do not show it, have a good look on indexes, without them you loose an order of magnitude on performance on any query that references anything but primary key.
Another major performance hit are the LIKE '%nnn%'.
Would it be possible to do something with those?
But with some good indexes, this query should be fine if you have the hardware to back it up.
I have queries doing LIKE on tables with millions of rows. its not a problem if the rest of the query can eliminate any unnecessary matchings.
You could go for subqueries to lessen the scope for the LIKE queries.

Big SQL SELECT performance difference when using <= against using < on a DATETIME column

Given the following table:
desc exchange_rates;
+------------------+----------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| time | datetime | NO | MUL | NULL | |
| base_currency | varchar(3) | NO | MUL | NULL | |
| counter_currency | varchar(3) | NO | MUL | NULL | |
| rate | decimal(32,16) | NO | | NULL | |
+------------------+----------------+------+-----+---------+----------------+
I have added indexes on time, base_currency and counter_currency, as well as a composite index on (time, base_currency, counter_currency), but I'm seeing a big performance difference when I perform a SELECT using <= against using <.
The first SELECT is:
ExchangeRate Load (95.5ms)
SELECT * FROM `exchange_rates` WHERE (time <= '2009-12-30 14:42:02' and base_currency = 'GBP' and counter_currency = 'USD') LIMIT 1
As you can see this is taking 95ms.
If I change the query such that I compare time using < rather than <= I see this:
ExchangeRate Load (0.8ms)
SELECT * FROM `exchange_rates` WHERE (time < '2009-12-30 14:42:02' and base_currency = 'GBP' and counter_currency = 'USD') LIMIT 1
Now it takes less than 1 millisecond, which sounds right to me. Is there a rational explanation for this behaviour?
The output from EXPLAIN provides further details, but I'm not 100% sure how to intepret this:
-- Output from the first, slow, select
SIMPLE | 5,5 | exchange_rates | 1 | index_exchange_rates_on_time,index_exchange_rates_on_base_currency,index_exchange_rates_on_counter_currency,time_and_currency | index_merge | Using intersect(index_exchange_rates_on_counter_currency,index_exchange_rates_on_base_currency); Using where | 813 | | index_exchange_rates_on_counter_currency,index_exchange_rates_on_base_currency
-- Output from the second, fast, select
SIMPLE | 5 | exchange_rates | 1 | index_exchange_rates_on_time,index_exchange_rates_on_base_currency,index_exchange_rates_on_counter_currency,time_and_currency | ref | Using where | 4988 | const | index_exchange_rates_on_counter_currency
(Note: I'm producing these queries through ActiveRecord (in a Rails app) but these are ultimately the queries which are being executed)
In the first case, MySQL tries to combine results from all indexes. It fetches all records from both indexes and joins them on the value of the row pointer (table offset in MyISAM, PRIMARY KEY in InnoDB).
In the second case, it just uses a single index, which, considering LIMIT 1, is the best decision.
You need to create a composite index on (base_currency, counter_currency, time) (in this order) for this query to work as fast as possible.
The engine will use the index for filtering on the leading columns (base_currency, counter_currency) and for ordering on the trailing column (time).
It also seems you want to add something like ORDER BY time DESC to your query to get the last exchange rate.
In general, any LIMIT without ORDER BY should ring the bell.