How to optimize select query with case statements? - mysql

I have 3 tables over 1,000,000+ records. My select query is running for hours.
How to optimize it? I'm newbie.
I tried to add index for name, still it taking hours to load.
Like this,
ALTER TABLE table2 ADD INDEX(name);
and like this also,
CREATE INDEX INDEX1 table2(name);
SELECT MS.*, P.Counts FROM
(SELECT M.*,
TIMESTAMPDIFF(YEAR, M.date, CURDATE()) AS age,
CASE V.name
WHEN 'text' THEN M.name
WHEN V.name IS NULL THEN M.name
ELSE V.name
END col1
FROM table1 M
LEFT JOIN table2 V ON M.id=V.id) AS MS
LEFT JOIN
(select E.id, count(E.id) Counts
from table3 E
where E.field2 = 'value1'
group by E.id) AS P
ON MS.id=P.id;
Explain <above query>;
output:
+----+-------------+------------+------------+-------+---------------------------------------------+------------------+---------+------------------------+---------+----------+-----------------------------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+------------+------------+-------+---------------------------------------------+------------------+---------+------------------------+---------+----------+-----------------------------------------------------------------+
| 1 | PRIMARY | M | NULL | ALL | NULL | NULL | NULL | NULL | 344763 | 100.00 | NULL |
| 1 | PRIMARY | <derived3> | NULL | ref | <auto_key0> | <auto_key0> | 8 | CP.M.id | 10 | 100.00 | NULL |
| 1 | PRIMARY | V | NULL | index | NULL | INDEX1 | 411 | NULL | 1411083 | 100.00 | Using where; Using index; Using join buffer (Block Nested Loop) |
| 3 | DERIVED | E | NULL | ref | PRIMARY,f2,f3 | f2| 43 | const | 966442 | 100.00 | Using index |
+----+-------------+------------+------------+-------+---------------------------------------------+------------------+---------+------------------------+---------+----------+-----------------------------------------------------------------+
I expect to get result in less than 1 min.
The query indented for clarity.
SELECT MS.*, P.Counts
FROM (
SELECT M.*,
TIMESTAMPDIFF(YEAR, M.date, CURDATE()) AS age,
CASE V.name
WHEN 'text' THEN M.name
WHEN V.name IS NULL THEN M.name
ELSE V.name
END col1
FROM table1 M
LEFT JOIN table2 V ON M.id=V.id
) AS MS
LEFT JOIN (
select E.id, count(E.id) Counts
from table3 E
where E.field2 = 'value1'
group by E.id
) AS P ON MS.id=P.id;

Your query has no filtering predicate, so it's essentially retrieving all the rows. That is a 1,000,000+ rows from table1. Then it's joining it with table2, and then with another table expression/derived table.
Why do you expect this query to be fast? A massive query like this one will normally run as a batch process at night. I assume this query is not for an online process, right?
Maybe you need to rethink the process. Do you really need to process millions of rows at once interactively? Will the user read a million rows in the web page?

Subqueries are not always well-optimized.
I think you can flatten it out something like:
SELECT M.*, V.*,
TIMESTAMPDIFF(YEAR, M.date, CURDATE()) AS age,
CASE V.name WHEN 'text' THEN M.name
WHEN V.name IS NULL THEN M.name
ELSE V.name END col1,
( SELECT COUNT(*) FROM table3 WHERE field2 = 'value1' AND id = x.id
) AS Counts
FROM table1 AS M
LEFT JOIN table2 AS V ON M.id = V.id
I may have some parts not quite right; see if you can make this formulation work.

For starters, you are returning the same result for 'col1' in case v.name is null or v.name != 'text'. That said, you can include that extra condition on you join with table2 and use IFNULL function.
Has you are filtering table3 by field2, you could probably create an index over table 3 that includes field2.
You should also check if you can include any additional filter for any of those tables, and if you do you can consider using a stored procedure to get the results.
Also, I don´t see why you need to the aggregate the first join into 'MS' you can easy do all the joins in one go like this:
SELECT
M.*,
TIMESTAMPDIFF(YEAR, M.date, CURDATE()) AS age,
IFNULL(V.name, M.name) as col1,
P.Counts
FROM table1 M
LEFT JOIN table2 V ON M.id=V.id AND V.name <> 'text'
LEFT JOIN
(SELECT
E.id,
COUNT(E.id) Counts
FROM table3 E
WHERE E.field2 = 'value1'
GROUP BY E.id) AS P ON M.id=P.id;
I'm also assuming that you do have clustered indexes for all id fields in all this three tables, but with no filter, if you are dealing with millions off records, this will always be an big heavy query. To say the least your are doing a table scan for table1.
I've included this additional information after you comment.
I've mentioned clustered index, but according to the official documentation about indexes here
When you define a PRIMARY KEY on your table, InnoDB uses it as the clustered index. So if you already have a primary key defined you don't need to do anything else.
Has the documentation also point's out you should define a primary key for each table that you create.
If you don't have a primary key. Here is the code snippet you requested.
ALTER TABLE table1 ADD CONSTRAINT pk_table1
PRIMARY KEY CLUSTERED (id);
ATTENTION: Keep in mind that creating a clustered index is a big operation, for tables like yours with tones of data.
This isn’t something you want to do without planning, on a production server. This operation will also take a long time and table will be locked during the process.

Related

Improving query performance by example

I'm trying to think out a way to improve a query the consumed schema is like this:
CREATE TABLE `orders` (
`id` int PRIMARY KEY NOT NULL AUTO_INCREMENT,
`store_id` INTEGER NOT NULL,
`billing_profile_id` INTEGER NOT NULL,
`billing_address_id` INTEGER NULL,
`total` DECIMAL(8, 2) NOT NULL
);
CREATE TABLE `billing_profiles` (
`id` int PRIMARY KEY NOT NULL AUTO_INCREMENT,
`name` TEXT NOT NULL
);
CREATE TABLE `billing_addresses` (
`id` int PRIMARY KEY NOT NULL AUTO_INCREMENT,
`address` TEXT NOT NULL
);
CREATE TABLE `stores` (
`id` int PRIMARY KEY NOT NULL AUTO_INCREMENT,
`name` TEXT NOT NULL
);
The query I'm executing:
SELECT bp.name,
ba.address,
s.name,
Sum(o.total) AS total
FROM billing_profiles bp,
stores s,
orders o
LEFT JOIN billing_addresses ba
ON o.billing_address_id = ba.id
WHERE o.billing_profile_id = bp.id
AND s.id = o.store_id
GROUP BY bp.name,
ba.address,
s.name;
And here is the EXPLAIN:
+----+-------------+-------+------------+--------+---------------+---------+---------+------------------------------+-------+----------+--------------------------------------------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+---------------+---------+---------+------------------------------+-------+----------+--------------------------------------------+
| 1 | SIMPLE | bp | NULL | ALL | PRIMARY | NULL | NULL | NULL |155000 | 100.00 | Using temporary |
| 1 | SIMPLE | o | NULL | ALL | NULL | NULL | NULL | NULL |220000 | 33.33 | Using where; Using join buffer (hash join) |
| 1 | SIMPLE | ba | NULL | eq_ref | PRIMARY | PRIMARY | 4 | factory.o.billing_address_id | 1 | 100.00 | NULL |
| 1 | SIMPLE | s | NULL | eq_ref | PRIMARY | PRIMARY | 4 | factory.o.store_id | 1 | 100.00 | NULL |
+----+-------------+-------+------------+--------+---------------+---------+---------+------------------------------+------+----------+--------------------------------------------+
The problem I'm facing is that this query takes 30+ secs to excute, we have over 200000 orders, and 150000+ billing_profiles/billing_addresses.
What should I do regarding index/constraints so that this query becomes faster to execute?
Edit: after some suggestions in the comments I edited the query to:
SELECT bp.name,
ba.address,
s.name,
Sum(o.total) AS total
FROM orders o
INNER JOIN billing_profiles bp
ON o.billing_profile_id = bp.id
INNER JOIN stores s
ON s.id = o.store_id
LEFT JOIN billing_addresses ba
ON o.billing_address_id = ba.id
GROUP BY bp.name,
ba.address,
s.name;
But still takes too much time.
One thing I have used in the past and has helped in many instances with MySQL is to use the STRAIGHT_JOIN clause which tells the engine to do the query in the order as listed.
I have cleaned-up your query to proper JOIN context. Since the ORDER table is the primary basis of data, and the other 3 are lookup references to their respective IDs, I put the ORDER table first.
SELECT STRAIGHT_JOIN
bp.name,
ba.address,
s.name,
Sum(o.total) AS total
FROM
orders o
JOIN stores s
ON o.store_id = s.id
JOIN billing_profiles bp
on o.billing_profile_id = bp.id
LEFT JOIN billing_addresses ba
ON o.billing_address_id = ba.id
GROUP BY
bp.name,
ba.address,
s.name
Now, your data tables dont appear that large, but if you are going to be grouping by 3 of the columns in the order table, I would have an index on the underlying basis of them, which are the "ID" keys linking to the other tables. Adding the total to help for a covering index / aggregate query, I would index on
( store_id, billing_profile_id, billing_address_id, total )
I'm sure that in reality, you have many other columns associated with an order and just showing the context for this query. Then, I would change to a pre-query so the aggregation is all done once for the orders table by their ID keys, THEN the result is joined to the lookup tables and you just need to apply an ORDER BY clause for your final output. Something like..
SELECT
bp.name,
ba.address,
s.name,
o.total
FROM
( select
store_id,
billing_profile_id,
billing_address_id,
sum( total ) total
from
orders
group by
store_id,
billing_profile_id,
billing_address_id ) o
JOIN stores s
ON o.store_id = s.id
JOIN billing_profiles bp
on o.billing_profile_id = bp.id
LEFT JOIN billing_addresses ba
ON o.billing_address_id = ba.id
ORDER BY
bp.name,
ba.address,
s.name
Add this index to o, being sure to start with billing_profile_id:
INDEX(billing_profile_id, store_id, billing_address_id, total)
Discussion of the Explain:
The Optimizer saw that it needed to do a full scan of some table.
bp was smaller than o, so it picked bp as the "first" table.
Then it reached into the next table repeatedly.
It did not see a suitable index (one starting with billing_profile_id) and decided to do "Using join buffer (hash join)", which involves loading the entire table into a hash in RAM.
"Using temporary", though mentioned on the "first" table, really does not show up until just before the GROUP BY. (The GROUP BY references multiple tables, so there is no way to optimize it.)
Potential bug Please check the results of Sum(o.total) AS total. It is performed after all the JOINing and before the GROUP BY, so it may be inflated. Notice how DRapp's formulation does the SUM before the JOINs.

How to optimize join which causes very slow performace

This query runs more than 12 seconds, even though all tables are relatively small - about 2 thousands rows.
SELECT attr_73206_ AS attr_73270_
FROM object_73130_ f1
LEFT OUTER JOIN (
SELECT id_field, attr_73206_ FROM (
SELECT m.id_field, t0.attr_73102_ AS attr_73206_ FROM object_73200_ o
INNER JOIN master_slave m ON (m.id_object = 73130 OR m.id_object = 73290) AND (m.id_master = 73200 OR m.id_master = 73354) AND m.id_slave_field = o.id
INNER JOIN object_73101_ t0 ON t0.id = o.attr_73206_
ORDER BY o.id_order
) AS o GROUP BY o.id_field
) AS o ON f1.id = o.id_field
Both tables have fields id as primary keys. Besides, id_field, id_order,attr_73206_ and all fields in master_slave are indexed. As for the logic of this query, on the whole it's of master-detail kind. Table object_73130_ is a master-table, table object_73200_ is a detail-table. They are linked by a master_slave table. object_73101_ is an ad-hoc table used to get a real value for the field attr_73206_ by its id. For each row in the master table the query returns a field from the very first row of its detail table. Firstly, the query had another look, but here at stackoverflow I was advised to use this more optimized structure (instead of a subquery which was used previously, and, by the way, the query started to run much faster). I observe that the subquery in the first JOIN block runs very fast but returns a number of rows comparable to the number of rows in the main master-table. In any way, I do not know how to optimize it. I just wonder why a simple fast-running join causes so much trouble. Oh, the main observation is that if I remove an ad-hoc object_73101_ from the query to return just an id, but not a real value, then the query runs as quick as a flash. So, all attention should be focused on this part of the query
INNER JOIN object_73101_ t0 ON t0.id = o.attr_73206_
Why does it slow down the whole query so terribly?
EDIT
In this way it runs super-fast
SELECT t0.attr_73102_ AS attr_73270_
FROM object_73130_ f1
LEFT OUTER JOIN (
SELECT id_field, attr_73206_ FROM (
SELECT m.id_field, attr_73206_ FROM object_73200_ o
INNER JOIN master_slave m ON (m.id_object = 73130 OR m.id_object = 73290) AND (m.id_master = 73200 OR m.id_master = 73354) AND m.id_slave_field = o.id
ORDER BY o.id_order
) AS o GROUP BY o.id_field
) AS o ON f1.id = o.id_field
LEFT JOIN object_73101_ t0 ON t0.id = o.attr_73206_
So, you can see, that I just put the add-hoc join outside of the subquery. But, the problem is, that subquery is automatically created and I have an access to that part of algo which creates it and I can modify this algo, and I do not have access to the part of algo which builds the whole query, so the only thing I can do is just to fix the subquery somehow. Anyway, I still can't understand why INNER JOIN inside a subquery can slow down the whole query hundreds of times.
EDIT
A new version of query with different aliases for each table. This has no effect on the performance:
SELECT attr_73206_ AS attr_73270_
FROM object_73130_ f1
LEFT OUTER JOIN (
SELECT id_field, attr_73206_ FROM (
SELECT m.id_field, t0.attr_73102_ AS attr_73206_ FROM object_73200_ a
INNER JOIN master_slave m ON (m.id_object = 73130 OR m.id_object = 73290) AND (m.id_master = 73200 OR m.id_master = 73354) AND m.id_slave_field = a.id
INNER JOIN object_73101_ t0 ON t0.id = a.attr_73206_
ORDER BY a.id_order
) AS b GROUP BY b.id_field
) AS c ON f1.id = c.id_field
EDIT
This is the result of EXPLAIN command:
| id | select_type | TABLE | TYPE | possible_keys | KEY | key_len | ROWS | Extra |
| 1 | PRIMARY | f1 | INDEX | NULL | PRIMARY | 4 | 1570 | USING INDEX
| 1 | PRIMARY | derived2| ALL | NULL | NULL | NULL | 1564 |
| 2 | DERIVED | derived3| ALL | NULL | NULL | NULL | 1575 | USING TEMPORARY; USING filesort
| 3 | DERIVED | m | RANGE | id_object,id_master,..| id_object | 4 | 1356 | USING WHERE; USING TEMPORARY; USING filesort
| 3 | DERIVED | a | eq_ref | PRIMARY,attr_73206_ | PRIMARY | 4 | 1 |
| 3 | DERIVED | t0 | eq_ref | PRIMARY | PRIMARY | 4 | 1 |
What is wrong with that?
EDIT
Here is the result of EXPLAIN command for the "super-fast" query
| id | select_type | TABLE | TYPE | possible_keys | KEY | key_len | ROWS | Extra
| 1 | PRIMARY | f1 | INDEX | NULL | PRIMARY | 4 | 1570 | USING INDEX
| 1 | PRIMARY | derived2| ALL | NULL | NULL | NULL | 1570 |
| 1 | PRIMARY | t0 | eq_ref| PRIMARY | PRIMARY | 4 | 1 |
| 2 | DERIVED | derived3| ALL | NULL | NULL | NULL | 1581 | USING TEMPORARY; USING filesort
| 3 | DERIVED | m | RANGE | id_object,id_master,| id_bject | 4 | 1356 | USING WHERE; USING TEMPORARY; USING filesort
| 3 | DERIVED | a | eq_ref | PRIMARY | PRIMARY | 4 | 1 |
CLOSED
I will use my own "super-fast" query, which I presented above. I think it is impossible to optimize it anymore.
Without knowing the exact nature of the data/query, there are a couple things that I'm seeing:
MySQL is notoriously bad at handling sub-selects, as it requires the creation of derived tables. In fact, some versions of MySQL also ignore indexes when using sub-selects. Typically, it's better to use JOINs instead of sub-selects, but if you need to use sub-selects, it's best to make that sub-select as lean as possible.
Unless you have a very specific reason for putting the ORDER BY in the sub-select, it may be a good idea to move it to the "main" query portion because the result set may be smaller (allowing for quicker sorting).
So all that being said, I tried to re-write your query using JOIN logic, but I was wondering What table the final value (attr_73102_) is coming from? Is it the result of the sub-select, or is it coming from table object_73130_? If it's coming from the sub-select, then I don't see why you're bothering with the original LEFT JOIN, as you will only be returning the list of values from the sub-select, and NULL for any non-matching rows from object_73130_.
Regardless, not knowing this answer, I think the query below MAY be syntactically equivalent:
SELECT t0.attr_73102_ AS attr_73270_
FROM object_73130_ f1
LEFT JOIN (object_73200_ o
INNER JOIN master_slave m ON m.id_slave_field = o.id
INNER JOIN object_73101_ t0 ON t0.id = o.attr_73206_)
ON f1.id = o.id_field
WHERE m.id_object IN (73130,73290)
AND m.id_master IN (73200,73354)
GROUP BY o.id_field
ORDER BY o.id_order;

MySQL grouping query optimization

I have three tables: categories, articles, and article_events, with the following structure
categories: id, name (100,000 rows)
articles: id, category_id (6000 rows)
article_events: id, article_id, status_id (20,000 rows)
The highest article_events.id for each article row describes the current status of each article.
I'm returning a table of categories and how many articles are in them with a most-recent-event status_id of '1'.
What I have so far works, but is fairly slow (10 seconds) with the size of my tables. Wondering if there's a way to make this faster. All the tables have proper indexes as far as I know.
SELECT c.id,
c.name,
SUM(CASE WHEN e.status_id = 1 THEN 1 ELSE 0 END) article_count
FROM categories c
LEFT JOIN articles a ON a.category_id = c.id
LEFT JOIN (
SELECT article_id, MAX(id) event_id
FROM article_events
GROUP BY article_id
) most_recent ON most_recent.article_id = a.id
LEFT JOIN article_events e ON most_recent.event_id = e.id
GROUP BY c.id
Basically I have to join to the events table twice, since asking for the status_id along with the MAX(id) just returns the first status_id it finds, and not the one associated with the MAX(id) row.
Any way to make this better? or do I just have to live with 10 seconds? Thanks!
Edit:
Here's my EXPLAIN for the query:
ID | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
---------------------------------------------------------------------------------------------------------------------------
1 | PRIMARY | c | index | NULL | PRIMARY | 4 | NULL | 124044 | Using index; Using temporary; Using filesort
1 | PRIMARY | a | ref | category_id | category_id | 4 | c.id | 3 |
1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 6351 |
1 | PRIMARY | e | eq_ref | PRIMARY | PRIMARY | 4 | most_recent.event_id | 1 |
2 | DERIVED | article_events | ALL | NULL | NULL | NULL | NULL | 19743 | Using temporary; Using filesort
If you can eliminate subqueries with JOINs, it often performs better because derived tables can't use indexes. Here's your query without subqueries:
SELECT c.id,
c.name,
COUNT(a1.article_id) AS article_count
FROM categories c
LEFT JOIN articles a ON a.category_id = c.id
LEFT JOIN article_events ae1
ON ae1.article_id = a.id
LEFT JOIN article_events ae2
ON ae2.article_id = a.id
AND ae2.id > a1.id
WHERE ae2.id IS NULL
GROUP BY c.id
You'll want to experiment with the indexes and use EXPLAIN to test, but here's my guess (I'm assuming id fields are primary keys and you are using InnoDB):
categories: `name`
articles: `category_id`
article_events: (`article_id`, `id`)
Didn't try it, but I'm thinking this will save a bit of work for the database:
SELECT ae.article_id AS ref_article_id,
MAX(ae.id) event_id,
ae.status_id,
(select a.category_id from articles a where a.id = ref_article_id) AS cat_id,
(select c.name from categories c where c.id = cat_id) AS cat_name
FROM article_events
GROUP BY ae.article_id
Hope that helps
EDIT:
By the way... Keep in mind that joins have to go through each row, so you should start your selection from the small end and work your way up, if you can help it. In this case, the query has to run through 100,000 records, and join each one, then join those 100,000 again, and again, and again, even if values are null, it still has to go through those.
Hope this all helps...
I don't like that index on categories.id is used, as you're selecting the whole table.
Try running:
ANALYZE TABLE categories;
ANALYZE TABLE article_events;
and re-run the query.

How to `SELECT` and manufacture missing rows from previous values?

I have the following (simplified) result from SELECT * FROM table ORDER BY tick,refid:
tick refid value
----------------
1 1 11
1 2 22
1 3 33
2 1 1111
2 3 3333
3 3 333333
Note the "missing" rows for refid 1 (tick 3) and refid 2 (ticks 2 and 3)
If possible, how can I make a query to add these missing rows using the most recent prior value for that refid? "Most recent" means the value for the row with the same refid as the missing row and largest tick such that the tick is less than the tick for the missing row. e.g.
tick refid value
----------------
1 1 11
1 2 22
1 3 33
2 1 1111
2 2 22
2 3 3333
3 1 1111
3 2 22
3 3 333333
Additional conditions:
All refids will have values at tick=1.
There may be many 'missing' ticks for a refid in sequence, (as above for refid 2).
There are many refids and it's not known which will have sparse data where.
There will be many ticks beyond 3, but all sequential. In the correct result, each refid will have a result for each tick.
Missing rows are not known in advance - this will be run on multiple databases, all with the same structure, and different "missing" rows.
I'm using MySQL and cannot change db just now. Feel free to post answer in another dialect, to help discussion, but I'll select an answer in MySQL dialect over others.
Yes, I know this can be done in the code, which I've implemented. I'm just curious if it can be done with SQL.
What value should be returned when a given tick-refid combination does not exist? In this solution, I simply returned the lowest value for that given refid.
Revision
I've updated the logic to determine what value to use in the case of a null. It should be noted that I'm assuming that ticks+refid is unique in the table.
Select Ticks.tick
, Refs.refid
, Case
When Table.value Is Null
Then (
Select T2.value
From Table As T2
Where T2.refid = Refs.refId
And T2.tick = (
Select Max(T1.tick)
From Table As T1
Where T1.tick < Ticks.tick
And T1.refid = T2.refid
)
)
Else Table.value
End As value
From (
Select Distinct refid
From Table
) As Refs
Cross Join (
Select Distinct tick
From Table
) As Ticks
Left Join Table
On Table.tick = Ticks.tick
And Table.refid = Refs.refid
If you know in advance what your 'tick' and 'refid' values are,
Make a helper table that contains all possible tick and refid values.
Then left join from the helper table on tick and refid to your data table.
If you don't know exactly what your 'tick' and 'refid' values are, you maybe could still use this method, but instead of a static helper table, it would have to be dynamically generated.
The following has too many sub-selects for my taste, but it generates the desired result in MySQL, as long as every tick and every refid occurs separately at least once in the table.
Start with a query that generates every pair of tick and refid. The following uses the table to generate the pairs, so if any tick never appears in the underlying table, it will also be missing from the generated pairs. The same holds true for refids, though the restriction that "All refids will have values at tick=1" should ensure the latter never happens.
SELECT tick, refid FROM
(SELECT refid FROM chadwick WHERE tick=1) AS r
JOIN
(SELECT DISTINCT tick FROM chadwick) AS t
Using this, generate every missing tick, refid pair, along with the largest tick that exists in the table by equijoining on refid and θ≥-joining on tick. Group by the generated tick, refid since only one row for each pair is desired. The key to filtering out existing tick, refid pairs is the HAVING clause. Strictly speaking, you can leave out the HAVING; the resulting query will return existing rows with their existing values.
SELECT tr.tick, tr.refid, MAX(c.tick) AS ctick
FROM
(SELECT tick, refid FROM
(SELECT refid FROM chadwick WHERE tick=1) AS r
JOIN
(SELECT DISTINCT tick FROM chadwick) AS t
) AS tr
JOIN chadwick AS c ON tr.tick >= c.tick AND tr.refid=c.refid
GROUP BY tr.tick, tr.refid
HAVING tr.tick > MAX(c.tick)
One final select from the above as a sub-select, joined to the original table to get the value for the given ctick, returns the new rows for the table.
INSERT INTO chadwick
SELECT missing.tick, missing.refid, c.value
FROM (SELECT tr.tick, tr.refid, MAX(c.tick) AS ctick
FROM
(SELECT tick, refid FROM
(SELECT refid FROM chadwick WHERE tick=1) AS r
JOIN
(SELECT DISTINCT tick FROM chadwick) AS t
) AS tr
JOIN chadwick AS c ON tr.tick >= c.tick AND tr.refid=c.refid
GROUP BY tr.tick, tr.refid
) AS missing
JOIN chadwick AS c ON missing.ctick = c.tick AND missing.refid=c.refid
;
Performance on the sample table, along with (tick, refid) and (refid, tick) indices:
+----+-------------+------------+-------+-------------------+----------+---------+----------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+------------+-------+-------------------+----------+---------+----------+------+---------------------------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 3 | |
| 1 | PRIMARY | c | ALL | tick_ref,ref_tick | NULL | NULL | NULL | 6 | Using where; Using join buffer |
| 2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 9 | Using temporary; Using filesort |
| 2 | DERIVED | c | ref | tick_ref,ref_tick | ref_tick | 5 | tr.refid | 1 | Using where; Using index |
| 3 | DERIVED | <derived4> | ALL | NULL | NULL | NULL | NULL | 3 | |
| 3 | DERIVED | <derived5> | ALL | NULL | NULL | NULL | NULL | 3 | Using join buffer |
| 5 | DERIVED | chadwick | index | NULL | tick_ref | 10 | NULL | 6 | Using index |
| 4 | DERIVED | chadwick | ref | tick_ref | tick_ref | 5 | | 2 | Using where; Using index |
+----+-------------+------------+-------+-------------------+----------+---------+----------+------+---------------------------------+
As I said, too many sub-selects. A temporary table may help matters.
To check for missing ticks:
SELECT clo.tick+1 AS missing_tick
FROM chadwick AS chi
RIGHT JOIN chadwick AS clo ON chi.tick = clo.tick+1
WHERE chi.tick IS NULL;
This will return at least one row with tick equal to 1 + the largest tick in the table. Thus, the largest value in this result can be ignored.
In order to have the list of pairs (tick, refid) to insert get a whole list:
SELECT a.tick, b.refid
FROM ( SELECT DISTINCT tick FROM t) a
CROSS JOIN ( SELECT DISTINCT refid FROM t) b
Now substract from that query the existing ones:
SELECT a.tick tick, b.refid refid
FROM ( SELECT DISTINCT tick FROM t) a
CROSS JOIN ( SELECT DISTINCT refid FROM t) b
MINUS
SELECT DISTINCT tick, refid FROM t
Now you can join with t to obtain the final query (note that I use inner join + left join to obtain previous result but you could adapt):
INSERT INTO t(tick, refid, value)
SELECT c.tick, c.refid, t1.value
FROM ( SELECT a.tick tick, b.refid refid
FROM ( SELECT DISTINCT tick FROM t) a
CROSS JOIN ( SELECT DISTINCT refid FROM t) b
MINUS
SELECT DISTINCT tick, refid FROM t
) c
INNER JOIN t t1 ON t1.refid = c.refid and t1.tick < c.tick
LEFT JOIN t t2 ON t2.refid = c.refid AND t1.tick < t2.tick AND t2.tick < c.tick
WHERE t2.tick IS NULL

Optimise MySQL ORDER BY RAND() on a filtered GROUP BY query to avoid temp/indexless join

MySQL "join without index" counter is incrementing as shown in various analysis tools like mysql-tuner.pl etc, having tracked down to a query which selects a random product using RAND(), I would like to optimise to help avoid this increment.
The query looks like this:
select p.*, count(u.prodid) as count from prods p
left outer join usage u on p.prodid=u.prodid
where p.ownerid>0 and p.active=1
group by p.prodid
order by rand() limit 1;
I've tried using this style also...
select p.*, count(u.prodid) as count from prods p
left outer join usage u on p.prodid=u.prodid
where prodid in
(select prodid from prods
where ownerid>0 and active=1
group by prodid order by rand() limit 1);
but MySQL doesn't support a LIMIT in an 'in' subquery...
The explain/describe looks like this...
+----+-------------+-------+-------+---------------+---------+---------+------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+----------------------------------------------+
| 1 | SIMPLE | p | range | ownerid | ownerid | 4 | NULL | 11 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | u | index | NULL | userid | 8 | NULL | 52 | Using index |
+----+-------------+-------+-------+---------------+---------+---------+------+------+----------------------------------------------+
2 rows in set (0.00 sec)
Whilst some of you may think "so what if it performs an index-less join", perhaps it's more an annoyance than something that could be a problem, but I appreciate there may be a better way to achieve what is needed anyway particularly as the table row counts grow...
So any ideas welcome!
Usually it's faster to run several queries than sorting the table by rand(). Firstly get the random number of the row:
select floor( count(*) * rand() ) random_number
from prods
where ownerid > 0 and active = 1
And then get the particular row:
select p.*, count(u.prodid) as count
from prods p
left outer join usage u on p.prodid = u.prodid
where prodid = (
select prodid from prods
where ownerid > 0 and active = 1
limit {$random_number}, 1
)
By the way your subquery returns only one field, so you can use = instead of in operator.