Currently I have a table with close to 1 million rows, which I need to query from. What I need to be able to do is stack rank packages on the number of products they include from a given list of product id's.
SELECT count(productID) AS commonProducts, packageID
FROM supply
WHERE productID IN (2,3,4,5,6,7,8,9,10)
GROUP BY packageID
ORDER BY commonProducts
DESC LIMIT 10
The query works fine, but I would like to improve upon it. I tried a multi-column index on productID and packageID, but it seemed to seek more rows than just having a separate index for each of the columns.
MySQL Explain
select_type: SIMPLE
table: supply
type: range
possible_keys: supplyID
key: supplyID
key_len: 3
ref: null
rows: 996
extra: Using where; Using temporary; Using filesort
My main concern is that the query is using a temporary table and filesort. How could I go about optimizing this query? I presume that the biggest issues is count() and the ORDER BY on the results of count().
You can remove the temp table using a Dependent Subquery:
select * from
(
SELECT count(productID) AS commonProducts, s.productId, s.packageID
FROM supply as s
WHERE EXISTS
(
select 1 from supply as innerS
where innerS.productID in (2,3,4,5,6,7,8,9,10)
and s.productId = innerS.productId
)
GROUP BY s.packageID
) AS t
ORDER BY t.commonProducts
DESC LIMIT 10
The inner query links to the outer query and preserves the index. You'll find that any query that sorts on commonProducts, including the above query, will use a filesort, as count(*) is definitely not indexed. But fear not, filesort is just a fancy word for sort -- mysql can choose to use an effective in-memory sort -- and whether you did it now or as a mergesort on the way to an indexed temporary table, you'll have to pay for that sorting somewhere. However, this case is pretty good because filesort will stop sorting once it hits the LIMIT you've put in place. It will not sort the entire list of commonProducts.
Update
If this query is going to be run all the time, I would recommend (without getting too fancy) setting triggers on the supply table to update a legitimate table that tracks counters like this one.
Creatng a temporary resulte set:
SELECT TMP.*
FROM ( SELECT count(productID) AS commonProducts, packageID
FROM supply
WHERE productID IN (2,3,4,5,6,7,8,9,10)
GROUP BY packageID
) AS TMP
ORDER BY commonProducts
DESC LIMIT 10
Perhaps it's not the most elegant way and I cannot guarantee it will be faster because everything depends on your particular data. But in some cases this gives much better results:
SELECT count(*) AS commonProducts, packageID
FROM (
SELECT packageID FROM supply WHERE productID = 2
UNION ALL
SELECT packageID FROM supply WHERE productID = 3
UNION ALL
.
.
.
SELECT packageID FROM supply WHERE productID = 10
) AS t
GROUP BY packageID
ORDER BY commonProducts DESC
LIMIT 10
Related
In order to get the most recent record of a certain combination of identifiers, I use the following query:
SELECT t1.*
FROM (
SELECT id, b_id, c_id
FROM a
ORDER BY epoch DESC
LIMIT 18446744073709551615
) AS t1
GROUP BY t1.b_id, t1.c_id
If there are multiple records of a combination of b_id + c_id , then it will always select the one with the highest value of epoch (and as such, the latest in time).
The LIMIT is added as a workaround to force MariaDB to actually order the results. I successfully use this construction a lot in my application, and so have others.
However, now I came across an exact same query in my application, where I "accidentally" used more columns than strictly necessary in the sub-query:
SELECT t1.*
FROM (
SELECT id, b_id, c_id, and, some, other, columns, ...
FROM a
ORDER BY epoch DESC
LIMIT 18446744073709551615
) AS t1
GROUP BY t1.b_id, t1.c_id
I've tested both queries. And the exact same query, but with as only change those additional columns, makes the result to become incorrect. In fact, the number of columns determines the result. If I have <= 28 columns, the result is okay. If I have 29 columns, then it gives the third-latest record (which is wrong too), and if I have 30-36 columns it always gives the second-latest record (36 is the total number for table a). In my testing, it didn't seem to matter which particular column was removed or added.
I'm having a hard time finding out why exactly the behavior changes after I add more columns. Also, perhaps by chance, it still gave the correct result yesterday. But today suddenly the result changed, probably after new records (with unrelated identifiers) were added to table a. I've tried using EXPLAIN:
# The first query, with columns: id, b_id, c_id
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 280 Using where; Using temporary; Using filesort
2 DERIVED a ALL NULL NULL NULL NULL 280 Using filesort
# The second query, with columns: id, b_id, c_id, and, some, other, columns, ...
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 276 Using where; Using temporary; Using filesort
2 DERIVED a ALL NULL NULL NULL NULL 276 Using filesort
But that doesn't really help me much, other than that I can see that the key_len is different. The second-latest record that is incorrectly received in the second query is one where id = 276, the actual latest record that it correctly retrieves using the first query is one where id = 278. In total there are 307 rows now, and yesterday perhaps just ~300. I'm not sure how to interpret these results to understand what is going wrong. Does anyone know? And if not, what else can I do to find out what is causing these strange results?
This is a malformed query and should be generating a syntax error:
SELECT t1.*
FROM (SELECT id, b_id, c_id
FROM a
ORDER BY epoch DESC
LIMIT 18446744073709551615
) t1
GROUP BY t1.b_id, t1.c_id;
Why? You are selecting 3 columns with no aggregation functions. But the group by only has two columns. Happily, this is now a syntax error in MySQL, using the default settings. Finally! (MySQL accepted this non-standard syntax until version 8.0.)
You can do what you want using a correlated subquery:
select a.*
from a
where a.epoch = (select max(a2.epoch)
from a a2
where a2.b_id = a.b_id and a2.c_id = a.c_id
);
With an index on a(b_id, c_id, epoch), this is probably also faster than aggregation -- even if that happened to work under some circumstances.
Why not use window functions rather than this dirty workaround, that relies on MySQL/MariaDB non-standard behavior regarding group by?
select *
from (
select a.*, row_number() over(partition by b_id, c_id order by epoch desc) rn
from a
) a
where rn = 1
This works in MySQL 8.0 and Maria DB 10.2 or higher. In earlier versions, one alternative is a correlated subquery:
select *
from a
where epoch = (select max(a1.epoch) from a a1 where a1.b_id = a.b_id and a1.c_id = a.c_id)
The scenario: (I'm using MySQL)
Here is my schema:
CREATE TABLE so_time_diff(
OwnerUserId int(11),
time_diff int(10)
);
There are many OwnerUserId's with each OwnerUserId having many time_diff values.
I would like to pick 1000 random distinct OwnerUserIds and for each OwnerUserId, pick just one random time_difference value.
I already got 1000 distinct OwnerUserIds from else where and stored in a different table:
mysql> create table so_OwnerUserId select distinct(Id) as OwnerUserId
from so_users order by RAND() limit 1000;
I have written the following query:
select #td := time_diff from so_time_diff sotd, so_OwnerUserId soui
where sotd.OwnerUserId = soui.OwnerUserId group by sotd.OwnerUserId
order by rand() limit 1;
This doesn't seem to accomplish what I want. It obviously returns just one row. But I want one random row from each OwnerUserId's time_diff collection. Could someone guide me on how to accomplish this?
FYI - the size of dataset is huge - ~56m records. So I'm looking for an optimal query.
Any help appreciated.
Thanks!
One approach is to use a correlated subquery. It's not a very efficient approach, since that subquery will be executed for each row in the outer table, which would be a 1000 times if there are 1000 rows in so_OwnerUserId.
SELECT r.OwnerUserId
, ( SELECT d.time_diff
FROM so_time_diff d
WHERE d.OwnerUserId = r.OwnerUserId
ORDER BY RAND()
LIMIT 1
) AS random_time_diff
FROM so_OwnerUserId r
For any kind of performance, you're going to need an index with a leading column of OwnerUserId on the so_time_diff table. Better yet, a covering index
... ON so_time_diff (OwnerUserId, time_diff)
(For InnoDB, if those are the only two columns in the table, you'd want that to be the cluster key.)
I'm trying to optimize a query.
My question seems to be similar to MySQL, Union ALL and LIMIT and the answer might be the same (I'm afraid). However in my case there's a stricter limit (1) as well as an index on a datetime column.
So here we go:
For simplicity, let's have just one table with three: columns:
md5 (varchar)
value (varchar).
lastupdated (datetime)
There's an index on (md5, updated) so selecting on a md5 key, ordering by updated and limiting to 1 will be optimized.
The search shall return a maximum of one record matching one of 10 md5 keys. The keys have a priority. So if there's a record with prio 1 it will be preferred over any record with prio 2, 3 etc.
Currently UNION ALL is used:
select * from
(
(
select 0 prio, value
from mytable
where md5 = '7b76e7c87e1e697d08300fd9058ed1db'
order by lastupdated desc
limit 1
)
union all
(
select 1 prio, value
from mytable
where md5 = 'eb36cd1c563ffedc6adaf8b74c259723'
order by lastupdated desc
limit 1
)
) x
order by prio
limit 1;
It works, but the UNION seems to execute all 10 queries if 10 keys are provided.
However, from a business perspective, it would be ok to run the selects sequentially and stop after the first match.
Is that possible though plain SQL?
Or would the only option be a stored procedure?
There's a much better way to do this that doesn't need UNION. You really want the groupwise max for each key, with a custom ordering.
Groupwise Max
Order by FIELD()
There's no way the optimizer for UNION ALL can figure out what you're up to.
I don't know if you can do this, but suppose you had a md5prio table with the list of hash codes you know you're looking for. For example.
prio md5
0 '7b76e7c87e1e697d08300fd9058ed1db'
1 'eb36cd1c563ffedc6adaf8b74c259723'
etc
in it.
Then your query could be:
select mytable.*
from mytable
join md5prio on mytable.md5 = md5prio.md5
order by md5prio.prio, mytable.lastupdated desc
limit 1
This might save the repeated queries. You'll definitely need your index on mytable.md5. I am not sure whether your compound index on lastupdated will help; you'll need to try it.
In your case, the most efficient solution may be to build an index on (md5, lastupdated). This index should be used to resolve each subquery very efficiently (looking up the values in the index and then looking up one data page).
Unfortunately, the groupwise max referenced by Gavin will produce multiple rows when there are duplicate lastupdated values (admittedly, perhaps not a concern in your case).
There is, actually, a MySQL way to get this answer, using group_concat and substring_index:
select p.prio,
substring_index(group_concat(mt.value order by mt.lastupdated desc), ',', 1)
from mytable mt join
(select 0 as prio, '7b76e7c87e1e697d08300fd9058ed1db' as md5 union all
select 1 as prio, 'eb36cd1c563ffedc6adaf8b74c259723' as md5 union all
. . .
) p
on mt.md5 = p.md5
I have a problem with this slow query that runs for 10+ seconds:
SELECT DISTINCT siteid,
storyid,
added,
title,
subscore1,
subscore2,
subscore3,
( 1 * subscore1 + 0.8 * subscore2 + 0.1 * subscore3 ) AS score
FROM articles
WHERE added > '2011-10-23 09:10:19'
AND ( articles.feedid IN (SELECT userfeeds.siteid
FROM userfeeds
WHERE userfeeds.userid = '1234')
OR ( articles.title REGEXP '[[:<:]]keyword1[[:>:]]' = 1
OR articles.title REGEXP '[[:<:]]keyword2[[:>:]]' = 1 ) )
ORDER BY score DESC
LIMIT 0, 25
This outputs a list of stories based on the sites that a user added to his account. The ranking is determined by score, which is made up out of the subscore columns.
The query uses filesort and uses indices on PRIMARY and feedid.
Results of an EXPLAIN:
1 PRIMARY articles
range
PRIMARY,added,storyid
PRIMARY 729263 rows
Using where; Using filesort
2 DEPENDENT SUBQUERY
userfeeds
index_subquery storyid,userid,siteid_storyid
siteid func
1 row
Using where
Any suggestions to improve this query? Thank you.
I would move the calculation logic to the client and only load fields from the database. This makes your query and the calculation itself faster. It's not a good style to do such things in SQL code.
And also is the regex very slow, maybe another searching mode like 'LIKE' is faster.
Looking at your EXPLAIN, it doesn't appear your query is utilizing any index (thus the filesort). This is being caused by the sort on the calculated column (score).
Another barrier is the size of the table (729263 rows). You don't want to create an index that is too wide as it will take much more space and impact performance of your CUD operations. What we want to do is target the columns that are being selected, however, in this situation we can't since it's a calculated column. You can try creating a VIEW or either remove the sort or do it at the application layer.
Apologies if this has been asked before but is there any way, at all, I can optimize this query to run faster. At the minute it takes about 2 seconds which while isn't a huge amount it is the slowest query on my site, all other queries take less that 0.5 secs.
Here is my query:
SELECT SQL_CALC_FOUND_ROWS MAX(images.id) AS maxID, celebrity.* FROM images
JOIN celebrity ON images.celeb_id = celebrity.id
GROUP BY images.celeb_id
ORDER BY maxID DESC
LIMIT 0,20
Here is an explain:
1 SIMPLE celebrity ALL PRIMARY NULL NULL NULL 536 Using temporary; Using filesort
1 SIMPLE images ref celeb_id celeb_id 4 celeborama_ignite.celebrity.id 191
I'm at a loss at how to improve the performance in this query further. I'm not super familiar with MySQL, but I do know that it is slow because I am sorting on the data created by MAX() and that has no index. I can't not sort on that as it gives me the results needed, but is there something else I can do to prevent it from slowing down the query?
Thanks.
If you really need fast solution - then don't perform such queries in runtime.
Just create additional field last_image_id in celebrity table and update it on event of uploading of new image (by trigger or your application logic, doesn't matter)
I would get the latest image this way:
SElECT c.*, i.id AS image_id
FROM celebrity c
JOIN images i ON i.celeb_id = c.id
LEFT OUTER JOIN images i2 ON i2.celeb_id = c.id AND i2.id > i.id
WHERE i2.id IS NULL
ORDER BY image_id DESC
LIMIT 0,20;
In other words, try to find a row i2 for the same celebrity with a higher id than i.id. If the outer join fails to find that match, then i.id must be the max image id for the given celebrity.
SQL_CALC_FOUND_ROWS can cause queries to run extremely slowly. I've found some cases where just removing the SQL_CALC_FOUND_ROWS made the query run 200x faster (but it could also make only a small difference in other cases, it depends on the table, so you should test both ways).
If you need the equivalent of SQL_CALC_FOUND_ROWS, just run a separate query:
SELECT COUNT(*) FROM celebrity;
I think you need a compound index on (celeb_id, id) in table images (supposing it's a MyISAM table), so the GROUP BY celeb_id and MAX(id) can use this index.
But with big tables, you'll probably have to follow #zerkms' advice and add a new column in table celebrity
MYSQL doesn't perform so good with joins. i would recommend to dividing your query in two. that is in first query select the Celeb and then select image. Simply avoid joins.
Check out this link - http://phpadvent.org/2011/a-stitch-in-time-saves-nine-by-paul-jones
SELECT STRAIGHT_JOIN *
FROM (
SELECT MAX(id) as maxID, celeb_id as id
FROM images
GROUP BY celeb_id
ORDER by maxID DESC
LIMIT 0, 20) as ids
JOIN celebrity USING (id);
the query does not allow row number precalculation, but an additional:
SELECT COUNT(DISTINCT celeb_id)
FROM images;
or even (if each celebrity has an image):
SELECT COUNT(*) FROM celebrity;
will not cost much, because can easily be cached by the query cache (if it not switched off).