I am building a sql query with a large set of data but query is too slow
I've got 3 tables; movies, movie_categories, skipped_movies
The movies table is normalized and I am trying to query a movie based on a category while excluding ids from skipped_movies table.
However I am trying to use WHERE IN and WHERE NOT IN to in my query.
movies table has approx. 2 million rows (id, name, score)
movie_categories approx. 5 million (id, movie_id, category_id)
skipped_movies has approx. 1k rows (id, movie_id, user_id)
When the skipped_movies table is very small 10 - 20 rows the query is quite fast. (about 40 - 50 ms) but when the table gets somewhere around 1k of data I get somewhere around 7 to 8 seconds on the query.
This is the query I'm using.
SELECT SQL_NO_CACHE * FROM `movies` WHERE `id` IN (SELECT `movie_id` FROM `movie_categories` WHERE `category_id` = 1) AND `id` NOT IN (SELECT `movie_id` FROM `skipped_movies` WHERE `user_id` = 1) AND `score` <= 9 ORDER BY `score` DESC LIMIT 1;
I've tried many ways that came to mind but this was the fastest one. I even tried the EXISTS method to no extent.
I'm using the SQL_NO_CACHE just for testing.
And I guess that the ORDER BY statement is running very slow.
Assuming that (movie_id,category_id) is unique in movies_categories table, I'd get the specified result using join operations, rather than subqueries.
To exclude "skipped" movies, an anti-join pattern would suffice... that's a left outer join to find matching rows in skipped_movies, and then a predicate in the WHERE clause to exclude any matches found, leaving only rows that didn't have a match.
SELECT SQL_NO_CACHE m.*
FROM movies m
JOIN movie_categories c
ON c.movie_id = m.id
AND c.category_id = 1
LEFT
JOIN skipped_movies s
ON s.movie_id = m.id
AND s.user_id = 1
WHERE s.movie_id IS NULL
AND m.score <= 9
ORDER
BY m.score DESC
LIMIT 1
And appropriate indexes will likely improve performance...
... ON movie_categories (category_id, movie_id)
... ON skipped_movies (user_id, movie_id)
Most IN/NOT IN queries can be expressed using JOIN/LEFT JOIN, which usually gives the best performance.
Convert your query to use joins:
SELECT m.*
FROM movies m
JOIN movie_categories mc ON m.id = mc.movie_id AND mc.category_id = 1
LEFT JOIN skipped_movies sm ON m.id = sm.movie_id AND sm.user_id = 1
WHERE sm.movie_id IS NULL
AND score <= 9
ORDER BY score DESC
LIMIT 1
Your query seem to be all right. Just a small tweak need. You can replace * with with the column/attribute names in your table. It will make this query work faster then ever. Since * operation is really slow
Related
I am trying to limit a result set to 5 from each merchant_id by using HAVING in MySQL 5.7. Unfortunatelly this does not seem to work and I can not figure out why.
My SQL query joins three tables together and identifies categories where the manufactuer has a listing in. I want to limit this list to 5 per merchant_id:
SELECT
mcs.CAT_ID
FROM tbl1 mc
INNER JOIN tbl2 mcs ON mc.ID = mcs.CAT_ID
INNER JOIN tbl3 p ON mcs.ARTICLE_ID = p.SKU
WHERE
p.MANUFACTURER_ID =18670
group by
mc.merchant_ID, mcs.CAT_ID
HAVING
COUNT(mc.merchant_id) < 5
I was reading on SO that having gets executed without looking at the where statement, but what would be the right way to limit this list?
You didn't provide tables schema and dummy data, so I can't be sure about the exact query, but I'd use the following approach:
SELECT
mc.merchant_id, t.CAT_ID
FROM tbl1 mc
INNER JOIN (
SELECT mcs.CAT_ID
FROM tbl2 AS mcs
WHERE mc.ID = mcs.CAT_ID
AND EXISTS (
SELECT 'x'
FROM tbl3 AS p
WHERE p.SKU = mcs.ARTICLE_ID
AND p.MANUFACTURER_ID = 18670
)
LIMIT 5
) as t
;
With the subquery in the join I select all the CAT_IDs relate to that mc.ID which have the listing for the product selected (18670), limited to 5 rows. In this way the limit to 5 is applied to each merchant_id
I need to join two tables (1M rows and 10M rows respectively)
Each table is filtered with a fulltext match condition :
SELECT SQL_NO_CACHE c.company_index
FROM dw.companies c INNER JOIN dw.people p
ON c.company_index = p.company_index
WHERE MATCH ( c.tag ) AGAINST ( 'ecommerce' IN BOOLEAN MODE )
AND MATCH ( p.title ) AGAINST ( 'director' IN BOOLEAN MODE )
ORDER BY c.company_index DESC ;
Both tables have fulltext indexes (on "tag" and "title")
The query time is more than 1 mn with both conditions.
With only one of the two conditions, the query time is below 1 sec.
How could I optimize this query ?
I think the problem is that FULLTEXT is very fast if it can be performed first, but very slow if not. With both of the tests in the original query, one MATCH can be first, but the other cannot be.
Here's a messy idea on how to work around the problem.
SELECT c.company_index
FROM ( SELECT company_index FROM companies WHERE MATCH... ) AS c
JOIN ( SELECT company_index FROM people WHERE MATCH... ) AS p
ON c.company_index
= p.company_index
ORDER BY ...
What version of MySQL are you using? Newer versions will automatically create an index on one of the 'derived' tables, thereby making the JOIN quite efficient.
Here's another approach:
SELECT c.company_index
FROM ( SELECT company_index FROM companies WHERE MATCH... ) AS c
WHERE EXISTS ( SELECT 1 FROM people WHERE MATCH...
AND company_index = c.company_index )
ORDER BY ...
In both cases (I hope) one SELECT will use one FT index; the other will use the other, thereby getting the FT performance benefit.
An example of a table, data along with the query can be found in http://sqlfiddle.com/#!9/2e65dd/3
I'm interested in finding all distinct user_id's that don't have certain record_type.
In my actual case, this table is huge and it has several million records in it and have an index on user_id column. Although i'm planning to retrieve it in batches by limiting the output to 1000 at a time.
select distinct user_id from
records o where
not exists (
select *
from records i
where i.user_id=o.user_id and i.record_type=3)
limit 0, 1000
Is there a better approach to achieve this need ?
I would do it this way:
SELECT u.user_id
FROM (SELECT DISTINCT user_id FROM records) AS u
LEFT OUTER JOIN records as r
ON u.user_id = r.user_id AND r.record_type = 3
WHERE r.user_id IS NULL
That avoids the correlated subquery in your NOT EXISTS solution.
Alternatively, you should have another table that just lists users, so you don't have to do the subquery:
SELECT u.user_id
FROM users AS u
LEFT OUTER JOIN records as r
ON u.user_id = r.user_id AND r.record_type = 3
WHERE r.user_id IS NULL
In either case, it would help optimize the JOIN to add a compound index on the pair of columns:
ALTER TABLE records ADD KEY (user_id, record_type)
I's suggest a join as well, but mine would have differed from Bill K's like so:
SELECT DISTINCT r.user_id
FROM records AS r
LEFT JOIN (SELECT DISTINCT user_id FROM records WHERE record_type = 3) AS rt3users
ON r.user_id = rt3users.user_id
WHERE rt3users.user_id IS NULL
;
However, an alternative I would not expect better performance from, but is worth checking, since performance can vary based on size and content of data...
SELECT DISTINCT r.user_id
FROM records AS r
WHERE r.user_id NOT IN (
SELECT DISTINCT user_id
FROM records
WHERE record_type = 3
)
;
Note, this one is more similar to your original but does away with the correlated nature of the original subquery.
You could create a temporary table with record types equal 3 like
Select distinct user_id
into #users
from records
where record_type=3
Then create unique index (or primary key) on this table. Then you query would search indexes in both tables.
I can’t say the performance would be better you’d have to test it on your data.
If I have the following two tables:
Table "a" with 2 columns: id (int) [Primary Index], column1 [Indexed]
Table "b" with 3 columns: id_table_a (int),condition1 (int),condition2 (int) [all columns as Primary Index]
I can run the following query to select rows from Table a where Table b condition1 is 1
SELECT a.id FROM a WHERE EXISTS (SELECT 1 FROM b WHERE b.id_table_a=a.id && condition1=1 LIMIT 1) ORDER BY a.column1 LIMIT 50
With a couple hundred million rows in both tables this query is very slow. If I do:
SELECT a.id FROM a INNER JOIN b ON a.id=b.id_table_a && b.condition1=1 ORDER BY a.column1 LIMIT 50
It is pretty much instant but if there are multiple matching rows in table b that match id_table_a then duplicates are returned. If I do a SELECT DISTINCT or GROUP BY a.id to remove duplicates the query becomes extremely slow.
Here is an SQLFiddle showing the example queries: http://sqlfiddle.com/#!9/35eb9e/10
Is there a way to make a join without duplicates fast in this case?
*Edited to show that INNER instead of LEFT join didn't make much of a difference
*Edited to show moving condition to join did not make much of a difference
*Edited to add LIMIT
*Edited to add ORDER BY
You can try with inner join and distinct
SELECT distinct a.id
FROM a INNER JOIN b ON a.id=b.id_table_a AND b.condition1=1
but using distinct on select * be sure you don't distinct id that return wrong result in this case use
SELECT distinct col1, col2, col3 ....
FROM a INNER JOIN b ON a.id=b.id_table_a AND b.condition1=1
You could also add a composite index with use also condtition1 eg: key(id, condition1)
if you can you could also perform a
ANALYZE TABLE table_name;
on both the table ..
and another technique is try to reverting the lead table
SELECT distinct a.id
FROM b INNER JOIN a ON a.id=b.id_table_a AND b.condition1=1
Using the most selective table for lead the query
Using this seem different the use of index http://sqlfiddle.com/#!9/35eb9e/15 (the last add a using where)
# USING DISTINCT TO REMOVE DUPLICATES without col and order
EXPLAIN
SELECT DISTINCT a.id
FROM a
INNER JOIN b ON a.id=b.id_table_a AND b.condition1=1
;
It looks like I found the answer.
SELECT a.id FROM a
INNER JOIN b ON
b.id_table_a=a.id &&
b.condition1=1 &&
b.condition2=(select b.condition2 from b WHERE b.id_table_a=a.id && b.condition1=1 LIMIT 1)
ORDER BY a.column1
LIMIT 5;
I don't know if there is a flaw in this or not, please let me know if so. If anyone has a way to compress this somehow I will gladly accept your answer.
SELECT id FROM a INNER JOIN b ON a.id=b.id_table_a AND b.condition1=1
Take the condition into the ON clause of the join, that way the index of table b can get used to filter. Also use INNER JOIN over LEFT JOIN
Then you should have less results which have to be grouped.
Wrap the fast version in a query that handles de-duping and limit:
SELECT DISTINCT * FROM (
SELECT a.id
FROM a
JOIN b ON a.id = b.id_table_a && b.condition1 = 1
) x
ORDER BY column1
LIMIT 50
We know the inner query is fast. The de-duping and ordering has to happen somewhere. This way it happens on the smallest rowset possible.
See SQLFiddle.
Option 2:
Try the following:
Create indexes as follows:
create index a_id_column1 on a(id, column1)
create index b_id_table_a_condition1 on b(a_table_a, condition1)
These are covering indexes - ones that contain all the columns you need for the query, which in turn means that index-only access to data can achieve the result.
Then try this:
SELECT * FROM (
SELECT a.id, MIN(a.column1) column1
FROM a
JOIN b ON a.id = b.id_table_a
AND b.condition1 = 1
GROUP BY a.id) x
ORDER BY column1
LIMIT 50
Use your fast query in a subselect and remove the duplicates in the outer select:
SELECT DISTINCT sub.id
FROM (
SELECT a.id
FROM a
INNER JOIN b ON a.id=b.id_table_a && b.condition1=1
WHERE b.id_table_a > :offset
ORDER BY a.column1
LIMIT 50
) sub
Because of removing duplicates you might get less than 50 rows. Just repeat the query until you get anough rows. Start with :offset = 0. Use the last ID from last result as :offset in the following queries.
If you know your statistics, you can also use two limits. The limit in the inner query should be high enough to return 50 distinct rows with a probability which is high enough for you.
SELECT DISTINCT sub.id
FROM (
SELECT a.id
FROM a
INNER JOIN b ON a.id=b.id_table_a && b.condition1=1
ORDER BY a.column1
LIMIT 1000
) sub
LIMIT 50
For example: If you have an average of 10 duplicates per ID, LIMIT 1000 in the inner query will return an average of 100 distinct rows. Its very unlikely that you get less than 50 rows.
If the condition2 column is a boolean, you know that you can have a maximum of two duplicates. In this case LIMIT 100 in the inner query would be enough.
I have table with 38k rows and I use this query to compare item id from items table with item id from posted_domains table.
select * from `items`
where `items`.`source_id` = 2 and `items`.`source_id` is not null
and not exists (select *
from `posted_domains`
where `posted_domains`.`item_id` = `items`.`id` and `domain_id` = 1)
order by `item_created_at` asc limit 1
This query took 8s. I don't know if is a problem with my query or my mysql is bad configured. This query is generated by Laravel relations like
$items->doesntHave('posted', 'and', function ($q) use ($domain) {
$q->where('domain_id', $domain->id);
});
CORRELATED subqueries can be rather slow (as they are often executed repeatedly, once for each row in the outer query), this might be faster.
select *
from `items`
where `items`.`source_id` = 2
and `items`.`source_id` is not null
and item_id not in (
select DISTINCT item_id
from `posted_domains`
where `domain_id` = 1)
order by `item_created_at` asc
limit 1
I say might because subqueries in where are also rather slow in MySQL.
This LEFT JOIN will probably be the fastest.
select *
from `items`
LEFT JOIN (
select DISTINCT item_id
from `posted_domains`
where `domain_id` = 1) AS subQ
ON items.item_id = subQ.item_id
where `items`.`source_id` = 2
and `items`.`source_id` is not null
and subQ.item_id is null
order by `item_created_at` asc
limit 1;
Since it is a no matches scenario, it technically doesn't even need to be a subquery; and might be faster as a direct left join, but that will depend on indexes, and possibly actual data values.