If I have the following two tables:
Table "a" with 2 columns: id (int) [Primary Index], column1 [Indexed]
Table "b" with 3 columns: id_table_a (int),condition1 (int),condition2 (int) [all columns as Primary Index]
I can run the following query to select rows from Table a where Table b condition1 is 1
SELECT a.id FROM a WHERE EXISTS (SELECT 1 FROM b WHERE b.id_table_a=a.id && condition1=1 LIMIT 1) ORDER BY a.column1 LIMIT 50
With a couple hundred million rows in both tables this query is very slow. If I do:
SELECT a.id FROM a INNER JOIN b ON a.id=b.id_table_a && b.condition1=1 ORDER BY a.column1 LIMIT 50
It is pretty much instant but if there are multiple matching rows in table b that match id_table_a then duplicates are returned. If I do a SELECT DISTINCT or GROUP BY a.id to remove duplicates the query becomes extremely slow.
Here is an SQLFiddle showing the example queries: http://sqlfiddle.com/#!9/35eb9e/10
Is there a way to make a join without duplicates fast in this case?
*Edited to show that INNER instead of LEFT join didn't make much of a difference
*Edited to show moving condition to join did not make much of a difference
*Edited to add LIMIT
*Edited to add ORDER BY
You can try with inner join and distinct
SELECT distinct a.id
FROM a INNER JOIN b ON a.id=b.id_table_a AND b.condition1=1
but using distinct on select * be sure you don't distinct id that return wrong result in this case use
SELECT distinct col1, col2, col3 ....
FROM a INNER JOIN b ON a.id=b.id_table_a AND b.condition1=1
You could also add a composite index with use also condtition1 eg: key(id, condition1)
if you can you could also perform a
ANALYZE TABLE table_name;
on both the table ..
and another technique is try to reverting the lead table
SELECT distinct a.id
FROM b INNER JOIN a ON a.id=b.id_table_a AND b.condition1=1
Using the most selective table for lead the query
Using this seem different the use of index http://sqlfiddle.com/#!9/35eb9e/15 (the last add a using where)
# USING DISTINCT TO REMOVE DUPLICATES without col and order
EXPLAIN
SELECT DISTINCT a.id
FROM a
INNER JOIN b ON a.id=b.id_table_a AND b.condition1=1
;
It looks like I found the answer.
SELECT a.id FROM a
INNER JOIN b ON
b.id_table_a=a.id &&
b.condition1=1 &&
b.condition2=(select b.condition2 from b WHERE b.id_table_a=a.id && b.condition1=1 LIMIT 1)
ORDER BY a.column1
LIMIT 5;
I don't know if there is a flaw in this or not, please let me know if so. If anyone has a way to compress this somehow I will gladly accept your answer.
SELECT id FROM a INNER JOIN b ON a.id=b.id_table_a AND b.condition1=1
Take the condition into the ON clause of the join, that way the index of table b can get used to filter. Also use INNER JOIN over LEFT JOIN
Then you should have less results which have to be grouped.
Wrap the fast version in a query that handles de-duping and limit:
SELECT DISTINCT * FROM (
SELECT a.id
FROM a
JOIN b ON a.id = b.id_table_a && b.condition1 = 1
) x
ORDER BY column1
LIMIT 50
We know the inner query is fast. The de-duping and ordering has to happen somewhere. This way it happens on the smallest rowset possible.
See SQLFiddle.
Option 2:
Try the following:
Create indexes as follows:
create index a_id_column1 on a(id, column1)
create index b_id_table_a_condition1 on b(a_table_a, condition1)
These are covering indexes - ones that contain all the columns you need for the query, which in turn means that index-only access to data can achieve the result.
Then try this:
SELECT * FROM (
SELECT a.id, MIN(a.column1) column1
FROM a
JOIN b ON a.id = b.id_table_a
AND b.condition1 = 1
GROUP BY a.id) x
ORDER BY column1
LIMIT 50
Use your fast query in a subselect and remove the duplicates in the outer select:
SELECT DISTINCT sub.id
FROM (
SELECT a.id
FROM a
INNER JOIN b ON a.id=b.id_table_a && b.condition1=1
WHERE b.id_table_a > :offset
ORDER BY a.column1
LIMIT 50
) sub
Because of removing duplicates you might get less than 50 rows. Just repeat the query until you get anough rows. Start with :offset = 0. Use the last ID from last result as :offset in the following queries.
If you know your statistics, you can also use two limits. The limit in the inner query should be high enough to return 50 distinct rows with a probability which is high enough for you.
SELECT DISTINCT sub.id
FROM (
SELECT a.id
FROM a
INNER JOIN b ON a.id=b.id_table_a && b.condition1=1
ORDER BY a.column1
LIMIT 1000
) sub
LIMIT 50
For example: If you have an average of 10 duplicates per ID, LIMIT 1000 in the inner query will return an average of 100 distinct rows. Its very unlikely that you get less than 50 rows.
If the condition2 column is a boolean, you know that you can have a maximum of two duplicates. In this case LIMIT 100 in the inner query would be enough.
Related
Table A:
ID, Name, etc.
Table B:
ID, TableA-ID.
SELECT * FROM A;
and I want to return a boolean value in the same result for this condition ( if A.ID Exists in Table B).
There are several ways of achieving what you need. Below are three possibilities. These all differ in execution plans and how database actually wants to execute them so depending on your record count one may be more efficient than the other. It's better if you see it for yourself.
1) Use LEFT JOIN and check if a non-null field from B is not null to ensure the record exists. Then apply DISTINCT clause if relationship is 1:N to only show rows from A without duplicates.
select distinct a.*, b.id is not null as exists_b
from a
left join b on
a.id = b.tablea-id
2) Use exists() function, which will be evaluated for each row being returned from table A.
select a.*, exists(select 1 from b where a.id = b.tablea-id) as exists_b
from a
3) Use a combination of subquery expression EXISTS and it's contradiction in two queries to check if a record has or has not a match within table B. Then UNION ALL to combine both results into one.
select *, true as exists_b
from a
where exists (
select 1
from b
where a.id = b.tablea-id
)
union all
select *, false as exists_b
from a
where not exists (
select 1
from b
where a.id = b.tablea-id
)
select A.*, IFNULL((select 1 from B where B.TableA-ID = A.ID limit 1),0) as `exists` from A;
The above statement will result in a 1, if the key exists, and a 0 if that key does not exist. Limit 1 is important if there are multiple records in B
I want to use a same subquery multiple times into UNION. This subquery is time consumed and I think that using it a lot of times may will be increased the total time of execution.
For example
(SELECT * FROM (SELECT * FROM A INNER JOIN B ... AND SOME COMPLEX WHERE CONDITIONS) as T ORDER BY column1 DESC LIMIT 10)
UNION
(SELECT * FROM (SELECT * FROM A INNER JOIN B ... AND SOME COMPLEX WHERE CONDITIONS) as T ORDER BY column2 DESC LIMIT 10)
UNION
(SELECT * FROM (SELECT * FROM A INNER JOIN B ... AND SOME COMPLEX WHERE CONDITIONS) as T ORDER BY column3 DESC LIMIT 10)
Does the (SELECT * FROM A INNER JOIN B ... AND SOME COMPLEX WHERE CONDITIONS) executed 3 times ?
If mysql is smart enough the internal subquery will be executed only one so I don't need any optimization, but if not I have to use something else to optimize it (like using a temporary table, but I want to avoid it)
Do I have to optimize this query by other syntax ? Any suggestion ?
In practice I want to filter some data from huge records and get some of them in 3 group-sections, each section in different order
Plan A:
A TEMPORARY TABLE cannot be referenced more than once. So, build a permanent table and DROP it when finished. (If you might have multiple connections doing the same thing, it will be a hassle to make sure you are not using the same table name.)
Plan B:
With MySQL 8.0, you can do
WITH T AS ( SELECT ... )
SELECT ... FROM T ORDER BY col1
UNION ...
Plan C:
If it is possible to do this:
SELECT id FROM A
ORDER BY col1 LIMIT 10
You could use that as a 'derived' table inside
(SELECT * FROM A INNER JOIN B ... AND SOME COMPLEX WHERE CONDITIONS)
Something like
SELECT A.*, B.*
FROM ( SELECT id FROM A
ORDER BY col1 LIMIT 10 ) AS x1
JOIN A USING(id)
JOIN B ... AND SOME COMPLEX WHERE CONDITIONS
Similarly for the other two SELECTs, then UNION them together.
Better yet, UNION together the 3 sets of ids, then JOIN to A and B once.
This may have the advantage of dealing with fewer rows.
I have two tables with some data ( > 300_000 rows) and this simple query is taking ~1 seconds.
Any idea to make it faster?
SELECT a.*
FROM a
INNER JOIN b on (a.b_id = b.id)
WHERE b.some_int_column = 2
ORDER BY a.id DESC
LIMIT 0,10
Both, a.b_id and b.some_int_column have indexes. Also, a.id and i.id are integer primary keys.
When I try a explain, it says first it is using some_int_column index, with temporary and filesort.
If I do this same query, but ordering by b.id ASC it takes ~0.2 ms instead (I know this is because in such case I'm ordering by first explain row), but I really need to order by a table.
Is there something I am missing?
For this query:
SELECT a.*
FROM a INNER JOIN
b
ON a.b_id = b.id
WHERE b.some_int_column = 2
ORDER BY a.id DESC
LIMIT 0, 10;
The optimal indexes are likely to be b(some_int_column, id), and a, b_id, id).
You might find that this version has better performance with these indexes:
SELECT a.*
FROM a
WHERE EXISTS (SELECT 1
FROM b
WHERE a.b_id = b.id AND b.some_int_column = 2
)
ORDER BY a.id DESC
LIMIT 0, 10;
For this query, the indexes should be a(id, b_id) and b(id, some_int_column).
SELECT a.*
FROM b
INNER JOIN a on (b.id = a.b_id)
WHERE b.some_int_column = 2
ORDER BY a.id DESC
LIMIT 0,10
Try this. Because your are filtering on a column in table B, not a column in Table A. This may reduce the volume of data read. Depending on the sql optimizer it may match up all records in the join and then filter out those =2. But reversing it, the optimizer may only match up records in table b to a that are = 2 in your where clause.
I am wondering how to group by a field that has both a select count() and count() statement. I know that we have to put all select fields in group by but it wont let me do so because of the second count() statement in the field.
create table C as(
select a.id, a.date_id,
(select count(b.hits)*1.00 where b.hits >= '9')/count(b.hits) AS percent **<--error here
from A a join B b
on a.id = b.id
group by 1,2,3) with no data primary index(id);
This is my error:
[SQLState HY000] GROUP BY and WITH...BY clauses may not contain
aggregate functions. Error Code: 3625
When i add a select to the second count in the third line only get 1 or 0 which is not right.
`((select count(b.hits)*1.00 where b.hits >= '9')/(select count(b.hits))) AS` percent
Do i need to do a self join instead or is there any way i can just use nested queries?
You need to fix the group by. But, you can probably simplify the query as:
create table C as
select a.id, a.date_id,
avg(b.hits >= 9) as percent
from A a join
B b
on a.id = b.id
group by a.id, a.date_id
with no data primary index(id);
It looks like you only need to group on 2 columns, not 3, plus you shouldn't need a sub-select:
create table C as(
select a.id, a.date_id,
SUM(CASE WHEN b.hits >= '9' THEN 1 ELSE 0 END)/COUNT(b.hits) AS percent
from A a join B b
on a.id = b.id
group by 1,2) with no data primary index(id);
I've 3 tables say A,B,C.
Table A has userid column.
Table B has caid column.
Table C has lisid and image columns.
one userid can have one or several caids.
one caid can have one or several lisids.
how do I select a userid which has maximum number of rows with image column as not null (in some lisids image column is blank and in some it has some value).
can someone please help.
Presumably, the ids are spread among the tables in a reasonable fashion. If so, the following should do this:
select b.userid, count(*)
from TableB b join
TableC c
on b.caid = c.caid
where c.image is not null
group by b.userid
order by count(*) desc
limit 1
The question in the comments is how you connect TableA to TableB and TableB to TableC. The reasonable approach is to have the userid in TableB and the caid in TableC.
Getting all the rows with the max requires a bit more work. Essentially, you have to join in the above query to get the list
select s.*
from (select b.userid, count(*) as cnt
from TableB b join
TableC c
on b.caid = c.caid
) s
(select count(*) as maxcnt
from TableB b join
TableC c
on b.caid = c.caid
group by b.userid
order by count(*) desc
limit 1
) smax
on s.cnt = smax.cnt
Other databses have a set of functions called window functions/ranking functions that make this sort of query much simpler. Alas, MySQL does not offer these.