I have two MySQL queries that run very fast, but when I combine them the new query is very slow.
Fast (<1 second, 15 results):
SELECT DISTINCT
Id, Name, Company_Id
FROM people
where Company_Id in (5295, 1834)
and match(Locations) against('austin')
Fast (<1 second, 2970 results):
select distinct Company_Id from technologies
where match(Name) against('elastic')
and Company_Id is not null
When I combine the two like this:
SELECT DISTINCT Id, Name, Company_Id
FROM people
where Company_Id in
( select Company_Id from technologies
where match(Name) against('elastic')
and Company_Id is not null
)
and match(Locations) against('austin')
The result query takes over 2 minutes to complete. It has 278 rows hit.
I've tried rewriting the slow query a few ways. One other example is like this:
SELECT DISTINCT
`Extent1`.`Id`, `Extent1`.`Name`, `Extent1`.`Company_Id`
FROM `people` AS `Extent1`
INNER JOIN `technologies` AS `Extent2`
ON (`Extent1`.`Company_Id` = `Extent2`.`Company_Id`)
WHERE (`Extent1`.`Company_Id` IS NOT NULL)
AND ((match(`Extent1`.`Locations`) against('austin'))
AND (match(`Extent2`.`Name`) against('elastic')))
I'm using MySQL 5.7 on Windows. I have full text index on the Name and Location columns. My InnoDB Buffer Usage never goes above 40%. I tried to use MySQL workbench to look at the execution plan, but it shows "Explain data not available for statement"
Please let me know if you see anything I could improve or try. Thank you.
IN ( SELECT ... ) is poorly optimized, at least in older versions of MySQL. What version are you using?
When using a FULLTEXT index (MATCH...), that part is performed first, if possible. This is because nearly always the FT lookup is faster than whatever else is going on.
But when using two fulltext queries, it picks one, then can't use fulltext on the other.
Here's one possible workaround:
Have a extra table for searches. It includes both Name and Locations in it.
Have FULLTEXT(Name, Locations)
MATCH (Name, Locations) AGAINST ('+austin +elastic' IN BOOLEAN MODE)
If necessary, AND that with something to verify that it is not, for example, finding a person named 'Austin'.
Another possibility:
5.7 (or 5.6?) might be able to optimize this by creating indexes on the subqueries:
SELECT ...
FROM ( SELECT Company_Id FROM ... MATCH(Name) ... ) AS x
JOIN ( SELECT Company_Id FROM ... MATCH(Locations) ... ) AS y
USING(Company_id);
Provide the EXPLAIN; I am hoping to see <auto-key>.
Test that. If it is 'fast', then you may need to add on another JOIN and/or WHERE. (I am unclear what your ultimate query needs to be.)
Write the query with the subquery in the from clause:
select distinct p.Id, p.Name, p.Company_Id
from people p join
(select Company_Id
from technologies
where match(Name) against('elastic') and Company_Id is not null
) t
on p.Company_Id = t.Company_Id
where match(p.Locations) against ('austin');
I suspect that you have a problem with your data structure, though. You should have a CompanyLocations table, rather than storing locations in a list in the table.
Related
I have a common aggregation query:
SELECT
products.type,
count(products.id)
FROM
products
INNER JOIN product_colors
ON products.id = product_colors.product_id
AND product_colors.is_active = 1
AND product_colors.is_archive = 0
WHERE
(products.is_active = 1
AND product_colors.is_individual = 0
AND product_colors.is_visible = 1)
GROUP BY
type
It lasts in the order of 0.1 seconds. The indexes look fine, tmp_table_size = 128M and
max_heap_table_size = 128M. Why they are so slow? Classic selects are fast, but as there is group and count, no.
Indexes on products table:
Indexes on product_colors table:
Explain SQL:
EDIT Product indexes:
Your indexing is not optimal for what you are asking. Instead of just having an index on each column individually (can be a big waste), you should have composite indexes that better match what you are trying to query and be covering enough to handle any group by or orderings.
In this case, you primary query is ACTIVE products and ordering by type. So I would have a SINGLE index on your primary table on (is_active, type, id). This way, your WHERE criteria is up front via Is_Active, then your order by via Type and finally the ID that qualifies the record. In this case, your query can get all it needs from the INDEX and not have to go to the raw data pages.
Now, your secondary table. Similarly should be composite index. First based on the criteria of the join between tables, THEN based on its restrictions you are looking for, thus: ( product_id, is_active, is_archive ). Why you have two columns of Is_Active and another for Is_Archive, dont know. I would think that if something were in the archives, it would not be active to begin with, but just a guess on that.
Anyhow, with the optimized indexes should help.
One last consideration on your count(product.id). Do you intend DISTINCT Products, or all records found. So if a one product has 8 colors, do you want the ID counted as 1 or 8.
count(*) would give 8
count( distinct product.id ) would give 1
Give these a try:
products: INDEX(is_active, type)
product_colors: INDEX(product_id, is_individual, is_visible, is_active, is_archive)
Since products.id cannot be NULL, you may as well say COUNT(*) instead of count(products.id). (Or, as DRapp points out, maybe you need COUNT(DISTINCT products.id)
I have an issue on creating tables by using select keyword (it runs so slow). The query is to take only the details of the animal with the latest entry date. that query will be used to inner join another query.
SELECT *
FROM amusementPart a
INNER JOIN (
SELECT DISTINCT name, type, cageID, dateOfEntry
FROM bigRegistrations
GROUP BY cageID
) r ON a.type = r.cageID
But because of slow performance, someone suggested me steps to improve the performance. 1) use temporary table, 2)store the result and use it and join it the the other statement.
use myzoo
CREATE TABLE animalRegistrations AS
SELECT DISTINCT name, type, cageID, MAX(dateOfEntry) as entryDate
FROM bigRegistrations
GROUP BY cageID
unfortunately, It is still slow. If I only use the select statement, the result will be shown in 1-2 seconds. But if I add the create table, the query will take ages (approx 25 minutes)
Any good approach to improve the query time?
edit: the size of big registration table is around 3.5 million rows
Can you please try the query in the way below to achieve The query is to take only the details of the animal with the latest entry date. that query will be used to inner join another query, the query you are using is not fetching records as per your requirement and it will faster:
SELECT a.*, b.name, b.type, b.cageID, b.dateOfEntry
FROM amusementPart a
INNER JOIN bigRegistrations b ON a.type = b.cageID
INNER JOIN (SELECT c.cageID, max(c.dateOfEntry) dateofEntry
FROM bigRegistrations c
GROUP BY c.cageID) t ON t.cageID = b.cageID AND t.dateofEntry = b.dateofEntry
Suggested indexing on cageID and dateofEntry
This is a multipart question.
Use Temporary Table
Don't use Distinct - group all columns to make distinct (dont forget to check for index)
Check the SQL Execution plans
Here you are not creating a temporary table. Try the following...
CREATE TEMPORARY TABLE IF NOT EXISTS animalRegistrations AS
SELECT name, type, cageID, MAX(dateOfEntry) as entryDate
FROM bigRegistrations
GROUP BY cageID
Have you tried doing an explain to see how the plan is different from one execution to the next?
Also, I have found that there can be locking issues in some DB when doing insert(select) and table creation using select. I ran this in MySQL, and it solved some deadlock issues I was having.
SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
The reason the query runs so slow is probably because it is creating the temp table based on all 3.5 million rows, when really you only need a subset of those, i.e. the bigRegistrations that match your join to amusementPart. The first single select statement is faster b/c SQL is smart enough to know it only needs to calculate the bigRegistrations where a.type = r.cageID.
I'd suggest that you don't need a temp table, your first query is quite simple. Rather, you may just need an index. You can determine this manually by studying the estimated execution plan, or running your query in the database tuning advisor. My guess is you need to create an index similar to below. Notice I index by cageId first since that is what you join to amusementParks, so that would help SQL narrow the results down the quickest. But I'm guessing a bit - view the query plan or tuning advisor to be sure.
CREATE NONCLUSTERED INDEX IX_bigRegistrations ON bigRegistrations
(cageId, name, type, dateOfEntry)
Also, if you want the animal with the latest entry date, I think you want this query instead of the one you're using. I'm assuming the PK is all 4 columns.
SELECT name, type, cageID, dateOfEntry
FROM bigRegistrations BR
WHERE BR.dateOfEntry =
(SELECT MAX(BR1.dateOfEntry)
FROM bigRegistrations BR1
WHERE BR1.name = BR.name
AND BR1.type = BR.type
AND BR1.cageID = BR.cageID)
[Summary of the question: 2 SQL statements produce same results, but at different speeds. One statement uses JOIN, other uses IN. JOIN is faster than IN]
I tried a 2 kinds of SELECT statement on 2 tables, named booking_record and inclusions. The table inclusions has a many-to-one relation with table booking_record.
(Table definitions not included for simplicity.)
First statement: (using IN clause)
SELECT
id,
agent,
source
FROM
booking_record
WHERE
id IN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
)
Second statement: (using JOIN)
SELECT
id,
agent,
source
FROM
booking_record
JOIN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
) inclusions
ON
id = foreign_key_booking_record
with 300,000+ rows in booking_record-table and 6,100,000+ rows in inclusions-table; the 2nd statement delivered 127 rows in just 0.08 seconds, but the 1st statement took nearly 21 minutes for same records.
Why JOIN is so much faster than IN clause?
This behavior is well-documented. See here.
The short answer is that until MySQL version 5.6.6, MySQL did a poor job of optimizing these types of queries. What would happen is that the subquery would be run each time for every row in the outer query. Lots and lots of overhead, running the same query over and over. You could improve this by using good indexing and removing the distinct from the in subquery.
This is one of the reasons that I prefer exists instead of in, if you care about performance.
EXPLAIN should give you some clues (Mysql Explain Syntax
I suspect that the IN version is constructing a list which is then scanned by each item (IN is generally considered a very inefficient construct, I only use it if I have a short list of items to manually enter).
The JOIN is more likely constructing a temp table for the results, making it more like normal JOINs between tables.
You should explore this by using EXPLAIN, as said by Ollie.
But in advance, note that the second command has one more filter: id = foreign_key_booking_record.
Check if this has the same performance:
SELECT
id,
agent,
source
FROM
booking_record
WHERE
id IN
( SELECT DISTINCT
foreign_key_booking_record
FROM
inclusions
WHERE
id = foreign_key_booking_record -- new filter
AND
foreign_key_bill IS NULL
AND
invoice_closure <> FALSE
)
I have a problem with this query:
SELECT DISTINCT s.city, pc.start, pc.end
FROM postal_codes pc LEFT JOIN suspects s ON (s.postalcode BETWEEN pc.start AND pc.end)
WHERE pc.user_id = "username"
ORDER BY pc.start
Suspect table has about 340 000 entries, there is a index on postalcode, I have several users, but this individual query takes about 0.5s, when I run this SQL with explain, I get something like this: http://my.jetscreenshot.com/7536/20111225-myhj-41kb.jpg - does these NULLs mean that the query isn't using index? The index is a BTREE so I think this should run a little faster.
Can you please help me with this? If there are any other informations needed just let me know.
Edit: I have indexes on suspects.postalcode, postal_codes.start, postal_codes.end, postal_codes.user_id.
Basically what I'm trying to achieve: I have a table where each user ID has multiple postalcode ranges assigned, so it looks like:
user_id | start | end
Than I have a table of suspects where each suspect has an address (which contains a postalcode), so in this query I'm trying to get postalcode range - start and end and also name of the city in this range.
Hope this helps.
Whenever left join is used all the records of the first table are picked up rather than the selection on the basis of index. I would suggest to using an inner join. Something like in the below query.
select distinct
s.city,
pc.start,
pc.end
from postal_codes pc, suspect s
where
s.postalcode between (select pc1.start, pc1.end from postal_code pc1 where pc1.user_id = "username" )
and pc.user_id = "username"
order by pc.start
It's using only one index, and not for the fields involved in the join. Try creating an index for the start and end fields, or using >= and <= instead of BETWEEN
Not 100% sure, but this might be relevant:
Sometimes MySQL does not use an index, even if one is available. One circumstance under which this occurs is when the optimizer estimates that using the index would require MySQL to access a very large percentage of the rows in the table. (In this case, a table scan is likely to be much faster because it requires fewer seeks.) However, if such a query uses LIMIT to retrieve only some of the rows, MySQL uses an index anyway, because it can much more quickly find the few rows to return in the result.
So try testing with LIMIT, and if it uses the index then, you found your cause.
I have to say I'm a little confused by your table naming convention, I would expect the "suspect" table to have a user_id not the postal_code, but you must have your reasons. If you were to leave this query as it is, you can add an index on postal_code (star,end) to avoid the complete table scan.
I think you can restructure your query like following,
SELECT DISTINCT s.city, pc1.start, pc1.end FROM
(SELECT pc.start and pc.end from postal_codes pc where pc.user_id = "username") as pc1, Suspect s
WHERE s.postalcode BETWEEN pc1.start, pc1.end ORDER BY pc1.start
your query is not picking up the index on s table because of left join and your between condition. Having an Index in your table doesn't necessarily mean that it will be used in all the queries.
Try FORCE INDEX.
If I have a table
CREATE TABLE users (
id int(10) unsigned NOT NULL auto_increment,
name varchar(255) NOT NULL,
profession varchar(255) NOT NULL,
employer varchar(255) NOT NULL,
PRIMARY KEY (id)
)
and I want to get all unique values of profession field, what would be faster (or recommended):
SELECT DISTINCT u.profession FROM users u
or
SELECT u.profession FROM users u GROUP BY u.profession
?
They are essentially equivalent to each other (in fact this is how some databases implement DISTINCT under the hood).
If one of them is faster, it's going to be DISTINCT. This is because, although the two are the same, a query optimizer would have to catch the fact that your GROUP BY is not taking advantage of any group members, just their keys. DISTINCT makes this explicit, so you can get away with a slightly dumber optimizer.
When in doubt, test!
If you have an index on profession, these two are synonyms.
If you don't, then use DISTINCT.
GROUP BY in MySQL sorts results. You can even do:
SELECT u.profession FROM users u GROUP BY u.profession DESC
and get your professions sorted in DESC order.
DISTINCT creates a temporary table and uses it for storing duplicates. GROUP BY does the same, but sortes the distinct results afterwards.
So
SELECT DISTINCT u.profession FROM users u
is faster, if you don't have an index on profession.
All of the answers above are correct, for the case of DISTINCT on a single column vs GROUP BY on a single column.
Every db engine has its own implementation and optimizations, and if you care about the very little difference (in most cases) then you have to test against specific server AND specific version! As implementations may change...
BUT, if you select more than one column in the query, then the DISTINCT is essentially different! Because in this case it will compare ALL columns of all rows, instead of just one column.
So if you have something like:
// This will NOT return unique by [id], but unique by (id,name)
SELECT DISTINCT id, name FROM some_query_with_joins
// This will select unique by [id].
SELECT id, name FROM some_query_with_joins GROUP BY id
It is a common mistake to think that DISTINCT keyword distinguishes rows by the first column you specified, but the DISTINCT is a general keyword in this manner.
So people you have to be careful not to take the answers above as correct for all cases... You might get confused and get the wrong results while all you wanted was to optimize!
Go for the simplest and shortest if you can -- DISTINCT seems to be more what you are looking for only because it will give you EXACTLY the answer you need and only that!
well distinct can be slower than group by on some occasions in postgres (dont know about other dbs).
tested example:
postgres=# select count(*) from (select distinct i from g) a;
count
10001
(1 row)
Time: 1563,109 ms
postgres=# select count(*) from (select i from g group by i) a;
count
10001
(1 row)
Time: 594,481 ms
http://www.pgsql.cz/index.php/PostgreSQL_SQL_Tricks_I
so be careful ... :)
Group by is expensive than Distinct since Group by does a sort on the result while distinct avoids it. But if you want to make group by yield the same result as distinct give order by null ..
SELECT DISTINCT u.profession FROM users u
is equal to
SELECT u.profession FROM users u GROUP BY u.profession order by null
It seems that the queries are not exactly the same. At least for MySQL.
Compare:
describe select distinct productname from northwind.products
describe select productname from northwind.products group by productname
The second query gives additionally "Using filesort" in Extra.
In MySQL, "Group By" uses an extra step: filesort. I realize DISTINCT is faster than GROUP BY, and that was a surprise.
After heavy testing we came to the conclusion that GROUP BY is faster
SELECT sql_no_cache
opnamegroep_intern
FROM telwerken
WHERE opnemergroep IN (7,8,9,10,11,12,13) group by opnamegroep_intern
635 totaal 0.0944 seconds
Weergave van records 0 - 29 ( 635 totaal, query duurde 0.0484 sec)
SELECT sql_no_cache
distinct (opnamegroep_intern)
FROM telwerken
WHERE opnemergroep IN (7,8,9,10,11,12,13)
635 totaal 0.2117 seconds ( almost 100% slower )
Weergave van records 0 - 29 ( 635 totaal, query duurde 0.3468 sec)
(more of a functional note)
There are cases when you have to use GROUP BY, for example if you wanted to get the number of employees per employer:
SELECT u.employer, COUNT(u.id) AS "total employees" FROM users u GROUP BY u.employer
In such a scenario DISTINCT u.employer doesn't work right. Perhaps there is a way, but I just do not know it. (If someone knows how to make such a query with DISTINCT please add a note!)
Here is a simple approach which will print the 2 different elapsed time for each query.
DECLARE #t1 DATETIME;
DECLARE #t2 DATETIME;
SET #t1 = GETDATE();
SELECT DISTINCT u.profession FROM users u; --Query with DISTINCT
SET #t2 = GETDATE();
PRINT 'Elapsed time (ms): ' + CAST(DATEDIFF(millisecond, #t1, #t2) AS varchar);
SET #t1 = GETDATE();
SELECT u.profession FROM users u GROUP BY u.profession; --Query with GROUP BY
SET #t2 = GETDATE();
PRINT 'Elapsed time (ms): ' + CAST(DATEDIFF(millisecond, #t1, #t2) AS varchar);
OR try SET STATISTICS TIME (Transact-SQL)
SET STATISTICS TIME ON;
SELECT DISTINCT u.profession FROM users u; --Query with DISTINCT
SELECT u.profession FROM users u GROUP BY u.profession; --Query with GROUP BY
SET STATISTICS TIME OFF;
It simply displays the number of milliseconds required to parse, compile, and execute each statement as below:
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 2 ms.
SELECT DISTINCT will always be the same, or faster, than a GROUP BY. On some systems (i.e. Oracle), it might be optimized to be the same as DISTINCT for most queries. On others (such as SQL Server), it can be considerably faster.
This is not a rule
For each query .... try separately distinct and then group by ... compare the time to complete each query and use the faster ....
In my project sometime I use group by and others distinct
If you don't have to do any group functions (sum, average etc in case you want to add numeric data to the table), use SELECT DISTINCT. I suspect it's faster, but i have nothing to show for it.
In any case, if you're worried about speed, create an index on the column.
If the problem allows it, try with EXISTS, since it's optimized to end as soon as a result is found (And don't buffer any response), so, if you are just trying to normalize data for a WHERE clause like this
SELECT FROM SOMETHING S WHERE S.ID IN ( SELECT DISTINCT DCR.SOMETHING_ID FROM DIFF_CARDINALITY_RELATIONSHIP DCR ) -- to keep same cardinality
A faster response would be:
SELECT FROM SOMETHING S WHERE EXISTS ( SELECT 1 FROM DIFF_CARDINALITY_RELATIONSHIP DCR WHERE DCR.SOMETHING_ID = S.ID )
This isn't always possible but when available you will see a faster response.
in mySQL i have found that GROUP BY will treat NULL as distinct, while DISTINCT does not.
Took the exact same DISTINCT query, removed the DISTINCT, and added the selected fields as the GROUP BY, and i got many more rows due to one of the fields being NULL.
So.. I tend to believe that there is more to the DISTINCT in mySQL.