In my basic understanding of an index, the index is used on a column in a WHERE clause. Since the HAVING clause is similar to a WHERE clause applied after a GROUP BY statement, does an index have the same effect on that? For example:
SELECT * FROM table WHERE full_name = 'Bob Jones'
--> index on full_name would be beneficial here
and
SELECT * FROM table WHERE first_name = 'Bob'
GROUP BY
height HAVING height > 72
In this second query, would an index on both first_name and height improve the performance? Which index would be more important, or are they roughly the equivalent? Also, do indexes improve GROUP BY performance as well (regardless of a HAVING)?
A HAVING clause is essentially the last thing done to filter a query's results before they're sent off to the client. It's only useful if you need to filter on the results of an aggregate function, whose value can NOT be available during the row-level filtering that WHERE clauses do.
Essentially, a HAVING clause can be seen as applying another query, turning your main query into a subquery.
e.g.
SELECT ...
FROM sometable
HAVING somefield = X
is really no different that
SELECT *
FROM (
SELECT ...
FROM sometable
)
WHERE somefield = X
If the field you're filtering with the HAVING is NOT a derived field (aggregate value, calculated field, etc...) then you're almost certainly better off doing the filtering at the WHERE level, which keeps unecessary rows from being loaded off disk in the first place.
Since having is applied last, rows will be loaded from disk, then possibly discarded if they don't match the HAVING criteria.
Related
I have a 40M record table having id as the primary key. I execute a select statement as follows:
select * from messages where (some condition) order by id desc limit 20;
It is ok and the query executes in a reasonable time. But when I add an always valid condition as follows, It takes a huge time.
select * from messages where id > 0 and (some condition) order by id desc limit 20;
I guess it is a bug and makes MySQL search from the top side of the table instead of the bottom side. If there is any other justification or optimization it would be great a help.
p.s. with a high probability, the results are found in the last 10% records of my table.
p.p.s. the some condition is like where col1 = x1 and col2 = x2 where col1 and col2 are indexed.
MySQL has to choose whether to use an index to process the WHERE clause, or use an index to control ORDER BY ... LIMIT ....
In the first query, the WHERE clause can't make effective use of an index, so it prefers to use the primary key index to optimize scanning in order by ORDER BY. In this case it stops when it finds 20 results that satisfy the WHERE condition.
In the second query, the id > 0 condition in the WHERE clause can make use of the index, so it prefers to use that instead of using the index for ORDER BY. In this case, it has to find all the results that match the WHERE condition, and then sort them by id.
I wouldn't really call this a bug, as there's no specification of precisely how a query should be optimized. It's not always easy for the query planner to determine the best way to make use of indexes. Using the index to filter the rows using WHERE id > x could be better if there aren't many rows that match that condition.
A query like this
select *
from messages
where col1 = x1
and col2 = x2
order by id desc
limit 20;
is best handled by a 'composite' index with the tests for '=' first:
INDEX(col1, col2, id)
Try it, I think it will be faster than either of the queries you are working with.
It’s not a bug. You are searching through a 40 million table where your where clause doesn’t have an index. Add an index on the column in your where clause and you will see substantial improvement.
I have this query
SELECT `PR_CODIGO`, `PR_EXIBIR`, `PR_NOME`, `PRC_DETALHES` FROM `PROPRIETARIOS` LEFT JOIN `PROPRIETARIOSCONTATOS` ON `PROPRIETARIOSCONTATOS`.`PRC_COD_CAD` = `PROPRIETARIOS`.`PR_CODIGO` WHERE `PR_EXIBIR` = 'T' LIMIT 20
It runs very fast, less than 1 second.
If i add GROUP BY, it takes several seconds (5+) to run. Even the Group By field being index.
I'm using group by because the query above returns repeated rows (i search for a name and his contacts on another table, show's 4 times same name).
How do i fix this?
With the GROUP BY clause, the LIMIT clause isn't applied until after the rows are collapsed by the group by operation.
To get an understanding of the operations that MySQL is performing and which indexes are being considered and chosen by the optimizer, we use EXPLAIN.
Unstated in the question is what "field" (columns or expressions) are in the GROUP BY clause. So we are only guessing.
Based on the query shown in the question...
SELECT pr.pr_codigo
, pr.pr_exibir
, pr.pr_nome
, prc.prc_detalhes
FROM `PROPRIETARIOS` pr
LEFT
JOIN `PROPRIETARIOSCONTATOS` prc
ON prc.prc_cod_cad = pr.pr_codigo
WHERE pr.pr_exibir = 'T'
LIMIT 20
Our guess at the most appropriate indexes...
... ON PROPRIETARIOSCONTATOS (prc_cod_cad, prc_detalhes)
... ON PROPRIETARIOS (pr_exibir, pr_codigo, pr_exibir, pr_nome)
Our guess is going to change depending on what column(s) are listed in the GROUP BY clause. And we might also suggest an alternative query to return an equivalent result.
But without knowing the GROUP BY clause, without knowing if our guesses about which table each column is from are correct, without knowing the column datatypes, without any estimates of cardinality, and without example data and expected output, ... we're flying blind and just making guesses.
I'm trying to get the column order in our indexes set correctly and haven't seen a direct answer on this. If we have a query like the following
SELECT ... all the things ...
FROM tb_contact
inner join tb_contact_association on tb_contact.id = tb_contact_association.attached_id
where tb_contact_association.contact_id = '498'
order by ...
We're looking at a pivot table, tb_contact_association on this join. And this table is never really queried without looking at both attached_id (on the join) and contact_id (the where).
When creating an index for tb_contact_association, should the index cover both "attached_id,contact_id" in that order? With the joined on first, then the where? Or the other way around? Or each of them individually?
Thanks.
Generally, the ordering of fields in an index doesn't matter, IF you use the appropriate fields.
e.g. for a query like:
SELECT .. WHERE f1 = 'a' AND f2 = 'b' AND f3 = 'c'
INDEX(f3, f2, f1) - index can be used
INDEX(f1, f3, f1) - can be used
INDEX(f1, f2, f3) - can be used
INDEX(f1, f3) - completely usable
INDEX(f3, f1) - completely usable
INDEX(f4, f1) - cannot be used - no 'f4' field in the where clause
INDEX(f1, f4) - can be used, because 'f1' is in the where clause, but f4
component will be ignored
The actual ordering of the WHERE clause doesn't matter. WHERE f1 = 'a' AND f2 = 'b' v.s. WHERE f2 = 'b' AND f1 = 'a' are both indentical as far as the query compiler/optimizer are concerned.
The indexes needed depend on which direction the join will run. You can determine this by running an EXPLAIN on your select statement. In this case though, since your WHERE clause is filtering on the tb_contact_association table, the optimizer will most likely start with this table and join into the tb_contact table.
The exception would be if tb_contact is small (few rows) compared to tb_contact_association. To see why this is the case, consider an extreme example. If tb_contact is only one row long, it's obviously going to be faster to start from that row, join into the corresponding row in the tb_contact_association table, and test its value for contact_id, rather than go through the whole larger tb_contact_association table looking for contact_id=498 (even with an index), and then joining back to the tb_contact table.
But, for any normal tables, the query above would start with tb_contact_association. For a join, you need an index on the column you're joining to. In this case, that's tb_contact.id. You'll also want an index to help your WHERE clause, ie on tb_contact_association.contact_id.
You don't actually need an index on tb_contact_association.attached_id for this particular query, as long as the join always goes in the direction we expect. A composite index on (contact_id, attached_id) (in that order) in tb_contact_association should be a slight help, because it will allow all necessary info for that table to be pulled directly from the index, saving a read from the data table for each row. (With this index added, you should see "using index" in the extra section of the query EXPLAIN.) The contact_id column is used for the WHERE clause, just as with a single index on that column, but with the composite index, it can then just read attached_id straight from the index, rather than from the table.
Most likely, both fields should have an index. However in this query, only contact_id needs an index, Nathan's answer explains why in more details.
The optimal index for your specific query would be (contact_id, attached_id).
I have two questions here but i am asking them at once as i think they are inter-related.
I am working with a complex query (Multiple joins + sub queries) and the table is pretty huge as well (around 2,00,000 records in this table).
A part of this query (a LEFT JOIN) is required to find a record which has a second lowest value in a cetain column among all the records associated with the primary key of the first table. For now I have isolated this part and thinking on the lines of -
SELECT id FROM tbl ORDER BY `myvalue` ASC LIMIT 1,1;
But there is a case where, if there is only 1 record in the table, it must return that record instead of NULL. So my first question is how do write a query for this ?
Secondly, considering the size of the table and the time its already taking to run even after creating indexes, I understand that adding any more complexity to it in order to achieve the above part might affect the querying time dramatically.
I cannot decompose joins because I need to get some of the columns for the ORDER BY clause (the application has an option to sort the result by these columns, the above column "myvalue" being one of them)
What would be the way(s) to approach this problem ?
Thanks
Something like this might work
COALESCE(
(SELECT id FROM tbl ORDER BY `myvalue` ASC LIMIT 1,1),
(SELECT id FROM tbl ORDER BY `myvalue` ASC LIMIT 0,1))
It selects the first non null value from the list provided.
As for the complexity of the query, post the whole thing so we can take a look at it.
This is going to be one of those questions but I need to ask it.
I have a large table which may or may not have one unique row. I therefore need a MySQL query that will just tell me TRUE or FALSE.
With my current knowledge, I see two options (pseudo code):
[id = primary key]
OPTION 1:
SELECT id FROM table WHERE x=1 LIMIT 1
... and then determine in PHP whether a result was returned.
OPTION 2:
SELECT COUNT(id) FROM table WHERE x=1
... and then just use the count.
Is either of these preferable for any reason, or is there perhaps an even better solution?
Thanks.
If the selection criterion is truly unique (i.e. yields at most one result), you are going to see massive performance improvement by having an index on the column (or columns) involved in that criterion.
create index my_unique_index on table(x)
If you want to enforce the uniqueness, that is not even an option, you must have
create unique index my_unique_index on table(x)
Having this index, querying on the unique criterion will perform very well, regardless of minor SQL tweaks like count(*), count(id), count(x), limit 1 and so on.
For clarity, I would write
select count(*) from table where x = ?
I would avoid LIMIT 1 for two other reasons:
It is non-standard SQL. I am not religious about that, use the MySQL-specific stuff where necessary (i.e. for paging data), but it is not necessary here.
If for some reason, you have more than one row of data, that is probably a serious bug in your application. With LIMIT 1, you are never going to see the problem. This is like counting dinosaurs in Jurassic Park with the assumption that the number can only possibly go down.
AFAIK, if you have an index on your ID column both queries will be more or less equal performance. The second query will need 1 less line of code in your program but that's not going to make any performance impact either.
Personally I typically do the first one of selecting the id from the row and limiting to 1 row. I like this better from a coding perspective. Instead of having to actually retrieve the data, I just check the number of rows returned.
If I were to compare speeds, I would say not doing a count in MySQL would be faster. I don't have any proof, but my guess would be that MySQL has to get all of the rows and then count how many there are. Altough...on second thought, it would have to do that in the first option as well so the code will know how many rows there are as well. But since you have COUNT(id) vs COUNT(*), I would say it might be slightly slower.
Intuitively, the first one could be faster since it can abort the table(or index) scan when finds the first value. But you should retrieve x not id, since if the engine it's using an index on x, it doesn't need to go to the block where the row actually is.
Another option could be:
select exists(select 1 from mytable where x = ?) from dual
Which already returns a boolean.
Typically, you use group by having clause do determine if there are duplicate rows in a table. If you have a table with id and a name. (Assuming id is the primary key, and you want to know if name is unique or repeated). You would use
select name, count(*) as total from mytable group by name having total > 1;
The above will return the number of names which are repeated and the number of times.
If you just want one query to get your answer as true or false, you can use a nested query, e.g.
select if(count(*) >= 1, True, False) from (select name, count(*) as total from mytable group by name having total > 1) a;
The above should return true, if your table has duplicate rows, otherwise false.