I have a question about using "group by" in mysql: group order whether to affect the efficiency of query.
1.SELECT SQL_NO_CACHE `er_ct`, `appve` FROM TBL_547 WHERE UAEWA_ts >= 1417276800 AND UAEWA_ts <= 1417449540 GROUP BY `appve`, `er_ct` ORDER BY `c79fd348-cc8e-41f2-ae93-0b2b2cde8a31` DESC limit 5;
2.SELECT SQL_NO_CACHE `er_ct`, `appve` FROM TBL_547 WHERE UAEWA_ts >= 1417276800 AND UAEWA_ts <= 1417449540 GROUP BY `er_ct`,`appve` ORDER BY `c79fd348-cc8e-41f2-ae93-0b2b2cde8a31` DESC limit 5;
The difference betwen two sentence is "GROUP BY appve, er_ct " and " GROUP BY er_ct,appve".There is no index(combined index) on appve and er_ct. The value of "SELECT COUNT(DISTINCT er_ct) FROM TBL_547" is 7000. The value of "SELECT COUNT(DISTINCT appve) FROM TBL_547" is 3.
here is the screenshot. http://i.stack.imgur.com/AeQy2.png
the structure: http://i.stack.imgur.com/ewgAy.png
thanks.
Creating index on column with group by will not boost your results. When you perform a query, the SQL statement first gets compiled into a tree of relational algebra operations. These operations each take one or more tables as input and produce another table as output. Then using the output table SQL engine is applying any other operations:
- agregation - group by
- sorting
So you can boost your query mostly by:
- creating smart queries, like on only indexed columns.
- asuring your result set is not huge, and not accesssing all joined tble columns, like Select * is a total overkill on production
I would also recommend SQL Tuning as an lecture. I hope my answer will help.
First think that pops in my mind is size of distinct results in both columns, you mentioned 3 and 7k, that would be the main factor I assume.
When query optimizer (they are changing all the time) will see that first group column is small, it will just go with the flow, but if he sees that the first column is large (7k distinct results) he could go with building up an index on it. That operation on a large column could be slow, thats why you have two different times for both queries.
Related
There are 2 samples.
In the first example, it gives faster results when using orderby. (according to phpmyadmin speed report)
In the other example, I don't use order by, it gives slower results. (according to phpmyadmin speed report)
Isn't it unreasonable that it gives quick results when using Orderby?
The ranking doesn't matter to me, it's the speed that matters.
select bayi,tutar
from siparisler
where durum='1' and MONTH(tarih) = MONTH(CURDATE()) and YEAR(tarih) = YEAR(CURRENT_DATE())
order by id desc
Speed: 0.0006
select bayi,tutar
from siparisler
where durum='1' and MONTH(tarih) = MONTH(CURDATE()) and YEAR(tarih) = YEAR(CURRENT_DATE())
Speed: 0.7785
An order by query will never execute faster than the same query without the order by clause. Sorting rows incurs more work for the database. In the best-case scenario, the sorting becomes a no-op because MySQL fetched the rows in the correct order in the first place: but that just make the two queries equivalent in terms of performance (it does not make the query that sorts faster).
Possibly, the results of the order by were cached already, so MYSQL gives you the result directly from the cache rather than actually executing the query.
If performance is what matters most to you, let me suggest to change the where predicate in order not to use date functions on the tarih column: such construct prevents the database to take advantage of an index (we say the predicate is non-SARGable). Consider:
select bayi, tutar
from siparisler
where
durum = 1
and tarih >= dateformat(current_date, '%Y-%m-01')
and tarih < dateformat(current_date, '%Y-%m-01') + interval 1 month
order by id desc
For performance with this query, consider an index on (durum, tarih, id desc, bay, tutar): it should behave as a covering index, that MySQL can use to execute the entire query, without even looking at the actual data.
At 0.0006s, you are almost certainly measuring the performance of the query_cache rather than the execution time. Try both queries again with SELECT SQL_NO_CACHE and see what the performance difference is.
First, I recommend writing the query as:
select bayi, tutar
from siparisler p
where durum = 1 and -- no quotes assuming this is an integer
tarih >= curdate() - interval (1 - day(curdate()) day;
This can take advantage of an index on (durm, tarih).
But that isn't your question. It is possible that the order by could result in a radically different execution plan. This is hypothetical, but the intention is to explain how this might occur.
Let me assume the following:
The table only has an index on (id desc, durum, tarih).
The where clause matches few rows.
The rows are quite wide.
The query without the order by would probably generate an execution plan that is a full table scan. Because the rows are wide, lots of unnecessary data would be read.
The query with the order by could read the data in order and then apply the where conditions. This would be faster than the other version, because only the rows that match the where conditions would be read in.
I cannot guarantee that this is happening. But there are some counterintuitive situations that arise with queries.
You can analyze it through the EXPLAIN command, and then check the value corresponding to the type field, index or all
Example:
EXPLAIN SELECT bayi,tutar
FROM siparisler
WHERE durum='1' AND MONTH(tarih) = MONTH(CURDATE()) AND YEAR(tarih) = YEAR(CURRENT_DATE())
ORDER BY id DESC;
I have this query
SELECT `PR_CODIGO`, `PR_EXIBIR`, `PR_NOME`, `PRC_DETALHES` FROM `PROPRIETARIOS` LEFT JOIN `PROPRIETARIOSCONTATOS` ON `PROPRIETARIOSCONTATOS`.`PRC_COD_CAD` = `PROPRIETARIOS`.`PR_CODIGO` WHERE `PR_EXIBIR` = 'T' LIMIT 20
It runs very fast, less than 1 second.
If i add GROUP BY, it takes several seconds (5+) to run. Even the Group By field being index.
I'm using group by because the query above returns repeated rows (i search for a name and his contacts on another table, show's 4 times same name).
How do i fix this?
With the GROUP BY clause, the LIMIT clause isn't applied until after the rows are collapsed by the group by operation.
To get an understanding of the operations that MySQL is performing and which indexes are being considered and chosen by the optimizer, we use EXPLAIN.
Unstated in the question is what "field" (columns or expressions) are in the GROUP BY clause. So we are only guessing.
Based on the query shown in the question...
SELECT pr.pr_codigo
, pr.pr_exibir
, pr.pr_nome
, prc.prc_detalhes
FROM `PROPRIETARIOS` pr
LEFT
JOIN `PROPRIETARIOSCONTATOS` prc
ON prc.prc_cod_cad = pr.pr_codigo
WHERE pr.pr_exibir = 'T'
LIMIT 20
Our guess at the most appropriate indexes...
... ON PROPRIETARIOSCONTATOS (prc_cod_cad, prc_detalhes)
... ON PROPRIETARIOS (pr_exibir, pr_codigo, pr_exibir, pr_nome)
Our guess is going to change depending on what column(s) are listed in the GROUP BY clause. And we might also suggest an alternative query to return an equivalent result.
But without knowing the GROUP BY clause, without knowing if our guesses about which table each column is from are correct, without knowing the column datatypes, without any estimates of cardinality, and without example data and expected output, ... we're flying blind and just making guesses.
The bellow statement does not work but i cant seem to figure out why
select AVG(delay_in_seconds) from A_TABLE ORDER by created_at DESC GROUP BY row_type limit 1000;
I want to get the avg's of the most recent 1000 rows for each row_type. created_at is of type DATETIME and row_type is of type VARCHAR
If you only want the 1000 most recent rows, regardless of row_type, and then get the average of delay_in_seconds for each row_type, that's a fairly straightforward query. For example:
SELECT t.row_type
, AVG(t.delay_in_seconds)
FROM (
SELECT r.row_type
, r.delay_in_seconds
FROM A_table r
ORDER BY r.created_at DESC
LIMIT 1000
) t
GROUP BY t.row_type
I suspect, however, that this query does not satisfy the requirements that were specified. (I know it doesn't satisfy what I understood as the specification.)
If what we want is the average of the most recent 1000 rows for each row_type, that would also be fairly straightforward... if we were using a database that supported analytic functions.
Unfortunately, MySQL doesn't provide support for analytic functions. But it is possible to emulate one in MySQL, but the syntax is a bit involved, and it is dependent on behavior that is not guaranteed.
As an example:
SELECT s.row_type
, AVG(s.delay_in_seconds)
FROM (
SELECT #row_ := IF(#prev_row_type = t.row_type, #row_ + 1, 1) AS row_
, #prev_row_type := t.row_type AS row_type
, t.delay_in_seconds
FROM A_table t
CROSS
JOIN (SELECT #prev_row_type := NULL, #row_ := NULL) i
ORDER BY t.row_type DESC, t.created_at DESC
) s
WHERE s.row_ <= 1000
GROUP
BY s.row_type
NOTES:
The inline view query is going to be expensive for large sets. What that's effectively doing is assigning a row number to each row. The "order by" is sorting the rows in descending sequence by created_at, what we want is for the most recent row to be assigned a value of 1, the next most recent 2, etc. This numbering of rows will be repeated for each distinct value of row_type.
For performance, we'd want a suitable index with leading columns (row_type,created_at,delay_seconds) to avoid an expensive "Using filesort" operation. We need at least those first two columns for that, including the delay_seconds makes it a covering index (the query can be satisfied entirely from the index.)
The outer query then runs against the resultset returned from the view query (a "derived table"). The predicate in the WHERE filters out all rows that were assigned a row number greater than 1000, the rest is a straighforward GROUP BY and and AVG aggregate.
A LIMIT clause is entirely unnecessary. It may be possible to incorporate some additional predicates for some additional performance enhancement... like, what if we specified the most recent 1000 rows, but only that were create_at within the past 30 or 90 days?
(I'm not entirely sure this answers the question that OP was asking. What this answers is: Is there a query that can return the specified resultset, making use of AVG aggregate and GROUP BY, ORDER BY and LIMIT clauses.)
N.B. This query is dependent on a behavior of MySQL user-defined variables which is not guaranteed.
The query above shows one approach, but there is also another approach. It's possible to use a "join" operation (of A_table with A_table) to get a row number assigned (getting a COUNT of the number of rows that are "more recent" than each row. With large sets, however, that can produce a humongous intermediate result, if we aren't careful to limit it.
Write the ORDER BY at the last of the statement.
SELECT AVG(delay_in_seconds) from A_TABLE GROUP BY row_type ORDER by created_at DESC limit 1000;
read mysql dev site for details.
Say I have an Order table that has 100+ columns and 1 million rows. It has a PK on OrderID and FK constraint StoreID --> Store.StoreID.
1) select * from 'Order' order by OrderID desc limit 10;
the above takes a few milliseconds.
2) select * from 'Order' o join 'Store' s on s.StoreID = o.StoreID order by OrderID desc limit 10;
this somehow can take up to many seconds. The more inner joins I add, slows it down further more.
3) select OrderID, column1 from 'Order' o join 'Store' s on s.StoreID = o.StoreID order by OrderID desc limit 10;
this seems to speed the execution up, by limiting the columns we select.
There are a few points that I dont understand here and would really appreciate it if anyone more knowledgeable with mysql (or rmdb query execution in general) can enlighten me.
Query 1 is fast since it's just a reverse lookup by PK and DB only needs to return the first 10 rows it encountered.
I don't see why Query 2 should take for ever. Shouldn't the operation be the same? i.e. get the first 10 rows by PK and then join with other tables. Since there's a FK constraint, it is guaranteed that the relationship will be satisfied. So DB doesn't need to join more rows than necessary and then trim the result, right? Unless, FK constraint allows null FK? In which case I guess a left join would make this much faster than an inner join?
Lastly, I'm guess query 3 is simply faster because less columns are used in those unnecessary joins? But why would the query execution need the other columns while joining? Shouldn't it just join using PKs first, and then get the columns for just the 10 rows?
Thanks!
My understanding is that the mysql engine applies limit after any join's happen.
From http://dev.mysql.com/doc/refman/5.0/en/select.html, The HAVING clause is applied nearly last, just before items are sent to the client, with no optimization. (LIMIT is applied after HAVING.)
EDIT: You could try using this query to take advantage of the PK speed.
select * from (select * from 'Order' order by OrderID desc limit 10) o
join 'Store' s on s.StoreID = o.StoreID;
All of your examples are asking for tablescans of the existing tables, so none of them will be more or less performant than the degree to which mysql can cache the data or results. Some of your queries have order by or join criteria, which can take advantage of indexes purely to make the joining process more efficient, however, that still is not the same as having a set of criteria that will trigger the use of indexes.
Limit is not a criteria -- it can be thought of as filtration once a result set is determined. You save time on the client, once the result set is prepared, but not on the server.
Really, the only way to get the answers you are seeking is to become familiar with:
EXPLAIN EXTENDED your_sql_statement
The output of EXPLAIN will show you how many rows are being looked at by mysql, as well as whether or not any indexes are being used.
I have two questions here but i am asking them at once as i think they are inter-related.
I am working with a complex query (Multiple joins + sub queries) and the table is pretty huge as well (around 2,00,000 records in this table).
A part of this query (a LEFT JOIN) is required to find a record which has a second lowest value in a cetain column among all the records associated with the primary key of the first table. For now I have isolated this part and thinking on the lines of -
SELECT id FROM tbl ORDER BY `myvalue` ASC LIMIT 1,1;
But there is a case where, if there is only 1 record in the table, it must return that record instead of NULL. So my first question is how do write a query for this ?
Secondly, considering the size of the table and the time its already taking to run even after creating indexes, I understand that adding any more complexity to it in order to achieve the above part might affect the querying time dramatically.
I cannot decompose joins because I need to get some of the columns for the ORDER BY clause (the application has an option to sort the result by these columns, the above column "myvalue" being one of them)
What would be the way(s) to approach this problem ?
Thanks
Something like this might work
COALESCE(
(SELECT id FROM tbl ORDER BY `myvalue` ASC LIMIT 1,1),
(SELECT id FROM tbl ORDER BY `myvalue` ASC LIMIT 0,1))
It selects the first non null value from the list provided.
As for the complexity of the query, post the whole thing so we can take a look at it.