Extremely slow GROUP BY on MySQL. Not index-related - mysql

There are quite a few "why is my GROUP BY so slow" questions on SO, most of them seem to be resolved with indexes.
My situation is different. Indeed I GROUP BY on non-indexed data but this is on purpose and it's not something I can change.
However, when I compare the performance of GROUP BY with the performance of a similar query without a GROUP BY (that also doesn't use indexes) - the GROUP BY query is slower by an order of magnitude.
Slow query:
SELECT someFunc(col), COUNT(*) FROM tbl WHERE col2 = 42 GROUP BY someFunc(col)
The results are something like this:
someFunc(col) COUNT(*)
=========================
a 100000
b 80000
c 20
d 10
Fast(er) query:
SELECT 'a', COUNT(*) FROM tbl WHERE col2 = 42 AND someFunc(col) = 'a'
UNION
SELECT 'b', COUNT(*) FROM tbl WHERE col2 = 42 AND someFunc(col) = 'b'
UNION
SELECT 'c', COUNT(*) FROM tbl WHERE col2 = 42 AND someFunc(col) = 'c'
UNION
SELECT 'd', COUNT(*) FROM tbl WHERE col2 = 42 AND someFunc(col) = 'd'
This query yields the same results and is about ten times faster despite actually running multiple separate queries.
I realize that they are not the same from MySQL point of view, because MySQL doesn't know in advance that someFunc(col) can only yield four different values, but still it seems like too big of a difference.
I'm thinking that this has to do with some work GROUP BY is doing behind the scenes (creating temporary tables and stuff like that).
Are there configuration parameters that I could tweak to make the GROUP BY faster?
Is there a way to hint MySQL to do things differently within the query itself? (e.g. refrain from creating a temporary table).
EDIT:
In fact what I referred to as someFunc(col) above is actually a JSON_EXTRACT(). I just tried to copy the specific data being extracted into its own (unindexed) column and it makes GROUP BY extremely fast, and indeed faster than the alternative UNIONed queries.
The question remains: why? JSON_EXTRACT() might be slow but it should be just as slow with the four queries (in fact slower because more rows are scanned). Also, I've read that MySQL JSON is designed for fast reads.
The difference I'm seeing is between more than 200 seconds (GROUP BY with JSON_EXTRACT()) and 1-2 seconds (GROUP BY on a CONCAT() on an actual unindexed column).

First, for this query:
SELECT someFunc(col), COUNT(*)
FROM tbl
WHERE col2 = 42
GROUP BY someFunc(col);
You should have an index on tbl(col2, col). This is a covering index for the query so it should improve performance.
Small note: The second version should use UNION ALL rather than UNION. The performance difference for eliminating duplicates is small on 4 rows, but UNION is a bad habit in these cases.
I'm not sure what would cause 10x performance slow down. I can readily think of two things that would make the second version faster.
First, this query is calling someFunc() twice for each row being processed. If this is an expensive operation, then that would account for half the increase in query load. This could be much larger if the first version is calling someFunc() on all rows, rather than just on matching rows.
To see if this is an issue, you can try:
SELECT someFunc(col) as someFunc_col, COUNT(*)
FROM tbl
WHERE col2 = 42
GROUP BY someFunc_col;
Second, doing 4 smaller GROUP BYs is going to be a bit faster than doing 1 bigger one. This is because GROUP BY uses a sort, and sorting is O(n log(n)). So, sorting 100,000 rows and 80,000 rows should be faster than sorting 180,000. Your case has about half the data in two groups. This might account for up to 50% difference (although I would be surprised if it were this large).

Related

Query takes too long to query with OR clause but their parts are very quick

I have two tables with ~1M rows indexed by their Id's.
the fallowing query...
SELECT t.* FROM transactions t
INNER JOIN integration it ON it.id_trans = t.id_trans
WHERE t.id_trans = '5440073'
OR it.id_integration = '439580587'
This query takes about 30s. But ...
SELECT ... WHERE t.id_trans = '5440073'
takes less than 100ms and
SELECT ... WHERE it.id_integration = '439580587'
also takes less than 100ms. Even
SELECT ... WHERE t.id_trans = '5440073'
UNION
SELECT ... WHERE it.id_integration = '439580587'
takes less then 100ms
Why does the OR clause takes so much time even if the parts being so fast?
Why is OR so slow, but UNION is so fast?
Do you understand why UNION is fast? Because it can use two separate indexes to good advantage, and gather some result rows from each part of the UNION, then combine the results together.
But why can't OR do that? Simply put, the Optimizer is not smart enough to try that angle.
In your case, the tests are on different tables; this leads to radically different query plans (see EXPLAIN SELECT ...) for the two parts of the UNION. Each can be well optimized, so each is fast.
Assuming each part delivers only a few rows, the subsequent overhead of UNION is minor -- namely to gather the two small sets of row, dedup them (if you use UNION DISTINCT instead of UNION ALL), and deliver the results.
Meanwhile, the OR query effectively gather all combinations of the two tables, then filtered out based on the two parts of the OR. The intermediate stage may involve a huge temp table, only to have most of the rows tossed.
(Another example of inflate-deflate is JOINs + GROUP BY. The workarounds are different.)
I would suggest writing the query using UNION ALL:
SELECT t.*
FROM transactions t
WHERE t.id_trans = '5440073'
UNION ALL
SELECT t.*
FROM transactions t JOIN
integration it
ON it.id_trans = t.id_trans
WHERE t.id_trans <> '5440073' AND
it.id_integration = '439580587';
Note: If the ids are really numbers (and not strings), then drop the single quotes. Mixing types can sometimes confuse the optimizer and prevent the use of indexes.

When UNION stops being superior than OR conditions in SQL queries?

In most cases, when I tried to remove OR condition and replace them with a UNION (which holds each of the conditions separately), it performed significantly better, as those parts of the query were index-able again.
Is there a rule of thumb (and maybe some documentation to support it) on when this 'trick' stops being useful? Will it be useful for 2 OR conditions? for 10 OR conditions? As the amount of UNIONs increases, and the UNION distinct part may have its own overhead.
What would be your rule of thumb on this?
Small example of the transformation:
SELECT
a, b
FROM
tbl
WHERE
a = 1 OR b = 2
Transformed to:
(SELECT
tbl.a, tbl.b
FROM
tbl
WHERE
tbl.b = 2)
UNION DISTINCT
(SELECT
tbl.a, tbl.b
FROM
tbl
WHERE
tbl.a = 1)
I suggest there is no useful Rule of Thumb (RoT). Here is why...
As you imply, more UNIONs implies slower work, while more ORs does not (at least not much). The SELECTs of a union are costly because they are separate. I would estimate that a UNION of N SELECTs takes about N+1 or N+2 units of time, where one indexed SELECT takes 1 unit of time. In contrast, multiple ORs does little to slow down the query, since fetching all rows of the table is the costly part.
How fast each SELECT of a UNION runs depends on how good the index is and how few rows are fetched. This can vary significantly. (Hence, it makes it hard to devise a RoT.)
A UNION starts by generating a temp table into which each SELECT adds the rows it finds. This is some overhead. In newer versions (5.7.3 / MariaDB 10.1), there are limited situations where the temp table can be avoided. (This eliminates the hypothetical +1 or +2, thereby adding more complexity into devising a RoT.)
If it is UNION DISTINCT (the default) instead of UNION ALL, there needs to be a dedup-pass, probably involving a sort over the temp table. Note: This means that the even the new versions cannot avoid the temp table. UNION DISTINCT precisely mimics the OR, yet you may know that ALL would give the same answer.

Codeigniter 3 really slow query when Group By is called

I have this query
SELECT `PR_CODIGO`, `PR_EXIBIR`, `PR_NOME`, `PRC_DETALHES` FROM `PROPRIETARIOS` LEFT JOIN `PROPRIETARIOSCONTATOS` ON `PROPRIETARIOSCONTATOS`.`PRC_COD_CAD` = `PROPRIETARIOS`.`PR_CODIGO` WHERE `PR_EXIBIR` = 'T' LIMIT 20
It runs very fast, less than 1 second.
If i add GROUP BY, it takes several seconds (5+) to run. Even the Group By field being index.
I'm using group by because the query above returns repeated rows (i search for a name and his contacts on another table, show's 4 times same name).
How do i fix this?
With the GROUP BY clause, the LIMIT clause isn't applied until after the rows are collapsed by the group by operation.
To get an understanding of the operations that MySQL is performing and which indexes are being considered and chosen by the optimizer, we use EXPLAIN.
Unstated in the question is what "field" (columns or expressions) are in the GROUP BY clause. So we are only guessing.
Based on the query shown in the question...
SELECT pr.pr_codigo
, pr.pr_exibir
, pr.pr_nome
, prc.prc_detalhes
FROM `PROPRIETARIOS` pr
LEFT
JOIN `PROPRIETARIOSCONTATOS` prc
ON prc.prc_cod_cad = pr.pr_codigo
WHERE pr.pr_exibir = 'T'
LIMIT 20
Our guess at the most appropriate indexes...
... ON PROPRIETARIOSCONTATOS (prc_cod_cad, prc_detalhes)
... ON PROPRIETARIOS (pr_exibir, pr_codigo, pr_exibir, pr_nome)
Our guess is going to change depending on what column(s) are listed in the GROUP BY clause. And we might also suggest an alternative query to return an equivalent result.
But without knowing the GROUP BY clause, without knowing if our guesses about which table each column is from are correct, without knowing the column datatypes, without any estimates of cardinality, and without example data and expected output, ... we're flying blind and just making guesses.

Can MySQL use Indexes when there is OR between conditions?

I have two queries plus its own EXPLAIN's results:
One:
SELECT *
FROM notifications
WHERE id = 5204 OR seen = 3
Benchmark (for 10,000 rows): 0.861
Two:
SELECT h.* FROM ((SELECT n.* from notifications n WHERE id = 5204)
UNION ALL
(SELECT n.* from notifications n WHERE seen = 3)) h
Benchmark (for 10,000 rows): 2.064
The result of two queries above is identical. Also I have these two indexes on notifications table:
notifications(id) -- this is PK
notification(seen)
As you know, OR usually prevents effective use of indexes, That's why I wrote second query (by UNION). But after some tests I figured it out which still using OR is much faster that using UNION. So I'm confused and I really cannot choose the best option in my case.
Based on some logical and reasonable explanations, using union is better, but the result of benchmark says using OR is better. May you please help me should I use which approach?
The query plan for the OR case appears to indicate that MySQL is indeed using indexes, so evidently yes, it can do, at least in this case. That seems entirely reasonable, because there is an index on seen, and id is the PK.
Based on some logical and reasonable explanations, using union is better, but the result of benchmark says using OR is better.
If "logical and reasonable explanations" are contradicted by reality, then it is safe to assume that the logic is flawed or the explanations are wrong or inapplicable. Performance is notoriously difficult to predict; performance testing is essential where speed is important.
May you please help me should I use which approach?
You should use the one that tests faster on input that adequately models that which the program will see in real use.
Note also, however, that your two queries are not semantically equivalent: if the row with id = 5204 also has seen = 3 then the OR query will return it once, but the UNION ALL query will return it twice. It is pointless to choose between correct code and incorrect code on any basis other than which one is correct.
index_merge, as the name suggests, combines the primary keys of two indexes using the Sort Merge Join or Sort Merge Union for AND and OR conditions, appropriately, and then looks up the rest of the values in the table by PK.
For this to work, conditions on both indexes should be so that each index would yield primary keys in order (your conditions are).
You can find the strict definition of the conditions in the docs, but in a nutshell, you should filter by all parts of the index with an equality condition, plus possibly <, =, or > on the PK.
If you have an index on (col1, col2, col3), this should be col1 = :val1 AND col2 = :val2 AND col3 = :val3 [ AND id > :id ] (the part in the square brackets is not necessary).
The following conditions won't work:
col1 = :val1 -- you omit col2 and col3
col1 = :val1 AND col2 = :val2 AND col3 > :val3 -- you can only use equality on key parts
As a free side effect, your output is sorted by id.
You could achieve the similar results using this:
SELECT *
FROM (
SELECT 5204 id
UNION ALL
SELECT id
FROM mytable
WHERE seen = 3
AND id <> 5204
) q
JOIN mytable m
ON m.id = q.id
, except that in earlier versions of MySQL the derived table would have to be materialized which would definitely make the query performance worse, and your results would not have been ordered by id anymore.
In short, if your query allows index_merge(union), go for it.
The answer is contained in your question. The EXPLAIN output for OR says Using union(PRIMARY, seen) - that means that the index_merge optimization is being used and the query is actually executed by unioning results from the two indexes.
So MySQL can use index in some cases and it does in this one. But the index_merge is not always available or is not used because the statistics of the indexes say it won't be worth it. In those cases OR may be a lot slower than UNION (or not, you need to always check both versions if you are not sure).
In your test you "got lucky" and MySQL did the right optimization for you automatically. It is not always so.

MySQL RAND() optimization with LIMIT option

I have 50,000 rows in table and i am running following query but i heard it is a bad idea but how do i make it work better way?
mysql> SELECT t_dnis,account_id FROM mytable WHERE o_dnis = '15623157085' AND enabled = 1 ORDER BY RAND() LIMIT 1;
+------------+------------+
| t_dnis | account_id |
+------------+------------+
| 5623157085 | 1127 |
+------------+------------+
Any other way i can make is query faster or user other options?
I am not DBA so sorry if this question asked before :(
Note: currently we are not seeing performance issue but we are growing so could be impact in future so just want to know + and - point before are are out of wood.
This query:
SELECT t_dnis, account_id
FROM mytable
WHERE o_dnis = '15623157085' AND enabled = 1
ORDER BY RAND()
LIMIT 1;
is not sorting 50,000 rows. It is sorting the number of rows that match the WHERE clause. As you state in the comments, this is in the low double digits. On a handful of rows, the use of ORDER BY rand() should not have much impact on performance.
You do want an index. The best index would be mytable(o_dnis, enabled, t_dnis, account_id). This is a covering index for the query, so the original data pages do not need to be accessed.
Under most circumstances, I would expect the ORDER BY to be fine up to at least a few hundred rows, if not several thousand. Of course, this depends on lots of factors, such as your response-time requirements, the hardware you are running on, and how many concurrent queries are running. My guess is that your current data/configuration does not pose a performance problem, and there is ample room for growth in the data without an issue arising.
Unless you are running on very slow hardware, you should not experience problems in sorting (much? less than) 50,000 rows. So if you still ask the question, this makes me suspect that your problem does not lie in the RAND().
For example one possible cause of slowness could be not having a proper index - in this case you can go for a covering index:
CREATE INDEX mytable_ndx ON enabled, o_dnis, t_dnis, account_id;
or the basic
CREATE INDEX mytable_ndx ON enabled, o_dnis;
At this point you should already have good performances.
Otherwise you can run the query twice, either by counting the rows or just priming a cache. Which to choose depends on the data structure and how many rows are returned; usually, the COUNT option is the safest bet.
SELECT COUNT(1) AS n FROM mytable WHERE ...
which gives you n, which allows you to generate a random number k in the same range as n, followed by
SELECT ... FROM mytable LIMIT k, 1
which ought to be really fast. Again, the index will help you speeding up the counting operation.
In some cases (MySQL only) you could perhaps do better with
SELECT SQL_CACHE SQL_CALC_FOUND_ROWS ... FROM mytable WHERE ...
using the calc_found_rows() function to recover n, then run the second query which should take advantage of the cache. It's best if you experiment first, though. And changes in the table demographics might cause performance to fall.
The problem with ORDER BY RAND() LIMIT 1 is that MySQL will give each row a random values and that sort, performing a full table scan and than drops all the results but one.
This is especially bad on a table with a lot of row, doing a query like
SELECT * FROM foo ORDER BY RAND() LIMIT 1
However in your case the query is already filtering on o_dnis and enabled. If there are only a limited number of rows that match (like a few hundred), doing an ORDER BY RAND() shouldn't cause a performance issue.
The alternative required two queries. One to count and the other one to fetch.
in pseudo code
count = query("SELECT COUNT(*) FROM mytable WHERE o_dnis = '15623157085' AND enabled = 1").value
offset = random(0, count - 1)
result = query("SELECT t_dnis, account_id FROM mytable WHERE o_dnis = '15623157085' AND enabled = 1 LIMIT 1 OFFSET " + offset).row
Note: For the pseudo code to perform well, there needs to be a (multi-column) index on o_dnis, enabled.