In project we use MySQL table for analytics. It is big table 40+ columns, more than 10kk rows.
A big part of time in query take result calculation (50+ cols in result). Idea is to reuse calculated values as variables and make it faster.
Query example:
SELECT col1, SUM(col2) as s_col2, SUM(col3) as s_col3, AVG(col2) as a_col2, ...,
SUM(col2)/SUM(col3) as aaa,
ROUND(AVG(col4), 2) as a_col4,
ROUND(SUM(col5), 2) as s_col5,
ROUND(SUM(col5)/AVG(col4), 2) as zzz,
...
JOIN ...
GROUP_BY ...
ORDER BY ...
Idea is to use #variable, for example:
SELECT col1, #s_col2 := SUM(col2) as s_col2, #s_col3 := SUM(col3) as s_col3, ...,
#s_col2/#s_col3 as aaa,
It works only for a few variables which are outside function, but I don't need additional columns for every variable.
#a_col4 := AVG(col4), #I don`t need this column
#s_col5 := SUM(col5), #I don`t need this column
ROUND(#a_col4, 2) as a_col4,
ROUND(#s_col5, 2) as s_col5,
ROUND(#s_col5/#a_col4, 2) as zzz,
How I can assign variables inside functions?
ROUND(#a_col4 := AVG(col4), 2) as a_col4, #not works
ROUND((#s_col5 := SUM(col5)), 2) as s_col5, #not works
ROUND(#s_col5/#a_col4, 2) as zzz,
UPDATED:
Thanks guys for your help.
The MySQL engine is probably smart enough to compute the value of
SUM(col5) only once
I am not sure because for a big quantity of columns
SUM(col1) as a1,
SUM(col1) as a2,
SUM(col1) as a3,
SUM(col1) as a4,
SUM(col1) as a5,
Is slower than
#a1 := SUM(col1) as a1,
#a1 as a2,
#a1 as a3,
#a1 as a4,
#a1 as a5,
you can also use CTE for reusing table results
I tried, but some values use too many functions one inside another, sometimes 7 levels deep and not all variables can be reused in this way (COALESCE(ROUND(COALESCE(ROUND(SUM(AVG(IF..AND...OR...AND)...
All my changes, (15 variables) have very small effect, for the small period it takes 139 sec (was 151 sec), but some of our reports take a few hours and we need stronger optimisation.
We will try to analyse server bottlenecks, maybe use partitioning, sharding...
As a general rule, the number of rows touched is much more important
to how long a query will take than the functions being evaluated.
The number of rows is always big, a lot of indexes and it works really fast. If I comment columns where we need calculations and only select existing it will take 40 sec (instead of 150)
The qwy to do it in sql is to use the Select withe with the sum and average as basis for an outer select
SELECT *,
s_col2/s_col3 as aaa,
ROUND(a_col4, 2) as a_col4,
ROUND(acol5, 2) as s_col5,
ROUND(s_col5/a_col4, 2) as zzz.
...
FROM
(SELECT col1, SUM(col2) as s_col2, SUM(col3) as s_col3, AVG(col2) as a_col2, ...,
...
JOIN ...
GROUP_BY ...) t1
ORDER BY ...
in MySQL 8 you can also use CTE for reusing tbale rewsults see manual
Don't worry.
As a general rule, the number of rows touched is much more important to how long a query will take than the functions being evaluated.
A similar question comes from the choice between these:
SELECT foo, COUNT(*) FROM x GROUP BY foo ORDER BY COUNT(*) DESC LIMIT 5;
SELECT foo, COUNT(*) FROM x GROUP BY foo ORDER BY 1 DESC LIMIT 5;
I often do the latter because it is fewer keystrokes. I have not been able to determine whether it is faster.
I suggest you write your question in the simplest or clearest way. A subquery (or CTE) may actually be clearer in spite of taking more keystrokes.
Clarity and correctness are more important than speed.
And beware -- With both JOIN and GROUP BY in the query, you may have incorrect results. The JOIN is done before the aggregation; the GROUP BY comes after. Check to see if COUNT or SUM is bigger than it should be. If so, you will need a subquery or CTE.
Related
I have a code like this:
SELECT column1 = (SELECT MAX(column-name21) FROM table-name2 WHERE condition2 GROUP BY id2) as m,
column2 = (SELECT count(*) FROM table-name2 WHERE condition2 GROUP BY id2) as c,
column-names
FROM table-name
WHERE condition
ORDER BY ordercondition
LIMIT 25,50
those internal selects are quite long and complicated.
My question is are there in mysql language contracts, which allow one to avoid duplicating code and computations in this case?
For example, something like this
SELECT (column1, column2) = (SELECT MAX(column-name1) as m, count(*) as c FROM table-name WHERE condition GROUP BY id),
column-names
FROM table-name
WHERE condition
ORDER BY ordercondition
LIMIT 25,50
which of course won't be interpreted by mysql.
I tried this:
SELECT (SELECT MAX(column-name1) as column1, count(*) as column2 FROM table-name WHERE condition GROUP BY id),
column-names
FROM table-name
WHERE condition
ORDER BY ordercondition
LIMIT 25,50
and it also doesn't work.
Such subqueries get cumbersome when you need more than one from the same source. Usually, the "fix" is to us a "derived table" and JOIN:
SELECT x2.col1, x2.col2, names
FROM ( SELECT MAX(c21) AS col1,
COUNT(*) AS col2,
?? -- may be needed for "cond2"
FROM t2
WHERE cond2a ) AS x2
JOIN t1
ON cond2b
WHERE cond1
ORDER BY ??? -- Limit is non-deterministic without ORDER BY
LIMIT 25, 50
If the "condition" in the subquery is "correlated", please specify it; it makes a big difference in how to transform the query.
The construct COUNT(col) is usually a mistake:
COUNT(*) -- the number of rows.
COUNT(DISTINCT col) -- the number of different values in column `col`.
COUNT(col) -- count the number of rows with non-NULL `col`.
Please provide your actual query and provide SHOW CREATE TABLE. I sloughed over several issues; "the devil is in the details".
for Edit 1
INDEX(tool, uuuuId) -- would help performance
Is uuuuId some form of "hash" or "UUID"? If so, that is relevant to seeing how the performance works. Also, how big (approximately) are the tables? What is the value of innodb_buffer_pool_size. (I am fishing for whether you are I/O-bound or CPU-bound.)
WZ needs INDEX(uuuuId, ppppppId, check1) But actually, that Select...=Yes can be turned and EXISTS for some speedup.
Z might benefit from INDEX(check1, uuuuId, ppppppId, check2)
Since Z and WZ are the same table, this might take care of both:
INDEX(ppppppId, uuuuId, check1, check2)
(The order is important.)
So I found this code snippet here on SO. It essentially fakes a "row_number()" function for MySQL. It executes quite fast, which I like and need, but I am unable to tack on a where clause at the end.
select
#i:=#i+1 as iterator, t.*
from
big_table as t, (select #i:=0) as foo
Adding in where iterator = 875 yields an error.
The snippet above executes in about .0004 seconds. I know I can wrap it within another query as a subquery, but then it becomes painfully slow.
select * from (
select
#i:=#i+1 as iterator, t.*
from
big_table as t, (select #i:=0) as foo) t
where iterator = 875
The snippet above takes over 10 seconds to execute.
Anyway to speed this up?
In this case you could use the LIMIT as a WHERE:
select
#i:=#i+1 as iterator, t.*
from
big_table as t, (select #i:=874) as foo
LIMIT 875,1
Since you only want record 875, this would be fast.
Could you please try this?
Increasing the value of the variable in the where clause and checking it against 875 would do the trick.
SELECT
t.*
FROM
big_table AS t,
(SELECT #i := 0) AS foo
WHERE
(#i :=#i + 1) = 875
LIMIT 1
Caution:
Unless you specify an order by clause it's not guaranteed that you will get the same row every time having the desired row number. MySQL doesn't ensure this since data in table is an unordered set.
So, if you specify an order on some field then you don't need user defined variable to get that particular record.
SELECT
big_table.*
FROM big_table
ORDER BY big_table.<some_field>
LIMIT 875,1;
You can significantly improve performance if the some_field is indexed.
I know normally "the order of evaluation for expressions involving user variables is undefined" so we can't safely define and use a variable in the same select statement. But what if there's a subquery? As an example, I have something like this:
select col1,
(select min(date_)from t where i.col1=col1) as first_date,
datediff(date_, (select min(date_)from t where i.col1=col1)
) as days_since_first_date,
count(*) cnt
from t i
where anothercol in ('long','list','of','values')
group by col1,days_since_first_date;
Is there a way to use (select #foo:=min(date_)from t where i.col1=col1) safely instead of repeating the subquery? If so, could I do it in the datediff function or the first time the subquery appears (or either one)?
Of course, I could do
select col1,
(select min(date_)from t where i.col1=col1) as first_date,
date_,
count(*) cnt
from t i
where anothercol in ('long','list','of','values')
group by col1,date_;
and then do some simple postprocessing to get the datediff. Or I can write two separate queries. But those don't answer my question, which is whether one can safely define and use the same variable in a query and a subquery.
First, your query doesn't really make sense, because date_ has no aggregation functions. You are going to get an arbitrary value.
That said, you could repeat the subquery, but I don't see why that would be necessary. Just use a subquery:
select t.col1, t.first_date,
datediff(date_, first_date),
count(*)
from (select t.*, (select min(date_) from t where i.col1 = t.col1) as first_date
from t
where anothercol in ('long','list', 'of', 'values')
) t
group by col1, days_since_first_date;
As I mentioned, though, the value of the third column is problematic.
Note: this does occur additional overhead for materializing the subquery. However, there is a group by anyway, so the data is being read and written multiple times.
So I have a query that, when ran, has, as the values in the last column, 1, 4 and 8. But when I change the HAVING condition those values become 1, 3 and 5. This doesn't make any sense to me.
Here's my SQL:
SELECT memberId, #temp:=total AS total, #runningTotal as runningTotal, #runningTotal:=#temp+#runningTotal AS newRunningTotal
FROM (
SELECT 1 AS memberId, 1 AS total UNION
SELECT 2, 2 UNION
SELECT 3, 2
) AS temp
JOIN (SELECT #temp:=0) AS temp2
JOIN (SELECT #runningTotal:=0) AS temp3
HAVING newRunningTotal <= 40;
Here's the SQL fiddle:
http://sqlfiddle.com/#!2/d41d8/27761/0
If I change newRunningTotal to runningTotal I get different numbers in the runningTotal and newRunningTotal. This doesn't make any sense to me.
Here's the SQL fiddle for the changed query:
http://sqlfiddle.com/#!2/d41d8/27762/0
Any ideas?
Thanks!
MySQL documentation is quite explicit against doing what you are doing in the select:
As a general rule, other than in SET statements, you should never
assign a value to a user variable and read the value within the same
statement. For example, to increment a variable, this is okay:
SET #a = #a + 1;
For other statements, such as SELECT, you might get the results you
expect, but this is not guaranteed. In the following statement, you
might think that MySQL will evaluate #a first and then do an
assignment second:
SELECT #a, #a:=#a+1, ...;
However, the order of evaluation for
expressions involving user variables is undefined.
I think you have found a situation where it makes a difference.
use this sqlFiddle to specify less than 40,
or this sqlFiddle also to specify less than 40.
I think what's happening is HAVING is applied after everything is done, but because you have variables in your select, they get calculated again giving you no control.
Basically I store data in MySql 5.5. I use qt to connect to mysql. I want to compare two columns, if col1 is greater than col2, the count continues, but when col1 is less than col2, count finishes and exits. So this is to count how many rows under some condition at the beginning of column. Is it possible in mysql?
An example:
Col1 Col2
2 1
2 3
2 1
The count I need should return 1, because the first row meets the condition of Col1 > Col2, but the second row doesn't. Whenever the condition is not meet, counting exits no matter if following rows meet the condition or not.
SELECT COUNT(*)
FROM table
WHERE col1 > col2
It's a little difficult to understand what you're after, but COUNT(*) will return the number of rows matched by your condition, if that's your desire. If it's not, can you maybe be more specific or show example(s) of what you're going for? I will do my best to correct my answer depending on additional details.
You should not be using SQL for this; any answer you get will be chock full of comprimise and if (for example) the result set from your intial query comes back in a different order (due to an index being created or changed), then they will fail.
SQL is designed for "set based" logic - and you really are after procedural logic. If you have to do this, then
1) Use a cursor
2) Use an order by statement
3) Cross fingers
This is a bit ugly, but will do the job. It'll need adjusting depending on any ORDER etc you would like to apply to someTable but the principle is sound.
SELECT COUNT(*)
FROM (
SELECT
#multiplier:=#multiplier*IF(t.`col1`<t.`col2`,0,1) AS counter
FROM `someTable` t, (SELECT #multiplier := 1) v
HAVING counter = 1
) scanQuery
The #multiplier variable will keep multiplying itself by 1. When it encounters a row where col1 < col2 it multiplies by 0. It will then continue multiplying 0 x 1. The outer query then just sums them up.
It's not ideal, but would suffice. This could be expanded to allow you to get those rows before the break by doing the following
SELECT
`someTable`.*
FROM `someTable`
INNER JOIN (
SELECT
`someTable`.`PrimaryKeyField`
#multiplier:=#multiplier*IF(`col1`<`col2`,0,1) AS counter
FROM `someTable` t, (SELECT #multiplier := 1) v
HAVING counter = 1
) t
ON scanQuery.`PrimaryKeyField` = `someTable`.`PrimaryKeyField`
Or possibly simply
SELECT
`someTable`.*
#multiplier:=#multiplier*IF(`col1`<`col2`,0,1) AS counter
FROM `someTable` t, (SELECT #multiplier := 1) v
HAVING counter = 1