I was looking for a query that sums the top n values for each user. Fortunately, I found the solution in this post Sum top 5 values in MySQL
However, I'm having a harsh time understanding the given solution which is :
SELECT driver, SUM(`position`)
FROM (SELECT driver, race, season, `position`,
IF(#lastDriver=(#lastDriver:=driver), #auto:=#auto+1, #auto:=1) indx
FROM results, (SELECT #lastDriver:=0, #auto:=1) A
ORDER BY driver, `position`) AS A
WHERE indx <= 5
GROUP BY driver ;
Can someone explain how it works especially the subquery after the FROM Clause ?
Thank you in advance.
The MySQL #variables are like doing an inline program of setting a variable, comparing and then using the result as basis for next row being compared... Let me re-format your query for a little better following / readability.
SELECT
driver,
SUM(`position`)
FROM
( SELECT
driver,
race,
season,
`position`,
IF( #lastDriver = ( #lastDriver := driver),
#auto := #auto + 1,
#auto := 1) indx
FROM
results,
(SELECT #lastDriver := 0,
#auto := 1) A
ORDER BY
driver,
`position`) AS A
WHERE
indx <= 5
GROUP BY
driver ;
The inner "From" clause starts the query. The Select # is like you having a program that does..
set #auto = 1
set #lastDriver = 0
to prepare the variables.
Because you have an order by clause the query will grab records from the "results" table and put them in order. This is like a queue of records ABOUT to be processed. Now, think of the query running all the records through a do/while loop. Remember, your #auto = 1 and #lastDriver = 0 to start.
Now, the query will process the records in order, and as it goes ONE FIELD AT A TIME is like the query saying...
Add the driver to the result
add the race to the result
add the season to the result
add the `position` to the result
Now for what you are probably waiting for... what is going on with the IF() and # variables. The IF() is like the following
IF( some condition )
add this value to the result
else
add this value to the result
In this case, the #lastDriver = ( #lastDriver := driver ) is pseudo something like..
set HoldLastDriver = #lastDriver
set #lastDriver = the driver of the current record being processed
if( HoldLastDriver = new value of #lastDriver -- current record driver )
set the auto value = auto value +1
else
set the auto value back to 1 because the driver just changed
The result of this is put into the "indx" column.
So, now, when it gets to the next row, the #lastDriver and #auto have been updated and continue from there... So, if you have the following data, the list will show what the values might be of just showing the driver and auto columns as this is the basis of the as that is the critical element for the query.
BEFORE ROW PROCESSED AFTER ROW PROCESSED
driver pos/race/season #lastDriver #auto #lastDriver #auto
1 3 / A / A 0 1 1 1
1 3 / B / A 1 1 1 2
1 4 / D / B 1 2 1 3
1 5 / E / D 1 3 1 4
1 7 / F / D 1 4 1 5
1 7 / G / D 1 5 1 6
2 2 / A / A 1 6 2 1
2 2 / B / A 2 1 2 2
3 1 / A / A 2 2 3 1
3 2 / B / A 3 1 3 2
3 2 / D / B 3 2 3 3
3 2 / E / D 3 3 3 4
3 3 / F / D 3 4 3 5
3 3 / G / D 3 5 3 6
To end the query, now that all these records are processed, your outer query will process the driver and sum() but only for the records where "indx" (the AFTER version of #auto) is less or equal to 5... so the two records where #auto was 6 (driver 1 & 3) would NOT be considered in the result summation.
Related
I have a table,
Name Seconds Status_measure
a 0 10
a 10 13
a 20 -1
a 30 15
a 40 20
a 50 12
a 60 -1
Here I want for a particular name a new column which is calculated by, "The number of times the value goes >-1 only after once the -1 is met" . So in this particular data I want a new column for the name "a" which has the value=3 , because once the -1 is reached in Status_measure, we have 3 values (15 and 20 and 12)>-1
Required data frame:
Id Name Seconds Status_measure Value
1 a 0 10 3
2 a 10 13 3
3 a 20 -1 3
4 a 30 15 3
5 a 40 20 3
6 a 50 12 3
7 a 60 -1 3
I tried doing
count(status_measure>-1) over (partition by name order by seconds)
But this is not giving any desired result
You can do it in 2 steps, group data, count entries of the grp = 1.
select *, sum(Status_measure > -1 and grp = 1) over(partition by name) n
from (
select *
, row_number() over(partition by name order by Seconds) - sum(Status_measure > -1 ) over(partition by name order by Seconds) grp
from tbl
) t
An option is using a variable update, which:
starts from 0
increases its value when reaches a -1
decreases its value when reaches a second -1
Once you have this column, you can run a sum over your values.
SET #change = 0;
SELECT *, SUM(CASE WHEN Status_measure = -1
THEN IF(#change=0, #change := #change + 1, #change := #change - 1)
ELSE #change END) OVER() -1 AS Value_
FROM tab
Check the demo here.
Limitations: this solution assumes you have only one range of interesting values between -1s.
Note: there's a -1 decrement from your sum because the first update of the variable will leave 1 in the same row of -1, which you don't want. For better understanding, comment out the application of SUM() OVER and see intermediate output.
More of a clarification to your question first. I want to expand your original data to include another row for the sake of 2 vs 3 entries. Also, is there some auto-increment ID in your data that the sequential consideration is applicable such as
Id Name Seconds Status_measure Value
1 a 0 10 3
2 a 10 13 3
3 a 20 -1 3
4 a 30 15 3
5 a 40 20 3
6 a 50 12 3
7 a 60 -1 3
If sequential, and you have IDs 1 & 2 above the -1 at ID #3. This would indicate two entries. But then for IDs 4-6 above -1 have a count of three entries before ID #7.
So, what "VALUE" do you want to have in your result. The max count of 3 for all rows, or would it be a value of 2 for ID#s 1, 2 and 3? And value of 3 for Ids 4-7? Or, do you want ALL entries to recognize the greatest count before -1 measure to show 3 for all entries.
Please EDIT your question, you can copy/paste this in your original question if need be and provide additional clarification as requested (auto-increment as well as that is an impact of final output / determining break).
See the last two row where option_order is 0 but type is different. I want to keep them on first position for each type. How can I re-ordered the value of option_order column?
Here, the condition is, the '000' choice must be kept on first for each type by setting its position_order.
current table status:
MULTIPLE_CHOICE Table
id choice type option_order
1 AA 1 1
2 BB 1 2
3 CC 1 3
4 AAA 2 4
5 BBB 2 5
6 CCC 2 6
7 DDD 2 7
8 000 1 0
9 000 2 0
Required updated table:
This is what I need:
updated MULTIPLE_CHOICE Table
id choice type option_order
8 000 1 1
1 AA 1 2
2 BB 1 3
3 CC 1 4
9 000 2 5
4 AAA 2 6
5 BBB 2 7
6 CCC 2 8
9 DDD 2 9
The actual table is too big, so I cannot do this by edit. Please help for this complex query. I have no clue to solve this.
[Note: I need this to solve for mysql version 5.7]
Recalculate the whole column. Use, for example, user-defined variable:
UPDATE MULTIPLE_CHOICE
SET option_order = (#counter := #counter + 1)
WHERE (SELECT #counter := 0) = 0
ORDER BY type, choice;
fiddle
can you explain this condition: WHERE (SELECT #counter := 0) = 0. – HiddenHopes
This is a condition only formally - as you can see it is always TRUE. The aim of this construction is in variable initialization.
In SELECT queries we can initialise user-defined variables in separate subquery cross-joined to another tables, like:
SELECT {columnset}
FROM {tableset}
CROSS JOIN ( SELECT #variable := starting_value ) AS initialize_variables
{the rest clauses}
But we cannot do the same in UPDATE. The calculations are critically dependent by rows processing order, i.e. ORDER BY clause with the ordering expression which provides rows uniqueness is compulsory. But the subquery which initializes the variables will convert single-table UPDATE to multiple-table which does not support ORDER BY clause at all!
The way out of this situation is to initialize the variable in WHERE clause. When server builds query execution plan it evaluates all constant expressions, including ones in WHERE clause. Moreover, in UPDATE server MUST evaluate WHERE expression before updating because it must firstly determine the rows which will be updated, and only then update them. So the expression in WHERE will be evaluated before rows updating, and hence the variable will be initialized before rows iteration with guarantee.
select count(distinct a,b,c,d) from mytable;
select count(distinct concat(a,'-',b),concat(c,'-',d)) from mytable;
Since '-' never appears in a,b,c,d fields, the 2 queries above should return the same result. Am I right ?
Actually it is not the case, the difference is 4 rows out of ~60M and I cant figure out how this is possible
Any idea or example ?
Thanks
First, I am assuming that you are using MySQL, because that is the only database of your original tags where your syntax would be accepted.
Second, this does not directly answer your question. Given your types and expressions, I do not see how you can get different results. However, very similar constructs can produce different results.
It is very important to note that NULL is not the culprit. If any argument is NULL for either COUNT(DISTINCT) or CONCAT(), then the result is NULL -- and NULLs are not counted.
However, spaces at the end of strings can be an issue. Consider the results from this query:
select count(distinct x, y),
count(distinct concat(x, '-', y)),
count(distinct concat(y, '-', x))
from (select 1 as x, 'a' as y union all
select 1, 'a ' union all
select 1, NULL
) a
I would expect the second and third arguments to return the same thing. But spaces at the end of the string cause differences. COUNT(DISTINCT) ignores them. However, CONCAT() will embed them in the string. Hence, the above returns
1 1 2
And the two values are different.
In other words, two values may not be exactly the same, but COUNT(DISTINCT) might regard them as the same. Spaces are one example. Collations are another potential culprit.
Take example of sample data
A B C D
1 2 3 4
5 6 7 8
1 2 5 7
1 2 5 7
1 3 3 4
1 3 3 4
then count (distinct (a, b, c, d)) = 4
A B C D
1 2 3 4
5 6 7 8
1 2 5 7
1 3 3 4
and count (distinct (a,-,b), distinct (c,-,d)) = 3
dist (a,-,b) dist (c,-,d)
1 2 3 4
5 6 7 8
1 3 5 7
What I try to do is to make the first 200 records from a column to start from 1 to 200. After 200 records no changing on values.
The current records look like this
1
2
3
4
4
6
6
...
What I need is to update them to be
1
2
3
4
5
...
200
What sql statement do I need to fix them?
Initialize a user defined variable and do it like this:
SET #rownumber = 0;
UPDATE your_table
SET your_column = (#rownumber := #rownumber + 1)
ORDER BY the_column_that_defines_the_order_of_the_first_200_records
LIMIT 200;
Have a existing table of results like this;
race_id race_num racer_id place
1 0 32 2
1 1 32 3
1 2 32 1
1 3 32 6
1 0 44 2
1 1 44 2
1 2 44 2
1 3 44 2
etc...
Have lots of PHP scripts that access this table output the results in a nice format.
Now I have a case where I need to output the results for only certain race_nums.
So I have created this table races_included.
race_view race_id race_num
Day 1 1 0
Day 1 1 1
Day 2 1 2
Day 2 1 3
And can use this query to get the right results.
SELECT racer_id, place from results WHERE race_id=1
AND race_num IN
(SELECT race_num FROM races_included WHERE race_id='1' AND race_view='Day 1')
This is great but I only need this feature for a few races and to have it work in a compatible mode for the simple case show all races. I need to add alot of rows to the races_included table. Like
race_view race_id race_num
All 1 0
All 1 1
All 1 2
All 1 3
95% of my races don't use the daily feature.
So I am looking for a way to change the query so that if for race 1 there are no records in the races_included table it defaults to all races. In addition I need it to be close the same execution speed as the query without the IN clause, because this query Or variations of it are used a lot.
One way that does work is to redefine the table as races_excluded and use NOT IN. This works great but is a pain to manage the table when races are added or deleted.
Is there a simple way to use EXISTS and IN in tandem as a subquery to get the desired results? Or some other neat trick I am missing.
To clarify I have found a working but very slow solution.
SELECT * FROM race_results WHERE race_id=1
AND FIND_IN_SET(race_num, (SELECT IF((SELECT Count(*) FROM races_excluded
WHERE rid=1>0),(SELECT GROUP_CONCAT(rnum) FROM races_excluded
WHERE rid=1 AND race_view='Day 1' GROUP BY rid),race_num)))
It basically checks if any records exists for that race_id and if not return a set equal to the current race_num and if yes returns a list of included race nums.
You can do this by using or in the subquery:
SELECT racer_id, plac
from results
WHERE race_id = 1 AND
race_num IN (SELECT race_num
FROM races_included
WHERE race_id = '1' AND (race_view = 'Day 1' or raw_view = 'ANY')
);