Mysql limit rows per group weird results - mysql

I wanted to get the latest 4 dates for each symbolid. I adapted the code here as follows:
set #num := 0, #symbolid := '';
select symbolid, date,
#num := if(#symbolid = symbolid, #num + 1, 1) as row_number,
#symbolid := symbolid as dummy
from projections
group by symbolid, date desc
having row_number < 5
and get the following results:
symbolid date row_number dummy
1 '2011-09-01 00:00:00' 1 1
1 '2011-08-31 00:00:00' 3 1
1 '2011-08-30 00:00:00' 5 1
2 '2011-09-01 00:00:00' 1 2
2 '2011-08-31 00:00:00' 3 2
2 '2011-08-30 00:00:00' 5 2
3 '2011-09-01 00:00:00' 1 3
3 '2011-08-31 00:00:00' 3 3
3 '2011-08-30 00:00:00' 5 3
4 '2011-09-01 00:00:00' 1 4
...
The obvious question is, why did I only get 3 rows per symbolid, and why are they numbered 1,3,5? A few details:
I tried both forcing an index and not (as seen here), and got the same results both ways.
The dates are correct, i.e., the listing correctly shows the top 3 dates per symbolid, but the row_number value is off
When I don't use the "having" statement, the row numbers are correct, i.e., the most recent date is 1, the next most recent is 2, etc
Obviously the row_number computed field is being affected by the "having" clause, but I don't know how to fix it.
I realize that I could just change the "having" to "having row_number < 7" (6 gives the same as 5), but it's very ugly and would like to know what to do to make it "behave".

I'm not 100% sure why it behaves this way (maybe it's because logically SELECT is processed prior to ORDER BY), but it should work as expected:
SELECT *
FROM
(
select symbolid, date,
#num := if(#symbolid = symbolid, #num + 1, 1) as row_number,
#symbolid := symbolid as dummy
from projections
INNER JOIN (SELECT #symbolid:=0)c
INNER JOIN (SELECT #num:=0)d
group by symbolid, date desc
) a
WHERE row_number < 5

The user defined variables does not work well, (refer here)
As a general rule, you should never assign a value to a user variable and read the value within the same statement. You might get the results you expect, but this is not guaranteed. The order of evaluation for expressions involving user variables is undefined and may change based on the elements contained within a given statement; in addition, this order is not guaranteed to be the same between releases of the MySQL Server. In SELECT #a, #a:=#a+1, ..., you might think that MySQL will evaluate #a first and then do an assignment second. However, changing the statement (for example, by adding a GROUP BY, HAVING, or ORDER BY clause) may cause MySQL to select an execution plan with a different order of evaluation.
Here is my proposal
select symbolid,
substring_index(group_concat(date order by date desc), ',', 4) as last_4_dates
from projections
group by symbolid
The drawback of this approach is it will group collapse the date,
and you need to explode before you can actually use it.

Final code:
set #num := 0, #symbolid := '';
select d.* from
(
select symbolid, date,
#num := if(#symbolid = symbolid, #num + 1, 1) as row_number,
#symbolid := symbolid as dummy
from projections
order by symbolid, date desc
) d
where d.row_number < 5

Related

SQL query to select last X entries for a certain non-primary field

I'm having difficulties setting up a slightly more advanced SQL query.
What I'm trying to do is to select the last 24 entries for every zr_miner_id, but I keep getting SQL timeouts (the table has around 40000 entries so far).
So let's say there's 200 entries for zr_miner_id 1 and 200 for zr_miner_id 2, I'd end up with 48 results.
So far, I've come up with the query below.
What this is supposed to do is to select each result in zec_results that has less than 24 newer entries with the same zr_miner_id.
I couldn't think of any better way to perform this task, but then again, I'm not that far advanced at SQL yet.
SELECT results_a.*
FROM zec_results results_a
WHERE (
SELECT COUNT(results_b.zr_id)
FROM zec_results AS results_b
WHERE results_b.zr_miner_id = results_a.zr_miner_id
AND results_b.zr_id >= results_a.zr_id
) <= 24
Use variables!
SELECT r.*
FROM (SELECT r.*,
(#rn := if(#m = r.zr_miner_id, #rn + 1,
if(#m := r.zr_miner_id, 1, 1)
)
) as rn
FROM zec_results r CROSS JOIN
(SELECT #m := -1, #rn := 0) params
ORDER BY r.zr_miner_id, r.zr_id DESC
) r
WHERE rn <= 24 ;
If you want to put the query into a view, then the above will not work. Performance on your approach might improve with an index on (zr_miner_id, zr_id).

How to eliminate only continuous duplicates but not all duplicates in a select query (MySQL)?

I have a table like this:
01-Jul-17 100
02-Jul-17 100
03-Jul-17 300
04-Jul-17 300
05-Jul-17 500
06-Jul-17 500
07-Jul-17 300
08-Jul-17 400
09-Jul-17 100
10-Jul-17 100
What I want to output is (in this order) by eliminating the continuous duplicates but not all duplicates:
100
300
500
300
400
100
I cannot select Distinct, as it will eliminate the second instances of 300, 100. Is there a way to achieve this result in MySQL?
Thanks!
You want to get the previous value. If the dates really have no gaps or duplicates, just do:
select t.*
from t left join
t tprev
on t.col1 = date_add(tprev.col1, interval 1 day)
where tprev.col2 is null or tprev.col2 <> t.col2;
EDIT:
If the dates don't meet these conditions, then you can use variables:
select t.*
from (select t.*,
(#rn := if(#v = col2, #rn + 1,
if(#v := col2, 1, 1)
)
) as rn
from t cross join
(select #v := 0, #rn := 0) params
order by t.col1
) t
where rn = 1;
Note that MySQL does not guarantee the order of evaluation of expressions in the SELECT. So variables should not be assigned in one expression and then used in another -- they should be assigned in a single expression.
One way to handle this problem is by using session variables to track the changes of the values as ordered by your date column. In the query below, we keep track of the value, ordered by date, and assign a row number to each group of identical value. Then, only the first value in each group is retained. Note that this approach is robust to any number of duplicates. It is also robust with respect to there being gaps in your dates, so long as each record can be ordered by date.
SET #rn = 1;
SET #val = NULL;
SELECT t.val
FROM
(
SELECT
#rn:=CASE WHEN #val = val THEN #rn+1 ELSE 1 END rn,
#val:=val AS val,
dt
FROM yourTable
ORDER BY dt
) t
WHERE t.rn = 1
ORDER BY t.dt;
Output:
Demo here:
Rextester
You can make use of lag and lead functions.
select y from (select y , lag(y,1,0) over (order by x) as prev_y from t1) where y <> prev_y;

I need to find any 5 rows that match where clause and they occur "in a row" (they are neighbors)

I have a MySQL table for fictional fitness app.
Let's say that app is monitoring user progress on doing pushups day by day.
TrainingDays
id | id_user | date | number_of_pushups
Now, I need to find if user have ever managed to do more than 100 pushups 5 days in a row.
I know this is probably doable by fetching all days and then making some php loops, but I wonder if there is possibility to do this in plain mysql...
In MySQL, the easiest way is to use variables. The following gets all sequences of days with 100 or more pushups:
select grp, count(*) as numdaysinarow
from (select (date - interval rn day) as grp, td.*
from (select td.*,
(#rn := if(#i = id_user, #rn + 1
if(#i := id_user, 1, 1)
) as rn
from trainingdays td cross join
(select #rn := 0, #i := NULL) vars
where number_of_pushups >= 100
order by id_user, date
) td
) td
group by grp;
This uses the observation that when you subtract a sequence of numbers from a series of dates that increment, then the resulting value is constant.
To determine if there are 5 or more days in a row, use max():
select max(numdaysinarow)
from (select grp, count(*) as numdaysinarow
from (select (date - interval rn day) as grp, td.*
from (select td.*,
(#rn := if(#i = id_user, #rn + 1
if(#i := id_user, 1, 1)
) as rn
from trainingdays td cross join
(select #rn := 0, #i := NULL) vars
where number_of_pushups >= 100
order by id_user, date
) td
) td
group by grp
) td;
Your app can then check the value against whatever minimum you like.
Note: this assumes that there is only one record per day. The above can easily be modified if you are looking for the sum of the number of pushups on each day.
Order of records shouldn't be relied on, e.g. with ORDER BY you can change the sequence.
However, you have many functions at hand in a database, which also enables you to use less PHP. What you want is SUM function. Combined with a WHERE clause, this should get you started:
SELECT SUM(number_of_pushups) AS sum_pushups
FROM TrainingDays
WHERE date >= :start_day
AND user_id = :user_id

What are the subquery equivalents of SQL aggregate functions MAX/MIN/AVG/COUNT

Can someone show me how to represent the following SQL statements without the use of aggregate functions?
SELECT COUNT(column) FROM table;
SELECT AVG(column) FROM table;
SELECT MAX(column) FROM table;
SELECT MIN(column) FROM table;
MIN() and MAX() can be done with simple subqueries:
select (select column from table order by column is not null desc, column asc limit 1) as "MIN",
(select column from table order by column is not null desc, column desc limit 1) as "MAX"
COUNT() and AVG() require the use of variables, if you don't allow any aggregations:
select rn as "COUNT", sumcol / rnaas "AVG"
from (select t.*
from (select t.*,
(#rn := #rn + 1) as rn,
(#rna := #rna + if(column is not null, 1, 0)) as rna,
(#sum := #sum + coalesce(column, 0)) as sumcol
from table t cross join
(select #rn := 0, #rna := 0, #sum := 0) const
order by column
) t
order by rn desc
limit 1
) t
This latter formulation only works in MySQL.
EDIT:
The empty table is a challenge. Let's do this with a left outer join:
select cast(coalesce(rn, 0) as int) as "COUNT",
(case when rna > 0 then sumcol / rna end) as "AVG"
from (select 1 as n
) n left outer join
(select t.*
from (select t.*,
(#rn := #rn + 1) as rn,
(#rna := #rna + if(column is not null, 1, 0)) as rna,
(#sum := #sum + coalesce(column, 0)) as sumcol
from table t cross join
(select #rn := 0, #rna := 0, #sum := 0) const
order by column
) t
order by rn desc
limit 1
) t
on n.n = 1;
Notes. This will return 0 for the count if the table is empty. That is correct. If the table is empty, it will return NULL for the average, and that is also correct.
If the table is not empty, but the values are all NULL, then it will also return NULL. The types for the count are always integers, so that should be ok. The type of the average is more problematic, but the variables will return some sort of generic numeric type, which seems compatible in spirit.
min/max can be replaced with something like this:
select t1.pk_column,
t1.some_column
from the_table t1
where t1.some_column < ALL (select t2.some_column
from the_table t2
where t2.pk_column <> t2.pk_column);
For getting the max you need to replace < with >. pk_column is the primary key column of the table and is needed to avoid comparing each row to itself (it doesn't have to be a PK it only needs to be unique)
I don't think there is an alternative for count() or avg() (at least I can't think of one)
I used the_column and the_table because column and table are reserved words
SET #t1=0, #t2=0, #t3=0,#T4=0;
COUNT:
Select #t1:=#t1+1 as CNT from table
order by #t1:=#t1+1 DESC
LIMIT 1
Similar methods could be put together for Avg and max/min using limits...
Still thinking about Min/Max...
Not to supersede the excellent answer from Gordon Linoff, but there's a little more work involved to accurately emulate the AVG(), COUNT(), and SUM() functions. (The answer for the MIN and MAX functions in Gordon's answer are spot on.)
There's a corner case when the table is empty. In order to emulate the SQL aggregate functions, we need our query to return a single row. But at the same time, we need a test of whether or not the table contains at least one row.
Here's a query that is a more precise emulation:
-- create an empty table
CREATE TABLE `foo` (col INT);
-- TRUNCATE TABLE `foo`;
SELECT IF(s.ne IS NULL,0,s.rn) AS `COUNT(*)`
, IF(s.cc>0,s.tc,NULL) AS `SUM(col)`
, IF(s.cc>0,s.tc/s.cc,NULL) AS `AVG(col)`
FROM ( SELECT v.rn
, v.cc
, v.tc
, e.ne
FROM ( SELECT #rn := #rn + 1 AS rn
, #cc := #cc + (t.col IS NOT NULL) AS cc
, #tc := #tc + IFNULL(t.col,0) AS tc
FROM (SELECT #rn := 0, #cc := 0, #tc := 0) c
LEFT
JOIN `foo` t
ON 1=1
) v
LEFT
JOIN (SELECT 1 AS ne FROM `foo` z LIMIT 1) e
ON 1=1
ORDER BY v.rn DESC
LIMIT 1
) s
NOTES:
The purpose of the inline view aliased as e is to give us a way to determine whether or not the table contains any rows. If the table contains at least one row, we'll get a value of 1 returned as column ne (not empty). If the table is empty, that query won't return a row, and e.ne will be NULL, which is something we can test in the outer query.
In order to return a row, so we can return a value, like a 0 for a COUNT, we need to insure that we return at least one row from the inline view v. Since we are guaranteed exactly one row from the inline view aliased as c (which initializes our user defined variables), we'll use that as the "driving" table for a LEFT [OUTER] JOIN operation.
But, if the table is empty, our our row counter (#rn) coming out of v is going to have a value of 1. But we'll deal with that, we have the e.ne we can check to know if the count should really be returned as 0.
In order to calculate the average, we can't divide by the row counter, we have to divide by the number of rows where col was not null. We use the #cc user defined variable to keep track of the count of those rows.
Similarly, for the SUM (and the average) we need to accumulate only the non-NULL values. (If we were to add a NULL, it would turn the whole total to NULL, basically wiping out are accumulation. So, we're going to do a conditional test to check if t.col IS NULL, to avoid accidentally wiping out the accumulation. And our accumulator is going to be a 0 if there aren't any rows that are not null. But that's not a problem, because we'll make sure we check our #cc to see if there were any rows that were included. We're going to need to check it anyway, to avoid a "divide by zero" issue.
To test, run against the empty table foo. It will return a count of 0, and NULL for SUM and AVG, equivalent to the result we get from:
SELECT COUNT(*), SUM(col), AVG(col) FROM foo;
We can also test the query against a table containing only NULL values for col:
INSERT INTO `foo` (col) VALUES (NULL);
As well as some non-NULL values:
INSERT INTO `foo` (col) VALUES (2),(3),(5),(7),(11),(13),(17),(19);
And compare the results of the two queries.
This essentially the same as the answer from Gordon Linoff, with just a little more precision to work around the corner cases of NULL values and the empty table.

Limit rows of every month?

I need to get data between Decemember 2012 to November 2014.
Each month I only need 1500 rows.
For example:
SELECT * FROM data WHERE YEAR(submit_date) = 2012 AND MONTH(submit_date) = 12 limit 1500;
SELECT * FROM data WHERE YEAR(submit_date) = 2013 AND MONTH(submit_date) = 1 limit 1500;
SELECT * FROM data WHERE YEAR(submit_date) = 2013 AND MONTH(submit_date) = 2 limit 1500;
SELECT * FROM data WHERE YEAR(submit_date) = 2013 AND MONTH(submit_date) = 3 limit 1500;
and until Nov 2014
Is there a way to write SQL query smaller?
There are some options list here: http://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/
IMHO one of the best is using a row-counter:
set #num := 0, #type := '';
select id, name, submit_date,
#num := if(#type = CONCAT(YEAR(submit_date), MONTH(submit_date)), #num + 1, 1) as row_number,
#type := CONCAT(YEAR(submit_date), MONTH(submit_date)) as dummy
from data force index(IX_submit_date)
group by id, name, submit_date
having row_number <= 2;
You can test it here: http://sqlfiddle.com/#!2/e829c/13 (I do a cut for 2 elements, not for 1500)
I think you're looking for a GROUP BY clause. I would need to know a bit more to give you a definitive answer. But the following pseduo-query might guide you in the right direction.
SELECT *, SUM(some_field)
FROM data
GROUP BY MONTH(submit_date)
Or if you only need 1500 rows, select the top 1500 ordered by the date
SELECT TOP(1500) *
FROM data
WHERE submit_date > '12-01-2012' AND submit_date < '11-01-2014'
ORDER BY MONTH(submit_date)
With MySQL you can use LIMIT
SELECT *
FROM data
WHERE submit_date > '12-01-2012' AND submit_date < '11-01-2014'
ORDER BY MONTH(submit_date)
LIMIT 0,1500;
You can do it almost like you have it, just add a UNION between your queries. But you still have to create 1 query per month.
Otherwise you need to enumerate the rows that are returned. You need to first order and enumerate your records, then you can do a select on that select to get only the top X. Not sure if you want to include the last month or not.
SET #prev_date='';
SELECT * FROM (
SELECT IF(#prev_date=submit_date, #incr := #incr+1, #incr:=1) AS row_num,
data.*,
(#prev_date := submit_date) AS set_prev_date
FROM data WHERE submit_date BETWEEN "2012-12-01" AND "2014-11-30"
ORDER BY submit_date
) tmp WHERE row_num<1500;