I have a column of timestamps and I like to have a result where I can see
the amount of added entries for a certain date (added_on_this_date)
and the total amount since the beginning (total_since_beginning)
My table:
added
==========
1392040040
1392050040
1392060040
1392070040
1392080040
1392090040
1392100040
1392110040
1392120040
1392130040
1392140040
1392150040
1392160040
1392170040
1392180040
1392190040
1392200040
The result should look like:
date | added_on_this_date | total_since_beginning
=========================================================
2014-02-10 | 4 | 4
2014-02-11 | 9 | 13
2014-02-12 | 4 | 17
I'm using this query which gives me the wrong result
SELECT FROM_UNIXTIME(added, '%Y-%m-%d') AS date,
count(*) AS added_on_this_date,
(SELECT COUNT(*) FROM mytable t2 WHERE t2.added <= t.added) AS total_since_beginning
FROM mytable t WHERE 1=1 GROUP BY date
I've created a fiddle for better understanding: http://sqlfiddle.com/#!2/a72a9/1
your mixing timestamps and yyyy-mm-dd dates...
As you group by a yyyy-mm-dd, you're not sure to know which timestamp will be taken.
You could do
SELECT FROM_UNIXTIME(added, '%Y-%m-%d') AS date,
count(*) AS added_on_this_date,
(SELECT COUNT(*) FROM mytable t2 WHERE FROM_UNIXTIME(t2.added, '%Y-%m-%d') <= FROM_UNIXTIME(t.added, '%Y-%m-%d')) AS total_since_beginning
FROM mytable t GROUP BY date
This is probably more efficient to do with variables than with a subquery:
select date, added_on_this_date,
#cumsum := #cumsum + added_on_this_date as total_since_beginning
from (SELECT FROM_UNIXTIME(added, '%Y-%m-%d') AS date,
count(*) AS added_on_this_date
FROM mytable t
WHERE 1=1
GROUP BY date
) d cross join
(select #cumsum := 0) const
order by date;
EDIT (in response to comment):
The above query has a significant performance advantage because it aggregates the data once and that is basically all the effort the query needs to do. Your original formulation with a correlated subquery can be optimized using an appropriate index. Unfortunately, once the condition in the correlated subquery uses a function on both tables, then MySQL will not be able to take advantage of an index (in general).
Because the query is aggregating by date anyway, this should perform much better.
Related
I have the following query:
set #cumulativeSum := 0;
select
(#cumulativeSum:= #cumulativeSum + (count(distinct `ce`.URL, `ce`.`IP`))) as `uniqueClicks`,
cast(`ce`.`dt` as date) as `createdAt`
from (SELECT DISTINCT min((date(CODE_EVENTS.CREATED_AT))) dt, CODE_EVENTS.IP, CODE_EVENTS.URL
FROM CODE_EVENTS
GROUP BY CODE_EVENTS.IP, CODE_EVENTS.URL) as ce
join ATTACHMENT on `ce`.URL = ATTACHMENT.`ID`
where ATTACHMENT.`USER_ID` = 6
group by cast(`ce`.`dt` as date)
ORDER BY ce.URL;
It works almost ok, I would like to have as result set a date and amount of cumulative sum as uniqueClicks, the problem is that in my result set it is not added up together.
uniqueClicks createdAt
1 2018-02-01
3 2018-02-03
1 2018-02-04
and I'd like to have
uniqueClicks createdAt
1 2018-02-01
4 2018-02-03
5 2018-02-04
I believe you can obtain a rolling sum of the unique clicks without needing to resort to dynamic SQL:
SELECT
t1.CREATED_AT,
(SELECT SUM(t2.uniqueClicks) FROM
(
SELECT CREATED_AT, COUNT(DISTINCT IP, URL) uniqueClicks
FROM CODE_EVENTS
GROUP BY CREATED_AT
) t2
WHERE t2.CREATED_AT <= t1.CREATED_AT) uniqueClicksRolling
FROM
(
SELECT DISTINCT CREATED_AT
FROM CODE_EVENTS
) t1
ORDER BY t1.CREATED_AT;
The subquery aliased as t2 computes the number of unique clicks on each given day which appears in your table. The distinct count of IP and URL is what determines the number of clicks. We can then subquery this intermediate table and sum clicks for all days up and including the current date. This is essentially cursor style action, and can replace your use of session variables.
I've tried a few things but I've ended up confusing myself.
What I am trying to do is find the most recent records from a table and left join the first after a certain date.
An example might be
id | acct_no | created_at | some_other_column
1 | A0001 | 2017-05-21 00:00:00 | x
2 | A0001 | 2017-05-22 00:00:00 | y
3 | A0001 | 2017-05-22 00:00:00 | z
So ideally what I'd like is to find the latest record of each acct_no sorted by created_at DESC so that the results are grouped by unique account numbers, so from the above record it would be 3, but obviously there would be multiple different account numbers with records for different days.
Then, what I am trying to achieve is to join on the same table and find the first record with the same account number after a certain date.
For example, record 1 would be returned for a query joining on acct_no A0001 after or equal to 2017-05-21 00:00:00 because it is the first result after/equal to that date, so these are sorted by created_at ASC AND created_at >= "2017-05-21 00:00:00" (and possibly AND id != latest.id.
It seems quite straight forward but I just can't get it to work.
I only have my most recent attempt after discarding multiple different queries.
Here I am trying to solve the first part which is to select the most recent of each account number:
SELECT latest.* FROM my_table latest
JOIN (SELECT acct_no, MAX(created_at) FROM my_table GROUP
BY acct_no) latest2
ON latest.acct_no = latest2.acct_no
but that still returns all rows rather than the most recent of each.
I did have something using a join on a subquery but it took so long to run I quite it before it finished, but I have indexes on acct_no and created_at but I've also ran into other problems where columns in the select are not in the group by. I know this can be turned off but I'm trying to find a way to perform the query that doesn't require that.
Just try a little edit to your initial query:
SELECT latest.* FROM my_table latest
join (SELECT acct_no, MAX(created_at) as max_time FROM my_table GROUP
BY acct_no) latest2
ON latest.acct_no = latest2.acct_no AND latest.created_at = latest2.max_time
Trying a different approach. Not sure about the performance impact. But hoping that avoiding self join and group by would be better in terms of performance.
SELECT * FROM (
SELECT mytable1.*, IF(#temp <> acct_no, 1, 0) selector, #temp := acct_no FROM `mytable1`
JOIN (SELECT #temp := '') a
ORDER BY acct_no, created_at DESC , id DESC
) b WHERE selector = 1
Sql Fiddle
you need to get the id where max date is created.
SELECT latest.* FROM my_table latest
join (SELECT max(id) as id FROM my_table GROUP
BY acct_no where created_at = MAX(created_at)) latest2
ON latest.id = latest2.id
I have a simple MySQL table like below, used to compute MPG for a car.
+-------------+-------+---------+
| DATE | MILES | GALLONS |
+-------------+-------+---------+
| JAN 25 1993 | 20.0 | 3.00 |
| FEB 07 1993 | 55.2 | 7.22 |
| MAR 11 1993 | 44.1 | 6.28 |
+-------------+-------+---------+
I can easily compute the Miles Per Gallon (MPG) for the car using a select statement, but because the MPG varies widely from fillup to fillup (i.e. you don't fill the exact same amount of gas each time), I would like to computer a 'MOVING AVERAGE' as well. So for any row the MPG is MILES/GALLON for that row, and the MOVINGMPG is the SUM(MILES)/SUM(GALLONS) for the last N rows. If less than N rows exist by that point, just SUM(MILES)/SUM(GALLONS) up to that point.
Is there a single SELECT statement that will fetch the rows with MPG and MOVINGMPG by substituting N into the select statement?
Yes, it's possible to return the specified resultset with a single SQL statement.
Unfortunately, MySQL does not support analytic functions, which would make for a fairly simple statement. Even though MySQL does not have syntax to support them, it is possible to emulate some analytic functions using MySQL user variables.
One of the ways to achieve the specified result set (with a single SQL statement) is to use a JOIN operation, using a unique ascending integer value (rownum, derived by and assigned within the query) to each row.
For example:
SELECT q.rownum AS rownum
, q.date AS latest_date
, q.miles/q.gallons AS latest_mpg
, COUNT(1) AS cnt_rows
, MIN(r.date) AS earliest_date
, SUM(r.miles) AS rtot_miles
, SUM(r.gallons) AS rtot_gallons
, SUM(r.miles)/SUM(r.gallons) AS rtot_mpg
FROM ( SELECT #s_rownum := #s_rownum + 1 AS rownum
, s.date
, s.miles
, s.gallons
FROM mytable s
JOIN (SELECT #s_rownum := 0) c
ORDER BY s.date
) q
JOIN ( SELECT #t_rownum := #t_rownum + 1 AS rownum
, t.date
, t.miles
, t.gallons
FROM mytable t
JOIN (SELECT #t_rownum := 0) d
ORDER BY t.date
) r
ON r.rownum <= q.rownum
AND r.rownum > q.rownum - 2
GROUP BY q.rownum
Your desired value of "n" to specify how many rows to include in each rollup row is specified in the predicate just before the GROUP BY clause. In this example, up to "2" rows in each running total row.
If you specify a value of 1, you will get (basically) the original table returned.
To eliminate any "incomplete" running total rows (consisting of fewer than "n" rows), that value of "n" would need to be specified again, adding:
HAVING COUNT(1) >= 2
sqlfiddle demo: http://sqlfiddle.com/#!2/52420/2
Followup:
Q: I'm trying to understand your SQL statement. Does your solution do a select of twenty rows for each row in the db? In other words, if I have 1000 rows will your statement perform 20000 selects? (I'm worried about performance)...
A: You are right to be concerned with performance.
To answer your question, no, this does not perform 20,000 selects for 1,000 rows.
The performance hit comes from the two (essentially identical) inline views (aliased as q and r). What MySQL does with these (basically) is create temporary MyISAM tables (MySQL calls them "derived tables"), which are basically copies of mytable, with an extra column, each row assigned a unique integer value from 1 to the number of rows.
Once the two "derived" tables are created and populated, MySQL runs the outer query, using those two "derived" tables as a row source. Each row from q, is matched with up to n rows from r, to calculate the "running total" miles and gallons.
For better performance, you could use a column already in the table, rather than having the query assign unique integer values. For example, if the date column is unique, then you could calculate "running total" over a certain period of days.
SELECT q.date AS latest_date
, SUM(q.miles)/SUM(q.gallons) AS latest_mpg
, COUNT(1) AS cnt_rows
, MIN(r.date) AS earliest_date
, SUM(r.miles) AS rtot_miles
, SUM(r.gallons) AS rtot_gallons
, SUM(r.miles)/SUM(r.gallons) AS rtot_mpg
FROM mytable q
JOIN mytable r
ON r.date <= q.date
AND r.date > q.date + INTERVAL -30 DAY
GROUP BY q.date
(For performance, you would want an appropriate index defined with date as a leading column in the index.)
For the first query, any predicates included (in the inline view definition queries) to reduce the number of rows returned (for example, return only date values in the past year) would reduce the number of rows to be processed, and would also likely improve performance.
Again, to your question about running 20,000 selects for 1,000 rows... a nested loops operation is another way to get the same result set. For a large number of rows, this can exhibit slower performance. (On the other hand, this approach can be fairly efficient, when only a few rows are being returned:
SELECT q.date AS latest_date
, q.miles/q.gallons AS latest_mpg
, ( SELECT SUM(r.miles)/SUM(r.gallons)
FROM mytable r
WHERE r.date <= q.date
AND r.date >= q.date + INTERVAL -90 DAY
) AS rtot_mpg
FROM mytable q
ORDER BY q.date
Something like this should work:
SELECT Date, Miles, Gallons, Miles/Gallons as MilesPerGallon,
#Miles:=#Miles+Miles overallMiles,
#Gallons:=#Gallons+Gallons overallGallons,
#RunningTotal:=#Miles/#Gallons runningTotal
FROM YourTable
JOIN (SELECT #Miles:= 0) t
JOIN (SELECT #Gallons:= 0) s
SQL Fiddle Demo
Which produces the following:
DATE MILES GALLONS MILESPERGALLON RUNNINGTOTAL
January, 25 1993 20 3 6.666667 6.666666666667
February, 07 1993 55.2 7.22 7.645429 7.358121330724
March, 11 1993 44.1 6.28 7.022293 7.230303030303
--EDIT--
In response to the comment, you can add another Row Number to limit your results to the last N rows:
SELECT *
FROM (
SELECT Date, Miles, Gallons, Miles/Gallons as MilesPerGallon,
#Miles:=#Miles+Miles overallmiles,
#Gallons:=#Gallons+Gallons overallGallons,
#RunningTotal:=#Miles/#Gallons runningTotal,
#RowNumber:=#RowNumber+1 rowNumber
FROM (SELECT * FROM YourTable ORDER BY Date DESC) u
JOIN (SELECT #Miles:= 0) t
JOIN (SELECT #Gallons:= 0) s
JOIN (SELECT #RowNumber:= 0) r
) t
WHERE rowNumber <= 3
Just change your ORDER BY clause accordingly. And here is the updated fiddle.
I want to make a MySQL to get daily differential values from a table who looks like this:
Date | VALUE
--------------------------------
"2011-01-14 19:30" | 5
"2011-01-15 13:30" | 6
"2011-01-15 23:50" | 9
"2011-01-16 9:30" | 10
"2011-01-16 18:30" | 15
I have made two subqueries. The first one is to get the last daily value, because I want to compute the difference values from this data:
SELECT r.Date, r.VALUE
FROM table AS r
JOIN (
SELECT DISTINCT max(t.Date) AS Date
FROM table AS t
WHERE t.Date < CURDATE()
GROUP BY DATE(t.Date)
) AS x USING (Date)
The second one is made to get the differential values from the result of the first one (I show it with "table" name):
SELECT Date, VALUE - IFNULL(
(SELECT MAX( VALUE )
FROM table
WHERE Date < t1.table) , 0) AS diff
FROM table AS t1
ORDER BY Date
At first, I tried to save the result of first query in a temporary table but it's not possible to use temporary tables with the second query. If I use the first query inside the FROM of second one between () with an alias, the server complaints about table alias doesn't exist. How can get a something like this:
Date | VALUE
---------------------------
"2011-01-15 00:00" | 4
"2011-01-16 00:00" | 6
Try this query -
SELECT
t1.dt AS date,
t1.value - t2.value AS value
FROM
(SELECT DATE(date) dt, MAX(value) value FROM table GROUP BY dt) t1
JOIN
(SELECT DATE(date) dt, MAX(value) value FROM table GROUP BY dt) t2
ON t1.dt = t2.dt + INTERVAL 1 DAY
I have a table that looks something like this:
DataTable
+------------+------------+------------+
| Date | DailyData1 | DailyData2 |
+------------+------------+------------+
| 2012-01-23 | 146.30 | 212.45 |
| 2012-01-20 | 554.62 | 539.11 |
| 2012-01-19 | 710.69 | 536.35 |
+------------+------------+------------+
I'm trying to create a view (call it AggregateView) that will, for each date and for each data column, show a few different aggregates. For example, select * from AggregateView where Date = '2012-01-23' might give:
+------------+--------------+----------------+--------------+----------------+
| Date | Data1_MTDAvg | Data1_20DayAvg | Data2_MTDAvg | Data2_20DayAvg |
+------------+--------------+----------------+--------------+----------------+
| 2012-01-23 | 697.71 | 566.34 | 601.37 | 192.13 |
+------------+--------------+----------------+--------------+----------------+
where Data1_MTDAvg shows avg(DailyData1) for each date in January prior to Jan 23, and Data1_20DayAvg shows the same but for the prior 20 dates in the table. I'm no SQL ninja, but I was thinking that the best way to do this would be via subqueries. The MTD average is easy:
select t1.Date, (select avg(t2.DailyData1)
from DataTable t2
where t2.Date <= t1.Date
and month(t2.Date) = month(t1.Date)
and year(t2.Date) = year(t1.Date)) Data1_MTDAvg
from DataTable t1;
But I'm getting hung up on the 20-day average due to the need to limit the number of results returned. Note that the dates in the table are irregular, so I can't use a date interval; I need the last twenty records in the table, rather than just all records over the last twenty days. The only solution I've found is to use a nested subquery to first limit the records selected, and then take the average.
Alone, the subquery works for individual hardcoded dates:
select avg(t2.DailyData1) Data1_20DayAvg
from (select DailyData1
from DataTable
where Date <= '2012-01-23'
order by Date desc
limit 0,20) t2;
But trying to embed this as part of the greater query blows up:
select t1.Date, (select avg(t2.DailyData1) Data1_20DayAvg
from (select DailyData1
from DataTable
where Date <= t1.Date
order by Date desc
limit 0,20) t2)
from DataTable t1;
ERROR 1054 (42S22): Unknown column 't1.Date' in 'where clause'
From searching around I get the impression that you can't use correlated subqueries as part of a from clause, which I think is where the problem is here. The other issue is that I'm not sure if MySQL will accept a view definition containing a from clause in a subquery. Is there a way to limit the data in my aggregate selection without resorting to subqueries, in order to work around these two issues?
No, you can't use correalted subqueries in the FROM clause. But you can use them in the ON conditions:
SELECT AVG(d.DailyData1) Data1_20DayAvg
--- other aggregate stuff on d (Datatable)
FROM
( SELECT '2012-01-23' AS DateChecked
) AS dd
JOIN
DataTable AS d
ON
d.Date <= dd.DateChecked
AND
d.Date >= COALESCE(
( SELECT DailyData1
FROM DataTable AS last20
WHERE Date <= dd.DateChecked
AND (other conditions for last20)
ORDER BY Date DESC
LIMIT 1 OFFSET 19
), '1001-01-01' )
WHERE (other conditions for d Datatable)
Similar, for many dates:
SELECT dd.DateChecked
, AVG(d.DailyData1) Data1_20DayAvg
--- other aggregate stuff on d (Datatable)
FROM
( SELECT DISTINCT Date AS DateChecked
FROM DataTable
) AS dd
JOIN
DataTable AS d
ON
d.Date <= dd.DateChecked
AND
d.Date >= COALESCE(
( SELECT DailyData1
FROM DataTable AS last20
WHERE Date <= dd.DateChecked
AND (other conditions for last20)
ORDER BY Date DESC
LIMIT 1 OFFSET 19
), '1001-01-01' )
WHERE (other conditions for d Datatable)
GROUP BY
dd.DateChecked
Both queries assume that Datatable.Date has a UNIQUE constraint.