MySQL - Group By Latest and Join First Instance - mysql

I've tried a few things but I've ended up confusing myself.
What I am trying to do is find the most recent records from a table and left join the first after a certain date.
An example might be
id | acct_no | created_at | some_other_column
1 | A0001 | 2017-05-21 00:00:00 | x
2 | A0001 | 2017-05-22 00:00:00 | y
3 | A0001 | 2017-05-22 00:00:00 | z
So ideally what I'd like is to find the latest record of each acct_no sorted by created_at DESC so that the results are grouped by unique account numbers, so from the above record it would be 3, but obviously there would be multiple different account numbers with records for different days.
Then, what I am trying to achieve is to join on the same table and find the first record with the same account number after a certain date.
For example, record 1 would be returned for a query joining on acct_no A0001 after or equal to 2017-05-21 00:00:00 because it is the first result after/equal to that date, so these are sorted by created_at ASC AND created_at >= "2017-05-21 00:00:00" (and possibly AND id != latest.id.
It seems quite straight forward but I just can't get it to work.
I only have my most recent attempt after discarding multiple different queries.
Here I am trying to solve the first part which is to select the most recent of each account number:
SELECT latest.* FROM my_table latest
JOIN (SELECT acct_no, MAX(created_at) FROM my_table GROUP
BY acct_no) latest2
ON latest.acct_no = latest2.acct_no
but that still returns all rows rather than the most recent of each.
I did have something using a join on a subquery but it took so long to run I quite it before it finished, but I have indexes on acct_no and created_at but I've also ran into other problems where columns in the select are not in the group by. I know this can be turned off but I'm trying to find a way to perform the query that doesn't require that.

Just try a little edit to your initial query:
SELECT latest.* FROM my_table latest
join (SELECT acct_no, MAX(created_at) as max_time FROM my_table GROUP
BY acct_no) latest2
ON latest.acct_no = latest2.acct_no AND latest.created_at = latest2.max_time

Trying a different approach. Not sure about the performance impact. But hoping that avoiding self join and group by would be better in terms of performance.
SELECT * FROM (
SELECT mytable1.*, IF(#temp <> acct_no, 1, 0) selector, #temp := acct_no FROM `mytable1`
JOIN (SELECT #temp := '') a
ORDER BY acct_no, created_at DESC , id DESC
) b WHERE selector = 1
Sql Fiddle

you need to get the id where max date is created.
SELECT latest.* FROM my_table latest
join (SELECT max(id) as id FROM my_table GROUP
BY acct_no where created_at = MAX(created_at)) latest2
ON latest.id = latest2.id

Related

mysql distinct does not filter unique ids

Using mysql I'm attempting to display the results with a list of devices (serialno) that have not appeared for a specific time (last_seen) and then only display unique devices with the max(last_seen). The last_seen is an init value which is a number that increments (think minutes) when the device has not been seen. Imagine a table that has a row of with serialno "L123" with last_seen "1", then after another minute, serialno "L123" with last_seen "2", and so fourth. Using max(last_seen) the results should display the highest number or the last time the device was seen.
Works so far, but I'm noticing where a device serialno L123 will display twice, how can I filter the results to only display the highest last_seen? I've tried two scenarios using distinct but neither of them seem to work.
As an example of what i get (not working)
email | serialno | Last seen (min)
abc#example.com | L123 | 30
abc1#example.com | K900 | 20
abc2#example.com | L123 | 1 <--yes the email is different but same serialno
As an example of what want to see
email | serialno | Last seen (min)
abc#example.com | L123 | 30
abc1#example.com | K900 | 20
Scenario 1: select distinct in a where sub-query
SELECT
email,
serialno,
max(last_seen)
FROM
my_table
WHERE
last_seen IN (SELECT last_seen FROM my_table WHERE last_seen > 0)
AND
serialno IN (SELECT distinct serialno FROM my_table)
GROUP BY
2,1
ORDER BY
3 DESC
Scenario 2: using having, after group by
SELECT
email,
serialno,
max(last_seen)
FROM
my_table
WHERE
last_seen IN (SELECT last_seen FROM my_table WHERE last_seen > 0)
GROUP BY
2,1
HAVING
serialno in (SELECT distinct serialno FROM my_table)
ORDER BY
3 DESC
JOIN with a sub-query which is used to find each serialno's max last_seen value:
select t1.*
from my_table t1
join (select serialno, max(Last_seen) Last_seen
from my_table
group by serialno) t2
on t1.serialno = t2.serialno and t1.Last_seen = t2.Last_seen
order by t1.Last_seen desc
Distinct works on the values you actually want to output. If you use it in a subselect, you must also filter the unique values in the subselect.
Group only by serialno though in the first query. You just want grouped results for your serialno. You don't need a in clause or having.

MySQL timestamp differences between two rows large table

I have a Transactions table with over 2,500,000 rows and three columns (that are relevant): id, company_id, and created_at. id identifies the transaction, company_id identifies which company received it, created_at is a timestamp with the time that transaction was performed.
What I want is to get a list of the differences between every consecutive pair of transactions of a given company. In other words, if my table goes:
id | company_id | created_at
------------------------------
01 | ab | 2016/01/02
02 | ab | 2016/01/03
03 | cd | 2016/01/03
04 | ab | 2016/01/03
05 | cd | 2016/01/04
06 | ab | 2016/01/05
(Note that there may be an arbitrary number of transactions of other companies between two consecutive transaction of a given company.)
Then I want the output to be:
diff | company_id
-------------------
01 | ab
00 | ab
01 | cd
02 | ab
(I wrote the created_at and diff values in days, but that's just for ease of visualisation.)
I tried using this but it was too slow.
--EDIT:
"This" is:
SELECT (B.created_at - A.created_at) AS diff, A.company_id
FROM Transactions A CROSS JOIN Transactions B
WHERE B.id IN (SELECT MIN (C.id) FROM Transactions C WHERE C.id > A.id AND C.company_id = A.company_id)
ORDER BY A.id ASC
To get a result like the one it looks like you're expecting, I will sometimes make use of MySQL user-defined variables, and have MySQL perform the processing of the rows "in order", so I can compare the current row to values from the previous row.
For this to run efficiently, we'll want an appropriate index, to avoid an expensive "Using filesort" operation. (We're going to need the rows in company_id order, then by id order, so those will be the first two columns in the index. While we're at it, we might just as well include the created_at column and make it a covering index.
... ON Transactions (company_id, id, created_at)
Then we can try a query like this:
SELECT t.diff
, t.company_id
FROM (
SELECT IF(r.company_id = #pv_company_id, r.created_at - #pv_created_at, NULL) AS diff
, IF(r.company_id = #pv_company_id, 1, 0) AS include_
, #pv_company_id := r.company_id AS company_id
, #pv_created_at := r.created_at AS created_at
FROM (SELECT #pv_company_id := NULL, #pv_created_at := NULL) i
CROSS
JOIN Transactions r
ORDER
BY r.company_id
, r.id
) t
WHERE t.include_
The MySQL Reference Manual explicitly warns against using user-defined variables like this within a statement. But the behavior we observe in MySQL 5.1 and 5.5 is consistent. (The big problem is that some future version of MySQL could use a different execution plan.)
The inline view aliased as i is just to initialize a couple of user-defined variables. We could just as easily do that as a separate step, before we run our query. But I like to include the initialization right in the statement itself, so I don't need a separate SELECT/SET statement.
MySQL accesses the Transactions table, and processes the ORDER BY first, ordering the rows from Transactions in (company_id,id) order. (We prefer to have this done via an index, rather than via an expensive "Using filesort" operation, which is why we want that index defined, with company_id and id as the leading columns.
The "trick" is saving the values from the current row into user-defined variables. When processing the next row, the values from the previous row are available in the user-defined variables, for performing comparisons (is the current row for the same company_id as the previous row?) and for performing a calculation (the difference between the created_at values of the two rows.
Based on the usage of the subtraction operation, I'm assuming that the created_at columns is integer/numeric. That is, I'm assuming that created_at is not DATE, DATETIME, or TIMESTAMP datatype, because we don't use the subtraction operation to find a difference.
SELECT a
, b
, a - b AS `subtraction`
, DATEDIFF(a,b) AS `datediff`
, TIMESTAMPDIFF(DAY,b,a) AS `tsdiff`
FROM ( SELECT DATE('2015-02-17') AS a
, DATE('2015-01-16') AS b
) t
returns:
a b subtraction datediff tsdiff
---------- ---------- ----------- -------- ------
2015-02-17 2015-01-16 101 32 32
(The subtraction operation doesn't throw an error. But what it returns may be unexpected. In this example, it returns the difference between two integer values 20150217 and 20150116, which is not the number of days between the two DATE expressions.)
EDIT
I notice that the original query includes an ORDER BY. If you need the rows returned in a specific order, you can include that column in the inline view query, and use an ORDER BY on the outer query.
SELECT t.diff
, t.company_id
FROM (
SELECT IF(r.company_id = #pv_company_id, r.created_at - #pv_created_at, NULL) AS diff
, IF(r.company_id = #pv_company_id, 1, 0) AS include_
, #pv_company_id := r.company_id AS company_id
, #pv_created_at := r.created_at AS created_at
, r.id AS id
FROM (SELECT #pv_company_id := NULL, #pv_created_at := NULL) i
CROSS
JOIN Transactions r
ORDER
BY r.company_id
, r.id
) t
WHERE t.include_
ORDER BY t.id
Sorry, there's no getting around a "Using filesort" for the ORDER BY on the outer query.
You could use a Cursor functionality. If you open a cursor you slide every row and every two lines fetched you make a difference. I think this method is more efficient because slide all the rows of table instead make a join over 2 and half million.
Try this one too.
SELECT company_id,
(SELECT DATEDIFF(created_at,TR.created_at)
FROM transactions
WHERE id > TR.id AND company_id = TR.company_id LIMIT 0,1) AS diff
FROM transactions AS TR
HAVING diff is not null
Try this
SELECT
t1.company_id,
t2.created_at - t1.created_at as diff
FROM Transactions t1
LEFT JOIN Transactions t2
on t2.created_at > t1.created_at
and t2.company_id = t1.company_id

SQL - Average number of related records with group_by

I have a table of records (lets call them TV shows) with an air_date field.
I have another table of advertisements that are related by a show_id field.
I am trying to get the average number of advertisements per show for each date (with a where clause specifying the shows).
I currently have this:
SELECT
`air_date`,
(SELECT COUNT(*) FROM `commercial` WHERE `show_id` = `show`.`id`) AS `num_commercials`,
FROM `show`
WHERE ...
This gives me a result like so:
air_date | num_commercials
2015-6-30 | 6
2015-6-30 | 3
2015-6-30 | 8
2015-6-30 | 2
2015-6-31 | 9
2015-6-31 | 4
When I do a GROUP_BY, it only gives me one of the records, but I want the average for each air_date.
Not too sure I am clear on what you want - but does this do it
SELECT `air_date`,
AVG((SELECT COUNT(*) FROM `commercial` WHERE `show_id` = `show`.`id`)) AS `num_commercials`,
FROM `show`
WHERE .....
GROUP BY `air_date`
(Note double parentheses for AVG function is required)
You can use a sub-query to select count of commercials by air_date/show, then use an outer query to select the average commercials count per air_date.
Something like this should work:
select air_date, avg(num_commercials)
from
(
select show.air_date as air_date,
show.id as show_id,
count(*) as num_commercials
from show
inner join commercial on commercial.show_id = show.id
group by show.air_date, show.id
where ...
) sub
group by air_date

SELECT newest rows, ignore old duplicates

In my table, i have the following columns :
CRMID | user | ticket_id | | description | date | hour
what i am trying to do is to select all the rows from the table, but when two (or more) rows have the same ticket_id, i want only the newest one to appear in the results, so the row with the newest date and hour.
the problem here is that i should be addin cases, if the values from the date column are the same, then i will compare the hour colum, otherwise, its simple cauz i'll be comparing only the date column.
SELECT
n.*
FROM
table n RIGHT JOIN (
SELECT
MAX(date) AS max_date,
(SELECT MAX(hour) AS hour WHERE date = max_date) AS hour,
user,
ticket_id
FROM
table
GROUP BY
user,
ticket_id
) m ON n.user = m.user AND n.ticket_id = m.ticket_id
You may want to combine your date and hour columns, then perform the comparison
SELECT foo.*
FROM foo
JOIN (SELECT ticket_id, MAX(ADDTIME(`date`,`hour`)) as mostrecent
FROM foo
GROUP BY ticket_id) AS bar
ON bar.ticket_id = foo.ticket_id
and bar.mostrecent = ADDTIME(foo.`date`,foo.`hour`);

How can I write a query that aggregate a single row with latest date among multiple set of rows?

I have a MySQL table where there are many rows for each person, and I want to write a query which aggregates rows with special constraint. (one per person)
For example, lets say the table is consist of following data.
name date reason
---------------------------------------
John 2013-04-01 14:00:00 Vacation
John 2013-03-31 18:00:00 Sick
Ted 2012-05-06 20:00:00 Sick
Ted 2012-02-20 01:00:00 Vacation
John 2011-12-21 00:00:00 Sick
Bob 2011-04-02 20:00:00 Sick
I want to see the distribution of 'reason' column. If I just write a query like below
select reason, count(*) as count from table group by reason
then I will be able to see number of reasons for this table overall.
reason count
------------------
Sick 4
Vacation 2
However, I am only interested in single reason from each person. The reason that should be counted should be from a row with latest date from the person's records. For example, John's latest reason would be Vacation while Ted's latest reason would be Sick. And Bob's latest reason (and the only reason) is Sick.
The expected result for that query should be like below. (Sum of count will be 3 because there are only 3 people)
reason count
-----------------
Sick 2
Vacation 1
Is it possible to write a query such that single latest reason will be counted when I want to see distribution(count) of reasons?
Here are some facts about the table.
The table has tens of millions of rows
For most of times, each person has one reason.
Some people have multiple reasons, but 99.99% of people have fewer than 5 reasons.
There are about 30 different reasons while there are millions of distinct names.
The table is partitioned based on date range.
SELECT T.REASON, COUNT(*)
FROM
(
SELECT PERSON, MAX(DATE) AS MAX_DATE
FROM TABLE-NAME
GROUP BY PERSON
) A, TABLE-NAME T
WHERE T.PERSON = A.PERSON AND T.DATE = A.MAX_DATE
GROUP BY T.REASON
Try this
select reason, count(*) from
(select reason from table where date in
(select max(date) from table group by name)) t
group by reason
In MySQL, it's not very efficient to do this kind of query since you don't have access to tools like partitionning query in SQL Server or Oracle.
You can still emulate it by doing a subquery and retrieve the rows based on the condition you need, here the maximum date :
SELECT t.reason, COUNT(1)
FROM
(
SELECT name, MAX(adate) AS maxDate
FROM #aTable
GROUP BY name
) maxDateRows
INNER JOIN #aTable t ON maxDateRows.name = t.name
AND maxDateRows.maxDate = t.adate
GROUP BY t.reason
You can see a sample here.
Test this query on your samples, but I'm afraid that it will be slow as hell.
For your information, you can do the same thing in a more elegant and much much faster way in SQL Server :
SELECT reason, COUNT(1)
FROM
(
SELECT name
, reason
, RANK() OVER(PARTITION BY name ORDER BY adate DESC) as Rank
FROM #aTable
) AS rankTable
WHERE Rank = 1
GROUP BY reason
The sample is here
If you are really stuck to MySql, and the first query is too slow, then you can split the problem.
Do a first query creating a table:
CREATE TABLE maxDateRows AS
SELECT name, MAX(adate) AS maxDate
FROM #aTable
GROUP BY name
Then create index on both name and maxDate.
Finally, get the results :
SELECT t.reason, COUNT(1)
FROM maxDateRows m
INNER JOIN #aTable t ON m.name = t.name
AND m.maxDate = t.adate
GROUP BY t.reason
The solution you are looking for seems to be solved by this query :
select
reason,
count(*)
from (select * from tablename group by name) abc
group by
reason
It is quite fast and simple. You can view the SQL Fiddle
Apologies if this answer duplicates an existing. Maybe I'm suffering from some form aphasia but I cannot see it...
SELECT x.reason
, COUNT(*)
FROM absentism x
JOIN
( SELECT name,MAX(date) max_date FROM absentism GROUP BY name) y
ON y.name = x.name
AND y.max_date = x.date
GROUP
BY reason;