Mysql group by aggregation sort and limit [duplicate] - mysql

This question already has answers here:
Get records with max value for each group of grouped SQL results
(19 answers)
Closed 2 years ago.
I am trying to figure out a seemingly trivial SQL query.
For all users in the table I want to find the time and data for the row with the highest time (latest event).
The following almost solves it
SELECT user, MAX(time) as time FROM tasks GROUP BY user;
The problem is of course that the data column cannot be reduced. I think therefore I should use a WHERE or ORDER BY + LIMIT construction. But I am too far out of my domain here to know how this should be done properly. Any hints?
Note. It is not possible to use GROUP BY in this instance because I want to select on the table row ID, which cannot be aggregated, obviously.
-- MYSQL
DROP DATABASE IF EXISTS test;
CREATE DATABASE test;
USE test;
CREATE TABLE tasks (
id int AUTO_INCREMENT,
user varchar(100) NOT NULL,
time date NOT NULL,
data varchar(100) NOT NULL,
PRIMARY KEY (id)
);
INSERT INTO tasks (user, time, data) VALUES
("Kalle", "1970-01-01", "old news"),
("Kalle", "2020-01-01", "latest shit"),
("Pelle", "1970-01-01", "regular data");
-- Expected output
-- +----+-------+------------+--------------+
-- | id | user | time | data |
-- +----+-------+------------+--------------+
-- | 2 | Kalle | 2020-01-01 | latest shit |
-- | 3 | Pelle | 1970-01-01 | regular data |
-- +----+-------+------------+--------------+
-- 2 rows in set (0.00 sec)

You can filter with a subquery:
select t.*
from tasks t
where time = (select max(t1.time) from tasks t1 where t1.user = t.user)
This query would take advantage of a multi-column index on (user, time).
In MySQL 8.0, you can also solve this top-1-per-group with window functions:
select *
from (select t.*, row_number() over(partition by user order by time desc) rn from tasks t) t
where rn = 1

Related

Best practice in MySQL for selecting two interchangeable columns and counting them, returning the most recent result

I have a MySQL table that looks like:
CREATE TABLE `messages` (
`id` int NOT NULL AUTO_INCREMENT,
`from` varchar(12) NOT NULL,
`to` varchar(12) NOT NULL,
`message` text,
`timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=66 DEFAULT CHARSET=latin1;
So each time a message is sent or received, it is stored as:
# id from, to, message, timestamp
'65', '+1231303****', '+1833935****', 'Showtimes', '2022-01-26 09:26:10'
'64', '+1833935****', '+1231303****', 'Showtimes are: 12:30 someresponse', '2022-01-26 09:26:10'
I want to create a index of these conversation threats, and need to be able to execute a query that selects the conversation based on it either being addressed from or to a specific number, and returns the number of rows that match either, while at the same time, returning the last message that was sent. So basically I want it to return:
recipient (the other phone number, not the one I'm using to look up ),count(messages),lastmessage
Individually, I can query this all separately, since most of my experience here resolves around using PHP to untangle the data I'm going after. What I'm curious about is a single query that lets MySQL handle this, rather than submitting multiple queries to the database server. I figure this may be a good time to approach in, since several projects I've coded have ran out of memory to process before with so many queries between so many loops.
Apologies in advance if this has been answered somewhere else already. I searched extensively for an answer, but the few results I found used a completely different table structure than I am using, and the MySQL query I was able to fumble together didn't work. I stand next to my work as a PHP programmer, but my MySQL needs some work. Hence I'm here!
If a conversation thread can be defined by a unique combination of from and to then creating a compound key where the first node is the lower of the two then all the conversations in the thread can be established , however selecting on from OR two means many conversation threads may be selected. for example
DROP TABLE IF EXISTS T;
CREATE TABLE T(ID INT AUTO_INCREMENT PRIMARY KEY, FROMNO INT, TONO INT);
INSERT INTO T(FROMNO,TONO) VALUES
(1,2),(2,1),
(1,3),(4,1),(1,2);
WITH CTE AS
(SELECT * ,
CASE WHEN FROMNO < TONO THEN CONCAT(FROMNO,TONO)
ELSE CONCAT(TONO,FROMNO)
END AS CVAL
FROM T
WHERE FROMNO = 1 OR TONO = 1
),
CTE1 AS
(SELECT *,
DENSE_RANK() OVER (ORDER BY CVAL) DR
FROM CTE
),
CTE2 AS
(SELECT CVAL,COUNT(*) conversations,MAX(ID) MAXID
FROM CTE1
GROUP BY CVAL
)
SELECT CTE2.CVAL,CTE2.THINGS,CTE2.MAXID,T.ID
FROM CTE2
JOIN T ON T.ID = CTE2.MAXID;
Yields
+------+---------------+-------+----+
| CVAL | conversations | MAXID | ID |
+------+---------------+-------+----+
| 13 | 1 | 3 | 3 |
| 14 | 1 | 4 | 4 |
| 12 | 3 | 5 | 5 |
+------+---------------+-------+----+
3 rows in set (0.002 sec)

Compare time between consequent rows using MySQL 5.5

This is my table tusers on MySQL 5.5.1 database community version
mysql> SELECT * FROM `tusers`;
+------------+------------+----------+-----+
| tIDUSER | tDate | tHour | tID |
+------------+------------+----------+-----+
| Controneri | 2022-01-06 | 07:54:42 | 1 |
| Controneri | 2022-01-06 | 07:43:38 | 2 |
| Controneri | 2022-01-06 | 07:13:09 | 3 |
| Controneri | 2022-01-06 | 06:31:52 | 4 |
| Controneri | 2022-01-06 | 06:13:12 | 5 |
+------------+------------+----------+-----+
5 rows in set (0.13 sec)
I need select from the table tusers only these rows
+------------+------------+----------+-----+
| tIDUSER | tDate | tHour | tID |
+------------+------------+----------+-----+
| Controneri | 2022-01-06 | 07:43:38 | 2 |
| Controneri | 2022-01-06 | 06:13:12 | 5 |
+------------+------------+----------+-----+
Because the other rows are repeated for the same user Controneri within one hour compared to the previous row.
Each user access to the web page is stored on the table tusers for date and time.
But I have to extract only the first access and exclude the repeated accesses in the time span of one hour.
On this example the user Controneri on January 6 he was logged in 5 times. But the valid accesses are those at 06:13:12 and 07:43:38, because after the access at 06:13:12 there were other accesses before 07:13:12, i.e. before the end of the hour compared to the hours 06:13:12 (06:31:52 and 07:13:09 , rows 4 and 3).
I have tried without success.
My table structure and the Select query below on db-fiddle.com, which offers MySQL 5
Any suggestion?
-- ----------------------------
-- Table structure for tusers
-- ----------------------------
DROP TABLE IF EXISTS `tusers`;
CREATE TABLE `tusers` (
`tIDUSER` varchar(255) NULL DEFAULT NULL,
`tDate` date NULL DEFAULT NULL,
`tHour` time NULL DEFAULT NULL,
`tID` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`tID`) USING BTREE
) ENGINE = InnoDB;
-- ----------------------------
-- Records of tusers
-- ----------------------------
INSERT INTO `tusers` VALUES ('Controneri', '2022-01-06', '07:54:42', 1);
INSERT INTO `tusers` VALUES ('Controneri', '2022-01-06', '07:43:38', 2);
INSERT INTO `tusers` VALUES ('Controneri', '2022-01-06', '07:13:09', 3);
INSERT INTO `tusers` VALUES ('Controneri', '2022-01-06', '06:31:52', 4);
INSERT INTO `tusers` VALUES ('Controneri', '2022-01-06', '06:13:12', 5);
SELECT
a.tID,
a.tDate,
a.tHour,
a.tIDUSER,
TIMEDIFF( a.tHour, b.tHour ) AS tDif
FROM
`tusers` a
JOIN `tusers` b ON
a.tDate = b.tDate
AND a.tIDUSER = b.tIDUSER
AND a.tID > b.tID
WHERE
( TIMEDIFF( a.tHour, b.tHour ) BETWEEN '00:00:00' AND '01:00:00' )
ORDER BY
a.tIDUSER,
a.tDate,
a.tHour ASC;
For MySQL 5.5 you can achieve this by tracking the previous values in user variables -
SELECT tIDUSER, tDate, tHour, tID
FROM (
SELECT
tusers.*,
IF(#prev_date_time IS NULL OR #prev_user <> tIDUSER OR #prev_date_time + INTERVAL 1 HOUR < TIMESTAMP(tDate, tHour), #prev_date_time := TIMESTAMP(tDate, tHour), NULL) AS `show`,
#prev_user := tIDUSER
FROM tusers, (SELECT #prev_date_time := NULL, #prev_user := NULL) n
ORDER BY tIDUSER ASC, tDate ASC, tHour ASC
) t
WHERE `show` IS NOT NULL
ORDER BY tIDUSER ASC, tDate ASC, tHour ASC;
And here's a db<>fiddle. Thanks to sticky bit as I took the liberty of "borrowing" from their db<>fiddle.
The MySQL 5.6 manual states -
However, the order of evaluation for expressions involving user
variables is undefined.
And in later versions is extended to say -
The order of evaluation for expressions involving user variables is
undefined. For example, there is no guarantee that SELECT #a, #a:=#a+1
evaluates #a first and then performs the assignment.
The MySQL 5.7 manual also states -
It is also possible to assign a value to a user variable in statements
other than SET. (This functionality is deprecated in MySQL 8.0 and
subject to removal in a subsequent release.) When making an assignment
in this way, the assignment operator must be := and not = because the
latter is treated as the comparison operator = in statements other
than SET:
Despite the above warnings, this method has been widely used for many years. Your mileage may vary.
I suspect this will perform badly with larger result sets but give it a try.
As requested by the OP in the comments, here is a query using recursive CTEs which will be available with MySQL version 8 and higher.
WITH RECURSIVE
cte1
AS
(
SELECT tusers.tiduser,
tusers.tdate,
tusers.thour,
tusers.tid,
addtime(tusers.tdate, tusers.thour) AS sane_timestamp_representation,
row_number() OVER (PARTITION BY tusers.tiduser
ORDER BY addtime(tusers.tdate, tusers.thour) ASC) AS rn
FROM tusers
),
cte2
AS
(
SELECT cte1.tiduser,
cte1.tdate,
cte1.thour,
cte1.tid,
cte1.sane_timestamp_representation,
0 AS n
FROM cte1
UNION ALL
SELECT cte1.tiduser,
cte1.tdate,
cte1.thour,
cte1.tid,
cte1.sane_timestamp_representation,
cte2.n + 1 AS n
FROM cte2
INNER JOIN cte1
ON cte2.tiduser = cte1.tiduser
AND cte1.sane_timestamp_representation > adddate(cte2.sane_timestamp_representation, INTERVAL 1 HOUR)
),
cte3
AS
(
SELECT cte2.tiduser,
cte2.tdate,
cte2.thour,
cte2.tid,
cte2.sane_timestamp_representation,
row_number() OVER (PARTITION BY cte2.tiduser,
cte2.n
ORDER BY cte2.sane_timestamp_representation ASC) rn
FROM cte2
)
SELECT cte3.tiduser,
cte3.tdate,
cte3.thour,
cte3.tid
FROM cte3
WHERE cte3.rn = 1
ORDER BY cte3.tiduser ASC,
cte3.sane_timestamp_representation ASC;
db<>fiddle
1.
In cte1 we first and foremost unite that day and hour part of the timestamp (not the brightest idea to save them as two different columns; it'll become a mess when day boundaries have to be crossed). We also assign a row_number() rn according to the timestamp in ascending order per user. cte1 acts as our "base table" from now on.
2.
Now in cte2 the recursiveness happens. As anchor we take all the rows from cte1 where cte1.rn = 1. These are the records for a user with the minimum timestamp for that user. We also add some number n. For those initial anchor rows we set n to 0. n will act as an indicator which rows cannot cover each other. All rows with an n + x for x > 1 cannot be covered by any row with n (per user).
In the recursive step we join all records from cte1 past an hour per user. Since these cannot be covered by the records already in the result set (per user), they're past an hour, we assign n + 1 as n to them.
3.
cte3 adds another row_number() rn ordering the records by the timestamp ascending per user and n. Those with an rn of 1 aren't covered themselves by any previous record for the user because all others with equal or greater n have greater timestamps and those with lesser n don't cover them as per we constructed n. So we can select these records from cte3 where rn = 1 and get our final result.
One big fat warning though:
The intermediate result sets will grow rapidly! You can try to select from cte3 without a WHERE clause and see for yourself. So while this shows it can be done theoretically, it might not be practical, even for medium sets of data. The needed resources can quickly exceed maximums.
(And well, since AFAIK SQL with recursive CTEs is Turing complete and the problem seems well computable, it was clear that it can be done anyway. But it still was interesting to see an example how it can be done, I think.)
Maybe it can be optimized. The key, I believe, is to limit the joined rows in the recursive step. We actually only need to join the oldest record past an hour, that would be the next record of interest. That would also make cte3 and the WHERE in the final SELECT unnecessary (unless for projection to get rid of the helper columns). But I didn't find a way to do so. LIMIT as well as window functions aren't allowed or implemented for recursive CTEs, at least in the recursive step. But if somebody comes up with such an optimization, I'd love to see it!
Oh, and the stupid timestamp representation in two columns, which needs to be put together at first, will also render the use of indexes on the timestamps impossible. So that's another factor limiting performance here.

Sql Query to retrive data from table

How to retrieve odd rows from the table?
In the Base table always Cr_id is duplicated 2 times.
Base table
I want a SELECT statement that retrieves only those c_id =1 where Cr_id is always first as shown in the output table.
Output table
Just see the base table and output table you should automatically know what I want, Thanx.
Just testing min date should be enough
drop table if exists t;
create table t(c_id int,cr_id int,dt date);
insert into t values
(1,56,'2020-12-17'),(56,56,'2020-12-17'),
(1,8,'2020-12-17'),(56,8,'2020-12-17'),
(123,78,'2020-12-17'),(1,78,'2020-12-18');
select c_id,cr_id,dt
from t
where c_id = 1 and
dt = (select min(dt) from t t1 where t1.cr_id = t.cr_id);
+------+-------+------------+
| c_id | cr_id | dt |
+------+-------+------------+
| 1 | 56 | 2020-12-17 |
| 1 | 8 | 2020-12-17 |
+------+-------+------------+
2 rows in set (0.002 sec)
What you're looking for could be "partition by", at least if you're working on mssql.
(In the future, please include more background, SQL is not just SQL)
https://codingsight.com/grouping-data-using-the-over-and-partition-by-functions/
I have an old query lying around, that is able to put a sorting index on data who lacks this, although the underlying reason is 99.9% sure to be a bad data design.
Typically I use this query to remove bad data, but you may rewrite it to become a join instead, so that you can identify the data you need.
The reason why I'm not putting that answer here, is to point out, bad data design results in more work when reading it afterwards, whom seems to be the real root cause here.
DELETE t
FROM
(
SELECT ROW_NUMBER () OVER (PARTITION BY column_1 ,column_2, column_3 ORDER BY column_1,column_2 ,column_3 ) AS Seq
FROM Table
)t
WHERE Seq > 1

Remove Purge duplicate/multiplicate records from mariadb

Briefly: database imported from foreign source, so I cannot prevent duplicates, I can only prune and clean the database.
Foreign db changes daily, so, I want to automate the pruning process.
It resides on:
MariaDB v10.4.6 managed predominantly by phpMyadmin GUI v4.9.0.1 (both pretty much up to date as of this writing).
This is a radio browsing database.
It has multiple columns, but for me there are only few important:
StationID (it is unique entry number, thus db does not consider new entries as duplicates, all of them are unique because of this primary key)
There are no row numbers.
Name, url, home-page, country, etc
I do want to remove multiple url duplicated entries base on:
duplicate url has country to it, but some country values are NULL (=empty)
so I do want remove all duplicates except one containing country name, if there is one entry with it, if there is none, just one url, regardless of name (names are multilingual, so some duplicated urls have also various names, which I do not care for.
StationID (unique number, but not consecutive, also this is primary db key)
Name (variable, least important)
url (variable, but I do want to remove the duplicates)
country (variable, frequently NULL/empty, I want to eliminate those with empty entries as much as possible, if possible)
One url has to stay by any means (not to be deleted)
I have tried multitude of queries, some work for SELECT, but do NOT for DELETE, some hang my machine when executed. Here are some queries I tried (remember I use MariaDB, not oracle, or ms-sql)
SELECT * from `radio`.`Station`
WHERE (`radio`.`Station`.`Url`, `radio`.`Station`.`Name`) IN (
SELECT `radio`.`Station`.`Url`, `radio`.`Station`.`Name`
FROM `radio`.`Station`
GROUP BY `radio`.`Station`.`Url`, `radio`.`Station`.`Name`
HAVING COUNT(*) > 1)
This one should show all entries (not only one grouped), but this query hangs my machine
This query gets me as close as possible:
SELECT *
FROM `radio`.`Station`
WHERE `radio`.`Station`.`StationID` NOT IN (
SELECT MAX(`radio`.`Station`.`StationID`)
FROM `radio`.`Station`
GROUP BY `radio`.`Station`.`Url`,`radio`.`Station`.`Name`,`radio`.`Station`.`Country`)
However this query lists more entries:
SELECT *, COUNT(`radio`.`Station`.`Url`) FROM `radio`.`Station` GROUP BY `radio`.`Station`.`Name`,`radio`.`Station`.`Url` HAVING (COUNT(`radio`.`Station`.`Url`) > 1);
But all of these queries group them and display only one row.
I also tried UNION, INNER JOIN, but failed.
WITH cte AS..., but phpMyadmin does NOT like this query, and mariadb cli also did not like it.
I also tried something of this kind, published at oracle blog, which did not work, and I really had no clue what was what in this function:
select *
from (
select f.*,
count(*) over (
partition by `radio`.`Station`.`Url`, `radio`.`Station`.`Name`
) ct
from `radio`.`Station` f
)
where ct > 1
I did not know what f.* was, query did not like ct.
Given
drop table if exists radio;
create table radio
(stationid int,name varchar(3),country varchar(3),url varchar(3));
insert into radio values
(1,'aaa','uk','a/b'),
(2,'bbb','can','a/b'),
(3,'bbb',null,'a/b'),
(4,'bbb',null,'b/b'),
(5,'bbb',null,'b/b');
You could give the null countries a unique value (using coalesce), fortunately stationid is unique so:
select t.stationid,t.name,t.country,t.url
from radio t
join
(select url,max(coalesce(country,stationid)) cntry from radio t group by url) s
on s.url = t.url and s.cntry= coalesce(t.country,t.stationid);
Yields
+-----------+------+---------+------+
| stationid | name | country | url |
+-----------+------+---------+------+
| 1 | aaa | uk | a/b |
| 5 | bbb | NULL | b/b |
+-----------+------+---------+------+
2 rows in set (0.00 sec)
Translated to a delete
delete t from radio t
join
(select url,max(coalesce(country,stationid)) cntry from radio t group by url) s
on s.url = t.url and s.cntry <> coalesce(t.country,t.stationid);
MariaDB [sandbox]> select * from radio;
+-----------+------+---------+------+
| stationid | name | country | url |
+-----------+------+---------+------+
| 1 | aaa | uk | a/b |
| 5 | bbb | NULL | b/b |
+-----------+------+---------+------+
2 rows in set (0.00 sec)
Fix 2 problems at once:
Dup rows already in table
Dup rows can still be put in table
Do this fore each table:
CREATE TABLE new LIKE real;
ALTER TABLE new ADD UNIQUE(x,y); -- will prevent future dups
INSERT IGNORE INTO new -- IGNORE dups
SELECT * FROM real;
RENAME TABLE real TO old, new TO real;
DROP TABLE old;

SQL checking duplicates in one column and deleting another

I need to delete around 300,000 duplicates in my database. I want to check the Card_id column for duplicates, then check for duplicate timestamps. Then delete one copy and keep one. Example:
| Card_id | Time |
| 1234 | 5:30 |
| 1234 | 5:45 |
| 1234 | 5:30 |
| 1234 | 5:45 |
So remaining data would be:
| Card_id | Time |
| 1234 | 5:30 |
| 1234 | 5:45 |
I have tried several different delete statements, and merging into a new table but with no luck.
UPDATE: Got it working!
Alright after many failures I got this to work for DB2.
delete from(
select card_id, time, row_number() over (partition by card_id, time) rn
from card_table) as A
where rn > 1
rn increments when there are duplicates for card_id and time. The duplicated, or second rn, will be deleted.
I strongly suggest you take this approach:
create temporary table tokeep as
select distinct card_id, time
from t;
truncate table t;
insert into t(card_id, time)
select *
from tokeep;
That is, store the data you want. Truncate the table, and then regenerate it. By truncating the table, you get to keep triggers and permissions and other things linked to the table.
This approach should also be faster than deleting many, many duplicates.
If you are going to do that, you ought to insert a proper id as well:
create temporary table tokeep as
select distinct card_id, time
from t;
truncate table t;
alter table t add column id int auto_increment;
insert into t(card_id, time)
select *
from tokeep;
If you haven't Primary key or Candidate key probably there is no option using only one command. Try solution below.
Create table with duplicates
select Card_id,Time
into COPY_YourTable
from YourTable
group by Card_id,Time
having count(1)>1
Remove duplicates using COPY_YourTable
delete from YourTable
where exists
(
select 1
from COPY_YourTable c
where c.Card_id = YourTable.Card_id
and c.Time = YourTable.Time
)
Copy data without duplicates
insert into YourTable
select Card_id,Time
from COPY_YourTabl