This is my table tusers on MySQL 5.5.1 database community version
mysql> SELECT * FROM `tusers`;
+------------+------------+----------+-----+
| tIDUSER | tDate | tHour | tID |
+------------+------------+----------+-----+
| Controneri | 2022-01-06 | 07:54:42 | 1 |
| Controneri | 2022-01-06 | 07:43:38 | 2 |
| Controneri | 2022-01-06 | 07:13:09 | 3 |
| Controneri | 2022-01-06 | 06:31:52 | 4 |
| Controneri | 2022-01-06 | 06:13:12 | 5 |
+------------+------------+----------+-----+
5 rows in set (0.13 sec)
I need select from the table tusers only these rows
+------------+------------+----------+-----+
| tIDUSER | tDate | tHour | tID |
+------------+------------+----------+-----+
| Controneri | 2022-01-06 | 07:43:38 | 2 |
| Controneri | 2022-01-06 | 06:13:12 | 5 |
+------------+------------+----------+-----+
Because the other rows are repeated for the same user Controneri within one hour compared to the previous row.
Each user access to the web page is stored on the table tusers for date and time.
But I have to extract only the first access and exclude the repeated accesses in the time span of one hour.
On this example the user Controneri on January 6 he was logged in 5 times. But the valid accesses are those at 06:13:12 and 07:43:38, because after the access at 06:13:12 there were other accesses before 07:13:12, i.e. before the end of the hour compared to the hours 06:13:12 (06:31:52 and 07:13:09 , rows 4 and 3).
I have tried without success.
My table structure and the Select query below on db-fiddle.com, which offers MySQL 5
Any suggestion?
-- ----------------------------
-- Table structure for tusers
-- ----------------------------
DROP TABLE IF EXISTS `tusers`;
CREATE TABLE `tusers` (
`tIDUSER` varchar(255) NULL DEFAULT NULL,
`tDate` date NULL DEFAULT NULL,
`tHour` time NULL DEFAULT NULL,
`tID` int(11) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`tID`) USING BTREE
) ENGINE = InnoDB;
-- ----------------------------
-- Records of tusers
-- ----------------------------
INSERT INTO `tusers` VALUES ('Controneri', '2022-01-06', '07:54:42', 1);
INSERT INTO `tusers` VALUES ('Controneri', '2022-01-06', '07:43:38', 2);
INSERT INTO `tusers` VALUES ('Controneri', '2022-01-06', '07:13:09', 3);
INSERT INTO `tusers` VALUES ('Controneri', '2022-01-06', '06:31:52', 4);
INSERT INTO `tusers` VALUES ('Controneri', '2022-01-06', '06:13:12', 5);
SELECT
a.tID,
a.tDate,
a.tHour,
a.tIDUSER,
TIMEDIFF( a.tHour, b.tHour ) AS tDif
FROM
`tusers` a
JOIN `tusers` b ON
a.tDate = b.tDate
AND a.tIDUSER = b.tIDUSER
AND a.tID > b.tID
WHERE
( TIMEDIFF( a.tHour, b.tHour ) BETWEEN '00:00:00' AND '01:00:00' )
ORDER BY
a.tIDUSER,
a.tDate,
a.tHour ASC;
For MySQL 5.5 you can achieve this by tracking the previous values in user variables -
SELECT tIDUSER, tDate, tHour, tID
FROM (
SELECT
tusers.*,
IF(#prev_date_time IS NULL OR #prev_user <> tIDUSER OR #prev_date_time + INTERVAL 1 HOUR < TIMESTAMP(tDate, tHour), #prev_date_time := TIMESTAMP(tDate, tHour), NULL) AS `show`,
#prev_user := tIDUSER
FROM tusers, (SELECT #prev_date_time := NULL, #prev_user := NULL) n
ORDER BY tIDUSER ASC, tDate ASC, tHour ASC
) t
WHERE `show` IS NOT NULL
ORDER BY tIDUSER ASC, tDate ASC, tHour ASC;
And here's a db<>fiddle. Thanks to sticky bit as I took the liberty of "borrowing" from their db<>fiddle.
The MySQL 5.6 manual states -
However, the order of evaluation for expressions involving user
variables is undefined.
And in later versions is extended to say -
The order of evaluation for expressions involving user variables is
undefined. For example, there is no guarantee that SELECT #a, #a:=#a+1
evaluates #a first and then performs the assignment.
The MySQL 5.7 manual also states -
It is also possible to assign a value to a user variable in statements
other than SET. (This functionality is deprecated in MySQL 8.0 and
subject to removal in a subsequent release.) When making an assignment
in this way, the assignment operator must be := and not = because the
latter is treated as the comparison operator = in statements other
than SET:
Despite the above warnings, this method has been widely used for many years. Your mileage may vary.
I suspect this will perform badly with larger result sets but give it a try.
As requested by the OP in the comments, here is a query using recursive CTEs which will be available with MySQL version 8 and higher.
WITH RECURSIVE
cte1
AS
(
SELECT tusers.tiduser,
tusers.tdate,
tusers.thour,
tusers.tid,
addtime(tusers.tdate, tusers.thour) AS sane_timestamp_representation,
row_number() OVER (PARTITION BY tusers.tiduser
ORDER BY addtime(tusers.tdate, tusers.thour) ASC) AS rn
FROM tusers
),
cte2
AS
(
SELECT cte1.tiduser,
cte1.tdate,
cte1.thour,
cte1.tid,
cte1.sane_timestamp_representation,
0 AS n
FROM cte1
UNION ALL
SELECT cte1.tiduser,
cte1.tdate,
cte1.thour,
cte1.tid,
cte1.sane_timestamp_representation,
cte2.n + 1 AS n
FROM cte2
INNER JOIN cte1
ON cte2.tiduser = cte1.tiduser
AND cte1.sane_timestamp_representation > adddate(cte2.sane_timestamp_representation, INTERVAL 1 HOUR)
),
cte3
AS
(
SELECT cte2.tiduser,
cte2.tdate,
cte2.thour,
cte2.tid,
cte2.sane_timestamp_representation,
row_number() OVER (PARTITION BY cte2.tiduser,
cte2.n
ORDER BY cte2.sane_timestamp_representation ASC) rn
FROM cte2
)
SELECT cte3.tiduser,
cte3.tdate,
cte3.thour,
cte3.tid
FROM cte3
WHERE cte3.rn = 1
ORDER BY cte3.tiduser ASC,
cte3.sane_timestamp_representation ASC;
db<>fiddle
1.
In cte1 we first and foremost unite that day and hour part of the timestamp (not the brightest idea to save them as two different columns; it'll become a mess when day boundaries have to be crossed). We also assign a row_number() rn according to the timestamp in ascending order per user. cte1 acts as our "base table" from now on.
2.
Now in cte2 the recursiveness happens. As anchor we take all the rows from cte1 where cte1.rn = 1. These are the records for a user with the minimum timestamp for that user. We also add some number n. For those initial anchor rows we set n to 0. n will act as an indicator which rows cannot cover each other. All rows with an n + x for x > 1 cannot be covered by any row with n (per user).
In the recursive step we join all records from cte1 past an hour per user. Since these cannot be covered by the records already in the result set (per user), they're past an hour, we assign n + 1 as n to them.
3.
cte3 adds another row_number() rn ordering the records by the timestamp ascending per user and n. Those with an rn of 1 aren't covered themselves by any previous record for the user because all others with equal or greater n have greater timestamps and those with lesser n don't cover them as per we constructed n. So we can select these records from cte3 where rn = 1 and get our final result.
One big fat warning though:
The intermediate result sets will grow rapidly! You can try to select from cte3 without a WHERE clause and see for yourself. So while this shows it can be done theoretically, it might not be practical, even for medium sets of data. The needed resources can quickly exceed maximums.
(And well, since AFAIK SQL with recursive CTEs is Turing complete and the problem seems well computable, it was clear that it can be done anyway. But it still was interesting to see an example how it can be done, I think.)
Maybe it can be optimized. The key, I believe, is to limit the joined rows in the recursive step. We actually only need to join the oldest record past an hour, that would be the next record of interest. That would also make cte3 and the WHERE in the final SELECT unnecessary (unless for projection to get rid of the helper columns). But I didn't find a way to do so. LIMIT as well as window functions aren't allowed or implemented for recursive CTEs, at least in the recursive step. But if somebody comes up with such an optimization, I'd love to see it!
Oh, and the stupid timestamp representation in two columns, which needs to be put together at first, will also render the use of indexes on the timestamps impossible. So that's another factor limiting performance here.
Related
In the select, I get rows from the table with time in the format TIMESTAMP. I want to count unique rows, BUT with a possible error of 1 second. In the example below, for example, 3 unique records (1 and 2 have an error of 1 second, and therefore counted as one).
I was thinking to make a function like ABS(time_1 - time_2) > 1 to search for unique values.
Is it possible to implement this somehow on the SQL side, or would it be better to implement it on the server-side, which would be pulling this data?
Is it possible to do it without functions?
How much of a burden will this put on the database?
Any tips for solving the problem are welcome!
ps: I have an old version of SQL 5.7
Example output:
+------------+
| time |
+------------+
| 1583060400 |
+------------+
| 1583060401 |
+------------+
| 1583060460 |
+------------+
| 1583074860 |
+------------+
Assuming that "if a row TIMESTAMP differs from previous row TIMESTAMP by not more than 1 second then ignore this row presence" you may use
SELECT MAX(counter) groups_amount
FROM ( SELECT CASE WHEN TIMESTAMPDIFF(SECOND, #previous, `time`) > 1
THEN #counter := #counter + 1
END counter,
#previous := `time`
FROM test
CROSS JOIN ( SELECT #previous := '1970-01-01 00:00:01',
#counter := 0 ) init_vars
ORDER BY `time` ASC ) subquery;
https://dbfiddle.uk/?rdbms=mysql_5.7&fiddle=2aba3b8f473e65f4f40e449c8d97a79d
This question already has answers here:
Get records with max value for each group of grouped SQL results
(19 answers)
Closed 2 years ago.
I am trying to figure out a seemingly trivial SQL query.
For all users in the table I want to find the time and data for the row with the highest time (latest event).
The following almost solves it
SELECT user, MAX(time) as time FROM tasks GROUP BY user;
The problem is of course that the data column cannot be reduced. I think therefore I should use a WHERE or ORDER BY + LIMIT construction. But I am too far out of my domain here to know how this should be done properly. Any hints?
Note. It is not possible to use GROUP BY in this instance because I want to select on the table row ID, which cannot be aggregated, obviously.
-- MYSQL
DROP DATABASE IF EXISTS test;
CREATE DATABASE test;
USE test;
CREATE TABLE tasks (
id int AUTO_INCREMENT,
user varchar(100) NOT NULL,
time date NOT NULL,
data varchar(100) NOT NULL,
PRIMARY KEY (id)
);
INSERT INTO tasks (user, time, data) VALUES
("Kalle", "1970-01-01", "old news"),
("Kalle", "2020-01-01", "latest shit"),
("Pelle", "1970-01-01", "regular data");
-- Expected output
-- +----+-------+------------+--------------+
-- | id | user | time | data |
-- +----+-------+------------+--------------+
-- | 2 | Kalle | 2020-01-01 | latest shit |
-- | 3 | Pelle | 1970-01-01 | regular data |
-- +----+-------+------------+--------------+
-- 2 rows in set (0.00 sec)
You can filter with a subquery:
select t.*
from tasks t
where time = (select max(t1.time) from tasks t1 where t1.user = t.user)
This query would take advantage of a multi-column index on (user, time).
In MySQL 8.0, you can also solve this top-1-per-group with window functions:
select *
from (select t.*, row_number() over(partition by user order by time desc) rn from tasks t) t
where rn = 1
I have a table that has the following columns:
id | fk_id | rcv_date
There may be multiple records with a common fk_id, which represents a foreign key id in a related table.
I need to create a query that will assign a row number to each record, grouped by fk_id, sorted by rcv_date.
I originally began with the following query, which works quite well for sorting and assigning row numbers:
SELECT #row:=#row +1 AS ordinality, c.fk_id, rcv_date
FROM (SELECT #row:=0) r, mytable c
ORDER BY rcv_date
However -- the row count and sorting is done across the entire dataset. I need the counting to be within a common fk_id. For example, the following sample data would return (the first column represents the row count/ordinality):
1 | 5 | 2011-10-01
2 | 5 | 2011-10-14
3 | 5 | 2011-11-02
4 | 5 | 2011-12-17
1 | 8 | 2011-09-03
2 | 8 | 2011-11-12
1 | 9 | 2011-10-08
2 | 9 | 2011-10-10
3 | 9 | 2011-11-19
The middle column represents the fk_id. As you can see, the sorting and row count is within the fk_id "grouping."
UPDATE
I have a query that seems to be working, but would like some input as to whether it can be improved:
SELECT IF(#last = c.fk_id, #row:=#row +1, #row:=1) AS ordinality, #last:=c.fk_id, c.fk_id, rcv_date
FROM (SELECT #row:=0) r, (SELECT #last:=0) l, mytable c
ORDER BY c.fk_id, rcv_date
So what this does is order by fk_id and then rcv_date -- which essentially handles my grouping. Then I use a second variable to compare the fk_id in the previous record with the current record: if it's the same, we increment the row; if different, we reset to 1.
My tests with real data appear to be working. I suspect it's a pretty inefficient query though -- so if anyone has ideas for improving it, or see possible flaws, I would love to hear.
This should be pretty straightforward.
SELECT (CASE WHEN #fk <> fk_id THEN #row:=1 ELSE #row:=#row + 1 END) AS ordinality,
#fk:=fk_id, rcv_date
FROM (SELECT #row:=0) AS r,
(SELECT #fk:=0) AS f,
(SELECT fk_id, rcv_date FROM files ORDER BY fk_id, rcv_date) AS t
I ordered by fk_id first to ensure all your foreign keys come together (what if they are not really in the table?), then I did your preferred ordering, ie by rcv_date. The query checks for a change in fk_id and if there is one, then row number variable is set to 1, or else the variable is incremented. Its handled in case statement. Notice that #fk:=fk_id is done after the case checking else it will affect the row number.
Edit: Just noticed your own solution which happened to be the same as I ended up with. Kudos! :)
My objective
I am trying to retrieve multiple random rows that contain only unique userid but for the type column to be random - type can only be 0 or 1. The table in question will contain less than 1,000 rows at any given time.
My table
CREATE TABLE tbl_message_queue (
userid bigint(20) NOT NULL,
messageid varchar(20) NOT NULL,
`type` int(1) NOT NULL,
PRIMARY KEY (userid,messageid,`type`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Sample data
userid | messageid | type
---------------------------------------------------
4353453 | 518423942 | 0
4353453 | 518423942 | 1
2342934 | 748475435 | 0
2342934 | 748475435 | 1
7657529 | 821516543 | 0
7657529 | 821516543 | 1
0823546 | 932843285 | 0
0823546 | 932843285 | 1
What to rule out
Using ORDER BY RAND() isn't feasible as over at least 18,000 of these types of queries are executed by applications at any given moment and are causing high load. Using SELECT DISTINCT or GROUP BY is (obviously) more efficient and will always pick unique userid but type will always equal to 0 with an acceptable load.
The common method is to create an id column but I'm looking for an alternative way only. The group primary key cannot change as it is required and deeply integrated into our application, however the structure of each column can be altered.
Thanks.
My understanding of your question is that for each userid you have two entries, but want to extract only one, at random.
To achieve this, you ought to generate a random value between 0 and 1 for each unique userid, and then JOIN this list with the starting list:
SELECT a.* FROM tbl_message_queue AS a
JOIN ( SELECT userid, FLOOR(2*RAND()) AS type
FROM tbl_message_queue GROUP BY userid ) AS b
ON ( a.userid = b.userid AND a.type = b.type );
But if an ORDER BY RAND() does not work for you, maybe we should compromise.
In the above sequence, any two userids will be uncorrelated -- i.e., the fact that user A gets type 0 tells you nothing about what user B will turn up with.
Depending on the use case, a less random (but "apparently random") sequence could be obtained with two queries:
SELECT #X := FLOOR(2*RAND()), #Y := POW(2,FLOOR(2+14*RAND()))-1;
SELECT * FROM tbl_message_queue WHERE (((userid % #Y) & 1) XOR type XOR #X);
This way, you can get what seems a random extraction. What really happens is that the userids are correlated, and you only have some couple dozens different extractions possible. But using only simple operators, and no JOINs, this query is very fast.
I have a database with a created_at column containing the datetime in Y-m-d H:i:s format.
The latest datetime entry is 2011-09-28 00:10:02.
I need the query to be relative to the latest datetime entry.
The first value in the query should be the latest datetime entry.
The second value in the query should be the entry closest to 7 days from the first value.
The third value should be the entry closest to 7 days from the second value.
REPEAT #3.
What I mean by "closest to 7 days from":
The following are dates, the interval I desire is a week, in seconds a week is 604800 seconds.
7 days from the first value is equal to 1316578202 (1317183002-604800)
the value closest to 1316578202 (7 days) is... 1316571974
unix timestamp | Y-m-d H:i:s
1317183002 | 2011-09-28 00:10:02 -> appear in query (first value)
1317101233 | 2011-09-27 01:27:13
1317009182 | 2011-09-25 23:53:02
1316916554 | 2011-09-24 22:09:14
1316836656 | 2011-09-23 23:57:36
1316745220 | 2011-09-22 22:33:40
1316659915 | 2011-09-21 22:51:55
1316571974 | 2011-09-20 22:26:14 -> closest to 7 days from 1317183002 (first value)
1316499187 | 2011-09-20 02:13:07
1316064243 | 2011-09-15 01:24:03
1315967707 | 2011-09-13 22:35:07 -> closest to 7 days from 1316571974 (second value)
1315881414 | 2011-09-12 22:36:54
1315794048 | 2011-09-11 22:20:48
1315715786 | 2011-09-11 00:36:26
1315622142 | 2011-09-09 22:35:42
I would really appreciate any help, I have not been able to do this via mysql and no online resources seem to deal with relative date manipulation such as this. I would like the query to be modular enough to be able to change the interval weekly, monthly, or yearly. Thanks in advance!
Answer #1 Reply:
SELECT
UNIX_TIMESTAMP(created_at)
AS unix_timestamp,
(
SELECT MIN(UNIX_TIMESTAMP(created_at))
FROM my_table
WHERE created_at >=
(
SELECT max(created_at) - 7
FROM my_table
)
)
AS `random_1`,
(
SELECT MIN(UNIX_TIMESTAMP(created_at))
FROM my_table
WHERE created_at >=
(
SELECT MAX(created_at) - 14
FROM my_table
)
)
AS `random_2`
FROM my_table
WHERE created_at =
(
SELECT MAX(created_at)
FROM my_table
)
Returns:
unix_timestamp | random_1 | random_2
1317183002 | 1317183002 | 1317183002
Answer #2 Reply:
RESULT SET:
This is the result set for a yearly interval:
id | created_at | period_index | period_timestamp
267 | 2010-09-27 22:57:05 | 0 | 1317183002
1 | 2009-12-10 15:08:00 | 1 | 1285554786
I desire this result:
id | created_at | period_index | period_timestamp
626 | 2011-09-28 00:10:02 | 0 | 0
267 | 2010-09-27 22:57:05 | 1 | 1317183002
I hope this makes more sense.
It's not exactly what you asked for, but the following example is pretty close....
Example 1:
select
floor(timestampdiff(SECOND, tbl.time, most_recent.time)/604800) as period_index,
unix_timestamp(max(tbl.time)) as period_timestamp
from
tbl
, (select max(time) as time from tbl) most_recent
group by period_index
gives results:
+--------------+------------------+
| period_index | period_timestamp |
+--------------+------------------+
| 0 | 1317183002 |
| 1 | 1316571974 |
| 2 | 1315967707 |
+--------------+------------------+
This breaks the dataset into groups based on "periods", where (in this example) each period is 7-days (604800 seconds) long. The period_timestamp that is returned for each period is the 'latest' (most recent) timestamp that falls within that period.
The period boundaries are all computed based on the most recent timestamp in the database, rather than computing each period's start and end time individually based on the timestamp of the period before it. The difference is subtle - your question requests the latter (iterative approach), but I'm hoping that the former (approach I've described here) will suffice for your needs, since SQL doesn't lend itself well to implementing iterative algorithms.
If you really do need to determine each period based on the timestamp in the previous period, then your best bet is going to be an iterative approach -- either using a programming language of your choice (like php), or by building a stored procedure that uses a cursor.
Edit #1
Here's the table structure for the above example.
CREATE TABLE `tbl` (
`id` int(10) unsigned NOT NULL auto_increment PRIMARY KEY,
`time` datetime NOT NULL
)
Edit #2
Ok, first: I've improved the original example query (see revised "Example 1" above). It still works the same way, and gives the same results, but it's cleaner, more efficient, and easier to understand.
Now... the query above is a group-by query, meaning it shows aggregate results for the "period" groups as I described above - not row-by-row results like a "normal" query. With a group-by query, you're limited to using aggregate columns only. Aggregate columns are those columns that are named in the group by clause, or that are computed by an aggregate function like MAX(time)). It is not possible to extract meaningful values for non-aggregate columns (like id) from within the projection of a group-by query.
Unfortunately, mysql doesn't generate an error when you try to do this. Instead, it just picks a value at random from within the grouped rows, and shows that value for the non-aggregate column in the grouped result. This is what's causing the odd behavior the OP reported when trying to use the code from Example #1.
Fortunately, this problem is fairly easy to solve. Just wrap another query around the group query, to select the row-by-row information you're interested in...
Example 2:
SELECT
entries.id,
entries.time,
periods.idx as period_index,
unix_timestamp(periods.time) as period_timestamp
FROM
tbl entries
JOIN
(select
floor(timestampdiff( SECOND, tbl.time, most_recent.time)/31536000) as idx,
max(tbl.time) as time
from
tbl
, (select max(time) as time from tbl) most_recent
group by idx
) periods
ON entries.time = periods.time
Result:
+-----+---------------------+--------------+------------------+
| id | time | period_index | period_timestamp |
+-----+---------------------+--------------+------------------+
| 598 | 2011-09-28 04:10:02 | 0 | 1317183002 |
| 996 | 2010-09-27 22:57:05 | 1 | 1285628225 |
+-----+---------------------+--------------+------------------+
Notes:
Example 2 uses a period length of 31536000 seconds (365-days). While Example 1 (above) uses a period of 604800 seconds (7-days). Other than that, the inner query in Example 2 is the same as the primary query shown in Example 1.
If a matching period_time belongs to more than one entry (i.e. two or more entries have the exact same time, and that time matches one of the selected period_time values), then the above query (Example 2) will include multiple rows for the given period timestamp (one for each match). Whatever code consumes this result set should be prepared to handle such an edge case.
It's also worth noting that these queries will perform much, much better if you define an index on your datetime column. For my example schema, that would look like this:
ALTER TABLE tbl ADD INDEX idx_time ( time )
If you're willing to go for the closest that is after the week is out then this'll work. You can extend it to work out the closest but it'll look so disgusting it's probably not worth it.
select unix_timestamp
, ( select min(unix_tstamp)
from my_table
where sql_tstamp >= ( select max(sql_tstamp) - 7
from my_table )
)
, ( select min(unix_tstamp)
from my_table
where sql_tstamp >= ( select max(sql_tstamp) - 14
from my_table )
)
from my_table
where sql_tstamp = ( select max(sql_tstamp)
from my_table )