MySQL GROUP BY DateTime +/- 3 seconds - mysql

Suppose I have a table with 3 columns:
id (PK, int)
timestamp (datetime)
title (text)
I have the following records:
1, 2010-01-01 15:00:00, Some Title
2, 2010-01-01 15:00:02, Some Title
3, 2010-01-02 15:00:00, Some Title
I need to do a GROUP BY records that are within 3 seconds of each other. For this table, rows 1 and 2 would be grouped together.
There is a similar question here: Mysql DateTime group by 15 mins
I also found this: http://www.artfulsoftware.com/infotree/queries.php#106
I don't know how to convert these methods into something that will work for seconds. The trouble with the method on the SO question is that it seems to me that it would only work for records falling within a bin of time that starts at a known point. For instance, if I were to get FLOOR() to work with seconds, at an interval of 5 seconds, a time of 15:00:04 would be grouped with 15:00:01, but not grouped with 15:00:06.
Does this make sense? Please let me know if further clarification is needed.
EDIT: For the set of numbers, {1, 2, 3, 4, 5, 6, 7, 50, 51, 60}, it seems it might be best to group them {1, 2, 3, 4, 5, 6, 7}, {50, 51}, {60}, so that each grouping row depends on if the row is within 3 seconds of the previous. I know this changes things a bit, I'm sorry for being wishywashy on this.
I am trying to fuzzy-match logs from different servers. Server #1 may log an item, "Item #1", and Server #2 will log that same item, "Item #1", within a few seconds of server #1. I need to do some aggregate functions on both log lines. Unfortunately, I only have title to go on, due to the nature of the server software.

I'm using Tom H.'s excellent idea but doing it a little differently here:
Instead of finding all the rows that are the beginnings of chains, we can find all times that are the beginnings of chains, then go back and ifnd the rows that match the times.
Query #1 here should tell you which times are the beginnings of chains by finding which times do not have any times below them but within 3 seconds:
SELECT DISTINCT Timestamp
FROM Table a
LEFT JOIN Table b
ON (b.Timestamp >= a.TimeStamp - INTERVAL 3 SECONDS
AND b.Timestamp < a.Timestamp)
WHERE b.Timestamp IS NULL
And then for each row, we can find the largest chain-starting timestamp that is less than our timestamp with Query #2:
SELECT Table.id, MAX(StartOfChains.TimeStamp) AS ChainStartTime
FROM Table
JOIN ([query #1]) StartofChains
ON Table.Timestamp >= StartOfChains.TimeStamp
GROUP BY Table.id
Once we have that, we can GROUP BY it as you wanted.
SELECT COUNT(*) --or whatever
FROM Table
JOIN ([query #2]) GroupingQuery
ON Table.id = GroupingQuery.id
GROUP BY GroupingQuery.ChainStartTime
I'm not entirely sure this is distinct enough from Tom H's answer to be posted separately, but it sounded like you were having trouble with implementation, and I was thinking about it, so I thought I'd post again. Good luck!

Now that I think that I understand your problem, based on your comment response to OMG Ponies, I think that I have a set-based solution. The idea is to first find the start of any chains based on the title. The start of a chain is going to be defined as any row where there is no match within three seconds prior to that row:
SELECT
MT1.my_id,
MT1.title,
MT1.my_time
FROM
My_Table MT1
LEFT OUTER JOIN My_Table MT2 ON
MT2.title = MT1.title AND
(
MT2.my_time < MT1.my_time OR
(MT2.my_time = MT1.my_time AND MT2.my_id < MT1.my_id)
) AND
MT2.my_time >= MT1.my_time - INTERVAL 3 SECONDS
WHERE
MT2.my_id IS NULL
Now we can assume that any non-chain starters belong to the chain starter that appeared before them. Since MySQL doesn't support CTEs, you might want to throw the above results into a temporary table, as that would save you the multiple joins to the same subquery below.
SELECT
SQ1.my_id,
COUNT(*) -- You didn't say what you were trying to calculate, just that you needed to group them
FROM
(
SELECT
MT1.my_id,
MT1.title,
MT1.my_time
FROM
My_Table MT1
LEFT OUTER JOIN My_Table MT2 ON
MT2.title = MT1.title AND
(
MT2.my_time < MT1.my_time OR
(MT2.my_time = MT1.my_time AND MT2.my_id < MT1.my_id)
) AND
MT2.my_time >= MT1.my_time - INTERVAL 3 SECONDS
WHERE
MT2.my_id IS NULL
) SQ1
INNER JOIN My_Table MT3 ON
MT3.title = SQ1.title AND
MT3.my_time >= SQ1.my_time
LEFT OUTER JOIN
(
SELECT
MT1.my_id,
MT1.title,
MT1.my_time
FROM
My_Table MT1
LEFT OUTER JOIN My_Table MT2 ON
MT2.title = MT1.title AND
(
MT2.my_time < MT1.my_time OR
(MT2.my_time = MT1.my_time AND MT2.my_id < MT1.my_id)
) AND
MT2.my_time >= MT1.my_time - INTERVAL 3 SECONDS
WHERE
MT2.my_id IS NULL
) SQ2 ON
SQ2.title = SQ1.title AND
SQ2.my_time > SQ1.my_time AND
SQ2.my_time <= MT3.my_time
WHERE
SQ2.my_id IS NULL
This would look much simpler if you could use CTEs or if you used a temporary table. Using the temporary table might also help performance.
Also, there will be issues with this if you can have timestamps that match exactly. If that's the case then you will need to tweak the query slightly to use a combination of the id and the timestamp to distinguish rows with matching timestamp values.
EDIT: Changed the queries to handle exact matches by timestamp.

Warning: Long answer. This should work, and is fairly neat, except for one step in the middle where you have to be willing to run an INSERT statement over and over until it doesn't do anything since we can't do recursive CTE things in MySQL.
I'm going to use this data as the example instead of yours:
id Timestamp
1 1:00:00
2 1:00:03
3 1:00:06
4 1:00:10
Here is the first query to write:
SELECT a.id as aid, b.id as bid
FROM Table a
JOIN Table b
ON (a.Timestamp is within 3 seconds of b.Timestamp)
It returns:
aid bid
1 1
1 2
2 1
2 2
2 3
3 2
3 3
4 4
Let's create a nice table to hold those things that won't allow duplicates:
CREATE TABLE
Adjacency
( aid INT(11)
, bid INT(11)
, PRIMARY KEY (aid, bid) --important for later
)
Now the challenge is to find something like the transitive closure of that relation.
To do so, let's find the next level of links. by that I mean, since we have 1 2 and 2 3 in the Adjacency table, we should add 1 3:
INSERT IGNORE INTO Adjacency(aid,bid)
SELECT adj1.aid, adj2.bid
FROM Adjacency adj1
JOIN Adjacency adj2
ON (adj1.bid = adj2.aid)
This is the non-elegant part: You'll need to run the above INSERT statement over and over until it doesn't add any rows to the table. I don't know if there is a neat way to do that.
Once this is over, you will have a transitively-closed relation like this:
aid bid
1 1
1 2
1 3 --added
2 1
2 2
2 3
3 1 --added
3 2
3 3
4 4
And now for the punchline:
SELECT aid, GROUP_CONCAT( bid ) AS Neighbors
FROM Adjacency
GROUP BY aid
returns:
aid Neighbors
1 1,2,3
2 1,2,3
3 1,2,3
4 4
So
SELECT DISTINCT Neighbors
FROM (
SELECT aid, GROUP_CONCAT( bid ) AS Neighbors
FROM Adjacency
GROUP BY aid
) Groupings
returns
Neighbors
1,2,3
4
Whew!

I like #Chris Cunningham's answer, but here's another take on it.
First, my understanding of your problem statement (correct me if I'm wrong):
You want to look at your event log as a sequence, ordered by the time of the event,
and partitition it into groups, defining the boundary as being an interval of
more than 3 seconds between two adjacent rows in the sequence.
I work mostly in SQL Server, so I'm using SQL Server syntax. It shouldn't be too difficult to translate into MySQL SQL.
So, first our event log table:
--
-- our event log table
--
create table dbo.eventLog
(
id int not null ,
dtLogged datetime not null ,
title varchar(200) not null ,
primary key nonclustered ( id ) ,
unique clustered ( dtLogged , id ) ,
)
Given the above understanding of the problem statement, the following query should give you the upper and lower bounds your groups. It's a simple, nested select statement with 2 group by to collapse things:
The innermost select defines the upper bound of each group. That upper boundary defines a group.
The outer select defines the lower bound of each group.
Every row in the table should fall into one of the groups so defined, and any given group may well consist of a single date/time value.
[edited: the upper bound is the lowest date/time value where the interval is more than 3 seconds]
select dtFrom = min( t.dtFrom ) ,
dtThru = t.dtThru
from ( select dtFrom = t1.dtLogged ,
dtThru = min( t2.dtLogged )
from dbo.EventLog t1
left join dbo.EventLog t2 on t2.dtLogged >= t1.dtLogged
and datediff(second,t1.dtLogged,t2.dtLogged) > 3
group by t1.dtLogged
) t
group by t.dtThru
You could then pull rows from the event log and tag them with the group to which they belong thus:
select *
from ( select dtFrom = min( t.dtFrom ) ,
dtThru = t.dtThru
from ( select dtFrom = t1.dtLogged ,
dtThru = min( t2.dtLogged )
from dbo.EventLog t1
left join dbo.EventLog t2 on t2.dtLogged >= t1.dtLogged
and datediff(second,t1.dtLogged,t2.dtLogged) > 3
group by t1.dtLogged
) t
group by t.dtThru
) period
join dbo.EventLog t on t.dtLogged >= period.dtFrom
and t.dtLogged <= coalesce( period.dtThru , t.dtLogged )
order by period.dtFrom , period.dtThru , t.dtLogged
Each row is tagged with its group via the dtFrom and dtThru columns returned. You could get fancy and assign an integral row number to each group if you want.

Simple query:
SELECT * FROM time_history GROUP BY ROUND(UNIX_TIMESTAMP(time_stamp)/3);

Related

MySQL : Optimizing simple query

Needless to say i am not proficient at SQL. Now i have to run a query on a table that looks like this :
id, tp_id, value_1, value_2, value_3, date
This table has 2 entries for each distinct tp_id, with different values. tp_id is a foreign key, which is indexed, in the following table :
id, external_id
I'm trying to retrieve data as follows :
Get distinct tp_id where value_2 = 2, value_1 = 1 | 2, value_3 = 1, and date < now - 1 year. These conditions must hold true for BOTH entries with matching tp_id
I have tried the following query, but as i understand it the SUM function paired with the JOIN statement makes the query too slow :
SELECT t1.tp_id, t2.external_id
FROM table_1 t1
JOIN table_2 t2 ON t1.tp_id = t2.id
GROUP BY t1.tp_id
HAVING
SUM(
t1.value_2 = 2
AND t1.value_1 IN (1, 2)
AND t1.value_3 = 1
AND t1.date <= DATE_SUB(NOW(), INTERVAL 1 YEAR)
) = 2;
Both tables have roughly 2.5M rows.
I'd like to optimize this query or learn a better way to do this, so any help would be welcome.
Thanks in advance
EDIT: It appears running this query will be altogether unnecessary. I will therefore close the question, thanks for the answers
If I got your requirement correctly, something like this might help.
SELECT tp_id
FROM (
SELECT t1.tp_id,count(*) as count
FROM table_1 t1
WHERE
t1.value_2 = 2
AND (t1.value_1 = 1 OR t1.value_1 = 2)
AND t1.value_3 = 1
AND t1.date <= DATE_SUB(NOW(), INTERVAL 1 YEAR)
GROUP BY tp_id
) as res
WHERE res.count = 2
Essentially, I did 3 performance update:
the WHERE condition is applied before the GROUP BY, way more performant than the HAVING
I've used a nested query, but you can also use HAVING COUNT(tp_id) = 2 depending on your MySQL version
2 boolean checks should be more performant than an IN clause

Possible to count number of occurrences in a "group" in MySQL?

Sorry if the title is misleading, I don't really know the terminology for what I want to accomplish. But let's consider this table:
CREATE TABLE entries (
id INT NOT NULL,
number INT NOT NULL
);
Let's say it contains four numbers associated with each id, like this:
id number
1 0
1 9
1 17
1 11
2 5
2 8
2 9
2 0
.
.
.
Is it possible, with a SQL-query only, to count the numbers of matches for any two given numbers (tuples) associated with a id?
Let's say I want to count the number of occurrences of number 0 and 9 that is associated with a unique id. In the sample data above 0 and 9 does occur two times (one time where id=1 and one time where id=2). I can't think of how to write a SQL-query that solves this. Is it possible? Maybe my table structure is wrong, but that's how my data is organized right now.
I have tried sub-queries, unions, joins and everything else, but haven't found a way yet.
You can use GROUP BY and HAVING clauses:
SELECT COUNT(s.id)
FROM(
SELECT t.id
FROM YourTable t
WHERE t.number in(0,9)
GROUP BY t.id
HAVING COUNT(distinct t.number) = 2) s
Or with EXISTS():
SELECT COUNT(distinct t.id)
FROM YourTable t
WHERE EXISTS(SELECT 1 FROM YourTable s
WHERE t.id = s.id and s.id IN(0,9)
HAVING COUNT(distinct s.number) = 2)

improve sql query with 2 EXISTS sub queries

I have this query (mysql):
SELECT `budget_items`.*
FROM `budget_items`
WHERE (budget_category_id = 4
AND ((is_custom_for_family = 0)
OR (is_custom_for_family = 1
AND custom_item_family_id = 999))
AND ((EXISTS
(SELECT 1
FROM balance_histories
WHERE balance_histories.budget_item_id = budget_items.id
AND balance_histories.family_id = 999
AND payment_date >= '2021-02-01'
AND payment_date <= '2021-02-28' ))
OR (EXISTS
(SELECT 1
FROM budget_lines
WHERE family_id = 999
AND budget_id = 188311
AND budget_item_id = budget_items.id
AND amount > 0))))
It runs multiple times on app start. It takes more than 10 seconds (all of them).
I have indexes on:
balance_histories table: budget_item_id, family_id (tried also payment_date)
budget_lines table: family_id, budget_id, budget_item_id
How can I improve the speed? Query or maybe mysql (8) configuration.
balance_histories table:
budget_lines table:
I would start this query in reverse of what you have. Assuming you COULD have years of data, but your EXISTS query is looking more specifically at a date-range, or specific budget lines, start there, it will probably be much smaller. Once you have DISTINCT IDs, then go back to the budget items by qualified ID PLUS the additional criteria.
To help optimize the queries, I would have indexes on
table index
balance_histories ( family_id, payment_date, budget_item_id )
budget_lines ( family_id, budget_id, amount )
budget_items ( id, budget_category_id, is_custom_for_family, custom_item_family_id )
select
bi.*
from
-- pre-query a list of DISTINCT IDs from the balance history
-- and budget lines that qualify. THEN join to the rest.
( select distinct
bh.budget_item_id id
from
balance_histories bh
where
bh.family_id = 999
AND bh.payment_date >= '2021-02-01'
AND bh.payment_date <= '2021-02-28'
UNION
select
bl.budget_item_id
FROM
budget_lines bl
WHERE
bl.family_id = 999
AND bl.budget_id = 188311
AND bl.amount > 0 ) PQ
JOIN budget_items bi
on PQ.id = bi.id
AND bi.budget_category_id = 4
AND ( bi.is_custom_for_family = 0
OR
( bi.is_custom_for_family = 1
AND bi.custom_item_family_id = 999 )
)
Feedback
As for many SQL queries, there are typically multiple ways to get a solution. Sometimes using EXISTS works well, sometimes not as much. You need to consider cardinality of your data, and that is what I was shooting for. Look at what you were asking for first: Get budget items that are all category for and custom for family is 1 or 0 (which is all), but if family, only those for 999. You were correct on your balance of AND/OR. However, this is going through EVERY RECORD, and if you have millions of rows, that is what you are scanning through. Only after scanning through every row are you now doing a secondary query (for each record that qualified) against the histories for the specific date range OR family/budget.
My guess is that the number of possible records returned from your two EXISTS queries was going to be very small. So, by starting by getting a DISTINCT list of just those IDs that are part of that union would be the very small subset. Once that single "ID" if found, it now becomes a direct match to the budget items table and have the final filtering limits of categoryID / Family / Custom Item considerations.
By having indexes better match the context of your query WHERE clause will optimize pulling data. I have had answers to several other questions with similar resolutions and clarify indexes and why in those... take a look for example, and another here.

Number duplicate records on the MySQL table

Have a table with similar schema
id control code amount
1 200 12 300
2 400 12 300
3 200 12 300
4 100 10 400
5 100 10 400
6 500 13 500
Trying to list the duplicates of records on a UI.
Using following query I can retrieve the duplicate records and show it on UI.
select * from mwt group by control,code,amount having count(id) > 1;
id control code amount
1 200 12 300
4 100 10 400
Here the records with id 1 and 4 are duplicates of 3 and 5 respectively.
On the UI, the user will click a check-box adjacent to the record and corresponding duplicate records should be populate to the UI. To make things easier trying to populate another column named dup_id. Using this dup_id it is possible to filter the results from UI , which is in the JSON format.
How to create a result set similar to the one shown below?
id control code amount dup_id
1 200 12 300 1
2 400 12 300
3 200 12 300 1
4 100 10 400 4
5 100 10 400 4
6 500 13 500
This seems like a simpler solution than that suggested by #kickstarter - but maybe I've misunderstood the requirement...
SELECT x.*
, y.dup_id
FROM my_table x
LEFT
JOIN
( SELECT MIN(id) dup_id
, control
, code
, amount
FROM my_table
GROUP
BY control
, code
, amount
HAVING COUNT(*) > 1
) y
ON y.control = x.control
AND y.code = x.code
AND y.amount = x.amount;
Depending on how accurate the order has to be, you could do something like this.
This is getting all the unique control / code / amount with a count, to get a flag to know if that is a duplicate row, and ordered by control / code / amount so that they are in order. It does a cross join to initialise a few user variables.
Then it calculates a counter, only incrementing it if any of control / code / amount have changed AND it is a duplicate row. Then sets user variables to store the previous values of control / code / amount.
The outer query then orders the results back in to id order.
SELECT sub3.id,
sub3.control,
sub3.code,
sub3.amount,
sub3.dup_id
FROM
(
SELECT sub2.id,
sub2.control,
sub2.code,
sub2.amount,
#cnt:=IF(#control=control AND #code=code AND #amount=amount AND sub2.id_count IS NOT NULL, #cnt, IF(sub2.id_count IS NULL, #cnt, #cnt + 1)),
#control:=control,
#code:=code,
#amount:=amount,
IF(sub2.id_count IS NULL, NULL, #cnt) AS dup_id
FROM
(
SELECT mwt.id, mwt.control, mwt.code, mwt.amount, sub1.id_count
FROM mwt
LEFT OUTER JOIN
(
SELECT control, code, amount, COUNT(id) AS id_count
FROM mwt
GROUP BY control,code,amount
HAVING id_count > 1
) sub1
ON mwt.control = sub1.control
AND mwt.code = sub1.code
AND mwt.amount = sub1.amount
ORDER BY mwt.control, mwt.code, mwt.amount
) sub2
CROSS JOIN
(
SELECT #cnt:=0, #control:=0, #code:=0, #amount:=0
) sub0
) sub3
ORDER BY id
Note that this is ordering by control, code and amount, so not an exact match for your required output (which would require getting the first duplicates ordered by id first).
EDIT - Simpler and better way to do it. This gets all the duplicate rows with the min id for those duplicates (ordered by the min id), and uses a user variable to add a sequence number for those. Then LEFT OUTER JOINs that back against the main table to put that sequence number in all the matching rows.
SELECT mwt.id, mwt.control, mwt.code, mwt.amount, sub2.dup_id
FROM mwt
LEFT OUTER JOIN
(
SELECT sub1.id, sub1.control, sub1.code, sub1.amount, #cnt:=#cnt+1 AS dup_id
FROM
(
SELECT MIN(id) AS id, control, code, amount
FROM mwt
GROUP BY control,code,amount
HAVING COUNT(id) > 1
ORDER BY id
) sub1
CROSS JOIN
(
SELECT #cnt:=0
) sub0
) sub2
ON mwt.control = sub2.control
AND mwt.code = sub2.code
AND mwt.amount = sub2.amount
ORDER BY mwt.id
Would you need a dup_id column ?. I hope this can be achieved with a simple query like below
select id
, control
, code
, amount
from table
where control = from selected Record
and code = from selected Record
and amount = from selected Record
and id not equals from selected Record
You can very well omit the last not equals if the requirement is to list down duplicates including the selected record.

Elegant mysql to select, group, combine multiple rows from one table

Here is a simplified version of my table:
group price spec
a 1 .
a 2 ..
b 1 ...
b 2
c .
. .
. .
I'd like to produce a result like this: (I'll refer to this as result_table)
price_a |spec_a |price_b |spec_b |price_c ...|total_cost
1 |. |1 |.. |... |
(min) (min) =1+1+...
Basically I want to:
select the rows containing the min price within each group
combine columns into a single row
I know this can be done using several queries and/or combined with some non-sql processing on the results, but I suspect that there maybe better solutions.
The reason that I want to do task 2 (combine columns into a single row)
is because I want to do something like the following with the result_table:
select *,
(result_table.total_cost + table1.price + table.2.price) as total_combined_cost
from result_table
right join table1
right join table2
This may be too much to ask for, so here is some other thoughts on the problem:
Instead of trying to combine multiple rows(task 2), store them in a temporary table
(which would be easier to calculate the total_cost using sum)
Feel free to drop any thoughts, don't have to be complete answer, I feel it's brilliant enough if you have an elegant way to do task 1 !
==Edited/Added 6 Feb 2012==
The goal of my program is to identify best combinations of items with minimal cost (and preferably possess higher utilitarian value at the same time).
Consider #ypercube's comment about large number of groups, temporary table seems to be the only feasible solution. And it is also pointed out there is no pivoting function in MySQL (although it can be implemented, it's not necessary to perform such operation).
Okay, after study #Johan's answer, I'm thinking about something like this for task 1:
select * from
(
select * from
result_table
order by price asc
) as ordered_table
group by group
;
Although looks dodgy, it seems to work.
==Edited/Added 7 Feb 2012==
Since there could be more than one combination may produce the same min value, I have modified my answer :
select result_table.* from
(
select * from
(
select * from
result_table
order by price asc
) as ordered_table
group by group
) as single_min_table
inner join result_table
on result_table.group = single_min_table.group
and result_table.price = single_min_table.price
;
However, I have just realised that there is another problem I need to deal with:
I can not ignore all the spec, since there is a provider property, items from different providers may or may not be able to be assembled together, so to be safe (and to simplify my problem) I decide to combine items from the same provider only, so the problem becomes:
For example if I have an initial table like this(with only 2 groups and 2 providers):
id group price spec provider
1 a 1 . x
2 a 2 .. y
3 a 3 ... y
4 b 1 ... y
5 b 2 x
6 b 3 z
I need to combine
id group price spec provider
1 a 1 . x
5 b 2 x
and
2 a 2 .. y
4 b 1 ... y
record (id 6) can be eliminated from the choices since it dose not have all the groups available.
So it's not necessarily to select only the min of each group, rather it's to select one from each group so that for each provider I have a minimal combined cost.
You cannot pivot in MySQL, but you can group results together.
The GROUP_CONCAT function will give you a result like this:
column A column B column c column d
groups specs prices sum(price)
a,b,c some,list,xyz 1,5,7 13
Here's a sample query:
(The query assumes you have a primary (or unique) key called id defined on the target table).
SELECT
GROUP_CONCAT(a.`group`) as groups
,GROUP_CONCAT(a.spec) as specs
,GROUP_CONCAT(a.min_price) as prices
,SUM(a.min_prices) as total_of_min_prices
FROM
( SELECT price, spec, `group` FROM table1
WHERE id IN
(SELECT MIN(id) as id FROM table1 GROUP BY `group` HAVING price = MIN(price))
) AS a
See: http://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html
Producing the total_cost only:
SELECT SUM(min_price) AS total_cost
FROM
( SELECT MIN(price) AS min_price
FROM TableX
GROUP BY `group`
) AS grp
If a result set with the minimum prices returned in row (not in column) per group is fine, then your problem is of the gretaest-n-per-group type. There are various methods to solve it. Here's one:
SELECT tg.grp
tm.price AS min_price
tm.spec
FROM
( SELECT DISTINCT `group` AS grp
FROM TableX
) AS tg
JOIN
TableX AS tm
ON
tm.PK = --- the Primary Key of the table
( SELECT tmin.PK
FROM TableX AS tmin
WHERE tmin.`group` = tg.grp
ORDER BY tmin.price ASC
LIMIT 1
)