Detecting near duplicates above a threshold

Detecting near duplicates above a threshold - mysql

I want to be able to query a table for records I suspect may be nearly duplicates.
I've racked my brains but can't think where to begin with this one, so I've simplified the problem as much as possible, and came to ask here!
Here's my simplified table:
CREATE TABLE sales
(
`id1` int auto_increment primary key,
`amount` decimal(6,2),
`date` datetime
);
Here's some test values:
INSERT INTO sales
(`amount`, `date`)
VALUES
(10, '2013-05-15T11:11:00'),
(11, '2013-05-15T11:11:11'),
(20, '2013-05-15T11:22:00'),
(3, '2013-05-15T12:12:00'),
(4, '2013-05-15T12:12:12'),
(45, '2013-05-15T12:22:00'),
(4, '2013-05-15T12:24:00'),
(8, '2013-05-15T13:00:00'),
(9, '2013-05-15T13:01:00'),
(10, '2013-05-15T14:00:00');
The problem
I want to return sales above amount Y, that have neighbour sales above Y that recorded within X minutes of each other.
ie, from this data:
amt, date
(10, '2013-05-15T11:11:00'),
(11, '2013-05-15T11:11:11'),
(20, '2013-05-15T11:22:00'),
(3, '2013-05-15T12:12:00'),
(4, '2013-05-15T12:12:12'),
(45, '2013-05-15T12:22:00'),
(4, '2013-05-15T12:24:00'),
(8, '2013-05-15T13:00:00'),
(9, '2013-05-15T13:01:00'),
(10, '2013-05-15T14:00:00');
where #yVal = 5 and #xMins = 10
expected result would be:
(10, '2013-05-15T11:11:00'),
(11, '2013-05-15T11:11:11'),
(20, '2013-05-15T11:22:00'),
(8, '2013-05-15T13:00:00'),
(9, '2013-05-15T13:01:00'),
I've put the above into a fiddle: http://sqlfiddle.com/#!2/cf8fe
Any help will be greatly appreciated!

Try somthing like this:
SELECT DISTINCT s1.* FROM sales s1
LEFT JOIN sales s2
ON (s1.id1 != s2.id1
AND s1.amount >= s2.amount - #xVal AND s1.amount <= s2.amount + #xVal
AND s1.date >= DATE_SUB(s2.date, INTERVAL #xMins minute) AND s1.date <= DATE_ADD(s2.date, INTERVAL #xMins minute)
)
WHERE
s2.id1 is not null
Extends
Fix some errors
Result for your data looks like:
+-----+--------+---------------------+
| id1 | amount | date |
+-----+--------+---------------------+
| 1 | 10.00 | 2013-05-15 11:11:00 |
| 2 | 11.00 | 2013-05-15 11:11:11 |
| 4 | 3.00 | 2013-05-15 12:12:00 |
| 5 | 4.00 | 2013-05-15 12:12:12 |
| 8 | 8.00 | 2013-05-15 13:00:00 |
| 9 | 9.00 | 2013-05-15 13:01:00 |
+-----+--------+---------------------+
Extends 2
SELECT DISTINCT s1.* FROM sales s1
LEFT JOIN sales s2
ON (s1.id1 != s2.id1
AND s2.amount >= #xVal
AND s1.date >= DATE_SUB(s2.date, INTERVAL #xMins minute) AND s1.date <= DATE_ADD(s2.date, INTERVAL #xMins minute)
)
WHERE
s2.id1 is not null
AND s1.amount >= #xVal

Related

Using time interval in table for select in another

I am using a MySQL data base with 2 tables:
In one table I have BatchNum and Time_Stamp. In another I have ErrorCode and Time_Stamp.
My goal is to use timestamps in one table as the beginning and end of an interval within which I'd like to select in another table. I would like to select the beginning and end of intervals within which the BatchNum is constant.
CREATE TABLE Batch (BatchNum INT, Time_Stamp DATETIME);
INSERT INTO Batch VALUES (1,'2020-12-17 07:29:36'), (1, '2020-12-17 08:31:56'), (1, '2020-12-17 08:41:56'), (2, '2020-12-17 09:31:13'), (2, '2020-12-17 10:00:00'), (2, '2020-12-17 10:00:57'), (2, '2020-12-17 10:01:57'), (3, '2020-12-17 10:47:59'), (3, '2020-12-17 10:48:59'), (3, '2020-12-17 10:50:59');
CREATE TABLE Errors (ErrorCode INT, Time_Stamp DATETIME);
INSERT INTO Errors VALUES (10, '2020-12-17 07:29:35'), (11, '2020-12-17 07:30:00'), (12, '2020-12-17 07:30:35'), (10, '2020-12-17 07:30:40'), (22, '2020-12-17 10:01:45'), (23, '2020-12-17 10:48:00');
In my example below, I would like something like SELECT BatchNum , ErrorCode, Errors.Time_Stamp WHERE Erorrs.Time_Stamp BETWEEN beginning_of_batch and end_of_batch:
+----------+-----------+---------------------+
| BatchNum | ErrorCode | Errors.Time_Stamp |
+----------+-----------+---------------------+
| 1 | 11 | 2020-12-17 07:30:00 |
| 1 | 12 | 2020-12-17 07:30:35 |
| 1 | 10 | 2020-12-17 07:30:40 |
| 2 | 22 | 2020-12-17 10:01:45 |
| 3 | 23 | 2020-12-17 10:48:00 |
+----------+-----------+---------------------+
I am using an answer from a previous question:
Select on value change
to find BatchNum changes but I don't know how to include this in another select to get the ErrorCodes happening within the interval defined by BatchNum changes.

I think you want:
select b.*, e.error_code, e.time_stamp as error_timestamp
from (
select b.*,
lead(time_stamp) over(order by time_stamp) lead_time_stamp
from batch b
) b
inner join errors e
on e.time_stamp >= b.time_stamp
and (e.time_stamp < b.lead_time_stamp or b.lead_time_stamp is null)

using "OR" and "HAVING SUM" Mysql 5.7

this is my fiddle :https://dbfiddle.uk/?rdbms=mysql_5.7&fiddle=71b1fd5d8e222ab1c51ace8d1af4c94f
CREATE TABLE order_match(ID int(10) NOT NULL PRIMARY KEY AUTO_INCREMENT,
quantity decimal(10,2), createdAt date NOT NULL, order_status_id int(10) NOT NULL,
createdby int(11), code_order varchar(20) NOT NULL);
insert into order_match values
(1, 0.2, '2020-02-02', 6, 01, 0001),
(2, 1, '2020-02-03', 7, 02, 0002),
(3, 1.3, '2020-02-04', 7, 03, 0003),
(4, 1.4, '2020-02-08', 5, 08, 0004),
(5, 1.2, '2020-02-05', 8, 04, 0005),
(6, 1.4, '2020-03-01', 8, 05, 0006),
(7, 0.23, '2020-01-01', 8, 03, 0007),
(8, 2.3, '2020-02-07', 8, 04, 0009);
and then this is my table
select order_status_id, createdby, createdAt from order_match;
+-----------------+-----------+------------+
| order_status_id | createdby | createdAt |
+-----------------+-----------+------------+
| 6 | 1 | 2020-02-02 |
| 7 | 2 | 2020-02-03 |
| 7 | 3 | 2020-02-04 |
| 5 | 8 | 2020-02-08 |
| 8 | 4 | 2020-02-05 |
| 8 | 5 | 2020-03-01 |
| 8 | 3 | 2020-01-01 |
+-----------------+-----------+------------+
order_status_id are the status of transaction, "7" means no approval transaction, else are approval, createdby are the id of users who doing transaction, and createdAt are the date of transaction happen.
so i want to find out the repeat users who doing transaction in between '2020-02-01' and '2020-02-28', repeat users are the users who doing approval transaction before '2020-02-28' and atleast doing 1 more approval transaction again in date range ('2020-02-01' until '2020-02-28')
based on the explanation i used this query :
SELECT s1.createdby
FROM order_match s1
WHERE s1.order_status_Id in (4, 5, 6, 8)
GROUP BY s1.createdby
HAVING SUM(s1.createdAt BETWEEN '2020-02-01' AND '2020-02-28')
AND SUM(s1.createdAt <= '2020-02-28') > 1
OR exists (select 1 from order_match s1 where
s1.createdAt < '2020-02-01'
and s1.order_status_id in (4, 5, 6, 8));
from that query, the result was this :
+-----------+
| createdby |
+-----------+
| 1 |
| 3 |
| 4 |
| 5 |
| 8 |
+-----------+
and the expected results based on the data and explanation was like this :
+-----------+
| createdby |
+-----------+
| null |
+-----------+
because there's no users who fit with "repeat users" condition. where my wrong at?

Looks like
SELECT createdby
FROM order_match
-- select rows in specified data range
WHERE createdAt BETWEEN '2020-02-01' AND '2020-02-28'
GROUP BY createdby
-- check that user has more than one transaction which' status is not non-approved
HAVING SUM(order_status_id != 7) > 1 -- or SUM(order_status_id in (4, 5, 6, 8)) > 1
that's why i used or exists to check the users before '2020-02-01'
Sorry, I have understood the task wrongly.
SELECT createdby
FROM order_match
GROUP BY createdby
-- check that user has more than one transaction which' status is not non-approved
HAVING SUM(order_status_id != 7) > 1
-- and at least one of them is in specified data range
AND SUM(order_status_id != 7 AND createdAt BETWEEN '2020-02-01' AND '2020-02-28')
where my wrong at?
In WHERE IN - this condition gives TRUE for each createdby who has at least one approved transacions, because this transaction checks self in this condition.
Additionally - s1.createdAt BETWEEN '2020-02-01' AND '2020-02-28' overlaps s1.createdAt <= '2020-02-28', so 2nd condition is excess (if 1st is true then 2nd is true too).

find out time difference for every user in condition mysql 5.7

this is my fiddle https://dbfiddle.uk/?rdbms=mysql_5.7&fiddle=7c549a3de0c8002ec43381462ba6a801
let's assume I have the data like this
CREATE TABLE test (
ID INT,
user_id INT,
createdAt DATE,
status_id INT
);
INSERT INTO test VALUES
(1, 12, '2020-01-01', 4),
(2, 12, '2020-01-03', 7),
(3, 12, '2020-01-06', 7),
(4, 13, '2020-01-02', 5),
(5, 13, '2020-01-03', 6),
(6, 14, '2020-03-03', 8),
(7, 13, '2020-03-04', 4),
(8, 15, '2020-04-04', 7),
(9, 14, '2020-03-02', 6),
(10, 14, '2020-03-10', 5),
(11, 13, '2020-04-10', 8);
select * from test
order by createdAt;
and this is the table after doing select (*)
+----+---------+------------+-----------+
| ID | user_id | createdAt | status_id |
+----+---------+------------+-----------+
| 1 | 12 | 2020-01-01 | 4 |
| 4 | 13 | 2020-01-02 | 5 |
| 2 | 12 | 2020-01-03 | 7 |
| 5 | 13 | 2020-01-03 | 6 |
| 3 | 12 | 2020-01-06 | 7 |
| 9 | 14 | 2020-03-02 | 6 |
| 6 | 14 | 2020-03-03 | 8 |
| 7 | 13 | 2020-03-04 | 4 |
| 10 | 14 | 2020-03-10 | 5 |
| 8 | 15 | 2020-04-04 | 7 |
| 11 | 13 | 2020-04-10 | 8 |
+----+---------+------------+-----------+
the id is the id of the transaction, user_Id is the id of the users who doing the transaction, createdAt are the date transaction happen, status_id is the status for the transaction (if the status_Id is 7, then the transaction are denied or not approval).
so on this case, I want to find out time difference for every approval transaction on every repeat users on time range between '2020-02-01' until '2020-04-01', repeat users are the users who doing transaction before the end of the time range, and at least doing 1 transaction again in the time range, on this case, users are doing approval transaction before '2020-04-01' and at least doing 1 more approval transaction again in between '2020-02-01' and '2020-04-01'.
from the explanation, I used this query
SELECT SUM(transactions) AS transactions,
MIN(`MIN`) AS `MIN`,
MAX(`MAX`) AS `MAX`,
SUM(total) / SUM(transactions) AS `AVG`
FROM (
SELECT user_id,
COUNT(*) AS transactions,
MIN(diff) AS `MIN`,
MAX(diff) AS `MAX`,
SUM(diff) AS total
FROM (
SELECT user_id, DATEDIFF((SELECT MIN(t2.createdAt)
FROM test t2
WHERE t2.user_id = t1.user_id
AND t1.createdAt < t2.createdAt
AND t2.status_id in (4, 5, 6, 8)
), t1.createdAt) AS diff
FROM test t1
WHERE status_id in (4, 5, 6, 8)
HAVING SUM(status_id != 7 and createdAt < '2020-04-01') > 1
AND SUM(status_id != 7 AND createdAt BETWEEN '2020-02-01'
AND '2020-04-01')
) DiffTable
WHERE diff IS NOT NULL
GROUP BY user_id
) totals
and it says
In aggregated query without GROUP BY, expression #1 of SELECT list contains nonaggregated column 'db_314931870.t1.user_id'; this is incompatible with sql_mode=only_full_group_by
expected results
+-----+-----+---------+
| MIN | MAX | AVG |
+-----+-----+---------+
| 1 | 61 | 21,6667 |
+-----+-----+---------+
explanation: min (minimum) is 1-day difference which happens for users_id 14 who doing approval transaction in '2020-03-02' and doing approval transaction again in '2020-03-03', max (maximum) is 61-time difference which happen in users_Id 13 who doing approval transaction in '2020-01-03'
and doing approval transaction again in '2020-03-04', average time difference is from sum all time difference in time range: count transaction happen in the time range

SELECT MIN(DATEDIFF(t2.createdAt, t1.createdAt)) min_diff,
MAX(DATEDIFF(t2.createdAt, t1.createdAt)) max_diff,
AVG(DATEDIFF(t2.createdAt, t1.createdAt)) avg_diff
FROM test t1
JOIN test t2 ON t1.user_id = t2.user_id
AND t1.createdAt < t2.createdAt
AND 7 NOT IN (t1.status_id, t2.status_id)
JOIN (SELECT t3.user_id
FROM test t3
WHERE t3.status_id != 7
GROUP BY t3.user_id
HAVING SUM(t3.createdAt < '2020-04-01')
AND SUM(t3.createdAt BETWEEN '2020-02-01' AND '2020-04-01')) t4 ON t1.user_id = t4.user_id
WHERE NOT EXISTS (SELECT NULL
FROM test t5
WHERE t1.user_id = t5.user_id
AND t5.status_id != 7
AND t1.createdAt < t5.createdAt
AND t5.createdAt < t2.createdAt)
fiddle with short explanations.

Filter out rows with date range gaps in mysql

In mysql I have a table similar to the following one:
--------------------------------------------
| id | parent_id | date_start | date_end |
--------------------------------------------
| 1 | | 2017-05-01 | 2017-05-10 |
| 2 | 1 | 2017-05-01 | 2017-05-10 |
| 3 | | 2017-06-01 | 2017-06-10 |
| 4 | 3 | 2017-06-01 | 2017-06-03 |
| 5 | 3 | 2017-06-04 | 2017-06-06 |
| 6 | 3 | 2017-06-07 | 2017-06-10 |
| 7 | | 2017-07-01 | 2017-07-10 |
| 8 | 7 | 2017-07-01 | 2017-07-03 |
| 9 | 7 | 2017-07-04 | 2017-07-06 |
| 10 | 7 | 2017-07-08 | 2017-07-10 |
rows without parent id are "pricelists" while rows with parent are pricelist periods.
I'd want to filter out pricelist ids with periods that have time gaps, so ideally my query should return 1 and 3.
So far I've written a simple query which correctly returns 3:
SELECT distinct period1.parent_id
FROM pricelist period1
INNER JOIN pricelist period2
ON period1.parent_id = period2.parent_id
AND period2.date_start = DATE_ADD(period1.date_end,INTERVAL 1 DAY);
but unfortunately it doesn't take into account pricelists with a single period, which have no gaps by definition!
So I was wondering if it could be possible to modify such a query to return pricelists with either single periods or multiple periods without time gaps, possibly without a UNION.

I had difficulty finding a MySQL workspace so initially I presented a T-SQL solution, but have subsequently learned how to use MySQL at rextester which has helped a lot. The syntax for the DATEDIFF() function in MySQL is the opposite logic to T-SQL which became complex without an ability to test it. Hopefully now resolved.
The basic logic of this approach is to calculate the overall duration of the parents. Then to calculate the sum of duration of all the children. Then compare these durations (in a join) and if they are the same you have no gaps.
Note this logic isn't tested for overlaps in children but I would expect thses to also fail at the join where durations are compared.
Data:
#drop table if exists `PRICELIST`;
create table `PRICELIST`
(`id` int, `parent_id` int, `date_start` datetime, `date_end` datetime)
;
INSERT INTO PRICELIST
(`id`, `parent_id`, `date_start`, `date_end`)
VALUES
(1, NULL, '2017-05-01 00:00:00', '2017-05-10 00:00:00'),
(2, 1, '2017-05-01 00:00:00', '2017-05-10 00:00:00'),
(3, NULL, '2017-06-01 00:00:00', '2017-06-10 00:00:00'),
(4, 3, '2017-06-01 00:00:00', '2017-06-03 00:00:00'),
(5, 3, '2017-06-04 00:00:00', '2017-06-06 00:00:00'),
(6, 3, '2017-06-07 00:00:00', '2017-06-10 00:00:00'),
(7, NULL, '2017-07-01 00:00:00', '2017-07-10 00:00:00'),
(8, 7, '2017-07-01 00:00:00', '2017-07-03 00:00:00'),
(9, 7, '2017-07-04 00:00:00', '2017-07-06 00:00:00'),
(10, 7, '2017-07-08 00:00:00', '2017-07-10 00:00:00')
;
MySQL:
select p.id, datediff(p.date_end,p.date_start) pdu, x.du
from pricelist p
inner join (
select
p1.parent_id, sum(datediff(p1.date_end,p1.date_start) + coalesce(datediff(p2.date_start,p1.date_end),0)) du
from (
select parent_id,date_start,date_end
from pricelist
where parent_id IS NOT NULL
) p1
left join (
select parent_id,date_start,date_end
from pricelist
where parent_id IS NOT NULL
) p2 on p1.parent_id = p2.parent_id and p1.date_end < p2.date_start
where datediff(p2.date_start,p1.date_end) = 1 or p2.parent_id is null
group by p1.parent_id
) x on p.id = x.parent_id and datediff(p.date_end,p.date_start) = x.du
where p.parent_id IS NULL
;
see it working at: http://rextester.com/JRD47056
T-SQL:
Whilst I could not find a MySQL environment that worked, I used MSSQL but avoided any analytic functions etc that can't be used in MySQL. However I have relied on datediff() which is slightly different to the same function in MySQL.
TSQL query
select p.id, datediff(day,p.date_start,p.date_end) du
from #pricelist p
inner join (
select
p1.parent_id, sum(datediff(day,p1.date_start,p1.date_end) + coalesce(datediff(day,p1.date_end,p2.date_start),0)) du
from (
select parent_id,date_start,date_end
from #pricelist
where parent_id IS NOT NULL
) p1
left join (
select parent_id,date_start,date_end
from #pricelist
where parent_id IS NOT NULL
) p2 on p1.parent_id = p2.parent_id and p1.date_end < p2.date_start
where datediff(day,p1.date_end,p2.date_start) = 1 or p2.parent_id is null
group by p1.parent_id
) x on p.id = x.parent_id and datediff(day,p.date_start,p.date_end) = x.du
where p.parent_id IS NULL
see it working at: http://rextester.com/KUK2410
In MySQL the datediff() expects just 2 parameters and ou need to swap the field references (i.e. latest date first)
Perhaps there are easier ways. Best I could come up with for now.

mysql date list with count even if no data on specific date [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
MySQL how to fill missing dates in range?
I'm trying to make a graph from mysqldata,
Postid | date | text
1 | 2012-01-01 | bla
2 | 2012-01-01 | bla
3 | 2012-01-02 | bla
4 | 2012-01-02 | bla
5 | 2012-01-04 | bla
6 | 2012-01-04 | bla
7 | 2012-01-05 | bla
Now, I'd like to get the number of posts on every day, INCLUDING dates with zero. For example, i'd like to be able to get the first week like this:
date | count(Postid)
2012-01-01 | 2
2012-01-02 | 2
2012-01-03 | 0
2012-01-04 | 2
2012-01-05 | 1
2012-01-06 | 0
2012-01-07 | 0
I'm looking for a generic solution, where i don't have to specify every date. Any suggestions?

Assuming your postids are contiguous in your table, then this query:
SELECT
DATE_FORMAT(b.date, '%Y-%m-%d') date,
COUNT(c.postid)
FROM
(
SELECT
(SELECT MAX(date) FROM ex) + INTERVAL 3 DAY - INTERVAL a.postid DAY AS date
FROM
ex a
) b
LEFT JOIN
ex c ON c.date = b.date
GROUP BY
b.date
ORDER BY
b.date
produces:
date COUNT(c.postid)
2012-01-01 2
2012-01-02 2
2012-01-03 0
2012-01-04 2
2012-01-05 1
2012-01-06 0
2012-01-07 0
See http://sqlfiddle.com/#!2/2cf8d/2
If your postids are not contiguous, then you can use an ids table of contiguous ids:
SELECT
DATE_FORMAT(b.date, '%Y-%m-%d') date,
COUNT(c.postid)
FROM
(
SELECT
(SELECT MAX(date) FROM ex) + INTERVAL 3 DAY - INTERVAL a.id DAY AS date
FROM
ids a
) b
LEFT JOIN
ex c ON c.date = b.date
GROUP BY
b.date
ORDER BY
b.date DESC
LIMIT 7
See http://sqlfiddle.com/#!2/13035/1

in MySQL, I would suggest creating a Calendar table with the dates listed. Then you will join on that table. Similar to this:
CREATE TABLE Table1(`Postid` int, `date` datetime, `text` varchar(3));
INSERT INTO Table1(`Postid`, `date`, `text`)
VALUES
(1, '2011-12-31 17:00:00', 'bla'),
(2, '2011-12-31 17:00:00', 'bla'),
(3, '2012-01-01 17:00:00', 'bla'),
(4, '2012-01-01 17:00:00', 'bla'),
(5, '2012-01-03 17:00:00', 'bla'),
(6, '2012-01-03 17:00:00', 'bla'),
(7, '2012-01-04 17:00:00', 'bla');
CREATE TABLE Table2(`date` datetime);
INSERT INTO Table2(`date`)
VALUES('2011-12-31 17:00:00'),
('2012-01-01 17:00:00'),
('2012-01-02 17:00:00'),
('2012-01-03 17:00:00'),
('2012-01-04 17:00:00'),
('2012-01-05 17:00:00'),
('2012-01-06 17:00:00'),
('2012-01-07 17:00:00'),
('2012-01-08 17:00:00');
select t2.date, count(postid)
from table2 t2
left join table1 t1
on t2.date = t1.date
group by t2.date
See SQL Fiddle with Demo

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Detecting near duplicates above a threshold - mysql

Related

Using time interval in table for select in another

using "OR" and "HAVING SUM" Mysql 5.7

find out time difference for every user in condition mysql 5.7

Filter out rows with date range gaps in mysql

mysql date list with count even if no data on specific date [duplicate]

Categories

Resources