Identifying Duplicate Transactions in SQL - mysql

Recently because of an issue, multiple duplicate transactions were inserted into the database at different timings. Need to find those duplicate transactions and remove them.
I tried grouping the members and transactions
select count(*),
member_id,
TRUNC(created, 'DDD')
from TXN
where created > TO_DATE('06/01/2019 00:00:00', 'MM/DD/YYYY HH24:MI:SS')
group by member_id,
TRUNC(created, 'DDD')
having count(*) > 2;
I need all the transactions that were created in 10 minutes of time difference for the same member.
Examples:
MEMBER_ID ROW_ID ORG DEST Created
1-FRGD 1-FGTH YFG DFG 10-01-2019 00:00:00:00
1-FRGD 1-TYHG THU SEF 10-01-2019 00:00:09:12
1-FGHR 1-FTGH TGH DRF 10-01-2019 00:01:03:25
In this example, I need the first two txns as output because of not more than 10minutes if time difference and has the same member number

You may want self join:
select a.Member_Id as Member_Id,
a.Row_Id as Row_Id,
a.Org as Org,
a.Dest as Dest ,
a.Created as Created,
b.Row_Id as Duplicate_Row_Id,
b.Org as Duplicate_Org,
b.Dest as Duplicate_Dest,
b.Created as Duplicate_Created
from TXN a inner join
TXN b on a.Member_Id = b.Member_Id and
a.Created < b.Created and
TIMESTAMPDIFF(a.Created, b.Created) / 60 <= 10
order by a.Member_Id
For each record in TNX you provide its duplicate(s).

If you want to delete these transactions:
delete tnext
from txn tnext join
txn t
on tnext.member_id = t.member_id and
tnext.created > t.created and
tnext.created < t.created + interval 10 minute
where t.created > '2019-06-01';
Be sure you backup the table and test the logic using select before running this on your actual table.
If you simply want to select transactions without the duplicates, I would recommend not exists:
select t.*
from txn t
where not exists (select 1
from t tprev
where tprev.member_id = t.member_id and
tprev.created < t.created and
tprev.created > t.created - interval 10 minute
) and
t.created >= '2019-06-01';

Related

how to find duplicate records in a table within a predefined time period in sql

For example, i have following table
Mobile number Timestamp
123456 17-09-2015 11:30
455677 17-09-2015 12:15
123456 17-09-2015 12:25
453377 17-09-2015 13:15
If now is 11:30, I want to scan my table and find rows with the same numbers within the past 1 hour.
That's my SQL statement:
select a.number, a.time
from mytable a inner join
(select number, time
from mytable b
where time>=now()-Interval 1 hour and time<=now ()
group by number
Having count(*) > 1
) b
on a.number = b.number and a.time = b.time
I want to find duplicate rown with the same numbers happening within 1 hour. I should output the number and timestamp.
How about just using exists?
select t.*
from mytable t
where t.time >= now() - Interval 1 hour and
t.time <= now() and
exists (select 1
from mytable t2
where t2.number = t.number and
t2.time >= now() - Interval 1 hour and
t2.time <= now () and
t2.time <> t.time
);
However, I suspect that the problem with your query is the join to time. Just remove the time from the subquery and the on clause and you will get all numbers. Alternatively, use group by:
select t.number, group_concat(time)
from mytable t
where t.time >= now() - Interval 1 hour and
t.time <= now()
group by t.number
having count(*) > 1;

SQL aggregation select using SUM and COUNT on different tables

I have a table emails
id date sent_to
1 2013-01-01 345
2 2013-01-05 990
3 2013-02-05 1000
table2 is responses
email_id email response
1 xyz#email.com xxxx
1 xyzw#email.com yyyy
.
.
.
I want a result with the following format:
Month total_number_of_subscribers_sent total_responded
2013-01 1335 2
.
.
this is my query:
SELECT
DATE_FORMAT(e.date, '%Y-%m')AS `Month`,
count(*) AS total_responded,
SUM(e.sent_to) AS total_sent
FROM
responses r
LEFT JOIN emails e ON e.id = r.email_id
WHERE
e.date > '2012-12-31' AND e.date < '2013-10-01'
GROUP BY
DATE_FORMAT(e.date, '%Y %m')
it works ok with total_responded, but the total_sent goes crazy in millions, obviously because the resultant join table has the redundant values.
So basically can I do a SUM and COUNT in the same query on separate tables ?
If you want to count duplicates in each table, then the query is a little complicated.
You need to aggregate the sends and responses separately, before joining them together. The join is on the date, which necessarily comes from the "sent" information:
select r.`Month`, coalesce(total_sent, 0) as total_sent, coalesce(total_emails, 0) as total_emails,
coalesce(total_responses, 0) as total_responses,
coalesce(total_email_responses, 0) as total_email_responses
from (select DATE_FORMAT(e.date, '%Y-%m') as `Month`,
count(*) as total_sent, count(distinct email) as total_emails
from emails e
where e.date > '2012-12-31' AND e.date < '2013-10-01'
group by DATE_FORMAT(r.date, '%Y-%m')
) e left outer join
(select DATE_FORMAT(e.date, '%Y-%m') as `Month`,
count(*) as total_responses, count(distinct r.email) as total_email_responses
from emails e join
responses r
on e.email = r.email
where e.date > '2012-12-31' AND e.date < '2013-10-01'
) r
on e.`Month` = r.`Month`;
The apparent fact that your responses have no link to the "sent" information -- not even the date -- suggests a real problem with your operations and data.

Mysql subtracting values using row selected from min timestamp, grouping by id

I've been at this for a few hours now to no avail, pulling my hair out.
Edit: Im wanting to calculate the difference between the overall_exp column by using the same data from 1 day ago to calculate the greatest 'gain' for each user
Currently I'm take a row, then select a row from 1 day ago based on the first rows timestamp then subtract the overall_exp column from the 2 rows and order by that result whilst grouping by user_id
SQL Fiddle: http://sqlfiddle.com/#!2/501c8
Here is what i currently have, however the logic is completely wrong so im pulling 0 results
SELECT rsn, ts.timestamp, #original_ts := SUBDATE( ts.timestamp, INTERVAL 1 DAY), ts.overall_exp, ts.overall_exp - previous.overall_exp AS gained_exp
FROM tracker AS ts
INNER JOIN (
SELECT user_id, MIN( TIMESTAMP ) , overall_exp
FROM tracker
WHERE TIMESTAMP >= #original_ts
GROUP BY user_id
) previous
ON ts.user_id = previous.user_id
JOIN users
ON ts.user_id = users.id
GROUP BY ts.user_id
ORDER BY gained_exp DESC
You can do this with a self-join:
select t.user_id, max(t.overall_exp - tprev.overall_exp)
from tracker t join
tracker tprev
on tprev.user_id = t.user_id and
date(tprev.timestamp) = date(SUBDATE(t.timestamp, INTERVAL 1 DAY))
group by t.user_id
A key here is converting the timestamps to dates, so the comparison is exact.
Try:
select u.*, max(t.`timestamp`)-min(t.`timestamp`) gain
from users u
left join tracker t
on u.id = t.user_id and
t.`timestamp` >= date_sub(date(now()), interval 1 day) and
t.`timestamp` < date_add(date(now()), interval 1 day)
group by u.id
order by gain desc
SQLFiddle here.

SQL selecting average score over range of dates

I have 3 tables:
doctors (id, name) -> has_many:
patients (id, doctor_id, name) -> has_many:
health_conditions (id, patient_id, note, created_at)
Every day each patient gets added a health condition with a note from 1 to 10 where 10 is a good health (full recovery if you may).
What I want to extract is the following 3 statistics for the last 30 days (month):
- how many patients got better
- how many patients got worst
- how many patients remained the same
These statistics are global so I don't care right now of statistics per doctor which I could extract given the right query.
The trick is that the query needs to extract the current health_condition note and compare with the average of past days (this month without today) so one needs to extract today's note and an average of the other days excluding this one.
I don't think the query needs to define who went up/down/same since I can loop and decide that. Just today vs. rest of the month will be sufficient I guess.
Here's what I have so far which obv. doesn't work because it only returns one result due to the limit applied:
SELECT
p.id,
p.name,
hc.latest,
hcc.average
FROM
pacients p
INNER JOIN (
SELECT
id,
pacient_id,
note as LATEST
FROM
health_conditions
GROUP BY pacient_id, id
ORDER BY created_at DESC
LIMIT 1
) hc ON(hc.pacient_id=p.id)
INNER JOIN (
SELECT
id,
pacient_id,
avg(note) AS average
FROM
health_conditions
GROUP BY pacient_id, id
) hcc ON(hcc.pacient_id=p.id AND hcc.id!=hc.id)
WHERE
date_part('epoch',date_trunc('day', hcc.created_at))
BETWEEN
(date_part('epoch',date_trunc('day', hc.created_at)) - (30 * 86400))
AND
date_part('epoch',date_trunc('day', hc.created_at))
The query has all the logic it needs to distinguish between what is latest and average but that limit kills everything. I need that limit to extract the latest result which is used to compare with past results.
Something like this assuming created_at is of type date
select p.name,
hc.note as current_note,
av.avg_note
from patients p
join health_conditions hc on hc.patient_id = p.id
join (
select patient_id,
avg(note) as avg_note
from health_conditions hc2
where created_at between current_date - 30 and current_date - 1
group by patient_id
) avg on t.patient_id = hc.patient_id
where hc.created_at = current_date;
This is PostgreSQL syntax. I'm not sure if MySQL supports date arithmetics the same way.
Edit:
This should get you the most recent note for each patient, plus the average for the last 30 days:
select p.name,
hc.created_at as last_note_date
hc.note as current_note,
t.avg_note
from patients p
join health_conditions hc
on hc.patient_id = p.id
and hc.created_at = (select max(created_at)
from health_conditions hc2
where hc2.patient_id = hc.patient_id)
join (
select patient_id,
avg(note) as avg_note
from health_conditions hc3
where created_at between current_date - 30 and current_date - 1
group by patient_id
) t on t.patient_id = hc.patient_id
SELECT SUM(delta < 0) AS worsened,
SUM(delta = 0) AS no_change,
SUM(delta > 0) AS improved
FROM (
SELECT patient_id,
SUM(IF(DATE(created_at) = CURDATE(),note,NULL))
- AVG(IF(DATE(created_at) < CURDATE(),note,NULL)) AS delta
FROM health_conditions
WHERE DATE(created_at) BETWEEN CURDATE() - INTERVAL 1 MONTH AND CURDATE()
GROUP BY patient_id
) t

Trying to correct a mysql query

I currently have the following query;
SELECT a.schedID,
a.start AS eventDate, b.div_id AS divisionID, b.div_name AS divisionName
FROM schedules a
INNER JOIN divisions b ON b.div_id = a.div_id
WHERE date_format(a.start, '%Y-%m-%d') >= '2010-01-01'
AND DATE_ADD(a.start, INTERVAL 5 DAY) <= CURDATE()
AND NOT EXISTS (SELECT results_id FROM results e WHERE e.schedID = a.schedID)
ORDER BY eventDate ASC;
Im trying to basically find any schedules that do not have any results 5 days after the schedule date. My current query has major performance issues. It also times out inconsistently. Is there a different way to write the query? Im at a mental roadblock. Any help is appreciated.
Without antcipating much on the outcome I would suggest the following leads :
* try to remove the date_format as this generates one function call per record. I don't know the format of your column a.start but this should be possible.
* same for DATE_ADD, you could probably put it on the other member like :
a.start <= DATE_SUB(CURDATE(), INTERVAL 5 DAYS)
you get a chance the result is cached rather than being calculated for each line, you could even define it as a parameter upfront
* the NOT EXISTS is very expensive, it seems to mee you could replace this by a left join like :
schedules a LEFT JOIN results e ON a.schedId = e.schedId WHERE e.schedId is NULL
double-check that all join fields are well indexed.
Good luck
Maybe something like:
SELECT
a.schedID, a.start AS eventDate, b.div_id AS divisionID, b.div_name AS divisionName
FROM
schedules a
INNER JOIN divisions b ON b.div_id = a.div_id
WHERE
date_format(a.start, '%Y-%m-%d') >= '2010-01-01'
AND NOT EXISTS (
SELECT
*
FROM
results e
INNER JOIN schedules a2 ON e.schedID = a2.schedID
WHERE
DATE_ADD(a2.start, INTERVAL 5 DAY) <= CURDATE()
AND a2.id = a.id
)
ORDER BY eventDate ASC;
dont know if mysql is same as oracle but are you converting a date to a string here and then comparing it with a string '2010-01-01' ? Can you convvert 2010-01-01 to a date instead so that if there is an index on a.start, it can be used ?
Also does this query definitely return the right answer ?
You mention you want schedules without results 5 days after the schedule date but it looks like you are aksing for anything in the last 5 days ?
a.start >= 1-Jan-10 and start date + 5 days is before today
try this query
SELECT a.schedID,
a.start AS eventDate,
b.div_id AS divisionID,
b.div_name AS divisionName
FROM (SELECT * FROM schedules s WHERE DATE(s.start) >= '2010-01-01' AND DATE_ADD(s.start, INTERVAL 5 DAY) <= CURDATE()) a
INNER JOIN divisions b
ON b.div_id = a.div_id
LEFT JOIN (SELECT results_id FROM results) e
ON e.schedID = a.schedID
WHERE e.results_id = ''
ORDER BY eventDate ASC;