Lets say we have a table named record with 4 fields
id (INT 11 AUTO_INC)
email (VAR 50)
timestamp (INT 11)
status (INT 1)
And the table contains following data
Now we can see that the email address test#xample.com was duplicated 4 times (the record with the lowest timestamp is the original one and all copies after that are duplicates). I can easily count the number of unique records using
SELECT COUNT(DISTINCT email) FROM record
I can also easily find out which email address was duplicated how many times using
SELECT email, count(id) FROM record GROUP BY email HAVING COUNT(id)>1
But now the business question is
How many times STATUS was 1 on all the Duplicate Records?
For example:
For test#example.com there was no duplicate record having status 1
For second#example.com there was 1 duplicate record having status 1
For third#example.com there was 1 duplicate record having status 1
For four#example.com there was no duplicate record having status 1
For five#example.com there were 2 duplicate record having status 1
So the sum of all the numbers is 0 + 1 + 1 + 0 + 2 = 4
Which means there were 4 Duplicate records which had status = 1 In table
Question
How many Duplicate records have status = 1 ?
This is a new solution that works better. It removes the first entry for each email and then counts the rest. It's not easy to read, if possible I would write this in a stored procedure but this works.
select sum(status)
from dude d1
join (select email,
min(ts) as ts
from dude
group by email) mins
using (email)
where d1.ts != mins.ts;
sqlfiddle
original answer below
Your own query to find "which email address was duplicated how many times using"
SELECT email,
count(id) as duplicates
FROM record
GROUP BY email
HAVING COUNT(id)>1
can easily be modified to answer "How many Duplicate records have status = 1"
SELECT email,
count(id) as duplicates_status_sum
FROM record
GROUP BY email
WHERE status = 1
HAVING COUNT(id)>1
Both these queries will answer including the original line so it's actually "duplicates including the original one". You can subtract 1 from the sums if the original one always have status 1.
SELECT email,
count(id) -1 as true_duplicates
FROM record
GROUP BY email
HAVING COUNT(id)>1
SELECT email,
count(id) -1 as true_duplicates_status_sum
FROM record
GROUP BY email
WHERE status = 1
HAVING COUNT(id)>1
If I am not wrong in understanding then your query should be
SELECT `email` , COUNT( `id` ) AS `tot`
FROM `record` , (
SELECT `email` AS `emt` , MIN( `timestamp` ) AS `mtm`
FROM `record`
GROUP BY `email`
) AS `temp`
WHERE `email` = `emt`
AND `timestamp` > `mtm`
AND `status` =1
GROUP BY `email`
HAVING COUNT( `id` ) >=1
First we need to get the minimum timestamp and then find duplicate records that are inserted after this timestamp and having status 1.
If you want the total sum then the query is
SELECT SUM( `tot` ) AS `duplicatesWithStatus1`
FROM (
SELECT `email` , COUNT( `id` ) AS `tot`
FROM `record` , (
SELECT `email` AS `emt` , MIN( `timestamp` ) AS `mtm`
FROM `record`
GROUP BY `email`
) AS `temp`
WHERE `email` = `emt`
AND `timestamp` > `mtm`
AND `status` =1
GROUP BY `email`
HAVING COUNT( `id` ) >=1
) AS t
Hope this is what you want
You can get the count of Duplicate records have status = 1 by
select count(*) as Duplicate_Record_Count
from (select *
from record r
where r.status=1
group by r.email,r.status
having count(r.email)>1 ) t1
The following query will return the duplicate email with status 1 count and timestamp
select r.email,count(*)-1 as Duplicate_Count,min(r.timestamp) as timestamp
from record r
where r.status=1
group by r.email
having count(r.email)>1
Related
I have two tables contacts and calllist. contacts has multiple columns containing phone numbers. calllist has only one column from_number containing phone numbers. I'm trying to get all phone numbers from the column from_number which do not match the phone numbers in the table calllist.
Here is my working but probably very inefficient and slow SQL query:
SELECT from_number AS phone_number, COUNT(from_number) AS number_of_calls
FROM calllist
WHERE from_number NOT IN (
SELECT businessPhone1
FROM contacts
WHERE businessPhone1 IS NOT NULL
)
AND from_number NOT IN (
SELECT businessPhone2
FROM contacts
WHERE businessPhone2 IS NOT NULL
)
AND from_number NOT IN (
SELECT homePhone1
FROM contacts
WHERE homePhone1 IS NOT NULL
)
AND from_number NOT IN (
SELECT homePhone2
FROM contacts
WHERE homePhone2 IS NOT NULL
)
AND from_number NOT IN (
SELECT mobilePhone
FROM contacts
WHERE mobilePhone IS NOT NULL
)
AND (received_at BETWEEN '$startDate' AND DATE_ADD('$endDate', INTERVAL 1 DAY))
GROUP BY phone_number
ORDER BY number_of_calls DESC
LIMIT 10
How do i rewrite this SQL query to be faster? Any help would be much appreciated.
try this
SELECT from_number AS phone_number, COUNT(from_number) AS number_of_calls
FROM calllist
WHERE from_number NOT IN (
SELECT businessPhone1
FROM contacts
WHERE businessPhone1 IS NOT NULL
UNION
SELECT businessPhone2
FROM contacts
WHERE businessPhone2 IS NOT NULL
UNION
SELECT homePhone1
FROM contacts
WHERE homePhone1 IS NOT NULL
UNION
SELECT homePhone2
FROM contacts
WHERE homePhone2 IS NOT NULL
UNION
SELECT mobilePhone
FROM contacts
WHERE mobilePhone IS NOT NULL
)
AND (received_at BETWEEN '$startDate' AND DATE_ADD('$endDate', INTERVAL 1 DAY))
GROUP BY phone_number
ORDER BY number_of_calls DESC
LIMIT 10
I don't like the schema design. You have multiple columns holding 'identical' data -- namely phone numbers. What if technology advances and you need a 6th phone number??
Instead, have a separate table of phone numbers, with linkage (id) to JOIN back to calllist. That gets rid of all the slow NOT IN ( SELECT... ), avoids a messy UNION ALL, etc.
If you desire, the new table could have a 3rd column that says which type of phone it is.
ENUM('unknown', 'company', 'home', 'mobile')
The simplified query goes something like
SELECT cl.from_number AS phone_number,
COUNT(*) AS number_of_calls
FROM calllist AS cl
LEFT JOIN phonenums AS pn ON cl.id = pn.user_id
WHERE cl.received_at >= '$startDate' AND
AND cl.received_at < '$endDate' + INTERVAL 1 DAY
AND pn.number IS NULL -- not found in phonenums
GROUP BY phone_number
ORDER BY number_of_calls DESC
LIMIT 10
I am working with a table of items with expiration dates,these items are assigned to users.
I want to get for each user,the highest expiration date.The issue here is that default items are initialized with a '3000/01/01' expiration date that should be ignored if another item exists for that user.
I've got a query doing that:
SELECT
user_id as UserId,
CASE WHEN (YEAR(MAX(date_expiration)) = 3000)
THEN (
SELECT MAX(temp.date_expiration)
FROM user_items temp
WHERE YEAR(temp.date_expiration) <> 3000 and temp.user_id = UserId
)
ELSE MAX(date_expiration)
END as date_expiration
FROM user_items GROUP BY user_id
This works, but the query inside THEN block is killing performance a bit and it is a huge table.
So,Is there a better way to ignore the default date from the MAX operation when entering the CASE condition?
SELECT user_id,
COALESCE(
MAX(CASE WHEN YEAR(date_expiration) = 3000 THEN NULL ELSE date_expiration END),
MAX(date_expiration)
)
FROM user_items
GROUP BY
user_id
If there are few users but lots of entries per user in your table, you can try improving your query yet a little more:
SELECT user_id,
COALESCE(
(
SELECT date_expiration
FROM user_items uii
WHERE uii.user_id = uid.user_id
AND date_expiration < '3000-01-01'
ORDER BY
user_id DESC, date_expiration DESC
LIMIT 1
),
(
SELECT date_expiration
FROM user_items uii
WHERE uii.user_id = uid.user_id
ORDER BY
user_id DESC, date_expiration DESC
LIMIT 1
)
)
FROM (
SELECT DISTINCT
user_id
FROM user_items
) uid
You need an index on (user_id, date_expiration) for this to work fast.
I have a table CONTACT with a field opt_out.
The field opt_out may have values 'Y', 'N' and NULL.
I have a table CONTACT_AUDIT with fields
date
contact_id
field_name
value_before
value_after
When I add a new contact, a new line is added in the CONTACT table, nothing the CONTACT_AUDIT table.
When I edit a contact, for example if I change the opt_out field value from NULL to 'Y', the opt_out field value in CONTACT table is changed and a new line is added to CONTACT_AUDIT table with values
date=NOW()
contact_id=<my contact's id>
field_name='opt_out'
value_before=NULL
value_after='Y'
I need to know the contacts who had opt_out='Y' at a given date.
I tried this :
SELECT count(*) AS nb
FROM contacts c
WHERE
( -- contact is optout now and has never been modified before
c.optout = 'Y'
AND c.id NOT IN (SELECT DISTINCT contact_id FROM contacts_audit WHERE field_name = 'optout')
)
OR ( -- we consider contacts where the last row before date in contacts_audit is optout = 'Y'
c.id IN (
SELECT ca.contact_id
FROM contacts_audit ca
WHERE date_created BETWEEN '2014-07-24' AND DATE_ADD( '2014-07-24', INTERVAL 1 DAY )
AND field_name = 'optout'
ORDER BY date_created
LIMIT 1
)
)
But mysql does not support LIMIT in subquery.
So I tried with HAVING :
SELECT count(*) AS nb
FROM contacts c
WHERE
( -- contact is optout now and has never been modified before
c.optout = 'Y'
AND c.id NOT IN (SELECT DISTINCT contact_id FROM contacts_audit WHERE field_name = 'optout')
)
OR ( -- we consider contacts where the last row before date in contacts_audit is optout = 'Y'
c.id IN (
SELECT ca.contact_id
FROM contacts_audit ca
WHERE date_created BETWEEN '2014-07-24' AND DATE_ADD( '2014-07-24', INTERVAL 1 DAY )
AND field_name = 'optout'
HAVING MAX(date_created)
)
)
The query runs, but now, I don't know how to know if the value corresponding to the subquery value is 'Y' or 'N'. If I add a WHERE clause to check only for 'Y' values, 'N' values will be filtred and I will not be able to know if the last value at date was 'Y' or 'N'...
Thank you for your help
If i understand your problem correctly you may want to use a union. I dont have mysql to test it right now but the code could be something like this. tell me if this helped
select c.id, c.optout
where c.optout = 'Y'
AND c.id NOT IN (SELECT DISTINCT contact_id FROM contacts_audit WHERE field_name = 'optout')
UNION
select c.id, c.optout where c.id IN (
SELECT ca.contact_id
FROM contacts_audit ca
WHERE date_created BETWEEN '2014-07-24' AND DATE_ADD( '2014-07-24', INTERVAL 1 DAY )
AND field_name = 'optout'
HAVING MAX(date_created)
)
I have a table that contains all purchased items.
I need to check which users purchased items in a specific period of time (say between 2013-03-21 to 2013-04-21) and never purchased anything after that.
I can select users that purchased items in that period of time, but I don't know how to filter those users that never purchased anything after that...
SELECT `userId`, `email` FROM my_table
WHERE `date` BETWEEN '2013-03-21' AND '2013-04-21' GROUP BY `userId`
Give this a try
SELECT
user_id
FROM
my_table
WHERE
purchase_date >= '2012-05-01' --your_start_date
GROUP BY
user_id
HAVING
max(purchase_date) <= '2012-06-01'; --your_end_date
It works by getting all the records >= start date, groups the resultset by user_id and then finds the max purchase date for every user. The max purchase date should be <=end date. Since this query does not use a join/inner query it could be faster
Test data
CREATE table user_purchases(user_id int, purchase_date date);
insert into user_purchases values (1, '2012-05-01');
insert into user_purchases values (2, '2012-05-06');
insert into user_purchases values (3, '2012-05-20');
insert into user_purchases values (4, '2012-06-01');
insert into user_purchases values (4, '2012-09-06');
insert into user_purchases values (1, '2012-09-06');
Output
| USER_ID |
-----------
| 2 |
| 3 |
SQLFIDDLE
This is probably a standard way to accomplish that:
SELECT `userId`, `email` FROM my_table mt
WHERE `date` BETWEEN '2013-03-21' AND '2013-04-21'
AND NOT EXISTS (
SELECT * FROM my_table mt2 WHERE
mt2.`userId` = mt.`userId`
and mt2.`date` > '2013-04-21'
)
GROUP BY `userId`
SELECT `userId`, `email` FROM my_table WHERE (`date` BETWEEN '2013-03-21' AND '2013-04-21') and `date` >= '2013-04-21' GROUP BY `userId`
This will select only the users who purchased during that timeframe AND purchased after that timeframe.
Hope this helps.
Try the following
SELECT `userId`, `email`
FROM my_table WHERE `date` BETWEEN '2013-03-21' AND '2013-04-21'
and user_id not in
(select user_id from my_table
where `date` < '2013-03-21' or `date` > '2013-04-21' )
GROUP BY `userId`
You'll have to do it in two stages - one query to get the list of users who did buy within the time period, then another query to take that list of users and see if they bought anything afterwards, e.g.
SELECT userID, email, count(after.*) AS purchases
FROM my_table AS after
LEFT JOIN (
SELECT DISTINCT userID
FROM my_table
WHERE `date` BETWEEN '2013-03-21' AND '2013-04-21'
) AS during ON after.userID = during.userID
WHERE after.date > '2013-04-21'
HAVING purchases = 0;
Inner query gets the list of userIDs who purchased at least one thing during that period. That list is then joined back against the same table, but filtered for purchases AFTER the period , and counts how many purchases they made and filters down to only those users with 0 "after" purchases.
probably won't work as written - haven't had my morning tea yet.
SELECT
a.userId,
a.email
FROM
my_table AS a
WHERE a.date BETWEEN '2013-03-21'
AND '2013-04-21'
AND a.userId NOT IN
(SELECT
b.userId
FROM
my_table AS b
WHERE b.date BETWEEN '2013-04-22'
AND CURDATE()
GROUP BY b.userId)
GROUP BY a.userId
This filters out anyone who has not purchased anything from the end date to the present.
I have the following query building a recordset which is used in a pie-chart as a report.
It's not run particularly often, but when it does it takes several seconds, and I'm wondering if there's any way to make it more efficient.
SELECT
CASE
WHEN (lastStatus IS NULL) THEN 'Unused'
WHEN (attempts > 3 AND callbackAfter IS NULL) THEN 'Max Attempts Reached'
WHEN (callbackAfter IS NOT NULL AND callbackAfter > DATE_ADD(NOW(), INTERVAL 7 DAY)) THEN 'Call Back After 7 Days'
WHEN (callbackAfter IS NOT NULL AND callbackAfter <= DATE_ADD(NOW(), INTERVAL 7 DAY)) THEN 'Call Back Within 7 Days'
WHEN (archived = 0) THEN 'Call Back Within 7 Days'
ELSE 'Spoke To'
END AS statusSummary,
COUNT(leadId) AS total
FROM
CO_Lead
WHERE
groupId = 123
AND
deleted = 0
GROUP BY
statusSummary
ORDER BY
total DESC;
I have an index for (groupId, deleted), but I'm not sure it would help to add any of the other fields into the index (if it would, how do I decide which should go first? callbackAfter because it's used the most?)
The table has about 500,000 rows (but will have 10 times that a year from now.)
The only other thing I could think of was to split it out into 6 queries (with the WHEN clause moved into the WHERE), but that makes it take 3 times as long.
EDIT:
Here's the table definition
CREATE TABLE CO_Lead (
objectId int UNSIGNED NOT NULL AUTO_INCREMENT,
groupId int UNSIGNED NOT NULL,
numberToCall varchar(20) NOT NULL,
firstName varchar(100) NOT NULL,
lastName varchar(100) NOT NULL,
attempts tinyint NOT NULL default 0,
callbackAfter datetime NULL,
lastStatus varchar(30) NULL,
createdDate datetime NOT NULL,
archived bool NOT NULL default 0,
deleted bool NOT NULL default 0,
PRIMARY KEY (
objectId
)
) ENGINE = InnoDB;
ALTER TABLE CO_Lead ADD CONSTRAINT UQIX_CO_Lead UNIQUE INDEX (
objectId
);
ALTER TABLE CO_Lead ADD INDEX (
groupId,
archived,
deleted,
callbackAfter,
attempts
);
ALTER TABLE CO_Lead ADD INDEX (
groupId,
deleted,
createdDate,
lastStatus
);
ALTER TABLE CO_Lead ADD INDEX (
firstName
);
ALTER TABLE CO_Lead ADD INDEX (
lastName
);
ALTER TABLE CO_Lead ADD INDEX (
lastStatus
);
ALTER TABLE CO_Lead ADD INDEX (
createdDate
);
Notes:
If leadId cannot be NULL, then change the COUNT(leadId) to COUNT(*). They are logically equivalent but most versions of MySQL optimizer are not so clever to identify that.
Remove the two redundant callbackAfter IS NOT NULL conditions. If callbackAfter satisfies the second part, it cannot be null anyway.
You could benefit from splitting the query into 6 parts and add appropriate indexes for each one - but depending on whether the conditions at the CASE are overlapping or not, you may have wrong or correct results.
A possible rewrite (mind the different format and check if this returns the same results, it may not!)
SELECT
cnt1 AS "Unused"
, cnt2 AS "Max Attempts Reached"
, cnt3 AS "Call Back After 7 Days"
, cnt4 AS "Call Back Within 7 Days"
, cnt5 AS "Call Back Within 7 Days"
, cnt6 - (cnt1+cnt2+cnt3+cnt4+cnt5) AS "Spoke To"
FROM
( SELECT
( SELECT COUNT(*) FROM CO_Lead
WHERE groupId = 123 AND deleted = 0
AND lastStatus IS NULL
) AS cnt1
, ( SELECT COUNT(*) FROM CO_Lead
WHERE groupId = 123 AND deleted = 0
AND attempts > 3 AND callbackAfter IS NULL
) AS cnt2
, ( SELECT COUNT(*) FROM CO_Lead
WHERE groupId = 123 AND deleted = 0
AND callbackAfter > DATE_ADD(NOW(), INTERVAL 7 DAY)
) AS cnt3
, ( SELECT COUNT(*) FROM CO_Lead
WHERE groupId = 123 AND deleted = 0
AND callbackAfter <= DATE_ADD(NOW(), INTERVAL 7 DAY)
) AS cnt4
, ( SELECT COUNT(*) FROM CO_Lead
WHERE groupId = 123 AND deleted = 0
AND archived = 0
) AS cnt5
, ( SELECT COUNT(*) FROM CO_Lead
WHERE groupId = 123 AND deleted = 0
) AS cnt6
) AS tmp ;
If it does return correct results, you could add indexes to be used for each one of the subqueries:
For subquery 1: (groupId, deleted, lastStatus)
For subquery 2, 3, 4: (groupId, deleted, callbackAfter, attempts)
For subquery 5: (groupId, deleted, archived)
Another approach would be to keep the query you have (minding only notes 1 and 2 above) and add a wide covering index:
(groupId, deleted, lastStatus, callbackAfter, attempts, archived)
Try removing the index to see if this improves the performance.
Indexes do not necessarily improve performance, in some databases. If you have an index, MySQL will always use it. In this case, that means that it will read the index, then it will have to read data from each page. The page reads are random, rather than sequential. This random reading can reduce performance, on a query that has to read all the pages anyway.