MySQL COUNT(DISTINCT) giving wrong values with GROUP BY - mysql

I have a table that contains custom user analytics data. I was able to pull the number of unique users with a query:
SELECT COUNT(DISTINCT(user_id)) AS 'unique_users'
FROM `events`
WHERE client_id = 123
And this will return 16728
This table also has a column of type DATETIME that I would like to group the counts by. However, if I add a GROUP BY to the end of it, everything groups properly it seems except the totals don't match. My new query is this:
SELECT COUNT(DISTINCT(user_id)) AS 'unique_users', DATE(server_stamp) AS 'date'
FROM `events`
WHERE client_id = 123
GROUP BY DATE(server_stamp)
Now I get the following values:
|-----------------------------|
| unique_users | date |
|---------------|-------------|
| 2650 | 2019-08-26 |
| 3486 | 2019-08-27 |
| 3475 | 2019-08-28 |
| 3631 | 2019-08-29 |
| 3492 | 2019-08-30 |
|-----------------------------|
Totaling to 16734. I tried using a sub query to get the distinct users then count and group in the main query but no luck there. Any help in this would be greatly appreciated. Let me know if there is further information to help diagnosis.

A user, who is connected with events on multiple days (e.g. session starts before midnight and ends afterwards), will occur the number of these days times in the new query. This is due to the fact, that the first query performs the DISTINCT over all rows at once while the second just removes duplicates inside each groups. Identical values in different groups will stay untouched.
So if you have a combination of DISTINCT in the select clause and a GROUP BY, the GROUP BY will be executed before the DISTINCT. Thus without any restrictions you cannot assume, that the COUNT(DISTINCT user_id) of the first query and the sum over the COUNT(DISTINCT user_id) of all groups is the same.

Xandor is absolutely correct. If a user logged on 2 different days, There is no way your 2nd query can remove them. If you need data grouped by date, You can try below query -
SELECT COUNT(user_id) AS 'unique_users', DATE(MIN_DATE) AS 'date'
FROM (SELECT user_id, MIN(DATE(server_stamp)) MIN_DATE -- Might be MAX
FROM `events`'
WHERE client_id = 123
GROUP BY user_id) X
GROUP BY DATE(server_stamp);

Related

Need some help optimising an SQL query

my client was given the following code and he uses it daily to count the messages sent to businesses on his website. I have looked at the MYSQL.SLOW.LOG and it has the following stats for this query, which indicates to me it needs optimising.
Count: 183 Time=44.12s (8073s) Lock=0.00s (0s)
Rows_sent=17337923391683297280.0 (-1), Rows_examined=382885.7
(70068089), Rows_affected=0.0 (0), thewedd1[thewedd1]#localhost
The query is:
SELECT
businesses.name AS BusinessName,
messages.created AS DateSent,
messages.guest_sender AS EnquirersEmail,
strip_tags(messages.message) AS Message,
users.name AS BusinessName
FROM
messages
JOIN users ON messages.from_to = users.id
JOIN businesses ON users.business_id = businesses.id
My SQL is not very good but would a LEFT JOIN rather than a JOIN help to reduce the number or rows returned? Ive have run an EXPLAIN query and it seems to make no difference between the LEFT JOIN and the JOIN..
Basically I think it would be good to reduce the number of rows returned, as it is absurdly big..
Short answer: There is nothing "wrong" with your query, other than the duplicate BusinessName alias.
Long answer: You can add indexes to the foreign / primary keys to speed up searching which will do more than changing the query.
If you're using SSMS (SQL management studio) you can right click on indexes for a table and use the wizard.
Just don't be tempted to index all the columns as that may slow down any inserts you do in future, stick to the ids and _ids unless you know what you're doing.
he uses it daily to count the messages sent to businesses
If this is done per day, why not limit this to messages sent in specific recent days?
As an example: To count messages sent per business per day, for just a few recent days (example: 3 or 4 days), try this:
SELECT businesses.name AS BusinessName
, messages.created AS DateSent
, COUNT(*) AS n
FROM messages
JOIN users ON messages.from_to = users.id
JOIN businesses ON users.business_id = businesses.id
WHERE messages.created BETWEEN current_date - INTERVAL '3' DAY AND current_date
GROUP BY businesses.id
, DateSent
ORDER BY DateSent DESC
, n DESC
, businesses.id
;
Note: businesses.name is functionally dependent on businesses.id (in the GROUP BY terms), which is the primary key of businesses.
Example result:
+--------------+------------+---+
| BusinessName | DateSent | n |
+--------------+------------+---+
| business1 | 2021-09-05 | 3 |
| business2 | 2021-09-05 | 1 |
| business2 | 2021-09-04 | 1 |
| business2 | 2021-09-03 | 1 |
| business3 | 2021-09-02 | 5 |
| business1 | 2021-09-02 | 1 |
| business2 | 2021-09-02 | 1 |
+--------------+------------+---+
7 rows in set
This assumes your basic join logic is correct, which might not be true.
Other data could be returned as aggregated results, if necessary, and the fact that this is now limited to just recent data, the amount of rows examined should be much more reasonable.

I don't get why these two SQL queries with WHERE IN give different results

What I wanted to do
: DELETE row only if there is a data exists that meets the WHERE condition.
table looks like below and I'm using MySQL.
table A
+-----------+-------------+---------------------+
| id | user_id | created_date |
+-----------+-------------+---------------------+
| 17 | Amy | 2021-04-19 17:00:00 |
| 19 | Amy | 2021-04-20 17:00:00 |
| 20 | Amy | 2021-04-22 17:00:00 |
| 21 | Bob | 2021-04-22 17:00:00 |
+-----------+-------------+---------------------+
1st try
I tried below query, but it failed.
I wanted to delete only Amy's 2021-04-20 data, but it deleted all three Amy's rows.
DELETE FROM A
WHERE user_id IN (
SELECT tmp.user_id from (SELECT user_id FROM A WHERE date(created_date)=date("2021-04-20") AND user_id="Amy") tmp )
AND user_id="Amy";
2nd try
succeeded.
Below query only deletes one row that meets the condition.
DELETE FROM A
WHERE created_date IN (
SELECT tmp.created_date from (SELECT created_date FROM A WHERE date(created_date)=date("2021-04-20") AND user_id="Amy") tmp )
AND user_id="Amy";
question
I don't get why these two SQL queries give different results.
All I changed was just using another column.
Maybe I'm not fully understanding IN or subquery :(
Please give some advice.
Perhaps what can help you understand this situation is to know the order of execution of an MySQL statement.
FROM clause
WHERE clause
SELECT clause
GROUP BY clause
HAVING clause
ORDER BY clause
So in this case the interpretation of your query starts FROM your original table. however the main query in its where condition has another query that also begins to be evaluated in that order. The conditional in the WHERE statement of the subquery where in the first case is:
SELECT tmp.user_id from (SELECT user_id FROM example WHERE date(created_date)=date("2021-04-20") AND user_id="Amy") tmp;
The result of that query is 'Amy'. Because even though you evaluate with the createddate you are requesting the user_id in the SELECT statement.
In the second case:
SELECT tmp.created_date from (SELECT created_date FROM example WHERE date(created_date)=date("2021-04-20") AND user_id="Amy") tmp
The result is '2021-04-20 17:00:00' which is the result that effectively in your search you will eliminate a single record.
Continuing with the order of execution, we can already notice that in both cases the data universe changes because you are changing the search condition in your main query with the WHERE user_id IN or with the WHERE created_date IN.
In the first it is looking for everything it finds with the user_id 'Amy' and in the second it is looking for everything with the date '2021-04-20 17:00:00'.
Only theoretically, if we wanted to solve the first case, you should also include in the main query the conditional of the date field, which is the field that can differentiate the 3 cases of 'Amy'. It would be as follows:
SELECT * FROM example
WHERE user_id IN (
SELECT tmp.user_id from (SELECT user_id FROM example WHERE date(created_date)=date("2021-04-20") AND user_id="Amy") tmp )
and created_date IN (SELECT tmp.created_date from (SELECT created_date FROM example WHERE date(created_date)=date("2021-04-20") AND user_id="Amy") tmp)
Regards.

Get the MAX date between a range

I was searching for querys but i cant find an answer that helps me or if exit a similar question.
i need to get the info of the customers that made their last purchase between two dates
+--------+------------+------------+
| client | amt | date |
+--------+------------+------------+
| 1 | 2440.9100 | 2014-02-05 |
| 1 | 21640.4600 | 2014-03-11 |
| 2 | 6782.5000 | 2014-03-12 |
| 2 | 1324.6600 | 2014-05-28 |
+--------+------------+------------+
for example if i want to know all the cust who make the last purchase between
2014-02-11 and 2014-03-16, in that case the result must be
+--------+------------+------------+
| client | amt | date |
+--------+------------+------------+
| 1 | 21640.4600 | 2014-03-11 |
+--------+------------+------------+
cant be the client number 2 cause have a purchease on 2014-05-28,
i try to make a
SELECT MAX(date)
FROM table
GROUP BY client
but that only get the max of all dates,
i dont know if exist a function or something that can help, thanks.
well i dont know how to mark this question as resolved but this work for me
to complete the original query
SELECT client, MAX(date)
FROM table
GROUP BY client
HAVING MAX(date) BETWEEN date1 AND date2
thanks to all that took a minute to help me with my problem,
special thanks to Ollie Jones and Peter Pei Guo
Something in this format, replace date1 and date 2 with the real values.
SELECT client, max(date)
from table
group by client
having max(date) between date1 AND date2
There is more than one way to do this. Here is one of them.
select * from
(
select client, max(date) maxdate
from table
group by client ) temp
where maxdate between '2014-02-11' and '2014-03-06'
This will allow you to grab the amount column of the applicable rows as well:
select t.*
from tbl t
join (select client, max(date) as last_date
from tbl
group by client
having max(date) between date1 and date2) v
on t.client = v.client
and t.date = v.last_date
I had to change the field "Date" to "TheDate" since date is a reserved word. I assume you are using SQL? My table name is Table1. You need to group records:
SELECT Table1.Client, Sum(Table1.Amt) AS SumOfAmt, Table1.TheDate
FROM Table1
GROUP BY Table1.Client, Table1.TheDate
HAVING (((Table1.TheDate) Between #2/11/2014# And #3/16/2014#));
Query Results:
Client SumOfAmt TheDate
1 21640 03/11/14
2 6792 03/12/14
You may want to get yourself a copy of MS Access. You can generate SQL statements using their query builder which I used to generate this SQL. When I make a post here I will always test it first to make sure it works! I have never written even 1 line of SQL code, but have executed thousands of them from within MS Access.
Good luck,
Dan

MySQL: Transfer Data Based on a Column Without Also Transferring That Column

My table stores revision data for my CMS entries. Each entry has an ID and a revision date, and there are multiple revisions:
Table: old_revisions
+----------+---------------+-----------------------------------------+
| entry_id | revision_date | entry_data |
+----------+---------------+-----------------------------------------+
| 1 | 1302150011 | I like pie. |
| 1 | 1302148411 | I like pie and cookies. |
| 1 | 1302149885 | I like pie and cookies and cake. |
| 2 | 1288917372 | Kittens are cute. |
| 2 | 1288918782 | Kittens are cute but puppies are cuter. |
| 3 | 1288056095 | Han shot first. |
+----------+---------------+-----------------------------------------+
I want to transfer some of this data to another table:
Table: new_revisions
+--------------+----------------+
| new_entry_id | new_entry_data |
+--------------+----------------+
| | |
+--------------+----------------+
I want to transfer entry_id and entry_data to new_entry_id and new_entry_data. But I only want to transfer the most recent version of each entry.
I got as far as this query:
INSERT INTO new_revisions (
new_entry_id,
new_entry_data
)
SELECT
entry_id,
entry_data,
MAX(revision_date)
FROM old_revisions
GROUP BY entry_id
But I think the problem is that I'm trying to insert 3 columns of data into 2 columns.
How do I transfer the data based on the revision date without transferring the revision date as well?
You can use the following query:
insert into new_revisions (new_entry_id, new_entry_data)
select o1.entry_id, o1.entry_data
from old_revisions o1
inner join
(
select max(revision_date) maxDate, entry_id
from old_revisions
group by entry_id
) o2
on o1.entry_id = o2.entry_id
and o1.revision_date = o2.maxDate
See SQL Fiddle with Demo. This query gets the max(revision_date) for each entry_id and then joins back to your table on both the entry_id and the max date to get the rows to be inserted.
Please note that the subquery is only returning the entry_id and date, this is because we want to apply the GROUP BY to the items in the select list that are not in an aggregate function. MySQL uses an extension to the GROUP BY clause that allows columns in the select list to be excluded in a group by and aggregate but this could causes unexpected results. By only including the columns needed by the aggregate and the group by will ensure that the result is the value you want. (see MySQL Extensions to GROUP BY)
From the MySQL Docs:
MySQL extends the use of GROUP BY so that the select list can refer to nonaggregated columns not named in the GROUP BY clause. ... You can use this feature to get better performance by avoiding unnecessary column sorting and grouping. However, this is useful primarily when all values in each nonaggregated column not named in the GROUP BY are the same for each group. The server is free to choose any value from each group, so unless they are the same, the values chosen are indeterminate. Furthermore, the selection of values from each group cannot be influenced by adding an ORDER BY clause. Sorting of the result set occurs after values have been chosen, and ORDER BY does not affect which values the server chooses.
If you want to enter the last entry you need to filter it before:
select entry_id, max(revision_date) as maxDate
from old_revisions
group by entry_id;
Then use this as a subquery to filter the data you need:
insert into new_revisions (new_entry_id, new_entry_data)
select entry_id, entry_data
from old_revisions as o
inner join (
select entry_id, max(revision_date) as maxDate
from old_revisions
group by entry_id
) as a on o.entry_id = a.entry_id and o.revision_date = a.maxDate

how to find duplicate count without counting original

I need to count the number of duplicate emails in a mysql database, but without counting the first one (considered the original). In this table, the query result should be the single value "3" (2 duplicate x#q.com plus 1 duplicate f#q.com).
TABLE
ID | Name | Email
1 | Mike | x#q.com
2 | Peter | p#q.com
3 | Mike | x#q.com
4 | Mike | x#q.com
5 | Frank | f#q.com
6 | Jim | f#q.com
My current query produces not one number, but multiple rows, one per email address regardless of how many duplicates of this email are in the table:
SELECT value, count(lds1.leadid) FROM leads_form_element lds1 LEFT JOIN leads lds2 ON lds1.leadID = lds2.leadID
WHERE lds2.typesID = "31" AND lds1.formElementID = '97'
GROUP BY lds1.value HAVING ( COUNT(lds1.value) > 1 )
It's not one query so I'm not sure if it would work in your case, but you could do one query to select the total number of rows, a second query to select distinct email addresses, and subtract the two. This would give you the total number of duplicates...
select count(*) from someTable;
select count(distinct Email) from someTable;
In fact, I don't know if this will work, but you could try doing it all in one query:
select (count(*)-(count(distinct Email))) from someTable
Like I said, untested, but let me know if it works for you.
Try doing a group by in a sub query and then summing up. Something like:
select sum(tot)
from
(
select email, count(1)-1 as tot
from table
group by email
having count(1) > 1
)