MySQL GROUP_CONCAT DISTINCT ordering anomaly

MySQL GROUP_CONCAT DISTINCT ordering anomaly - mysql

I've been reading about this for the past day (even here), and have not found a suitable resource, so I'm popping it out there again :)
Check out these two queries:
SELECT DISTINCT transactions.StoreNumber FROM transactions WHERE PersonID=2 ORDER BY transactions.transactionID DESC;
and
SELECT GROUP_CONCAT(DISTINCT transactions.StoreNumber ORDER BY transactions.transactionID DESC SEPARATOR ',') FROM transactions WHERE PersonID=2 ORDER BY transactions.transactionID DESC;
From everything I've read I would expect the two queries to return the same results, with the second set grouped into CSV. They're not though.
Result set for query 1 (picture each value in its own row, formatting results here is cumbersome):
'611'
'345'
'340'
'310'
'327'
'323'
'362'
'360'
'330'
'379'
'356'
'367'
'375'
'306'
'354'
'389'
'343'
'346'
'357'
'733'
'370'
'347'
'703'
'355'
'341'
'342'
'358'
'351'
'319'
'365'
'372'
'368'
'353'
'363'
'349'
'369'
'336'
'364'
'202'
'366'
'416'
'731'
Result Set for query 2:
611,379,375,389,703,355,351,372,368,362,342,365,353,341,733,347,336,319,354,306,345,364,202,358,370,343,366,349,356,367,369,416,323,346,731,360,363,330,310,357,340,327
If I remove the DISTINCT clause, the results line up.
Can anyone point out what I'm doing wrong with the difference between the queries above?
The fact that removing DISTINCT from each query returns the same result indicates that DISTINCT is problematic within GROUP_CONCAT. Performing a GROUP BY outside the GROUP_CONCAT causes multiple rows to be returned, which isn't what I'm after.
Any ideas on how I can get a GROUP_CONCAT DISTINCT list of StoreNumber, in order of TransactionID DESC?
Thanks all

Consider your first query:
SELECT DISTINCT transactions.StoreNumber
FROM transactions
WHERE PersonID=2
ORDER BY transactions.transactionID DESC;
This is equivalent to:
SELECT transactions.StoreNumber
FROM transactions
WHERE PersonID=2
group by transactions.StoreNumber
ORDER BY transactions.transactionID DESC;
You are ordering by something that is not in the select list. So, MySQL chooses an arbitrary transactionid for each store number. This may differ from one execution to another.
I believe the same thing is happening in the group_concat(). The issue is that the arbitrary number chosen is different for each one.
If you want consistency, consider these two queries:
SELECT transactions.StoreNumber
FROM transactions
WHERE PersonID=2
group by transactions.StoreNumber
ORDER BY min(transactions.transactionID) DESC;
and:
SELECT GROUP_CONCAT(DISTINCT t.StoreNumber ORDER BY t.mintransactionID DESC SEPARATOR ',')
FROM (select t.StoreNumber, min(TransactionId) as minTransactionId
from transactions t
WHERE PersonID=2.transactionID
group by t.StoreNumber
) t
These should produce the same results.
Before you complain too loudly about MySQL, any other database would return an error on the first query, because, when using select distinct, you can only order by columns in the select list (or expressions composed of them).

Related

SQL Query, What have I done wrong? I am fairly new to mySQL

Solution
problem
SELECT agency_name, Count(*) AS complaint_type_count
FROM service_request_xs
GROUP BY agency_name
ORDER BY Count(*) DESC;
solution uploaded

You have to tell the count() function what to count. You can insert an individual column, or * for all of it etc. But you have to count something.
This is a fiddle, showing how it works:
https://www.db-fiddle.com/f/dbPnE4BXv8oRRkQY4WQs8v/1
SELECT agency_name,
COUNT(DISTINCT compliant_type) AS complaint_type_count
FROM service_request_xs
GROUP BY agency_name
ORDER BY COUNT(DISTINCT compliant_type) DESC;

Usually, when you use the Count() function, you have to add a column name between the parentheses; such as: Count(complaint_type) or else SQL will not know what to count.
You have to do this in the SELECT and the ORDER BY.
You can also use Count(*) to count all lines in your table.

How to Insert Data into Tempdb Until Amount of Distinct Values Met

I have a main table named tblorder.
It contains CUID(Customer ID), CuName(Customer Name) and OrDate(Order Date) that I care about.
It is currently ordered by date in ascending order(ex. 2001 before 2002).
Objective:
Trying to retrieve most recent 1 Million DISTINCT Customer's CUID and CuNameS, and Insert them Into a Tempdb(#Recent1M) for Later Joining Uses.
So I:
Would Need Order By desc to flip the date to retrieve most recent 1 Million Customers
Only want first 1 Million DISTINCT Customer Information(CUID, CuName)
I know following code is not correct, but it is the main idea. I just can't figure out the correct syntax. So far I have the While Loop with Select Into as the most plausible solution.
SQL Platform: SSMS
Declare #DC integer
Set #DC = Count(distinct(CUID)) from #Recent1M))
While (#DC <1000000)
Begin
Select CuID,CuName into #Recent1MCus from tblorder
End
Thank you very much, I appreciate any help!

TOP 1000000 is the way to go, but you're going to need an ORDER BY clause or you will get arbitrary results. In your case, you mentioned that you wanted the most recent ones, so:
ORDER BY OrderDate DESC
Also, you might consider using GROUP BY rather than DISTINCT. I think it looks cleaner and keeps the select list a select list so you have the option to include whatever else you might want (as I took the liberty of doing). Notice that, because of the grouping, the ORDER BY now uses MAX(ordate) since customers can presumably have multiple ordate's and we are interested in the most recent. So:
select top 1000000 cuid, cuname, sum(order_value) as ca_ching, count(distinct(order_id)) as order_count
into #Recent1MCus
from tblorder
group by cuid, cuname
order by max(ordate) desc
I hope this helps.

Wouldn't you just do this?
select distinct top 1000000 cuid, cuname
into #Recent1MCus
from tblorder;
If the names might not be distinct, you can do:
select top 1000000 cuid, cuname
into #Recent1MCus
from (select o.*, row_number() over (partition by cuid order by ordate desc) as seqnum
from tblorder o
) o
where seqnum = 1;

Use DISTINCT and ORDER BY <colname> DESC to get latest unique records.
Try this SQL query:
SELECT DISTINCT top 1000000
cuid,
cuname
INTO #Recent1MCus
FROM tblorder
ORDER BY OrDate DESC;

Mysql derived table order by

If I run the following query:
SELECT *
FROM `smp_data_log`
WHERE Post_id = 1234 AND Account_id = 1306
ORDER BY Created_time DESC
I get 7 rows back including entries with the following Created_times:
1) 1424134801
2) 1424134801
3) 1421802001
4) 3601
If I run the following query:
SELECT mytable.*
FROM (SELECT * FROM `smp_data_log` ORDER BY Created_time DESC) AS mytable
WHERE Post_id = 1234 AND Account_id = 1306
GROUP BY Post_id
I am would expect to see 1424134801 come back as a single row - but instead I am seeing 3601?? I would have thought this would have returned the latest time (as its descending). What am I doing wrong?

Your expectation is wrong. And this is well documented in MySQL. You are using an extension, where you have columns in the select that are not in the group by -- a very bad habit and one that doesn't work in other databases (except in some very special circumstances allowed by the ANSI standard).
Just use join to get what you really want:
SELECT l.*
FROM smp_data_log l JOIN
(select post_id, max(created_time) as maxct
from smp_data_log
group by post_id
) lmax
on lmax.post_id = l.post_id and lmax.maxct = l.created_time;
Here is the quote from the documentation:
MySQL extends the use of GROUP BY so that the select list can refer to
nonaggregated columns not named in the GROUP BY clause. This means
that the preceding query is legal in MySQL. You can use this feature
to get better performance by avoiding unnecessary column sorting and
grouping. However, this is useful primarily when all values in each
nonaggregated column not named in the GROUP BY are the same for each
group. The server is free to choose any value from each group, so
unless they are the same, the values chosen are indeterminate.

MySQL Group By and HAVING

I'm a MySQL query noobie so I'm sure this is a question with an obvious answer.
But, I was looking at these two queries. Will they return different result sets? I understand that the sorting process would commence differently, but I believe they will return the same results with the first query being slightly more efficient?
Query 1: HAVING, then AND
SELECT user_id
FROM forum_posts
GROUP BY user_id
HAVING COUNT(id) >= 100
AND user_id NOT IN (SELECT user_id FROM banned_users)
Query 2: WHERE, then HAVING
SELECT user_id
FROM forum_posts
WHERE user_id NOT IN(SELECT user_id FROM banned_users)
GROUP BY user_id
HAVING COUNT(id) >= 100

Actually the first query will be less efficient (HAVING applied after WHERE).
UPDATE
Some pseudo code to illustrate how your queries are executed ([very] simplified version).
First query:
1. SELECT user_id FROM forum_posts
2. SELECT user_id FROM banned_user
3. Group, count, etc.
4. Exclude records from the first result set if they are presented in the second
Second query
1. SELECT user_id FROM forum_posts
2. SELECT user_id FROM banned_user
3. Exclude records from the first result set if they are presented in the second
4. Group, count, etc.
The order of steps 1,2 is not important, mysql can choose whatever it thinks is better. The important difference is in steps 3,4. Having is applied after GROUP BY. Grouping is usually more expensive than joining (excluding records can be considering as join operation in this case), so the fewer records it has to group, the better performance.

You have already answers that the two queries will show same results and various opinions for which one is more efficient.
My opininion is that there will be a difference in efficiency (speed), only if the optimizer yields with different plans for the 2 queries. I think that for the latest MySQL versions the optimizers are smart enough to find the same plan for either query so there will be no difference at all but off course one can test and see either the excution plans with EXPLAIN or running the 2 queries against some test tables.
I would use the second version in any case, just to play safe.
Let me add that:
COUNT(*) is usually more efficient than COUNT(notNullableField) in MySQL. Until that is fixed in future MySQL versions, use COUNT(*) where applicable.
Therefore, you can also use:
SELECT user_id
FROM forum_posts
WHERE user_id NOT IN
( SELECT user_id FROM banned_users )
GROUP BY user_id
HAVING COUNT(*) >= 100
There are also other ways to achieve same (to NOT IN) sub-results before applying GROUP BY.
Using LEFT JOIN / NULL :
SELECT fp.user_id
FROM forum_posts AS fp
LEFT JOIN banned_users AS bu
ON bu.user_id = fp.user_id
WHERE bu.user_id IS NULL
GROUP BY fp.user_id
HAVING COUNT(*) >= 100
Using NOT EXISTS :
SELECT fp.user_id
FROM forum_posts AS fp
WHERE NOT EXISTS
( SELECT *
FROM banned_users AS bu
WHERE bu.user_id = fp.user_id
)
GROUP BY fp.user_id
HAVING COUNT(*) >= 100
Which of the 3 methods is faster depends on your table sizes and a lot of other factors, so best is to test with your data.

HAVING conditions are applied to the grouped by results, and since you group by user_id, all of their possible values will be present in the grouped result, so the placing of the user_id condition is not important.

To me, second query is more efficient because it lowers the number of records for GROUP BY and HAVING.
Alternatively, you may try the following query to avoid using IN:
SELECT `fp`.`user_id`
FROM `forum_posts` `fp`
LEFT JOIN `banned_users` `bu` ON `fp`.`user_id` = `bu`.`user_id`
WHERE `bu`.`user_id` IS NULL
GROUP BY `fp`.`user_id`
HAVING COUNT(`fp`.`id`) >= 100
Hope this helps.

No it does not gives same results.
Because first query will filter records from count(id) condition
Another query filter records and then apply having clause.
Second Query is correctly written

Mysql returns only one row when using Count

Well I've just hit a weird behaviour that I've never seen before, or haven't noticed.
I'm using this query:
SELECT *,
COUNT(*) AS pages
FROM notis
WHERE cid = 20
ORDER BY nid DESC
LIMIT 0, 3
...to read 3 items but while doing that I want to get the total rows.
Problem is...
...when I use count the query only returns one row, but if I remove
COUNT(*) AS pages -- I get the 3 rows as I'm suppose to. Obviously, i'm missing something here.

Yeah, the count is an aggregate operator, which makes only one row returned (without a group by clause)
Maybe make two separate queries? It doesn't make sense to have the row return the data and the total number of rows, because that data doesn't belong together.
If you really really want it, you can do something like this:
SELECT *, (select count(*) FROM notis WHERE cid=20) AS count FROM notis WHERE cid=20 ORDER BY nid DESC LIMIT 0,3
or this:
SELECT N.*, C.total from notis N join (select count(*) total FROM notis WHERE cid=20) C WHERE cid=20) AS count FROM notis WHERE cid=20 ORDER BY nid DESC LIMIT 0,3
With variances on the nested expression depending on your SQL dialect.

Using an aggregate function without a GROUP BY will always return one row, no matter what. You must use a GROUP BY if you want to return more than one row.
Note that on most RDBMS, such a query would have failed because it makes no sense.

This is inefficient, but will work:
SELECT
*,
(SELECT COUNT(*) FROM notis WHERE cid=20) AS pages
FROM notis
WHERE cid=20
ORDER BY nid DESC
LIMIT 0,3

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

MySQL GROUP_CONCAT DISTINCT ordering anomaly - mysql

Related

SQL Query, What have I done wrong? I am fairly new to mySQL

How to Insert Data into Tempdb Until Amount of Distinct Values Met

Mysql derived table order by

MySQL Group By and HAVING

Mysql returns only one row when using Count

Categories

Resources