How does GROUP BY DESC select its order? - mysql

So I am creating sections for a store. The store can have multiple scopes, if there isn't a section_identifier set for a given store_id it should fallback to the global store which is 0.
The SQL command that I want should return a list of section_options for any related given store.
Example of my table:
SELECT * FROM my_table:
+----+--------------------+----------------------+----------+
| id | section_identifier | option_identifier | store_id |
+----+--------------------+----------------------+----------+
| 17 | header | header_option_one | 1 |
| 18 | footer | footer_option_one | 0 |
| 19 | homepage_feature | homepage_feature_one | 0 |
| 23 | header | header_option_three | 0 |
| 25 | homepage_feature | homepage_feature_one | 1 |
+----+--------------------+----------------------+----------+
So section_identifier is unique, the IDs I need back for store 1 would be 17, 18 and 25.
When I run:
SELECT * FROM my_table GROUP BY section_identifier it returns:
+----+--------------------+----------------------+----------+
| id | section_identifier | option_identifier | store_id |
+----+--------------------+----------------------+----------+
| 18 | footer | footer_option_one | 0 |
| 23 | header | header_option_three | 0 |
| 19 | homepage_feature | homepage_feature_one | 0 |
+----+--------------------+----------------------+----------+
This means if I run SELECT * FROM my_table GROUP BY section_identifier DESC:
I get the response (this is my desired output):
+----+--------------------+----------------------+----------+
| id | section_identifier | option_identifier | store_id |
+----+--------------------+----------------------+----------+
| 25 | homepage_feature | homepage_feature_one | 1 |
| 17 | header | header_option_one | 1 |
| 18 | footer | footer_option_one | 0 |
+----+--------------------+----------------------+----------+
Although this works, I have no understanding of as to why.
Its my understanding the initial GROUP BY should get the first instance in the database, IE the response I expect should be:
+----+--------------------+----------------------+----------+
| id | section_identifier | option_identifier | store_id |
+----+--------------------+----------------------+----------+
| 18 | footer | footer_option_one | 0 |
| 17 | header | header_option_three | 1 |
| 19 | homepage_feature | homepage_feature_one | 0 |
+----+--------------------+----------------------+----------+
However, it seems to be referencing my store_id somehow? I have tried a few different combinations and Im weirdly getting my expected result each time but I have no understanding as to why.
Can anybody explain this to me please?
PS
I have tried updating the option_identifier of id = 7 to see if MySql references the latest saved on disk and it didn't change the result.
Also: I'm not planning on using this feature or asking for an alternative, I'm asking what's going on with it?

SELECT * FROM my_table GROUP BY section_identifier
is an invalid SQL query.
How GROUP BY works?
Let's get the query above and see how GROUP BY works. First the database engine selects all the rows that match the WHERE clause. There is no WHERE clause in this query; this means all the rows of the table are used to generate the result set.
It then groups the rows using the expressions specified in the GROUP BY clause:
+----+--------------------+----------------------+----------+
| id | section_identifier | option_identifier | store_id |
+----+--------------------+----------------------+----------+
| 17 | header | header_option_one | 1 |
| 23 | header | header_option_three | 0 |
+----+--------------------+----------------------+----------+
| 18 | footer | footer_option_one | 0 |
+----+--------------------+----------------------+----------+
| 19 | homepage_feature | homepage_feature_one | 0 |
| 25 | homepage_feature | homepage_feature_one | 1 |
+----+--------------------+----------------------+----------+
I marked the groups in the listing above to make everything clear.
On the next step, from each group the database engine produces a single row. But how?
The SELECT clause of your query is SELECT *. * stands for the full list of table columns; in this case, SELECT * is a short way to write:
SELECT id, section_identifier, option_identifier, store_id
Let's analyze the values of column id for the first group. What value should the database engine choose for id? 17 or 23? Why 17 and why 23?
It does not have any criteria to favor 17 over 23. It just picks one of them (probably 17 but this depends on a lot of internal factors) and goes one.
There is no problem to determine the value for section_identifier. It is the column used to GROUP BY, all its values in a group are equal.
The choosing dilemma occurs again on columns option_identifier and store_id.
According to the standard SQL your query is not valid and it cannot be executed. However, some database engines run it as described above. The values for expressions that are not (at least one of the below):
used in the GROUP BY clause;
used with GROUP BY aggregate functions in the SELECT clause;
functionally dependent of columns used in the GROUP BY clause;
are indeterminate.
Since version 5.7.5, MySQL implements functional dependency detection and, by default, it rejects an invalid GROUP BY query like yours.
How to make it work
It's not clear for me how you want to get the result set. Anyway, if you want to get some rows from the table then GROUP BY is not the correct way to do it. GROUP BY does not select rows from a table, it generates new values using the values from the table. A row generated by GROUP BY, most of the times, do not match any row from the source table.
You can find a possible solution to your problem in this answer. You'll have to write the query yourself after you read and understand the idea (and is very clear to you how the "winner" rows should be selected).

GROUP BY sorts records in ascending order by default. Your store_id is not being referenced at all instead the records returned are sorted in ascending order of the section_identifier

Related

SQL: How to get last date in groupby for getting unique records?

I am working on a data where I have to use multiple joins and figures out that one of the table is producing duplicates as I applied Group by on dates as well and b/c of different dates my query takes in duplicate values.
I wrote following query
SELECT
ll.ID,
ll.EST_DT
gg.col1 ,
ll.EST_CLAIM_DT,
gg.col2
FROM table gg
inner join
(select substr(ID,1,instr(ID,'-',7)-1) EST_ID,
max(est_dt) as EST_DT,
max(EST_CLAIM_DT) as EST_CLAIM_DT
from table group by substr(gg.ID,1,instr(ID,'-',7)-1)) ll
on substr(ID,1,instr(gg.ID,'-',7)-1)=substr(ll.ID,1,instr(ll.ID,'-',7)-1)
GROUP BY
ll.ID,
ll.EST_DT
gg.col1 ,
ll.EST_CLAIM_DT,
gg.col2
Table looks like this:
+-----------------+------------+----------------+------+------+
| ID | est_date | est_claimed_dt | col1 | col2 |
+-----------------+------------+----------------+------+------+
| EST-U-1040452-1 | 28/02/2019 | 28/02/2019 | 50 | 50 |
| EST-U-1040452-2 | 5/10/2020 | 5/10/2020 | 50 | 50 |
+-----------------+------------+----------------+------+------+
Desired output
+---------+-----------+----------------+------+------+
| ID | est_date | est_claimed_dt | col1 | col2 |
+---------+-----------+----------------+------+------+
| 1040452 | 5/10/2020 | 5/10/2020 | 50 | 50 |
+---------+-----------+----------------+------+------+
I get this error as well
Negative sub string length not allowed
P.S. I have search SO for this issue and it helped but couldn't get it to work.

Why should I write the rest of columns into GROUP BY when there is an aggregate function?

I have this table structure:
// mytable
+----+------+-------+-------------+
| id | type | score | unix_time |
+----+------+-------+-------------+
| 1 | 1 | 5 | 1463508841 |
| 2 | 1 | 10 | 1463508842 |
| 3 | 2 | 5 | 1463508843 |
| 4 | 1 | 5 | 1463508844 |
| 5 | 2 | 15 | 1463508845 |
| 6 | 1 | 10 | 1463508846 |
+----+------+-------+-------------+
And here is my query:
SELECT SUM(score), unix_time
FROM mytable
WHERE 1
GROUP BY type
And here is the output:
+-------+-------------+
| score | unix_time |
+-------+-------------+
| 30 | 1463508841 |
| 20 | 1463508843 |
+-------+-------------+
Ok, all fine .. Just there is a thing: Professional people suggest me to write unix_time into GROUP BY. They believe doing that is the base of grouping and aggregate function.
Well why really should I write a (almost) unique column into GROUP BY? If I do that then each row will be a separated group and there will be a lot of extra rows which are useless:
+-------+-------------+
| score | unix_time |
+-------+-------------+
| 30 | 1463508841 |
| 30 | 1463508842 |
| 20 | 1463508843 |
| 30 | 1463508844 |
| 20 | 1463508845 |
| 30 | 1463508846 |
+-------+-------------+
See? There is a lot of extra rows. So why doing that is an standard thing? Why everybody tell me MySQL does work without doing that but no database else doesn't .. Well I really don't understand why should I do that ..!
May please someone make it clear for me and explain me how GROUP BY works exactly? Is that different than my understanding?
Not having unix_time in the GROUP BY clause is a non-standard MySQL hack that I would totally stay away from. The values for unix_type across all the rows with the same type are completely different. How do you know which unix_time should appear?
In your example, you seem perfectly content to use a completely arbitrary value of unix_time per group.
However this is a recipe for disaster. What does it even mean to pick some totally arbitrary value from a group? What if the unix_times were spread out by days or weeks or even years? Which one would you take then?
The reason the pros are telling you to put it in the group by clause is so that the result makes sense! Another approach is to leave unix_time out of the select completely, as the result you are getting shouldn't be relied upon.
Maybe you need something like this:
SELECT type,
SUM(score) as sum_of_score,
MIN(unix_time) as start_unix_time,
MAX(unix_time) as end_unix_time
FROM mytable
WHERE 1
GROUP BY type

Only return an ordered subset of the rows from a joined table

Given a structure like this in a MySQL database
#data_table
(id) | user_id | time | (...)
#relations_table
(id) | user_id | user_coach_id | (...)
we can select all data_table rows belonging to a certain user_coach_id (let's say 1) with
SELECT rel.`user_coach_id`, dat.*
FROM `relations_table` rel
LEFT JOIN `data_table` dat ON rel.`uid` = dat.`uid`
WHERE rel.`user_coach_id` = 1
ORDER BY val.`time` DESC
returning something like
| user_coach_id | id | user_id | time | data1 | data2 | ...
| 1 | 9 | 4 | 15 | foo | bar | ...
| 1 | 7 | 3 | 12 | oof | rab | ...
| 1 | 6 | 4 | 11 | ofo | abr | ...
| 1 | 4 | 4 | 5 | foo | bra | ...
(And so on. Of course time are not integers in reality but to keep it simple.)
But now I would like to query (ideally) only up to an arbitrary number of rows from data_table per distinct user_id but still have those ordered (i.e. newest first). Is that even possible?
I know I can use GROUP BY user_id to only return 1 row per user, but then the ordering doesn't work and it seems kind of unpredictable which row will be in the result. I guess it's doable with a subquery, but I haven't figured it out yet.
To limit the number of rows in each GROUP is complicated. It is probably best done with an #variable to count, plus an outer query to throw out the rows beyond the limit.
My blog on Groupwise Max gives some hints of how to do such.

MySQL: optimize query for scoring calculation

I have a data table that I use to do some calculations. The resulting data set after calculations looks like:
+------------+-----------+------+----------+
| id_process | id_region | type | result |
+------------+-----------+------+----------+
| 1 | 4 | 1 | 65.2174 |
| 1 | 5 | 1 | 78.7419 |
| 1 | 6 | 1 | 95.2308 |
| 1 | 4 | 1 | 25.0000 |
| 1 | 7 | 1 | 100.0000 |
+------------+-----------+------+----------+
By other hand I have other table that contains a set of ranges that are used to classify the calculations results. The range tables looks like:
+----------+--------------+---------+
| id_level | start | end | status |
+----------+--------------+---------+
| 1 | 0 | 75 | Danger |
| 2 | 76 | 90 | Alert |
| 3 | 91 | 100 | Good |
+----------+--------------+---------+
I need to do a query that add the corresponding 'status' column to each value when do calculations. Currently, I can do that adding the following field to calculation query:
select
...,
...,
[math formula] as result,
(select status
from ranges r
where result between r.start and r.end) status
from ...
where ...
It works ok. But when I have a lot of rows (more than 200K), calculation query become slow.
My question is: there is some way to find that 'status' value without do that subquery?
Some one have worked on something similar before?
Thanks
Yes, you are looking for a subquery and join:
select s.*, r.status
from (select s.*
from <your query here>
) s left outer join
ranges r
on s.result between r.start and r.end
Explicit joins often optimize better than nested select. In this case, though, the ranges table seems pretty small, so this may not be the performance issue.

SQL 'COUNT' not returning what I expect, and somehow limiting results to one row

Some background: an 'image' is part of one 'photoshoot', and may be a part of zero or many 'galleries'. My tables:
'shoots' table:
+----+--------------+
| id | name |
+----+--------------+
| 1 | Test shoot |
| 2 | Another test |
| 3 | Final test |
+----+--------------+
'images' table:
+----+-------------------+------------------+
| id | original_filename | storage_location |
+----+-------------------+------------------+
| 1 | test.jpg | store/test.jpg |
| 2 | test.jpg | store/test.jpg |
| 3 | test.jpg | store/test.jpg |
+----+-------------------+------------------+
'shoot_images' table:
+----------+----------+
| shoot_id | image_id |
+----------+----------+
| 1 | 1 |
| 1 | 2 |
| 3 | 3 |
+----------+----------+
'gallery_images' table:
+------------+----------+
| gallery_id | image_id |
+------------+----------+
| 1 | 1 |
| 1 | 2 |
| 2 | 3 |
| 3 | 1 |
| 4 | 1 |
+------------+----------+
What I'd like to get back, so I can say 'For this photoshoot, there are X images in total, and these images are featured in Y galleries:
+----+--------------+-------------+---------------+
| id | name | image_count | gallery_count |
+----+--------------+-------------+---------------+
| 3 | Final test | 1 | 1 |
| 2 | Another test | 0 | 0 |
| 1 | Test shoot | 2 | 4 |
+----+--------------+-------------+---------------+
I'm currently trying the SQL below, which appears to work correctly but only ever returns one row. I can't work out why this is happening. Curiously, the below also returns a row even when 'shoots' is empty.
SELECT shoots.id,
shoots.name,
COUNT(DISTINCT shoot_images.image_id) AS image_count,
COUNT(DISTINCT gallery_images.gallery_id) AS gallery_count
FROM shoots
LEFT JOIN shoot_images ON shoots.id=shoot_images.shoot_id
LEFT JOIN gallery_images ON shoot_images.image_id=gallery_images.image_id
ORDER BY shoots.id DESC
Thanks for taking the time to look at this :)
You are missing the GROUP BY clause:
SELECT
shoots.id,
shoots.name,
COUNT(DISTINCT shoot_images.image_id) AS image_count,
COUNT(DISTINCT gallery_images.gallery_id) AS gallery_count
FROM shoots
LEFT JOIN shoot_images ON shoots.id=shoot_images.shoot_id
LEFT JOIN gallery_images ON shoot_images.image_id=gallery_images.image_id
GROUP BY 1, 2 -- Added this line
ORDER BY shoots.id DESC
Note: The SQL standard allows GROUP BY to be given either column names or column numbers, so GROUP BY 1, 2 is equivalent to GROUP BY shoots.id, shoots.name in this case. There are many who consider this "bad coding practice" and advocate always using the column names, but I find it makes the code a lot more readable and maintainable and I've been writing SQL since before many users on this site were born, and it's never cause me a problem using this syntax.
FYI, the reason you were getting one row before, and not getting and error, is that in mysql, unlike any other database I know, you are allowed to omit the group by clause when using aggregating functions. In such cases, instead of throwing a syntax exception, mysql returns the first row for each unique combination of non-aggregate columns.
Although at first this may seem abhorrent to SQL purists, it can be incredibly handy!
You should look into the MySQL function group by.