Which row is selected in GROUP BY? - mysql

Let's say I have a table
+------+---------+--------+
| lang | title | url |
+------+---------+--------+
| pt | Livro 1 | o294jl |
| en | Book 1 | o294jl |
| en | Book 2 | o294jl |
+------+---------+--------+
And I run a query
SELECT lang, title
FROM table
GROUP BY url
The result of the query is not obvious because the values of lang and title are different among the group.
How does an SQL engine choose which row to return from a group? Which row must be selected in my example? Is it specified in the SQL standard?

Values are chosen from arbitrary matching rows for each group. The values could come from different rows for different runs. In theory, different columns in the same SELECT could come from different rows.
The documentation explains this:
If ONLY_FULL_GROUP_BY is disabled, a MySQL extension to the standard
SQL use of GROUP BY permits the select list, HAVING condition, or
ORDER BY list to refer to nonaggregated columns even if the columns
are not functionally dependent on GROUP BY columns. . . . In this case, the server is free to
choose any value from each group, so unless they are the same, the
values chosen are nondeterministic, which is probably not what you
want.
You should read the complete documentation on the subject.
Note that the default behavior of MySQL is now to reject such queries. Yay!

I addition to Gorden's answer – In practice the engine will just do the least work, which is to choose the values from the first found row in the group. However – Which row is the first depends on the execution plan, in particular on the chosen index.
Assuming the following schema and data:
CREATE TABLE test (
`lang` VARCHAR(2),
`title` VARCHAR(50),
`url` VARCHAR(50)
) engine=InnoDB;
INSERT INTO test (`lang`, `title`, `url`) VALUES
('pt', 'Livro 1', 'o294jl'),
('en', 'Book 1', 'o294jl'),
('en', 'Book 2', 'o294jl');
Executing the query
SELECT lang, title FROM test GROUP BY url;
returns
| lang | title |
| ---- | ------- |
| pt | Livro 1 |
Which is the first row in insertion order (using the clustered index).
If we now define an index on (url, lang, title)
ALTER TABLE test ADD INDEX url_lang_title (url, lang, title);
The same SELECT query returns
| lang | title |
| ---- | ------ |
| en | Book 1 |
which is the first row in the new url_lang_title index.
View on DB Fiddle
As you can see: Having exactly same data and exactly same query – MySQL can return different results. And even if you don't change the indices, you can't rely on a particular index being chosen. The engine can choose another index for many other reasons.
The moral of the story: Don't ask what the engine will return. Instead tell it exactly what you want it to return by writing deterministic queries.

Related

Inconsistency with MySQL - USING vs ON [duplicate]

In a MySQL JOIN, what is the difference between ON and USING()? As far as I can tell, USING() is just more convenient syntax, whereas ON allows a little more flexibility when the column names are not identical. However, that difference is so minor, you'd think they'd just do away with USING().
Is there more to this than meets the eye? If yes, which should I use in a given situation?
It is mostly syntactic sugar, but a couple differences are noteworthy:
ON is the more general of the two. One can join tables ON a column, a set of columns and even a condition. For example:
SELECT * FROM world.City JOIN world.Country ON (City.CountryCode = Country.Code) WHERE ...
USING is useful when both tables share a column of the exact same name on which they join. In this case, one may say:
SELECT ... FROM film JOIN film_actor USING (film_id) WHERE ...
An additional nice treat is that one does not need to fully qualify the joining columns:
SELECT film.title, film_id -- film_id is not prefixed
FROM film
JOIN film_actor USING (film_id)
WHERE ...
To illustrate, to do the above with ON, we would have to write:
SELECT film.title, film.film_id -- film.film_id is required here
FROM film
JOIN film_actor ON (film.film_id = film_actor.film_id)
WHERE ...
Notice the film.film_id qualification in the SELECT clause. It would be invalid to just say film_id since that would make for an ambiguity:
ERROR 1052 (23000): Column 'film_id' in field list is ambiguous
As for select *, the joining column appears in the result set twice with ON while it appears only once with USING:
mysql> create table t(i int);insert t select 1;create table t2 select*from t;
Query OK, 0 rows affected (0.11 sec)
Query OK, 1 row affected (0.00 sec)
Records: 1 Duplicates: 0 Warnings: 0
Query OK, 1 row affected (0.19 sec)
Records: 1 Duplicates: 0 Warnings: 0
mysql> select*from t join t2 on t.i=t2.i;
+------+------+
| i | i |
+------+------+
| 1 | 1 |
+------+------+
1 row in set (0.00 sec)
mysql> select*from t join t2 using(i);
+------+
| i |
+------+
| 1 |
+------+
1 row in set (0.00 sec)
mysql>
Thought I would chip in here with when I have found ON to be more useful than USING. It is when OUTER joins are introduced into queries.
ON benefits from allowing the results set of the table that a query is OUTER joining onto to be restricted while maintaining the OUTER join. Attempting to restrict the results set through specifying a WHERE clause will, effectively, change the OUTER join into an INNER join.
Granted this may be a relative corner case. Worth putting out there though.....
For example:
CREATE TABLE country (
countryId int(10) unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
country varchar(50) not null,
UNIQUE KEY countryUIdx1 (country)
) ENGINE=InnoDB;
insert into country(country) values ("France");
insert into country(country) values ("China");
insert into country(country) values ("USA");
insert into country(country) values ("Italy");
insert into country(country) values ("UK");
insert into country(country) values ("Monaco");
CREATE TABLE city (
cityId int(10) unsigned NOT NULL PRIMARY KEY AUTO_INCREMENT,
countryId int(10) unsigned not null,
city varchar(50) not null,
hasAirport boolean not null default true,
UNIQUE KEY cityUIdx1 (countryId,city),
CONSTRAINT city_country_fk1 FOREIGN KEY (countryId) REFERENCES country (countryId)
) ENGINE=InnoDB;
insert into city (countryId,city,hasAirport) values (1,"Paris",true);
insert into city (countryId,city,hasAirport) values (2,"Bejing",true);
insert into city (countryId,city,hasAirport) values (3,"New York",true);
insert into city (countryId,city,hasAirport) values (4,"Napoli",true);
insert into city (countryId,city,hasAirport) values (5,"Manchester",true);
insert into city (countryId,city,hasAirport) values (5,"Birmingham",false);
insert into city (countryId,city,hasAirport) values (3,"Cincinatti",false);
insert into city (countryId,city,hasAirport) values (6,"Monaco",false);
-- Gah. Left outer join is now effectively an inner join
-- because of the where predicate
select *
from country left join city using (countryId)
where hasAirport
;
-- Hooray! I can see Monaco again thanks to
-- moving my predicate into the ON
select *
from country co left join city ci on (co.countryId=ci.countryId and ci.hasAirport)
;
Wikipedia has the following information about USING:
The USING construct is more than mere syntactic sugar, however, since
the result set differs from the result set of the version with the
explicit predicate. Specifically, any columns mentioned in the USING
list will appear only once, with an unqualified name, rather than once
for each table in the join. In the case above, there will be a single
DepartmentID column and no employee.DepartmentID or
department.DepartmentID.
Tables that it was talking about:
The Postgres documentation also defines them pretty well:
The ON clause is the most general kind of join condition: it takes a
Boolean value expression of the same kind as is used in a WHERE
clause. A pair of rows from T1 and T2 match if the ON expression
evaluates to true.
The USING clause is a shorthand that allows you to take advantage of
the specific situation where both sides of the join use the same name
for the joining column(s). It takes a comma-separated list of the
shared column names and forms a join condition that includes an
equality comparison for each one. For example, joining T1 and T2 with
USING (a, b) produces the join condition ON T1.a = T2.a AND T1.b =
T2.b.
Furthermore, the output of JOIN USING suppresses redundant columns:
there is no need to print both of the matched columns, since they must
have equal values. While JOIN ON produces all columns from T1 followed
by all columns from T2, JOIN USING produces one output column for each
of the listed column pairs (in the listed order), followed by any
remaining columns from T1, followed by any remaining columns from T2.
Database tables
To demonstrate how the USING and ON clauses work, let's assume we have the following post and post_comment database tables, which form a one-to-many table relationship via the post_id Foreign Key column in the post_comment table referencing the post_id Primary Key column in the post table:
The parent post table has 3 rows:
| post_id | title |
|---------|-----------|
| 1 | Java |
| 2 | Hibernate |
| 3 | JPA |
and the post_comment child table has the 3 records:
| post_comment_id | review | post_id |
|-----------------|-----------|---------|
| 1 | Good | 1 |
| 2 | Excellent | 1 |
| 3 | Awesome | 2 |
The JOIN ON clause using a custom projection
Traditionally, when writing an INNER JOIN or LEFT JOIN query, we happen to use the ON clause to define the join condition.
For example, to get the comments along with their associated post title and identifier, we can use the following SQL projection query:
SELECT
post.post_id,
title,
review
FROM post
INNER JOIN post_comment ON post.post_id = post_comment.post_id
ORDER BY post.post_id, post_comment_id
And, we get back the following result set:
| post_id | title | review |
|---------|-----------|-----------|
| 1 | Java | Good |
| 1 | Java | Excellent |
| 2 | Hibernate | Awesome |
The JOIN USING clause using a custom projection
When the Foreign Key column and the column it references have the same name, we can use the USING clause, like in the following example:
SELECT
post_id,
title,
review
FROM post
INNER JOIN post_comment USING(post_id)
ORDER BY post_id, post_comment_id
And, the result set for this particular query is identical to the previous SQL query that used the ON clause:
| post_id | title | review |
|---------|-----------|-----------|
| 1 | Java | Good |
| 1 | Java | Excellent |
| 2 | Hibernate | Awesome |
The USING clause works for Oracle, PostgreSQL, MySQL, and MariaDB. SQL Server doesn't support the USING clause, so you need to use the ON clause instead.
The USING clause can be used with INNER, LEFT, RIGHT, and FULL JOIN statements.
SQL JOIN ON clause with SELECT *
Now, if we change the previous ON clause query to select all columns using SELECT *:
SELECT *
FROM post
INNER JOIN post_comment ON post.post_id = post_comment.post_id
ORDER BY post.post_id, post_comment_id
We are going to get the following result set:
| post_id | title | post_comment_id | review | post_id |
|---------|-----------|-----------------|-----------|---------|
| 1 | Java | 1 | Good | 1 |
| 1 | Java | 2 | Excellent | 1 |
| 2 | Hibernate | 3 | Awesome | 2 |
As you can see, the post_id is duplicated because both the post and post_comment tables contain a post_id column.
SQL JOIN USING clause with SELECT *
On the other hand, if we run a SELECT * query that features the USING clause for the JOIN condition:
SELECT *
FROM post
INNER JOIN post_comment USING(post_id)
ORDER BY post_id, post_comment_id
We will get the following result set:
| post_id | title | post_comment_id | review |
|---------|-----------|-----------------|-----------|
| 1 | Java | 1 | Good |
| 1 | Java | 2 | Excellent |
| 2 | Hibernate | 3 | Awesome |
You can see that this time, the post_id column is deduplicated, so there is a single post_id column being included in the result set.
Conclusion
If the database schema is designed so that Foreign Key column names match the columns they reference, and the JOIN conditions only check if the Foreign Key column value is equal to the value of its mirroring column in the other table, then you can employ the USING clause.
Otherwise, if the Foreign Key column name differs from the referencing column or you want to include a more complex join condition, then you should use the ON clause instead.
For those experimenting with this in phpMyAdmin, just a word:
phpMyAdmin appears to have a few problems with USING. For the record this is phpMyAdmin run on Linux Mint, version: "4.5.4.1deb2ubuntu2", Database server: "10.2.14-MariaDB-10.2.14+maria~xenial - mariadb.org binary distribution".
I have run SELECT commands using JOIN and USING in both phpMyAdmin and in Terminal (command line), and the ones in phpMyAdmin produce some baffling responses:
1) a LIMIT clause at the end appears to be ignored.
2) the supposed number of rows as reported at the top of the page with the results is sometimes wrong: for example 4 are returned, but at the top it says "Showing rows 0 - 24 (2503 total, Query took 0.0018 seconds.)"
Logging on to mysql normally and running the same queries does not produce these errors. Nor do these errors occur when running the same query in phpMyAdmin using JOIN ... ON .... Presumably a phpMyAdmin bug.
Short answer:
USING: when clause is ambiguous
ON: when clause has different comparison params

How to create the index on MUL key in Mysql?

We need to create the index on "source path" column, which is already in MUL - Key. For Example it have /src/com/Vendor/DTP/Emp/Grd1/Sal/2016/Jan/31-01/Joseph and we need to search like '%Sal/2016/Jan%' it have almost 10 Million records.
Please suggest any idea for performance improvement.
| Field | Type | Null | Key | Default | Extra |
+------------+----------+------+-----+---------+----------------+
| Id | int(11) | NO | PRI | NULL | auto_increment |
| Name | char(35) | NO | | | |
| Country | char(3) | NO | UNI | | |
| source Path| char(20) | YES | MUL | | |
| Population | int(11) | NO | | 0 |
Unfortunately, a search that starts with % cannot use an index (it has not much to do with being in a composite index).
You have some options though:
The values in your path seem to have actual meaning. The ideal solution would be to use the meta-data, e.g. the month, name, whatever "SAL" stands for, and store it in their own columns or an attribute table, and then query for that meta-data instead. This is obviously only possible in very specific cases where you have the required meta-data for every path, so it is probably not an option here.
You can add a "search table" (e.g. (id, subpath)) that contains all subpaths of your source path, e.g.
'/src/com/Vendor/DTP/Emp/Grd1/Sal/2016/Jan/31-01/Joseph'
'/com/Vendor/DTP/Emp/Grd1/Sal/2016/Jan/31-01/Joseph'
'/Vendor/DTP/Emp/Grd1/Sal/2016/Jan/31-01/Joseph'
...
'/Sal/2016/Jan/31-01/Joseph'
...
'/31-01/Joseph'
'/Joseph'
so 11 rows in your example. It's now possible to use an index on that, e.g. in
...
where exists
(select * from subpaths s
where s.subpath like '/Sal/2016/Jan%' and s.id = outerquery.id)
This relies on knowing the start of your search term. If Sal in your example %Sal/2016/Jan should actually include word endings, e.g. /NoSal/2016/Jan, you would have to modify your input term to remove the first word, so %Sal/2016/Jan% would require you to search for /2016/Jan% (with an index) and then recheck the resultset afterwards if it also fits %Sal/2016/Jan% (see the fulltext option for an example, it has the same "problem" to only look for the beginning of words).
You will have to maintain the search table, which is usually done in a trigger (update the subpath table when you insert, update or delete values in your original table).
Since this is a new table, you cannot combine it (directly) with another index, to e.g. optimize where country = 'A' and subpath like 'Sal/2016/Jan%' if country = 'A' would already get rid of 99.99% of the rows. You may have to check explain for your query if MySQL actually uses the index (because the optimizer can try something different) and then maybe reorganize your query (e.g. use a join or force index).
You can use a fulltext search. From the userinput, you would have to generate a query like
select * from
(select * from table
where match(`source Path`) against ('+SAL +2016 +Jan' in boolean mode)) subquery
where `source path` like '%Sal/2016/Jan%'
The fulltext search will not care about the order of the words, so you have to recheck the resultset if it actually is the correct path, but the fulltext search will use the (fulltext) index to speed it up. It will only look for the beginning of words, so similar to the "search table" option, if Sal can be the end of the word, you have to remove it from the fulltext search. By default, only words with at least 3 or 4 letters (depending on your engine) will be added to the index, so you have to set the value of either ft_min_word_len or innodb_ft_min_token_size to whatever fits your requirements.
The search table approach is probably the most convenient solution, as it can be used very similar to your current search: you can add the userinput directly in one place (without having to interpret it to create the against (...) expression) and you can also use it easily in other situations (e.g. in something like join table2 on concat(table2.Year,'/',table2.Month,'%') like ...); but you will have to set up the triggers (or however else you maintain the table), which is a little more complicated than just adding a fulltext index.

Can MySQL FIND_IN_SET or equivalent be made to use indices?

If I compare
explain select * from Foo where find_in_set(id,'2,3');
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | User | ALL | NULL | NULL | NULL | NULL | 4 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
with this one
explain select * from Foo where id in (2,3);
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | User | range | PRIMARY | PRIMARY | 8 | NULL | 2 | Using where |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
It is apparent that FIND_IN_SET does not exploit the primary key.
I want to put a query such as the above into a stored procedure, with the comma-separated string as an argument.
Is there any way to make the query behave like the second version, in which the index is used, but without knowing the content of the id set at the time the query is written?
In reference to your comment:
#MarcB the database is normalized, the CSV string comes from the UI.
"Get me data for the following people: 101,202,303"
This answer has a narrow focus on just those numbers separated by a comma. Because, as it turns out, you were not even talking about FIND_IN_SET afterall.
Yes, you can achieve what you want. You create a prepared statement that accepts a string as a parameter like in this Recent Answer of mine. In that answer, look at the second block that shows the CREATE PROCEDURE and its 2nd parameter which accepts a string like (1,2,3). I will get back to this point in a moment.
Not that you need to see it #spraff but others might. The mission is to get the type != ALL, and possible_keys and keys of Explain to not show null, as you showed in your second block. For a general reading on the topic, see the article Understanding EXPLAIN’s Output and the MySQL Manual Page entitled EXPLAIN Extra Information.
Now, back to the (1,2,3) reference above. We know from your comment, and your second Explain output in your question that it hits the following desired conditions:
type = range (and in particular not ALL) . See the docs above on this.
key is not null
These are precisely the conditions you have in your second Explain output, and the output that can be seen with the following query:
explain
select * from ratings where id in (2331425, 430364, 4557546, 2696638, 4510549, 362832, 2382514, 1424071, 4672814, 291859, 1540849, 2128670, 1320803, 218006, 1827619, 3784075, 4037520, 4135373, ... use your imagination ..., ..., 4369522, 3312835);
where I have 999 values in that in clause list. That is an sample from this answer of mine in Appendix D than generates such a random string of csv, surrounded by open and close parentheses.
And note the following Explain output for that 999 element in clause below:
Objective achieved. You achieve this with a stored proc similar to the one I mentioned before in this link using a PREPARED STATEMENT (and those things use concat() followed by an EXECUTE).
The index is used, a Tablescan (meaning bad) is not experienced. Further readings are The range Join Type, any reference you can find on MySQL's Cost-Based Optimizer (CBO), this answer from vladr though dated, with a eye on the ANALYZE TABLE part, in particular after significant data changes. Note that ANALYZE can take a significant amount of time to run on ultra-huge datasets. Sometimes many many hours.
Sql Injection Attacks:
Use of strings passed to Stored Procedures are an attack vector for SQL Injection attacks. Precautions must be in place to prevent them when using user-supplied data. If your routine is applied against your own id's generated by your system, then you are safe. Note, however, that 2nd level SQL Injection attacks occur when data was put in place by routines that did not sanitize that data in a prior insert or update. Attacks put in place prior via data and used later (a sort of time bomb).
So this answer is Finished for the most part.
The below is a view of the same table with a minor modification to it to show what a dreaded Tablescan would look like in the prior query (but against a non-indexed column called thing).
Take a look at our current table definition:
CREATE TABLE `ratings` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`thing` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=5046214 DEFAULT CHARSET=utf8;
select min(id), max(id),count(*) as theCount from ratings;
+---------+---------+----------+
| min(id) | max(id) | theCount |
+---------+---------+----------+
| 1 | 5046213 | 4718592 |
+---------+---------+----------+
Note that the column thing was a nullable int column before.
update ratings set thing=id where id<1000000;
update ratings set thing=id where id>=1000000 and id<2000000;
update ratings set thing=id where id>=2000000 and id<3000000;
update ratings set thing=id where id>=3000000 and id<4000000;
update ratings set thing=id where id>=4000000 and id<5100000;
select count(*) from ratings where thing!=id;
-- 0 rows
ALTER TABLE ratings MODIFY COLUMN thing int not null;
-- current table definition (after above ALTER):
CREATE TABLE `ratings` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`thing` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=5046214 DEFAULT CHARSET=utf8;
And then the Explain that is a Tablescan (against column thing):
You can use following technique to use primary index.
Prerequisities:
You know the maximum amount of items in comma separated string and it is not large
Description:
we convert comma separated string into temporary table
inner join to the temporary table
select #ids:='1,2,3,5,11,4', #maxCnt:=15;
SELECT *
FROM foo
INNER JOIN (
SELECT * FROM (SELECT #n:=#n+1 AS n FROM foo INNER JOIN (SELECT #n:=0) AS _a) AS _a WHERE _a.n <= #maxCnt
) AS k ON k.n <= LENGTH(#ids) - LENGTH(replace(#ids, ',','')) + 1
AND id = SUBSTRING_INDEX(SUBSTRING_INDEX(#ids, ',', k.n), ',', -1)
This is a trick to extract nth value in comma separated list:
SUBSTRING_INDEX(SUBSTRING_INDEX(#ids, ',', k.n), ',', -1)
Notes: #ids can be anything including other column from other or the same table.

Reference price and subtract from another during group_concat in MySQL

I'm working on manipulating a product data feed, and am currently working on grouping the related products. I've almost got things where I want them, but, like a mediocre racing driver, I've run out of skill right when I need it the most.
To illustrate my problem I've created a simplified version. Here is the data structure:
CREATE TABLE `feed` (
`sku` VARCHAR(10),
`price` DECIMAL(6,2),
`groupkey` VARCHAR(10)
);
INSERT INTO `feed` (`sku`, `price`, `groupkey`) VALUES
('AAA', 10.00, NULL),
('BBB', 10.00, 'group1'),
('CCC', 12.00, 'group1'),
('DDD', 10.00, 'group2'),
('EEE', 12.00, 'group2'),
('FFF', 14.00, 'group2'),
('GGG', 20.00, NULL);
My current query is:
SELECT feed.groupkey
, group_concat(feed.sku) AS skus
, group_concat(feed.price) AS prices
, feed.price AS pprice
FROM
feed
WHERE
feed.groupkey IS NOT NULL
GROUP BY
feed.groupkey;
The query returns the following rows:
+----------+-------------+-------------------+--------+
| groupkey | skus | prices | pprice |
+----------+-------------+-------------------+--------+
| group1 | BBB,CCC | 10.00,12.00 | 10.00 |
| group2 | DDD,EEE,FFF | 10.00,12.00,14.00 | 10.00 |
+----------+-------------+-------------------+--------+
What I actually need to do is subtract pprice from each concatenated price, giving me the price difference between each sku, rather than their absolute prices. This would return the dream result:
+----------+-------------+-------------------+--------+
| groupkey | skus | prices | pprice |
+----------+-------------+-------------------+--------+
| group1 | BBB,CCC | 0.00,2.00 | 10.00 |
| group2 | DDD,EEE,FFF | 0.00,2.00,4.00 | 10.00 |
+----------+-------------+-------------------+--------+
I've spent a lot of time on this feed in general, and am really stuck on what is probably the last hurdle in the integration. I'd really appreciate some guidance to help me in the right direction.
edit: I'm using the results from this query as "virtual" product rows, to serve as parents for the products in the group.
You can just do the subtraction in the group_concat(), for something like:
SELECT feed.groupkey, group_concat(feed.sku) AS skus,
group_concat(feed.price - min(feed.price)) AS prices
min(feed.price) AS pprice
FROM feed
WHERE feed.groupkey IS NOT NULL
GROUP BY feed.groupkey
The problem is . . . which feed.price? The value returned in your original query is an arbitrary value from one of the rows in the group. Thinking that you might want the difference over the minimum, I used that value.
I think the best way to write the query is:
SELECT feed.groupkey, group_concat(feed.sku) AS skus,
group_concat(feed.price - fsum.minprice) AS prices
min(feed.price) AS pprice
FROM feed left outer join
(select groupkey, MIN(feed.price) as minprice
from feed
group by groupkey
) fsum
on feed.groupkey = fsum.groupkey
WHERE feed.groupkey IS NOT NULL
GROUP BY feed.groupkey
You CANNOT assume the ordering for hidden columns and group_concat(). The documentation is quite explicit on this point:
MySQL extends the use of GROUP BY so that the select list can refer to
nonaggregated columns not named in the GROUP BY clause. This means
that the preceding query is legal in MySQL. You can use this feature
to get better performance by avoiding unnecessary column sorting and
grouping. However, this is useful primarily when all values in each
nonaggregated column not named in the GROUP BY are the same for each
group. The server is free to choose any value from each group, so
unless they are the same, the values chosen are indeterminate.
Furthermore, the selection of values from each group cannot be
influenced by adding an ORDER BY clause. Sorting of the result set
occurs after values have been chosen, and ORDER BY does not affect
which values the server chooses.
If you want things in a particular order, then you need to be sure the structure is queried properly. That said, it often works in practice, but there is no guarantee.

WHERE vs HAVING

Why do you need to place columns you create yourself (for example select 1 as "number") after HAVING and not WHERE in MySQL?
And are there any downsides instead of doing WHERE 1 (writing the whole definition instead of a column name)?
All other answers on this question didn't hit upon the key point.
Assume we have a table:
CREATE TABLE `table` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`value` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`),
KEY `value` (`value`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
And have 10 rows with both id and value from 1 to 10:
INSERT INTO `table`(`id`, `value`) VALUES (1, 1),(2, 2),(3, 3),(4, 4),(5, 5),(6, 6),(7, 7),(8, 8),(9, 9),(10, 10);
Try the following 2 queries:
SELECT `value` v FROM `table` WHERE `value`>5; -- Get 5 rows
SELECT `value` v FROM `table` HAVING `value`>5; -- Get 5 rows
You will get exactly the same results, you can see the HAVING clause can work without GROUP BY clause.
Here's the difference:
SELECT `value` v FROM `table` WHERE `v`>5;
The above query will raise error: Error #1054 - Unknown column 'v' in 'where clause'
SELECT `value` v FROM `table` HAVING `v`>5; -- Get 5 rows
WHERE clause allows a condition to use any table column, but it cannot use aliases or aggregate functions.
HAVING clause allows a condition to use a selected (!) column, alias or an aggregate function.
This is because WHERE clause filters data before select, but HAVING clause filters resulting data after select.
So put the conditions in WHERE clause will be more efficient if you have many many rows in a table.
Try EXPLAIN to see the key difference:
EXPLAIN SELECT `value` v FROM `table` WHERE `value`>5;
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
| 1 | SIMPLE | table | range | value | value | 4 | NULL | 5 | Using where; Using index |
+----+-------------+-------+-------+---------------+-------+---------+------+------+--------------------------+
EXPLAIN SELECT `value` v FROM `table` having `value`>5;
+----+-------------+-------+-------+---------------+-------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-------+---------+------+------+-------------+
| 1 | SIMPLE | table | index | NULL | value | 4 | NULL | 10 | Using index |
+----+-------------+-------+-------+---------------+-------+---------+------+------+-------------+
You can see either WHERE or HAVING uses index, but the rows are different.
Why is it that you need to place columns you create yourself (for example "select 1 as number") after HAVING and not WHERE in MySQL?
WHERE is applied before GROUP BY, HAVING is applied after (and can filter on aggregates).
In general, you can reference aliases in neither of these clauses, but MySQL allows referencing SELECT level aliases in GROUP BY, ORDER BY and HAVING.
And are there any downsides instead of doing "WHERE 1" (writing the whole definition instead of a column name)
If your calculated expression does not contain any aggregates, putting it into the WHERE clause will most probably be more efficient.
The main difference is that WHERE cannot be used on grouped item (such as SUM(number)) whereas HAVING can.
The reason is the WHERE is done before the grouping and HAVING is done after the grouping is done.
HAVING is used to filter on aggregations in your GROUP BY.
For example, to check for duplicate names:
SELECT Name FROM Usernames
GROUP BY Name
HAVING COUNT(*) > 1
These 2 will be feel same as first as both are used to say about a condition to filter data. Though we can use ‘having’ in place of ‘where’ in any case, there are instances when we can’t use ‘where’ instead of ‘having’. This is because in a select query, ‘where’ filters data before ‘select’ while ‘having’ filter data after ‘select’. So, when we use alias names that are not actually in the database, ‘where’ can’t identify them but ‘having’ can.
Ex: let the table Student contain student_id,name, birthday,address.Assume birthday is of type date.
SELECT * FROM Student WHERE YEAR(birthday)>1993; /*this will work as birthday is in database.if we use having in place of where too this will work*/
SELECT student_id,(YEAR(CurDate())-YEAR(birthday)) AS Age FROM Student HAVING Age>20;
/*this will not work if we use ‘where’ here, ‘where’ don’t know about age as age is defined in select part.*/
WHERE filters before data is grouped, and HAVING filters after data is grouped. This is an important distinction; rows that are
eliminated by a WHERE clause will not be included in the group. This
could change the calculated values which, in turn(=as a result) could affect which
groups are filtered based on the use of those values in the HAVING
clause.
And continues,
HAVING is so similar to WHERE that most DBMSs treat them as the same
thing if no GROUP BY is specified. Nevertheless, you should make that
distinction yourself. Use HAVING only in conjunction with GROUP BY
clauses. Use WHERE for standard row-level filtering.
Excerpt From:
Forta, Ben. “Sams Teach Yourself SQL in 10 Minutes (5th
Edition) (Sams Teach Yourself...).”.
Having is only used with aggregation but where with non aggregation statements
If you have where word put it before aggregation (group by)