I'm working on manipulating a product data feed, and am currently working on grouping the related products. I've almost got things where I want them, but, like a mediocre racing driver, I've run out of skill right when I need it the most.
To illustrate my problem I've created a simplified version. Here is the data structure:
CREATE TABLE `feed` (
`sku` VARCHAR(10),
`price` DECIMAL(6,2),
`groupkey` VARCHAR(10)
);
INSERT INTO `feed` (`sku`, `price`, `groupkey`) VALUES
('AAA', 10.00, NULL),
('BBB', 10.00, 'group1'),
('CCC', 12.00, 'group1'),
('DDD', 10.00, 'group2'),
('EEE', 12.00, 'group2'),
('FFF', 14.00, 'group2'),
('GGG', 20.00, NULL);
My current query is:
SELECT feed.groupkey
, group_concat(feed.sku) AS skus
, group_concat(feed.price) AS prices
, feed.price AS pprice
FROM
feed
WHERE
feed.groupkey IS NOT NULL
GROUP BY
feed.groupkey;
The query returns the following rows:
+----------+-------------+-------------------+--------+
| groupkey | skus | prices | pprice |
+----------+-------------+-------------------+--------+
| group1 | BBB,CCC | 10.00,12.00 | 10.00 |
| group2 | DDD,EEE,FFF | 10.00,12.00,14.00 | 10.00 |
+----------+-------------+-------------------+--------+
What I actually need to do is subtract pprice from each concatenated price, giving me the price difference between each sku, rather than their absolute prices. This would return the dream result:
+----------+-------------+-------------------+--------+
| groupkey | skus | prices | pprice |
+----------+-------------+-------------------+--------+
| group1 | BBB,CCC | 0.00,2.00 | 10.00 |
| group2 | DDD,EEE,FFF | 0.00,2.00,4.00 | 10.00 |
+----------+-------------+-------------------+--------+
I've spent a lot of time on this feed in general, and am really stuck on what is probably the last hurdle in the integration. I'd really appreciate some guidance to help me in the right direction.
edit: I'm using the results from this query as "virtual" product rows, to serve as parents for the products in the group.
You can just do the subtraction in the group_concat(), for something like:
SELECT feed.groupkey, group_concat(feed.sku) AS skus,
group_concat(feed.price - min(feed.price)) AS prices
min(feed.price) AS pprice
FROM feed
WHERE feed.groupkey IS NOT NULL
GROUP BY feed.groupkey
The problem is . . . which feed.price? The value returned in your original query is an arbitrary value from one of the rows in the group. Thinking that you might want the difference over the minimum, I used that value.
I think the best way to write the query is:
SELECT feed.groupkey, group_concat(feed.sku) AS skus,
group_concat(feed.price - fsum.minprice) AS prices
min(feed.price) AS pprice
FROM feed left outer join
(select groupkey, MIN(feed.price) as minprice
from feed
group by groupkey
) fsum
on feed.groupkey = fsum.groupkey
WHERE feed.groupkey IS NOT NULL
GROUP BY feed.groupkey
You CANNOT assume the ordering for hidden columns and group_concat(). The documentation is quite explicit on this point:
MySQL extends the use of GROUP BY so that the select list can refer to
nonaggregated columns not named in the GROUP BY clause. This means
that the preceding query is legal in MySQL. You can use this feature
to get better performance by avoiding unnecessary column sorting and
grouping. However, this is useful primarily when all values in each
nonaggregated column not named in the GROUP BY are the same for each
group. The server is free to choose any value from each group, so
unless they are the same, the values chosen are indeterminate.
Furthermore, the selection of values from each group cannot be
influenced by adding an ORDER BY clause. Sorting of the result set
occurs after values have been chosen, and ORDER BY does not affect
which values the server chooses.
If you want things in a particular order, then you need to be sure the structure is queried properly. That said, it often works in practice, but there is no guarantee.
Related
Here is the SQL problem.
Table: Countries
+---------------+---------+
| Column Name | Type |
+---------------+---------+
| country_id | int |
| country_name | varchar |
+---------------+---------+
country_id is the primary key for this table.
Each row of this table contains the ID and the name of one country.
Table: Weather
+---------------+------+
| Column Name | Type |
+---------------+------+
| country_id | int |
| weather_state | int |
| day | date |
+---------------+------+
(country_id, day) is the primary key for this table.
Each row of this table indicates the weather state in a country for one day.
Write an SQL query to find the type of weather in each country for November 2019.
The type of weather is:
Cold if the average weather_state is less than or equal 15,
Hot if the average weather_state is greater than or equal to 25, and
Warm otherwise.
Return result table in any order.
One of the MySQL solutions is as follows:
SELECT country_name, CASE WHEN AVG(weather_state) <= 15 THEN 'Cold' WHEN AVG(weather_state) >= 25 THEN 'Hot'
ELSE 'Warm'
END AS weather_type
FROM Weather w
JOIN Countries c
ON w.country_id = c.country_id
AND LEFT(w.day, 7) = '2019-11'
GROUP BY w.country_id
How does the "case when AVG(weather_state)" get executed, if the group by gets executed after the select statement?
How does the "case when AVG(weather_state)" get executed, if the group by gets executed after the select statement?
AVG(weather_state) computes the per-group average of column weather_state. It and other aggregate functions can be used in a select clause, from which you can conclude that the grouping defined by a group by clause must be visible in the context where the select clause is evaluated. In this sense, at least, group by gets executed before select. Pretty much everything else does too.
It is possible for an aggregate query to be identifiable only from the select clause. In such cases, the select clause needs to be parsed before it is known that grouping (all rows into a single group) is to be performed. This is the closest I can think of to the execution-order claim you asserted, but it is not at all well characterized as group by being executed after select.
MySQL's implementation details surely present a more complicated picture, but the fact remains that MySQL does provide correct SQL semantics in this regard. Therefore, even if you look at the details, they cannot reasonably be characterized as executing the group by after the select. Whoever told you that was wrong, or at least their lesson was very misleading, or else you misunderstood them.
Let's say I have a table
+------+---------+--------+
| lang | title | url |
+------+---------+--------+
| pt | Livro 1 | o294jl |
| en | Book 1 | o294jl |
| en | Book 2 | o294jl |
+------+---------+--------+
And I run a query
SELECT lang, title
FROM table
GROUP BY url
The result of the query is not obvious because the values of lang and title are different among the group.
How does an SQL engine choose which row to return from a group? Which row must be selected in my example? Is it specified in the SQL standard?
Values are chosen from arbitrary matching rows for each group. The values could come from different rows for different runs. In theory, different columns in the same SELECT could come from different rows.
The documentation explains this:
If ONLY_FULL_GROUP_BY is disabled, a MySQL extension to the standard
SQL use of GROUP BY permits the select list, HAVING condition, or
ORDER BY list to refer to nonaggregated columns even if the columns
are not functionally dependent on GROUP BY columns. . . . In this case, the server is free to
choose any value from each group, so unless they are the same, the
values chosen are nondeterministic, which is probably not what you
want.
You should read the complete documentation on the subject.
Note that the default behavior of MySQL is now to reject such queries. Yay!
I addition to Gorden's answer – In practice the engine will just do the least work, which is to choose the values from the first found row in the group. However – Which row is the first depends on the execution plan, in particular on the chosen index.
Assuming the following schema and data:
CREATE TABLE test (
`lang` VARCHAR(2),
`title` VARCHAR(50),
`url` VARCHAR(50)
) engine=InnoDB;
INSERT INTO test (`lang`, `title`, `url`) VALUES
('pt', 'Livro 1', 'o294jl'),
('en', 'Book 1', 'o294jl'),
('en', 'Book 2', 'o294jl');
Executing the query
SELECT lang, title FROM test GROUP BY url;
returns
| lang | title |
| ---- | ------- |
| pt | Livro 1 |
Which is the first row in insertion order (using the clustered index).
If we now define an index on (url, lang, title)
ALTER TABLE test ADD INDEX url_lang_title (url, lang, title);
The same SELECT query returns
| lang | title |
| ---- | ------ |
| en | Book 1 |
which is the first row in the new url_lang_title index.
View on DB Fiddle
As you can see: Having exactly same data and exactly same query – MySQL can return different results. And even if you don't change the indices, you can't rely on a particular index being chosen. The engine can choose another index for many other reasons.
The moral of the story: Don't ask what the engine will return. Instead tell it exactly what you want it to return by writing deterministic queries.
I have two tables - books and images. books has columns like id, name, releasedate, purchasecount. images has bookid (which is same as the id in books, basically one book can have multiple images. Although I haven't set any foreign key constraint), bucketid, poster (each record points to an image file in a certain bucket, for a certain bookid).
Table schema:
poster is unique in images, hence it is a primary key.
Covering index on books: (name, id, releasedate)
Covering index on images: (bookid,poster,bucketid)
My query is, given a name, find the top ten books (sorted by number of purchasecount) from the books table whose name matches that name, and for that book, return any (preferably the first) record (bucketid and poster) from the images table.
Obviously this can be solved by two queries by running the first, and using its results to query the images table, but that will be slow, so I want to use 'join' and subquery to do it in one go. However, what I am trying is not giving me correct results:
select books.id,books.name,year(releasedate),purchasecount,bucketid,poster from books
inner join (select bucketid,bookid, poster from images) t on
t.bookid = books.id where name like "%foo%" order by purchasecount desc limit 2;
Can anybody suggest an optimal query to fetch the result set as desired here (including any suggestion to change the table schema to improve search time) ?
Updated fiddle: http://sqlfiddle.com/#!9/17c5a8/1.
The example query should return two results - fooe and fool, and one (any of the multiple posters corresponding to each book) poster for each result. However I am not getting correct results. Expected:
fooe - 1973 - 459 - 11 - swt (or fooe - 1973 - 459 - 11 - pqr)
fool - 1963 - 456 - 12 - xxx (or fool - 1963 - 456 - 111 - qwe)
I agree with Strawberry about the schema. We can discuss ideas for better performance and all that. But here is my take on how to solve this after a few chats and changes to the question.
Note below the data changes to deal with various boundary conditions which include books with no images in that table, and tie-breaks. Tie-breaks meaning using the max(upvotes). The OP changed the question a few times and added a new column in the images table.
Modified quetion became return 1 row make per book. Scratch that, always 1 row per book even if there are no images. The image info to return would be the one with max upvotes.
Books table
create table books
( id int primary key,
name varchar(1000),
releasedate date,
purchasecount int
) ENGINE=InnoDB;
insert into books values(1,"fool","1963-12-18",456);
insert into books values(2,"foo","1933-12-18",11);
insert into books values(3,"fooherty","1943-12-18",77);
insert into books values(4,"eoo","1953-12-18",678);
insert into books values(5,"fooe","1973-12-18",459);
insert into books values(6,"qoo","1983-12-18",500);
Data Changes from original question.
Mainly the new upvotes column.
The below includes a tie-break row added.
create table images
( bookid int,
poster varchar(150) primary key,
bucketid int,
upvotes int -- a new column introduced by OP
) ENGINE=InnoDB;
insert into images values (1,"xxx",12,27);
insert into images values (5,"pqr",11,0);
insert into images values (5,"swt",11,100);
insert into images values (2,"yyy",77,65);
insert into images values (1,"qwe",111,69);
insert into images values (1,"blah_blah_tie_break",111,69);
insert into images values (3,"qwqqe",14,81);
insert into images values (1,"qqawe",8,45);
insert into images values (2,"z",81,79);
Visualization of a Derived Table
This is just to assist in visualizing an inner piece of the final query. It demonstrates the gotcha for tie-break situations, thus the rownum variable. That variable is reset to 1 each time the bookid changes otherwise it increments. In the end (our final query) we only want rownum=1 rows so that max 1 row is returned per book (if any).
Final Query
select b.id,b.purchasecount,xDerivedImages2.poster,xDerivedImages2.bucketid
from books b
left join
( select i.bookid,i.poster,i.bucketid,i.upvotes,
#rn := if(#lastbookid = i.bookid, #rn + 1, 1) as rownum,
#lastbookid := i.bookid as dummy
from
( select bookid,max(upvotes) as maxup
from images
group by bookid
) xDerivedImages
join images i
on i.bookid=xDerivedImages.bookid and i.upvotes=xDerivedImages.maxup
cross join (select #rn:=0,#lastbookid:=-1) params
order by i.bookid
) xDerivedImages2
on xDerivedImages2.bookid=b.id and xDerivedImages2.rownum=1
order by b.purchasecount desc
limit 10
Results
+----+---------------+---------------------+----------+
| id | purchasecount | poster | bucketid |
+----+---------------+---------------------+----------+
| 4 | 678 | NULL | NULL |
| 6 | 500 | NULL | NULL |
| 5 | 459 | swt | 11 |
| 1 | 456 | blah_blah_tie_break | 111 |
| 3 | 77 | qwqqe | 14 |
| 2 | 11 | z | 81 |
+----+---------------+---------------------+----------+
The significance of the cross join is merely to introduce and set starting values for 2 variables. That is all.
The results are the top ten books in descending order of purchasecount with the info from images if it exists (otherwise NULL) for the most upvoted image. The image selected honors tie-break rules picking the first one as mentioned above in the Visualization section with rownum.
Final Thoughts
I leave it to the OP to wedge in the appropriate where clause at the end as the sample data given had no useful book name to search on. That part is trivial. Oh, and do something about the schema for the large width of your primary keys. But that is off-topic at the moment.
Lets say I have a plant table:
id fruit
1 banana
2 apple
3 orange
I can do these
SELECT * FROM plant ORDER BY id;
SELECT * FROM plant ORDER BY fruit DESC;
which does the obvious thing.
But I was bitten by this, what does this do?
SELECT * FROM plant ORDER BY SUM(id);
SELECT * FROM plant ORDER BY COUNT(fruit);
SELECT * FROM plant ORDER BY COUNT(*);
SELECT * FROM plant ORDER BY SUM(1) DESC;
All these return just the first row (which is with id = 1).
What's happening underhood?
What are the scenarios where aggregate function will come in handy in ORDER BY?
Your results are more clear if you actually select the aggregate values instead of columns from the table:
SELECT SUM(id) FROM plant ORDER BY SUM(id)
This will return the sum of all id's. This is of course a useless example because the aggregation will always create only one row, hence no need for ordering. The reason you get a row qith columns in your query is because MySQL picks one row, not at random but not deterministic either. It just so happens that it is the first column in the table in your case, but others may get another row depending on storage engine, primary keys and so on. Aggregation only in the ORDER BY clause is thus not very useful.
What you usually want to do is grouping by a certain field and then order the result set in some way:
SELECT fruit, COUNT(*)
FROM plant
GROUP BY fruit
ORDER BY COUNT(*)
Now that's a more interesting query! This will give you one row for each fruit together with the total count for that fruit. Try adding some more apples and the ordering will actually start making sense:
Complete table:
+----+--------+
| id | fruit |
+----+--------+
| 1 | banana |
| 2 | apple |
| 3 | orange |
| 4 | apple |
| 5 | apple |
| 6 | banana |
+----+--------+
The query above:
+--------+----------+
| fruit | COUNT(*) |
+--------+----------+
| orange | 1 |
| banana | 2 |
| apple | 3 |
+--------+----------+
All these queries will all give you a syntax error on any SQL platform that complies with SQL standards.
SELECT * FROM plant ORDER BY SUM(id);
SELECT * FROM plant ORDER BY COUNT(fruit);
SELECT * FROM plant ORDER BY COUNT(*);
SELECT * FROM plant ORDER BY SUM(1) DESC;
On PostgreSQL, for example, all those queries will raise the same error.
ERROR: column "plant.id" must appear in the GROUP BY clause or be
used in an aggregate function
That means you're using a domain aggregate function without using GROUP BY. SQL Server and Oracle return similar error messages.
MySQL's GROUP BY is known to be broken in several respects, at least as far as standard behavior is concerned. But the queries you posted were a new broken behavior to me, so +1 for that.
Instead of trying to understand what it's doing under the hood, you're probably better off learning to write standard GROUP BY queries. MySQL will process standard GROUP BY statements correctly, as far as I know.
Earlier versions of MySQL docs warned you about GROUP BY and hidden columns. (I don't have a reference, but this text is cited all over the place.)
Do not use this feature if the columns you omit from the GROUP BY part
are not constant in the group. The server is free to return any value
from the group, so the results are indeterminate unless all values are
the same.
More recent versions are a little different.
You can use this feature to get better performance by avoiding
unnecessary column sorting and grouping. However, this is useful
primarily when all values in each nonaggregated column not named in
the GROUP BY are the same for each group. The server is free to choose
any value from each group, so unless they are the same, the values
chosen are indeterminate.
Personally, I don't consider indeterminate a feature in SQL.
When you use an aggregate like that, the query gets an implicit group by where the entire result is a single group.
Using an aggregate in order by is only useful if you also have a group by, so that you can have more than one row in the result.
I have two simple Mysql tables:
SYMBOL
| id | symbol |
(INT(primary) - varchar)
PRICE
| id | id_symbol | date | price |
(INT(primary), INT(index), date, double)
I have to pass two symbols to get something like:
DATE A B
2001-01-01 | 100.25 | 25.26
2001-01-02 | 100.23 | 25.25
2001-01-03 | 100.24 | 25.24
2001-01-04 | 100.25 | 25.26
2001-01-05 | 100.26 | 25.28
2001-01-06 | 100.27 | 30.29
Where A and B are the symbols i need to search and the date is the date of the prices. (because i need the same date to compare symbol)
If one symbol doesn't have a date that has the other I have to jump it. I only need to retrive the last N prices of those symbols.
ORDER: from the earliest date to latest (example the last 100 prices of both)
How could I implement this query?
Thank you
Implementing these steps should bring you the desired result:
Get dates and prices for symbol A. (Inner join PRICE with SYMBOL to obtain the necessary rows.)
Similarly get dates and prices for symbol B.
Inner join the two result sets on the date column and pull the price from the first result set as the A column and the other one as B.
This should be simple if you know how to join tables.
I think you should update your question to resolve any of the mistakes you made in representing your data. I'm having a hard time following the details. However, I think based on what I am seeing there are four MySQL concepts you need to solve your problem.
The first is JOINS you would use a join to put two tables together so you may select related data using the key that you describe as "id_symbol"
The second would be to use LIMIT which will allow you to specify the number of records to return such as that if you wanted one record you would use the keywould LIMIT 1 or if you wanted a hundred records LIMIT 100
The third would be to use a WHERE clause to allow you to search for a specific value in one of your fields from the table you are querying.
The last is the ORDER BY which will allow you to specify a field to sort your returned records and the direction you want them sorted ASC or DESC
An Example:
SELECT *
FROM table1
JOIN table2 ON table1.id = table2.table1_id
WHERE table1.searchfield = 'search string'
LIMIT 100
ORDER BY table1.orderfield DESC
(This is pseudo code so this query may not actually work but is close and should provide you with the correct idea.)
I suggest referencing the MySQL documentation found here it should provide everything you need to keep going.