"GROUP BY" on MariaDB behaves differently from MySQL - mysql

I have been told many times that same queries MariaDB will work just the same like how it is on MySQL... until I meet this problem.
Recently, I am trying to clone an application from MySQL(InnoDB) to MariaDB(XtraDB).
Although MariaDB runs MySQL queries without the need of changing anything, I was surprised to discover that the same queries actually behave quite differently on both platforms particularly in ORDER BY and GROUP BY.
For an example:
MyTable
=======
+----+----------+---------------------+-----------+
| id | parentId | creationDate | name |
+----+----------+---------------------+-----------+
| 1 | 2357 | 2017-01-01 06:03:40 | Anna |
+----+----------+---------------------+-----------+
| 2 | 5480 | 2017-01-02 07:13:20 | Becky |
+----+----------+---------------------+-----------+
| 3 | 2357 | 2017-01-03 08:20:12 | Christina |
+----+----------+---------------------+-----------+
| 4 | 2357 | 2017-01-03 08:20:15 | Dorothy |
+----+----------+---------------------+-----------+
| 5 | 5480 | 2017-01-04 09:25:45 | Emma |
+----+----------+---------------------+-----------+
| 6 | 1168 | 2017-01-05 10:30:10 | Fiona |
+----+----------+---------------------+-----------+
| 7 | 5480 | 2017-01-05 10:33:23 | Gigi |
+----+----------+---------------------+-----------+
| 8 | 1168 | 2017-01-06 12:46:34 | Heidi |
+----+----------+---------------------+-----------+
| 9 | 1168 | 2017-01-06 12:46:34 | Irene |
+----+----------+---------------------+-----------+
| 10 | 2357 | 2017-01-07 14:58:37 | Jane |
+----+----------+---------------------+-----------+
| 11 | 2357 | 2017-01-07 14:58:37 | Katy |
+----+----------+---------------------+-----------+
Basically what I want to get from a query is the latest records from each GROUPing (i.e. parentId). By latest, I mean MAX(creationDate) and MAX(id)
So, for the above example, since there are only three different parentId values, I am hoping to get:
+----+----------+---------------------+-----------+
| id | parentId | creationDate | name |
+----+----------+---------------------+-----------+
| 11 | 2357 | 2017-01-07 14:58:37 | Katy |
+----+----------+---------------------+-----------+
| 9 | 1168 | 2017-01-06 12:46:34 | Irene |
+----+----------+---------------------+-----------+
| 7 | 5480 | 2017-01-05 10:33:23 | Gigi |
+----+----------+---------------------+-----------+
Originally the application has queries similar to this fashion:
SELECT * FROM
( SELECT * FROM `MyTable` WHERE `parentId` IN (...)
ORDER BY `creationDate` DESC, `id` DESC ) AS `t`
GROUP BY `parentId`;
On MySQL, this works, since the inner query will order and then the outer query gets the first of each GROUP from the result of the inner query. The outer query basically obeys ordering of the inner query.
But on MariaDB, the outer query will ignore the ordering of the inner query result. I get this on MariaDB instead:
+----+----------+---------------------+-----------+
| id | parentId | creationDate | name |
+----+----------+---------------------+-----------+
| 1 | 2357 | 2017-01-01 06:03:40 | Anna |
+----+----------+---------------------+-----------+
| 2 | 5480 | 2017-01-02 07:13:20 | Becky |
+----+----------+---------------------+-----------+
| 6 | 1168 | 2017-01-05 10:30:10 | Fiona |
+----+----------+---------------------+-----------+
To achieve the same behaviour on MariaDB, I have come up with something like this. (Not sure if this is accurate though.)
SELECT `t1`.* FROM `MyTable` `t1` LEFT JOIN `MyTable` `t2` ON (
`t1`.`parentId` = `t2`.`parentId`
AND `t2`.`parentId` IN (...)
AND `t1`.`creationDate` <= `t2`.`creationDate`
AND `t1`.`id` < `t2`.`id`)
) WHERE `t2`.`id` IS NULL;
Now the problem is... If I am going to rewrite the queries, I have to rewrite hundreds of them... and they are some how a little bit different from each other.
I wonder if anyone here have any ideas that would allow me to make the least changes possible.
Thank you all in advance.

Yeah, this is a link-only answer. But the links are to the MariaDB site.
Here is another discussion of the 'incompatibility': https://mariadb.com/kb/en/mariadb/group-by-trick-has-been-optimized-away/
Technically, speaking, MySQL implemented an extension to the the Ansi standard. Much later, it decided to remove it, so I think you will find that MySQL has migrated toward MariaDB.
Here is list of "fast" ways to do group-wise max, which is probably what you are trying to do: https://mariadb.com/kb/en/mariadb/groupwise-max-in-mariadb/

Your first query would probably work in MySQL but its behavior is not documented: you are grouping by groupid but you are selecting non-aggregated columns with * and the value of any of those non-aggregated columns is undefined - if the value you get is the first value encountered it's just a "matter of luck".
It is true that, even if it cannot be considered correct, on MySQL I have never seen this "trick" fail (and here on stackoverflow there are plenty of upvoted answers suggesting you to use this trick), but MariaDB uses a different optimization engine and you cannot rely on MySQL undocumented behavior.
Your second query needs a little adjustment:
and (
`t1`.`creationDate` < `t2`.`creationDate`
or (
`t1`.`creationDate` = `t2`.`creationDate`
and `t1`.`id` < `t2`.`id`
)
)
because first you are ordering by creation date, then if more than one record share the same creation date you are getting the one with the highest id.
There are other ways to write the same query, e.g.
select * from mytable
where id in (
select max(m.id)
from mytable m inner join (
select parentID, max(creationDate) as max_cd
from mytable
group by ParentID
) t on m.parentID = t.parentID and m.creationDate = t.max_cd
group by m.parentID, m.creationDate
)
but every query needs to be rewritten separately.
Edit
Your example is a little more complicated because you are ordering by both creationDate and id. Let me explain better. First thing to do, for every parentID you have to get the last creationDate:
select parentID, max(creationDate) as max_cd
from MyTable
group by parentID
then for every max creationDate you have to get the highest id:
select t.parentID, t.max_cd, max(t.id) as max_id
from
MyTable t inner join (
select parentID, max(creationDate) as max_cd
from MyTable
group by parentID
) t1 on t.parentID = t1.parentID and t.creationDate = t1.max_cd
group t.parentID, t.max_cd
then you have to get all records where the id are returned by this query. In this particular context a LEFT JOIN with the table itself should be easier to write and more performant.

Related

Issue with grouping?

I asked earlier about a solution to my problem which worked however now when I'm trying to get some information from a second table (that stores more information) I'm running into a few issues.
My tables are as follows
Users
+----+----------------------+---------------+------------------+
| id | username | primary_group | secondary_groups |
+----+----------------------+---------------+------------------+
| 1 | Username1 | 3 | 7,10 |
| 2 | Username2 | 7 | 3,5,10 |
| 3 | LongUsername | 1 | 3,7 |
| 4 | Username3 | 1 | 3,10 |
| 5 | Username4 | 7 | |
| 6 | Username5 | 5 | 3,7,10 |
| 7 | Username6 | 2 | 7 |
| 8 | Username7 | 4 | |
+----+----------------------+---------------+------------------+
Profile
+----+---------------+------------------+
| id | facebook | steam |
+----+---------------+------------------+
| 1 | 10049424151 | 11 |
| 2 | 10051277183 | 55 |
| 3 | 10051281183 | 751 |
| 4 | | 735 |
| 5 | 10051215770 | 4444 |
| 6 | 10020210531 | 50415 |
| 7 | 10021056938 | 421501 |
| 8 | 10011547143 | 761 |
+----+---------------+------------------+
My SQL is as follows (based off the previous thread)
SELECT u.id, u.username, p.id, p.facebook, p.steam
FROM users u, profile p
WHERE p.id=u.id AND FIND_IN_SET( '7', secondary_groups )
OR primary_group = 7
GROUP BY u.id
The problem is my output is displayed as below
+----+----------------------+-------------+-------+
| id | username | facebook | steam |
+----+----------------------+-------------+-------+
| 1 | Username1 | 10049424151 | 11 |
| 2 | Username2 | 10051277183 | 55 |
| 3 | LongUsername | 10051281183 | 751 |
| 4 | Username4 | 10051215770 | 4444 |
| 5 | Username5 | 10049424151 | 11 |
| 6 | Username6 | 10049424151 | 55 |
+----+----------------------+-------------+-------+
I'm guessing that the problem is that profile rows with a primary_group of 7 are getting matched to all user rows. Remove the GROUP BY, and you'll be able to better see what is happening.
But that's just a guess. It's not clear what you are attempting to achieve.
I suspect you are getting tripped up with the order of precedence of the AND and OR. (The AND operator has a higher order of precedence than OR operator. That means the AND will be evaluated before the OR.)
The quick fix is to just add some parens, to override the default order of operations. Something like this:
WHERE p.id=u.id AND ( FIND_IN_SET('7',secondary_groups) OR primary_group = 7 )
-- ^ ^
The parens will cause the OR operation to be evaluated (as either TRUE, FALSE or NULL) and then the result from that will be evaluated in the AND.
Without the parens, it's the same as if the parens were here:
WHERE ( p.id=u.id AND FIND_IN_SET('7',secondary_groups) ) OR primary_group = 7
-- ^ ^
With the AND condition evaluated first, and the result from that is operated on by OR. This is what is causing profile rows with a 7 to be matched to rows in user with different id values.
A few pointers on style:
avoid the old-school comma operator for join operations, and use the newer JOIN syntax
place the join predicates (conditions) in the ON clause, other filtering criteria in the WHERE clause
qualify all column references
As an example:
SELECT u.id
, u.username
, p.id
, p.facebook
, p.steam
FROM users u
JOIN profile p
ON p.id = u.id
WHERE u.primary_group = 7
OR FIND_IN_SET('7',u.secondary_groups)
ORDER BY u.id
We only need a GROUP BY clause if we want to "collapse" rows. If the id column is unique in both the users and profile tables, then there's no need for a GROUP BY u.id. We can add an ORDER BY clause if we want rows returned in a particular sequence.
I don't know, what exactly do you want to do with output, but you can't group informations like this. MySQL isn't really a classic programming language, it's more like powerful tool for set mathematics. So if you want to get informations based on corelations between two or more tables, first you write a select statement which contains raw data which you want to work with, like this:
SELECT * FROM users u INNER JOIN profile p ON p.id=u.id
GROUP BY u.id;
Now you select relevant data with WHERE statement:
SELECT * FROM users u INNER JOIN profile p ON p.id=u.id WHERE
FIND_IN_SET( '7', secondary_groups ) OR primary_group = 7
GROUP BY u.id;
Now you should see grouped joined tables profile and users, and can start mining data. For example, if you want to count items in these groups, just add count function in SELECT and so on.
When debugging SQL, I highly recommend these steps:
1.) First, you should write down all corelations between data, all foreign keys between tables, so you will know if your selection is fully deterministic. You can now start JOINing tables from left to right
2.) Try small bits of querys on model database. Then you will see which selection works right and which doesn't do what you expected.
I think #SIDU has it in the comments: You are experiencing a Boolean order of operations problem. See also SQL Logic Operator Precedence: And and Or
For example:
SELECT 0 AND 0 OR 1 AS test;
+------+
| test |
+------+
| 1 |
+------+
When doing complex statements with both AND and OR, use parenthesis. The operator order problem is leading to you doing an unintended outer join that's being masked by your GROUP BY. You shouldn't need a GROUP BY for that statement.
Although I don't personally care for the style #spencer7593 suggests in his answer(using INNER JOIN, etc.), it does have the advantage of preventing or identifying errors early for people new to SQL, so it's something to consider.

List Last record of each item in mysql

Each item(item is produced by Serial) in my table has many record and I need to get last record of each item so I run below code:
SELECT ID,Calendar,Serial,MAX(ID)
FROM store
GROUP BY Serial DESC
it means it must show a record for each item which in that record all data of columns be for last record related to each item but the result is like this:
-------------------------------------------------------------+
ID | Calendar | Serial | MAX(ID) |
-------------------------------------------------------------|
7031053 | 2016-05-14 14:05:14 79.5 | N10088 | 7031056 |
7053346 | 2016-05-14 15:17:28 79.8 | N10078 | 7053346 |
7051349 | 2016-05-14 15:21:29 86.1 | J20368 | 7051349 |
7059144 | 2016-05-14 15:50:27 89.6 | J20367 | 7059144 |
7045551 | 2016-05-14 15:15:15 89.2 | J20366 | 7045551 |
7056243 | 2016-05-14 15:25:34 85.2 | J20358 | 7056245 |
7042652 | 2016-05-14 15:18:33 83.9 | J20160 | 7042652 |
7039753 | 2016-05-14 11:48:16 87 | J20158 | 7039753 |
7036854 | 2016-05-14 15:18:35 87.5 | J20128 | 7036854 |
7033955 | 2016-05-14 15:20:45 83.4 | 9662 | 7033955 |
-------------------------------------------------------------+
the problem is why for example in record related to Serial N10088 the ID is "7031053", but MAX(ID) is "7031056"? or also for J20358?
each row must show last record of each item but in my output it is not true!
If you want the row with the max value, then you need a join or some other mechanism.
Here is a simple way using a correlated subquery:
select s.*
from store s
where s.id = (
select max(s2.id)
from store s2
where s2.serial = s.serial
);
You query uses a (mis)feature of SQL Server that generates lots of confusion and is not particularly helpful: you have columns in the select that are not in the group by. What value do these get?
Well, in most databases the answer is simple: the query generates an error as ANSI specifies. MySQL pulls the values for the additional columns from indeterminate matching rows. That is rarely what the writer of the query intends.
For performance, add an index on store(serial, id).
try this one.
SELECT MAX(id), tbl.*
FROM store tbl
GROUP BY Serial
You can try with this also...
SELECT ID,Calendar,Serial
FROM store s0
where ID = (
SELECT MAX(id)
FROM store s1
WHERE s1.serial = s0.serial
);

Only return an ordered subset of the rows from a joined table

Given a structure like this in a MySQL database
#data_table
(id) | user_id | time | (...)
#relations_table
(id) | user_id | user_coach_id | (...)
we can select all data_table rows belonging to a certain user_coach_id (let's say 1) with
SELECT rel.`user_coach_id`, dat.*
FROM `relations_table` rel
LEFT JOIN `data_table` dat ON rel.`uid` = dat.`uid`
WHERE rel.`user_coach_id` = 1
ORDER BY val.`time` DESC
returning something like
| user_coach_id | id | user_id | time | data1 | data2 | ...
| 1 | 9 | 4 | 15 | foo | bar | ...
| 1 | 7 | 3 | 12 | oof | rab | ...
| 1 | 6 | 4 | 11 | ofo | abr | ...
| 1 | 4 | 4 | 5 | foo | bra | ...
(And so on. Of course time are not integers in reality but to keep it simple.)
But now I would like to query (ideally) only up to an arbitrary number of rows from data_table per distinct user_id but still have those ordered (i.e. newest first). Is that even possible?
I know I can use GROUP BY user_id to only return 1 row per user, but then the ordering doesn't work and it seems kind of unpredictable which row will be in the result. I guess it's doable with a subquery, but I haven't figured it out yet.
To limit the number of rows in each GROUP is complicated. It is probably best done with an #variable to count, plus an outer query to throw out the rows beyond the limit.
My blog on Groupwise Max gives some hints of how to do such.

Calculating row indices with subquery having joins, results in A*B examined rows

This question is derived from a one I started previously: Incorrect row index when grouping
Due to different natures, I'm asking here and will provide the answer back there once I have resolved this issue.
I thought about subqueries, and came up with this:
SELECT
mq.*,
#indexer := #indexer + 1 AS indexer
FROM
(
SELECT
p.id,
p.tag_id,
p.title,
p.created_at
FROM
`posts` AS p
LEFT JOIN
`votes` AS v
ON p.id = v.votable_id
AND v.votable_type = "Post"
AND v.deleted_at IS NULL
WHERE
p.deleted_at IS NULL
GROUP BY
p.id
) AS mq
JOIN
(SELECT #indexer := 0) AS i
Which actually works, I get the desired result:
+----+--------+------------------------------------+---------------------+---------+
| id | tag_id | title | created_at | indexer |
+----+--------+------------------------------------+---------------------+---------+
| 2 | 2 | PostPostPost | 2014-10-23 23:53:15 | 1 |
| 3 | 3 | Title | 2014-10-23 23:56:13 | 2 |
| 4 | 2 | GIFGIFIGIIF | 2014-10-23 23:59:03 | 3 |
| 5 | 2 | GIFGIFIGIIF | 2014-10-23 23:59:03 | 4 |
| 6 | 4 | My new avatar | 2014-10-26 22:22:30 | 5 |
| 7 | 5 | Hi, haiii, oh Hey ! | 2014-10-26 22:38:10 | 6 |
| 8 | 6 | Mclaren testing stealth technology | 2014-10-26 22:44:15 | 7 |
| 9 | 7 | Just random thoughts while pooping | 2014-10-26 22:50:03 | 8 |
+----+--------+------------------------------------+---------------------+---------+
The problem now is... I ran a EXPLAIN query, to see how fast it works. And, I have a number there that is really bugging me:
Well, the number is obvious: 252 * 1663 = 419076.
This worries me, though - is the row count normal there, or I have to optimize the query? And if so, then how do I optimize this one?
As of MySQL version 5.7 all joins are treated as nested loop joins.
MySQL resolves all joins using a nested-loop join method. This means that MySQL reads a row from the first table, and then finds a matching row in the second table, the third table, and so on.
So to answer your question... no, you won't be able to get that row count down. However, by adding indexes to your join columns you may be able to achieve faster results but your row count will be the same.

how to store a *sorted* SELECT result in another table?

In my projects I often need to store the result of a SELECT in another table (we call this a "resultset"). The reason is to dynamically display a large number of rows in a web application while loading only small chunks as necessary.
Typically, this is done by queries such as this one:
SET #counter := 0;
INSERT INTO resultsetdata
SELECT "12345", #counter:=#counter+1, a.ID
FROM sometable a
JOIN bigtable b
WHERE (a.foo = b.bar)
ORDER BY a.whatever DESC;
The fixed "12345" value is just a value to identify the "resultset" as a whole and changes for each query. The second column is a incrementing index counter that is meant to allow direct access to a specific row in the result and the ID column references the specific row in the source data table.
When the application needs a certain range of the result I just join resultsetdata with the source table to get the detailed data - which is quick as opposed to the resultsetdata query above which may take 2-3 seconds to complete (which explains why I need this intermediary table).
The SELECT query itself is not relevant for this question.
resultsetdata has the following structure:
CREATE TABLE `resultsetdata` (
`ID` int(11) NOT NULL,
`ContIdx` int(11) NOT NULL,
`Value` int(11) NOT NULL,
PRIMARY KEY (`ID`,`ContIdx`)
) ENGINE=InnoDB;
This usually works like a charm but lately we noticed that in some cases the ORDER of the result is not correct. This depends on the query itself (for example, adding DISTINCT is a typical cause), the server version and the data contained in the source tables, so I guess one can say that the row order is unpredictable with this method. Probably it depends on internal optimizations.
However, the problem is now that I can't think of any alternative solution that gives me the expected result.
Since the resultset can get several thousands of rows, loading all data in memory and then manually INSERTing it is not feasible.
Any suggestions?
EDIT: For further clarification, have a look at these queries:
DROP TABLE IF EXISTS test;
CREATE TABLE test (ID INT NOT NULL, PRIMARY KEY(ID)) ENGINE=InnoDB;
INSERT INTO test (ID) VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10);
SET #counter:=0;
SELECT "12345", #counter:=#counter+1, ID
FROM test
ORDER BY ID DESC;
This produces the following result as "expected":
+-------+----------------------+----+
| 12345 | #counter:=#counter+1 | ID |
+-------+----------------------+----+
| 12345 | 1 | 10 |
| 12345 | 2 | 9 |
| 12345 | 3 | 8 |
| 12345 | 4 | 7 |
| 12345 | 5 | 6 |
| 12345 | 6 | 5 |
| 12345 | 7 | 4 |
| 12345 | 8 | 3 |
| 12345 | 9 | 2 |
| 12345 | 10 | 1 |
+-------+----------------------+----+
10 rows in set (0.00 sec)
As said, in some cases (I can't provide a testcase here, sorry), this may lead to a result similar to this:
+-------+----------------------+----+
| 12345 | #counter:=#counter+1 | ID |
+-------+----------------------+----+
| 12345 | 10 | 10 |
| 12345 | 9 | 9 |
| 12345 | 8 | 8 |
| 12345 | 7 | 7 |
| 12345 | 6 | 6 |
| 12345 | 5 | 5 |
| 12345 | 4 | 4 |
| 12345 | 3 | 3 |
| 12345 | 2 | 2 |
| 12345 | 1 | 1 |
+-------+----------------------+----+
I'm not saying this is a MySQL bug and I fully understand that my method currently provides unpredictable results. Still, I don't know how to tweak this to get predictable results.
This is because the order that records are sorted when they are inserted is unrelated to the order when you retrieve them.
When you retrieve them a query plan will be created. If no ORDER BY is specified in your SELECT statement then the order will depend on the query plan produced. This is why it is unpredictable and adding DISTINCT can change the order.
The solution is to store enough data that you can retrieve them in the correct order using an ORDER BY clause. In your case you have ordered your data by a.whatever. Can a.whatever be stored in resultsetdata? If so then you can read the records out in the correct order.
Maybe you could wrap the select into another select:
SET #counter := 0;
INSERT INTO resultsetdata
SELECT *, #counter := #counter + 1
FROM (
SELECT "12345", a.ID
FROM sometable a
JOIN bigtable b
WHERE a.foo = b.bar
ORDER BY a.whatever DESC
) AS tmp
... but you are still at the mercy of the dumbness of MySQL's optimizer.
That's all I found about this topic, but I couln't find a hard guarantee:
Pure-SQL Technique for Auto-Numbering Rows in Result Set
http://www.xaprb.com/blog/2006/12/02/how-to-number-rows-in-mysql/
http://www.xaprb.com/blog/2005/09/27/simulating-the-sql-row_number-function/