Mysql: How reliable are values provided when rows are grouped? - mysql

I think this is a relatively advanced question and I may have trouble asking it well. so apologies in advance for any babbling.
I love Mysql's grouping functions. MIN(), MAX(), etc. make it easy to group rows by a certain common factor, then fetch salient features of each pool of grouped rows. But the question I'm asking relates to cases where I do not want this behavior to happen; rather, in a particular situation, I want to ensure that when I group a set of (let's say 10) rows into a single row, for any values that vary from row to row, all values displayed in the resultant grouped row were derived from the same pre-grouped row. My question: is this possible? are there potholes I should look out for?
Let me share a bit of this query's structure. At core, it has a "parent" table (here t1) joined to a "child" table (here t2). The query results, prior to any grouping or sorting, may list the same t1 record multiple times, associated with different t2 records and values. I want the final output to be grouped such that each t1 record only appears once, and that the t2 values displayed in each row reflect the t2 record that had the highest priority (among all t2 records associated with that t1 record). See my dumbed-down query below for example.
Based on my experimentation, it seems that nested queries should be able to do this, where I ORDER first, then GROUP later. The GROUP operation seems to reliably preserve the values from the first row it came across, meaning that if I ORDER then GROUP, I should have reasonable control over which values are included in the grouped output.
Here's an example of the query structure I'm planning. My question: Am I missing anything? Have you experienced GROUP to behave in ways that might make this a bad plan for me? Can you think of a simpler way to achieve what I'm describing?
Thanks in advance!
SELECT * FROM (
SELECT
# Each record from t1 may only appear once in the final output.
t1.id, t1.field2, t1.field3, t1.field4,
# there are multiple t2 records (each having different values & priority)
# associated with each t1 record.
t2.id AS t2_id, t2.field5, t2.field6, t2.priority
FROM t1
JOIN t2 ON t1.id = t2.t1_id
{ several other joins }
WHERE { lots of conditions }
ORDER BY t2.priority ) t
GROUP BY t.priority

It's unreliable at all. DBMS does not specify a row which will be returned in described case. To say more, it's only MySQL feature, in normal SQL this will be invalid - to mix non-group columns and group functions. Further explanations about this behavior can be found in this manual page:
However, this is useful primarily when all values in each
nonaggregated column not named in the GROUP BY are the same for each
group. The server is free to choose any value from each group, so
unless they are the same, the values chosen are indeterminate.
Furthermore, the selection of values from each group cannot be
influenced by adding an ORDER BY clause. Sorting of the result set
occurs after values have been chosen, and ORDER BY does not affect
which values within each group the server chooses.

There's another way to get the right result that would work in any DBMS. Taking your original query, it would look something like this.
SELECT
t1.id, t1.field2, t1.field3, t1.field4,
t2.id AS t2_id, t2.field5, t2.field6, t2.priority
FROM t1
JOIN t2 ON t1.id = t2.t1_id AND t2.priority =
(Select Max(t2b.priority) From t2 AS t2b Where t1.id = t2b.t1_id)
{ several other joins }
WHERE { lots of conditions }
(I assumed there's only one row in t2 by (t1.id, priority) )
Hope it helps!

Related

what's the execution order of each part of a sql query

What's the execution order of each part of a sql query,like SELECT、DISTINCT、FROM、WHERE、GROUP BY、ORDER BY···
I searched a large number of sites said that ORDER BY executes after SELECT,if this is true,a simple query like 'select column1 from table1 order by column2' should not execute because after executing SELECT,there is only column1 in the dataset,it can't use column2 to sort the dataset. But actually it works!
Consider a query -
select distinct <columns> from
table1 t1 inner join t2
on t1.col=t2.col
where <conditions>
group by <col>
having <conditions>
Order of execution would be -
> From
> ON
> JOIN
> Where
> group by
> Having
> Select
> Distinct
> Order By
Let's decompose two queries against two tables, both containing two columns. First, we'll do a simple one:
SELECT t1.a,t2.d + 6 as e
FROM
table1 t1
inner join
table2 t2
on
t1.a = t2.c
WHERE
t1.b = 2
ORDER BY
t2.c
And lets consider what is "in scope" as we complete each clause:
FROM table1 t1 - at this point, we have a result set containing two columns - {t1.a, t1.b}.
INNER JOIN table2 t2 ON ... - we now have a result set containing four columns - T1.a, t1.b, t2.c, t2.d}. We may personally also now that a and c are equal but that's irrelevant for the analysis.
WHERE - although WHERE can filter rows from a query, it doesn't change the set of columns making up the result set - it's still {t1.a, t1.b, t2.c, t2.d}.
SELECT - we don't have a GROUP BY clause, and so the job of the SELECT clause here is a) to mark some columns for output and b) possibly to add some additional columns whose values are computed. That's what we have here. We end up with a set of {O(t1.a), t1.b, t2.c, t2.d, O(e = t2.d +6)}1.
ORDER BY - now we order by t2.c, which is still in scope despite the fact that it won't be output
finally, the outputs of this query are delivered (technically via a cursor) and just contains {a, e}. The columns no longer have their "originating table" associated with them, and the non-output columns disappear into the ether.
SELECT
t1.a,SUM(t2.d) as e
FROM
table1 t1
inner join
table2 t2
on
t1.a = t2.c
GROUP BY t1.a
HAVING e > 5
ORDER BY t1.a
The FROM/JOIN clauses are identical to previously and so the same analysis prevails. Similarly we have no WHERE clause but it's irrelevant to the set of columns. We have {t1.a, t1.b, t2.c, t2.d}.
SELECT/GROUP BY/DISTINCT. DISTINCT and GROUP BY are really the same thing - both identify a set of columns either explicitly (GROUP BY) or by their existing in the SELECT clause. You cannot untie SELECT from GROUP BY because we also have to compute aggregates and the aggregate definitions are in the SELECT clause. For each distinct set of values evident in the grouping columns, we produce a single output row containing that set of values together with any computed aggregates. We produce here {O(t1.a), O(e)}2 and that is the result set that the remaining parts of the query can observe. The original result set is not in scope.
HAVING - we can work with just those columns produced by the SELECT clause3. But again, we filter rows, not columns.
and ORDER BY can also only work with the columns produced by the SELECT.
By the time SELECT was done, we only had output columns anyway but the output processing is the same anyway.
Hopefully, from the above you can see that SELECT can work in two quite different ways; but at least now you're aware of the difference and what the knock-on effects of that are.
1I'm making up terminology on the fly here, but I'm using the O() wrapper to mean "this column will be in the final result set".
2This is the behaviour you appear to have been expecting SELECT to always exhibit, only providing the "outputable" rows to later clauses.
3mysql contains an extension to the SQL standard that allows non-grouped and non-aggregated columns to appear as HAVING clause predicates. They're effectively re-written to be used in the WHERE clause instead.

MYSQL update from another table with multiple entries

I have seen a bunch of helpful answers about updating table values from a different table with multiple values based on a timestamp using a MAX() subquery.
e.g. Update another table based on latest record
I was wondering how this compares with doing an ALTER first and relying on the order in the table to simplify the UPDATE. Something like this:
ALTER TABLE `table_with_multiple_data` ORDER BY `timestamp` DESC;
UPDATE `table_with_single_data` as `t1`
LEFT JOIN `table_with_multiple_data` AS `t2`
ON `t1`.`id`=`t2`.`t1id`
SET `t1`.`value` = `t2`.`value`;
(Apologies for the pseudocode but I hope you get what I'm asking)
Both achieve the same for me but don't really have a big enough data set to see any difference in speed.
Thanks!!
You would normally use a correlated subquery:
UPDATE table_with_single_data t1
SET t1.value = (select t2.value
from table_with_multiple_data t2
where t2.t1id = t1.id
order by t2.timestamp desc
limit 1
);
If your method happens to work, that is just happenstance. Even if MySQL respected the ordering of tables, such ordering would not survive the join operation. Not to mention the fact that there is no guarantee on *which * value is assigned when there is multiple matching rows.

MySQL GROUP BY behavior (when using a derived table with order by)

Since mysql does not enforce the Single-Value Rule (See: https://stackoverflow.com/a/1646121/1688441) does a derived table with an order by guarantee which row values will be displayed? This is for columns not in an aggregate function and not in the group by.
I was looking at the question (MySQL GROUP BY behavior) after having commented on and answered the question (https://stackoverflow.com/a/24653572/1688441) .
I don't agree with the accepted answer, but realized that a possible improved upon answer would be:
SELECT * FROM
(SELECT * FROM tbl order by timestamp) as tb2
GROUP BY userID;
http://sqlfiddle.com/#!2/4b475/18
Is this correct though or will mysql still decide arbitrarily which row values will be displayed?
This query:
SELECT *
FROM (SELECT * FROM tbl order by timestamp) as tb2
GROUP BY userID;
Relies on a MySQL group by extension, which is documented here. You are specifically relying on the fact that all the columns come from the same row and the first one encountered. MySQL specifically warns against making this assumption:
MySQL extends the use of GROUP BY so that the select list can refer to
nonaggregated columns not named in the GROUP BY clause. This means
that the preceding query is legal in MySQL. You can use this feature
to get better performance by avoiding unnecessary column sorting and
grouping. However, this is useful primarily when all values in each
nonaggregated column not named in the GROUP BY are the same for each
group. The server is free to choose any value from each group, so
unless they are the same, the values chosen are indeterminate.
So, you cannot depend on this behavior. It is easy enough to work around. Here is an example query:
select t.*
from tbl t
where not exists (select 1 from tbl t2 where t2.userid = t.userid and t2.timestamp > t.timestamp)
With an index on tbl(userid, timestamp) this may even work faster. MySQL does a notoriously poor job of optimizing aggregations.

only select the row if the field value is unique

I sort the rows on date. If I want to select every row that has a unique value in the last column, can I do this with sql?
So I would like to select the first row, second one, third one not, fourth one I do want to select, and so on.
What you want are not unique rows, but rather one per group. This can be done by taking the MIN(pk_artikel_Id) and GROUP BY fk_artikel_bron. This method uses an IN subquery to get the first pk_artikel_id and its associated fk_artikel_bron for each unique fk_artikel_bron and then uses that to get the remaining columns in the outer query.
SELECT * FROM tbl
WHERE pk_artikel_id IN
(SELECT MIN(pk_artikel_id) AS id FROM tbl GROUP BY fk_artikel_bron)
Although MySQL would permit you to add the rest of the columns in the SELECT list initially, avoiding the IN subquery, that isn't really portable to other RDBMS systems. This method is a little more generic.
It can also be done with a JOIN against the subquery, which may or may not be faster. Hard to say without benchmarking it.
SELECT *
FROM tbl
JOIN (
SELECT
fk_artikel_bron,
MIN(pk_artikel_id) AS id
FROM tbl
GROUP BY fk_artikel_bron) mins ON tbl.pk_artikel_id = mins.id
This is similar to Michael's answer, but does it with a self-join instead of a subquery. Try it out to see how it performs:
SELECT * from tbl t1
LEFT JOIN tbl t2
ON t2.fk_artikel_bron = t1.fk_artikel_bron
AND t2.pk_artikel_id < t1.pk_artikel_id
WHERE t2.pk_artikel_id IS NULL
If you have the right indexes, this type of join often out performs subqueries (since derived tables don't use indexes).
This non-standard, mysql-only trick will select the first row encountered for each value of pk_artikel_bron.
select *
...
group by pk_artikel_bron
Like it or not, this query produces the output asked for.
Edited
I seem to be getting hammered here, so here's the disclaimer:
This only works for mysql 5+
Although the mysql specification says the row returned using this technique is not predictable (ie you could get any row as the "first" encountered), in fact in all cases I've ever seen, you'll get the first row as per the order selected, so to get a predictable row that works in practice (but may not work in future releases but probably will), select from an ordered result:
select * from (
select *
...
order by pk_artikel_id) x
group by pk_artikel_bron

Single query to get data from 3 separate tables based on provided timestamp

I'm working with three tables which can be summarized as follows:
Table1 (pid, created, data, link1)
Table2 (pid, created, data)
Table3 (link1, created, data)
The created field is a UNIX timestamp, and the data field is the same format in each table.
I want to get the data in all 3 tables such that Table1.pid = Table2.pid AND Table1.link1 = Table3.link1 AND created is less than and closest to a given timestamp value.
So, as an example, say I provide a date of May 7, 2011, 1:00pm. I'd want the data record from each table created most recently before this date-time.
Now I've managed to do this in a rather ugly single query involving sub-queries (using either INNER JOINs or UNIONs), but I'm wondering whether it can be done w/o sub-queries in a single query? Thanks for any suggestions.
Obviously I haven't really run the query, but it seems like it would do whatever you need if I'm understanding your question right.
t1.pid and t2.pid are the same, t1.link1 and t3.link1 are the same, and none of the .created are above your time, and you want the row closest to the date.
SELECT t1.data, t2.data, t3.data FROM Table1 t1, Table2 t2, Table3 t3 WHERE t1.pid = t2.pid AND t1.link1 = t3.link1 AND GREATEST(t1.created,t2.created,t3.created) < 'your-formatted-time' ORDER BY GREATEST(t1.created,t2.created,t3.created) DESC;