MySQL Union sort works unexpected at different server

MySQL Union sort works unexpected at different server - mysql

Good day,
i have problem with "simple" query. When i execute it in different server i got other result set as what i need..
I tried to re-import all "tables" via export->import and still it's not working.
Where can be problem? Can be problem in MariaDB?
Database versions:
5.6.27 - MySQL Community Server (GPL), client: libmysql 5.0.11-dev
10.0.25-MariaDB-0+deb8u1 - (Debian), client: libmysql 5.5.49
Both running on MyISAM engine.
Query:
SELECT id, datum, ordinary FROM (SELECT *, 0 as `ordinary` FROM
`user_todolist` WHERE `done` = '0'
AND `deleted` = '0' AND `id_uzivatel` = '1' ORDER BY `datum` ASC) AS a1
UNION
SELECT id, datum, ordinary FROM (SELECT *, 1 as `ordinary` FROM
`user_todolist` WHERE `done` = '1' AND `deleted` = '0' AND
`id_uzivatel` = '1' ORDER BY `datum` DESC) AS a2
ORDER BY `ordinary`
Results (Left EXPECTED, Right Invalid):
SQL Explain(Top for Expected, Bot Invalid)

ORDER BY is not necessarily a stable sort. When you apply ORDER BY to the result of the UNION, it can reorder within the groups.
You don't need ORDER BY ordinary in the outer query. When you use UNION, the results are ordinarily in the order of the sub-queries, so the results of the first SELECT will come first, and the second SELECT after that.
You should change UNION to UNION ALL, though. By default, it's UNION DISTINCT, which means it has to combine the results of the queries to remove duplicates. Since there can never be duplicates between the queries (since they have different ordinary columns) this is unnecessary.
Another solution that doesn't rely on this (I'm not actually sure if it's guaranteed) is to take ORDER BY datum out of the subqueries, and make the main query use:
ORDER by ordinary, IF(ordinary = 0, datum, '') ASC, IF(ordinary = 1, datum, '') DESC

ANSI standard says that the ORDER BY in a subquery can be ignored. This is equivalent to saying that a table has no intrinsic order to the rows.
Recently both Oracle and MariaDB (apparently independently) started taking advantage of this standard.
UNION ALL is appropriate since there is no overlap of values, and since it is faster than UNION DISTINCT, due to the absence of a de-dup pass.
UNION has traditionally been implemented by creating a temp table, feeding rows from one select into it, then the next select.
Recently the need for the temp table was eliminated in certain situations. This is a nice optimization.
In the future, I expect multiple threads to perform the SELECTs in parallel. This will really invalidate any assumptions you may enjoy today about how things or ordered.
Bottom line: Remove the internal ORDER BYs and add an external ORDER BY, such as #Barmar's suggestion.
That way, your query will work 'correctly' in all past, current, and future versions of MySQL/MariaDB. (I was first burned by the issue several years ago: Here.)
Meanwhile, switch to InnoDB, before it is totally removed.

Related

Correct format for Select in SQL Server

I have what should be a simple query for any database and which always runs in MySQL but not in SQL Server
select
tagalerts.id,
ts,
assetid,
node.zonename,
battlevel
from tagalerts, node
where
ack=0 and
tagalerts.nodeid=node.id
group by assetid
order by ts desc
The error is:
column tagalerts.id is invalid in the select list because it is not contained in either an aggregate function or the group by clause.
It is not a simple case of adding tagalerts.id to the group by clause because the error repeats for ts and for assetid etc, implying that all the selects need to be in a group or in aggregate functions... either of which will result in a meaningless and inaccurate result.
Splitting the select into a subquery to sort and group correctly (which again works fine with MySQL, as you would expect) makes matters worse
SELECT * from
(select
tagalerts.id,
ts,
assetid,
node.zonename,
battlevel
from tagalerts, node
where
ack=0 and
tagalerts.nodeid=node.id
order by ts desc
)T1
group by assetid
the order by clause is invalid in views, inline functions, derived tables and expressions unless TOP etc is used
the 'correct output' should be
id ts assetid zonename battlevel
1234 a datetime 1569 Reception 0
3182 another datetime 1572 Reception 0
Either I am reading SQL Server's rules entirely wrong or this is a major flaw with that database.
How can I write this to work on both systems?

In most databases you can't just include columns that aren't in the GROUP BY without using an aggregate function.
MySql is an exception to that. But MS SQL Server isn't.
So you could keep that GROUP BY with only the "assetid".
But then use the appropriate aggregate functions for all the other columns.
Also, use the JOIN syntax for heaven's pudding sake.
A SQL like select * from table1, table2 where table1.id2 = table2.id is using a syntax from the previous century.
SELECT
MAX(node.id) AS id,
MAX(ta.ts) AS ts,
ta.assetid,
MAX(node.zonename) AS zonename,
MAX(ta.battlevel) AS battlevel
FROM tagalerts AS ta
JOIN node ON node.id = ta.nodeid
WHERE ta.ack = 0
GROUP BY ta.assetid
ORDER BY ta.ts DESC;
Another trick to use in MS SQL Server is the window function ROW_NUMBER.
But this is probably not what you need.
Example:
SELECT id, ts, assetid, zonename, battlevel
FROM
(
SELECT
node.id,
ta.ts,
ta.assetid,
node.zonename,
ta.battlevel,
ROW_NUMBER() OVER (PARTITION BY ta.assetid ORDER BY ta.ts DESC) AS rn
FROM tagalerts AS ta
JOIN node ON node.id = ta.nodeid
WHERE ta.ack = 0
) q
WHERE rn = 1
ORDER BY ts DESC;

I strongly suspect this query is WRONG even in MySql.
We're missing a lot of details (sample data, and we don't know which table all of the columns belong to), but what I do know is you're grouping by assetid, where it looks like one assetid value could have more than one ts (timestamp) value in the group. It also looks like you're counting on the order by ts desc to ensure both that you see recent timestamps in the results first and that each assetid group uses the most recent possible ts timestamp for that group.
MySql only guarantees the former, not the latter. Nothing in this query guarantees that each assetid is using the most recent timestamp available. You could be seeing the wrong timestamps, and then also using those wrong timestamps for the order by. This is the problem the Sql Server rule is there to stop. MySql violates the SQL standard to allow you to write that wrong query.
Instead, you need to look at each column and either add it to the group by (best when all of the values are known to be the same, anyway) or wrap it in an aggregrate function like MAX(), MIN(), AVG(), etc, so there is a deterministic result for which value from the group is used.
If all of the values for a column in a group are the same, then there's no problem adding it to the group by. If the values are different, you want to be precise about which one is chosen for the result set.
While I'm here, the tagalerts, node join syntax has been obsolete for more than 20 years now. It's also good practice to use an alias with every table and prefix every column with the alias. I mention these to explain why I changed it for my code sample below, though I only prefix columns where I am confident in which table the column belongs to.
This query should run on both databases:
SELECT ta.assetid, MAX(ta.id) "id", MAX(ta.ts) "ts",
MAX(n.zonename) "zonename", MAX(battlevel) "battlevel"
FROM tagalerts ta
INNER JOIN node n ON ta.nodeid = n.id
WHERE ack = 0
GROUP BY ta.assetid
ORDER BY ts DESC
There is also a concern here the results may be choosing values from different records in the joined node table. So if battlevel is part of the node table, you might see a result that matches a zonename with a battlevel that never occurs in any record in the data. In Sql Server, this is easily fixed by using APPLY to match only one node record to each tagalert. MySql doesn't support this (APPLY or an equivalent has been in every other major database since at least 2012), but you can simulate with it in this case with two JOINs, where the first join is a subquery that uses GROUP BY to determine values will uniquely identify the needed node record, and second join is to the node table to actually produce that record. Unfortunately, we need to know more about the tables in question to actually write this code for you.

Knowage autogenerated query understanding

I'm using knowage software for data analysis, I'm facing performance issues, now I'm watching 'dataset audit' log to see what queries does the system perform. I found this one that, to me, is a nonsense:
SELECT COUNT(*)
FROM
(select TOP(100) PERCENT "ATC_1" AS "ATC_1"
from
(SELECT [ID_AFo]
,[ATC]
,[ATC_1]
,[ATC_3]
,[ATC_4]
,[ATC_5]
FROM [AFO]
) T order by "ATC_1" ASC
) u
inner T query is the dataset definition query I entered that basically is a select * from [AFO] on my table, outer wrap are made by knowage (I never wrote them)
doesn't a select count (*) from T have performed the same calculation but avoiding a cexpensive order by?
EDIT:
Backend (data source) is MSSQL, cache server is MYSQL so frequent queries are on mysql

This query is equivalent to:
SELECT COUNT(*)
FROM [AFO];
The only reason that I can think of for constructing such a query is if the "100" could be set to another value. I'm not sure if SQL Server's optimizer is good enough to eliminate the ORDER BY in the subquery.

SQL `group by` vs. `order by` Performance

tl;dr - lots of accepted stackoverflow answers suggest using a subquery to affect the row returned by a GROUP BY clause. While this works, is it the best advice?
I understand there are many questions already about how to retrieve a specific row in a GROUP BY statement. Most of them revolve around using a subquery in the FROM clause. The subquery will order the table appropriately and the group by will be run against the now-ordered temporary table. Some examples,
MySQL order by before group by
MySQL "Group By" and "Order By"
PostgreSQL removes the need for the subquery with the distinct on() clause.
Postgresql DISTINCT ON with different ORDER BY
However, what I'm not understanding in any of these cases is how badly I'm shooting myself in the foot trying to do something the system may not have originally been designed for. Take the following two examples in PostgreSQL and MySQL,
http://sqlfiddle.com/#!15/3b0f2/1
http://sqlfiddle.com/#!2/6d337/1
In both cases I have a table of posts that contain multiple versions of the same post (signified by its UUID). I want to select the most recently published version of each post ordered by it's created_at field.
My biggest concern is that given the MySQL approach a temporary table is necessary. Ratchet this up to "web scale" (lolz) and I'm wondering if I'm in for a world of hurt. Should I rethink my schema or are there ways to optimize the subquery-parentquery relationship enough that it'll be alright?

It is definitely not the best advice. SQL itself (and the MySQL documentation as far as I can tell) has little to say about the results from a subquery with an order by. Although they may be ordered in practice, they are not guaranteed to be.
The more important issue is the use of "hidden columns" in the aggregation. Consider this basic query:
select t.*
from (select t.* from table t order by datecol) t
group by t.col;
Everything except t.col in the select comes from an indeterminate row. The specific documentation is (emphasis is mine):
MySQL extends the use of GROUP BY so that the select list can refer to
nonaggregated columns not named in the GROUP BY clause. This means
that the preceding query is legal in MySQL. You can use this feature
to get better performance by avoiding unnecessary column sorting and
grouping. However, this is useful primarily when all values in each
nonaggregated column not named in the GROUP BY are the same for each
group. The server is free to choose any value from each group, so
unless they are the same, the values chosen are indeterminate.
Furthermore, the selection of values from each group cannot be
influenced by adding an ORDER BY clause. Sorting of the result set
occurs after values have been chosen, and ORDER BY does not affect
which values within each group the server chooses.
A safe way to write such a query is:
select t.*
from table t
where not exists (select 1
from table t2
where t2.col = t.col and t2.datecol < t.datecol
);
This is not exactly the same, because it will return multiple values if the minimum is not unique. The logic is "get me all rows in the table where there are no rows with the same col value and a smaller datecol value.
EDIT:
The question in your comment doesn't make sense, because nothing is discussing two queries. In MySQL you can use order by with variables to solve this:
select t.*
from (select t.*,
#rn := if(#col = col, #rn := #rn + 1, 1) as rn,
#col := col
from table t cross join
(select #col := '', #rn := 0) vars
order by col, datecol) t
where rn = 1;
This method should be faster than the order by with group by.

Getting different results from group by and distinct

this is my first post here since most of the time I already found a suitable solution :)
However this time nothing seems to help properly.
Im trying to migrate information from some mysql Database I have just read-only access to.
My problem is similar to this one: Group by doesn't give me the newest group
I also need to get the latest information out of some tables but my tables have >300k entries therefore checking whether the "time-attribute-value" is the same as in the subquery (like suggested in the first answer) would be too slow (once I did "... WHERE EXISTS ..." and the server hung up).
In addition to that I can hardly find the important information (e.g. time) in a single attribute and there never is a single primary key.Until now I did it like it was suggested in the second answer by joining with subquery that contains latest "time-attribute-entry" and some primary keys but that gets me in a huge mess after using multiple joins and unions with the results.
Therefore I would prefer using the having statement like here: Select entry with maximum value of column after grouping
But when I tried it out and looked for a good candidate as the "time-attribute" I noticed that this queries give me two different results (more = 39721, less = 37870)
SELECT COUNT(MATNR) AS MORE
FROM(
SELECT DISTINCT
LAB_MTKNR AS MATNR,
LAB_STG AS FACH,
LAB_STGNR AS STUDIENGANG
FROM
FKT_LAB
) AS TEMP1
SELECT COUNT(MATNR) AS LESS
FROM(
SELECT
LAB_MTKNR AS MATNR,
LAB_STG AS FACH,
LAB_STGNR AS STUDIENGANG,
LAB_PDATUM
FROM
FKT_LAB
GROUP BY
LAB_MTKNR,
LAB_STG,
LAB_STGNR
HAVING LAB_PDATUM = MAX(LAB_PDATUM)
)AS TEMP2
Although both are applied to the same table and use "GROUP BY" / "SELECT DISTINCT" on the same entries.
Any ideas?
If nothing helps and I have to go back to my mess I will use string variables as placeholders to tidy it up but then I lose the overview of how many subqueries, joins and unions I have in one query... how many temproal tables will the server be able to cope with?

Your second query is not doing what you expect it to be doing. This is the query:
SELECT COUNT(MATNR) AS LESS
FROM (SELECT LAB_MTKNR AS MATNR, LAB_STG AS FACH, LAB_STGNR AS STUDIENGANG, LAB_PDATUM
FROM FKT_LAB
GROUP BY LAB_MTKNR, LAB_STG, LAB_STGNR
HAVING LAB_PDATUM = MAX(LAB_PDATUM)
) TEMP2;
The problem is the having clause. You are mixing an unaggregated column (LAB_PDATUM) with an aggregated value (MAX(LAB_PDATAUM)). What MySQL does is choose an arbitrary value for the column and compare it to the max.
Often, the arbitrary value will not be the maximum value, so the rows get filtered. The reference you give (although an accepted answer) is incorrect. I have put a comment there.
If you want the most recent value, here is a relatively easy way:
SELECT COUNT(MATNR) AS LESS
FROM (SELECT LAB_MTKNR AS MATNR, LAB_STG AS FACH, LAB_STGNR AS STUDIENGANG,
max(LAB_PDATUM) as maxLAB_PDATUM
FROM FKT_LAB
GROUP BY LAB_MTKNR, LAB_STG, LAB_STGNR
) TEMP2;
It does not, however, affect the outer count.

Why does a MySQL query take anywhere from 1 millisecond to 7 seconds?

I have an SQL query(see below) that returns exactly what I need but when ran through phpMyAdmin takes anywhere from 0.0009 seconds to 0.1149 seconds and occasionally all the way up to 7.4983 seconds.
Query:
SELECT
e.id,
e.title,
e.special_flag,
CASE WHEN a.date >= '2013-03-29' THEN a.date ELSE '9999-99-99' END as date
CASE WHEN a.date >= '2013-03-29' THEN a.time ELSE '99-99-99' END as time,
cat.lastname,
FROM e_table as e
LEFT JOIN a_table as a ON (a.e_id=e.id)
LEFT JOIN c_table as c ON (e.c_id=c.id)
LEFT JOIN cat_table as cat ON (cat.id=e.cat_id)
LEFT JOIN m_table as m ON (cat.name=m.name AND cat.lastname=m.lastname)
JOIN (
SELECT DISTINCT innere.id
FROM e_table as innere
LEFT JOIN a_table as innera ON (innera.e_id=innere.id AND
innera.date >= '2013-03-29')
LEFT JOIN c_table as innerc ON (innere.c_id=innerc.id)
WHERE (
(
innera.date >= '2013-03-29' AND
innera.flag_two=1
) OR
innere.special_flag=1
) AND
innere.flag_three=1 AND
innere.flag_four=1
ORDER BY COALESCE(innera.date, '9999-99-99') ASC,
innera.time ASC,
innere.id DESC LIMIT 0, 10
) AS elist ON (e.id=elist.id)
WHERE (a.flag_two=1 OR e.special_flag) AND e.flag_three=1 AND e.flag_four=1
ORDER BY a.date ASC, a.time ASC, e.id DESC
Explain Plan:
The question is:
Which part of this query could be causing the wide range of difference in performance?

To specifically answer your question: it's not a specific part of the query that's causing the wide range of performance. That's MySQL doing what it's supposed to do - being a Relational Database Management System (RDBMS), not just a dumb SQL wrapper around comma separated files.
When you execute a query, the following things happen:
The query is compiled to a 'parametrized' query, eliminating all variables down to the pure structural SQL.
The compilation cache is checked to find whether a recent usable execution plan is found for the query.
The query is compiled into an execution plan if needed (this is what the 'EXPLAIN' shows)
For each execution plan element, the memory caches are checked whether they contain fresh and usable data, otherwise the intermediate data is assembled from master table data.
The final result is assembled by putting all the intermediate data together.
What you are seeing is that when the query costs 0.0009 seconds, the cache was fresh enough to supply all data together, and when it peaks at 7.5 seconds either something was changed in the queried tables, or other queries 'pushed' the in-memory cache data out, or the DBMS has other reasons to suspect it needs to recompile the query or fetch all data again. Probably some of the other variations have to do with used indexes still being cached freshly enough in memory or not.
Concluding this, the query is ridiculously slow, you're just sometimes lucky that caching makes it appear fast.
To solve this, I'd recommend looking into 2 things:
First and foremost - a query this size should not have a single line in its execution plan reading "No possible keys". Research how indexes work, make sure you realize the impact of MySQL's limitation of using a single index per joined table, and tweak your database so that each line of the plan has an entry under 'key'.
Secondly, review the query in itself. DBMS's are at their fastest when all they have to do is combine raw data. Using programmatic elements like CASE and COALESCE are by all means often useful, but they do force the database to evaluate more things at runtime than just take raw table data. Try to eliminate such statements, or move them to the business logic as post-processing with the retrieved data.
Finally, never forget that MySQL is actually a rather stupid DBMS. It is optimized for performance in simple data fetching queries such as most websites require. As such it is much faster than SQL Server and Oracle for most generic problems. Once you start complicating things with functions, cases, huge join or matching conditions etc., the competitors are frequently much better optimized, and have better optimization in their query compilers. As such, when MySQL starts becoming slow in a specific query, consider splitting it up in 2 or more smaller queries just so it doesn't become confused, and do some postprocessing in PHP or whatever language you are calling with. I've seen many cases where this increased performance a LOT, just by not confusing MySQL, especially in cases where subqueries were involved (as in your case). Especially the fact that your subquery is a derived table, and not just a subquery, is known to complicate stuff for MySQL beyond what it can cope with.

Lets start that both your outer and inner query are working with the "e" table WITH a minimum requirement of flag_three = 1 AND flag_four = 1 (regardless of your inner query's (( x and y ) or z) condition. Also, your outer WHERE clause has explicit reference to the a.Flag_two, but no NULL which forces your LEFT JOIN to actually become an (INNER) JOIN. Also, it appears every "e" record MUST have a category as you are looking for the "cat.lastname" and no coalesce() if none found. This makes sense at it appears to be a "lookup" table reference. As for the "m_table" and "c_table", you are not getting or doing anything with it, so they can be removed completely.
Would the following query get you the same results?
select
e1.id,
e1.Title,
e1.Special_Flag,
e1.cat_id,
coalesce( a1.date, '9999-99-99' ) ADate,
coalesce( a1.time, '99-99-99' ) ATime
cat.LastName
from
e_table e1
LEFT JOIN a_table as a1
ON e1.id = a1.e_id
AND a1.flag_two = 1
AND a1.date >= '2013-03-29'
JOIN cat_table as cat
ON e1.cat_id = cat.id
where
e1.flag_three = 1
and e1.flag_four = 1
and ( e1.special_flag = 1
OR a1.id IS NOT NULL )
order by
IF( a1.id is null, 2, 1 ),
ADate,
ATime,
e1.ID Desc
limit
0, 10
The Main WHERE clause qualifies for ONLY those that have the "three and four" flags set to 1 PLUS EITHER the ( special flag exists OR there is a valid "a" record that is on/after the given date in question).
From that, simple order by and limit.
As for getting the date and time, it appears that you only want records on/after the date to be included, otherwise ignore them (such as they are old and not applicable, you don't want to see them).
The order by, I am testing FIRST for a NULL value for the "a" ID. If so, we know they will all be forced to a date of '9999-99-99' and time of '99-99-99' and want them pushed to the bottom (hence 2), otherwise, there IS an "a" record and you want those first (hence 1). Then, sort by the date/time respectively and then the ID descending in case many within the same date/time.
Finally, to help on the indexes, I would ensure your "e" table has an index on
( id, flag_three, flag_four, special_flag ).
For the "a" table, index on
(e_id, flag_two, date)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008