MySQL ORDER BY from subquery lost by GROUP BY

MySQL ORDER BY from subquery lost by GROUP BY - mysql

I have a table x :
id lang externalid
1 nl 10
2 nl 11
3 fr 10
From this table I want al the rows for a certain lang and externalid, if the externalid doesn't exist for this lang, I want the row with any other lang.
The subquery sorts the table correct, but when I add the group by, the sort of the subquery is lost. This works in older mysql versions but not in 5.7.
(
SELECT
*
FROM
x
ORDER BY FIELD(lang, "fr") DESC, id
)
as y
group by externalid
I want the query to return the records with id 2 & 3. So for each distinct external id, if possible the lang = 'fr', else any other lang.
How can i solve this problem?

You are talking of given externalid and land. No need to group by externalid hence; use a mere where clause instead.
Combined with ORDER BY and LIMIT you get the record you want (i.e. the desired language if such a record exists, else another one).
select *
from mytable
where externalid = 10
order by lang = 'fr' desc
limit 1;
UPDATE: Okay, according to your comment you want to get the "best" record per externalid. In standard SQL you'd use ROW_NUMBER for this. Other DBMS have further solutions, e.g. Oracle's KEEP FIRST or Postgre's DISTINCT ON. MySQL doesn't support any of these. One way would be to emulate ROW_NUMBER with variables. Another would be to use above query as a subquery per externalid to find the best records:
select *
from mytable
where id in
(
select
(
select m.id
from mytable m
where m.externalid = e.externalid
order by m.lang = 'fr' desc
limit 1
) as best_id
from (select distinct externalid from mytable) e
);

Your subquery generates a result set (a virtual table) that's passed to your outer query.
All SQL queries, without exception, generate their results in unpredictable order unless you specify the order completely in an ORDER BY clause.
Unpredictable is like random, except worse. Random implies you'll get a different order every time you run the query. Unpredictable means you'll get the same order every time, until you don't.
MySQL ordinarily ignores ORDER BY clauses in subqueries (there are a few exceptions, mostly related to subquery LIMIT clauses). Move your ORDER BY to the top level query.
Edit. You are also misusing MySQL's notorious nonstandard extension to GROUP BY.

Related

How to improve performance getting recent records to display in list, recent top 5 most

I'm making a sample recent screen that will display a list, it displays the list, with id set as primary key.
I have done the correct query as expected but the table with big amount of data can cause slow performance issues.
This is the sample query below:
SELECT distinct H.id -- (Primary Key),
H.partnerid as PartnerId,
H.partnername AS partner, H.accountname AS accountName,
H.accountid as AccountNo,
FROM myschema.mytransactionstable H
INNER JOIN (
SELECT S.accountid, S.partnerid, S.accountname,
max(S.transdate) AS maxDate
from myschema.mytransactionstable S
group by S.accountid, S.partnerid, S.accountname
) ms ON H.accountid = ms.accountid
AND H.partnerid = ms.partnerid
AND H.accountname =ms.accountname
AND H.transdate = maxDate
WHERE H.accountid = ms.accountid
AND H.partnerid = ms.partnerid
AND H.accountname = ms.accountname
AND H.transdate = maxDate
GROUP BY H.partnerid,H.accountid, H.accountname
ORDER BY H.id DESC
LIMIT 5
In my case, there are values which are similar in the selected columns but differ only in their id's
Below is a link to an image without executing the query above. They are all the records that have not yet been filtered.
Sample result query click here
Since I only want to get the 5 most recent by their id but the other columns can contain similar values
accountname,accountid,partnerid.
I already got the correct query but,
I want to improve the performance of the query. Any suggestions for the improvement of query?

You can try using row_number()
select * from
(
select *,row_number() over(order by transdate desc) as rn
from myschema.mytransactionstable
)A where rn<=5

Don't repeat ON and WHERE clauses. Use ON to say how the tables (or subqueries) are "related"; use WHERE for filtering (that is, which rows to keep). Probably in your case, all the WHERE should be removed.
Please provide SHOW CREATE TABLE
This 'composite' index would probably help because of dealing with the subquery and the JOIN:
INDEX(partnerid, accountid, accountname, transdate)
That would also avoid a separate sort for the GROUP BY.
But then the ORDER BY is different, so it cannot avoid a sort.
This might avoid the sort without changing the result set ordering: ORDER BY partnerid, accountid, accountname, transdate DESC
Please provide EXPLAIN SELECT ... and EXPLAIN FORMAT=JSON SELECT ... if you have further questions.
If we cannot get an index to handle the WHERE, GROUP BY, and ORDER BY, the query will generate all the rows before seeing the LIMIT 5. If the index does work, then the outer query will stop after 5 -- potentially a big savings.

Correct format for Select in SQL Server

I have what should be a simple query for any database and which always runs in MySQL but not in SQL Server
select
tagalerts.id,
ts,
assetid,
node.zonename,
battlevel
from tagalerts, node
where
ack=0 and
tagalerts.nodeid=node.id
group by assetid
order by ts desc
The error is:
column tagalerts.id is invalid in the select list because it is not contained in either an aggregate function or the group by clause.
It is not a simple case of adding tagalerts.id to the group by clause because the error repeats for ts and for assetid etc, implying that all the selects need to be in a group or in aggregate functions... either of which will result in a meaningless and inaccurate result.
Splitting the select into a subquery to sort and group correctly (which again works fine with MySQL, as you would expect) makes matters worse
SELECT * from
(select
tagalerts.id,
ts,
assetid,
node.zonename,
battlevel
from tagalerts, node
where
ack=0 and
tagalerts.nodeid=node.id
order by ts desc
)T1
group by assetid
the order by clause is invalid in views, inline functions, derived tables and expressions unless TOP etc is used
the 'correct output' should be
id ts assetid zonename battlevel
1234 a datetime 1569 Reception 0
3182 another datetime 1572 Reception 0
Either I am reading SQL Server's rules entirely wrong or this is a major flaw with that database.
How can I write this to work on both systems?

In most databases you can't just include columns that aren't in the GROUP BY without using an aggregate function.
MySql is an exception to that. But MS SQL Server isn't.
So you could keep that GROUP BY with only the "assetid".
But then use the appropriate aggregate functions for all the other columns.
Also, use the JOIN syntax for heaven's pudding sake.
A SQL like select * from table1, table2 where table1.id2 = table2.id is using a syntax from the previous century.
SELECT
MAX(node.id) AS id,
MAX(ta.ts) AS ts,
ta.assetid,
MAX(node.zonename) AS zonename,
MAX(ta.battlevel) AS battlevel
FROM tagalerts AS ta
JOIN node ON node.id = ta.nodeid
WHERE ta.ack = 0
GROUP BY ta.assetid
ORDER BY ta.ts DESC;
Another trick to use in MS SQL Server is the window function ROW_NUMBER.
But this is probably not what you need.
Example:
SELECT id, ts, assetid, zonename, battlevel
FROM
(
SELECT
node.id,
ta.ts,
ta.assetid,
node.zonename,
ta.battlevel,
ROW_NUMBER() OVER (PARTITION BY ta.assetid ORDER BY ta.ts DESC) AS rn
FROM tagalerts AS ta
JOIN node ON node.id = ta.nodeid
WHERE ta.ack = 0
) q
WHERE rn = 1
ORDER BY ts DESC;

I strongly suspect this query is WRONG even in MySql.
We're missing a lot of details (sample data, and we don't know which table all of the columns belong to), but what I do know is you're grouping by assetid, where it looks like one assetid value could have more than one ts (timestamp) value in the group. It also looks like you're counting on the order by ts desc to ensure both that you see recent timestamps in the results first and that each assetid group uses the most recent possible ts timestamp for that group.
MySql only guarantees the former, not the latter. Nothing in this query guarantees that each assetid is using the most recent timestamp available. You could be seeing the wrong timestamps, and then also using those wrong timestamps for the order by. This is the problem the Sql Server rule is there to stop. MySql violates the SQL standard to allow you to write that wrong query.
Instead, you need to look at each column and either add it to the group by (best when all of the values are known to be the same, anyway) or wrap it in an aggregrate function like MAX(), MIN(), AVG(), etc, so there is a deterministic result for which value from the group is used.
If all of the values for a column in a group are the same, then there's no problem adding it to the group by. If the values are different, you want to be precise about which one is chosen for the result set.
While I'm here, the tagalerts, node join syntax has been obsolete for more than 20 years now. It's also good practice to use an alias with every table and prefix every column with the alias. I mention these to explain why I changed it for my code sample below, though I only prefix columns where I am confident in which table the column belongs to.
This query should run on both databases:
SELECT ta.assetid, MAX(ta.id) "id", MAX(ta.ts) "ts",
MAX(n.zonename) "zonename", MAX(battlevel) "battlevel"
FROM tagalerts ta
INNER JOIN node n ON ta.nodeid = n.id
WHERE ack = 0
GROUP BY ta.assetid
ORDER BY ts DESC
There is also a concern here the results may be choosing values from different records in the joined node table. So if battlevel is part of the node table, you might see a result that matches a zonename with a battlevel that never occurs in any record in the data. In Sql Server, this is easily fixed by using APPLY to match only one node record to each tagalert. MySql doesn't support this (APPLY or an equivalent has been in every other major database since at least 2012), but you can simulate with it in this case with two JOINs, where the first join is a subquery that uses GROUP BY to determine values will uniquely identify the needed node record, and second join is to the node table to actually produce that record. Unfortunately, we need to know more about the tables in question to actually write this code for you.

MySQL select AVG, ORDER BY, GROUP BY & LIMIT

The bellow statement does not work but i cant seem to figure out why
select AVG(delay_in_seconds) from A_TABLE ORDER by created_at DESC GROUP BY row_type limit 1000;
I want to get the avg's of the most recent 1000 rows for each row_type. created_at is of type DATETIME and row_type is of type VARCHAR

If you only want the 1000 most recent rows, regardless of row_type, and then get the average of delay_in_seconds for each row_type, that's a fairly straightforward query. For example:
SELECT t.row_type
, AVG(t.delay_in_seconds)
FROM (
SELECT r.row_type
, r.delay_in_seconds
FROM A_table r
ORDER BY r.created_at DESC
LIMIT 1000
) t
GROUP BY t.row_type
I suspect, however, that this query does not satisfy the requirements that were specified. (I know it doesn't satisfy what I understood as the specification.)
If what we want is the average of the most recent 1000 rows for each row_type, that would also be fairly straightforward... if we were using a database that supported analytic functions.
Unfortunately, MySQL doesn't provide support for analytic functions. But it is possible to emulate one in MySQL, but the syntax is a bit involved, and it is dependent on behavior that is not guaranteed.
As an example:
SELECT s.row_type
, AVG(s.delay_in_seconds)
FROM (
SELECT #row_ := IF(#prev_row_type = t.row_type, #row_ + 1, 1) AS row_
, #prev_row_type := t.row_type AS row_type
, t.delay_in_seconds
FROM A_table t
CROSS
JOIN (SELECT #prev_row_type := NULL, #row_ := NULL) i
ORDER BY t.row_type DESC, t.created_at DESC
) s
WHERE s.row_ <= 1000
GROUP
BY s.row_type
NOTES:
The inline view query is going to be expensive for large sets. What that's effectively doing is assigning a row number to each row. The "order by" is sorting the rows in descending sequence by created_at, what we want is for the most recent row to be assigned a value of 1, the next most recent 2, etc. This numbering of rows will be repeated for each distinct value of row_type.
For performance, we'd want a suitable index with leading columns (row_type,created_at,delay_seconds) to avoid an expensive "Using filesort" operation. We need at least those first two columns for that, including the delay_seconds makes it a covering index (the query can be satisfied entirely from the index.)
The outer query then runs against the resultset returned from the view query (a "derived table"). The predicate in the WHERE filters out all rows that were assigned a row number greater than 1000, the rest is a straighforward GROUP BY and and AVG aggregate.
A LIMIT clause is entirely unnecessary. It may be possible to incorporate some additional predicates for some additional performance enhancement... like, what if we specified the most recent 1000 rows, but only that were create_at within the past 30 or 90 days?
(I'm not entirely sure this answers the question that OP was asking. What this answers is: Is there a query that can return the specified resultset, making use of AVG aggregate and GROUP BY, ORDER BY and LIMIT clauses.)
N.B. This query is dependent on a behavior of MySQL user-defined variables which is not guaranteed.
The query above shows one approach, but there is also another approach. It's possible to use a "join" operation (of A_table with A_table) to get a row number assigned (getting a COUNT of the number of rows that are "more recent" than each row. With large sets, however, that can produce a humongous intermediate result, if we aren't careful to limit it.

Write the ORDER BY at the last of the statement.
SELECT AVG(delay_in_seconds) from A_TABLE GROUP BY row_type ORDER by created_at DESC limit 1000;
read mysql dev site for details.

Should ORDER BY change the result set?

I was under the impression that using an ORDER BY in an SQL query would not affect which records were selected for the result set. I thought that ORDER BY would only affect the presentation of the result set.
Recently, however, I was getting unexpected results from a query until I used an ORDER BY clause. This suggests that either a) ORDER BY can affect which records are included in the result set, or b) I have some other bug which I need to work on.
Which is it?
Here's the query: SELECT node_id FROM users ORDER BY node_id LIMIT 100
(node_id is both a primary key and foreign key).
As you can see, the query includes a LIMIT clause. It seems that if I use the ORDER BY, the records are ordered before the top 100 are selected. I had expected it to select 100 records based on natural order, then order them according to node_id.
I've looked for info on ORDER BY but as yet, the only info I can find suggests that it affects presentation only... I am using MySQL.

ORDER BY reflects the order of all of the records before the LIMIT Clause. To get the result you want you will need this:
select u.node_id
from users u
join
(
SELECT node_id
FROM users
LIMIT 100
) us ON u.node_id = us.node_id
ORDER BY u.node_id
This way you will use the limit clause first and get the top 100 records and then you will sort the result of that. The join clause is faster than a double Select statement especially if you are working with many records.

You can use a nested query:
SELECT node_id FROM
(
SELECT node_id FROM users LIMIT 100
) u
ORDER BY node_id

Order a query with two keys SQL Server 2008

I am trying to order a query by two keys. The query is built with several subqueries. The table contains, beside columns with other data, two columns, Key and Key_Father. So I need to order the results since SQL to print the results in a report. This is an example:
Key Key_Father
4 NULL
1 4
2 4
7 NULL
1 7
2 7
As you can see is a structure father-son, where a row is a father if the Key_Father is NULL and the Key column start from one for each son with a different father.
The first subquery gives the data in order, because is stored on that order in the table, but the second subquery that uses a group by, no. So I tried adding a extra column with Row_Number on the first subquery to keep that order, but the second subquery does the same thing.
This is the query:
SELECT Orden,INV_Key,Key_Padre,INV.INV_ID,INV.BOD_Bodega_ID,
CASE WHEN MAX(HIS_Ventas) > 0 OR max(HIS_Disponible) > 0 THEN 1 ELSE 0 END AS Participacion,MAX(ISNULL(HIS_Ventas,0)) AS Ventas
FROM(SELECT ROW_NUMBER() OVER (ORDER BY C.INV_Compra_ID) Orden,C.BOD_Bodega_ID,INV_Key,Key_Padre,CD.INV_ID
FROM dbo.INV_COMPRAS_USADOS C
INNER JOIN dbo.INV_COMPRAS_USADOS_DET CD ON C.INV_Compra_ID = CD.INV_Compra_ID
WHERE C.INV_Compra_ID = #Compra_ID
AND ((Key_Padre IS NULL AND CD.INV_Catalogo_Codigo = ISNULL(#Cod_Catalogo,CD.INV_Catalogo_Codigo)
AND INV_Key IN (SELECT DISTINCT Key_Padre
FROM dbo.INV_COMPRAS_USADOS_DET
WHERE INV_Compra_ID = #Compra_ID AND Key_Padre IS NOT NULL))
OR Key_Padre IN (SELECT DISTINCT INV_Key
FROM dbo.INV_COMPRAS_USADOS_DET
WHERE INV_Compra_ID = #Compra_ID AND (Key_Padre IS NULL AND CD.INV_Catalogo_Codigo = ISNULL(#Cod_Catalogo,CD.INV_Catalogo_Codigo))))) INV
LEFT JOIN DBO.HIS_HISTORICO_DETALLE HD ON INV.INV_ID = HD.INV_ID AND HD.BOD_Bodega_ID = INV.BOD_Bodega_ID
LEFT JOIN DBO.HIS_HISTORICO_INVENTARIO H on H.HIS_Historico_ID= HD.HIS_Historico_ID AND (CONVERT(datetime,(convert(varchar(20),HIS_Historico_Ano) + '/' + convert(varchar(20),HIS_Historico_Mes) + '/01')) BETWEEN #FechaDesde AND #FechaHasta)
WHERE H.HIS_Historico_Mes IS NOT NULL OR INV.INV_ID IS NULL
GROUP BY Orden,INV_Key,Key_Padre,INV.INV_ID,INV.BOD_Bodega_ID,HIS_Historico_Ano,HIS_Historico_Mes
Another interesting thing (well for me) is that when I change the #Variables for Constant values, the second query keeps the correct order, even when the constant values are the same that the #variables. This is just a portion of the total query, is a subquery that needs of another two selects, and I need to keep the order from those selects too.
So I hope that someone could help me with this. Thanks!

To order the results you need to place an ORDER BY clause on the outermost SELECT statement. Using ORDER BY in a nested SELECT is generally not permitted but even if you work around it (e.g. by using TOP), you can't rely on the results being ordered in any particular way.
Without an ORDER BY the results may appear to be coming out in the order you want but this cannot be relied upon. Running the same query on a different server or at some point in the future may produce a different order where differences in statistics, server load, etc can affect how the query optimizer actually executes the statement.
The portion of the query you've provided is outputting the following columns. Which are the ones you want to order by?
Orden (although this is just an alias for INV_Compra_ID as far as orderin is concerned)
INV_Key
Key_Padre
INV_ID
BOD_Bodega_ID
Participacion
Ventas
Let's say you want to order by just thre of them, then you need to append the following clause to the outermost SELECT:
ORDER BY
Orden,
INV_Key,
Key_Padre,

This should do it. I'm not sure if I'm missing an obvious simplification though.
ORDER BY ISNULL(Key_Father,[Key]), ISNULL(Key_Father,-1),[Key]

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008