MySQL seems to be unable to optimise a select with a GROUP BY subquery and ends up in long execution times. There must be a known optimisation for such common scenario.
Let's assume that we're trying to return all orders from the database, with a flag indicating if it is the first order for the customer.
CREATE TABLE orders (order int, customer int, date date);
Retrieving the first orders by customer is superfast.
SELECT customer, min(order) as first_order FROM orders GROUP BY customer;
However, it becomes very slow once we join this with the full order set using a subquery
SELECT order, first_order FROM orders LEFT JOIN (
SELECT customer, min(order) as first_order FROM orders GROUP BY customer
) AS first_orders ON orders.order=first_orders.first_order;
I hope there is a simple trick that we're missing, because otherwise it would be about 1000x faster to do
CREATE TEMPORARY TABLE tmp_first_order AS
SELECT customer, min(order) as first_order FROM orders GROUP BY customer;
CREATE INDEX tmp_boost ON tmp_first_order (first_order)
SELECT order, first_order FROM orders LEFT JOIN tmp_first_order
ON orders.order=tmp_first_order.first_order;
EDIT:
Inspired by #ruakh proposed option 3, there is indeed a less ugly workaround using INNER JOIN and UNION, which has acceptable performance yet does not require temporary tables. However, it is a bit specific to our case and I am wondering if a more generic optimisation exists.
SELECT order, "YES" as first FROM orders INNER JOIN (
SELECT min(order) as first_order FROM orders GROUP BY customer
) AS first_orders_1 ON orders.order=first_orders_1.first_order
UNION
SELECT order, "NO" as first FROM orders INNER JOIN (
SELECT customer, min(order) as first_order FROM orders GROUP BY customer
) AS first_orders_2 ON first_orders_2.customer = orders.customer
AND orders.order > first_orders_2.first_order;
Here are a few things you might try:
Removing customer from the subquery's field-list, since it's not doing anything anyway:
SELECT order,
first_order
FROM orders
LEFT
JOIN ( SELECT MIN(order) AS first_order
FROM orders
GROUP
BY customer
) AS first_orders
ON orders.order = first_orders.first_order
;
Conversely, adding customer to the ON clause, so it actually does something for you:
SELECT order,
first_order
FROM orders
LEFT
JOIN ( SELECT customer,
MIN(order) AS first_order
FROM orders
GROUP
BY customer
) AS first_orders
ON orders.customer = first_orders.customer
AND orders.order = first_orders.first_order
;
Same as previous, but using an INNER JOIN instead of a LEFT JOIN, and converting your original ON clause into a CASE expression:
SELECT order,
CASE WHEN first_order = order THEN first_order END AS first_order
FROM orders
INNER
JOIN ( SELECT customer,
MIN(order) AS first_order
FROM orders
GROUP
BY customer
) AS first_orders
ON orders.customer = first_orders.customer
;
Replacing the whole JOIN approach with an uncorrelated IN-subquery in a CASE expression:
SELECT order,
CASE WHEN order IN
( SELECT MIN(order)
FROM orders
GROUP
BY customer
)
THEN order
END AS first_order
FROM orders
;
Replacing the whole JOIN approach with a correlated EXISTS-subquery in a CASE expression:
SELECT order,
CASE WHEN NOT EXISTS
( SELECT 1
FROM orders AS o2
WHERE o2.customer = o1.customer
AND o2.order < o1.order
)
THEN order
END AS first_order
FROM orders AS o1
;
(It's very likely that some of the above will actually perform worse, but I think they're all worth trying.)
I would expect this to be faster when using a variable instead of the LEFT JOIN:
SELECT
`order`,
If(#previous_customer<>(#previous_customer:=`customer`),
`order`,
NULL
) AS first_order
FROM orders
JOIN ( SELECT #previous_customer := -1 ) x
ORDER BY customer, `order`;
That's what my example on SQL Fiddle returns:
CUSTOMER ORDER FIRST_ORDER
1 1 1
1 2 (null)
1 3 (null)
2 4 4
2 5 (null)
3 6 6
4 7 7
Related
I have two tables, a customers and orders table.
The customers table contains a unique ID for each customer. It contains 1141 entries.
The orders table contains many entries with a customerID and a date.
I am trying to query my database and return a list of customers and the max(date) from the orders list.
SELECT *
FROM customers
INNER JOIN
(
SELECT CustomerID, max(date) as date
FROM orders
GROUP BY CustomerID
) Sub1
ON customers.id = Sub1.CustomerID
INNER JOIN orders
ON orders.CustomerID = Sub1.CustomerID
AND orders.date = Sub1.Date
However this query is returning 1726 rows instead of 1141 rows. Where is this getting extra from?
I think it's beacause ORDERS table contains same customerID multiple times, so when you join the table with CUSTOMERS, each CUSTOMER.id matches multiple rows of ORDERS.
The problem is that there are ties.
For a given customer, some place more than one order per day. So there's a possibility that occasionally some may have placed more than one order on the date that is their max date.
To fix this, you need to use MAX() or some column that is always unique in the Orders table (or at least unique within a given date). This is easy if you can depend on an auto-increment primary key in the Orders table:
SELECT *
FROM customers
INNER JOIN
(
SELECT CustomerID, max(orderid) as orderid as date
FROM orders
GROUP BY CustomerID
) Sub1
ON customers.id = Sub1.CustomerID
INNER JOIN orders
ON orders.CustomerID = Sub1.CustomerID
AND orders.orderid = Sub1.orderid
This assumes that orderid increases in lock-step with increasing dates. That is, you'll never have an order with a greater auto-inc id but an earlier date. That might happen if you allow data to be entered out of chronological order, e.g. back-dating orders.
;with cte as
(
select CustomerID, orderdate
, rn = row_number() over (partition by customerID order by orderdate desc)
from orders
)
select c.*, cte.orderdate
from customer c
join cte on cte.customerID = c.customerid
where rn =1 -- This will limit to latest orderdate
I'm trying to write a query that finds each time the same person occurs in my table between a specific date range. It then groups this person and totals their spending for a specific range. If their spending habits are greater than X amount, then return each and every row for this person between date range specified. Not just the grouped total amount. This is what I have so far:
SELECT member_id,
SUM(amount) AS total
FROM `sold_items`
GROUP BY member_id
HAVING total > 50
This is retrieving the correct total and returning members spending over $50, but not each and every row. Just the total for each member and their grand total. I'm currently querying the whole table, I didn't add in the date ranges yet.
JOIN this subquery with the original table:
SELECT si1.*
FROM sold_items AS si1
JOIN (SELECT member_id
FROM sold_items
GROUP BY member_id
HAVING SUM(amount) > 50) AS si2
ON si1.member_id = si2.member_id
The general rule is that the subquery groups by the same column(s) that it's selecting, and then you join that with the original query using the same columns.
SELECT member_id, amount
FROM sold_items si
INNER JOIN (SELECT member_id,
SUM(amount) AS total
FROM `sold_items`
GROUP BY member_id
HAVING total > 50) spenders USING (member_id)
The query you have already built can be used as a temporary table to join with. if member_id is not an index on the table, this will become slow with scale.
The word spenders is a table alias, you can use any valid alias in its stead.
There are a few syntaxes that will get the result you are looking, here is one using an inner join to ensure that all rows returned have a member_id in the list returned by the group by and that the total is repeated for each a certain member has:
SELECT si.*, gb.total from sold_items as si, (SELECT member_id as mid,
SUM(amount) AS total
FROM `sold_items`
GROUP BY member_id
HAVING total > 50) as gb where gb.mid=si.member_id;
I think that this might help:
SELECT
member_id,
SUM(amount) AS amount_value,
'TOTAL' as amount_type
FROM
`sold_items`
GROUP BY
member_id
HAVING
SUM(amount) > 50
UNION ALL
SELECT
member_id,
amount AS amount_value,
'DETAILED' as amount_type
FROM
`sold_items`
INNER JOIN
(
SELECT
A.member_id,
SUM(amount) AS total
FROM
`sold_items` A
GROUP BY
member_id
HAVING
total <= 50
) AS A
ON `sold_items`.member_id = A.member_id
Results of the above query should be like the following:
member_id amount_value amount_type
==========================================
1 55 TOTAL
2 10 DETAILED
2 15 DETAILED
2 10 DETAILED
so the column amount_type would distinguish the two specific member groups
You could do subquery with EXISTS as an alternative:
select *
from sold_items t1
where exists (
select * from sold_items t2
where t1.member_id=t2.member_id
group by member_id
having sum(amount)>50
)
ref: http://dev.mysql.com/doc/refman/5.7/en/exists-and-not-exists-subqueries.html
In case you need to group by multiple columns, you can use a composite identifier with concatenate in combination with a group by subquery
select id, key, language, group
from translation
--query all key-language entries by composite identifier...
where concat(key, '_', language) in (
--by lookup of all key-language combinations...
select concat(key, '_', language)
from translation
group by key, language
--that occur more than once
having count(*) > 1
)
I have 2 DB tables that both share an Order Number column.
One table is "orders" and the Order Number is the unique key.
The second table is my "transactions" table that has one row, per transaction made for each order number. Based on the fact we take monthly payments, the "transactions" table obviously has multiple rows with a unique date but many repeats of a each Order Number.
How can I run a query which has a list of unique OrderNumbers in one column, and the latest "TransDate" (Transaction Date) in the second column.
I tried the below but its pulling back the first TransDate that exists for each ordernumber, not the latest one. I think I need a sub query of some sort:
select orders.ordernumber, transdate from orders
join transactions on transactions.ordernumber = orders.ordernumber
where status = 'booking'
group by ordernumber
order by orders.ordernumber, TransDate DESC
You should just use MAX() function along with grouping on order number. There also doesn't seem to be any reason to do a join here.
SELECT
ordernumber,
MAX(transdate) AS maxtransdate
FROM transactions
WHERE status = 'booking'
GROUP BY ordernumber
ORDER BY ordernumber ASC
Use aggregate functions, specifically max():
select o.ordernumber, max(transdate) as last_transdate
from orders as o
inner join transactions as t on o.ordernumber = t.ordernumber
-- where conditions go here
group by ordernumber
If you need to pull the details of the last transaction for each order, you can use the above query as a data source of another query and join it with the transactions table:
select a.*, t.*
from (
select o.ordernumber, max(transdate) as last_transdate
from orders as o
inner join transactions as t on o.ordernumber = t.ordernumber
-- where conditions go here
group by ordernumber
) as a
inner join transactions as t on a.ordernumber = t.ordernumber and a.last_transdate = t.transdate
Change the order by line to
order by Transdate DESC, orders.ordernumber
Here's the full query with the change
select orders.ordernumber, transdate from orders
join transactions on transactions.ordernumber = orders.ordernumber
where status = 'booking'
group by ordernumber
order by Transdate DESC, orders.ordernumber
Is there a difference between these two queries? Like performance issues, etc?
Query 1:
select i.invoice_id,
i.total_price
from ( select invoice_id,
sum(price) as total_price
from orders
group by
invoice_id
) as i
inner join invoice
ON i.invoice_id = invoice.invoice_id
Query 2:
select invoice.invoice_id,
orders.total_price
from invoice
inner join ( select invoice_id,
sum(price) as total_price
from orders
group by
invoice_id
) orders
ON orders.invoice_id = invoice.invoice_id
Thanks!
Let me rewrite your queries without any sinifical changes:
Query 1
SELECT i.invoice_id,
i.total_price
FROM invoice INNER JOIN (
SELECT invoice_id,
sum(price) AS total_price
FROM orders
GROUP BY
invoice_id
) AS i
ON i.invoice_id = invoice.invoice_id;
Query 2:
SELECT invoice.invoice_id,
i.total_price
FROM invoice INNER JOIN (
SELECT invoice_id,
sum(price) AS total_price
FROM orders
GROUP BY
invoice_id
) AS i
ON i.invoice_id = invoice.invoice_id;
things I changed:
order of JOIN (which doesn't matter, since it is INNER)
table alias (orders to i, and I really don't understand, why you wanted to name it differently)
Now, it is obvious, that the only difference between them - the first argument in the main SELECT. Your question could have made sence (if there was index on one column and wasn't on the other, and, dependant on the query, you would not always have used both orders.invoice_id and invoice.invoice_id), but since you already retrieving the both column for INNER JOIN it doesn't.
Futhermore, these queries are redundant. As already been mentioned by #valex, your query (actually - both of them) could (and must) be simplified to this:
SELECT invoice_id,
sum(price) AS total_price
FROM orders
GROUP BY
invoice_id;
So, no, there is no differnce in perfomance. And, surely, there is no difference in resultset.
Also, I'd like you to know, that you can always use EXPLAIN for perfomance questions.
Your first query
select i.invoice_id,
i.total_price
from ( select invoice_id,
sum(price) as total_price
from orders
group by
invoice_id
) as i
inner join invoice
ON i.invoice_id = invoice.invoice_id
is equivalent by its result to:
select invoice_id,
sum(price) as total_price
from orders
group by invoice_id
To get the same result (if all invoice_id from orders exist in the invoice table) you don't need to JOIN the Invoice table just use query:
select invoice_id,
sum(price) as total_price
from orders
group by invoice_id
I have this situation. I have a table Orders that is related to OrderStatus.
OrderStatus
id | orderId | created
I need to retrieve the Orders with its last status. I tried this query, what I don't know if it is performant. I need to know if there are better solutions.
select Orders.id, OrderStatus.status from Orders
inner join OrderStatus on OrderStatus.id =
(select top 1 id from OrderStatus where orderId = Order.id order by created desc)
Correlated subquery is usually bad news (sometimes SQL Server can optimize it away, sometimes it acts like a really slow loop). Also not sure why you think you need DISTINCT when you're only taking the latest status, unless you don't have any primary keys...
;WITH x AS
(
SELECT o.id, os.status,
rn = ROW_NUMBER() OVER (PARTITION BY os.orderId ORDER BY created DESC)
FROM dbo.Orders AS o
INNER JOIN dbo.OrderStatus AS os
ON o.id = os.orderId
)
SELECT id, status
FROM x
WHERE rn = 1;
You can use the Row_Number function:
WITH CTE AS
(
SELECT Orders.id, OrderStatus.status,
RN = ROW_NUMBER() OVER (
PARTITION BY OrderStatus.OrderId
ORDER BY created DESC)
FROM Orders
INNER JOIN OrderStatus ON OrderStatus.OrderId = Orders.id
)
SELECT id, status
FROM CTE WHERE RN = 1
I've used a common-table-expression since it enables to filter directly and it's also very readable.