MySQL Left Join Taking awhile - mysql

I am trying to get a list of possible customers along with the sum of their order history (ltv)
Without the order by, this query loads in under a second. With the order by and the query is taking over 90 seconds.
SELECT a.customerid,a.firstname,a.lastname,Orders.ltv
FROM customers a
LEFT JOIN (
SELECT customerid,
SUM(amount) as ltv
FROM orders
GROUP BY customerid) Orders
ON Orders.customerid=a.customerid
ORDER BY
Orders.ltv DESC
LIMIT 0,10
Any ideas how this could be sped up?
EDIT: I guess I cleaned up the query a little too much. The query is acually a little more complicated then this version. Other data is selected from the customers table, and can be sorted against as well.

Without the actual schema it is a bit hard to know how data is related but I guess this query should be equivalent and more performant:
SELECT a.customerid, coalesce(sum(o.amount), 0) TotalLtv FROM customers a
LEFT JOIN orders o ON a.customerid = o.cusomterid
GROUP BY a.customerid
ORDER BY TotalLtv DESC
LIMIT 10
The coalesce will make sure you return 0 for the customers without orders.
As #ypercube made me notice, an index on amount won't help either. You could give it a try to:
ALTER TABLE orders ADD INDEX(customer, amount)
After your question update
If you need to add more fields that functionally depend on the a.customerid in the select clause you can use the non-standard MySQL group by clause. This will result in better performance than grouping by a.customerid, a.firstname, a.lastname:
SELECT a.customerid, a.firstname, a.lastname, coalesce(sum(o.amount), 0) TotalLtv
FROM customers a
LEFT JOIN orders o ON a.customerid = o.cusomterid
GROUP BY a.customerid
ORDER BY TotalLtv DESC
LIMIT 10

A few things here. First it doesn't appear that you need to join the customers table at all here since you are only using it for the customerid, which already exists in orders table. If you have more than 10 customer id's with corresponding amounts, you will never even need to see the list of customer id's which don;t have amounts that you would get with LEFT JOIN from customers. As such, you should be able to reduce your query to this:
SELECT customerid, SUM(amount) AS ltv
FROM orders
GROUP BY customerid
ORDER BY ltv DESC LIMIT 0,10
You would need an index on customerid. Unfortunately, the sort is on a calculated field, so there is not a lot you can do to speed this up from that point.
I see the updated question. Since you do need additional fields from customers, I will revise my answer to include the customer table
SELECT c.customerid, c.firstname, c.lastname, coalesce(o.ltv, 0) AS total
FROM customers AS c
LEFT JOIN (
SELECT customerid, SUM(amount) as ltv
FROM orders
GROUP BY customerid
ORDER BY ltv DESC LIMIT 0,10) AS o
ON c.customerid = o.customerid
Note that I am joining on a sub-selected table as you were doing in your original query, however I have performed the sort and limit on the sub-selected table so you don't have to sort all the records without any entries on orders table.

Two things. First, don't use an inner query. MySQL does allow ORDER BY on a projection alias. Second, you should get a considerable improvment by having a B-TREE index on the composed key (customerid, amount). Then the engine will be able to execute this query by a simple traversal of the index, without fetching any row data.

Related

Optimising MySql Query with LEFT JOINS

I am trying to get a list of customer who haven't ordered for 6months or more. I have 4 tables which I have used in the query
accounts (account_id)
stores (store_id, account_id)
customers (store_id, customer_id)
orders (order_id, customer_id, store_id)
The customer and orders table are very big, 3M and 26M rows respectively, so using left joins in my query make the query time extremely long. I believe I have index my tables correctly
here is my query i have used
SELECT cus.customer_id, MAX(o.order_date), cus.store_id, s.account_id, store_name
FROM customers cus
LEFT JOIN stores s ON s.store_id=cus.store_id
LEFT JOIN orders o ON o.customer_id=cus.customer_id AND o.store_id=cus.store_id
WHERE account_id=26 AND
(SELECT order_id
FROM orders o
WHERE o.customer_id=cus.customer_id
AND o.store_id=cus.store_id
AND o.order_date < CURRENT_DATE() - INTERVAL 6 MONTH
ORDER BY order_id DESC LIMIT 0,1) IS NOT NULL
GROUP BY cus.customer_id, cus.client_id;
I need to get the last order date and this is the reason why I have joined the orders table, however since the customers can have multiple orders it is returning multiple rows of the customer and that is why I have used the group by clause.
If anyone can assist me with my query.
Start with this:
SELECT customer_id, MAX(order_date) AS last_order_date
FROM orders
GROUP BY customer_id
HAVING last_order_date < NOW() - INTERVAL 6 MONTH;
Assuming that gives you the relevant customer_ids, then move on to
SELECT ...
FROM ( that-select-as-a-subquery ) AS old
JOIN other-tables-as-needed ON USING(customer_id)
If necessary, JOIN back to orders to get more info. Do not try to get other columns in that subquery. (That's a "groupwise max" problem.)
Your strategy of using an ordered and limited subquery on your orders table is probably responsible for your poor performance.
This subquery will generate a virtual table showing the date of the most recent order for each distinct customer. (I guess a distinct customer is distinguished by the pair customer_id, store_id).
SELECT MAX(order_date) recent_order_date,
customer_id, store_id
FROM orders
GROUP BY customer_id, store_id
Then, you can use that subquery as if it were a table in your query.
SELECT cus.customer_id, summary.recent_order_date,
cus.store_id, s.account_id, store_name
FROM customers cus
JOIN stores s ON s.store_id=cus.store_id
JOIN (
SELECT MAX(order_date) recent_order_date,
customer_id, store_id
FROM orders
GROUP BY customer_id, store_id
) summary ON summary.customer_id = cus.customer_id
AND summary.store_id = s.store_id
WHERE summary.recent_order_date < CURRENT_DATE - INTERVAL 6 MONTH
AND store.account_id = 26
This approach moves the GROUP BY to an inner query, and eliminates the wasteful ORDER BY ... LIMIT query pattern. The inner query doesn't have to be remade for every row in the outer query.
I don't understand why you used LEFT JOIN operations in your query.
And, by the way, most people, when they're new to SQL, don't have great intuition about which indexes are useful and which aren't. So, when asking for help, it's always good to show your indexes. In the meantime, read this:
http://use-the-index-luke.com/

SQL Join, include rows from table a with no match in table b

SELECT orders.* FROM orders JOIN order_rows
ON orders.id = order_rows.order_id
WHERE order_rows.quant <> order_rows.quant_fulfilled
GROUP BY orders.id
ORDER BY orders.id DESC
I need this to include rows that have no corresponding order_row entries (which would be an order that has no items in it yet). It seems like there must be a way to do this by adding to the ON or WHERE clause?
There will only be a couple empty orders at a given time so I would use a separate query if the best answer to this is going to significantly decrease performance. But I was hoping to include them in this query so they are sorted by orders.id along with the rest. Just don't want to double query time just to include the 1-3 orders that have no items.
I am using MySQL. Thanks in advance for any advice.
Simply use LEFT JOIN instead of JOIN. You'll obtain all rows of orders.
SELECT orders.* FROM orders LEFT JOIN order_rows
ON orders.id = order_rows.order_id
WHERE order_rows.quant IS NULL OR order_rows.quant <> order_rows.quant_fulfilled
GROUP BY orders.id
ORDER BY orders.id DESC

Summing order totals Mysql

I'm fairly new to Mysql and need help trying to combine two mysql queries that give a "total" for each "storeid" from the orders total. I'm currently using two queries to get the result:
SELECT storeid, storenum, name FROM store ORDER BY storeid DESC
SELECT SUM((1+0.07125)*qty*discprice) as total FROM items WHERE orderid IN (SELECT orderid FROM orders WHERE store = '".$row['storeid']."' AND date >= '2012-01-01' AND date < '2013-01-01')
I'm running a while loop and running the second query with the "storeid". However, I know I can do this is one query and group by "storeid" and create a total for all stores combined. But I can't figure it out.
Thanks!
SELECT s.storeid, s.storenum, s.name
SUM((1+0.07125)*i.qty*i.discprice) AS total
FROM items AS i
LEFT JOIN orders AS o
ON i.orderid=o.orderid
LEFT JOIN stores AS s
ON o.store=s.storeid
WHERE o.date >= '2012-01-01'
AND o.date < '2013-01-01'
GROUP BY s.storeid, s.storenum, s.name;
The trick is to join the three tables and then use an aggregate function on the items table.
SELECT stores.storeid, stores.storenum, stores.name, SUM((1+0.07125)*items.qty*items.discprice) as total
FROM stores
LEFT JOIN orders ON orders.storeid=stores.storeid AND orders.date>='2012-01-01' AND orders.date<'2013-01-01'
LEFT JOIN items ON items.orderid=orders.orderid
GROUP BY stores.storeid, stores.storenum, stores.name
What this does it this:
It will select every store from the stores table, and sum up the orders in that store. I chose a LEFT JOIN instead of straight JOINs, so that stores without any order in that time span will still show up with a total of NULL.
P.S. I don't have a copy of your database's schema, above SQL query might not actually work as expected - it is just supposed to point you in the right direction.

MySQL is not using INDEX in subquery

I have these tables and queries as defined in sqlfiddle.
First my problem was to group people showing LEFT JOINed visits rows with the newest year. That I solved using subquery.
Now my problem is that that subquery is not using INDEX defined on visits table. That is causing my query to run nearly indefinitely on tables with approx 15000 rows each.
Here's the query. The goal is to list every person once with his newest (by year) record in visits table.
Unfortunately on large tables it gets real sloooow because it's not using INDEX in subquery.
SELECT *
FROM people
LEFT JOIN (
SELECT *
FROM visits
ORDER BY visits.year DESC
) AS visits
ON people.id = visits.id_people
GROUP BY people.id
Does anyone know how to force MySQL to use INDEX already defined on visits table?
Your query:
SELECT *
FROM people
LEFT JOIN (
SELECT *
FROM visits
ORDER BY visits.year DESC
) AS visits
ON people.id = visits.id_people
GROUP BY people.id;
First, is using non-standard SQL syntax (items appear in the SELECT list that are not part of the GROUP BY clause, are not aggregate functions and do not sepend on the grouping items). This can give indeterminate (semi-random) results.
Second, ( to avoid the indeterminate results) you have added an ORDER BY inside a subquery which (non-standard or not) is not documented anywhere in MySQL documentation that it should work as expected. So, it may be working now but it may not work in the not so distant future, when you upgrade to MySQL version X (where the optimizer will be clever enough to understand that ORDER BY inside a derived table is redundant and can be eliminated).
Try using this query:
SELECT
p.*, v.*
FROM
people AS p
LEFT JOIN
( SELECT
id_people
, MAX(year) AS year
FROM
visits
GROUP BY
id_people
) AS vm
JOIN
visits AS v
ON v.id_people = vm.id_people
AND v.year = vm.year
ON v.id_people = p.id;
The: SQL-fiddle
A compound index on (id_people, year) would help efficiency.
A different approach. It works fine if you limit the persons to a sensible limit (say 30) first and then join to the visits table:
SELECT
p.*, v.*
FROM
( SELECT *
FROM people
ORDER BY name
LIMIT 30
) AS p
LEFT JOIN
visits AS v
ON v.id_people = p.id
AND v.year =
( SELECT
year
FROM
visits
WHERE
id_people = p.id
ORDER BY
year DESC
LIMIT 1
)
ORDER BY name ;
Why do you have a subquery when all you need is a table name for joining?
It is also not obvious to me why your query has a GROUP BY clause in it. GROUP BY is ordinarily used with aggregate functions like MAX or COUNT, but you don't have those.
How about this? It may solve your problem.
SELECT people.id, people.name, MAX(visits.year) year
FROM people
JOIN visits ON people.id = visits.id_people
GROUP BY people.id, people.name
If you need to show the person, the most recent visit, and the note from the most recent visit, you're going to have to explicitly join the visits table again to the summary query (virtual table) like so.
SELECT a.id, a.name, a.year, v.note
FROM (
SELECT people.id, people.name, MAX(visits.year) year
FROM people
JOIN visits ON people.id = visits.id_people
GROUP BY people.id, people.name
)a
JOIN visits v ON (a.id = v.id_people and a.year = v.year)
Go fiddle: http://www.sqlfiddle.com/#!2/d67fc/20/0
If you need to show something for people that have never had a visit, you should try switching the JOIN items in my statement with LEFT JOIN.
As someone else wrote, an ORDER BY clause in a subquery is not standard, and generates unpredictable results. In your case it baffled the optimizer.
Edit: GROUP BY is a big hammer. Don't use it unless you need it. And, don't use it unless you use an aggregate function in the query.
Notice that if you have more than one row in visits for a person and the most recent year, this query will generate multiple rows for that person, one for each visit in that year. If you want just one row per person, and you DON'T need the note for the visit, then the first query will do the trick. If you have more than one visit for a person in a year, and you only need the latest one, you have to identify which row IS the latest one. Usually it will be the one with the highest ID number, but only you know that for sure. I added another person to your fiddle with that situation. http://www.sqlfiddle.com/#!2/4f644/2/0
This is complicated. But: if your visits.id numbers are automatically assigned and they are always in time order, you can simply report the highest visit id, and be guaranteed that you'll have the latest year. This will be a very efficient query.
SELECT p.id, p.name, v.year, v.note
FROM (
SELECT id_people, max(id) id
FROM visits
GROUP BY id_people
)m
JOIN people p ON (p.id = m.id_people)
JOIN visits v ON (m.id = v.id)
http://www.sqlfiddle.com/#!2/4f644/1/0 But this is not the way your example is set up. So you need another way to disambiguate your latest visit, so you just get one row per person. The only trick we have at our disposal is to use the largest id number.
So, we need to get a list of the visit.id numbers that are the latest ones, by this definition, from your tables. This query does that, with a MAX(year)...GROUP BY(id_people) nested inside a MAX(id)...GROUP BY(id_people) query.
SELECT v.id_people,
MAX(v.id) id
FROM (
SELECT id_people,
MAX(year) year
FROM visits
GROUP BY id_people
)p
JOIN visits v ON (p.id_people = v.id_people AND p.year = v.year)
GROUP BY v.id_people
The overall query (http://www.sqlfiddle.com/#!2/c2da2/1/0) is this.
SELECT p.id, p.name, v.year, v.note
FROM (
SELECT v.id_people,
MAX(v.id) id
FROM (
SELECT id_people,
MAX(year) year
FROM visits
GROUP BY id_people
)p
JOIN visits v ON ( p.id_people = v.id_people
AND p.year = v.year)
GROUP BY v.id_people
)m
JOIN people p ON (m.id_people = p.id)
JOIN visits v ON (m.id = v.id)
Disambiguation in SQL is a tricky business to learn, because it takes some time to wrap your head around the idea that there's no inherent order to rows in a DBMS.

Mysql Groupby and Orderby problem

Here is my data structure
alt text http://luvboy.co.cc/images/db.JPG
when i try this sql
select rec_id, customer_id, dc_number, balance
from payments
where customer_id='IHS050018'
group by dc_number
order by rec_id desc;
something is wrong somewhere, idk
I need
rec_id customer_id dc_number balance
2 IHS050018 DC3 -1
3 IHS050018 52 600
I want the recent balance of the customer with respective to dc_number ?
Thanx
There are essentially two ways to get this
select p.rec_id, p.customer_id, p.dc_number, p.balance
from payments p
where p.rec_id IN (
select s.rec_id
from payments s
where s.customer_id='IHS050018' and s.dc_number = p.dc_number
order by s.rec_id desc
limit 1);
Also if you want to get the last balance for each customer you might do
select p.rec_id, p.customer_id, p.dc_number, p.balance
from payments p
where p.rec_id IN (
select s.rec_id
from payments s
where s.customer_id=p.customer_id and s.dc_number = p.dc_number
order by s.rec_id desc
limit 1);
What I consider essentially another way is utilizing the fact that select rec_id with order by desc and limit 1 is equivalent to select max(rec_id) with appropriate group by, in full:
select p.rec_id, p.customer_id, p.dc_number, p.balance
from payments p
where p.rec_id IN (
select max(s.rec_id)
from payments s
group by s.customer_id, s.dc_number
);
This should be faster (if you want the last balance for every customer), since max is normally less expensive then sort (with indexes it might be the same).
Also when written like this the subquery is not correlated (it need not be run for every row of the outer query) which means it will be run only once and the whole query can be rewritten as a join.
Also notice that it might be beneficial to write it as correlated query (by adding where s.customer_id = p.customer_id and s.dc_number = p.dc_number in inner query) depending on the selectivity of the outer query.
This might improve performance, if you look for the last balance of only one or few rows.
I don't think there is a good way to do this in SQL without having window functions (like those in Postgres 8.4). You probably have to iterate over the dataset in your code and get the recent balances that way.
ORDER comes before GROUP:
select rec_id, customer_id, dc_number, balance
from payments
where customer_id='IHS050018'
order by rec_id desc
group by dc_number