Optimising MySql Query with LEFT JOINS - mysql

I am trying to get a list of customer who haven't ordered for 6months or more. I have 4 tables which I have used in the query
accounts (account_id)
stores (store_id, account_id)
customers (store_id, customer_id)
orders (order_id, customer_id, store_id)
The customer and orders table are very big, 3M and 26M rows respectively, so using left joins in my query make the query time extremely long. I believe I have index my tables correctly
here is my query i have used
SELECT cus.customer_id, MAX(o.order_date), cus.store_id, s.account_id, store_name
FROM customers cus
LEFT JOIN stores s ON s.store_id=cus.store_id
LEFT JOIN orders o ON o.customer_id=cus.customer_id AND o.store_id=cus.store_id
WHERE account_id=26 AND
(SELECT order_id
FROM orders o
WHERE o.customer_id=cus.customer_id
AND o.store_id=cus.store_id
AND o.order_date < CURRENT_DATE() - INTERVAL 6 MONTH
ORDER BY order_id DESC LIMIT 0,1) IS NOT NULL
GROUP BY cus.customer_id, cus.client_id;
I need to get the last order date and this is the reason why I have joined the orders table, however since the customers can have multiple orders it is returning multiple rows of the customer and that is why I have used the group by clause.
If anyone can assist me with my query.

Start with this:
SELECT customer_id, MAX(order_date) AS last_order_date
FROM orders
GROUP BY customer_id
HAVING last_order_date < NOW() - INTERVAL 6 MONTH;
Assuming that gives you the relevant customer_ids, then move on to
SELECT ...
FROM ( that-select-as-a-subquery ) AS old
JOIN other-tables-as-needed ON USING(customer_id)
If necessary, JOIN back to orders to get more info. Do not try to get other columns in that subquery. (That's a "groupwise max" problem.)

Your strategy of using an ordered and limited subquery on your orders table is probably responsible for your poor performance.
This subquery will generate a virtual table showing the date of the most recent order for each distinct customer. (I guess a distinct customer is distinguished by the pair customer_id, store_id).
SELECT MAX(order_date) recent_order_date,
customer_id, store_id
FROM orders
GROUP BY customer_id, store_id
Then, you can use that subquery as if it were a table in your query.
SELECT cus.customer_id, summary.recent_order_date,
cus.store_id, s.account_id, store_name
FROM customers cus
JOIN stores s ON s.store_id=cus.store_id
JOIN (
SELECT MAX(order_date) recent_order_date,
customer_id, store_id
FROM orders
GROUP BY customer_id, store_id
) summary ON summary.customer_id = cus.customer_id
AND summary.store_id = s.store_id
WHERE summary.recent_order_date < CURRENT_DATE - INTERVAL 6 MONTH
AND store.account_id = 26
This approach moves the GROUP BY to an inner query, and eliminates the wasteful ORDER BY ... LIMIT query pattern. The inner query doesn't have to be remade for every row in the outer query.
I don't understand why you used LEFT JOIN operations in your query.
And, by the way, most people, when they're new to SQL, don't have great intuition about which indexes are useful and which aren't. So, when asking for help, it's always good to show your indexes. In the meantime, read this:
http://use-the-index-luke.com/

Related

SQL beginner practice problems

Given two tables, orders (order_id, date, $, customer_id) and customers (ID, name)
Here's my method but I'm not sure if it's working & I'd like to know if there's faster/better way of solving these problems:
1) find out number of customers who made at least one order on date 7/9/2018
Select count (distinct customer_id)
From
(
Select customer_id from orders a
Left join customer b
On a.customer_id = b.ID
Group by customer_id,date
Having date = 7/9/2018
) a
2) find out number of customers who did not make an order on 7/9/2018
Select count (customer_id) from customer where customer_id not in
(
Select customer_id from orders a
Left join customer b
On a.customer_id = b.ID
Group by customer_id,date
Having date = 7/9/2018
)
3) find the date with most sales between 7/1 and 7/30
select date, max($)
from (
Select sum($),date from orders a
Left join customer b
On a.customer_id = b.ID
Group by date
Having date between 7/1 and 7/30
)
Thanks,
For problem 1, a valid solution might look like this:
SELECT COUNT(DISTINCT customer_id) x
FROM orders
WHERE date = '2018-09-07'; -- or is that '2018-07-09' ??
For problem 2, a valid solution might look like this:
SELECT COUNT(*) x
FROM customer c
LEFT
JOIN orders o
ON o.customer_id = x.customer_id
AND o.date = '2018-07-09'
WHERE o.crder_id IS NULL;
Assuming there are no ties, a valid solution to problem 3 might look like this:
SELECT date
, COUNT(*) sales
FROM orders
WHERE date BETWEEN '2018-07-01' AND '2018-07-30'
GROUP
BY date
ORDER
BY sales DESC
LIMIT 1;
The default format for a date in MySQL is YYYY-MM-DD, although this can be customized. You have to put quotes around it, otherwise it's treated as an arithmetic expression.
And none of your queries need to join with the customer table. The customer ID is already in the orders table, and you're not returning any info about the customers (like the name or address), you're just counting them.
1) You don't need the subquery or grouping.
SELECT COUNT(DISTINCT customer_id)
FROM orders
WHERE date = '2018-07-09'
2) Again, you don't need GROUP BY in the subquery. There's also a better pattern than NOT IN to get the count of non-matching rows.
SELECT COUNT(*)
FROM customer AS c
LEFT JOIN order AS o on c.id = o.customer_id AND o.date = '2018-07-09'
WHERE o.id IS NULL
See Return row only if value doesn't exist for various patterns to do this.
3) You can't use MAX($) in the outer query because the inner query doesn't return a column with that name. But even if you fix that, it still won't work, because the date column won't necessarily come from the same row that has the maximum. See SQL select only rows with max value on a column for more explanation of this.
You don't need a subquery at all. Use a query that returns the total sales for each day, then use ORDER BY to get the highest one.
SELECT date, SUM($) AS total_sales
FROM orders
WHERE date BETWEEN '2018-07-01' AND '2017-07-30'
GROUP BY date
ORDER BY total_sales DESC
LIMIT 1
If "most sales" is supposed to mean "most number of sales", replace SUM($) with COUNT(*).

SQL Query: How to use sub-query or AVG function to find number of days between a new entry?

I have a two tables, one called entities with these relevant columns:
id, company_id ,and integration_id. The other table is transactions with columns id, entity_id and created_at. The foreign keys linking the two tables are integration_id and entity_id.
The transactions table shows the number of transactions received from each company from the entities table.
Ultimately, I want to find date range with highest volume of transactions occurring and then from that range find the average number of days between transaction for each company.
To find the date range I used this query.
SELECT DATE_FORMAT(t.created_at, '%Y/%m/%d'), COUNT(t.id)
FROM entities e
JOIN transactions t
ON ei.id = t.entity_id
GROUP BY t.created_at;
I get this:
Date_FORMAT(t.created_at, '%Y/%m/%d') | COUNT(t.id)
+-------------------------------------+------------
2015/11/09 4
etc
From that I determine the range I want to use as 2015/11/09 to 2015/12/27
and I made this query
SELECT company_id, COUNT(t.id)
FROM entities e
INNER JOIN transactions t
ON e.integration_id = t.entity_id
WHERE tp.created_at BETWEEN '2015/11/09' AND '2015/12/27'
GROUP BY company_id;
I get this:
company_id | COUNT(t.id)
+-----------+------------
1234 17
and so on
Which gives me the total transactions made by each company over this date range. What's the best way now to query for the average number of days between transactions by company? How can I sub-query or is there a way to use the AVG function on dates in a WHERE clause?
EDIT:
playing around with the query, I'm wondering if there is a way I can
SELECT company_id, (49 / COUNT(t.id))...
49, because that is the number of days in that date range, in order to get the average number of days between transactions?
I think this might be it, does that make sense?
I think this may work:
Select z.company_id,
datediff(max(y.created_at),min(created_at))/count(y.id) as avg_days_between_orders,
max(y.created_at) as latest_order,
min(created_at) as earliest_order,
count(y.id) as orders
From
(SELECT entity_id, max(t.created_at) latest, min(t.created_at) earliest
FROM entities e, transactions t
Where e.id = t.entity_id
group by entity_id
order by COUNT(t.id) desc
limit 1) x,
transactions y,
entities z
where z.id = x.entity_id
and z.integration_id = y.entity_id
and y.created_at between x.earliest and x.latest
group by company_id;
It's tough without the data. There's a possibility that I have reference to integration_id incorrect in the subquery/join on the outer query.

Slow aggregate query with join on same table

I have a query to show customers and the total dollar value of all their orders. The query takes about 100 seconds to execute.
I'm querying on an ExpressionEngine CMS database. ExpressionEngine uses one table exp_channel_data, for all content. Therefore, I have to join on that table for both customer and order data. I have about 14,000 customers, 30,000 orders and 160,000 total records in that table.
Can I change this query to speed it up?
SELECT link.author_id AS customer_id,
customers.field_id_122 AS company,
Sum(orders.field_id_22) AS total_orders
FROM exp_channel_data customers
JOIN exp_channel_titles link
ON link.author_id = customers.field_id_117
AND customers.channel_id = 7
JOIN exp_channel_data orders
ON orders.entry_id = link.entry_id
AND orders.channel_id = 3
GROUP BY customer_id
Thanks, and please let me know if I should include other information.
UPDATE SOLUTION
My apologies. I noticed that entry_id for the exp_channel_data table customers corresponds to author_id for the exp_channel_titles table. So I don't have to use field_id_117 in the join. field_id_117 duplicates entry_id, but in a TEXT field. JOINING on that text field slowed things down. The query is now 3 seconds
However, the inner join solution posted by #DRapp is 1.5 seconds. Here is his sql with a minor edit:
SELECT
PQ.author_id CustomerID,
c.field_id_122 CompanyName,
PQ.totalOrders
FROM
( SELECT
t.author_id
SUM( o.field_id_22 ) as totalOrders
FROM
exp_channel_data o
JOIN
exp_channel_titles t ON t.author_id = o.entry_id AND o.channel_id = 3
GROUP BY
t.author_id ) PQ
JOIN
exp_channel_data c ON PQ.author_id = c.entry_id AND c.channel_id = 7
ORDER BY CustomerID
If this is the same table, then the same columns across the board for all alias instances.
I would ensure an index on (channel_id, entry_id, field_id_117 ) if possible. Another index on (author_id) for the prequery of order totals
Then, start first with what will become an inner query doing nothing but a per customer sum of order amounts.. Since the join is the "author_id" as the customer ID, just query/sum that first. Not completely understanding the (what I would consider) poor design of the structure, knowing what the "Channel_ID" really indicates, you don't want to duplicate summation values because of these other things in the mix.
select
o.author_id,
sum( o.field_id_22 ) as totalOrders
FROM
exp_channel_data customers o
where
o.channel_id = 3
group by
o.author_id
If that is correct on the per customer (via author_id column), then that can be wrapped as follows
select
PQ.author_id CustomerID,
c.field_id_122 CompanyName,
PQ.totalOrders
from
( select
o.author_id,
sum( o.field_id_22 ) as totalOrders
FROM
exp_channel_data customers o
where
o.channel_id = 3
group by
o.author_id ) PQ
JOIN exp_channel_data c
on PQ.author_id = c.field_id_117
AND c.channel_id = 7
Can you post the results of an EXPLAIN query?
I'm guessing that your tables are not indexed well for this operation. All of the columns that you join on should probably be indexed. As a first guess I'd look at indexing exp_channel_data.field_id_117
Try something like this. Possibly you have error in joins. also check whether joins on columns are correct in your databases. Cross join may takes time to fetch large data, by mistake if your joins are not proper on columns.
select
link.author_id as customer_id,
customers.field_id_122 as company,
sum(orders.field_id_22) as total_or_orders
from exp_channel_data customers
join exp_channel_titles link on (link.author_id = customers.field_id_117 and
link.author_id = customer.channel_id = 7)
join exp_channel_data orders on (orders.entry_id = link.entry_id and orders.entry_id = orders.channel_id = 3)
group by customer_id

MySQL Left Join Taking awhile

I am trying to get a list of possible customers along with the sum of their order history (ltv)
Without the order by, this query loads in under a second. With the order by and the query is taking over 90 seconds.
SELECT a.customerid,a.firstname,a.lastname,Orders.ltv
FROM customers a
LEFT JOIN (
SELECT customerid,
SUM(amount) as ltv
FROM orders
GROUP BY customerid) Orders
ON Orders.customerid=a.customerid
ORDER BY
Orders.ltv DESC
LIMIT 0,10
Any ideas how this could be sped up?
EDIT: I guess I cleaned up the query a little too much. The query is acually a little more complicated then this version. Other data is selected from the customers table, and can be sorted against as well.
Without the actual schema it is a bit hard to know how data is related but I guess this query should be equivalent and more performant:
SELECT a.customerid, coalesce(sum(o.amount), 0) TotalLtv FROM customers a
LEFT JOIN orders o ON a.customerid = o.cusomterid
GROUP BY a.customerid
ORDER BY TotalLtv DESC
LIMIT 10
The coalesce will make sure you return 0 for the customers without orders.
As #ypercube made me notice, an index on amount won't help either. You could give it a try to:
ALTER TABLE orders ADD INDEX(customer, amount)
After your question update
If you need to add more fields that functionally depend on the a.customerid in the select clause you can use the non-standard MySQL group by clause. This will result in better performance than grouping by a.customerid, a.firstname, a.lastname:
SELECT a.customerid, a.firstname, a.lastname, coalesce(sum(o.amount), 0) TotalLtv
FROM customers a
LEFT JOIN orders o ON a.customerid = o.cusomterid
GROUP BY a.customerid
ORDER BY TotalLtv DESC
LIMIT 10
A few things here. First it doesn't appear that you need to join the customers table at all here since you are only using it for the customerid, which already exists in orders table. If you have more than 10 customer id's with corresponding amounts, you will never even need to see the list of customer id's which don;t have amounts that you would get with LEFT JOIN from customers. As such, you should be able to reduce your query to this:
SELECT customerid, SUM(amount) AS ltv
FROM orders
GROUP BY customerid
ORDER BY ltv DESC LIMIT 0,10
You would need an index on customerid. Unfortunately, the sort is on a calculated field, so there is not a lot you can do to speed this up from that point.
I see the updated question. Since you do need additional fields from customers, I will revise my answer to include the customer table
SELECT c.customerid, c.firstname, c.lastname, coalesce(o.ltv, 0) AS total
FROM customers AS c
LEFT JOIN (
SELECT customerid, SUM(amount) as ltv
FROM orders
GROUP BY customerid
ORDER BY ltv DESC LIMIT 0,10) AS o
ON c.customerid = o.customerid
Note that I am joining on a sub-selected table as you were doing in your original query, however I have performed the sort and limit on the sub-selected table so you don't have to sort all the records without any entries on orders table.
Two things. First, don't use an inner query. MySQL does allow ORDER BY on a projection alias. Second, you should get a considerable improvment by having a B-TREE index on the composed key (customerid, amount). Then the engine will be able to execute this query by a simple traversal of the index, without fetching any row data.

Summing order totals Mysql

I'm fairly new to Mysql and need help trying to combine two mysql queries that give a "total" for each "storeid" from the orders total. I'm currently using two queries to get the result:
SELECT storeid, storenum, name FROM store ORDER BY storeid DESC
SELECT SUM((1+0.07125)*qty*discprice) as total FROM items WHERE orderid IN (SELECT orderid FROM orders WHERE store = '".$row['storeid']."' AND date >= '2012-01-01' AND date < '2013-01-01')
I'm running a while loop and running the second query with the "storeid". However, I know I can do this is one query and group by "storeid" and create a total for all stores combined. But I can't figure it out.
Thanks!
SELECT s.storeid, s.storenum, s.name
SUM((1+0.07125)*i.qty*i.discprice) AS total
FROM items AS i
LEFT JOIN orders AS o
ON i.orderid=o.orderid
LEFT JOIN stores AS s
ON o.store=s.storeid
WHERE o.date >= '2012-01-01'
AND o.date < '2013-01-01'
GROUP BY s.storeid, s.storenum, s.name;
The trick is to join the three tables and then use an aggregate function on the items table.
SELECT stores.storeid, stores.storenum, stores.name, SUM((1+0.07125)*items.qty*items.discprice) as total
FROM stores
LEFT JOIN orders ON orders.storeid=stores.storeid AND orders.date>='2012-01-01' AND orders.date<'2013-01-01'
LEFT JOIN items ON items.orderid=orders.orderid
GROUP BY stores.storeid, stores.storenum, stores.name
What this does it this:
It will select every store from the stores table, and sum up the orders in that store. I chose a LEFT JOIN instead of straight JOINs, so that stores without any order in that time span will still show up with a total of NULL.
P.S. I don't have a copy of your database's schema, above SQL query might not actually work as expected - it is just supposed to point you in the right direction.