Optimize SQL: Customers that haven't ordered for x days - mysql

I have created this SQL in order to find customers that haven't ordered for X days.
It is returning a result set, so this post is mainly just to get a second opinion on it, and possible optimizations.
SELECT o.order_id,
o.order_status,
o.order_created,
o.user_id,
i.identity_firstname,
i.identity_email,
(SELECT COUNT(*)
FROM orders o2
WHERE o2.user_id=o.user_id
AND o2.order_status=1) AS order_count,
(SELECT o4.order_created
FROM orders o4
WHERE o4.user_id=o.user_id
AND o4.order_status=1
ORDER BY o4.order_created DESC LIMIT 1) AS last_order
FROM orders o
INNER JOIN user_identities ui ON o.user_id=ui.user_id
INNER JOIN identities i ON ui.identity_id=i.identity_id
AND i.identity_email!=''
INNER JOIN subscribers s ON i.identity_id=s.identity_id
AND s.subscriber_status=1
AND s.subsriber_type=e
AND s.subscription_id=1
WHERE DATE(o.order_created) = "2013-12-14"
AND o.order_status=1
AND o.user_id NOT IN
(SELECT o3.user_id
FROM orders o3
WHERE o3.user_id=o.user_id
AND o3.order_status=1
AND DATE(o3.order_created) > "2013-12-14")
Can you guys find any potential problems with this SQL? Dates are dynamically inserted.
The final SQL that I put in production, will basically only include o.order_id, i.identity_id and o.order_count - this order_count will need to be correct. The other selected fields and 'last_order' subquery will not be included, it's only for testing.
This should give me a list of users that have their last order on that particular day, and is a newsletter subscriber. I am particular in doubt about correctness of the NOT IN part in the WHERE clause, and the order_count subquery.

There are several problems:
A. Using functions on indexable columns
You are searching for orders by comparing DATE(order_created) with some constant. This is a terrible idea, because a) the DATE() function is executed for every row (CPU) and b) the database can't use an index on the column (assuming one existed)
B. Using WHERE ID NOT IN (...)
Using a NOT IN (...) is almost always a bad idea, because optimizers usually have trouble with this construct, and often get the plan wrong. You can almost always express it as an outer join with a WHERE condition that filters for misses using an IS NULL condition for a joined column (and adds the side benefit of not needing DISTINCT, because there's only ever one miss returned)
C. Leaving joins that filtering out of large portions of rows too late
The earlier you can mask off rows by not making joins the better. You can do this by joining less likely to match tables earlier in the joined table list, and by putting non-key conditions into join rather than the where clause to get the rows excluded as early as possible. Some optimizers to this anyway, but I've often found they don't
D. Avoid correlated subqueries like the plague!
You have several correlated subqueries - ones that are executed for every row of the main table. That's really an incredibly bad idea. Again sometimes the optimizer can craft them into a join, but why rely (hope) on that. Most correlated subqueries can be expressed as a join; you examples are no exception.
With the above in mind, there are some specific changes:
o2 and o4 are the same join, so o4 may be dispensed with entirely - just use o2 after conversion to a join
DATE(order_created) = "2013-12-14" should be written as order_created between "2013-12-14 00:00:00" and "2013-12-14 23:59:59"
This query should be what you want:
SELECT
o.order_id,
o.order_status,
o.order_created,
o.user_id,
i.identity_firstname,
i.identity_email,
count(o2.user_id) AS order_count,
max(o2.order_created) AS last_order
FROM orders o
LEFT JOIN orders o2 ON o2.user_id = o.user_id AND o2.order_status=1
LEFT JOIN orders o3 ON o3.user_id = o.user_id
AND o3.order_status=1
AND o3.order_created >= "2013-12-15 00:00:00"
JOIN user_identities ui ON o.user_id=ui.user_id
JOIN identities i ON ui.identity_id=i.identity_id AND i.identity_email != ''
JOIN subscribers s ON i.identity_id=s.identity_id
AND s.subscriber_status=1
AND s.subsriber_type=e
AND s.subscription_id=1
WHERE o.order_created between "2013-12-14 00:00:00" and "2013-12-14 23:59:59"
AND o.order_status=1
AND o3.order_created IS NULL -- This gets only missed joins on o3
GROUP BY
o.order_id,
o.order_status,
o.order_created,
o.user_id,
i.identity_firstname,
i.identity_email;
The last line is how you achieve the same as NOT IN (...) using a LEFT JOIN
Disclaimer: Not tested.

Can't really comment on the results as you have not posted any table declares or example data, but your query has 3 correlated sub queries which is likely to make it perform poorly (OK, one of those is for last_order and is only for testing).
Eliminating the correlated sub queries and replacing them with joins would give something like this:-
SELECT o.order_id,
o.order_status,
o.order_created,
o.user_id,
i.identity_firstname,
i.identity_email,
Sub1.order_count,
Sub2.last_order
FROM orders o
INNER JOIN user_identities ui ON o.user_id=ui.user_id
INNER JOIN identities i ON ui.identity_id=i.identity_id
AND i.identity_email!=''
INNER JOIN subscribers s ON i.identity_id=s.identity_id
AND s.subscriber_status=1
AND s.subsriber_type=e
AND s.subscription_id=1
LEFT OUTER JOIN
(
SELECT user_id, COUNT(*) AS order_count
FROM orders
WHERE order_status=1
GROUP BY user_id
) Sub1
ON o.user_id = Sub1.user_id
LEFT OUTER JOIN
(
SELECT user_id, MAX(order_created) as last_order
FROM orders
WHERE order_status=1
GROUP BY user_id
) AS Sub2
ON o.user_id = Sub2.user_id
LEFT OUTER JOIN
(
SELECT DISTINCT user_id
FROM orders
WHERE order_status=1
AND DATE(order_created) > "2013-12-14"
) Sub3
ON o.user_id = Sub3.user_id
WHERE DATE(o.order_created) = "2013-12-14"
AND o.order_status=1
AND Sub3.user_id IS NULL

Related

MySQL: Optimizing Sub-queries

I have this query I need to optimize further since it requires too much cpu time and I can't seem to find any other way to write it more efficiently. Is there another way to write this without altering the tables?
SELECT category, b.fruit_name, u.name
, r.count_vote, r.text_c
FROM Fruits b, Customers u
, Categories c
, (SELECT * FROM
(SELECT *
FROM Reviews
ORDER BY fruit_id, count_vote DESC, r_id
) a
GROUP BY fruit_id
) r
WHERE b.fruit_id = r.fruit_id
AND u.customer_id = r.customer_id
AND category = "Fruits";
This is your query re-written with explicit joins:
SELECT
category, b.fruit_name, u.name, r.count_vote, r.text_c
FROM Fruits b
JOIN
(
SELECT * FROM
(
SELECT *
FROM Reviews
ORDER BY fruit_id, count_vote DESC, r_id
) a
GROUP BY fruit_id
) r on r.fruit_id = b.fruit_id
JOIN Customers u ON u.customer_id = r.customer_id
CROSS JOIN Categories c
WHERE c.category = 'Fruits';
(I am guessing here that the category column belongs to the categories table.)
There are some parts that look suspicious:
Why do you cross join the Categories table, when you don't even display a column of the table?
What is ORDER BY fruit_id, count_vote DESC, r_id supposed to do? Sub query results are considered unordered sets, so an ORDER BY is superfluous and can be ignored by the DBMS. What do you want to achieve here?
SELECT * FROM [ revues ] GROUP BY fruit_id is invalid. If you group by fruit_id, what count_vote and what r.text_c do you expect to get for the ID? You don't tell the DBMS (which would be something like MAX(count_vote) and MIN(r.text_c)for instance. MySQL should through an error, but silently replacescount_vote, r.text_cbyANY_VALUE(count_vote), ANY_VALUE(r.text_c)` instead. This means you get arbitrarily picked values for a fruit.
The answer hence to your question is: Don't try to speed it up, but fix it instead. (Maybe you want to place a new request showing the query and explaining what it is supposed to do, so people can help you with that.)
Your Categories table seems not joined/related to the others this produce a catesia product between all the rows
If you want distinct resut don't use group by but distint so you can avoid an unnecessary subquery
and you dont' need an order by on a subquery
SELECT category
, b.fruit_name
, u.name
, r.count_vote
, r.text_c
FROM Fruits b
INNER JOIN Customers u ON u.customer_id = r.customer_id
INNER JOIN Categories c ON ?????? /Your Categories table seems not joined/related to the others /
INNER JOIN (
SELECT distinct fruit_id, count_vote, text_c, customer_id
FROM Reviews
) r ON b.fruit_id = r.fruit_id
WHERE category = "Fruits";
for better reading you should use explicit join syntax and avoid old join syntax based on comma separated tables name and where condition
The next time you want help optimizing a query, please include the table/index structure, an indication of the cardinality of the indexes and the EXPLAIN plan for the query.
There appears to be absolutely no reason for a single sub-query here, let alone 2. Using sub-queries mostly prevents the DBMS optimizer from doing its job. So your biggest win will come from eliminating these sub-queries.
The CROSS JOIN creates a deliberate cartesian join - its also unclear if any attributes from this table are actually required for the result, if it is there to produce multiples of the same row in the output, or just an error.
The attribute category in the last line of your query is not attributed to any of the tables (but I suspect it comes from the categories table).
Further, your code uses a GROUP BY clause with no aggregation function. This will produce non-deterministic results and is a bug. Assuming that you are not exploiting a side-effect of that, the query can be re-written as:
SELECT
category, b.fruit_name, u.name, r.count_vote, r.text_c
FROM Fruits b
JOIN Reviews r
ON r.fruit_id = b.fruit_id
JOIN Customers u ON u.customer_id = r.customer_id
ORDER BY r.fruit_id, count_vote DESC, r_id;
Since there are no predicates other than joins in your query, there is no scope for further optimization beyond ensuring there are indexes on the join predicates.
As all too frequently, the biggest benefit may come from simply asking the question of why you need to retrieve every single row in the tables in a single query.

SQL retrieving filtered value in subquery

in this cust_id is a foreign key and ords returns the number of orders for every customers
SELECT cust_name, (
SELECT COUNT(*)
FROM Orders
WHERE Orders.cust_id = Customers.cust_id
) AS ords
FROM Customers
The output is correct but i want to filter it to retrieve only the customers with less than a given amount of orders, i don't know how to filter the subquery ords, i tried WHERE ords < 2 at the end of the code but it doesn't work and i've tried adding AND COUNT(*)<2 after the cust_id comparison but it doesn't work. I am using MySQL
Use the HAVING clause (and use a join instead of a subquery).....
SELECT Customers.cust_id, Customers.cust_name, COUNT(*) ords
FROM Orders, Customers
WHERE Orders.cust_id = Customers.cust_id
GROUP BY 1,2
HAVING COUNT(*)<2
If you want to include people with zero orders you change the join to an outer join.
There is no need for a correlated subquery here, because it calculates the value for each row which doesn't give a "good" performance. A better approach would be to use a regular query with joins, group by and having clause to apply your condition to groups.
Since your condition is to return only customers that have less than 2 orders, left join instead of inner join would be appropriate. It would return customers that have no orders as well (with 0 count).
select
cust_name, count(*)
from
customers c
left join orders o on c.cust_id = o.cust_id
group by cust_name
having count(*) < 2

Using GROUP BY or DISTINCT with a LEFT JOIN

I have a table with orders and a table with users. It's possible for an order to be placed with an entry in the user table.
With the following MySQL statement I get duplicate values for orders if there is a matching user:
SELECT o.id, u.id as 'user_id', u.name
FROM orders o
LEFT JOIN users u ON o.user_id = u.id
WHERE o.status = 'active'
If I add a GROUP BY o.id it solves the issue.
SELECT o.id, u.id as 'user_id'
FROM orders o
LEFT JOIN users u ON o.user_id = u.id
WHERE o.status = 'active'
GROUP BY o.id
It also works if I use SELECT DISTINCT.
My questions are:
Why does it return duplicate fields?
Is it more correct to use GROUP BY or SELECT DISTINCT?
Your detail query -- the query returning every row, rather than the deduplicated version with DISTINCT or GROUP BY -- is finding more than row in users matching each row in orders. So, it is dutifully returning all those rows.
To solve your problem correctly you need to figure out why there are multiple users rows for each order. That is, for some values of order.user_id there are multiple values of users.id.
That seems a little strange to me, but I do not understand your data model. You probably need to get to investigate this data anomaly. A conventional schema would have each user able to place multiple orders, but each order relating to only one user. In that schema this query would yield one row per order but still include users with no orders:
SELECT u.id AS user_id, o.id AS order_id
FROM users AS u
LEFT JOIN orders AS o ON o.user_id = u.id
Could it be that is what you want?
Contrary to some peoples' belief, GROUP BY orders.id and SELECT DISTINCT orders.id, users.id are not the same thing. In fact, your proposed use of GROUP BY misuses the notorious MySQL extension to GROUP BY. Standard SQL will reject your GROUP BY. It will only accept GROUP BY orders.id, users.id, which is indeed equivalent to DISTINCT.
Why does it return duplicate fields?
It returns duplicates because you have not applied anything to stop it from doing so. When you apply GROUP BY or DISTINCT then you actually stop the duplicates.
Is it more correct to use GROUP BY or SELECT DISTINCT
Both are equivalent and can be used as per your convenience. You may find that DISTINCT is faster over GROUP BY under the fact that indexes are not created on your table. But that does not make the usage of GROUP BY incorrect. If indexes are created then they both are equivalent to each other.
Your query does not need a JOIN at all. You can just use:
SELECT o.id, o.user_id
FROM orders o
WHERE o.status = 'active';
As for SELECT DISTINCT or GROUP BY. The two should be equivalent in performance (or very close). They are doing essentially the same work.
The advantage of GROUP BY is that you can add aggregation functions. The advantage of DISTINCT is that you don't have to list all the columns twice, and it accepts *.

SQL Join, include rows from table a with no match in table b

SELECT orders.* FROM orders JOIN order_rows
ON orders.id = order_rows.order_id
WHERE order_rows.quant <> order_rows.quant_fulfilled
GROUP BY orders.id
ORDER BY orders.id DESC
I need this to include rows that have no corresponding order_row entries (which would be an order that has no items in it yet). It seems like there must be a way to do this by adding to the ON or WHERE clause?
There will only be a couple empty orders at a given time so I would use a separate query if the best answer to this is going to significantly decrease performance. But I was hoping to include them in this query so they are sorted by orders.id along with the rest. Just don't want to double query time just to include the 1-3 orders that have no items.
I am using MySQL. Thanks in advance for any advice.
Simply use LEFT JOIN instead of JOIN. You'll obtain all rows of orders.
SELECT orders.* FROM orders LEFT JOIN order_rows
ON orders.id = order_rows.order_id
WHERE order_rows.quant IS NULL OR order_rows.quant <> order_rows.quant_fulfilled
GROUP BY orders.id
ORDER BY orders.id DESC

SQL Nested query interpereted as correlated incorrectly

I've got a serious problem with a nested query, which I suspect MySQL is interpreting as a correlated subquery when in fact it should be uncorrelated. The query spans two tables, one being a list of products and the other being their price at various points in time. My aim is to return each price record for products that have a price range above a certain value for the whole time. My query looks like this:
SELECT oP.id, oP.title, oCR.price, oC.timestamp
FROM Crawl_Results AS oCR
JOIN Products AS oP
ON oCR.product = oP.id
JOIN Crawls AS oC
ON oCR.crawl = oC.id
WHERE oP.id
IN (
SELECT iP.id
FROM Products AS iP
JOIN Crawl_Results AS iCR
ON iP.id = iCR.product
WHERE iP.category =2
GROUP BY iP.id
HAVING (
MAX( iCR.price ) - MIN( iCR.price )
) >1
)
ORDER BY oP.id ASC
Taken alone, the inner query executes fine and returns a list of the id's of the products with a price range above the criterion. The outer query also works fine if I provide a simple list of ids in the IN clause. When I run them together however, the query takes ~3min to return ~1500 rows, so I think it's executing the inner query for every row of the outer, which is not ideal. I did have the columns aliased the same in the inner and outer queries, so I thought that aliasing them differently in the inner and outer as above would fix it, but it didn't.
Any ideas as to what's going on here?
MySQL might think it could use indexes to execute the query faster by running it once for every OP.id. The first thing to check is if your statistics are up to date.
You could rewrite the where ... in as a filtering inner join. This is less likely to be "optimized" for seeks:
SELECT *
FROM Crawl_Results AS oCR
JOIN Products AS oP
ON oCR.product = oP.id
JOIN Crawls AS oC
ON oCR.crawl = oC.id
JOIN (
SELECT iP.id
FROM Products AS iP
JOIN Crawl_Results AS iCR
ON iP.id = iCR.product
WHERE iP.category =2
GROUP BY
iP.id
HAVING (MAX(iCR.price) - MIN(iCR.price)) > 1
) filter
ON OP.id = filter.id
Another option is to use a temporary table. You store the result of the subquery in a temporary table and join on that. That really forces MySQL not to execute the subquery as a correlated query.