MySQL - Find Orders placed by repeat customers vs. new customers over time - mysql

I have an orders table that contains the orders_id, customers_email_address and date_purchased. I want to write a SQL query that will, for each line in the table, add a new field called 'repeat_order_count' that shows how many times this customer ordered before and including this order.
For example, if John ordered once before this order, the repeat_order_count would be 2 for this order, or in other words, this is the second time John has ordered. The next order row I encounter for John will have a 3, and so on. This will allow me to create a line graph that shows the number of orders placed by repeat customers over time. I can now go to a specific time in the past and figure out how many orders were placed by repeat customers during that time period:
SELECT
*
FROM orders
WHERE repeat_order_count > 1
WHERE date_purchased = January 2014 --(simplifying things here)
I'm also able to determine now WHEN a customer became a repeat customer.
I can't figure out the query to solve this. Or perhaps there may be an easier way to do this?

One approach to retrieving the specified result would be to use a correlated subquery in the SELECT list. This assumes that the customer identifier is customers_email_address, and that date_purchased is a DATETIME or TIMESTAMP (or other canonical format), and that there are no duplicated values for the same customer (that is, the customer doesn't have two or more orders with the same date_purchased value.)
SELECT s.orders_id
, s.customers_email_address
, s.date_purchased
, ( SELECT COUNT(1)
FROM orders p
WHERE p.customers_email_address = s.customers_email_address
AND p.date_purchased < s.date_purchased
) AS previous_order_count
FROM orders s
ORDER
BY s.customers_email_address
, s.date_purchased
The correlated subquery will return 0 for the "first" order for a customer, and 1 for the "second" order. If you want to include the current order in the count, replace the < comparison operator with <= operator.
FOLLOWUP
For performance of that query, we need to be particulary concerned with the performance of the correlated subquery, since that is going to be executed for every row in the table. (A million rows in the table means a million executions of that query.) Having a suitable index available is going to be crucial.
For the query in my answer, I'd recommend trying an index like this:
ON orders (customers_email_address, date_purchased, orders_id)
With that index in place, we'd expect EXPLAIN to show the index being used by both the outer query, to satisfy the ORDER BY (No "Using filesort" in the Extra column), and as a covering index (no lookups to the pages in the underlying table, "Using index" shown in the Extra column.)
The answer I gave demonstrated just one approach. It's also possible to return an equivalent result using a join pattern, for example:
SELECT s.orders_id
, s.customers_email_address
, s.date_purchased
, COUNT(p.orders_id)
FROM orders s
JOIN orders p
ON p.customers_email_address = s.customers_email_address
AND p.date_purchased <= s.date_purchased
GROUP
BY s.customers_email_address
, s.date_purchased
, s.orders_id
ORDER
BY s.customers_email_address
, s.date_purchased
, s.orders_id
(This query is based on some additional information provided in a comment, which wasn't available before: orders_id is UNIQUE in the orders table.)
If we are guaranteed that orders_id of a "previous" order is less than the orders_id of a previous order, then it would be possible to use that column in place of the date_purchased column. We'd want a suitable index available:
... ON orders (customers_email_address, orders_id, date_purchased)
NOTE: The order of the columns in the index is important. With that index, we could do:
SELECT s.orders_id
, s.customers_email_address
, s.date_purchased
, COUNT(p.orders_id)
FROM orders s
JOIN orders p
ON p.customers_email_address = s.customers_email_address
AND p.orders_id <= s.orders_id
GROUP
BY s.customers_email_address
, s.orders_id
ORDER
BY s.customers_email_address
, s.orders_id
Again, we'd want to review the output from EXPLAIN to verify that the index is being used for both the join operation and the GROUP BY operation.
NOTE: With the inner join, we need to use a <= comparison, so we get at least one matching row back. We could either subtract 1 from that result, if we wanted a count of only "previous" orders (not counting the current order), or we could use an outer join operation with a < comparison, so we could get a row back with a count of 0.

when you are inserting into your orders table, for the column you have for your OrderCount you use a co-related sub-query.
eg:
select
col1,
col2,
(isnull((select count(*) from orders where custID = #currentCustomer),0) + 1),
col4
Note that you wouldn't be adding the field when the 2nd order is processed, the field would already exist and you would just be populating it.

Related

Select distinct column value from date range

I created a sqlfiddle that outlines what I'm trying to do:
http://sqlfiddle.com/#!9/41f3c/2
Basically, I need to search for unique posts in a table that contains meta information. The meta in this instance is a series of dates that represent exclusions (think of a booking system for a hotel).
I pass in a start and end date. I want to find post_id that does not contain a date that falls in the range. I'm close, but can't quite figure it out.
SELECT DISTINCT post_id
FROM wp_facetwp_index
WHERE facet_name = 'date_range'
AND facet_value NOT BETWEEN '$start_date' AND '$end_date'
This works if the only excluded dates in the table are in the range, but if some are out of the range, I still get the post_id.
Thanks for looking.
Do not forget, in SQL, the filters (where clause, etc.) are applied on a RECORD basis. Each record is evaluated independantly from the others.
So, since
(1, 511, 'date_range', 'cf/excluded_dates', '2015-07-31', '2015-07-31')
validates your condition, 511 is returned.
Since post_id is not unique, you need to proceed with a exclusion on SETS, as opposed to an exclusion on RECORDS which you're doing right now.
Here is the solution (adjusted fiddle here: http://sqlfiddle.com/#!9/41f3c/7)
SELECT DISTINCT i1.`post_id`
FROM `wp_facetwp_index` i1
WHERE i1.`facet_name` = 'date_range'
AND NOT EXISTS (
SELECT 1
FROM `wp_facetwp_index` i2
WHERE
i2.`facet_value` BETWEEN '$start_date' AND '$end_date'
AND i2.`facet_name` = 'date_range'
AND i2.`post_id` = i1.`post_id`
)
The subquery right after EXISTS ( is a subset of rows. It will be evaluated negatively by NOT EXISTS based on the junction i2.post_id = i1.post_id.
This is a negative intersection.
Working on exluding records does not work if the tuple you need to indentify is not unique.

wrapping inside aggregate function in SQL query

I have 2 tables called Orders and Salesperson shown below:
And I want to retrieve the names of all salespeople that have more than 1 order from the tables above.
Then firing following query shows an error:
SELECT Name
FROM Orders, Salesperson
WHERE Orders.salesperson_id = Salesperson.ID
GROUP BY salesperson_id
HAVING COUNT( salesperson_id ) >1
The error is:
Column 'Name' is invalid in the select list because it is
not contained in either an aggregate function or
the GROUP BY clause.
From the error and searching it on google, I could understand that the error is because of Name column must be either a part of the group by statement or aggregate function.
Also I tried to understand why does the selected column have to be in the group by clause or art of an aggregate function? But didn't understand clearly.
So, how to fix this error?
SELECT max(Name) as Name
FROM Orders, Salesperson
WHERE Orders.salesperson_id = Salesperson.ID
GROUP BY salesperson_id
HAVING COUNT( salesperson_id ) >1
The basic idea is that columns that are not in the group by clause need to be in an aggregate function now here due to the fact that the name is probably the same for every salesperson_id min or max make no real difference (the result is the same)
example
Looking at your data you have 3 entry's for Dan(7) now when a join is created the with row Dan (Name) gets multiplied by 3 (For every number 1 Dan) and then the server does not now witch "Dan" to pick cos to the server that are 3 lines even doh they are semantically the same
also try this so that you see what I am talking about:
SELECT Orders.Number, Salesperson.Name
FROM Orders, Salesperson
WHERE Orders.salesperson_id = Salesperson.ID
As far as the query goes INNER JOIN is a better solution since its kinda the standard for this simple query it should not matter but in some cases can happen that INNER JOIN produces better results but as far as I know this is more of a legacy thing since this days the server should pretty much produce the same execution plan.
For code clarity I would stick with INNER JOIN
Assuming the name is unique to the salesperson.id then simply add it to your group by clause
GROUP BY salesperson_id, salesperson.Name
Otherwise use any Agg function
Select Min(Name)
The reason for this is that SQL doesn't know whether there are multiple name per salesperson.id
For readability and correctness, I usually split aggregate queries into two parts:
The aggregate query
Any additional queries to support fields not contained in aggregate functions
So:
1.Aggregate query - salespeople with more than 1 order
SELECT salesperson_id
FROM ORDERS
GROUP BY salespersonId
HAVING COUNT(Number) > 1
2.Use aggregate as subquery (basically a select joining onto another select) to join on any additional fields:
SELECT *
FROM Salesperson SP
INNER JOIN
(
SELECT salesperson_id
FROM ORDERS
GROUP BY salespersonId
HAVING COUNT(Number) > 1
) AGG_QUERY
ON AGG_QUERY.salesperson_id = SP.ID
There are other approaches, such as selecting the additional fields via aggregation functions (as shown by the other answers). These get the code written quickly so if you are writing the query under time pressure you may prefer that approach. If the query needs to be maintained (and hence readable) I would favour subqueries.

joined table results count mysql performance

I have restaurants table with structure, id, name, table_count, and orders table, with structure restaruant_id, start_date, end_date, status. I want to find those tables, that are available for some date range - as available - considered those , that either there are no orders , or number of confirmed reservations (status = 1) for the given date range is less than the number of table count for that restaurant. So, I use this query
SELECT r.id, r.name, r.table_count
FROM restaurants r
LEFT JOIN orders o
ON r.id = o.restaurant_id
WHERE o.id IS NULL
OR (r.table_count > (SELECT COUNT(*)
FROM orders o2
WHERE o2.restaurant_id = r.id AND o2.status = 1 AND
NOT(o.start_date >= '2013-09-10') AND NOT (o.end_date <= '2013-09-05')
)
)
So, Can I make the same query in other way which will be faster. I am thinking in this, way, because during time it can be thousands or more rows in orders table, and because it should compare the dates columns to be between some date range, would it be faster result if I add a column is_search(with mysql index) with value 1 or 0, which can be updated by cron job, periodically checking and making reservations from the past as 0, so during search only the orders of the present or of the future will be considered when comparing the date range(which I think much more expensive than compare tinyint column for 1 or 0). Or, adding that condition will add one more thing to check, and it will have the opposite effect ?
Thanks
It is not a good idea to index on 0 and 1.In your case with thousonds of records the index will have poor selecetivity and is posible to not be used at all.
You can speed up your query if you build comoposite index on start_data and end_data.
And rewrite query to using not "NOT" clause in dates columns. Because NOT is going to do the sort the tables instead of using the index.
You query
...NOT(o.start_date >= '2013-09-10') AND NOT (o.end_date <= '2013-09-05')..
could be defined
...(o.start_date <'2013-09-10') AND (o.end_date > '2013-09-05')...

Fast MAX, GROUP BY on the concatenation of mulliple columns

I have a table with 4 columns: name, date, version,and value. There's an composite index on all four, in that order. It has 20M rows: 2.000 names, approx 1.000 dates per name, approx 10 versions per date.
I'm trying to get a list that give for all names the highest date, the highest version on that date, and the associated value.
When I do
SELECT name,
MAX(date)
FROM table
GROUP BY name
I get good performance and the database uses the composite index
However, when I join the table to this in order to get the MAX(version) per name the query takes ages. There must be a way to get the result in about the same magnitude of time as the SELECT statement above? I can easily be done by using the index.
Try this: (I know it needs a few syntax tweaks for MySQL... ask for them and I will find them)
INSERT INTO #TempTable
SELECT name, MAX(Date) as Date
FROM table
Group By name
select table.name, table.date, max(table.version) as version
from table
inner join #TempTable on table.name = #temptable.name and table.date = #temptable.date
group by table.name, table.date

Will grouping an ordered table always return the first row? MYSQL

I'm writing a query where I group a selection of rows to find the MIN value for one of the columns.
I'd also like to return the other column values associated with the MIN row returned.
e.g
ID QTY PRODUCT TYPE
--------------------
1 2 Orange Fruit
2 4 Banana Fruit
3 3 Apple Fruit
If I GROUP this table by the column 'TYPE' and select the MIN qty, it won't return the corresponding product for the MIN row which in the case above is 'Apple'.
Adding an ORDER BY clause before grouping seems to solve the problem. However, before I go ahead and include this query in my application I'd just like to know whether this method will always return the correct value. Is this the correct approach? I've seen some examples where subqueries are used, however I have also read that this inefficient.
Thanks in advance.
Adding an ORDER BY clause before grouping seems to solve the problem. However, before I go ahead and include this query in my application I'd just like to know whether this method will always return the correct value. Is this the correct approach? I've seen some examples where subqueries are used, however I have also read that this inefficient.
No, this is not the correct approach.
I believe you are talking about a query like this:
SELECT product.*, MIN(qty)
FROM product
GROUP BY
type
ORDER BY
qty
What you are doing here is using MySQL's extension that allows you to select unaggregated/ungrouped columns in a GROUP BY query.
This is mostly used in the queries containing both a JOIN and a GROUP BY on a PRIMARY KEY, like this:
SELECT order.id, order.customer, SUM(price)
FROM order
JOIN orderline
ON orderline.order_id = order.id
GROUP BY
order.id
Here, order.customer is neither grouped nor aggregated, but since you are grouping on order.id, it is guaranteed to have the same value within each group.
In your case, all values of qty have different values within the group.
It is not guaranteed from which record within the group the engine will take the value.
You should do this:
SELECT p.*
FROM (
SELECT DISTINCT type
FROM product p
) pd
JOIN p
ON p.id =
(
SELECT pi.id
FROM product pi
WHERE pi.type = pd.type
ORDER BY
type, qty, id
LIMIT 1
)
If you create an index on product (type, qty, id), this query will work fast.
It's difficult to follow you properly without an example of the query you try.
From your comments I guess you query something like,
SELECT ID, COUNT(*) AS QTY, PRODUCT_TYPE
FROM PRODUCTS
GROUP BY PRODUCT_TYPE
ORDER BY COUNT(*) DESC;
My advice, you group by concept (in this case PRODUCT_TYPE) and you order by the times it appears count(*). The query above would do what you want.
The sub-queries are mostly for sorting or dismissing rows that are not interested.
The MIN you look is not exactly a MIN, it is an occurrence and you want to see first the one who gives less occurrences (meaning appears less times, I guess).
Cheers,