Following my recent question Select information from last item and join to the total amount, I am having some memory problems while generation tables
I have two tables sales1 and sales2 like this:
id | dates | customer | sale
With this table definition:
CREATE TABLE sales (
id int auto_increment primary key,
dates date,
customer int,
sale int
);
sales1 and sales2 have the same definition, but sales2 has sale=-1 in every field. A customer can be in none, one or both tables. Both tables have around 300.000 records and much more fields than indicated here (around 50 fields). They are InnoDB.
I want to select, for each customer:
number of purchases
last purchase value
total amount of purchases, when it has a positive value
The query I am using is:
SELECT a.customer, count(a.sale), max_sale
FROM sales a
INNER JOIN (SELECT customer, sale max_sale
from sales x where dates = (select max(dates)
from sales y
where x.customer = y.customer
and y.sale > 0
)
)b
ON a.customer = b.customer
GROUP BY a.customer, max_sale;
The problem is:
I have to get the results, that I need for certain calculations, separated for dates: information on year 2012, information on year 2013, but also information from all the years together.
Whenever I do just one year, it takes about 2-3 minutes to storage all the information.
But when I try to gather information from all the years, the database crashes and I get messages like:
InternalError: (InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction')
It seems that joining such huge tables is too much for the database. When I explain the query, almost all the percentage of time comes from creating tmp table.
I thought in splitting the data gathering in quarters. We get the results for every three months and then join and sort it. But I guess this final join and sort will be too much for the database again.
So, what would you experts recommend to optimize these queries as long as I cannot change the tables structure?
300k rows is not a huge table. We frequently see 300 million row tables.
The biggest problem with your query is that you're using a correlated subquery, so it has to re-execute the subquery for each row in the outer query.
It's often the case that you don't need to do all your work in one SQL statement. There are advantages to breaking it up into several simpler SQL statements:
Easier to code.
Easier to optimize.
Easier to debug.
Easier to read.
Easier to maintain if/when you have to implement new requirements.
Number of Purchases
SELECT customer, COUNT(sale) AS number_of_purchases
FROM sales
GROUP BY customer;
An index on sales(customer,sale) would be best for this query.
Last Purchase Value
This is the greatest-n-per-group problem that comes up frequently.
SELECT a.customer, a.sale as max_sale
FROM sales a
LEFT OUTER JOIN sales b
ON a.customer=b.customer AND a.dates < b.dates
WHERE b.customer IS NULL;
In other words, try to match row a to a hypothetical row b that has the same customer and a greater date. If no such row is found, then a must have the greatest date for that customer.
An index on sales(customer,dates,sale) would be best for this query.
If you might have more than one sale for a customer on that greatest date, this query will return more than one row per customer. You'd need to find another column to break the tie. If you use an auto-increment primary key, it's suitable as a tie breaker because it's guaranteed to be unique and it tends to increase chronologically.
SELECT a.customer, a.sale as max_sale
FROM sales a
LEFT OUTER JOIN sales b
ON a.customer=b.customer AND (a.dates < b.dates OR a.dates = b.dates and a.id < b.id)
WHERE b.customer IS NULL;
Total Amount of Purchases, When It Has a Positive Value
SELECT customer, SUM(sale) AS total_purchases
FROM sales
WHERE sale > 0
GROUP BY customer;
An index on sales(customer,sale) would be best for this query.
You should consider using NULL to signify a missing sale value instead of -1. Aggregate functions like SUM() and COUNT() ignore NULLs, so you don't have to use a WHERE clause to exclude rows with sale < 0.
Re: your comment
What I have now is a table with fields year, quarter, total_sale (regarding to the pair (year,quarter)) and sale. What I want to gather is information regarding certain period: this quarter, quarters, year 2011... Info has to be splitted in top customers, ones with bigger sales, etc. Would it be possible to get the last purchase value from customers with total_purchases bigger than 5?
Top Five Customers for Q4 2012
SELECT customer, SUM(sale) AS total_purchases
FROM sales
WHERE (year, quarter) = (2012, 4) AND sale > 0
GROUP BY customer
ORDER BY total_purchases DESC
LIMIT 5;
I'd want to test it against real data, but I believe an index on sales(year, quarter, customer, sale) would be best for this query.
Last Purchase for Customers with Total Purchases > 5
SELECT a.customer, a.sale as max_sale
FROM sales a
INNER JOIN sales c ON a.customer=c.customer
LEFT OUTER JOIN sales b
ON a.customer=b.customer AND (a.dates < b.dates OR a.dates = b.dates and a.id < b.id)
WHERE b.customer IS NULL
GROUP BY a.id
HAVING COUNT(*) > 5;
As in the other greatest-n-per-group query above, an index on sales(customer,dates,sale) would be best for this query. It probably can't optimize both the join and the group by, so this will incur a temporary table. But at least it will only do one temporary table instead of many.
These queries are complex enough. You shouldn't try to write a single SQL query that can give all of these results. Remember the classic quote from Brian Kernighan:
Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?
I think you should try adding an index on sales(customer, date). The subquery is probably the performance bottleneck.
You can make this puppy scream. Dump the whole inner join query. Really. This is a trick virtually no one seems to know about.
Assuming dates is a datetime, convert it to a sortable string, concatenate the values you want, max (or min), substring, cast. You may need to adjust the date convert function (this one works in MS-SQL), but this idea will work anywhere:
SELECT customer, count(sale), max_sale = cast(substring(max(convert(char(19), dates, 120) + str(sale, 12, 2)), 20, 12) as numeric(12, 2))
FROM sales a
group by customer
Voilá. If you need more result columns, do:
SELECT yourkey
, maxval = left(val, N1) --you often won't need this
, result1 = substring(val, N1+1, N2)
, result2 = substring(val, N1+N2+1, N3) --etc. for more values
FROM ( SELECT yourkey, val = max(cast(maxval as char(N1))
+ cast(resultCol1 as char(N2))
+ cast(resultCol2 as char(N3)) )
FROM yourtable GROUP BY yourkey ) t
Be sure that you have fixed lengths for all but the last field. This takes a little work to get your head around, but is very learnable and repeatable. It will work on any database engine, and even if you have rank functions, this will often significantly outperform them.
More on this very common challenge here.
Related
Considering the following query:
SELECT COUNT(table1.someField), COUNT(table2.someField)
FROM table1
INNER JOIN table2 ON table2.id = table1.id
GROUP BY table1.id
I am trying to understand what the difference is (if any) between groupping by table1.id and groupping by table2.id. In short, when inner joining two tables on X=Y, what the difference is when groupping by X and when groupping by Y. That's it.
The real world example - pretty straightforward: a table transaction holds transactions information (paid amount, dates etc), and a table transaction_product holds information regarding which products were included in which transaction.
So for example, transaction number 1 could have included products number 1, 2 and 3, and so forth (so the table relation is obviously one-to-many).
The problem: I need to know for each transaction, how much was paid for how many products. This is the query, including both GROUP BY alternatives:
SELECT
`transaction`.id,
SUM(`transaction`.transaction_amount) AS total_amount,
COUNT(`transaction_product`.product_id) AS number_of_products
FROM `transaction`
INNER JOIN `transaction_product` ON `transaction_product`.transaction_id = `transaction`.id
GROUP BY [`transaction`.id [OR] `transaction_product`.transaction_id]
I need to know if there is a difference between the two GROUP BY alternatives. I couldn't find relevant information regarding the GROUP BY behavior in this case in the documentation, therefore any help on clarifying the matter would be much appreciated.
The result of the inner join will be a set of rows with matching transaction IDs, so the set of values that column can have will be the same on both transaction and transaction_product tables.
The group by will return a single row for each available value of the grouped column(s), and all the rows that share the same value will be aggregated with the aggregation function you use. The result
Result: there won't be any difference between the two options you have, because the same rows will be grouped with the exact same criteria, being the set of values the same on both sides.
TL/DR
There is no difference at all.
There is no difference whatsovever which id you choose to include in your GROUP BY clause. The total number of rows for each transaction id will be the number of products for that transaction. This query should get what you need:
SELECT
`transaction`.id,
SUM(`transaction`.transaction_amount) AS total_amount,
COUNT(1) AS number_of_products
FROM `transaction`
INNER JOIN `transaction_product` ON `transaction_product`.transaction_id =
`transaction`.id
GROUP BY `transaction`.id
I have a two tables, one called entities with these relevant columns:
id, company_id ,and integration_id. The other table is transactions with columns id, entity_id and created_at. The foreign keys linking the two tables are integration_id and entity_id.
The transactions table shows the number of transactions received from each company from the entities table.
Ultimately, I want to find date range with highest volume of transactions occurring and then from that range find the average number of days between transaction for each company.
To find the date range I used this query.
SELECT DATE_FORMAT(t.created_at, '%Y/%m/%d'), COUNT(t.id)
FROM entities e
JOIN transactions t
ON ei.id = t.entity_id
GROUP BY t.created_at;
I get this:
Date_FORMAT(t.created_at, '%Y/%m/%d') | COUNT(t.id)
+-------------------------------------+------------
2015/11/09 4
etc
From that I determine the range I want to use as 2015/11/09 to 2015/12/27
and I made this query
SELECT company_id, COUNT(t.id)
FROM entities e
INNER JOIN transactions t
ON e.integration_id = t.entity_id
WHERE tp.created_at BETWEEN '2015/11/09' AND '2015/12/27'
GROUP BY company_id;
I get this:
company_id | COUNT(t.id)
+-----------+------------
1234 17
and so on
Which gives me the total transactions made by each company over this date range. What's the best way now to query for the average number of days between transactions by company? How can I sub-query or is there a way to use the AVG function on dates in a WHERE clause?
EDIT:
playing around with the query, I'm wondering if there is a way I can
SELECT company_id, (49 / COUNT(t.id))...
49, because that is the number of days in that date range, in order to get the average number of days between transactions?
I think this might be it, does that make sense?
I think this may work:
Select z.company_id,
datediff(max(y.created_at),min(created_at))/count(y.id) as avg_days_between_orders,
max(y.created_at) as latest_order,
min(created_at) as earliest_order,
count(y.id) as orders
From
(SELECT entity_id, max(t.created_at) latest, min(t.created_at) earliest
FROM entities e, transactions t
Where e.id = t.entity_id
group by entity_id
order by COUNT(t.id) desc
limit 1) x,
transactions y,
entities z
where z.id = x.entity_id
and z.integration_id = y.entity_id
and y.created_at between x.earliest and x.latest
group by company_id;
It's tough without the data. There's a possibility that I have reference to integration_id incorrect in the subquery/join on the outer query.
Have the need to run a bit more complex of a MySQL query. I have two tables that I need to join where one contains the primary key on the other. That's easy enough, but then I need to find the number of occurrences of each ID returned as well, and ultimately sort all the results by this number.
Normally this would just be a group by, but I also need to see ALL of the results (so if it were a group by containing 10 records, I'd need to see all 10, as well as that count returned as well).
So for instance, two tables could be:
Customers table:
CustomerID name address phone etc..
Orders table:
OrderID CustomerID product info etc..
The idea is to output, and sort the orders table to find the customer with the most orders in a given time period. The resultant report would have a few hundred customers, along with their order info below.
I couldn't figure out a way to have it return the rows containing ALL the info from both tables, plus the number of occurences of each in one row. (customer info, individual orders info, and count).
I considered separating it into multiple queries (get the list of top customers), then a bunch of sub-queries for each order programmatically. But that was going to end up with many hundreds of sub-queries every time this is submitted.
So I was hoping someone might know of an easier way to do this. My thought was to have a return result with repeated information, but get it only in one query.
Thanks in advance!
SELECT CUST.CustomerID, CUST.Name, ORDR.OrderID, ORDR.OrderDate, ORDR.ProductInfo, COUNTS.cnt
FROM Customers CUST
INNER JOIN Orders ORDR
ON ORDR.CustomerID = CUST.CustomerID
INNER JOIN
(
SELECT C.CustomerID, COUNT(DISTINCT O.OrderID) AS cnt
FROM Customers C
INNER JOIN Orders O
ON O.CustomerID = C.CustomerID
GROUP BY C.CustomerID
) COUNTS
ON COUNTS.CustomerID = CUST.CustomerID
ORDER BY COUNTS.cnt DESC, CustomerID
This will return one row per order, displayed by customer, ordered by the number of orders for that customer.
I've got two tables, transactions and listings.
transactions list all of my transactions that occur each day. I have a column in transactions that lists the price of each transacted item. Several transactions occur each day. I would like to take the median or mean of all transactions in each day, and populate a new column in listings with this information.
So my end result would have a column in listings called daily_price_average, that takes the average price of individual transaction information from transactions.
Any thoughts on how to do this?
Or how could I do this using a view?
You can do this in a view as:
create v_listings as
select l.*,
(select avg(price)
from transactions t
where date(t.transactiondate) = l.date
) as daily_price_average
from listings l;
To do the update, you would first be sure daily_price_average is a column in listings:
update listings join
(select date(t.transactiondate) as tdate, avg(price) as avgprice
from transactions
group by date(t.transactiondate)
) td
on listings.date = td.tdate
set daily_price_average = td.avgprice;
Both of these assume that listings has a column called date for the average.
Use INSERT... SELECT
For average
INSERT INTO averages (day, average)
SELECT date, AVG(price)
GROUP BY date
Getting the median is... complicated.
Here is my data structure
alt text http://luvboy.co.cc/images/db.JPG
when i try this sql
select rec_id, customer_id, dc_number, balance
from payments
where customer_id='IHS050018'
group by dc_number
order by rec_id desc;
something is wrong somewhere, idk
I need
rec_id customer_id dc_number balance
2 IHS050018 DC3 -1
3 IHS050018 52 600
I want the recent balance of the customer with respective to dc_number ?
Thanx
There are essentially two ways to get this
select p.rec_id, p.customer_id, p.dc_number, p.balance
from payments p
where p.rec_id IN (
select s.rec_id
from payments s
where s.customer_id='IHS050018' and s.dc_number = p.dc_number
order by s.rec_id desc
limit 1);
Also if you want to get the last balance for each customer you might do
select p.rec_id, p.customer_id, p.dc_number, p.balance
from payments p
where p.rec_id IN (
select s.rec_id
from payments s
where s.customer_id=p.customer_id and s.dc_number = p.dc_number
order by s.rec_id desc
limit 1);
What I consider essentially another way is utilizing the fact that select rec_id with order by desc and limit 1 is equivalent to select max(rec_id) with appropriate group by, in full:
select p.rec_id, p.customer_id, p.dc_number, p.balance
from payments p
where p.rec_id IN (
select max(s.rec_id)
from payments s
group by s.customer_id, s.dc_number
);
This should be faster (if you want the last balance for every customer), since max is normally less expensive then sort (with indexes it might be the same).
Also when written like this the subquery is not correlated (it need not be run for every row of the outer query) which means it will be run only once and the whole query can be rewritten as a join.
Also notice that it might be beneficial to write it as correlated query (by adding where s.customer_id = p.customer_id and s.dc_number = p.dc_number in inner query) depending on the selectivity of the outer query.
This might improve performance, if you look for the last balance of only one or few rows.
I don't think there is a good way to do this in SQL without having window functions (like those in Postgres 8.4). You probably have to iterate over the dataset in your code and get the recent balances that way.
ORDER comes before GROUP:
select rec_id, customer_id, dc_number, balance
from payments
where customer_id='IHS050018'
order by rec_id desc
group by dc_number