Need a way to optimize the slow SQL Query? - mysql

I'm running an update query / sub-query on mySQL server and it takes 12 minutes to finish and I think it is not optimize enough.
Could someone think about anyway to optimize it, so it can run faster?
Thanks in advance.
UPDATE `TABLE_1` C
INNER JOIN(
SELECT Cust_No,
#current year sales
(SELECT SUM(`Sales`)
FROM `TABLE_2`
WHERE Year = 2016
AND Cust_No = p.Cust_No
) as CY_TOTAL_SALES,
# Get previou year sales
(SELECT SUM(`Sales`)
FROM `TABLE_2`
WHERE Year = 2015
AND Cust_No = p.Cust_No
) as PY_TOTAL_SALES
FROM `TABLE_2` p
WHERE Year >= 2015
AND Year <= 2016
) AS A ON C.`customer_number` = A.Cust_No
SET C.CY_TOTAL_SALES = A.CY_TOTAL_SALES,
C.PY_TOTAL_SALES = A.PY_TOTAL_SALES;
TABLE_1 contains 28,000 records ( customer_number field is unique and has indexed built)
TABLE_2 contains 250,000 records ( Cust_No is not unique,but has indexed built)
What it does is update TABLE_1 by joining Table_2 and use sub-query to sum up the total sales value for both years in TABLE_2 and then update the value back to TABLE_1 WHERE TABLE_1 customr number is matched with TABLE_2 Cust_no.

I can think of a couple of possible solutions.
Method one
Do just one subquery, don't do any correlated subqueries, and sum conditionally based on the year.
UPDATE TABLE_1 C
INNER JOIN (
SELECT Cust_No,
SUM(IF(Year=2015, Sales, 0)) AS PY_TOTAL_SALES,
SUM(IF(Year=2016, Sales, 0)) AS CY_TOTAL_SALES
FROM TABLE_2
WHERE Year IN (2015, 2016)
GROUP BY Cust_No
) AS S ON C.customer_number = S.Cust_No
SET C.PY_TOTAL_SALES = S.PY_TOTAL_SALES,
C.CY_TOTAL_SALES = S.CY_TOTAL_SALES;
Method two
Do no subqueries at all.
First, zero out the total sales for all customers:
UPDATE TABLE_1 C
SET C.CY_TOTAL_SALES = 0,
C.PY_TOTAL_SALES = 0;
Then do a join without using any subqueries or SUM() calls, and add each sale figure one at a time to the total sales for the customer.
UPDATE TABLE_1 AS C
INNER JOIN TABLE_2 AS S ON C.customer_number = S.Cust_No
SET C.CY_TOTAL_SALES = C.CY_TOTAL_SALES + IF(S.Year=2016, S.Sales, 0)
C.PY_TOTAL_SALES = C.PY_TOTAL_SALES + IF(S.Year=2015, S.Sales, 0)
WHERE S.Year IN (2015, 2016);
For both of these solutions, you'll want an index in TABLE_2 on the columns (Cust_No, Year, Sales).
In the meantime, I can explain a bit why your original query is so slow. Your subquery reads TABLE_2, which you say has 250,000 rows (I'll assume all the rows are in 2015-2016), and for each row it calculates the total sales for the corresponding customer. This means it calculates the same sums many times for each customer.
You're running 500,000 correlated subqueries! It's actually a miracle it only takes 12 minutes.
As it's doing this, it saves this entire result in a 250,000 row temporary table because of the subquery.
Then it joins the temporary table to TABLE_1, and for each customer sets the CY_TOTAL_SALES and PY_TOTAL_SALES. You don't know it, but it's setting the same totals many times for each customer.

Can't add comment because of new user reputation.
Without seeing the tables structures and the current indexes will be hard to tell how to optimize your current query.
Please edit your question to include the table structure (show create table).

Related

MySQL Compare Result in WHERE clause

I imagine I'm missing something pretty obvious here.
I'm trying to display a list of 'bookings' where the total charges is higher than the total payments for the booking. The charges and payments are stored in separate tables linked using foreign keys.
My query so far is:
SELECT `booking`.`id`,
SUM(`booking_charge`.`amount`) AS `charges`,
SUM(`booking_payment`.`amount`) AS `payments`
FROM `booking`
LEFT JOIN `booking_charge` ON `booking`.`id` = `booking_charge`.`booking_id`
LEFT JOIN `booking_payment` ON `booking`.`id` = `booking_payment`.`booking_id`
WHERE `charges` > `payments` ///this is the incorrect part
GROUP BY `booking`.`id`
My tables look something like this:
Booking (ID)
Booking_Charge (Booking_ID, Amount)
Booking_Payment (Booking_ID, Amount)
MySQL doesn't seem to like comparing the results from these two tables, I'm not sure what I'm missing but I'm sure it's something which would be possible.
try HAVING instead of WHERE like this
SELECT `booking`.`id`,
SUM(`booking_charge`.`amount`) AS `charges`,
SUM(`booking_payment`.`amount`) AS `payments`
FROM `booking`
LEFT JOIN `booking_charge` ON `booking`.`id` = `booking_charge`.`booking_id`
LEFT JOIN `booking_payment` ON `booking`.`id` = `booking_payment`.`booking_id`
GROUP BY `booking`.`id`
HAVING `charges` > `payments`
One of the problems with the query is the cross join between rows from `_charge` and rows from `_payment`. It's a semi-Cartesian join. Each row returned from `_charge` will be matched with each row returned from `_payment`, for a given `booking_id`.
Consider a simple example:
Let's put a single row in `_charge` for $40 for a particular `booking_id`.
And put two rows into `_payment` for $20 each, for the same `booking_id`.
The query will would return total charges of $80. (= 2 x $40). If there were instead five rows in \'_payment\' for $10 each, the query would return a total charges of $200 ( = 5 x $40)
There's a couple of approaches to addressing that issue. One approach is to do the aggregation in an inline view, and return the total of the charges and payments as a single row for each booking_id, and then join those to the booking table. With at most one row per booking_id, the cross join doesn't give rise to the problem of "duplicating" rows from _charge and/or _payment.
For example:
SELECT b.id
, IFNULL(c.amt,0) AS charges
, IFNULL(p.amt,0) AS payments
FROM booking b
LEFT
JOIN ( SELECT bc.booking_id
, SUM(bc.amount) AS amt
FROM booking_charge bc
GROUP BY bc.booking_id
) c
ON c.booking_id = b.id
LEFT
JOIN ( SELECT bp.booking_id
, SUM(bp.amount) AS amt
FROM booking_payment bp
GROUP BY bp.booking_id
) p
ON p.booking_id = b.id
WHERE IFNULL(c.amt,0) > IFNULL(p.amt,0)
We could make use of a HAVING clause, in place of the WHERE.
The query in this answer is not the only way to get the result, nor is it the most efficient. There are other query patterns that will return an equivalent result.

SQL Query: How to use sub-query or AVG function to find number of days between a new entry?

I have a two tables, one called entities with these relevant columns:
id, company_id ,and integration_id. The other table is transactions with columns id, entity_id and created_at. The foreign keys linking the two tables are integration_id and entity_id.
The transactions table shows the number of transactions received from each company from the entities table.
Ultimately, I want to find date range with highest volume of transactions occurring and then from that range find the average number of days between transaction for each company.
To find the date range I used this query.
SELECT DATE_FORMAT(t.created_at, '%Y/%m/%d'), COUNT(t.id)
FROM entities e
JOIN transactions t
ON ei.id = t.entity_id
GROUP BY t.created_at;
I get this:
Date_FORMAT(t.created_at, '%Y/%m/%d') | COUNT(t.id)
+-------------------------------------+------------
2015/11/09 4
etc
From that I determine the range I want to use as 2015/11/09 to 2015/12/27
and I made this query
SELECT company_id, COUNT(t.id)
FROM entities e
INNER JOIN transactions t
ON e.integration_id = t.entity_id
WHERE tp.created_at BETWEEN '2015/11/09' AND '2015/12/27'
GROUP BY company_id;
I get this:
company_id | COUNT(t.id)
+-----------+------------
1234 17
and so on
Which gives me the total transactions made by each company over this date range. What's the best way now to query for the average number of days between transactions by company? How can I sub-query or is there a way to use the AVG function on dates in a WHERE clause?
EDIT:
playing around with the query, I'm wondering if there is a way I can
SELECT company_id, (49 / COUNT(t.id))...
49, because that is the number of days in that date range, in order to get the average number of days between transactions?
I think this might be it, does that make sense?
I think this may work:
Select z.company_id,
datediff(max(y.created_at),min(created_at))/count(y.id) as avg_days_between_orders,
max(y.created_at) as latest_order,
min(created_at) as earliest_order,
count(y.id) as orders
From
(SELECT entity_id, max(t.created_at) latest, min(t.created_at) earliest
FROM entities e, transactions t
Where e.id = t.entity_id
group by entity_id
order by COUNT(t.id) desc
limit 1) x,
transactions y,
entities z
where z.id = x.entity_id
and z.integration_id = y.entity_id
and y.created_at between x.earliest and x.latest
group by company_id;
It's tough without the data. There's a possibility that I have reference to integration_id incorrect in the subquery/join on the outer query.

WHERE clause behaving in the wrong way

i want to create a weekly report using a php script. in the mysql query of this, i want to take out tables that functioned during the last week by using its timestamp in the meantime get the other values(especially the counts of unique users and overall users) since the creation date of those tables. but at the moment its returning incorrect data in terms of counts where the counts are given only within the last week not since the creation of project.
this happens inside a loop so for the example query im using table table_1
QUERY
SELECT DISTINCT b.ID, name, accountname, c.accountID, status,
total_impr, min(a.timestamp), max(a.timestamp),COUNT(DISTINCT userid)
AS unique_users,COUNT(userid) AS overall_users
FROM table_1 a INNER JOIN logs b on a.ID = b.ID INNER JOIN
accounts c on b.accountID = c.accountID
WHERE a.timestamp > DATE_ADD(NOW(), INTERVAL -1 WEEK

Speed up a Update, Inner Join, where date between, group by MySQL query

Im running the below query and was wondering if there is anyway to speed up.
Im running the same query multiple times for each month and it's odd because some months run very quick (10 seconds) and some are taking a very long time (30 minutes). The difference in totals between months is not that much so im not sure what the problem is.
Here is the query
UPDATE appmonth a
INNER JOIN (SELECT activity_id, COUNT(*) AS counts FROM appmaster
WHERE upload_date BETWEEN '2014/05/01' AND '2014/05/31'
GROUP BY activity_id) b
on b.activity_id = a.activity_id
SET `2014_05` = b.counts
I dont have any indexes on the table appmonth that is being updated.
I have the following indexes setup on the appmaster table
activity, upload_date
activity
upload_date, activity
I can suggest this way:
UPDATE appmonth AS a
INNER JOIN (SELECT activity_id, COUNT(*) AS counts FROM appmaster
WHERE upload_date >='2014/05/01' AND upload_date <= '2014/05/31'
GROUP BY activity_id) AS b
ON b.activity_id = a.activity_id
SET a.`2014_05` = b.counts
BETWEEN is a slow condition.
And add index 'activity_id' on appmonth.
P.S. the index 'activity_id' on appmaster is useless because you have yet 'activity_id, upload_date'

MySQL huge tables JOIN makes database collapse

Following my recent question Select information from last item and join to the total amount, I am having some memory problems while generation tables
I have two tables sales1 and sales2 like this:
id | dates | customer | sale
With this table definition:
CREATE TABLE sales (
id int auto_increment primary key,
dates date,
customer int,
sale int
);
sales1 and sales2 have the same definition, but sales2 has sale=-1 in every field. A customer can be in none, one or both tables. Both tables have around 300.000 records and much more fields than indicated here (around 50 fields). They are InnoDB.
I want to select, for each customer:
number of purchases
last purchase value
total amount of purchases, when it has a positive value
The query I am using is:
SELECT a.customer, count(a.sale), max_sale
FROM sales a
INNER JOIN (SELECT customer, sale max_sale
from sales x where dates = (select max(dates)
from sales y
where x.customer = y.customer
and y.sale > 0
)
)b
ON a.customer = b.customer
GROUP BY a.customer, max_sale;
The problem is:
I have to get the results, that I need for certain calculations, separated for dates: information on year 2012, information on year 2013, but also information from all the years together.
Whenever I do just one year, it takes about 2-3 minutes to storage all the information.
But when I try to gather information from all the years, the database crashes and I get messages like:
InternalError: (InternalError) (1205, u'Lock wait timeout exceeded; try restarting transaction')
It seems that joining such huge tables is too much for the database. When I explain the query, almost all the percentage of time comes from creating tmp table.
I thought in splitting the data gathering in quarters. We get the results for every three months and then join and sort it. But I guess this final join and sort will be too much for the database again.
So, what would you experts recommend to optimize these queries as long as I cannot change the tables structure?
300k rows is not a huge table. We frequently see 300 million row tables.
The biggest problem with your query is that you're using a correlated subquery, so it has to re-execute the subquery for each row in the outer query.
It's often the case that you don't need to do all your work in one SQL statement. There are advantages to breaking it up into several simpler SQL statements:
Easier to code.
Easier to optimize.
Easier to debug.
Easier to read.
Easier to maintain if/when you have to implement new requirements.
Number of Purchases
SELECT customer, COUNT(sale) AS number_of_purchases
FROM sales
GROUP BY customer;
An index on sales(customer,sale) would be best for this query.
Last Purchase Value
This is the greatest-n-per-group problem that comes up frequently.
SELECT a.customer, a.sale as max_sale
FROM sales a
LEFT OUTER JOIN sales b
ON a.customer=b.customer AND a.dates < b.dates
WHERE b.customer IS NULL;
In other words, try to match row a to a hypothetical row b that has the same customer and a greater date. If no such row is found, then a must have the greatest date for that customer.
An index on sales(customer,dates,sale) would be best for this query.
If you might have more than one sale for a customer on that greatest date, this query will return more than one row per customer. You'd need to find another column to break the tie. If you use an auto-increment primary key, it's suitable as a tie breaker because it's guaranteed to be unique and it tends to increase chronologically.
SELECT a.customer, a.sale as max_sale
FROM sales a
LEFT OUTER JOIN sales b
ON a.customer=b.customer AND (a.dates < b.dates OR a.dates = b.dates and a.id < b.id)
WHERE b.customer IS NULL;
Total Amount of Purchases, When It Has a Positive Value
SELECT customer, SUM(sale) AS total_purchases
FROM sales
WHERE sale > 0
GROUP BY customer;
An index on sales(customer,sale) would be best for this query.
You should consider using NULL to signify a missing sale value instead of -1. Aggregate functions like SUM() and COUNT() ignore NULLs, so you don't have to use a WHERE clause to exclude rows with sale < 0.
Re: your comment
What I have now is a table with fields year, quarter, total_sale (regarding to the pair (year,quarter)) and sale. What I want to gather is information regarding certain period: this quarter, quarters, year 2011... Info has to be splitted in top customers, ones with bigger sales, etc. Would it be possible to get the last purchase value from customers with total_purchases bigger than 5?
Top Five Customers for Q4 2012
SELECT customer, SUM(sale) AS total_purchases
FROM sales
WHERE (year, quarter) = (2012, 4) AND sale > 0
GROUP BY customer
ORDER BY total_purchases DESC
LIMIT 5;
I'd want to test it against real data, but I believe an index on sales(year, quarter, customer, sale) would be best for this query.
Last Purchase for Customers with Total Purchases > 5
SELECT a.customer, a.sale as max_sale
FROM sales a
INNER JOIN sales c ON a.customer=c.customer
LEFT OUTER JOIN sales b
ON a.customer=b.customer AND (a.dates < b.dates OR a.dates = b.dates and a.id < b.id)
WHERE b.customer IS NULL
GROUP BY a.id
HAVING COUNT(*) > 5;
As in the other greatest-n-per-group query above, an index on sales(customer,dates,sale) would be best for this query. It probably can't optimize both the join and the group by, so this will incur a temporary table. But at least it will only do one temporary table instead of many.
These queries are complex enough. You shouldn't try to write a single SQL query that can give all of these results. Remember the classic quote from Brian Kernighan:
Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?
I think you should try adding an index on sales(customer, date). The subquery is probably the performance bottleneck.
You can make this puppy scream. Dump the whole inner join query. Really. This is a trick virtually no one seems to know about.
Assuming dates is a datetime, convert it to a sortable string, concatenate the values you want, max (or min), substring, cast. You may need to adjust the date convert function (this one works in MS-SQL), but this idea will work anywhere:
SELECT customer, count(sale), max_sale = cast(substring(max(convert(char(19), dates, 120) + str(sale, 12, 2)), 20, 12) as numeric(12, 2))
FROM sales a
group by customer
Voilá. If you need more result columns, do:
SELECT yourkey
, maxval = left(val, N1) --you often won't need this
, result1 = substring(val, N1+1, N2)
, result2 = substring(val, N1+N2+1, N3) --etc. for more values
FROM ( SELECT yourkey, val = max(cast(maxval as char(N1))
+ cast(resultCol1 as char(N2))
+ cast(resultCol2 as char(N3)) )
FROM yourtable GROUP BY yourkey ) t
Be sure that you have fixed lengths for all but the last field. This takes a little work to get your head around, but is very learnable and repeatable. It will work on any database engine, and even if you have rank functions, this will often significantly outperform them.
More on this very common challenge here.