Optimizing a MySQL NOT IN( query

Optimizing a MySQL NOT IN( query - mysql

I am trying to optimize this MySQL query. I want to get a count of the number of customers that do not have an appointment prior to the current appointment being looked at. In other words, if they have an appointment (which is what the NOT IN( subquery is checking for), then exclude them.
However, this query is absolutely killing performance. I know that MySQL is not very good with NOT IN( queries, but I am not sure on the best way to go about optimizing this query. It takes anywhere from 15 to 30 seconds to run. I have created indexes on CustNo, AptStatus, and AptNum.
SELECT
COUNT(*) AS NumOfCustomersWithPriorAppointment,
FROM
transaction_log AS tl
LEFT JOIN
appointment AS a
ON
a.AptNum = tl.AptNum
INNER JOIN
customer AS c
ON
c.CustNo = tl.CustNo
WHERE
a.AptStatus IN (2)
AND a.CustNo NOT IN
(
SELECT
a2.CustNo
FROM
appointment a2
WHERE
a2.AptDateTime < a.AptDateTime)
AND a.AptDateTime > BEGIN_QUERY_DATE
AND a.AptDateTime < END_QUERY_DATE
Thank you in advance.

Try the following:
SELECT
COUNT(*) AS NumOfCustomersWithPriorAppointment,
FROM
transaction_log AS tl
INNER JOIN
appointment AS a
ON
a.AptNum = tl.AptNum
LEFT OUTER JOIN appointment AS earlier_a
ON earlier_a.CustNo = a.CustNo
AND earlier_a.AptDateTime < a.AptDateTime
INNER JOIN
customer AS c
ON
c.CustNo = tl.CustNo
WHERE
a.AptStatus IN (2)
AND earlier_a.AptNum IS NULL
AND a.AptDateTime > BEGIN_QUERY_DATE
AND a.AptDateTime < END_QUERY_DATE
This will benefit from a composite index on (CustNo,AptDateTime). Make it unique if that fits your business model (logically it seems like it should, but practically it may not, depending on how you handle conflicts in your application.)
Provide SHOW CREATE TABLE statements for all tables if this does not create a sufficient performance improvement.

Related

Mysql Join performance MongoDB, Cassandra

I have a join query which takes a lot of time to process.
SELECT
COUNT(c.id)
FROM `customers` AS `c`
LEFT JOIN `setting` AS `ssh` ON `c`.`shop_id` = `ssh`.`id`
LEFT JOIN `customer_extra` AS `cx` ON `c`.`id` = `cx`.`customer_id`
LEFT JOIN `customers_address` AS `ca` ON `ca`.`id` = `cx`.`customer_default_address_id`
LEFT JOIN `lytcustomer_tier` AS `ct` ON `cx`.`lyt_customer_tier_id` = `ct`.`id`
WHERE (c.shop_id = '12121') AND ((DATE(cx.last_email_open_date) > '2019-11-08'));
This is primarily because the table 'customers' has 2 million records.
I could go over into indexing etc. But, the larger point is, this 2.5 million could become a billion records 1 day.
I'm looking for solutions which can enhance performance.
I've given thought to
a) horizontal scalability. -: distribute the mysql table into different sections and query the count independently.
b) using composite indexes.
c) My favourite one -: Just create a seperate collection in mongodb or redis which only houses the count(output of this query) Since, the count is just 1 number. this will not require a huge size aka better query performance (Only question is, how many such queries are there, because that will increase size of the new collection)

Try this and see if it improve performance:
SELECT
COUNT(c.id)
FROM `customers` AS `c`
INNER JOIN `customer_extra` AS `cx` ON `c`.`id` = `cx`.`customer_id`
LEFT JOIN `setting` AS `ssh` ON `c`.`shop_id` = `ssh`.`id`
LEFT JOIN `customers_address` AS `ca` ON `ca`.`id` = `cx`.`customer_default_address_id`
LEFT JOIN `lytcustomer_tier` AS `ct` ON `cx`.`lyt_customer_tier_id` = `ct`.`id`
WHERE (c.shop_id = '12121') AND ((DATE(cx.last_email_open_date) > '2019-11-08'));
As I mention in the comment, since the condition AND ((DATE(cx.last_email_open_date) > '2019-11-08'));, already made customers table to INNER JOIN with customer_extra table, you might just change it to INNER JOIN customer_extra AS cx ON c.id = cx.customer_id and follow it with other LEFT JOIN.
The INNER JOIN will at least get the initial result to only return any customer who have last_email_open_date value based on what has been specified.

Say COUNT(*), not COUNT(c.id)
Remove these; they slow down the query without adding anything that I can see:
LEFT JOIN `setting` AS `ssh` ON `c`.`shop_id` = `ssh`.`id`
LEFT JOIN `customers_address` AS `ca` ON `ca`.`id` = `cx`.`customer_default_address_id`
LEFT JOIN `lytcustomer_tier` AS `ct` ON `cx`.`lyt_customer_tier_id` = `ct`.`id`
DATE(...) makes that test not "sargable". This works for DATE or DATETIME; and this is much faster:
cx.last_email_open_date > '2019-11-08'
Consider whether that should be >= instead of >.
Need an index on shop_id. (Please provide SHOW CREATE TABLE.)
Don't use LEFT JOIN when JOIN would work equally well.
If customer_extra is columns that should have been in customer, now is the time to move them in. That would let you use this composite index for even more performance:
INDEX(shop_id, last_email_open_date) -- in this order
With those changes, a billion rows in MySQL will probably not be a problem. If it is, there are still more fixes I can suggest.

Multitable join is taking hours, is there a better way?

I have a few tables recording trades by myself and trades that happen in general, separated out into tradelog and timesales respectively. Both tradelog and timesales each have a second table holding extra data about the trades. I have a column TimeSalesID in my tradelog so I can link up public trades which I participated in. Below is the query I'm running to try and get ALL of my trades and time and sales in one result. The right join is taking FOREVER though, is there a better way?
SELECT SUM(tsJoin.TradeEdge)
FROM
(SELECT * from tradelog tl JOIN slippage_processed sp ON tl.ID = sp.TradeLogID WHERE tl.TradeTIme > '2019-01-21') AS tlJoin
RIGHT JOIN
(SELECT * from timesales ts JOIN slippage_processed_timesales spt ON ts.ID = spt.TimeSalesID WHERE ts.TradeTime > '2019-01-21') AS tsJoin
ON tlJoin.TIMESalesID = tsJoin.ID

Not using subqueries would help. MySQL tends to materialize subqueries, which means that indexes are lost -- greatly impeding query plans.
I would start with:
select sum(?.TradeEdge) -- whatever table column it comes from
from timesales ts join
slippage_processed_timesales spt
on ts.ID = spt.TimeSalesID left join
tradelog tl
on ?.TIMESalesID = ?.ID left join
slippage_processed sp
on tl.ID = sp.TradeLogID and tl.TradeTIme > '2019-01-21'
where ts.TradeTime > '2019-01-21';
The ? are because I don't know the base tables where the columns are coming from. Depending on the tables, the query might need to be adjusted a bit.
Also, I don't think the outer joins are necessary for what you want to do.

You should simplify your sql to simple JOINs. Based on your query above, I think this might get you a result a little faster
SELECT SUM(TradeEdge)
FROM tradelog tl
INNER JOIN slippage_processed sp ON tl.ID = sp.TradeLogID
INNER JOIN timesales ts ON ts.ID = sp.TimeSalesID
WHERE tl.TradeTIme > '2019-01-21'
If this does not work, please post your table structure for all tables involved

How to improve SELECT performance joining multiple tables

I have the following mySQL SELECT statement that was working ok on a small data set but died when the volume was increased:
SELECT DISTINCT Bookings.BookingId, Bookings.ResortId, Bookings.WeekBeginning, Bookings.DepartDate, Bookings.CancelledDate,Clients.FirstName, Clients.LastName, Clients.Email, Clients.Address1, Clients.City, Clients.State, Clients.CountryId, Clients.ClientType, Countries.Country, BookingAccommodation.AccomId, BookingAccommodation.ShareType, BookingProgram.ProgramId, Programs.ProgramDesc
FROM Bookings, Clients, BookingProgram, BookingAccommodation, Countries, ClientType, Programs
WHERE Bookings.BookingId = BookingProgram.BookingId
AND Bookings.BookingId = BookingAccommodation.BookingId
AND Bookings.WeekBeginning >= '2016-10-01'
AND BookingAccommodation.Nights > 0
AND Clients.ClientId = Bookings.ClientId
AND Clients.Email <> ''
AND Clients.CountryId = Countries.CountryId
AND Programs.ProgramId = BookingProgram.ProgramId
With around 10K records in Bookings and 25K records in each of BookingAccommodation and BookingPrograms the volume isn't huge but the query ran in 950 seconds. I'm running the query in the SQL window of phpAdmin on a local MAMP server.
Splitting it into 3 queries the result comes back in a fraction of a second for each:
SELECT DISTINCT Bookings.BookingId, Bookings.ResortId, Bookings.WeekBeginning, Bookings.DepartDate, Bookings.CancelledDate, Clients.FirstName, Clients.LastName, Clients.Email, Clients.Address1, Clients.City, Clients.State, Clients.CountryId, Clients.ClientType, Countries.Country
FROM Bookings, Clients, Countries, ClientType
WHERE Bookings.WeekBeginning >= '2016-10-01'
AND Clients.ClientId = Bookings.ClientId
AND Clients.Email <> ''
AND Clients.CountryId = Countries.CountryId
SELECT DISTINCT Bookings.BookingId, BookingAccommodation.AccomId, BookingAccommodation.ShareType
FROM Bookings, BookingAccommodation
WHERE Bookings.BookingId = BookingAccommodation.BookingId
AND Bookings.WeekBeginning >= '2016-10-01'
AND BookingAccommodation.Nights > 0
SELECT DISTINCT Bookings.BookingId, BookingProgram.ProgramId, Programs.ProgramDesc
FROM Bookings, BookingProgram, Programs
WHERE Bookings.BookingId = BookingProgram.BookingId
AND Bookings.WeekBeginning >= '2016-10-01'
AND Programs.ProgramId = BookingProgram.ProgramId
There are multiple records in BookingAccommodation and BookingProgram for each record in Bookings but I only require one record from each hence the SELECT DISTINCT.
The primary key on Bookings is BookingId.
The primary key on BookingAccommodation is BookingId, AccomDate, AccomId
The primary key on BookingProgram is BookingId, ProgramId, AccomType
I've tried to rewrite the query with joins and sub queries but I'm obviously not doing it right. How can I join these 3 queries back into a single query that will perform well?

These are the basics of using subqueries instead of joins (MySQL assumed FWIW). Apologies for pseudocode, I thought it important to answer ASAP as this is one of the top hits on this issue I faced just now.
A client makes a booking to go on a cruise ship. The client should also specify their diet (eg. vegetarian, vegan, no soy, etc). We thus have three tables:
Bookings
Booking_Id, Booking_Date, Booking_Time, Client_Id
Clients
Client_Id, Client_Name, Client_Phone, Client_DietId
Diets
Diet_Id, Diet_Name
We now want to present to the concierge a full booking view.
Using "JOINS":
SELECT Bookings.Booking_Id, Bookings.Booking_Date, Bookings.Booking_Time, Clients.Client_Name, Diets.Diet_Name
FROM Bookings
INNER JOIN Clients
ON Bookings.Client_Id = Clients.Client_Id
INNER JOIN Diets
ON Clients.Client_DietId = Diets.Diet_Id
Using "SUBQUERIES":
How I think of it is creating "temp tables" in those separate JOINs - of course "temp tables" may or may not be the accurate low-level implementation, etc. but anecdotally subqueries may be faster than huge joins (other threads on this).
I have separate joins I want to do from the above example:
First I need to join the Clients with their Diets, then I join that "table" with Bookings.
Thus I end up with this (note the table (re)naming when referring to the subquery):
SELECT [RELEVANT FIELDS HERE ETC]
FROM
(SELECT Clients.Client_Id, Clients.Client_Name, Diets.Diet_Name
FROM Clients
INNER JOIN Diets
ON Clients.Client_DietId = Diets.Diet_Id)
AS ClientDetailsWithDiets
INNER JOIN Bookings
ON Bookings.Booking_Id = ClientDetailsWithDiets.Client_Id
Now if another table is to be joined say Staff assigned to a particular Booking, then the whole thing above would be nested, and so on eg:
SELECT [RELEVANT FIELDS HERE ETC]
FROM
(SELECT [RELEVANT FIELDS HERE ETC]
FROM
(SELECT Clients.Client_Id, Clients.Client_Name, Diets.Diet_Name
FROM Clients
INNER JOIN Diets
ON Clients.Client_DietId = Diets.Diet_Id)
AS ClientDetailsWithDiets
INNER JOIN Bookings
ON Bookings.Booking_Id = ClientDetailsWithDiets.Client_Id)
AS BookingDetailsFull
INNER JOIN Staff
ON BookingDetailsFull.Booking_Id = Staff.Booking_Id_Assigned

Try changing it as
SELECT DISTINCT Bookings.BookingId, Bookings.ResortId,
Bookings.WeekBeginning, Bookings.DepartDate, Bookings.CancelledDate,
Clients.FirstName, Clients.LastName, Clients.Email, Clients.Address1,
Clients.City, Clients.State, Clients.CountryId, Clients.ClientType, Countries.Country,
BookingAccommodation.AccomId, BookingAccommodation.ShareType, BookingProgram.ProgramId,
Programs.ProgramDesc
FROM Bookings
JOIN Clients ON Clients.ClientId = Bookings.ClientId AND Bookings.WeekBeginning >= '2016-10-01' AND Clients.Email <> ''
JOIN BookingProgram ON Bookings.BookingId = BookingProgram.BookingId
JOIN BookingAccommodation ON Bookings.BookingId = BookingAccommodation.BookingId AND BookingAccommodation.Nights > 0
JOIN Countries ON Clients.CountryId = Countries.CountryId
JOIN Programs ON Programs.ProgramId = BookingProgram.ProgramId
WHERE Bookings.WeekBeginning >= '2016-10-01';
If this is not getting you the results you wanted, try EXPLAIN and see the query plan.
Please Note: I didn't see table ClientType is being used anywhere so I did not include it in JOINs

Rather than spend more time trying to improve the select statement as it hits so many tables I opted to split it into the separate queries as I outlined in the original question.
In the end this was the quickest practical solution.

Convert SQL WHERE IN to JOIN

I have a database storing various information about fictional people. There is a table person with general information, such as name, adress etc and some more specific tables holding health history and education for everyone.
What I'm trying to do now, is getting possible connections for one person based on similarities like being at the same school for the same time or having the same doctor or being treated in the same hospital at the same time.
Following Query works fine for this (:id being the id of the person in question), however it is horribly slow (takes about 6secs to get a result).
SELECT person.p_id as id, fname, lname, image FROM person WHERE
(person.p_id IN (
SELECT patient from health_case WHERE
doctor IN (SELECT doctor FROM health_case WHERE patient =:id )
OR center IN (SELECT hc2.center FROM health_case as hc1, health_case as hc2 WHERE hc1.patient = :id AND hc2.center = hc1.center AND (hc1.start <= hc2.end AND hc1.end >= hc2.start)))
OR person.p_id IN (
SELECT ed2.pupil FROM education as ed1, education as ed2 WHERE
ed1.school IN (SELECT school FROM education WHERE pupil = :id) AND ed2.school = ed1.school AND (ed2.start <= ed1.end AND ed2.end >= ed1.start)
))
AND person.p_id != :id
What would be the best approach to convert it to use JOIN clauses? I somehow seem unable to wrap my head around these...

I think I understand what you're trying to do. There is more than one way to skin a cat, but may I suggest splitting your query into two separate queries, and then replacing the complicated WHERE clause with a couple inner joins? So, something like this:
/* Find connections based on health care */
SELECT p2.p_id as id, p2.fname, p2.lname, p2.image
FROM person p
JOIN health_case hc on hc.patient = p.p_id
JOIN health_case hc2 on hc2.doctor = hc.doctor and hc2.healthcenter = hc.healthcenter and hc.start <= hc2.end and hc.end >= hc2.start and hc2.patient <> hc.patient
JOIN person p2 on p2.p_id = hc2.patient and p2.p_id <> p.p_id
WHERE p.p_id = :id
Then, create a separate query to get connections based on education:
/* Find connections based on education */
SELECT p2.p_id as id, p2.fname, p2.lname, p2.image
FROM person p
JOIN education e on e.pupil = p.p_id
JOIN education e2 on e2.school = e.school and e2.start <= e.end AND e2.end >= e.start and e.pupil <> e2.pupil
JOIN person p2 on p2.p_id = e2.pupil and p2.p_id <> p.p_id
WHERE p.p_id = :id
If you really want the data results to be combined, you can use UNION since both queries return the same columns from the person table.

Depends on your SQL engine. Newer SQL systems that have reasonable query optimizers will most likely rewrite both IN and JOIN queries to the same plan. Typically, a sub-query (IN Clause) is rewritten using a join.
In simple SQL engines that may not have great query optimizers, the join should be faster because they may run sub-queries into a temporary in-memory table before running the outer query.
In some SQL engines that have limited memory footprint, however, the sub-query may be faster because it doesn't require joining -- which produces more data.

Optimizing one SQL query

I need to run query to get some data from the DB. The thing is the query which I use works but takes a very long time.
SELECT SHH1.CUST_NO,
SHH1.CUST_NAME,
ADDR.BVADDREMAIL
FROM SALES_HISTORY_HEADER SHH1
INNER JOIN ADDRESS ADDR ON (SHH1.CUST_NO=ADDR.CEV_NO)
INNER JOIN CUSTOMER CUST ON (SHH1.CUST_NO=CUST.CUS_NO)
WHERE CUST.HOLD = 0
AND SHH1.CUST_NO IN (SELECT SHH2.CUST_NO
FROM SALES_HISTORY_HEADER SHH2
GROUP BY SHH2.CUST_NO
HAVING Max(SHH2.IN_DATE) < '20120101')
GROUP BY SHH1.CUST_NO,
SHH1.CUST_NAME,
ADDR.BVADDREMAIL
I am not very good at this so was wondering if any of you guys could help me? Thanks.

I don't think you need the sub-select. the following should give you the same results:
SELECT SHH1.CUST_NO,SHH1.CUST_NAME,ADDR.BVADDREMAIL
FROM SALES_HISTORY_HEADER SHH1
INNER JOIN ADDRESS ADDR ON (SHH1.CUST_NO=ADDR.CEV_NO)
INNER JOIN CUSTOMER CUST ON (SHH1.CUST_NO=CUST.CUS_NO)
WHERE CUST.HOLD = 0
GROUP BY SHH1.CUST_NO,SHH1.CUST_NAME,ADDR.BVADDREMAIL
HAVING Max(SHH1.IN_DATE) < '20120101'

Assuming CUST_NAME is on CUSTOMER, try:
SELECT CUST.CUS_NO CUST_NO,
CUST.CUST_NAME,
ADDR.BVADDREMAIL
FROM CUSTOMER CUST
JOIN ADDRESS ADDR ON CUST.CUS_NO=ADDR.CEV_NO
LEFT JOIN SALES_HISTORY_HEADER SHH
ON CUST.CUS_NO=SHH.CUST_NO AND SHH.IN_DATE >= '20120101'
WHERE CUST.HOLD = 0 AND SHH.CUST_NO IS NULL
GROUP BY CUST.CUS_NO, ADDR.BVADDREMAIL

Use the Excution Plan which part of the query has the major cost of execution. That will likely indicate where is the problem. The next step I would perform is to review indexes creation. Please recall clustered index is created by default over PKs but it is also a good practice to create non-clustered index over the FKs and any potential field you may have particular interest. Hope it helps.

I agree with Mark Bannister's answer that the OP should do FROM CUSTOMER instead FROM SALES_HISTORY_HEADER.
So, Assuming CUST_NAME is on CUSTOMER, I would do the following:
SELECT CUST.CUS_NO,
CUST.CUST_NAME,
ADDR.BVADDREMAIL
FROM CUSTOMER CUST
JOIN ADDRESS ADDR ON CUST.CUS_NO = ADDR.CEV_NO
WHERE CUST.HOLD = 0
AND CUST.CUS_NO IN (SELECT SHH.CUST_NO
FROM SALES_HISTORY_HEADER SHH
GROUP BY SHH.CUST_NO
HAVING Max(SHH.IN_DATE) < '20120101')
Note that subquery materialization must be active for MySQL to optimize the sub-select.

What about removing some rows before joining
SELECT S.CUST_NO,
,S.Cust_NAME
,A.BVADDREMAIL
FROM CUSTOMER C
INNER JOIN SALES_HISTORY_HEADER S ON (CUST.HOLD = 0 AND S.CUST_NO=C.CUS_NO)
INNER JOIN ADDRESS A ON (S.CUST_NO=A.CEV_NO)
GROUP BY S.CUST_NO,S.CUST_NAME,A.BVADDREMAIL
HAVING Max(S.IN_DATE) < '20120101'
My update on StevieG's answer

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Optimizing a MySQL NOT IN( query - mysql

Related

Mysql Join performance MongoDB, Cassandra

Multitable join is taking hours, is there a better way?

How to improve SELECT performance joining multiple tables

Convert SQL WHERE IN to JOIN

Optimizing one SQL query

Categories

Resources