Alternative of NOT IN On MySQL - mysql

I have a query
SELECT DISTINCT phoneNum
FROM `Transaction_Register`
WHERE phoneNum NOT IN (SELECT phoneNum FROM `Subscription`)
LIMIT 0 , 1000000
It takes too much time to execute b/c Transaction_Register table has millions of records
is there any alternative of above query I will be grateful to you guys if there is any.

An alternative would be to use a LEFT JOIN:
select distinct t.phoneNum
from Transaction_Register t
left join Subscription s
on t.phoneNum = s.phoneNum
where s.phoneNum is null
LIMIT 0 , 1000000;
See SQL Fiddle with Demo

I doubt whether LEFT JOIN truly perform better than NOT IN. I just perform a few tests with the following table structure (if I am wrong please correct me):
account (id, ....) [42,884 rows, index by id]
play (account_id, playdate, ...) [61,737 rows, index by account_id]
(1) Query with LEFT JOIN
SELECT * FROM
account LEFT JOIN play ON account.id = play.account_id
WHERE play.account_id IS NULL
(2) Query with NOT IN
SELECT * FROM
account WHERE
account.id NOT IN (SELECT play.account_id FROM play)
Speed test with LIMIT 0,...
LIMIT 0,-> 100 150 200 250
-------------------------------------------------------------------------
LEFT 3.213s 4.477s 5.881s 7.472s
NOT EXIST 2.200s 3.261s 4.320s 5.647s
--------------------------------------------------------------------------
Difference 1.013s 1.216s 1.560s 1.825s
As I increase the the limit, the difference is getting larger and larger
With EXPLAIN
(1) Query with LEFT JOIN
SELECT_TYPE TABLE TYPE ROWS EXTRA
-------------------------------------------------
SIMPLE account ALL 42,884
SIMPLE play ALL 61,737 Using where; not exists
(2) Query with NOT IN
SELECT_TYPE TABLE TYPE ROWS EXTRA
-------------------------------------------------
SIMPLE account ALL 42,884 Using where
DEPENDENT SUBQUERY play INDEX 61,737 Using where; Using index
It seem like the LEFT JOIN does not make use of index
LOGIC
(1) Query with LEFT JOIN
After LEFT JOIN between account and play will produce 42,884 * 61,737
= 2,647,529,508 rows. Then check if play.account_id is NULL on those rows.
(2) Query with NOT IN
Binary search takes log2(N) for item existence. That's mean 42,884 * log2(61,737) = 686,144 steps

Related

Cross-Apply bad for a larger database or alternatives perform better?

so 2 (more so 3) questions, is my query just badly coded or thought out ? (be kind, I only just discovered cross apply and relatively new) and is corss-apply even the best sort of join to be using or why is it slow?
So I have a database table (test_tble) of around 66 million records. I then have a ##Temp_tble created which has one column called Ordr_nbr (nchar (13)). This is basically ones I wish to find.
The test_tble has 4 columns (Ordr_nbr, destination, shelf_no, dte_bought).
This is my current query which works the exact way I want it to but it seems to be quite slow performance.
select ##Temp_tble.Ordr_nbr, test_table1.destination, test_table1.shelf_no,test_table1.dte_bought
from ##MyTempTable
cross apply(
select top 1 test_table.destination,Test_Table.shelf_no,Test_Table.dte_bought
from Test_Table
where ##MyTempTable.Order_nbr = Test_Table.order_nbr
order by dte_bought desc)test_table1
If the ##Temp_tble only has 17 orders to search for it take around 2 mins. As you can see I'm trying to get just the most recent dte_bought or to some max(dte_bought) of each order.
In term of index I ran database engine tuner and it says its optimized for the query and I have all relative indexes created such as clustered index on test_tble for dte_bought desc including order_nbr etc.
The execution plan is using a index scan(on non_clustered) and a key lookup(on clustered).
My end result is it to return all the order_nbrs in ##MyTempTble along with columns of destination, shelf_no, dte_bought in relation to that order_nbr, but only the most recent bought ones.
Sorry if I explained this awfully, any info needed that I can provide just ask. I'm not asking for just downright "give me code", more of guidance,advice and learning. Thank you in advance.
UPDATE
I have now tried a sort of left join, it works reasonably quicker but still not instant or very fast (about 30 seconds) and it also doesn't return just the most recent dte_bought, any ideas? see below for left join code.
select a.Order_Nbr,b.Destination,b.LnePos,b.Dte_bought
from ##MyTempTble a
left join Test_Table b
on a.Order_Nbr = b.Order_Nbr
where b.Destination is not null
UPDATE 2
Attempted another let join with a max dte_bought, works very but only returns the order_nbr, the other columns are NULL. Any suggestion?
select a.Order_nbr,b.Destination,b.Shelf_no,b.Dte_Bought
from ##MyTempTable a
left join
(select * from Test_Table where Dte_bought = (
select max(dte_bought) from Test_Table)
)b on b.Order_nbr = a.Order_nbr
order by Dte_bought asc
K.M
Instead of CROSS APPLY() you can use INNER JOIN with subquery. Check the following query :
SELECT
TempT.Ordr_nbr
,TestT.destination
,TestT.shelf_no
,TestT.dte_bought
FROM ##MyTempTable TempT
INNER JOIN (
SELECT T.destination
,T.shelf_no
,T.dte_bought
,ROW_NUMBER() OVER(PARTITION BY T.Order_nbr ORDER BY T.dte_bought DESC) ID
FROM Test_Table T
) TestT
ON TestT.Id=1 AND TempT.Order_nbr = TestT.order_nbr

Query with multiple table joins taking too much time despite indexing

Query-
SELECT SUM(sale_data.total_sale) as totalsale, `sale_data_temp`.`customer_type_cy` as `customer_type`, `distributor_list`.`customer_status` FROM `distributor_list` LEFT JOIN `sale_data` ON `sale_data`.`depo_code` = `distributor_list`.`depo_code` and `sale_data`.`customer_code` = `distributor_list`.`customer_code` LEFT JOIN `sale_data_temp` ON `distributor_list`.`address_coordinates` = `sale_data_temp`.`address_coordinates` LEFT JOIN `item_master` ON `sale_data`.`item_code` = `item_master`.`item_code` WHERE `invoice_date` BETWEEN "2017-04-01" and "2017-11-01" AND `item_master`.`id_category` = 1 GROUP BY `distributor_list`.`address_coordinates`
Query, rewritten with formatting.
SELECT SUM(sale_data.total_sale) as totalsale,
sale_data_temp.customer_type_cy as customer_type,
distributor_list.customer_status
FROM distributor_list
LEFT JOIN sale_data
ON sale_data.depo_code = distributor_list.depo_code
and sale_data.customer_code = distributor_list.customer_code
LEFT JOIN sale_data_temp
ON distributor_list.address_coordinates = sale_data_temp.address_coordinates
LEFT JOIN item_master
ON sale_data.item_code = item_master.item_code
WHERE invoice_date BETWEEN "2017-04-01" and "2017-11-01"
AND item_master.id_category = 1
GROUP BY distributor_list.address_coordinates
DESC-
This Query is taking 7.5 seconds to run. My application contains 3-4 such queries. Therefore loading time appraches 1 min on server.
My sale data table contains 450K records.
Distributor list contains 970 records
Item master contains 7774 records and sale_data_temp contains 324 records.
I am using indexing but it is not being used for sale data table.
All the 400K records are searched as is evident from explain sql.
If I reduce the duration of BETWEEN clause than sale data table uses date index otherwise it scans all 400K rows.
The rows between 01-04-2017 and 01-11-2017 are 84000 but still it scans 400K rows.
MYSQL EXPLAIN-
I have modified queries two times with no success.
Modification 1:
SELECT SUM(sale_data.total_sale) as totalsale, `sale_data_temp`.`customer_type_cy` as `customer_type`, `distributor_list`.`customer_status` FROM `distributor_list` LEFT JOIN `sale_data` ON `sale_data`.`depo_code` = `distributor_list`.`depo_code` and `sale_data`.`customer_code` = `distributor_list`.`customer_code` AND `invoice_date` BETWEEN "2017-04-01" and "2017-11-01" LEFT JOIN `sale_data_temp` ON `distributor_list`.`address_coordinates` = `sale_data_temp`.`address_coordinates` LEFT JOIN `item_master` ON `sale_data`.`item_code` = `item_master`.`item_code` WHERE `item_master`.`id_category` = 1 GROUP BY `distributor_list`.`address_coordinates`
Modification 2
SELECT SQL_NO_CACHE SUM( sd.total_sale ) AS totalsale, `sale_data_temp`.`customer_type_cy` AS `customer_type` , `distributor_list`.`customer_status` FROM `distributor_list` LEFT JOIN (SELECT * FROM `sale_data` WHERE `invoice_date` BETWEEN "2017-04-01" AND "2017-11-01")sd ON `sd`.`depo_code` = `distributor_list`.`depo_code` AND `sd`.`customer_code` = `distributor_list`.`customer_code` LEFT JOIN `sale_data_temp` ON `distributor_list`.`address_coordinates` = `sale_data_temp`.`address_coordinates` LEFT JOIN `item_master` ON `sd`.`item_code` = `item_master`.`item_code` WHERE `item_master`.`id_category` =1 GROUP BY `distributor_list`.`address_coordinates`
HERE ARE MY INDEXES ON SALE DATA TABLE
See the key column of the EXPLAIN results view - no key is being used at the moment so MySQL is not using any of your indexes for filtering out rows so it is scanning the whole table on each query. This is why it is taking so long.
I have taken a look at your first query with relation to your sale_data indices. It looks like you will need to create a new composite index on this table that contains the following columns only:
depo_code, customer_code, item_code, invoice_date, total_sale
I recommend that you name this index test1 and experiment with modifying the ordering of the columns and keep testing again each time using EXPLAIN EXTENDED until you achieve a selected key - you want to see index test1 has been selected in the key column.
See this answer that has helped me before with this, and it will help you understand the importance of correctly ordering your composite indices.
Looking at the cardinality of the single field indices, here is my best attempt at giving you the correct index to apply:
ALTER TABLE `sale_data` ADD INDEX `test1` (`item_code`, `customer_code`, `invoice_date`, `depo_code`, `total_sale`);
Good luck with your mission!
A few things to notice about your query.
You are misusing the notorious MySQL extension to GROUP BY. Read this, then mention the same columns in your GROUP BY clause as you mention in your SELECT clause.
Your LEFT JOIN sale_data and LEFT JOIN item_master operations are actually ordinary JOIN operations. Why? You mention columns from those tables in your WHERE clause.
Your best bet for speedup is doing a date-range scan on an index on sale_data.invoice_date. For some reason known only to the MySQL query planner's feverish machinations, you're not getting it.
Try refactoring your query. Here's one suggestion:
SELECT SUM(sale_data.total_sale) as totalsale,
sale_data_temp.customer_type_cy as customer_type,
distributor_list.customer_status
FROM distributor_list
JOIN sale_data
ON sale_data.invoice_date BETWEEN "2017-04-01" and "2017-11-01"
and sale_data.depo_code = distributor_list.depo_code
and sale_data.customer_code = distributor_list.customer_code
LEFT JOIN sale_data_temp
ON distributor_list.address_coordinates = sale_data_temp.address_coordinates
JOIN item_master
ON sale_data.item_code = item_master.item_code
WHERE item_master.id_category = 1
GROUP BY sale_data_temp.customer_type_cy, distributor_list.customer_status
Try creating a covering index on sale_data for this query. You'll have to mess around a bit to get this right, but this is a starting point. (invoice_date, item_code, depo_code, customer_code, total_sale). The point of a covering index is to allow the query to be satisfied entirely from the index without having to refer back to the table's data. That's why I included total_sale in the index.
Please notice that index I suggested makes your index on invoice_date redundant. You can drop that index.

Optimizing MySQL query with subselect

I am trying to make the following query run faster than 180 secs:
SELECT
x.di_on_g AS deviceid, SUM(1) AS amount
FROM
(SELECT
g.device_id AS di_on_g
FROM
guide g
INNER JOIN operator_guide_type ogt ON ogt.guide_type_id = g.guide_type_id
INNER JOIN operator_device od ON od.device_id = g.device_id
WHERE
g.operator_id IN (1 , 1)
AND g.locale_id = 1
AND (g.device_id IN ("many (~1500) comma separated IDs coming from my code"))
GROUP BY g.device_id , g.guide_type_id) x
GROUP BY x.di_on_g
ORDER BY amount;
Screenshot from EXPLAIN:
https://ibb.co/da5oAF
Even if I run the subquery as separate query it is still very slow...:
SELECT
g.device_id AS di_on_g
FROM
guide g
INNER JOIN operator_guide_type ogt ON ogt.guide_type_id = g.guide_type_id
INNER JOIN operator_device od ON od.device_id = g.device_id
WHERE
g.operator_id IN (1 , 1)
AND g.locale_id = 1
AND (g.device_id IN (("many (~1500) comma separated IDs coming from my code")
Screenshot from EXPLAIN:
ibb.co/gJHRVF
I have indexes on g.device_id and on other appropriate places.
Indexes:
SHOW INDEX FROM guide;
ibb.co/eVgmVF
SHOW INDEX FROM operator_guide_type;
ibb.co/f0TTcv
SHOW INDEX FROM operator_device;
ibb.co/mseqqF
I tried creating a new temp table for the ids and using a JOIN to replace the slow IN clause but that didn't make the query much faster.
All IDs are Integers and I tried creating a new temp table for the ids that come from my code and JOIN that table instead of the slow IN clause but that didn't make the query much faster. (10 secs faster)
None of the tables have more then 300,000 rows and the mysql configuration is good.
And the visual plan:
Query Plan
Any help will be appreciated !
Let's focus on the subquery. The main problem is "inflate-deflate", but I will get to that in a moment.
Add the composite index:
INDEX(locale_id, operator_id, device_id)
Why the duplicated "1" in
g.operator_id IN (1 , 1)
Why does the GROUP BY have 2 columns, when you select only 1? Is there some reason for using GROUP BY instead of DISTINCT. (The latter seems to be your intent.)
The only reason for these
INNER JOIN operator_guide_type ogt ON ogt.guide_type_id = g.guide_type_id
INNER JOIN operator_device od ON od.device_id = g.device_id
would be to verify that there are guides and devices in those other table. Is that correct? Are these the PRIMARY KEYs, hence unique?: ogt.guide_type_id and od.device_id. If so, why do you need the GROUP BY? Based on the EXPLAIN, it sounds like both of those are related 1:many. So...
SELECT g.device_id AS di_on_g
FROM guide g
WHERE EXISTS( SELECT * FROM operator_guide_type WHERE guide_type_id = g.guide_type_id )
AND EXISTS( SELECT * FROM operator_device WHERE device_id = g.device_id
AND g.operator_id IN (1)
AND g.locale_id = 1
AND g.device_id IN (...)
Notes:
The GROUP BY is no longer needed.
The "inflate-deflate" of JOIN + GROUP BY is gone. The Explain points this out -- 139K rows inflated to 61M -- very costly.
EXISTS is a "semijoin", meaning that it does not collect all matches, but stops when it finds any match.
"the mysql configuration is good" -- How much RAM do you have? What Engine is the table? What is the value of innodb_buffer_pool_size?

Unexpected large execution time with two identical sql queries

I would be grateful to someone how can explain me this case http://pastebin.com/YBQTwxYG, both queries are almost identical, but second is executed ms except first almost 3 min.
As EXPLAIN shows second query uses correct indexes, but first doesnt.
I am very confused.
First some formatting:
SELECT c.`id`, c.`content_id`, c.`author_id`, c.`text`, c.`typotext`,
c.`created`,
p.`id`, p.`html_title`, p.`permalink`,
u.`id`, u.`username`, u.`first_name`, u.`last_name`, u.`avatar`
FROM `content_comment` AS c
INNER JOIN `content_post` AS p ON (c.`content_id` = p.`id`)
INNER JOIN `post_article` AS a ON (p.`id` = a.`content_id`)
LEFT OUTER JOIN `auth_user` AS u ON (c.`author_id` = u.`id`)
WHERE c.`deleted` IS NULL
AND a.`id` IS NOT NULL
ORDER BY c.`created` DESC
LIMIT 15
If id is the PRIMARY KEY of post_article, then it cannot be NULL. Hence, the test is unnecessary.
If you meant to test auth_user.id, then LEFT is unnecessary.
Then check that you have these composite indexes:
c: INDEX(deleted, created) -- in that order
a: INDEX(content_id, id) -- in that order
The second query is different in that it accesses forum_topic instead of auth_user. A look at SHOW CREATE TABLE for those two tables may explain why you got different explain plans. If not, check SHOW TABLE STATUS.
If you need further discussion, provide those SHOWs for all 5 tables.

Long query times for simple MySQL SELECT with JOIN

SELECT COUNT(*)
FROM song AS s
JOIN user AS u
ON(u.user_id = s.user_id)
WHERE s.is_active = 1 AND s.public = 1
The s.active and s.public are index as well as u.user_id and s.user_id.
song table row count 310k
user table row count 22k
Is there a way to optimize this? We're getting 1 second query times on this.
Ensure that you have a compound "covering" index on song: (user_id, is_active, public). Here, we've named the index covering_index:
SELECT COUNT(s.user_id)
FROM song s FORCE INDEX (covering_index)
JOIN user u
ON u.user_id = s.user_id
WHERE s.is_active = 1 AND s.public = 1
Here, we're ensuring that the JOIN is done with the covering index instead of the primary key, so that the covering index can be used for the WHERE clause as well.
I also changed COUNT(*) to COUNT(s.user_id). Though MySQL should be smart enough to pick the column from the index, I explicitly named the column just in case.
Ensure that you have enough memory configured on the server so that all of your indexes can stay in memory.
If you're still having issues, please post the results of EXPLAIN.
Perhaps write it as a stored procedure or view... You could also try selecting all the IDs first then running the count on the result... if you do it all as one query it may be faster. Generally optimisation is done by using nested selects or making the server do the work so in this context that is all I can think of.
SELECT Count(*) FROM
(SELECT song.user_id FROM
(SELECT * FROM song WHERE song.is_active = 1 AND song.public = 1) as t
JOIN user AS u
ON(t.user_id = u.user_id))
Also be sure you are using the correct kind of join.