MySQL Dynamic Optimization with variable AND OR NOT operators - mysql

UPDATE - FINAL SOLUTION TO THIS ISSUE
Our dynamic system allows for a BOOLEAN interpolated match of things like Name, Job Title, Phone Number, etc. So we can say:
Name("ted" OR "mike" OR "david" AND "martin") AND Title("developer" AND "senior" NOT "CTO) AND City("san diego")
The way this is accomplished is to follow the below grouping example, which is dynamically created. It's pretty straightforward, however the use of HAVING COUNT is necessary to properly define the AND indexes.
Also not in this example access_indexes is a list of ID indexes an account has access to, so if the "search" returns a person the account can't access, it won't show up.
Thanks to everyone for your help, especially #BillKarwin!
WITH filter0 AS
(
SELECT pm.ID FROM person_main pm
WHERE MATCH(pm.name_full) AGAINST ('(ted)' IN BOOLEAN MODE)
),
filter1 AS
(
SELECT ram.object_ref_id AS ID
FROM ras_assignment_main ram
WHERE ram.object_type_c = 1
AND ram.assignment_type_c = 1
AND ram.assignment_ref_id IN (2)
),
persongroup0_and AS
(
SELECT pg0_a.ID FROM
(
SELECT ID FROM filter0
) pg0_a
GROUP BY pg0_a.ID
HAVING COUNT(pg0_a.ID) = 1
),
persongroup0 AS
(
SELECT pm.ID
FROM person_main pm
JOIN persongroup0_and pg0_and ON pm.ID = pg0_and.ID
),
persongroup1_and AS
(
SELECT pg1_a.ID FROM
(
SELECT ID FROM filter1
) pg1_a
GROUP BY pg1_a.ID
HAVING COUNT(pg1_a.ID) = 1
),
persongroup1 AS
(
SELECT pm.ID
FROM person_main pm
JOIN persongroup1_and pg1_and ON pm.ID = pg1_and.ID
),
person_all_and AS
(
SELECT paa.ID FROM
(
SELECT ID FROM persongroup0
UNION ALL (SELECT ID FROM persongroup1)
) paa
GROUP BY paa.ID
HAVING COUNT(paa.ID) = 2
),
person_all AS
(
SELECT pm.ID
FROM person_main pm
JOIN person_all_and pa_and ON pm.ID = pa_and.ID
),
person_access AS
(
SELECT pa.ID
FROM person_all pa
LEFT JOIN access_indexes ai ON pa.ID = ai.ID
)
SELECT (JSON_ARRAYAGG(pm.ID))
FROM
(
SELECT person_sort.ID
FROM
(
SELECT pa.ID
FROM person_access pa
GROUP BY pa.ID
) person_sort
) pm;
Our front-end system has the ability to define dynamic SQL queries using AND/OR/NOT from multiple tables, and the core system works fine - but it's slows down to being unusable due to the compounded scanning of IN. For the life of me, I can't figure out how to have this level of dynamic functionality without using IN. Below is the code that works perfectly fine (the filter matching is ultra fast), but the compounding of the IN scan takes > 60 seconds because it's 50,000+ records for some of the filter returns.
WITH filter0 AS
(
SELECT pm.ID FROM person_main pm
WHERE MATCH(pm.name_full) AGAINST ('mike meyers' IN BOOLEAN MODE)
),
filter1 AS
(
SELECT phw.person_main_ref_id AS ID
FROM person_history_work phw
WHERE MATCH(phw.work_title) AGAINST('developer' IN BOOLEAN MODE)
),
filter2 AS
(
SELECT pa.person_main_ref_id AS ID
FROM person_address pa
WHERE pa.global_address_details_ref_id IN
(
SELECT gad.ID
FROM global_address_details gad
WHERE gad.address_city LIKE '%seattle%'
)
),
all_indexes AS
(
SELECT ID FROM filter0
UNION (SELECT ID FROM filter1)
UNION (SELECT ID FROM filter2)
),
person_filter AS
(
SELECT ai.ID
FROM all_indexes ai
WHERE
(
ai.ID IN (SELECT ID FROM filter0)
AND ai.ID NOT IN (SELECT ID FROM filter1)
OR ai.ID IN (SELECT ID FROM filter2)
)
)
SELECT (JSON_ARRAYAGG(pf.ID)) FROM person_filter pf;
Filter 0 has 461 records, Filter 1 has 48480 and Filter 2 has 750.
The key issue is with the WHERE statement; because the front-end can say AND/OR and NOT on any "joined" query.
So if I change it to:
ai.ID IN (SELECT ID FROM filter0)
AND ai.ID IN (SELECT ID FROM filter1)
AND ai.ID IN (SELECT ID FROM filter2)
The query takes more than 60 seconds. Because it's scanning 461 * 48480 * 750 = 16,761,960,00. UGH.
Of course I could hardcode around this if it was a static stored procedure or call, but it's a dynamic interpolative system that takes the settings defined by the user, so the user can define the above.
As you can see what I do is create a list of all indexes involved, then select them based on the AND/OR/NOT values as defined by the front-end web tool.
Obviously IN won't work for this; the question is what other techniques could I use that don't involve the use of IN that would allow the same level of flexibility with AND/OR/NOT?
Update for #BillKarwin in Comments
So the below code works well for executing an AND, NOT and OR:
SELECT pm.ID
FROM person_main pm
JOIN filter0 f0 ON f0.ID = pm.ID -- AND
LEFT JOIN filter1 f1 ON pm.ID = f1.ID WHERE f1.ID IS NULL -- NOT
UNION (SELECT ID FROM filter2) -- OR
I believe I can make this work with our system; I just need to store the different types (AND/NOT/OR) and execute them in process; let me do some updates and I'll get back to you.

As discussed in the comments above:
Logically, you can replace a lot of your subqueries with JOIN when they are AND terms of your expression, or UNION when they are OR terms of your expression. Also learn about exclusion joins.
But that doesn't necessarily mean that the queries will run faster, unless you have created indexes to support the join conditions and the user-defined conditions.
But which indexes should you create?
Ultimately, it's not possible to optimize all dynamic queries that users come up with. You may be able to run their queries (as you are already doing), but they won't be efficient.
It's kind of a losing game to allow users to specify arbitrary conditions. It's better to give them a fixed set of choices, which are types of queries that you have taken the time to optimize. Then allow them to run a "user-specified" query, but label it clearly that it is not optimized and it will likely take a long time.

Avoid IN ( SELECT ... ) -- Use JOIN or EXISTS
Avoid SELECT ID FROM ( SELECT ID FROM .... ) -- The outer SELECT is unnecessary.
Move UNION to the outer level (in some situations)
all_indexes seems to simplify to
( SELECT phw.person_main_ref_id AS ID
FROM person_history_work AS phw
WHERE MATCH(phw.work_title) AGAINST('developer' IN BOOLEAN MODE)
) UNION ALL
( SELECT gad.ID
FROM global_address_details AS gad
WHERE gad.address_city LIKE '%seattle%'
)
Can you change the last part to WHERE address_city = 'seattle'? If so, then you can use INDEX(address_city) If not, would a FULLTEXT index together with MATCH work for you?
See if you can follow my lead and simplify the rest.
WITH was only recently added to the MySQL's syntax. I suspect it will take another release or two before it is well optimized; try to avoid WITH. Since you are "building" the query, you can "build" UNION, LEFT JOIN, etc, as needed.

Related

MySQL chooses to execute queries, or not, at whim

For a reporting output, I used to DROP and recreate a table 'mis.pr_approval_time'. but now I just TRUNCATE it.
After populating the above table with data, I run an UPDATE statement, but I have written that as a SELECT below...
SELECT t.account_id FROM mis.hj_approval_survey h INNER JOIN mis.pr_approval_time t ON h.country = t.country AND t.scheduled_at =
(
SELECT MAX(scheduled_at) FROM mis.pr_approval_time
WHERE country = h.country
AND scheduled_at <= h.created_at
AND TIME_TO_SEC(TIMEDIFF(h.created_at, scheduled_at)) < 91
);
When I run the above statement or even just...
SELECT t.account_id FROM mis.hj_approval_survey h INNER JOIN mis.pr_approval_time t ON h.country = t.country AND t.scheduled_at =
(
SELECT MAX(scheduled_at) FROM mis.pr_approval_time
WHERE country = h.country
);
...it runs forever and does not seem to finish. There are only ~3,400 rows in hj_approval_survey table and 29,000 rows in pr_approval_time. I run this on an Amazon AWS instance with 15+ GB RAM.
Now, if I simply right click on pr_approval_time table and choose ALTER TABLE option, and just close without doing anything, then the above queries run within seconds.
I guess when I trigger the ALTER TABLE option and Workbench populates the table fields, it probably improves its execution plan somehow, but I am not sure why. Has anyone faced anything similar to this? How can I trigger a better execution plan check without right clicking the table and choosing 'ALTER TABLE'
EDIT
It may be noteworthy to mention that my organisation also uses DOMO. Originally, I had this setup as an MySQL Dataflow on DOMO, but the query would not complete on most occassions, but I have observed it finish at times.
This was the reason why I moved this query back to our AWS MySQL RDS. So the problem has not only been observed on our own MySQL RDS, but probably also on DOMO
I suspect this is slow because of the correlated subquery (subquery depends on row values from parent table, meaning it has to execute for each row). I'd try and rework the pr_approval_time table slightly so it's point-in-time and then you can use the JOIN to pick the correct rows without doing a correlated subquery. Something like:
SELECT
hj_approval_survey.country
, hj_approval_survey.created_at
, pr_approval_time.account_id
FROM
#hj_approval_survey AS hj_approval_survey
JOIN (
SELECT
current_row.country
, current_row.scheduled_at AS scheduled_at_start
, COALESCE( MIN( next_row.scheduled_at ), GETDATE() ) AS scheduled_at_end
FROM
#pr_approval_time AS current_row
LEFT OUTER JOIN
#pr_approval_time AS next_row ON (
next_row.country = current_row.country
AND next_row.scheduled_at > current_row.scheduled_at
)
GROUP BY
current_row.country
, current_row.scheduled_at
) AS pr_approval_pit ON (
pr_approval_pit.country = hj_approval_survey.country
AND ( hj_approval_survey.created_at >= pr_approval_pit.scheduled_at_start
AND hj_approval_survey.created_at < pr_approval_pit.scheduled_at_end
)
)
JOIN #pr_approval_time AS pr_approval_time ON (
pr_approval_time.country = pr_approval_pit.country
AND pr_approval_time.scheduled_at = pr_approval_pit.scheduled_at_start
)
WHERE
TIME_TO_SEC( TIMEDIFF( hj_approval_survey.created_at, pr_approval_time.scheduled_at ) ) < 91
Assuming you have proper index on the columns involved in join
You could try refactoring your query using a grouped by subquery and join on country
SELECT t.account_id
FROM mis.hj_approval_survey h
INNER JOIN mis.pr_approval_time t ON h.country = t.country
INNER JOIN (
SELECT country, MAX(scheduled_at) max_sched
FROM mis.pr_approval_time
group by country
) z on z.contry = t.country and t.scheduled_at = z.max_sched

MINUS operator in MySQL query [duplicate]

I am trying to perform a MINUS operation in MySql.I have three tables:
one with service details
one table with states that a service is offered in
another table (based on zipcode and state) shows where this service is not offered.
I am able to get the output for those two select queries separately. But I need a combined statement that gives the output as
'SELECT query_1 - SELECT query_2'.
Service_Details Table
Service_Code(PK) Service Name
Servicing_States Table
Service_Code(FK) State Country PK(Service_Code,State,Country)
Exception Table
Service_Code(FK) Zipcode State PK(Service_Code,Zipcode,State)
MySql does not recognise MINUS and INTERSECT, these are Oracle based operations. In MySql a user can use NOT IN as MINUS (other solutions are also there, but I liked it lot).
Example:
select a.id
from table1 as a
where <condition>
AND a.id NOT IN (select b.id
from table2 as b
where <condition>);
MySQL Does not supports MINUS or EXCEPT,You can use NOT EXISTS, NULL or NOT IN.
Here's my two cents... a complex query just made it work, originally expressed with Minus and translated for MySql
With MINUS:
select distinct oi.`productOfferingId`,f.name
from t_m_prod_action_oitem_fld f
join t_m_prod_action_oitem oi
on f.fld2prod_action_oitem = oi.oid;
minus
select
distinct r.name,f.name
from t_m_prod_action_oitem_fld f
join t_m_prod_action_oitem oi
on f.fld2prod_action_oitem = oi.oid
join t_m_rfs r
on r.name = oi.productOfferingId
join t_m_attr a
on a.attr2rfs = r.oid and f.name = a.name;
With NOT EXISTS
select distinct oi.`productOfferingId`,f.name
from t_m_prod_action_oitem_fld f
join t_m_prod_action_oitem oi
on f.fld2prod_action_oitem = oi.oid
where not exists (
select
r.name,f.name
from t_m_rfs r
join t_m_attr a
on a.attr2rfs = r.oid
where r.name = oi.productOfferingId and f.name = a.name
The tables have to have the same columns, but I think you can achieve what you are looking for with EXCEPT... except that EXCEPT only works in standard SQL! Here's how to do it in MySQL:
SELECT * FROM Servicing_states ss WHERE NOT EXISTS
( SELECT * FROM Exception e WHERE ss.Service_Code = e.Service_Code);
http://explainextended.com/2009/09/18/not-in-vs-not-exists-vs-left-join-is-null-mysql/
Standard SQL
SELECT * FROM Servicing_States
EXCEPT
SELECT * FROM Exception;
An anti-join pattern is the approach I typically use. That's an outer join, to return all rows from query_1, along with matching rows from query_2, and then filtering out all the rows that had a match... leaving only rows from query_1 that didn't have a match. For example:
SELECT q1.*
FROM ( query_1 ) q1
LEFT
JOIN ( query_2 ) q2
ON q2.id = q1.id
WHERE q2.id IS NULL
To emulate the MINUS set operator, we'd need the join predicate to compare all columns returned by q1 and q2, also matching NULL values.
ON q1.col1 <=> q2.col2
AND q1.col2 <=> q2.col2
AND q1.col3 <=> q2.col3
AND ...
Also, To fully emulate the MINUS operation, we'd also need to remove duplicate rows returned by q1. Adding the DISTINCT keyword would be sufficient to do that.
In case the tables are huge and are similar, one option is to save the PK to new tables. Then compare based only on the PK. In case you know that the first half is identical or so add a where clause to check only after a specific value or date .
create table _temp_old ( id int NOT NULL PRIMARY KEY )
create table _temp_new ( id int NOT NULL PRIMARY KEY )
### will take some time
insert into _temp_old ( id )
select id from _real_table_old
### will take some time
insert into _temp_new ( id )
select id from _real_table_new
### this version should be much faster
select id from _temp_old to where not exists ( select id from _temp_new tn where to.id = tn.id)
### this should be much slower
select id from _real_table_old rto where not exists ( select id from _real_table_new rtn where rto.id = rtn.id )

MySQL Join performance is really slow

I am trying to compare values against same table which has more than 1,000,000 rows. Below is my query and it takes around 25 secs to get results.
EXPLAIN SELECT DISTINCT a.studyid,a.number,a.load_number,b.studyid,b.number,b.load_number FROM
(SELECT t1.*, buildnumber,platformid FROM t t1
INNER JOIN testlog t2 ON t1.`testid` = t2.`testid`
WHERE (buildnumber =1031719 AND platformid IN (SELECT platformid FROM platform WHERE platform.`Description` = "Windows 7 SP1"))
)AS a
JOIN
(SELECT t1.*,buildnumber,platformid FROM t t1
INNER JOIN testlog t2 ON t1.`testid` = t2.`testid`
WHERE (buildnumber =1030716 AND platformid IN (SELECT platformid FROM platform WHERE platform.`Description` = "Windows 7 SP1"))
)AS b
ON a.studyid=b.studyid AND a.load_number = b.load_number AND a.number = b.number
Could you anyone help me to improve this query to get fast enough results?
The problem here is even I have number and load_number index, the query doesn't use that. I dont know why it is always ignored..
Thanks.
First, you have a silly query. You are retrieving six columns, but there are only three values. Look at the on clause.
I think your best bet is to rewrite the query using conditional aggregation. I think the following is equivalent:
SELECT t1.studyid, t1.load_number, t1.number
FROM t t1 INNER JOIN
testlog t2
ON t1.testid = t2.testid
WHERE t2.buildnumber IN (1031719, 1030716) AND
platformid IN (SELECT platformid FROM platform p WHERE p.Description = 'Windows 7 SP1'))
GROUP BY studyid, load_number, number
HAVING MIN(buildnumber) <> MAX(buildnumber)
For this query, you want indexes on platform(Description, platformid) and testlog(buildnumber, platformid) and t(testid).
Problem #1:
IN ( SELECT ... ) optimizes very poorly. The subquery is rerun again and again. It looks like you are expecting exactly one id from that query; if so, change it to = ( SELECT ... ). That way it will be run exactly once.
Problem #2:
FROM ( SELECT ... )
JOIN ( SELECT ... ) ON ...
optimizes poorly because neither subquery. Can you merge the two subqueries into one, as Gordon was trying? If not, then put one of them into a TEMPORARY TABLE and add an appropriate index to that table so that the ON will be able to use it. Probably PRIMARY KEY(studyid, load_number, number).
Footnote: The latest versions of MySQL have made improvements on these problems by dynamically generating indexes. What version are you using?

Mysql: same query, different results

EDIT:
Sorry about unreadable query, I was under deadline. I managed to solve problem by breaking this query into two smaller ones, and doing some business logic in Java. Still want to know why this query can random times return two different results.
So, it randomly returns once all expected results, other time just half. I noticed that when I write it join per join, and execute after each join, in the end it returns all expected results. So am wandering if there's some kind of MySql memory or other limitation that it doesn't take whole tables in joins. Also read on undeterministic queries but not sure what to tell.
Please help, ask if needs clarification, and thank you in advance.
RESET QUERY CACHE;
SET SQL_BIG_SELECTS=1;
set #displayvideoaction_id = 2302;
set #ticSessionId = 3851;
select richtext.id,richtextcross.name,richtextcross.updates_demo_field,richtext.content from
(
select listitemcross.id,name,updates_demo_field,listitem.text_id from
(
select id,name, updates_demo_field, items_id from
(
SELECT id, name, answertype_id, updates_demo_field,
#student:=CASE WHEN #class <> updates_demo_field THEN 0 ELSE #student+1 END AS rn,
#class:=updates_demo_field AS clset FROM
(SELECT #student:= -1) s,
(SELECT #class:= '-1') c,
(
select id, name, answertype_id, updates_demo_field from
(
select manytomany.questions_id from
(
select questiongroup_id from
(
select questiongroup_id from `ticnotes`.`scriptaction` where ticsession_id=#ticSessionId and questiongroup_id is not null
) scriptaction
inner join
(
select * from `ticnotes`.`questiongroup`
) questiongroup on scriptaction.questiongroup_id=questiongroup.id
) scriptgroup
inner join
(
select * from `ticnotes`.`questiongroup_question`
) manytomany on scriptgroup.questiongroup_id=manytomany.questiongroup_id
) questionrelation
inner join
(
select * from `ticnotes`.`question`
) questiontable on questionrelation.questions_id=questiontable.id
where updates_demo_field = 'DEMO1' or updates_demo_field = 'DEMO2'
order by updates_demo_field, id desc
) t
having rn=0
) firstrowofgroup
inner join
(
select * from `ticnotes`.`multipleoptionstype_listitem`
) selectlistanswers on firstrowofgroup.answertype_id=selectlistanswers.multipleoptionstype_id
) listitemcross
inner join
(
select * from `ticnotes`.`listitem`
) listitem on listitemcross.items_id=listitem.id
) richtextcross
inner join
(
select * from `ticnotes`.`richtext`
) richtext on richtextcross.text_id=richtext.id;
My first impression is - don't use short cuts to describe your tables. I am lost at which td3 is where ,then td6, tdx3... I guess you might be lost as well.
If you name your aliases more sensibly there will be less chance to get something wrong and mix 6 with 8 or whatever.
Just a sugestion :)
There is no limitation on mySQL so my bet would be on human error - somewhere there join logic fails.

MySQL - Faster method for this complex query?

Is there a less resource intensive / faster way of performing this query (which is partly based upon: This StackOverflow question ). Currently it takes 0.008 seconds searching through only a dozen or so rows per table.
SELECT DISTINCT *
FROM (
(
SELECT DISTINCT ta.auto_id, li.address, li.title, GROUP_CONCAT( ta.tag ) , li.description, li.keyword, li.rating, li.timestamp
FROM tags AS ta
INNER JOIN links AS li ON ta.auto_id = li.auto_id
WHERE ta.user_id =1
AND (
ta.tag LIKE '%query%'
)
OR (
li.keyword LIKE '%query%'
)
GROUP BY li.auto_id
)
UNION DISTINCT (
SELECT DISTINCT auto_id, address, title, '', description, keyword, rating, `timestamp`
FROM links
WHERE user_id =1
AND (
keyword LIKE '%query%'
)
)
) AS total
GROUP BY total.auto_id
Thank you very much,
Ice
I would hope that the query optimizer would do this for you, but you might want to try doing the select on tags by user_id before doing the join just in case in the first subquery. This would reduce the number of rows that you would have to join across presumably. You also probably want to have indexes on auto_id AND user_ID.
SELECT DISTINCT *
FROM (
(SELECT ta.auto_id, li.address, li.title, GROUP_CONCAT( ta.tag ),
li.description, li.keyword, li.rating, li.timestamp
FROM (SELECT auto_id, tag FROM tags WHERE user_id = 1) AS ta
INNER JOIN links AS li ON ta.auto_id = li.auto_id
WHERE (ta.tag LIKE '%query%') OR (li.keyword LIKE '%query%')
GROUP BY li.auto_id
)
UNION (
SELECT auto_id, address, title, '', description, keyword, rating, `timestamp`
FROM links
WHERE user_id = 1 AND (keyword LIKE '%query%')
)
) AS total
GROUP BY total.auto_id
If you can use the MyISAM table format, try to use a full-text index and search on ta.tag and li.keyword.
Testing this on tables with dozens of rows won't necessarily tell you if there is a performance problem. A DBMS may use different strategies depending on the size of tables.
Try this on larget datasets to get a better assessment of whether there's a problem and just how serious it is.
It is difficult to be sure without the table definitions, but you might be able to rephrase the query as a simpler left join from LINKS to TAGS:
select li.auto_id,
address,
title,
group_concat(ta.tag),
description,
keyword,
rating,
timestamp
from links li
left join tags ta ON ta.auto_id = li.auto_id
where li.user_id = 1 and ( keyword like '%query%' or ta.tag like '%query%' )
group by li.auto_id;
The logic might need beefing up to cope with nulls in keyword or ta.tag - depending on the table definition.
The % wildcards are probably going to stop your query from being able to use the indexes, particuarly the leading ones - searching for 'cat%' can still use indexes, but '%cat%' can't. Unless your data set is small, that's probably fatal.
I'd also check whether the OR logic is causing you trouble - I'm not sure whether the optimizer will be able to separately optimize the keyword and tag criteria. If it can't, it'll give up and brute-force it.
To re-iterate some of the other comments:
test with a bigger data set
try the components of this query first (there's about three separate queries in there) before trying to bolt them all together.