EDIT:
Sorry about unreadable query, I was under deadline. I managed to solve problem by breaking this query into two smaller ones, and doing some business logic in Java. Still want to know why this query can random times return two different results.
So, it randomly returns once all expected results, other time just half. I noticed that when I write it join per join, and execute after each join, in the end it returns all expected results. So am wandering if there's some kind of MySql memory or other limitation that it doesn't take whole tables in joins. Also read on undeterministic queries but not sure what to tell.
Please help, ask if needs clarification, and thank you in advance.
RESET QUERY CACHE;
SET SQL_BIG_SELECTS=1;
set #displayvideoaction_id = 2302;
set #ticSessionId = 3851;
select richtext.id,richtextcross.name,richtextcross.updates_demo_field,richtext.content from
(
select listitemcross.id,name,updates_demo_field,listitem.text_id from
(
select id,name, updates_demo_field, items_id from
(
SELECT id, name, answertype_id, updates_demo_field,
#student:=CASE WHEN #class <> updates_demo_field THEN 0 ELSE #student+1 END AS rn,
#class:=updates_demo_field AS clset FROM
(SELECT #student:= -1) s,
(SELECT #class:= '-1') c,
(
select id, name, answertype_id, updates_demo_field from
(
select manytomany.questions_id from
(
select questiongroup_id from
(
select questiongroup_id from `ticnotes`.`scriptaction` where ticsession_id=#ticSessionId and questiongroup_id is not null
) scriptaction
inner join
(
select * from `ticnotes`.`questiongroup`
) questiongroup on scriptaction.questiongroup_id=questiongroup.id
) scriptgroup
inner join
(
select * from `ticnotes`.`questiongroup_question`
) manytomany on scriptgroup.questiongroup_id=manytomany.questiongroup_id
) questionrelation
inner join
(
select * from `ticnotes`.`question`
) questiontable on questionrelation.questions_id=questiontable.id
where updates_demo_field = 'DEMO1' or updates_demo_field = 'DEMO2'
order by updates_demo_field, id desc
) t
having rn=0
) firstrowofgroup
inner join
(
select * from `ticnotes`.`multipleoptionstype_listitem`
) selectlistanswers on firstrowofgroup.answertype_id=selectlistanswers.multipleoptionstype_id
) listitemcross
inner join
(
select * from `ticnotes`.`listitem`
) listitem on listitemcross.items_id=listitem.id
) richtextcross
inner join
(
select * from `ticnotes`.`richtext`
) richtext on richtextcross.text_id=richtext.id;
My first impression is - don't use short cuts to describe your tables. I am lost at which td3 is where ,then td6, tdx3... I guess you might be lost as well.
If you name your aliases more sensibly there will be less chance to get something wrong and mix 6 with 8 or whatever.
Just a sugestion :)
There is no limitation on mySQL so my bet would be on human error - somewhere there join logic fails.
Related
UPDATE - FINAL SOLUTION TO THIS ISSUE
Our dynamic system allows for a BOOLEAN interpolated match of things like Name, Job Title, Phone Number, etc. So we can say:
Name("ted" OR "mike" OR "david" AND "martin") AND Title("developer" AND "senior" NOT "CTO) AND City("san diego")
The way this is accomplished is to follow the below grouping example, which is dynamically created. It's pretty straightforward, however the use of HAVING COUNT is necessary to properly define the AND indexes.
Also not in this example access_indexes is a list of ID indexes an account has access to, so if the "search" returns a person the account can't access, it won't show up.
Thanks to everyone for your help, especially #BillKarwin!
WITH filter0 AS
(
SELECT pm.ID FROM person_main pm
WHERE MATCH(pm.name_full) AGAINST ('(ted)' IN BOOLEAN MODE)
),
filter1 AS
(
SELECT ram.object_ref_id AS ID
FROM ras_assignment_main ram
WHERE ram.object_type_c = 1
AND ram.assignment_type_c = 1
AND ram.assignment_ref_id IN (2)
),
persongroup0_and AS
(
SELECT pg0_a.ID FROM
(
SELECT ID FROM filter0
) pg0_a
GROUP BY pg0_a.ID
HAVING COUNT(pg0_a.ID) = 1
),
persongroup0 AS
(
SELECT pm.ID
FROM person_main pm
JOIN persongroup0_and pg0_and ON pm.ID = pg0_and.ID
),
persongroup1_and AS
(
SELECT pg1_a.ID FROM
(
SELECT ID FROM filter1
) pg1_a
GROUP BY pg1_a.ID
HAVING COUNT(pg1_a.ID) = 1
),
persongroup1 AS
(
SELECT pm.ID
FROM person_main pm
JOIN persongroup1_and pg1_and ON pm.ID = pg1_and.ID
),
person_all_and AS
(
SELECT paa.ID FROM
(
SELECT ID FROM persongroup0
UNION ALL (SELECT ID FROM persongroup1)
) paa
GROUP BY paa.ID
HAVING COUNT(paa.ID) = 2
),
person_all AS
(
SELECT pm.ID
FROM person_main pm
JOIN person_all_and pa_and ON pm.ID = pa_and.ID
),
person_access AS
(
SELECT pa.ID
FROM person_all pa
LEFT JOIN access_indexes ai ON pa.ID = ai.ID
)
SELECT (JSON_ARRAYAGG(pm.ID))
FROM
(
SELECT person_sort.ID
FROM
(
SELECT pa.ID
FROM person_access pa
GROUP BY pa.ID
) person_sort
) pm;
Our front-end system has the ability to define dynamic SQL queries using AND/OR/NOT from multiple tables, and the core system works fine - but it's slows down to being unusable due to the compounded scanning of IN. For the life of me, I can't figure out how to have this level of dynamic functionality without using IN. Below is the code that works perfectly fine (the filter matching is ultra fast), but the compounding of the IN scan takes > 60 seconds because it's 50,000+ records for some of the filter returns.
WITH filter0 AS
(
SELECT pm.ID FROM person_main pm
WHERE MATCH(pm.name_full) AGAINST ('mike meyers' IN BOOLEAN MODE)
),
filter1 AS
(
SELECT phw.person_main_ref_id AS ID
FROM person_history_work phw
WHERE MATCH(phw.work_title) AGAINST('developer' IN BOOLEAN MODE)
),
filter2 AS
(
SELECT pa.person_main_ref_id AS ID
FROM person_address pa
WHERE pa.global_address_details_ref_id IN
(
SELECT gad.ID
FROM global_address_details gad
WHERE gad.address_city LIKE '%seattle%'
)
),
all_indexes AS
(
SELECT ID FROM filter0
UNION (SELECT ID FROM filter1)
UNION (SELECT ID FROM filter2)
),
person_filter AS
(
SELECT ai.ID
FROM all_indexes ai
WHERE
(
ai.ID IN (SELECT ID FROM filter0)
AND ai.ID NOT IN (SELECT ID FROM filter1)
OR ai.ID IN (SELECT ID FROM filter2)
)
)
SELECT (JSON_ARRAYAGG(pf.ID)) FROM person_filter pf;
Filter 0 has 461 records, Filter 1 has 48480 and Filter 2 has 750.
The key issue is with the WHERE statement; because the front-end can say AND/OR and NOT on any "joined" query.
So if I change it to:
ai.ID IN (SELECT ID FROM filter0)
AND ai.ID IN (SELECT ID FROM filter1)
AND ai.ID IN (SELECT ID FROM filter2)
The query takes more than 60 seconds. Because it's scanning 461 * 48480 * 750 = 16,761,960,00. UGH.
Of course I could hardcode around this if it was a static stored procedure or call, but it's a dynamic interpolative system that takes the settings defined by the user, so the user can define the above.
As you can see what I do is create a list of all indexes involved, then select them based on the AND/OR/NOT values as defined by the front-end web tool.
Obviously IN won't work for this; the question is what other techniques could I use that don't involve the use of IN that would allow the same level of flexibility with AND/OR/NOT?
Update for #BillKarwin in Comments
So the below code works well for executing an AND, NOT and OR:
SELECT pm.ID
FROM person_main pm
JOIN filter0 f0 ON f0.ID = pm.ID -- AND
LEFT JOIN filter1 f1 ON pm.ID = f1.ID WHERE f1.ID IS NULL -- NOT
UNION (SELECT ID FROM filter2) -- OR
I believe I can make this work with our system; I just need to store the different types (AND/NOT/OR) and execute them in process; let me do some updates and I'll get back to you.
As discussed in the comments above:
Logically, you can replace a lot of your subqueries with JOIN when they are AND terms of your expression, or UNION when they are OR terms of your expression. Also learn about exclusion joins.
But that doesn't necessarily mean that the queries will run faster, unless you have created indexes to support the join conditions and the user-defined conditions.
But which indexes should you create?
Ultimately, it's not possible to optimize all dynamic queries that users come up with. You may be able to run their queries (as you are already doing), but they won't be efficient.
It's kind of a losing game to allow users to specify arbitrary conditions. It's better to give them a fixed set of choices, which are types of queries that you have taken the time to optimize. Then allow them to run a "user-specified" query, but label it clearly that it is not optimized and it will likely take a long time.
Avoid IN ( SELECT ... ) -- Use JOIN or EXISTS
Avoid SELECT ID FROM ( SELECT ID FROM .... ) -- The outer SELECT is unnecessary.
Move UNION to the outer level (in some situations)
all_indexes seems to simplify to
( SELECT phw.person_main_ref_id AS ID
FROM person_history_work AS phw
WHERE MATCH(phw.work_title) AGAINST('developer' IN BOOLEAN MODE)
) UNION ALL
( SELECT gad.ID
FROM global_address_details AS gad
WHERE gad.address_city LIKE '%seattle%'
)
Can you change the last part to WHERE address_city = 'seattle'? If so, then you can use INDEX(address_city) If not, would a FULLTEXT index together with MATCH work for you?
See if you can follow my lead and simplify the rest.
WITH was only recently added to the MySQL's syntax. I suspect it will take another release or two before it is well optimized; try to avoid WITH. Since you are "building" the query, you can "build" UNION, LEFT JOIN, etc, as needed.
I am facing a wierd issue in couchbase: i was executing the following two queries:
SELECT *
FROM (
SELECT *
FROM (
SELECT *
FROM ssb_lineorder
LIMIT 10000) AS cte0
INNER JOIN ssb_ddate ON cte0.ssb_lineorder.lo_orderdate = ssb_ddate.d_datekey) AS cte1
JOIN ssb_part USE NL ON cte1.cte0.ssb_lineorder.lo_partkey = ssb_part.p_partkey
WHERE ssb_part.p_size > 10
and
SELECT *
FROM (
SELECT *
FROM (
SELECT *
FROM (
SELECT *
FROM ssb_lineorder
LIMIT 10000) AS cte0
INNER JOIN ssb_ddate ON cte0.ssb_lineorder.lo_orderdate = ssb_ddate.d_datekey) AS cte1
JOIN ssb_part USE NL ON cte1.cte0.ssb_lineorder.lo_partkey = ssb_part.p_partkey ) AS cte2
WHERE cte2.ssb_part.p_size > 10
These two are exactly the same except the final WHERE clause. According to my knowledge of relational DBMS, the results should be exactly the same. but I am getting different result: 1 for the first query, 7972 for the second query.
I am wondering if I misunderstood the n1ql mechenism ?
There should not be any different.
LIMIT inside without order by can cause inconsistent results. 1 vs 7972 that is way off.
As this data dependent you need to debug that.
Execute UI and go to Plan Text tab and take look ItemsIn#, ItemsOut# of each operator and take look where things gone wrong.
Also add predicate and reduce data and see what is wrong.
As no OUTER JOIN try the following.
CREATE INDEX ix1 ON ssb_part(p_size, p_partkey);
CREATE INDEX ix2 ON ssb_lineorder(lo_partkey, lo_orderdate);
CREATE INDEX ix3 ON ssb_ddate(d_datekey);
SELECT *
FROM ssb_part AS sp
JOIN ssb_lineorder AS sl ON sp.p_partkey = sl.lo_partkey
JOIN ssb_ddate AS sd ON sl.lo_orderdate = sd.d_datekey
WHERE sp.p_size > 10
SELECT *
FROM ssb_part AS sp
JOIN ssb_lineorder AS sl USE HASH (PROBE) ON sp.p_partkey = sl.lo_partkey
JOIN ssb_ddate AS sd USE HASH (PROBE) ON sl.lo_orderdate = sd.d_datekey
WHERE sp.p_size > 10 ;
Let's assume I have 2 tables. One contains car manufacturer's names and their IDs, the second contains information about car models. I need to select few of them from the first table, but order them by quantity of linked from the second table data.
Currently, my query looks like this:
SELECT DISTINCT `manufacturers`.`name`,
`manufacturers`.`cars_link`,
`manufacturers`.`slug`
FROM `manufacturers`
JOIN `cars`
ON manufacturers.cars_link = cars.manufacturer
WHERE ( NOT ( `manufacturers`.`cars_link` IS NULL ) )
AND ( `cars`.`class` = 'sedan' )
ORDER BY (SELECT Count(*)
FROM `cars`
WHERE `manufacturers`.cars_link = `cars`.manufacturer) DESC
It was working ok for my table of scooters which size is few dozens of mb. But now i need to do the same thing for the cars table, which size is few hundreds megabytes. The problem is that the query takes very long time, sometimes it even causes nginx timeout. Also, i think, that i have all the necesary database indexes. Is there any alternative for the query above?
lets try to use subquery for your count instead.
select * from (
select distinct m.name, m.cars_link, m.slug
from manufacturers m
join cars c on m.cars_link=c.manufacturer
left join
(select count(1) ct, c1.manufacturer from manufacturers m1
inner join cars_link c2 on m1.cars_link=c2.manufacturer
where coalesce(m1.cars_link, '') != '' and c1.class='sedan'
group by c1.manufacturer) as t1
on t1.manufacturer = c.manufacturer
where coalesce(m.cars_link, '') != '' and c.class='sedan') t2
order by t1.ct
For a reporting output, I used to DROP and recreate a table 'mis.pr_approval_time'. but now I just TRUNCATE it.
After populating the above table with data, I run an UPDATE statement, but I have written that as a SELECT below...
SELECT t.account_id FROM mis.hj_approval_survey h INNER JOIN mis.pr_approval_time t ON h.country = t.country AND t.scheduled_at =
(
SELECT MAX(scheduled_at) FROM mis.pr_approval_time
WHERE country = h.country
AND scheduled_at <= h.created_at
AND TIME_TO_SEC(TIMEDIFF(h.created_at, scheduled_at)) < 91
);
When I run the above statement or even just...
SELECT t.account_id FROM mis.hj_approval_survey h INNER JOIN mis.pr_approval_time t ON h.country = t.country AND t.scheduled_at =
(
SELECT MAX(scheduled_at) FROM mis.pr_approval_time
WHERE country = h.country
);
...it runs forever and does not seem to finish. There are only ~3,400 rows in hj_approval_survey table and 29,000 rows in pr_approval_time. I run this on an Amazon AWS instance with 15+ GB RAM.
Now, if I simply right click on pr_approval_time table and choose ALTER TABLE option, and just close without doing anything, then the above queries run within seconds.
I guess when I trigger the ALTER TABLE option and Workbench populates the table fields, it probably improves its execution plan somehow, but I am not sure why. Has anyone faced anything similar to this? How can I trigger a better execution plan check without right clicking the table and choosing 'ALTER TABLE'
EDIT
It may be noteworthy to mention that my organisation also uses DOMO. Originally, I had this setup as an MySQL Dataflow on DOMO, but the query would not complete on most occassions, but I have observed it finish at times.
This was the reason why I moved this query back to our AWS MySQL RDS. So the problem has not only been observed on our own MySQL RDS, but probably also on DOMO
I suspect this is slow because of the correlated subquery (subquery depends on row values from parent table, meaning it has to execute for each row). I'd try and rework the pr_approval_time table slightly so it's point-in-time and then you can use the JOIN to pick the correct rows without doing a correlated subquery. Something like:
SELECT
hj_approval_survey.country
, hj_approval_survey.created_at
, pr_approval_time.account_id
FROM
#hj_approval_survey AS hj_approval_survey
JOIN (
SELECT
current_row.country
, current_row.scheduled_at AS scheduled_at_start
, COALESCE( MIN( next_row.scheduled_at ), GETDATE() ) AS scheduled_at_end
FROM
#pr_approval_time AS current_row
LEFT OUTER JOIN
#pr_approval_time AS next_row ON (
next_row.country = current_row.country
AND next_row.scheduled_at > current_row.scheduled_at
)
GROUP BY
current_row.country
, current_row.scheduled_at
) AS pr_approval_pit ON (
pr_approval_pit.country = hj_approval_survey.country
AND ( hj_approval_survey.created_at >= pr_approval_pit.scheduled_at_start
AND hj_approval_survey.created_at < pr_approval_pit.scheduled_at_end
)
)
JOIN #pr_approval_time AS pr_approval_time ON (
pr_approval_time.country = pr_approval_pit.country
AND pr_approval_time.scheduled_at = pr_approval_pit.scheduled_at_start
)
WHERE
TIME_TO_SEC( TIMEDIFF( hj_approval_survey.created_at, pr_approval_time.scheduled_at ) ) < 91
Assuming you have proper index on the columns involved in join
You could try refactoring your query using a grouped by subquery and join on country
SELECT t.account_id
FROM mis.hj_approval_survey h
INNER JOIN mis.pr_approval_time t ON h.country = t.country
INNER JOIN (
SELECT country, MAX(scheduled_at) max_sched
FROM mis.pr_approval_time
group by country
) z on z.contry = t.country and t.scheduled_at = z.max_sched
i would like to reduce the process time of my SQL request (actually it runs 10 minutes ...)
I think the problem come from the nested SQL queries.
(sorry for my english, i'm french student)
SELECT DISTINCT `gst.codeAP21`, `gst.email`, `gst.date`, `go.amount`
FROM globe_statistique
JOIN globe_customers ON `gst.codeAP21`=`gc.codeAP21`
JOIN globe_orders ON `gc.ID`=`go.FK_ID_customers`
WHERE `gst.page` = 'send_order'
AND `gst.date` = FROM_UNIXTIME(`go.date`,'%%Y-%%m-%%d')
UNION
SELECT DISTINCT `gst.codeAP21`, `gst.email`, `gst.date`, '-'
FROM globe_statistique
WHERE `gst.page` NOT LIKE 'send_order' "
AND (`gst.codeAP21`,`gst.date`) NOT IN
( SELECT `gst.codeAP21`,`gst.date` FROM globe_statistique
WHERE `gst.page`='send_order');
Thanks
try this:
SELECT DISTINCT `gst.codeAP21`, `gst.email`, `gst.date`, `go.amount`
FROM globe_statistique
JOIN globe_customers ON `gst.codeAP21`=`gc.codeAP21`
JOIN globe_orders ON `gc.ID`=`go.FK_ID_customers`
WHERE `gst.page` = 'send_order'
AND `gst.date` = FROM_UNIXTIME(`go.date`,'%%Y-%%m-%%d')
UNION
SELECT DISTINCT t1.`gst.codeAP21`, t1.`gst.email`, t1.`gst.date`, '-'
FROM globe_statistique t1
left join globe_statistique t2 on t1.gst.page =t2.gst.page and t1.gst.date =t2.gst.date and t2.gst.page =send_order
WHERE `gst.page` <> 'send_order' AND t2.gst.date is null
But i recomment to rename your column names and remove the dots.
Also use EXPLAIN to find out why the query is slow and add the correct index
try to avoid the use of distinct. To this end, UNION ALL should be used. Group by at the end gives the same result:
select codeAP21, email, date, amount
from ( --> your query without distinct but with UNION ALL <-- )
group by codeAP21, email, date, amount
see: Huge performance difference when using group by vs distinct