MySQL in clause slow with 10 or more items - mysql

This query takes 18 seconds
SELECT `wd`.`week` AS `start_week`, `wd`.`hold_code`, COUNT(wd.hold_code) AS hold_code_count
FROM `weekly_data` AS `wd`
JOIN aol_reporting_hold_codes hc ON hc.hold_code = wd.hold_code AND chart = 'GR'
WHERE `wd`.`days` <= 6
AND `wd`.`hold_code` IS NOT NULL
AND NOT `wd`.`hold_code` = ''
AND `wd`.`week` >= '201717'
AND `wd`.`itemgroup` IN ('BOTDTO', 'BOTDWG', 'C&FORG', 'C&FOTO', 'MF-SUB', 'MI-SUB', 'PROPRI', 'PROPTO', 'STRSTO', 'STRSUB')
AND `production_type` = 2
AND `contract` = "1234"
AND `project` = 8
GROUP BY `start_week`, `wd`.`hold_code`
This query takes 4 seconds
SELECT `wd`.`week` AS `start_week`, `wd`.`hold_code`, COUNT(wd.hold_code) AS hold_code_count
FROM `weekly_data` AS `wd`
JOIN aol_reporting_hold_codes hc ON hc.hold_code = wd.hold_code AND chart = 'GR'
WHERE `wd`.`days` <= 6
AND `wd`.`hold_code` IS NOT NULL
AND NOT `wd`.`hold_code` = ''
AND `wd`.`week` >= '201717'
AND `wd`.`itemgroup` IN ('BOTDWG', 'C&FORG', 'C&FOTO', 'MF-SUB', 'MI-SUB', 'PROPRI', 'PROPTO', 'STRSTO', 'STRSUB')
AND `production_type` = 2
AND `contract` = "1234"
AND `project` = 8
GROUP BY `start_week`, `wd`.`hold_code`
All I have done is removed one item from the IN clause. I can remove any one of the items. It runs in 4 seconds as long as there are 9 items or less. It takes 18 seconds to run as soon as I increase to 10 items.
I thought MySQL limited length of command by size i.e. 1MB

More than just the EXPLAIN, use EXPLAIN FORMAT=JSON and get the "Optimizer trace" for the query. I suspect the length of the IN leads to picking a different query plan.
There is virtually no limit to the number of items in IN. I have seen as many as 70K.
That aside, you may be able to speed up even the 4-sec version...
I suggest having this index. Grrr... I can't tell which columns are in which tables. So, if these are all in one table, then make such an index:
INDEX(production_type, contract, project) -- in any order
If those are all in wd, then tack on a 4th column - any of week, itemgroup, days.
Be cautious about COUNT(wd.hold_code).
COUNT(x) checks x for being non-NULL; is that what you want? If not, then simply say COUNT(*).
When JOINing, then GROUP BY, you get an "explode-implode". The number of intermediate rows is big; that is when the COUNT is performed.
It seems wrong to both COUNT(hold_code) and GROUP BY hold_code. What are you trying to do?
For further discussion, please provide SHOW CREATE TABLE and EXPLAIN.

Please note MySql IN clause limit is established with max_allowed_packet value. You may check with NOT IN if results are faster. Also I suggest put values to be checked with IN clause under a buffer string instead of comma separated values and then give a try.

Related

Below Sql query is taking too much time. How to make this faster?

SELECT
call_id
,call_date
,call_no
,call_amountdue
,rechargesamount
,call_penalty
,callpayment_received
,calldiscount
FROM `call`
WHERE calltype = 'Regular'
AND callcode = 98
AND call_connect = 1
AND call_date < '2018-01-01'
ORDER BY
`call_date` DESC
,`call_id` DESC
limit 1
Index is already there on call_date, callcode, calltype, callconnect
Table has 10 million records. Query is taking 2 min
How to get results within 3sec?
INDEX (calltype, callcode, call_connect, -- in any order
call_date, -- next
call_id) -- last
This will make it possible to find the one row that is desired without having to step over other rows.
Since you seem to have INDEX(calltype), Drop it; it will be in the way and, anyway, redundant. The rest of the indexes you mentioned will be ignored.
More discussion in Index Cookbook

How to Find First Valid Row in SQL Based on Difference of Column Values

I am trying to find a reliable query which returns the first instance of an acceptable insert range.
Research:
some of the below links adress similar questions, but I could get none of them to work for me.
Find first available date, given a date range in SQL
Find closest date in SQL Server
MySQL difference between two rows of a SELECT Statement
How to find a gap in range in SQL
and more...
Objective Query Function:
InsertRange(1) = (StartRange(i) - EndRange(i-1)) > NewValue
Where InsertRange(1) is the value the query should return. In other words, this would be the first instance where the above condition is satisfied.
Table Structure:
Primary Key: StartRange
StartRange(i-1) < StartRange(i)
StartRange(i-1) + EndRange(i-1) < StartRange(i)
Example Dataset
Below is an example User table (3 columns), with a set range distribution. StartRanges are always ordered in a strictly ascending way, UserID are arbitrary strings, only the sequences of StartRange and EndRange matters:
StartRange EndRange UserID
312 6896 user0
7134 16268 user1
16877 22451 user2
23137 25142 user3
25955 28272 user4
28313 35172 user5
35593 38007 user6
38319 38495 user7
38565 45200 user8
46136 48007 user9
My current Query
I am trying to use this query at the moment:
SELECT t2.StartRange, t2.EndRange
FROM user AS t1, user AS t2
WHERE (t1.StartRange - t2.StartRange+1) > NewValue
ORDER BY t1.EndRange
LIMIT 1
Example Case
Given the table, if NewValue = 800, then the returned answer should be 23137. This means, the first available slot would be between user3 and user4 (with an actual slot size = 813):
InsertRange(1) = (StartRange(i) - EndRange(i-1)) > NewValue
InsertRange = (StartRange(6) - EndRange(5)) > NewValue
23137 = 25955 - 25142 > 800
More Comments
My query above seemed to be working for the special case where StartRanges where tightly packed (i.e. StartRange(i) = StartRange(i-1) + EndRange(i-1) + 1). This no longer works with a less tightly packed set of StartRanges
Keep in mind that SQL tables have no implicit row order. It seems fair to order your table by StartRange value, though.
We can start to solve this by writing a query to obtain each row paired with the row preceding it. In MySQL, it's hard to do this beautifully because it lacks the row numbering function.
This works (http://sqlfiddle.com/#!9/4437c0/7/0). It may have nasty performance because it generates O(n^2) intermediate rows. There's no row for user0; it can't be paired with any preceding row because there is none.
select MAX(a.StartRange) SA, MAX(a.EndRange) EA,
b.StartRange SB, b.EndRange EB , b.UserID
from user a
join user b ON a.EndRange <= b.StartRange
group by b.StartRange, b.EndRange, b.UserID
Then, you can use that as a subquery, and apply your conditions, which are
gap >= 800
first matching row (lowest StartRange value) ORDER BY SB
just one LIMIT 1
Here's the query (http://sqlfiddle.com/#!9/4437c0/11/0)
SELECT SB-EA Gap,
EA+1 Beginning_of_gap, SB-1 Ending_of_gap,
UserId UserID_after_gap
FROM (
select MAX(a.StartRange) SA, MAX(a.EndRange) EA,
b.StartRange SB, b.EndRange EB , b.UserID
from user a
join user b ON a.EndRange <= b.StartRange
group by b.StartRange, b.EndRange, b.UserID
) pairs
WHERE SB-EA >= 800
ORDER BY SB
LIMIT 1
Notice that you may actually want the smallest matching gap instead of the first matching gap. That's called best fit, rather than first fit. To get that you use ORDER BY SB-EA instead.
Edit: There is another way to use MySQL to join adjacent rows, that doesn't have the O(n^2) performance issue. It involves employing user variables to simulate a row_number() function. The query involved is a hairball (that's a technical term). It's described in the third alternative of the answer to this question. How do I pair rows together in MYSQL?

Optimize query mysql search

I have the following SQL but its execution this very slow, takes about 45 seconds, the table has 15 million record, how can I improve?
SELECT A.*, B.ESPECIE
FROM
(
SELECT
A.CODIGO_DOCUMENTO,
A.DOC_SERIE,A.DATA_EMISSAO,
A.DOC_NUMERO,
A.CF_NOME,
A.CF_SRF,
A.TOTAL_DOCUMENTO,
A.DOC_MODELO
FROM MOVIMENTO A
WHERE
A.CODIGO_EMPRESA = 1
AND A.CODIGO_FILIAL = 5
AND A.DOC_TIPO_MOVIMENTO = 1
AND A.DOC_MODELO IN ('65','55')
AND (A.CF_NOME LIKE '%TEXT_SEARCH%'
OR A.CF_CODIGO LIKE 'TEXT_SEARCH%'
OR A.CF_SRF LIKE 'TEXT_SEARCH%'
OR A.DOC_SERIE LIKE 'TEXT_SEARCH%'
OR A.DOC_NUMERO LIKE 'TEXT_SEARCH%')
ORDER BY A.DATA_EMISSAO DESC , A.CODIGO_DOCUMENTO DESC
LIMIT 0, 100
) A
LEFT JOIN MODELODOCUMENTOFISCAL B ON A.DOC_MODELO = B.CODMODELO
For this query, I would start with an index on MOVIMENTO(CODIGO_EMPRESA, CODIGO_FILIAL, DOC_MODELO) and MODELODOCUMENTOFISCAL(CODMODELO).
That should speed the query.
If it doesn't you may need to consider a full text search to handle the LIKE clauses. I do note that you only have a wildcard at the beginning of one of the patterns. Is that intentional?

query optimization for mysql

I have the following query which takes about 28 seconds on my machine. I would like to optimize it and know if there is any way to make it faster by creating some indexes.
select rr1.person_id as person_id, rr1.t1_value, rr2.t0_value
from (select r1.person_id, avg(r1.avg_normalized_value1) as t1_value
from (select ma1.person_id, mn1.store_name, avg(mn1.normalized_value) as avg_normalized_value1
from matrix_report1 ma1, matrix_normalized_notes mn1
where ma1.final_value = 1
and (mn1.normalized_value != 0.2
and mn1.normalized_value != 0.0 )
and ma1.user_id = mn1.user_id
and ma1.request_id = mn1.request_id
and ma1.request_id = 4 group by ma1.person_id, mn1.store_name) r1
group by r1.person_id) rr1
,(select r2.person_id, avg(r2.avg_normalized_value) as t0_value
from (select ma.person_id, mn.store_name, avg(mn.normalized_value) as avg_normalized_value
from matrix_report1 ma, matrix_normalized_notes mn
where ma.final_value = 0 and (mn.normalized_value != 0.2 and mn.normalized_value != 0.0 )
and ma.user_id = mn.user_id
and ma.request_id = mn.request_id
and ma.request_id = 4
group by ma.person_id, mn.store_name) r2
group by r2.person_id) rr2
where rr1.person_id = rr2.person_id
Basically, it aggregates data depending on the request_id and final_value (0 or 1). Is there a way to simplify it for optimization? And it would be nice to know which columns should be indexed. I created an index on user_id and request_id, but it doesn't help much.
There are about 4907424 rows on matrix_report1 and 335740 rows on matrix_normalized_notes table. These tables will grow as we have more requests.
First, the others are right about knowing better how to format your samples. Also, trying to explain in plain language what you are trying to do is also a benefit. With sample data and sample result expectations is even better.
However, that said, I think it can be significantly simplified. Your queries are almost completely identical with the exception of the one field of "final_value" = 1 or 0 respectively. Since each query will result in 1 record per "person_id", you can just do the average based on a CASE/WHEN AND remove the rest.
To help optimize the query, your matrix_report1 table should have an index on ( request_id, final_value, user_id ). Your matrix_normalized_notes table should have an index on ( request_id, user_id, store_name, normalized_value ).
Since your outer query is doing the average based on an per stores averages, you do need to keep it nested. The following should help.
SELECT
r1.person_id,
avg(r1.ANV1) as t1_value,
avg(r1.ANV0) as t0_value
from
( select
ma1.person_id,
mn1.store_name,
avg( case when ma1.final_value = 1
then mn1.normalized_value end ) as ANV1,
avg( case when ma1.final_value = 0
then mn1.normalized_value end ) as ANV0
from
matrix_report1 ma1
JOIN matrix_normalized_notes mn1
ON ma1.request_id = mn1.request_id
AND ma1.user_id = mn1.user_id
AND NOT mn1.normalized_value in ( 0.0, 0.2 )
where
ma1.request_id = 4
AND ma1.final_Value in ( 0, 1 )
group by
ma1.person_id,
mn1.store_name) r1
group by
r1.person_id
Notice the inner query is pulling all transactions for the final value as either a zero OR one. But then, the AVG is based on a case/when of the respective value for the normalized value. When the condition is NOT the 1 or 0 respectively, the result is NULL and is thus not considered when the average is computed.
So at this point, it is grouped on a per-person basis already with each store and Avg1 and Avg0 already set. Now, roll these values up directly per person regardless of the store. Again, NULL values should not be considered as part of the average computation. So, if Store "A" doesn't have a value in the Avg1, it should not skew the results. Similarly if Store "B" doesnt have a value in Avg0 result.

How to best get daily SUM for minute-level data?

I have a data set consisting of minute-by-minute data. My goal is to return minute-by-minute records, and add calculations that create sums of a certain field for the past 24 hours, counting back from each minute record.
The query I have is the following:
SELECT main.recorded_at AS x, (SELECT SUM(precipitation) FROM data AS sub WHERE sub.host = main.host sub.recorded_at BETWEEN SUBTIME(main.recorded_at, '24:00:00') AND main.recorded_at) AS y FROM data AS main WHERE host = 'xxxx' ORDER BY x ASC;
Is there a more efficient way to write this query? I have tried, but failed, so far, using LEFT JOINS and different GROUP BYs.
When I explain this query, I get the following:
1 PRIMARY main ref host host 767 const 4038 100.00 Using where; Using filesort
2 DEPENDENT SUBQUERY sub ref host,recorded_at host 767 const 4038 100.00 Using where
In total, the query takes about 200 seconds to run with 8000 records, getting slower all the time. My goal is to get the aggregate 24-hour precipitation for each result, and somehow in under 2 seconds.
Maybe I'm going about this the wrong way? I'm open to suggestions for other avenues to get the same result. :)
Thanks!
~Mike
Assuming I'm understanding your question correctly, it looks like you can use SUM with CASE to achieve the same result without using the correlated subquery.
SELECT recorded_at AS x,
SUM(CASE WHEN recorded_at BETWEEN SUBTIME(recorded_at, '24:00:00') AND recorded_at
THEN precipitation END) As y
FROM data
WHERE host = 'xxxx'
GROUP BY recorded_at
ORDER BY x ASC;
While I'm not sure this would yield a better performance, I do think it would solve your issue using an OUTER JOIN with GROUP BY:
SELECT main.recorded_at AS x,
SUM(sub.precipitation) As y
FROM data main LEFT JOIN data sub ON
main.host = sub.host AND
sub.recorded_at BETWEEN SUBTIME(main.recorded_at, '24:00:00') AND main.recorded_at
WHERE main.host = 'xxxx'
GROUP BY main.recorded_at
ORDER BY x ASC;