I'm trying to execute a select query over a fairly simple (but large) table and am getting over 10x slower performance when I don't join on a certain secondary table.
TableA is keyed on two columns, 'ID1' & 'ID2', and has a total of 10 numeric (int + dbl) columns.
TableB is keyed on 'ID1' and has a total of 2 numeric (int) columns.
SELECT
AVG(NULLIF(dollarValue, 0))
FROM
TableA
INNER JOIN
TableB
ON
TableA.ID1 = TableB.ID1
WHERE
TableA.ID2 = 5
AND
TableA.ID1 BETWEEN 15000 AND 20000
As soon as the join is removed, performance takes a major hit. The query above takes 0.016 seconds to run while the query below takes 0.216 seconds to run.
The end goal is to replace TableA's 'ID1' with TableB's 2nd (non-key) column and deprecate TableB.
SELECT
AVG(NULLIF(dollarValue, 0))
FROM
tableA
WHERE
ID2 = 5
AND
ID1 BETWEEN 15000 AND 20000
Both tables have indexes on their primary keys. The relationship between the two tables is One-to-Many. DB Engine is MyISAM.
Scenario 1 (fast):
id stype table type possKey key kln ref rws extra
1 SIMPLE TableB range PRIMARY PRIMARY 4 498 Using where; Using index
1 SIMPLE TableA eq_ref PRIMARY PRIMARY 8 schm.TableA.ID1,const 1
Scenario 2 (slow):
id stype table type possKey key key_len ref rows extra
1 SIMPLE TableA range PRIMARY PRIMARY 8 288282 Using where
Row count and lack of any mention of an index in scenario 2 definitely stand out, but why would that be the case?
700 results from both queries -- same data.
Given your query, I'd say an index like this might be useful:
CREATE INDEX i ON tableA(ID2, ID1)
A possible reason why your first query is much faster is because you probably only have few records in tableB, which makes the join predicate very selective, compared to the range predicate.
I suggest reading up on indexes. Knowing 2-3 details about them will help you easily tune your queries only by choosing better indexes.
Related
I have a query joining a table with many millions (tableA) rows with another with just 7000 rows (tableB). The query search in tableA the rows that are between a to and from date from tableB.
I have a where on tableB on the id to limit it to a set of ids.
When I use where tableB.id in ('1234')
it returns in a few seconds.
Same thing for where tableB.id in ('3456')
But if I use where tableB.id in ('1234', '3456') then it runs forever.
You can see the 2 explains are very different.
Why is it switching from a index range scan to an non-unique key lookup just because I have 1 or 2 ids from the other table ?
SELECT count(*)
FROM tableA t
join tableC b on t.tableC_id = b.id
join tableB tr on t.bassin_id = tr.bassinid
where t.date between tr.date_entree and tr.date_sortie_reelle
and tr.id in ( '1234', '4567')
When you have IN ('1234') is like = 1234. The scan type is 'const'.
The table has at most one matching row, which is read at the start of
the query. Because there is only one row, values from the column in
this row can be regarded as constants by the rest of the optimizer.
const tables are very fast because they are read only once.
const is used when you compare all parts of a PRIMARY KEY or UNIQUE
index to constant values
When you have IN ('1234', '4567') it uses 'range'
Only rows that are in a given range are retrieved, using an index to
select the rows. The key column in the output row indicates which
index is used. The key_len contains the longest key part that was
used. The ref column is NULL for this type.
range can be used when a key column is compared to a constant using
any of the =, <>, >, >=, <, <=, IS NULL, <=>, BETWEEN, LIKE, or IN()
operators:
Read more here: https://dev.mysql.com/doc/refman/8.0/en/explain-output.html#explain-join-types
P.S.: I hope you have indexes on the columns you use on ON and WHERE clauses. I have tested the performance on 2 tables with similar number of rows like you specified (all indexes set), using LEFT JOINS, SELECT SQL_NO_CACHE, and is pretty fast in both cases, under 1 sec on an i7 8700k with 32 GB DDR3 and nVME SSD.
(I found the same question exists, but was not happy with the detailed specification, so came here for help, forgive me for my ignorance)
DELETE FROM supportrequestresponse # ~3 million records
WHERE SupportRequestID NOT IN (
SELECT SR.SupportRequestID
FROM supportrequest AS SR # ~1 million records
)
Or
DELETE SRR
FROM supportrequestresponse AS SRR # ~3 million records
LEFT JOIN supportrequest AS SR
ON SR.SupportRequestID = SRR.SupportRequestID # ~1 million records
WHERE SR.SupportRequestID IS NULL
Specifics
Database: MySQL
SR.SupportRequestID is INTEGER PRIMARY KEY
SRR.SupportRequestID is INTEGER INDEX
SR.SupportRequestID & SRR.SupportRequestID are not in FOREIGN KEY relation
Both tables contain TEXT columns for subject and message
Both tables are InnoDB
Motive: I am planning to use this with a periodic clean up job, likely to be once an hour or every two hours. It is very important to avoid lengthy operation in order to avoid table locks as this is a very busy database and am already over quota with deadlocks!
EXPLAIN query 1
1 PRIMARY supportrequestresponse ALL 410 Using where
2 DEPENDENT SUBQUERY SR unique_subquery PRIMARY PRIMARY 4 func 1 Using index
EXPLAIN query 2
1 SIMPLE SRR ALL 410
1 SIMPLE SR eq_ref PRIMARY PRIMARY 4 SRR.SupportRequestID 1 Using where; Using index; Not exists
RUN #2
EXPLAIN query 1
1 PRIMARY supportrequestresponse ALL 157209473 Using where
2 DEPENDENT SUBQUERY SR unique_subquery PRIMARY PRIMARY 4 func 1 Using index; Using where; Full scan on NULL key
EXPLAIN query 2
1 SIMPLE SRR ALL 157209476
1 SIMPLE SR eq_ref PRIMARY PRIMARY 4 SRR.SupportRequestID 1 Using where; Using index; Not exists
I suspect it would be quicker to create a new table, retaining just the rows you wish to keep. Then drop the old table. Then rename the new table.
I don't know how to describe this, but this worked as an answer to my case; an unbelievable one!
DELETE SRR
FROM supportrequestresponse AS SRR
LEFT JOIN (
SELECT SRR3.SupportRequestResponseID
FROM supportrequestresponse AS SRR3
LEFT JOIN supportrequest AS SR ON SR.SupportRequestID = SRR3.SupportRequestID
WHERE SR.SupportRequestID IS NULL
LIMIT 999
) AS SRR2 ON SRR2.SupportRequestResponseID = SRR.SupportRequestResponseID
WHERE SRR2.SupportRequestResponseID IS NOT NULL;
... # Same piece of SQL
... # Same piece of SQL
... #999 Same piece of SQL
A fork of the second pattern looks/feels appropriate than having to let MySQL match each row against a dynamic list, but this is the minor fact. I just limited the row selection to 999 rows at once only, that lets the DELETE operation finish in a blink of eye, but most importantly, I repeated the same piece of DELETE SQL 99 times one after another!
This basically made it super comfortable for a Cron job. The x99 statements let the database engine keep the tables NOT LOCKED so other processes don't get stuck waiting for the DELETE to finish, while each x# DELETE SQL takes very less amount of time to finish. I find it something like when vehicles pass through cross roads in a zipper fashion.
I have a pretty long insert query that inserts data from a select query in a table. The problem is that the select query takes too long to execute. The table is MyISAM and the select locks the table which affects other users who also use the table. I have found that problem of the query is a join.
When I remove this part of the query, it takes less then a second to execute but when I leave this part the query takes more than 15 minutes:
LEFT JOIN enq_217 Pex_217
ON e.survey_panelId = Pex_217.survey_panelId
AND e.survey_respondentId = Pex_217.survey_respondentId
AND Pex_217.survey_respondentId != 0
db.table_1 contains 5,90,145 rows and e contains 4,703 rows.
Explain Output:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY e ALL survey_endTime,survey_type NULL NULL NULL 4703 Using where
1 PRIMARY Pex_217 ref survey_respondentId,idx_table_1 idx_table_1 8 e.survey_panelId,e.survey_respondentId 2 Using index
2 DEPENDENT SUBQUERY enq_11525_timing eq_ref code code 80 e.code 1
How can I edit this part of the query to be faster?
I suggest creating an index on the table db.table_1 for the fields panelId and respondentId
You want an index on the table. The best index for this logic is:
create index idx_table_1 on table_1(panelId, respondentId)
The order of these two columns in the index should not matter.
You might want to include other columns in the index, depending on what the rest of the query is doing.
Note: a single index with both columns is different from two indexes with each column.
Why is it a LEFT join?
How many rows in Pex_217?
Run ANALYZE TABLE on each table used. (This sometimes helps MyISAM; rarely is needed for InnoDB.)
Since the 'real problem' seems to be that the query "holds up other users", switch to InnoDB.
Tips on conversion
The JOIN is not that bad (with the new index -- note Using index): 4703 rows scanned, then reach into the other table's index about 2 times each.
Perhaps the "Dependent subquery" is the costly part. Let's see that.
I wanted to find all hourly records that have a successor in a ~5m row table.
I tried :
SELECT DISTINCT (date_time)
FROM my_table
JOIN (SELECT DISTINCT (DATE_ADD( date_time, INTERVAL 1 HOUR)) date_offset
FROM my_table) offset_dates
ON date_time = date_offset
and
SELECT DISTINCT(date_time)
FROM my_table
WHERE date_time IN (SELECT DISTINCT(DATE_ADD(date_time, INTERVAL 1 HOUR))
FROM my_table)
The first one completes in a few seconds, the seconds hangs for hours.
I can understand that the sooner is better but why such a huge performance gap?
-------- EDIT ---------------
Here are the EXPLAIN for both queries
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 1710 Using temporary
1 PRIMARY my_table ref PRIMARY PRIMARY 8 offset_dates.date_offset 555 Using index
2 DERIVED my_table index NULL PRIMARY 13 NULL 5644204 Using index; Using temporary
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY my_table range NULL PRIMARY 8 NULL 9244 Using where; Using index for group-by
2 DEPENDENT SUBQUERY my_table index NULL PRIMARY 13 NULL 5129983 Using where; Using index; Using temporary
In general, a query using a join will perform better than an equivalent query using IN (...), because the former can take advantage of indexes while the latter can't; the entire IN list must be scanned for each row which might be returned.
(Do note that some database engines perform better than others in this case; for example, SQL Server can produce equivalent performance for both types of queries.)
You can see what the MySQL query optimizer intends to do with a given SELECT query by prepending EXPLAIN to the query and running it. This will give you, among other things, a count of rows the engine will have to examine for each step in a query; multiply these counts to get the overall number of rows the engine will have to visit, which can serve as a rough estimate of likely performance.
I would prefix both queries by explain, and then compare the difference in the access plans. You will probably find that the first query looks at far fewer rows than the second.
But my hunch is that the JOIN is applied more immediately than the WHERE clause. So, in the WHERE clause you are getting every record from my_table, applying an arithmetic function, and then sorting them because select distinct usually requires a sort and sometimes it creates a temporary table in memory or on disk. The # of rows examined is probably the product of the size of each table.
But in the JOIN clause, a lot of the rows that are being examined and sorted in the WHERE clause are probably eliminated beforehand. You probably end up looking at far fewer rows... and the database probably takes easier measures to accomplish it.
But I think this post answers your question best: SQL fixed-value IN() vs. INNER JOIN performance
'IN' clause is usually slow for huge tables. As far as I remember, for the second statement you printed out - it will simply loop through all rows of my_table (unless you have index there) checking each row for match of WHERE clause. In general IN is treated as a set of OR clauses with all set elements in it.
That's why, I think, using temporary tables that are created in background of JOIN query is faster.
Here are some helpful links about that:
MySQL Query IN() Clause Slow on Indexed Column
inner join and where in() clause performance?
http://explainextended.com/2009/08/18/passing-parameters-in-mysql-in-list-vs-temporary-table/
Another things to consider is that with your IN style, very little future optimization is possible compared to the JOIN. With the join you can possibly add an index, which, who knows, it depends on the data set, it might speed things up by a 2, 5, 10 times. With the IN, it's going to run that query.
I have the following tables (example)
t1 (20.000 rows, 60 columns, primary key t1_id)
t2 (40.000 rows, 8 columns, primary key t2_id)
t3 (50.000 rows, 3 columns, primary key t3_id)
t4 (30.000 rows, 4 columns, primary key t4_id)
sql query:
SELECT COUNT(*) AS count FROM (t1)
JOIN t2 ON t1.t2_id = t2.t2_id
JOIN t3 ON t2.t3_id = t3.t3_id
JOIN t4 ON t3.t4_id = t4.t4_id
I have created indexes on columns that affect the join (e.g on t1.t2_id) and foreign keys where necessary. The query is slow (600 ms) and if I put where clauses (e.g. WHERE t1.column10 = 1, where column10 doesn't have index), the query becomes much slower. The queries I do with select (*) and LIMIT are fast, and I can't understand count behaviour. Any solution?
EDIT: EXPLAIN SQL ADDED
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t4 index PRIMARY user_id 4 NULL 5259 Using index
1 SIMPLE t2 ref PRIMARY,t4_id t4_id 4 t4.t4_id 1 Using index
1 SIMPLE t1 ref t2_id t2_id 4 t2.t2_id 1 Using index
1 SIMPLE t3 ref PRIMARY PRIMARY 4 t2.t2_id 1 Using index
where user_id is a column of t4 table
EDIT: I changed from innodb to myisam and i had a speed increase, especially if i put where clauses. But i h still have times (100-150 ms) The reason i want count in my application, is to the the user who is processing a search form, the number of results he is expecting with ajax. May be there is a better solution in this, for example creating a temporary table, that is updated every one hour?
The count query is simply faster because of INDEX ONLY SCAN, as stated within query plan. The query you mention consists of only indexed columns, and thats why during execution there is no need to touch physical data - all query is performed on indexes. When you put some additional clause consisting of columns that are not indexed, or indexed in a way that prevents index usage there is a need to access data stored in a heap table by physical address - which is very slow.
EDIT:
Another important thing is that, those are PKs, so they are UNIQUE. Optimizer choses to perform INDEX RANGE SCAN on the first index, and only checks if keys exist in subsequent indexes (that's why the plan states there will be only one row returned).
EDIT2:
Thx to J. Bruni, in fact that is clustered index co the above isn't the "whole truth". There is probably full scan on the first table, and three subsequent INDEX ACCESSes to confirm the FK existance.
count iterate over whole result set and does not depends on indexes. Use EXPLAIN ANALYSE for your query to check how it is executed.
select + limit does not iterate whole result set, hence it's faster
Regarding the COUNT(*) slow performance: are you using InnoDB engine? See:
http://www.mysqlperformanceblog.com/2006/12/01/count-for-innodb-tables/
"SELECT COUNT(*)" is slow, even with where clause
The main information seems to be: "InnoDB uses clustered primary keys, so the primary key is stored along with the row in the data pages, not in separate index pages."
So, one possible solution is to create a separated index and force its usage through USE INDEX command in the SQL query. Look at this comment for a sample usage report:
http://www.mysqlperformanceblog.com/2006/12/01/count-for-innodb-tables/comment-page-1/#comment-529049
Regarding the WHERE issue, the query will perform better if you put the condition in the JOIN clause, like this:
SELECT COUNT(t1.t1_id) AS count FROM (t1)
JOIN t2 ON (t1.column10 = 1) AND (t1.t2_id = t2.t2_id)
JOIN t3 ON t2.t3_id = t3.t3_id
JOIN t4 ON t3.t4_id = t4.t4_id