I have a query joining a table with many millions (tableA) rows with another with just 7000 rows (tableB). The query search in tableA the rows that are between a to and from date from tableB.
I have a where on tableB on the id to limit it to a set of ids.
When I use where tableB.id in ('1234')
it returns in a few seconds.
Same thing for where tableB.id in ('3456')
But if I use where tableB.id in ('1234', '3456') then it runs forever.
You can see the 2 explains are very different.
Why is it switching from a index range scan to an non-unique key lookup just because I have 1 or 2 ids from the other table ?
SELECT count(*)
FROM tableA t
join tableC b on t.tableC_id = b.id
join tableB tr on t.bassin_id = tr.bassinid
where t.date between tr.date_entree and tr.date_sortie_reelle
and tr.id in ( '1234', '4567')
When you have IN ('1234') is like = 1234. The scan type is 'const'.
The table has at most one matching row, which is read at the start of
the query. Because there is only one row, values from the column in
this row can be regarded as constants by the rest of the optimizer.
const tables are very fast because they are read only once.
const is used when you compare all parts of a PRIMARY KEY or UNIQUE
index to constant values
When you have IN ('1234', '4567') it uses 'range'
Only rows that are in a given range are retrieved, using an index to
select the rows. The key column in the output row indicates which
index is used. The key_len contains the longest key part that was
used. The ref column is NULL for this type.
range can be used when a key column is compared to a constant using
any of the =, <>, >, >=, <, <=, IS NULL, <=>, BETWEEN, LIKE, or IN()
operators:
Read more here: https://dev.mysql.com/doc/refman/8.0/en/explain-output.html#explain-join-types
P.S.: I hope you have indexes on the columns you use on ON and WHERE clauses. I have tested the performance on 2 tables with similar number of rows like you specified (all indexes set), using LEFT JOINS, SELECT SQL_NO_CACHE, and is pretty fast in both cases, under 1 sec on an i7 8700k with 32 GB DDR3 and nVME SSD.
Related
I have around 420 million records in my table. There is an only index on column colC of user_table . Below query returns around 1.5 million records based on
colC. But index is not used somehow and return the records 20 to 25 mins
select colA ,ColB , count(*) as count
from user_table
where colC >='2019-09-01 00:00:00'
and colC<'2019-09-30 23:59:59'
and colA in ("some static value")
and ColB in (17)
group by colA ,ColB;
But when I do force index, it starts getting used and returns the record in 2 mins only. My question why MYSQL is not using index by default when fetch
time is much lesser with index ? I have recreated the index alongwith repair but nothing works to make it in use by default .
Another observation for information is same query(without force index) works for previous months (having same volume of data) .
Update For the details asked by Evert
CREATE TABLE USER_TABLE (
id bigint(20) unsigned NOT NULL AUTO_INCREMENT,
COLA varchar(10) DEFAULT NULL,
COLB int(11) DEFAULT NULL,
COLC datetime DEFAULT NULL,
....
PRIMARY KEY (id),
KEYcolA(COLA),
KEYcolB(COLB),
KEYcolC(COLC)
) ENGINE=MyISAM AUTO_INCREMENT=2328036072 DEFAULT CHARSET=latin1 |
for better performance you could try using composite index .. based on the column involved in your where clause
and try to change the IN clause in a inner join
assuming your IN clause content is a set of fixed values you could use union (or a new table with the value you need )
eg using the union (you can do somethings similar if the IN clause is a subquery)
select user_table.colA ,ColB , count(*) as count
from user_table
INNER JOIN (
select 'FIXED1' colA
union
select 'FIXED2'
....
union
select 'FIXEDX'
) t on t.colA = user_table.colA
where colC >='2019-09-01 00:00:00'
and ColB = 17
group by colA ,ColB;
you could also add a composite index on table user_table on columns
colA, colB, colC
for what related to element used by mysql query optimizer for decide the index to use there several aspect and for all of these the query optimizer assign a cost
any what you should take in consideration
the column involved in Where clause
The size of the tables (and not yiuy case the size of the tables in join)
An estimation of how many rows will be fetched ( to decide whether to use an index, or simply scan the table )
if the datatypes match or not between columns in the jion and where clause
The use of function or data type conversion including mismacth of collation
The size of the index
cardinality of the index
and for all of these option is evaluated a cost and this lead to the index choose
In you case the colC as date could be implies a data conversion (respect the literal values as string ) and for this the index in not choosed ..
Is also for this that i have suggested a composite index with the left most column related to non converted values
Indexes try to get used as best as possible. I cant guarantee, but it SOUNDS like the engine is building a temporary index based on A & B to qualify the static values in your query. For 420+ million is just the time to build such temporary index. By you forcing an index is helping optimize the time otherwise.
Now, if you (and others) don't quite understand indexes, its a way of pre-grouping data to help the optimizer. When you have GROUP BY conditions, those components, where practical, should be part of the index, and TYPICALLY would be part of the criteria as you have in your query.
select colA ,ColB , count(*) as count
from user_table
where colC >='2019-09-01 00:00:00'
and colC<'2019-09-30 23:59:59'
and colA in ("some static value")
and ColB in (17)
group by colA ,ColB;
Now, lets look at your index, and only available based on ColC. Assume that all records are based on a day for scenario purposes. Make pretend each INDEX (single or compound) is stored in its own room. You have an index on just the date column C. In the room, you have 30 boxes (representing Sept 1 to Sept 30), not counting all other boxes for other days. Now, you have to go through each box per day and look for all entries that have a value of ColA and ColB that you want. The stuff in the box is not sorted, so you have to look at every record. Now, do this for the all 30 days of September.
Now, simulate the NEXT index, boxes stored in another room. This room is a compound index based on (and in this order to help optimize you query), Columns A, B and C.
So now, you could have 100 entries for "A". You only care about ColA = "some static value", so you grab that one box.
Now, you open that box and see a bunch of smaller boxes... Oh.. These are all the individual "Column B" records. On the top of each box represents each individual "B" entries so you find the 1 box with the value 17.
Finally, now you open Box B and look in side. Wow... they are all nicely sorted for you by date. So now, you scroll quickly to find Sept 1 and pull all entries up to Sept 30 you are looking for.
By quickly getting to the source by an optimized index will help you in the long run. Having an index on
(colA, colB, colC)
will significantly help your query performance.
One final note. Since you are only querying for a single "A" and single "B" value, you would only get a single row back and would not need a group by clause (in this case).
Hope this helps you and others better understand how indexes work from just individual vs compound (multi-columns).
One additional advantage of a multi-column index. Such as in this case where all the columns are part of the index, the database does not have to go to the raw data pages to confirm the other columns. Meaning you are looking only at the values A, B and C. All these fields are part of the index. It does not have to go back to the raw data pages where the actual data is stored to confirm its qualification to be returned.
In a single column index such as yours, it uses the index to find what records qualify (by date in this case). Then on an each record basis, it has to go to the raw data page holding the entire record (could have 50 columns in a record) just to confirm if the A and B columns qualify, then discard if not applicable. Then go back to the index by date, then back to the raw data page to confirm its A and B... You can probably understand much more time to keep going back and forth.
The second index already has "A", "B" and the pre-sorted date range of "C". Done without having to go to the raw data pages.
I have a pretty long insert query that inserts data from a select query in a table. The problem is that the select query takes too long to execute. The table is MyISAM and the select locks the table which affects other users who also use the table. I have found that problem of the query is a join.
When I remove this part of the query, it takes less then a second to execute but when I leave this part the query takes more than 15 minutes:
LEFT JOIN enq_217 Pex_217
ON e.survey_panelId = Pex_217.survey_panelId
AND e.survey_respondentId = Pex_217.survey_respondentId
AND Pex_217.survey_respondentId != 0
db.table_1 contains 5,90,145 rows and e contains 4,703 rows.
Explain Output:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY e ALL survey_endTime,survey_type NULL NULL NULL 4703 Using where
1 PRIMARY Pex_217 ref survey_respondentId,idx_table_1 idx_table_1 8 e.survey_panelId,e.survey_respondentId 2 Using index
2 DEPENDENT SUBQUERY enq_11525_timing eq_ref code code 80 e.code 1
How can I edit this part of the query to be faster?
I suggest creating an index on the table db.table_1 for the fields panelId and respondentId
You want an index on the table. The best index for this logic is:
create index idx_table_1 on table_1(panelId, respondentId)
The order of these two columns in the index should not matter.
You might want to include other columns in the index, depending on what the rest of the query is doing.
Note: a single index with both columns is different from two indexes with each column.
Why is it a LEFT join?
How many rows in Pex_217?
Run ANALYZE TABLE on each table used. (This sometimes helps MyISAM; rarely is needed for InnoDB.)
Since the 'real problem' seems to be that the query "holds up other users", switch to InnoDB.
Tips on conversion
The JOIN is not that bad (with the new index -- note Using index): 4703 rows scanned, then reach into the other table's index about 2 times each.
Perhaps the "Dependent subquery" is the costly part. Let's see that.
I'm trying to execute a select query over a fairly simple (but large) table and am getting over 10x slower performance when I don't join on a certain secondary table.
TableA is keyed on two columns, 'ID1' & 'ID2', and has a total of 10 numeric (int + dbl) columns.
TableB is keyed on 'ID1' and has a total of 2 numeric (int) columns.
SELECT
AVG(NULLIF(dollarValue, 0))
FROM
TableA
INNER JOIN
TableB
ON
TableA.ID1 = TableB.ID1
WHERE
TableA.ID2 = 5
AND
TableA.ID1 BETWEEN 15000 AND 20000
As soon as the join is removed, performance takes a major hit. The query above takes 0.016 seconds to run while the query below takes 0.216 seconds to run.
The end goal is to replace TableA's 'ID1' with TableB's 2nd (non-key) column and deprecate TableB.
SELECT
AVG(NULLIF(dollarValue, 0))
FROM
tableA
WHERE
ID2 = 5
AND
ID1 BETWEEN 15000 AND 20000
Both tables have indexes on their primary keys. The relationship between the two tables is One-to-Many. DB Engine is MyISAM.
Scenario 1 (fast):
id stype table type possKey key kln ref rws extra
1 SIMPLE TableB range PRIMARY PRIMARY 4 498 Using where; Using index
1 SIMPLE TableA eq_ref PRIMARY PRIMARY 8 schm.TableA.ID1,const 1
Scenario 2 (slow):
id stype table type possKey key key_len ref rows extra
1 SIMPLE TableA range PRIMARY PRIMARY 8 288282 Using where
Row count and lack of any mention of an index in scenario 2 definitely stand out, but why would that be the case?
700 results from both queries -- same data.
Given your query, I'd say an index like this might be useful:
CREATE INDEX i ON tableA(ID2, ID1)
A possible reason why your first query is much faster is because you probably only have few records in tableB, which makes the join predicate very selective, compared to the range predicate.
I suggest reading up on indexes. Knowing 2-3 details about them will help you easily tune your queries only by choosing better indexes.
I wanted to find all hourly records that have a successor in a ~5m row table.
I tried :
SELECT DISTINCT (date_time)
FROM my_table
JOIN (SELECT DISTINCT (DATE_ADD( date_time, INTERVAL 1 HOUR)) date_offset
FROM my_table) offset_dates
ON date_time = date_offset
and
SELECT DISTINCT(date_time)
FROM my_table
WHERE date_time IN (SELECT DISTINCT(DATE_ADD(date_time, INTERVAL 1 HOUR))
FROM my_table)
The first one completes in a few seconds, the seconds hangs for hours.
I can understand that the sooner is better but why such a huge performance gap?
-------- EDIT ---------------
Here are the EXPLAIN for both queries
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 1710 Using temporary
1 PRIMARY my_table ref PRIMARY PRIMARY 8 offset_dates.date_offset 555 Using index
2 DERIVED my_table index NULL PRIMARY 13 NULL 5644204 Using index; Using temporary
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY my_table range NULL PRIMARY 8 NULL 9244 Using where; Using index for group-by
2 DEPENDENT SUBQUERY my_table index NULL PRIMARY 13 NULL 5129983 Using where; Using index; Using temporary
In general, a query using a join will perform better than an equivalent query using IN (...), because the former can take advantage of indexes while the latter can't; the entire IN list must be scanned for each row which might be returned.
(Do note that some database engines perform better than others in this case; for example, SQL Server can produce equivalent performance for both types of queries.)
You can see what the MySQL query optimizer intends to do with a given SELECT query by prepending EXPLAIN to the query and running it. This will give you, among other things, a count of rows the engine will have to examine for each step in a query; multiply these counts to get the overall number of rows the engine will have to visit, which can serve as a rough estimate of likely performance.
I would prefix both queries by explain, and then compare the difference in the access plans. You will probably find that the first query looks at far fewer rows than the second.
But my hunch is that the JOIN is applied more immediately than the WHERE clause. So, in the WHERE clause you are getting every record from my_table, applying an arithmetic function, and then sorting them because select distinct usually requires a sort and sometimes it creates a temporary table in memory or on disk. The # of rows examined is probably the product of the size of each table.
But in the JOIN clause, a lot of the rows that are being examined and sorted in the WHERE clause are probably eliminated beforehand. You probably end up looking at far fewer rows... and the database probably takes easier measures to accomplish it.
But I think this post answers your question best: SQL fixed-value IN() vs. INNER JOIN performance
'IN' clause is usually slow for huge tables. As far as I remember, for the second statement you printed out - it will simply loop through all rows of my_table (unless you have index there) checking each row for match of WHERE clause. In general IN is treated as a set of OR clauses with all set elements in it.
That's why, I think, using temporary tables that are created in background of JOIN query is faster.
Here are some helpful links about that:
MySQL Query IN() Clause Slow on Indexed Column
inner join and where in() clause performance?
http://explainextended.com/2009/08/18/passing-parameters-in-mysql-in-list-vs-temporary-table/
Another things to consider is that with your IN style, very little future optimization is possible compared to the JOIN. With the join you can possibly add an index, which, who knows, it depends on the data set, it might speed things up by a 2, 5, 10 times. With the IN, it's going to run that query.
I have the following tables (example)
t1 (20.000 rows, 60 columns, primary key t1_id)
t2 (40.000 rows, 8 columns, primary key t2_id)
t3 (50.000 rows, 3 columns, primary key t3_id)
t4 (30.000 rows, 4 columns, primary key t4_id)
sql query:
SELECT COUNT(*) AS count FROM (t1)
JOIN t2 ON t1.t2_id = t2.t2_id
JOIN t3 ON t2.t3_id = t3.t3_id
JOIN t4 ON t3.t4_id = t4.t4_id
I have created indexes on columns that affect the join (e.g on t1.t2_id) and foreign keys where necessary. The query is slow (600 ms) and if I put where clauses (e.g. WHERE t1.column10 = 1, where column10 doesn't have index), the query becomes much slower. The queries I do with select (*) and LIMIT are fast, and I can't understand count behaviour. Any solution?
EDIT: EXPLAIN SQL ADDED
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t4 index PRIMARY user_id 4 NULL 5259 Using index
1 SIMPLE t2 ref PRIMARY,t4_id t4_id 4 t4.t4_id 1 Using index
1 SIMPLE t1 ref t2_id t2_id 4 t2.t2_id 1 Using index
1 SIMPLE t3 ref PRIMARY PRIMARY 4 t2.t2_id 1 Using index
where user_id is a column of t4 table
EDIT: I changed from innodb to myisam and i had a speed increase, especially if i put where clauses. But i h still have times (100-150 ms) The reason i want count in my application, is to the the user who is processing a search form, the number of results he is expecting with ajax. May be there is a better solution in this, for example creating a temporary table, that is updated every one hour?
The count query is simply faster because of INDEX ONLY SCAN, as stated within query plan. The query you mention consists of only indexed columns, and thats why during execution there is no need to touch physical data - all query is performed on indexes. When you put some additional clause consisting of columns that are not indexed, or indexed in a way that prevents index usage there is a need to access data stored in a heap table by physical address - which is very slow.
EDIT:
Another important thing is that, those are PKs, so they are UNIQUE. Optimizer choses to perform INDEX RANGE SCAN on the first index, and only checks if keys exist in subsequent indexes (that's why the plan states there will be only one row returned).
EDIT2:
Thx to J. Bruni, in fact that is clustered index co the above isn't the "whole truth". There is probably full scan on the first table, and three subsequent INDEX ACCESSes to confirm the FK existance.
count iterate over whole result set and does not depends on indexes. Use EXPLAIN ANALYSE for your query to check how it is executed.
select + limit does not iterate whole result set, hence it's faster
Regarding the COUNT(*) slow performance: are you using InnoDB engine? See:
http://www.mysqlperformanceblog.com/2006/12/01/count-for-innodb-tables/
"SELECT COUNT(*)" is slow, even with where clause
The main information seems to be: "InnoDB uses clustered primary keys, so the primary key is stored along with the row in the data pages, not in separate index pages."
So, one possible solution is to create a separated index and force its usage through USE INDEX command in the SQL query. Look at this comment for a sample usage report:
http://www.mysqlperformanceblog.com/2006/12/01/count-for-innodb-tables/comment-page-1/#comment-529049
Regarding the WHERE issue, the query will perform better if you put the condition in the JOIN clause, like this:
SELECT COUNT(t1.t1_id) AS count FROM (t1)
JOIN t2 ON (t1.column10 = 1) AND (t1.t2_id = t2.t2_id)
JOIN t3 ON t2.t3_id = t3.t3_id
JOIN t4 ON t3.t4_id = t4.t4_id