MySQL (MyISAM) SELECT query takes too long with join - mysql

I have a pretty long insert query that inserts data from a select query in a table. The problem is that the select query takes too long to execute. The table is MyISAM and the select locks the table which affects other users who also use the table. I have found that problem of the query is a join.
When I remove this part of the query, it takes less then a second to execute but when I leave this part the query takes more than 15 minutes:
LEFT JOIN enq_217 Pex_217
ON e.survey_panelId = Pex_217.survey_panelId
AND e.survey_respondentId = Pex_217.survey_respondentId
AND Pex_217.survey_respondentId != 0
db.table_1 contains 5,90,145 rows and e contains 4,703 rows.
Explain Output:
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY e ALL survey_endTime,survey_type NULL NULL NULL 4703 Using where
1 PRIMARY Pex_217 ref survey_respondentId,idx_table_1 idx_table_1 8 e.survey_panelId,e.survey_respondentId 2 Using index
2 DEPENDENT SUBQUERY enq_11525_timing eq_ref code code 80 e.code 1
How can I edit this part of the query to be faster?

I suggest creating an index on the table db.table_1 for the fields panelId and respondentId

You want an index on the table. The best index for this logic is:
create index idx_table_1 on table_1(panelId, respondentId)
The order of these two columns in the index should not matter.
You might want to include other columns in the index, depending on what the rest of the query is doing.
Note: a single index with both columns is different from two indexes with each column.

Why is it a LEFT join?
How many rows in Pex_217?
Run ANALYZE TABLE on each table used. (This sometimes helps MyISAM; rarely is needed for InnoDB.)
Since the 'real problem' seems to be that the query "holds up other users", switch to InnoDB.
Tips on conversion
The JOIN is not that bad (with the new index -- note Using index): 4703 rows scanned, then reach into the other table's index about 2 times each.
Perhaps the "Dependent subquery" is the costly part. Let's see that.

Related

SQL query with subquery takes longer than both queries separately

Problem
I have two queries where one needs the result of the other one. My first guess was to use an independent subquery:
SELECT P2.*
FROM ExampleTable P2
WHERE P2.delivery_start >= (
SELECT MIN(P1.delivery_start)
FROM ExampleTable P1
WHERE 1641288602 < P1.delivery_end
);
The entire query takes 5-6 seconds which is way to long for my application. Running these queries after another takes only around 800ms for both:
SELECT MIN(P1.delivery_start)
FROM ExampleTable P1
WHERE 1641288602 < P1.delivery_end;
SELECT P2.*
FROM ExampleTable P2
WHERE P2.delivery_start >= 1641286800;
I am using Mariadb 10.2 and have indices on both delivery_start and delivery_end.
What I have tried
I have used a CTE instead of a subquery which resulted in the same performance. Using a Variable with SET yields similar results as to running both queries separately, so thats what I will use for the time being.
I ran EXPLAIN on all 3 Queries:
1. Query with subquery
id
select_type
table
type
possible_keys
key
key_len
ref
rows
Extra
1
PRIMARY
P2
ALL
delivery_start
NULL
NULL
NULL
6388282
Using where
2
SUBQUERY
P1
range
delivery_end
delivery_end
4
NULL
36378
Using index condition
2. Separate Queries
id
select_type
table
type
possible_keys
key
key_len
ref
rows
Extra
1
SIMPLE
P1
range
delivery_end
delivery_end
4
NULL
36432
Using index condition
id
select_type
table
type
possible_keys
key
key_len
ref
rows
Extra
1
SIMPLE
P2
range
delivery_start
delivery_start
4
NULL
35944
Using index condition
Question
I think the issue is shown in the first EXPLAIN table as it has type ALL which means that the database performs a full table scan. My question is simply: why? Is the optimizer not able to figure out that the subquery produces a number with which we only need a range type query? And why does it not use any index?
The problem is described in the MariaDB docs:
In all remaining cases when NULL cannot be substituted with FALSE, it
is not possible to use index lookups. This is not a limitation in the
server, but a consequence of the NULL semantics in the ANSI SQL
standard.
There is a full examination here:
https://mariadb.com/kb/en/non-semi-join-subquery-optimizations/
The result of your subquery can potentially return a NULL in the case no rows were found. Hence, MariaDB cannot use the index for the parent query.
You must rewrite your subquery in a way that it will always return a row with a non-NULL scalar or stick with two separate queries. However, what should happen if your first query returns NULL? With a compound statement you could put an if around the second query and don't even execute it if the first returns NULL.
Replace these
INDEX(delivery_start)
INDEX(delivery_end)
with these:
INDEX(delivery_start, delivery_end)
INDEX(delivery_end, delivery_start)
The second one will help significantly with the subquery. Then the first may help with the outer query.
(If those don't help, please add SHOW CREATE TABLE, EXPLAIN SELECT ... and table sizes.)

Optinimizing query with fts + composite index

I have the following query:
SELECT *
FROM table
WHERE
structural_type=1
AND parent_id='167F2-F'
AND points_to_id=''
# AND match(search) against ('donotmatch124213123123')
The search takes about 10ms to run, running on the composite index (structural_type, parent_id, points_to_id). However, when I add in the fts index, the query balloons to taking ~1s, regardless of what is contained in the match criteria. Basically it seems like it 'skips the index' whenever I have a fts search applied.
What would be the best way to optimize this query?
Update: a few explains:
EXPLAIN SELECT... # without fts
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE table NULL ref structural_type structural_type 209 const,const,const 2 100.00 NULL
With fts (also adding 'force index'):
explain SELECT ... force INDEX (structural_type) AND match...
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE table NULL fulltext structural_type,search search 0 const 1 5.00 Using where; Ft_hints: sorted
The only thing I can think of which would be incredibly hack-ish, would be to add an additional term to the fts so it does the filter 'within' that. For example:
fts_term = fts_term += " StructuralType1ParentID167F2FPointsToID"
The MySQL optimizer can only use one index for your WHERE clause, so it has to choose between the composite one and the FULLTEXT one.
Since it can't run both queries to bench which one is faster, it will estimate how fast will different execution plans be.
To do so, MySQL uses some internal stats it keeps about each table. But those stats can be very different from the reality if they aren't updated and the data changes in the table.
Running a OPTIMIZE TABLE table query allows MySQL to refresh its table stats, so it will be able to perform better estimates and choose the better index.
Try expressing this without the full text logic, using like:
SELECT *
FROM table
WHERE structural_type = 1 AND
parent_id ='167F2-F' AND
points_to_id = '' AND
search not like '%donotmatch124213123123%';
The index should still be used for the first three columns. LIKE might be slow, but if not many rows match the first three, this might not be as bad as using the full text index.

How to prevent a SQL INNER JOIN with ON cond1 OR cond2 to ignore keys and do a full table scan

I have a simple query which works not as expected. In spite of an index, the join part of the query ignores it and does a full table scan. Here is the query
SELECT m0.id_field,
attr_73217_
FROM object_73195_ o
INNER JOIN master_slave m0
ON ( m0.id_object = 73130
OR m0.id_object = 82344)
AND ( m0.id_master = 73195
OR m0.id_master = 82413)
AND m0.id_slave_field = o.id
ORDER BY
o.id_order
The EXPLAIN command returns the following lines:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE m0 ALL id_object,id_master,id_slave_field,id_slave_field_2,id_object_2,id_object_3 \N \N \N 2782 Using where; Using temporary; Using filesort
1 SIMPLE o eq_ref PRIMARY PRIMARY 8 project.m0.id_slave_field 1 Using where
As you can see, it does not uses the key, even though it was created like this:
ALTER TABLE master_slave ADD INDEX (id_object,id_master,id_slave_field);
The interesting thing is that if I comment out m0.id_field from the SELECT part, then first select type (given by explain command) turns into range, the query starts to use a key id_object_3 and what is also very important - it now scans a less number of rows in the master_slave table. But the catch is, I need m0.id_field in my select part. I guess I need to do something with my indices, but I do not know what exactly.
EDIT
I tried to add another keys like this:
ALTER TABLE master_slave ADD INDEX (id_field);
ALTER TABLE master_slave ADD INDEX (id_object);
But the EXPLAIN command returns the very same set of rows - no keys and full table scan. The whole trouble is caused by m0.id_field in the select part.
EDIT
I just added a bunch of indices to master_slave table:
ALTER TABLE master_slave ADD INDEX (id_field,id_object,id_master,id_slave_field);
ALTER TABLE master_slave ADD INDEX (id_object,id_field,id_master,id_slave_field);
ALTER TABLE master_slave ADD INDEX (id_object,id_master,id_field,id_slave_field);
ALTER TABLE master_slave ADD INDEX (id_object,id_master,id_slave_field,id_field);
Each index resulted to lowering the number of scanned rows. My special thanks to kordirko.
#Jacobian - this is not an answer to your question
or mayby only a partial answer.
I am writing here because my explanation is too long and doesn't fit into the comment.
If the select statement does not contain m0.id_field, then the query refers only to 3 fields from m0 table: id_object,id_master,id_slave_field.
Since there exist a covering index on that table for these 3 columns, then the obvious choice is to scan this index instead of the table. The index (the index file on the disk) is much more smaller than the table and reading the index costs less than reading the table.
We say covering index when the index contains all required columns retrieved by the query, and the query can retrieve all information directly from the index --> see : http://en.wikipedia.org/wiki/Database_index#Covering_index
If you add m0.id_field to the select clause, then there is no index that contains all of these 4 columns, and in this case the query must read values of this column from the table.
It can do it in two ways:
1. using the index to filter rows, then access rows in the table using primary keys obtained from the index (row by row - random access).
2. scanning the whole table, without touching any index
The first method is good in cases where the expected number of rows is small (<5% maybe <10% of the table). Remember that DBMS cannot read one row from the disk, it always must read a whole page! To obtain one row with size, for example, 50 bytes, it must read the whole page, which size is 5k or 10k or more (a length of the page depends on settings). There are some optimization possible, for example MySql, while scanning the index, first collects PK values in the memory, then sorts them, and finally scans the table using these PKs in ascending order, to minimise a number of pages retrieved from the disk. But it is still the random access, which is slower than the sequential reading (the disk must seek heads to random track, instead of reading a track by track)
If an expected number of rows is huge ( 34% of the table in our case ), using the second method (scaning the whole table) is much more cheaper than first scanning and filtering the index, then sort a result of the scan, then access the table using PK retrieved from the index. The final number of disk pages that must be read from the disk is less (scanning the index also must reads some pages from the disk).

MySQL performance difference between JOIN and IN

I wanted to find all hourly records that have a successor in a ~5m row table.
I tried :
SELECT DISTINCT (date_time)
FROM my_table
JOIN (SELECT DISTINCT (DATE_ADD( date_time, INTERVAL 1 HOUR)) date_offset
FROM my_table) offset_dates
ON date_time = date_offset
and
SELECT DISTINCT(date_time)
FROM my_table
WHERE date_time IN (SELECT DISTINCT(DATE_ADD(date_time, INTERVAL 1 HOUR))
FROM my_table)
The first one completes in a few seconds, the seconds hangs for hours.
I can understand that the sooner is better but why such a huge performance gap?
-------- EDIT ---------------
Here are the EXPLAIN for both queries
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 1710 Using temporary
1 PRIMARY my_table ref PRIMARY PRIMARY 8 offset_dates.date_offset 555 Using index
2 DERIVED my_table index NULL PRIMARY 13 NULL 5644204 Using index; Using temporary
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY my_table range NULL PRIMARY 8 NULL 9244 Using where; Using index for group-by
2 DEPENDENT SUBQUERY my_table index NULL PRIMARY 13 NULL 5129983 Using where; Using index; Using temporary
In general, a query using a join will perform better than an equivalent query using IN (...), because the former can take advantage of indexes while the latter can't; the entire IN list must be scanned for each row which might be returned.
(Do note that some database engines perform better than others in this case; for example, SQL Server can produce equivalent performance for both types of queries.)
You can see what the MySQL query optimizer intends to do with a given SELECT query by prepending EXPLAIN to the query and running it. This will give you, among other things, a count of rows the engine will have to examine for each step in a query; multiply these counts to get the overall number of rows the engine will have to visit, which can serve as a rough estimate of likely performance.
I would prefix both queries by explain, and then compare the difference in the access plans. You will probably find that the first query looks at far fewer rows than the second.
But my hunch is that the JOIN is applied more immediately than the WHERE clause. So, in the WHERE clause you are getting every record from my_table, applying an arithmetic function, and then sorting them because select distinct usually requires a sort and sometimes it creates a temporary table in memory or on disk. The # of rows examined is probably the product of the size of each table.
But in the JOIN clause, a lot of the rows that are being examined and sorted in the WHERE clause are probably eliminated beforehand. You probably end up looking at far fewer rows... and the database probably takes easier measures to accomplish it.
But I think this post answers your question best: SQL fixed-value IN() vs. INNER JOIN performance
'IN' clause is usually slow for huge tables. As far as I remember, for the second statement you printed out - it will simply loop through all rows of my_table (unless you have index there) checking each row for match of WHERE clause. In general IN is treated as a set of OR clauses with all set elements in it.
That's why, I think, using temporary tables that are created in background of JOIN query is faster.
Here are some helpful links about that:
MySQL Query IN() Clause Slow on Indexed Column
inner join and where in() clause performance?
http://explainextended.com/2009/08/18/passing-parameters-in-mysql-in-list-vs-temporary-table/
Another things to consider is that with your IN style, very little future optimization is possible compared to the JOIN. With the join you can possibly add an index, which, who knows, it depends on the data set, it might speed things up by a 2, 5, 10 times. With the IN, it's going to run that query.

sql count results query with joins perfomance

I have the following tables (example)
t1 (20.000 rows, 60 columns, primary key t1_id)
t2 (40.000 rows, 8 columns, primary key t2_id)
t3 (50.000 rows, 3 columns, primary key t3_id)
t4 (30.000 rows, 4 columns, primary key t4_id)
sql query:
SELECT COUNT(*) AS count FROM (t1)
JOIN t2 ON t1.t2_id = t2.t2_id
JOIN t3 ON t2.t3_id = t3.t3_id
JOIN t4 ON t3.t4_id = t4.t4_id
I have created indexes on columns that affect the join (e.g on t1.t2_id) and foreign keys where necessary. The query is slow (600 ms) and if I put where clauses (e.g. WHERE t1.column10 = 1, where column10 doesn't have index), the query becomes much slower. The queries I do with select (*) and LIMIT are fast, and I can't understand count behaviour. Any solution?
EDIT: EXPLAIN SQL ADDED
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE t4 index PRIMARY user_id 4 NULL 5259 Using index
1 SIMPLE t2 ref PRIMARY,t4_id t4_id 4 t4.t4_id 1 Using index
1 SIMPLE t1 ref t2_id t2_id 4 t2.t2_id 1 Using index
1 SIMPLE t3 ref PRIMARY PRIMARY 4 t2.t2_id 1 Using index
where user_id is a column of t4 table
EDIT: I changed from innodb to myisam and i had a speed increase, especially if i put where clauses. But i h still have times (100-150 ms) The reason i want count in my application, is to the the user who is processing a search form, the number of results he is expecting with ajax. May be there is a better solution in this, for example creating a temporary table, that is updated every one hour?
The count query is simply faster because of INDEX ONLY SCAN, as stated within query plan. The query you mention consists of only indexed columns, and thats why during execution there is no need to touch physical data - all query is performed on indexes. When you put some additional clause consisting of columns that are not indexed, or indexed in a way that prevents index usage there is a need to access data stored in a heap table by physical address - which is very slow.
EDIT:
Another important thing is that, those are PKs, so they are UNIQUE. Optimizer choses to perform INDEX RANGE SCAN on the first index, and only checks if keys exist in subsequent indexes (that's why the plan states there will be only one row returned).
EDIT2:
Thx to J. Bruni, in fact that is clustered index co the above isn't the "whole truth". There is probably full scan on the first table, and three subsequent INDEX ACCESSes to confirm the FK existance.
count iterate over whole result set and does not depends on indexes. Use EXPLAIN ANALYSE for your query to check how it is executed.
select + limit does not iterate whole result set, hence it's faster
Regarding the COUNT(*) slow performance: are you using InnoDB engine? See:
http://www.mysqlperformanceblog.com/2006/12/01/count-for-innodb-tables/
"SELECT COUNT(*)" is slow, even with where clause
The main information seems to be: "InnoDB uses clustered primary keys, so the primary key is stored along with the row in the data pages, not in separate index pages."
So, one possible solution is to create a separated index and force its usage through USE INDEX command in the SQL query. Look at this comment for a sample usage report:
http://www.mysqlperformanceblog.com/2006/12/01/count-for-innodb-tables/comment-page-1/#comment-529049
Regarding the WHERE issue, the query will perform better if you put the condition in the JOIN clause, like this:
SELECT COUNT(t1.t1_id) AS count FROM (t1)
JOIN t2 ON (t1.column10 = 1) AND (t1.t2_id = t2.t2_id)
JOIN t3 ON t2.t3_id = t3.t3_id
JOIN t4 ON t3.t4_id = t4.t4_id