I have a very simple MYSQL table with 2 columns and I run this query:
SELECT * FROM table WHERE (col1 = '123' AND col2 = '456')
OR (col1 = '456' AND col2 = '123')
Col1 and col2 are a composite primary key: PRIMARY KEY('col1','col2'). Each of both is also a foreign key for primary key in another table
When I ran the EXPLAIN command for the above query i got the following:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE table index PRIMARY,col2 col2 8 NULL 1 Using where; Using index
The type in the above result is index which is very similar to All and so very likely to be slow on a large database. Is there a way to improve the above select command
In reality statements such as likely to be slow on a large database should be a red flag.
If you're going to have a large dataset, profiling and testing is vital to determine firstly if it will be a problem and then if it will be enough of a problem to warrant development time and cost to address. Usually this means micro optimisations that are unlikely to have any impact on most code bases.
Anyway, lets answer the question.
Yes, hypothetically as it's us using an index file, and if you have huge amounts of data and query this table a lot potentially it can be optimised by splitting the query into multiple execution sets rather than using expressions operators within the query, if you are only going to query twice as in your example you could achieve more performance with a union such as:
(
SELECT * FROM test
WHERE
(
col1 = 123 AND col2 = 456
)
)
UNION
(
SELECT * FROM test
WHERE
(
col1 = 456 AND col2 = 123
)
)
An EXPLAIN for this query is as follows:
ID SELECT_TYPE TABLE TYPE POSSIBLE_KEYS KEY KEY_LEN REF ROWS EXTRA
1 PRIMARY test ref PRIMARY PRIMARY 4 const 1 Using where; Using index
2 UNION test ref PRIMARY PRIMARY 4 const 2 Using where; Using index
(null) UNION RESULT <union1,2> ALL (null) (null) (null) (null) (null)
Take a look at this SQL fiddle http://sqlfiddle.com/#!2/9dc07a/1/0 for a simple test case.
The language I've used in this post such as "might", "could" etc is because I've not front loaded this example with hundreds of millions of records - I would strongly suggest you do this and evaluate and profile your query in more detail.
Unfortunately with optimisation, there isn't always a clear and simple answer of doing x to get greater performance - the query optimiser is a complex beast and sometimes trying to get every drop of performance can actually cripple your application (I'm speaking from experience here) so please, unless you have to worry about these micro optimisations - don't, if you do then evaluate, profile and test it fully before deciding on an approach.
The cost based optimiser chooses an execution plan based on the statistics it has about the table. It knows there are not many rows in there, and so it's a waste of time doing something clever. Increase the number of rows in the table dramatically and ensure you have a high cardinality (lots of different values) and run the explain plan again and you'll see the execution change.
Related
We recently moved our database from MariaDB to AWS Amazon Aurora RDS (MySQL). We observed something strange in a set of queries. We have two queries that are very quick, but when together as nested subquery it takes ages to finish.
Here id is the primary key of the table
SELECT * FROM users where id in(SELECT max(id) FROM users where id = 1);
execution time is ~350ms
SELECT * FROM users where id in(SELECT id FROM users where id = 1);
execution time is ~130ms
SELECT max(id) FROM users where id = 1;
execution time is ~130ms
SELECT id FROM users where id = 1;
execution time is ~130ms
We believe it has to do something with the type of value returned by max that is causing the indexing to be ignored when running the outer query from results of the sub query.
All the above queries are simplified for illustration of the problem. The original queries have more clauses as well as 100s of millions of rows. The issue did not exist prior to the migration and worked fine in MariaDB.
--- RESULTS FROM MariaDB ---
MySQL seems to optimize less efficient compared to MariaDB (int this case).
When doing this in MySQL (see: DBFIDDLE1), the execution plans look like:
For the query without MAX:
id select_type table partitions type
possible_keys
key key_len ref
rows
filtered Extra
1 SIMPLE integers null const
PRIMARY
PRIMARY 4 const
1
100.00 Using index
1 SIMPLE integers null const
PRIMARY
PRIMARY 4 const
1
100.00 Using index
For the query with MAX:
id select_type table partitions type
possible_keys
key key_len ref
rows
filtered Extra
1 PRIMARY integers null index null
PRIMARY
4 null
1000
100.00 Using where; Using index
2 DEPENDENT SUBQUERY null null null null
null
null null
null
null Select tables optimized away
While MariaDB (see: DBFIDDLE2 does have a better looking plan when using MAX:
id select_type table type
possible_keys
key key_len ref
rows
filtered Extra
1 PRIMARY system null
null
null null
1
100.00
1 PRIMARY integers const PRIMARY
PRIMARY
4 const
1
100.00 Using index
2 MATERIALIZED null null null
null
null null
null
null Select tables optimized away
EDIT: Because of time (some lack of it 😉) I now add some info
A suggestion to fix this:
SELECT *
FROM integers
WHERE i IN (select * from (SELECT MAX(i) FROM integers WHERE i=1)x);
When looking at the EXECUTION PLAN from MariaDB, which has 1 extra step, I tried to do the same in MySQL. Above query has an even bigger execution plan, but tests show that it performs better. (for explain plans, see: DBFIDDLE1a)
"the question is Mariadb that much faster? it uses a step more that mysql"
One step more does not mean that things get slower.
MySQL takes about 2-3 seconds on the query using the MAX, and MariaDB does execute the same in under 10 msecs. But this is performance, and time may vary on different systems.
SELECT max(id) FROM users where id = 1
Is strange. Since it is looking only at rows where id = 1, then "max" is obviously "1". So is the min. And the average.\
Perhaps you wanted:
SELECT max(id) FROM users
Is there an index on id? Perhaps the PRIMARY KEY? If not, then that might explain the sluggishness.
This can be done much faster (against assuming an index):
SELECT * FROM users
ORDER BY id DESC
LIMIT 1
Does that give you what you want?
To discuss this further, please provide SHOW CREATE TABLE users
I have the following query:
SELECT *
FROM table
WHERE
structural_type=1
AND parent_id='167F2-F'
AND points_to_id=''
# AND match(search) against ('donotmatch124213123123')
The search takes about 10ms to run, running on the composite index (structural_type, parent_id, points_to_id). However, when I add in the fts index, the query balloons to taking ~1s, regardless of what is contained in the match criteria. Basically it seems like it 'skips the index' whenever I have a fts search applied.
What would be the best way to optimize this query?
Update: a few explains:
EXPLAIN SELECT... # without fts
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE table NULL ref structural_type structural_type 209 const,const,const 2 100.00 NULL
With fts (also adding 'force index'):
explain SELECT ... force INDEX (structural_type) AND match...
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE table NULL fulltext structural_type,search search 0 const 1 5.00 Using where; Ft_hints: sorted
The only thing I can think of which would be incredibly hack-ish, would be to add an additional term to the fts so it does the filter 'within' that. For example:
fts_term = fts_term += " StructuralType1ParentID167F2FPointsToID"
The MySQL optimizer can only use one index for your WHERE clause, so it has to choose between the composite one and the FULLTEXT one.
Since it can't run both queries to bench which one is faster, it will estimate how fast will different execution plans be.
To do so, MySQL uses some internal stats it keeps about each table. But those stats can be very different from the reality if they aren't updated and the data changes in the table.
Running a OPTIMIZE TABLE table query allows MySQL to refresh its table stats, so it will be able to perform better estimates and choose the better index.
Try expressing this without the full text logic, using like:
SELECT *
FROM table
WHERE structural_type = 1 AND
parent_id ='167F2-F' AND
points_to_id = '' AND
search not like '%donotmatch124213123123%';
The index should still be used for the first three columns. LIKE might be slow, but if not many rows match the first three, this might not be as bad as using the full text index.
I wanted to find all hourly records that have a successor in a ~5m row table.
I tried :
SELECT DISTINCT (date_time)
FROM my_table
JOIN (SELECT DISTINCT (DATE_ADD( date_time, INTERVAL 1 HOUR)) date_offset
FROM my_table) offset_dates
ON date_time = date_offset
and
SELECT DISTINCT(date_time)
FROM my_table
WHERE date_time IN (SELECT DISTINCT(DATE_ADD(date_time, INTERVAL 1 HOUR))
FROM my_table)
The first one completes in a few seconds, the seconds hangs for hours.
I can understand that the sooner is better but why such a huge performance gap?
-------- EDIT ---------------
Here are the EXPLAIN for both queries
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 1710 Using temporary
1 PRIMARY my_table ref PRIMARY PRIMARY 8 offset_dates.date_offset 555 Using index
2 DERIVED my_table index NULL PRIMARY 13 NULL 5644204 Using index; Using temporary
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY my_table range NULL PRIMARY 8 NULL 9244 Using where; Using index for group-by
2 DEPENDENT SUBQUERY my_table index NULL PRIMARY 13 NULL 5129983 Using where; Using index; Using temporary
In general, a query using a join will perform better than an equivalent query using IN (...), because the former can take advantage of indexes while the latter can't; the entire IN list must be scanned for each row which might be returned.
(Do note that some database engines perform better than others in this case; for example, SQL Server can produce equivalent performance for both types of queries.)
You can see what the MySQL query optimizer intends to do with a given SELECT query by prepending EXPLAIN to the query and running it. This will give you, among other things, a count of rows the engine will have to examine for each step in a query; multiply these counts to get the overall number of rows the engine will have to visit, which can serve as a rough estimate of likely performance.
I would prefix both queries by explain, and then compare the difference in the access plans. You will probably find that the first query looks at far fewer rows than the second.
But my hunch is that the JOIN is applied more immediately than the WHERE clause. So, in the WHERE clause you are getting every record from my_table, applying an arithmetic function, and then sorting them because select distinct usually requires a sort and sometimes it creates a temporary table in memory or on disk. The # of rows examined is probably the product of the size of each table.
But in the JOIN clause, a lot of the rows that are being examined and sorted in the WHERE clause are probably eliminated beforehand. You probably end up looking at far fewer rows... and the database probably takes easier measures to accomplish it.
But I think this post answers your question best: SQL fixed-value IN() vs. INNER JOIN performance
'IN' clause is usually slow for huge tables. As far as I remember, for the second statement you printed out - it will simply loop through all rows of my_table (unless you have index there) checking each row for match of WHERE clause. In general IN is treated as a set of OR clauses with all set elements in it.
That's why, I think, using temporary tables that are created in background of JOIN query is faster.
Here are some helpful links about that:
MySQL Query IN() Clause Slow on Indexed Column
inner join and where in() clause performance?
http://explainextended.com/2009/08/18/passing-parameters-in-mysql-in-list-vs-temporary-table/
Another things to consider is that with your IN style, very little future optimization is possible compared to the JOIN. With the join you can possibly add an index, which, who knows, it depends on the data set, it might speed things up by a 2, 5, 10 times. With the IN, it's going to run that query.
I'm working with large number of rows in db(MySQL, innoDb engine, approx 20 milion rows) and I need to perform fuzzy like searching quite a lot. For some reasons I decided to use jaro_winkler algorithm and for performance issues I implemented it as a function in SQL. Application is written in Python and there is a weird situation I came across today:
Comparing these two queries (called from mysql shell not via Orm or etc):
SELECT * FROM products WHERE jaro_winkler(code, '78-1747') > 0.7 AND code LIKE '%78%';
and
SELECT * FROM products WHERE code LIKE '%78%' AND jaro_winkler(code, '78-1747') > 0.7;
I noticed that first one is at least 10 times slower than the second one. It seems logical at first, but as I checked the order of conditions in WHERE should not matter.
So my question - is it the normal behaviour ?
And has someone (from practical experience) can recommend a best algorithm or function to perform fuzzy searching ? I know about damerau-levenshtein metric, but it turns to be slower than my current solution.
EDIT:
After using explain:
I created sample database really quickly and used both queries:
for the first query:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE products ALL NULL NULL NULL NULL 4166 Using where
query time: ~ 2 seconds
explain for the second query:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE products ALL NULL NULL NULL NULL 4332 Using where
query time: ~ 0.1 second
I have a large MySQL, MyISAM table of around 4 million rows running in a core 2 duo, 8G RAM laptop.
This table has 30 columns including varchar, decimal and int types.
I have an index on a varchar(16). Let's call this column: "indexed_varchar_column".
My query is
SELECT 9 columns FROM the_table WHERE indexed_varchar_column = 'something';
It always returns around 5000 rows for every 'something' I query against.
An EXPLAIN to the query returns this:
+----+-------------+-------------+------+----------------------------------------------------+--------------------------------------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+----------------------------------------------------+--------------------------------------------+---------+-------+------+-------------+
| 1 | SIMPLE | the_table | ref | many indexes including indexed_varchar_column | another_index NOT: indexed_varchar_column! | 19 | const | 5247 | Using where |
+----+-------------+-------------+------+----------------------------------------------------+--------------------------------------------+---------+-------+------+-------------+
First thing is I'm not sure is why another_index is chosen. In fact it chooses an index which is a composite index of indexed_varchar_column and another 2 columns (which form part of the selected ones). Perhaps this makes sense since it may make things a bit faster for not having to read 2 of the columns in the query. The real QUESTION is the following one:
The query takes 5 seconds for every 'something' I match. On the 2nd time I query against 'something' it takes 0.15 secs (I guess because the query is being cached). When I run another query against 'something_new' it takes again 5 seconds. So, it is consistent.
THE PROBLEM IS: I discovered that creating an index (another composite index including my indexed_varchar_column) and dropping it again produces that all further queries against new 'something_other' take only 0.15 secs. Please note that 1) I create an index 2) drop it again. So everything is in the same state.
I guess all the operations needed for building and dropping indices make the SQL engine to cache something that is then reused. When I run EXPLAIN on a query after all this I get exactly the same as before.
How can I proceed to understand what is cached in the create-drop index procedure so that I can cache it without manipulating indices?
UPDATE:
Following a comment from Marc B that suggested that when mySQL creates an index it internally does a SELECT... I tried the following:
SELECT * FROM my_table;
It took 30 secs and returned 4 million rows. The good thing is that all further queries are very fast again (until I reboot the system). Please note that after rebooting the queries are slow again. I guess this is because mySQL is using some sort of OS caching.
Any idea? How can I explicitly cache the table I guess?
UPDATE 2:
Perhaps I should have mentioned that this table may be severely fragmented. It's 4 million rows but I remove lots of old fields regularly. I also add new ones. Since I had large gaps in IDs (for the rows deleted) every day I drop the primary index (ID) and create it again with consecutive numbers. The table may be then very fragmented and therefore IO must be an issue... Not sure what to do.
Thanks everybody for your help.
Finally I discovered (thanks to the hint of Marc B) that my table was severely fragmented after many INSERTs and DELETEs. I updated the question with this info some hours ago. There are two things that help:
1)
ALTER TABLE my_table ORDER BY indexed_varchar_column;
2) Running:
myisamchk --sort-records=4 my_table.MYI (where 4 corresponds to my index)
I believe both commands are equivalent. Queries are fast even after a system reboot.
I've put this ALTER TABLE ORDER BY command on a cron that is run everyday. It takes 2 minutes but it's worth it.
How many indexes do you have that contain the indexed_varchar_column? Do you have a single index for just the indexed_varchar_column?
Have you tried:
SELECT 9 columns FROM USE INDEX (name_of_index) the_table WHERE indexed_varchar_column = 'something';?
What is the order of the columns in your composite index.
You must use (at least) a left-associative sub-set of the columns in your query
If you have an index on foo,bar, and baz, that will not be usable as an index against bar or baz by themeselves. Only (foo), (foo,bar), and (foo,bar,baz).
EXPLAIN is your friend here. It will tell you which index, if any, is being used by a query.
EDIT Here's a postgres explain of a simple left join query for comparison.
Nested Loop Left Join (cost=0.00..16.97 rows=13 width=103)
Join Filter: (pagesets.id = pages.pageset_id)
-> Index Scan using ix_pages_pageset_id on pages (cost=0.00..8.51 rows=13 width=80)
Index Cond: (pageset_id = 515)
-> Materialize (cost=0.00..8.27 rows=1 width=23)
-> Index Scan using pagesets_pkey on pagesets (cost=0.00..8.27 rows=1 width=23)
Index Cond: (id = 515)