MySQL Explain vs Slow Log - mysql

Using MySQL (5.1.66) explain says it will scan just 72 rows while the "slow log" reports the whole table was scanned (Rows_examined: 5476845)
How is this possible? I can't figure out what's wrong with the query
*name* is a string unique index and
*date* is just a regular int index
This is the EXPLAIN
EXPLAIN SELECT *
FROM table
WHERE name LIKE 'The%Query%'
ORDER BY date DESC
LIMIT 3;
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE table index name date 4 NULL 72 Using where
Output from Slow Log
# Query_time: 5.545731 Lock_time: 0.000083 Rows_sent: 1 Rows_examined: 5476845
SET timestamp=1360007079;
SELECT * FROM table WHERE name LIKE 'The%Query%' ORDER BY date DESC LIMIT 3;

The rows value that is returned from an EXPLAIN is an estimate of the number of rows that have to be examined to find results that match your query.
If you look, you will see that the key being chosen for the query execution is date, which is probably being picked because of your ORDER BY clause. Because the key being used in the query is unrelated to your WHERE clause, that's probably why the estimate is getting messed up. Even though your WHERE clause is doing a LIKE on the name column, the optimizer may decide not to use an index at all:
Sometimes MySQL does not use an index, even if one is available. One
circumstance under which this occurs is when the optimizer estimates
that using the index would require MySQL to access a very large
percentage of the rows in the table. (In this case, a table scan is
likely to be much faster because it requires fewer seeks.) source
In short, the optimizer is choosing not to use the name key, even though it would be the one that is the limiting factor of rows to be returned. You can try forcing the index to see if that improves the performance.

Related

Optinimizing query with fts + composite index

I have the following query:
SELECT *
FROM table
WHERE
structural_type=1
AND parent_id='167F2-F'
AND points_to_id=''
# AND match(search) against ('donotmatch124213123123')
The search takes about 10ms to run, running on the composite index (structural_type, parent_id, points_to_id). However, when I add in the fts index, the query balloons to taking ~1s, regardless of what is contained in the match criteria. Basically it seems like it 'skips the index' whenever I have a fts search applied.
What would be the best way to optimize this query?
Update: a few explains:
EXPLAIN SELECT... # without fts
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE table NULL ref structural_type structural_type 209 const,const,const 2 100.00 NULL
With fts (also adding 'force index'):
explain SELECT ... force INDEX (structural_type) AND match...
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 SIMPLE table NULL fulltext structural_type,search search 0 const 1 5.00 Using where; Ft_hints: sorted
The only thing I can think of which would be incredibly hack-ish, would be to add an additional term to the fts so it does the filter 'within' that. For example:
fts_term = fts_term += " StructuralType1ParentID167F2FPointsToID"
The MySQL optimizer can only use one index for your WHERE clause, so it has to choose between the composite one and the FULLTEXT one.
Since it can't run both queries to bench which one is faster, it will estimate how fast will different execution plans be.
To do so, MySQL uses some internal stats it keeps about each table. But those stats can be very different from the reality if they aren't updated and the data changes in the table.
Running a OPTIMIZE TABLE table query allows MySQL to refresh its table stats, so it will be able to perform better estimates and choose the better index.
Try expressing this without the full text logic, using like:
SELECT *
FROM table
WHERE structural_type = 1 AND
parent_id ='167F2-F' AND
points_to_id = '' AND
search not like '%donotmatch124213123123%';
The index should still be used for the first three columns. LIKE might be slow, but if not many rows match the first three, this might not be as bad as using the full text index.

Distinct (or group by) using filesort and temp table

I know there are similar questions on this but I've got a specific query / question around why this query
EXPLAIN SELECT DISTINCT RSubdomain FROM R_Subdomains WHERE EmploymentState IN (0,1) AND RPhone='7853932120'
gives me this output explain
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE RSubdomains index NULL RSubdomain 767 NULL 3278 Using where
with and index on RSubdomains
but if I add in a composite index on EmploymentState/RPhone
I get this output from explain
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE RSubdomains range EmploymentState EmploymentState 67 NULL 2 Using where; Using temporary
if I take away the distinct on RSubdomains it drops the Using temp from the explain output... but what I don't get is why, when I add in the composite key (and keeping the key on RSubdomain) does the distinct end up using a temp table and which index schema is better here? I see that the amount of rows scanned on the combined key is far less, but the query is of type range and it's also slower.
Q: why ... does the distinct end up using a temp table?
MySQL is doing a range scan on the index (i.e. reading index blocks) to locate the rows that satisfy the predicates (WHERE clause). Then MySQL has to lookup the value of the RSubdomain column from the underlying table (it's not available in the index.) To eliminate duplicates, MySQL needs to scan the values of RSubdomain that were retrieved. The "Using temp" indicates the MySQL is materializing a resultset, which is processed in a subsequent step. (Likely, that's the set of RSubdomain values that was retrieved; given the DISTINCT, it's likely that MySQL is actually creating a temporary table with RSubdomain as a primary or unique key, and only inserting non-duplicate values.
In the first case, it looks like the rows are being retreived in order by RSubdomain (likely, that's the first column in the cluster key). That means that MySQL needn't compare the values of all the RSubdomain values; it only needs to check if the last retrieved value matches the currently retrieved value to determine whether the value can be "skipped."
Q: which index schema is better here?
The optimum index for your query is likely a covering index:
... ON R_Subdomains (RPhone, EmploymentState, RSubdomain)
But with only 3278 rows, you aren't likely to see any performance difference.
FOLLOWUP
Unfortunately, MySQL does not provide the type of instrumentation provided in other RDBMS (like the Oracle event 10046 sql trace, which gives actual timings for resources and waits.)
Since MySQL is choosing to use the index when it is available, that is probably the most efficient plan. For the best efficiency, I'd perform an OPTIMIZE TABLE operation (for InnoDB tables and MyISAM tables with dynamic format, if there have been a significant number of DML changes, especially DELETEs and UPDATEs that modify the length of the row...) At the very least, it would ensure that the index statistics are up to date.
You might want to compare the plan of an equivalent statement that does a GROUP BY instead of a DISTINCT, i.e.
SELECT r.RSubdomain
FROM R_Subdomains r
WHERE r.EmploymentState IN (0,1)
AND r.RPhone='7853932120'
GROUP
BY r.Subdomain
For optimum performance, I'd go with a covering index with RPhone as the leading column; that's based on an assumption about the cardinality of the RPhone column (close to unique values), opposed to only a few different values in the EmploymentState column. That covering index will give the best performance... i.e. the quickest elimination of rows that need to be examined.
But again, with only a couple thousand rows, it's going to be hard to see any performance difference. If the query was examining millions of rows, that's when you'd likely see a difference, and the key to good performance will be limiting the number of rows that need to be inspected.

MySQL performance difference between JOIN and IN

I wanted to find all hourly records that have a successor in a ~5m row table.
I tried :
SELECT DISTINCT (date_time)
FROM my_table
JOIN (SELECT DISTINCT (DATE_ADD( date_time, INTERVAL 1 HOUR)) date_offset
FROM my_table) offset_dates
ON date_time = date_offset
and
SELECT DISTINCT(date_time)
FROM my_table
WHERE date_time IN (SELECT DISTINCT(DATE_ADD(date_time, INTERVAL 1 HOUR))
FROM my_table)
The first one completes in a few seconds, the seconds hangs for hours.
I can understand that the sooner is better but why such a huge performance gap?
-------- EDIT ---------------
Here are the EXPLAIN for both queries
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY <derived2> ALL NULL NULL NULL NULL 1710 Using temporary
1 PRIMARY my_table ref PRIMARY PRIMARY 8 offset_dates.date_offset 555 Using index
2 DERIVED my_table index NULL PRIMARY 13 NULL 5644204 Using index; Using temporary
id select_type table type possible_keys key key_len ref rows Extra
1 PRIMARY my_table range NULL PRIMARY 8 NULL 9244 Using where; Using index for group-by
2 DEPENDENT SUBQUERY my_table index NULL PRIMARY 13 NULL 5129983 Using where; Using index; Using temporary
In general, a query using a join will perform better than an equivalent query using IN (...), because the former can take advantage of indexes while the latter can't; the entire IN list must be scanned for each row which might be returned.
(Do note that some database engines perform better than others in this case; for example, SQL Server can produce equivalent performance for both types of queries.)
You can see what the MySQL query optimizer intends to do with a given SELECT query by prepending EXPLAIN to the query and running it. This will give you, among other things, a count of rows the engine will have to examine for each step in a query; multiply these counts to get the overall number of rows the engine will have to visit, which can serve as a rough estimate of likely performance.
I would prefix both queries by explain, and then compare the difference in the access plans. You will probably find that the first query looks at far fewer rows than the second.
But my hunch is that the JOIN is applied more immediately than the WHERE clause. So, in the WHERE clause you are getting every record from my_table, applying an arithmetic function, and then sorting them because select distinct usually requires a sort and sometimes it creates a temporary table in memory or on disk. The # of rows examined is probably the product of the size of each table.
But in the JOIN clause, a lot of the rows that are being examined and sorted in the WHERE clause are probably eliminated beforehand. You probably end up looking at far fewer rows... and the database probably takes easier measures to accomplish it.
But I think this post answers your question best: SQL fixed-value IN() vs. INNER JOIN performance
'IN' clause is usually slow for huge tables. As far as I remember, for the second statement you printed out - it will simply loop through all rows of my_table (unless you have index there) checking each row for match of WHERE clause. In general IN is treated as a set of OR clauses with all set elements in it.
That's why, I think, using temporary tables that are created in background of JOIN query is faster.
Here are some helpful links about that:
MySQL Query IN() Clause Slow on Indexed Column
inner join and where in() clause performance?
http://explainextended.com/2009/08/18/passing-parameters-in-mysql-in-list-vs-temporary-table/
Another things to consider is that with your IN style, very little future optimization is possible compared to the JOIN. With the join you can possibly add an index, which, who knows, it depends on the data set, it might speed things up by a 2, 5, 10 times. With the IN, it's going to run that query.

speed up mysql query with group by

I have a MySQL query :
SELECT date(FROM_UNIXTIME(time)) as date,
count(view) as views
FROM ('table_1')
WHERE 'address' = 1
GROUP BY date(FROM_UNIXTIME(time))
where
view : auto increment and primary key, (int(11))
address : index , (int(11))
time : index, (int(11))
total rows number of the table is : 270k
this query have slow executing, in mysql-slow.log I got :
Query_time: 1.839096
Lock_time: 0.000042
Rows_sent: 155
Rows_examined: 286435
with use EXPLAIN looks like below:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE table_1 ref address address 5 const 139138 Using where; Using temporary; Using filesort
How to improve this query to speed up executing? Maybe better will be if I change date in PHP? But I think in PHP take a date as timestamp next convert to human readable and make "group by" will take more time then one query in MySQL. Maybe somebody knows how to make this query faster?
When you apply the functions date() and FROM_UNIXTIME() to the time in the group by you kill any indexing benefit you may have on that field.
Adding a date column would be the only way i can see speeding this up if you need it grouped by day. Without it, you'll need to decrase the overall set you are trying to group by. You could maybe add start/end dates to limit the date range. That would decrease the dates being transformed and grouped.
You should consider adding a additional DATE column to your table and indexing it.

mysql query optimization

I would need some help on how to optimize the query.
select * from transaction where id < 7500001 order by id desc limit 16
when i do an explain plan on this - the type is "range" and rows is "7500000"
According to the some online reference's this is explained as, it took the query 7,500,000 rows to scan and get the data.
Is there any way i can optimize so it uses less rows to scan and get the data. Also, id is the primary key column.
online reference's this is explained as, it took the query 7,500,000 rows to scan and get the data
not actually. it's the approximate (optimizer cannot say the correct number in many different cases) number of rows that potentially will be scanned. but you specified LIMIT - so only first 16 rows will be affected while query executed.
ps: i hope the used key in EXPLAIN is id?
I performed an explain with your query on a 8 million rows table
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE transaction range PRIMARY PRIMARY 8 NULL 4079100 Using where
The actual execution was fast, Execution Time : 00:00:00:044.