MySQL SELECT * optimization - mysql

Is there a reason why there is enormous difference between
1. SELECT * FROM data; -- 45000 rows
2. SELECT data.* FROM data; -- 45000 rows
SHOW PROFILES;
+----------+------------+-------------------------+
| Query_ID | Duration | Query |
+----------+------------+-------------------------+
| 1 | 0.10902800 | SELECT * FROM data |
| 2 | 0.11139200 | SELECT data.* FROM data |
+----------+------------+-------------------------+
2 rows in set, 1 warning (0.00 sec)
As far as I know it, they both return the same number of rows and columns. Why the disparity in duration?
MySQL version 5.6.29

That's not much difference. Neither are optimized. Both do full table scans. Both will parse to the optimizer the same. You are talking about fractions of milliseconds difference.
You can't optimize full table scans. The problem is not "select " or "select data.". The problem is that there is no "where" clause, because that's where optimization starts.

The particular examples specified would return the same result and have the same performance.
[TableName].[column] is usually used to pinpoint the table you wish to use when two tables a present in a join or a complex statement and you want to define which column to use out of the two with the same name.
It's most common use is in a join though, for a basic statement such as the one above there is no difference and the output will be the same.

Related

Why does the query execute so much slower when all the columns involved are the same and only the where condition changes?

I have this query:
SELECT 1 AS InputIndex,
IF(TRIM(DeviceInput1Name = '', 0, IF(INSTR(DeviceInput1Name, '|') > 0, 2, 1)) AS InputType,
(SELECT Value1_1 FROM devicevalues WHERE DeviceID = devices.DeviceID ORDER BY ValueTime DESC LIMIT 1) AS InputValueLeft,
(SELECT Value1_2 FROM devicevalues WHERE DeviceID = devices.DeviceID ORDER BY ValueTime DESC LIMIT 1) AS InputValueRight
FROM devices
WHERE DeviceIMEI = 'Some_Search_Value';
This completes fairly quickly (in up to 0.01 seconds). However, running the same query with WHERE clause as such
WHERE DeviceIMEI = 'Some_Other_Search_Value';
makes it run for upwards of 14 seconds! Some search values finish very quickly, while others run way too long.
If I run EXPLAIN on either query, I get the following:
+----+--------------------+--------------+-------+---------------+------------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------+-------+---------------+------------+---------+-------+------+-------------+
| 1 | PRIMARY | devices | ref | DeviceIMEI | DeviceIMEI | 28 | const | 1 | Using where |
| 3 | DEPENDENT SUBQUERY | devicevalues | index | DeviceID,More | ValueTime | 9 | NULL | 1 | Using where |
| 2 | DEPENDENT SUBQUERY | devicevalues | index | DeviceID,More | ValueTime | 9 | NULL | 1 | Using where |
+----+--------------------+--------------+-------+---------------+------------+---------+-------+------+-------------+
Also, here's the actual number of records, just so it's clear:
mysql> select count(*) from devicevalues inner join devices using(DeviceID) where devices.DeviceIMEI = 'Some_Search_Value';
+----------+
| count(*) |
+----------+
| 1017946 |
+----------+
1 row in set (0.17 sec)
mysql> select count(*) from devicevalues inner join devices using(DeviceID) where devices.DeviceIMEI = 'Some_Other_Search_Value';
+----------+
| count(*) |
+----------+
| 306100 |
+----------+
1 row in set (0.04 sec)
Any ideas why changing a search value in the WHERE clause would cause the query to execute so slowly, even when the number of physical records to search through is lower?
Note there is no need for you to rewrite the query, just explain why the above happens.
UPDATE: I have tried running two separate queries instead of one with dependent subqueries to get the information I need (first I select DeviceID from devices by DeviceIMEI, then select from devicevalues by DeviceID I got from the previous query) and all queries return instantly. I suppose the only solution is to run these queries in a transaction, so I'll be making a stored procedure to do this. This, however, still doesn't answer my question which puzzles me.
I dont think that 1017946 is equivalent to the number of rows returned by your very first query.Your first query returns all rows from devices with some correlated queries,your count query returns all common rows between the 2 tables.If this is so the problem might be cardinality issues namely some_other_values constitute a much larger proportion of the rows in your first query than some_value so Mysql chooses a table scan.
If I understand correctly, the query is the same, and only the searched value changes.
There are three real possibilities as I can see, the first much likelier than the others:
The fast query only appears to be fast. And that's why it is in the MySQL query cache already. Try disabling the cache, running with NO_SQL_CACHE, or run the slow query twice. If the second way round runs in 0.01s instead of 14s, you'll know this is the case.
One query has to look way more records than the other. An IMEI may have lots of rows in devicevalues, another might have next no none. Apparently you are in such a condition, and what makes this unlikely is (apart from the times involved) the fact that it is the slower IMEI which actually has less matches.
The slow query is indeed slow. This means that a particular subset of data is hard to locate or hard to retrieve. The first may be due to an overdue reindexing or to filesystem fragmentation of very large indexes. The second can also be due to fragmentation of the tablespace, or to other condition which splits up records (for example the database is partitioned). A search in a small partition is wont to be faster than a search in a large partition.
But the time differences involved aren't equal in the three cases, and a 1400x difference seems to me an unlikely consequence of (2) or (3). The first possibility seems way more appealing.
Update you seem not to be using indexes on your subqueries. Have you an index such as
CREATE INDEX dv_ndx ON devicevalues(DeviceID, ValueTime);
If you can, you can try a covering index:
CREATE INDEX dv_ndx ON devicevalues(DeviceID, ValueTime, Value1_1, Value1_2);

Using MySQL, always have the expected number of elements in a dataset by filling the remaining with previous written data

is it possible to always respect an expected number of element constraint by filling the remaining of a SQL dataset with previous written data, keeping the data insertion in order? Using MySQL?
Edit
In a web store, I always want to show n elements. I update the show elements every w seconds and I want to loop indefinitely.
By example, using table myTable:
+----+
| id |
+----+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
+----+
Something like
SELECT id FROM myTable WHERE id > 3 ORDER BY id ALWAYS_RETURN_THIS_NUMBER_OF_ELEMENTS 5
would actually return (where ALWAYS_RETURN_THIS_NUMBER_OF_ELEMENTS doesn't exist)
+----+
| id |
+----+
| 4 |
| 5 |
| 4 |
| 5 |
| 4 |
+----+
This is a very strange need. Here is a method:
select id
from (SELECT id
FROM myTable
WHERE id > 3
ORDER BY id
LIMIT 5
) t cross join
(select 1 as n union all select 2 union all select 3 union all select 4 union all select 5
) n
order by n.n, id
limit 5;
You may need to extend the list of numbers in n to be sure you have enough rows for the final limit.
No, that's not what LIMIT does. The LIMIT clause is applied as the last step in the statement execution, after aggregation, after the HAVING clause, and after ordering.
I can't fathom a use case that would require the type of functionality you describe.
FOLLOWUP
The query that Gordon Linoff provided will return the specified result, as long as there is at least one row in myTable that satisfies the predicate. Otherwise, it will return zero rows.
Here's the EXPLAIN output for Gordon's query:
id select_type table type key rows Extra
-- ------------ ---------------- ----- ------- ---- -------------------------------
1 PRIMARY <derived2> ALL 5 Using temporary; Using filesort
1 PRIMARY <derived3> ALL 5 Using join buffer
3 DERIVED No tables used
4 UNION No tables used
5 UNION No tables used
6 UNION No tables used
7 UNION No tables used
UNION RESULT <union3,4,5,6,7> ALL
2 DERIVED myTable range PRIMARY 10 Using where; Using index
Here's the EXPLAIN output for the original query:
id select_type table type key rows Extra
-- ----------- ----------------- ----- ------- ---- -------------------------------
1 SIMPLE myTable range PRIMARY 10 Using where; Using index
It just seems like it would be a whole lot more efficient to reprocess the resultset from the original query, if that resultset contains fewer than five (and more than zero) rows. (When that number of rows goes from 5 to 1,000 or 150,000, it would be even stranger.)
The code to get multiple copies of rows from a resultset is quite simple: fetch the rows, and if the end of the result set is reached before you've fetched five (or N) rows, then just reset the row pointer back to the first row, so the next fetch will return the first row again. In PHP using mysqli, for example, you could use:
$result->data_seek(0);
Or, for those still using the deprecated mysql_ interface:
mysql_data_seek($result,0);
But if you're returning only five rows, it's likely you aren't even looping through the result at all, and you already stuffed all the rows into an array. Just loop back through the beginning of the array.
For MySQL interfaces that don't support a scrollable cursor, we'd just store the whole resultset and process it multiple times. With Perl DBI, using the fetchall_arrayref, with JDBC (which is going to store the whole result set in memory anyway without special settings on the connection), we'd store the resultset as an object.
Bottom line, squeezing this requirement (to produce a resultset of exactly five rows) back to the database server, and pulling back duplicate copies of a row and/or storing duplicate copies of a row in memory just seems like the wrong way to satisfy the use case. (If there's rationale for storing duplicate copies of a row in memory, then that can be achieved without pulling duplicate copies of rows back from the database.)
It's just very odd that you say you're using/implementing a "circular buffer", but that you choose not to "circle" back around to the beginning of a resultset which contains fewer than five rows, and instead need to have MySQL return you duplicate rows. Just very, very strange.

mysql views performance when subset of the view's fields are required

I want to create a database view which has some costly fields to compute. I'm trying to see what is the cost if the costly fields are not required. However, I see that this is not the case, and that the cost is the same.
As toy example, I have a table with users information. I have a view on this table that returns the id,name and the number of users whose name are like mine.
create view same_name as
select
id,
name,
(select count(*) from users as u2 where u2.name = u1.name) as same_name_count,
from users as u1;
I'm now doing two queries over the table. In the first one I'm select all fields, in the second one I'm just selecting single field (the name). In both I constrain over the id.
mysql> select * from same_name where id = 2;
+----+--------+-----------------+
| id | name | same_name_count |
+----+--------+-----------------+
| 2 | meidan | 125 |
+----+--------+-----------------+
1 row in set (12.15 sec)
mysql> select name from same_name where id = 2;
+--------+
| name |
+--------+
| meidan |
+--------+
1 row in set (12.15 sec)
So it can be seen that the performance is the same and that no optimization is done here on the missing field. Is this the expected behavior? Any hints on this?
Thanks.
Most of CPU cycles are wasted on selections of rows and calculations (which are the same in both cases), not data transfer itself. If each of your table's rows does not occupy several kilobytes of data as opposed to several bytes in name field, you won't see much of a difference on fast network connection.

Why does mysql decide that this subquery is dependent?

On a MySQL 5.1.34 server, I have the following perplexing situation:
mysql> explain select * FROM master.ObjectValue WHERE id IN ( SELECT id FROM backup.ObjectValue ) AND timestamp < '2008-04-26 11:21:59';
+----+--------------------+-------------+-----------------+-------------------------------------------------------------+------------------------------------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------------+-----------------+-------------------------------------------------------------+------------------------------------+---------+------+--------+-------------+
| 1 | PRIMARY | ObjectValue | range | IX_ObjectValue_Timestamp,IX_ObjectValue_Timestamp_EventName | IX_ObjectValue_Timestamp_EventName | 9 | NULL | 541944 | Using where |
| 2 | DEPENDENT SUBQUERY | ObjectValue | unique_subquery | PRIMARY | PRIMARY | 4 | func | 1 | Using index |
+----+--------------------+-------------+-----------------+-------------------------------------------------------------+------------------------------------+---------+------+--------+-------------+
2 rows in set (0.00 sec)
mysql> select * FROM master.ObjectValue WHERE id IN ( SELECT id FROM backup.ObjectValue ) AND timestamp < '2008-04-26 11:21:59';
Empty set (2 min 48.79 sec)
mysql> select count(*) FROM master.ObjectValue;
+----------+
| count(*) |
+----------+
| 35928440 |
+----------+
1 row in set (2 min 18.96 sec)
How can it take 3 minutes to examine 500000 records when it only
takes 2 minutes to visit all records?
How can a subquery on a
separate database be classified dependent?
What can I do to speed up
this query?
UPDATE:
The actual query that took a long time was a DELETE, but you can't do explain on those; DELETE is why I used subselect. I have now read the documentation and found out about the syntax "DELETE FROM t USING ..." Rewriting the query from:
DELETE FROM master.ObjectValue
WHERE timestamp < '2008-06-26 11:21:59'
AND id IN ( SELECT id FROM backup.ObjectValue ) ;
into:
DELETE FROM m
USING master.ObjectValue m INNER JOIN backup.ObjectValue b ON m.id = b.id
WHERE m.timestamp < '2008-04-26 11:21:59';
Reduced the time from minutes to .01 seconds for an empty backup.ObjectValue.
Thank you all for good advise.
The dependent subquery slows you outer query down to a crawl (I suppose you know it means it's run once per row of found in the dataset being looked at).
You don't need the subquery there, and not using one will speedup your query quite significantly:
SELECT m.*
FROM master.ObjectValue m
JOIN backup.ObjectValue USING (id)
WHERE m.timestamp < '2008-06-26 11:21:59'
MySQL frequently treats subqueries as dependent even though they are not. I've never really understood the exact reasons for that - maybe it's simply because the query optimizer fails to recognize it as independent. I never bothered looking more in details because in these cases you can virtually always move it to the FROM clause, which fixes it.
For example:
DELETE FROM m WHERE m.rid IN (SELECT id FROM r WHERE r.xid = 10)
// vs
DELETE m FROM m WHERE m.rid IN (SELECT id FROM r WHERE r.xid = 10)
The former will produce a dependent subquery and can be very slow. The latter will tell the optimizer to isolate the subquery, which avoids a table scan and makes the query run much faster.
Notice how it says there is only 1 row for the subquery? There is obviously more than 1 row. That is an indication that mysql is loading only 1 row at a time. What mysql is probably trying to do is "optimize" the subquery so that it only loads records in the subquery that also exist in the master query, a dependent subquery. This is how a join works, but the way you phrased your query you have forced a reversal of the optimized logic of a join.
You've told mysql to load the backup table (subquery) then match it against the filtered result of the master table "timestamp < '2008-04-26 11:21:59'". Mysql determined that loading the entire backup table is probably not a good idea. So mysql decided to use the filtered result of the master to filter the backup query, but the master query hasn't completed yet when trying to filter the subquery. So it needs to check as it loads each record from the master query. Thus your dependent subquery.
As others mentioned, use a join, it's the right way to go. Join the crowd.
How can it take 3 minutes to examine 500000 records when it only takes 2 minutes to visit all records?
COUNT(*) is always transformed to COUNT(1) in MySQL. So it doesn't even have to enter each record, and also, I would imagine that it uses in-memory indexes which speeds things up. And in the long-running query, you use range (<) and IN operators, so for each record it visits, it has to do extra work, especially since it recognizes the subquery as dependent.
How can a subquery on a separate database be classified dependent?
Well, it doesn't matter if it's in a separate database. A subquery is dependent if it depends on values from the outer query, which you could still do in your case... but you don't, so it is, indeed, strange that it's classified as a dependent subquery. Maybe it is just a bug in MySQL, and that's why it's taking so long - it executes the inner query for every record selected by the outer query.
What can I do to speed up this query?
To start with, try using JOIN instead:
SELECT master.*
FROM master.ObjectValue master
JOIN backup.ObjectValue backup
ON master.id = backup.id
AND master.timestamp < '2008-04-26 11:21:59';
The real answer is, don't use MySQL, its optimizer is rubbish. Switch to Postgres, it will save you time in the long run.
To everyone saying "use JOIN", that's just a nonsense perpetuated by the MySQL crowd who have refused for 10 years to fix this glaringly horrible bug.

Compound index required to speed up join-ed query?

A colleague asked me to explain how indexes (indices?) boost up performance; I tried to do so, but got confused myself.
I used the model below for explanation (an error/diagnostics logging database). It consists of three tables:
List of business systems, table "System" containing their names
List of different types of traces, table "TraceTypes", defining what kinds of error messages can be logged
Actual trace messages, having foreign keys from System and TraceTypes tables
I used MySQL for the demo, however I don't recall the table types I used. I think it was InnoDB.
System TraceTypes
----------------------------- ------------------------------------------
| ID | Name | | ID | Code | Description |
----------------------------- ------------------------------------------
| 1 | billing | | 1 | Info | Informational mesage |
| 2 | hr | | 2 | Warning| Warning only |
----------------------------- | 3 | Error | Failure |
| ------------------------------------------
| ------------|
Traces | |
--------------------------------------------------
| ID | System_ID | TraceTypes_ID | Message |
--------------------------------------------------
| 1 | 1 | 1 | Job starting |
| 2 | 1 | 3 | System.nullr..|
--------------------------------------------------
First, i added some records to all of the tables and demonstrated that the query below executes in 0.005 seconds:
select count(*) from Traces
inner join System on Traces.System_ID = System.ID
inner join TraceTypes on Traces.TraceTypes_ID = TraceTypes.ID
where
System.Name='billing' and TraceTypes.Code = 'Info'
Then I generated more data (no indexes yet)
"System" contained about 100 entries
"TraceTypes" contained about 50 entries
"Traces" contained ~10 million records.
Now the previous query took 8-10 seconds.
I created indexes on Traces.System_ID column and Traces.TraceTypes_ID column. Now this query executed in milliseconds:
select count(*) from Traces where System_id=1 and TraceTypes_ID=1;
This was also fast:
select count(*) from Traces
inner join System on Traces.System_ID = System.ID
where System.Name='billing' and TraceTypes_ID=1;
but the previous query which joined all the three tables still took 8-10 seconds to complete.
Only when I created a compound index (both System_ID and TraceTypes_ID columns included in index), the speed went down to milliseconds.
The basic statement I was taught earlier is "all the columns you use for join-ing, must be indexed".
However, in my scenario I had indexes on both System_ID and TraceTypes_ID, however MySQL didn't use them. The question is - why? My bets is - the item count ratio 100:10,000,000:50 makes the single-column indexes too large to be used. But is it true?
First, the correct, and the easiest, way to analyze a slow SQL statement is to do EXPLAIN. Find out how the optimizer chose its plan and ponder on why and how to improve that. I'd suggest to study the EXPLAIN results with only 2 separate indexes to see how mysql execute your statement.
I'm not very familiar with MySQL, but it seems that there's restriction of MySQL 4 of using only one index per table involved in a query. There seems to be improvements on this since MySQL 5 (index merge), but I'm not sure whether it applies to your case. Again, EXPLAIN should tell you the truth.
Even with using 2 indexes per table allowed (MySQL 5), using 2 separate indexes is generally slower than compound index. Using 2 separate indexes requires index merge step, compared to the single pass of using a compound index.
Multi Column indexes vs Index Merge might be helpful, which uses MySQL 5.4.2.
It's not the size of the indexes so much as the selectivity that determines whether the optimizer will use them.
My guess would be that it would be using the index and then it might be using traditional look up to move to another index and then filter out. Please check the execution plan. So in short you might be looping through two indexes in nested loop. As per my understanding. We should try to make a composite index on column which are in filtering or in join and then we should use Include clause for the columns which are in select. I have never worked in MySql so my this understanding is based on SQL Server 2005.