I have a very simple table to log reading from sensors. There's a column for sensor id number, one for sensor reading and one for the timestamp. This column is of SQL type Timestamp. There's a big amount of data in the table, a few million rows.
When I query for all rows before a certain timestamp with a certain sensor id number, sometimes it can take a very long time. If the timestamp is far in the past, the query is pretty fast but, if it's a recent timestamp, it can take up to 2 or 3 seconds.
It appears as if the SQL engine is iterating over the table until it finds the first timestamp that's larger than the queried timestamp. Or maybe the larger amount of queried data slows it down, I don't know.
In any case, I'm looking for design suggestions here, specifically to address to points: why is it so slow? and how can I make it faster?
Is there any design technique that could be applied here? I don't know much about SQL, maybe there's a way to let the SQL engine know the data is ordered (right now it's not but I could order it upon insertion I guess) and speed up the query. Maybe I should change the way the query is done or change the data type of the timestamp column.
Use EXPLAIN to see the execution plan, and verify that the query is using a suitable index. If not, verify that appropriate indexes are available.
An INDEX is stored "in order", and MySQL can make effective of use with some query patterns. (An InnoDB table is also stored in order, by the cluster key, which is the PRIMARY KEY of the table (if it exists) or the first UNIQUE KEY on non-NULL columns.)
With some query patterns, by using an index, MySQL can eliminate vast swaths of rows from being examined. When MySQL can't make user of an index (either because a suitable index doesn't exist, or because the query has constructs that prevent it), the execution plan is going to do a full scan, that is, examine every row in the table. And when that happens with very large tables, there's a tendency for things to get slow.
EDIT
Q: Why is it so slow?
A: There are several factors that affect the elapsed time. It could be contention, for example, an exclusive table lock taken by another session, or it could be time for I/O (disk reads), or a large "Using filesort" operation. Time for returning resultset over a slow network connection.
It's not possible to diagnose the issue with the limited information provided. We can only provide some suggestions about some common issue.
Q: How can I make it faster?
A: It's not possible to make a specific recommendation. We need to figure out where and what the bottleneck is, and the address that.
Take a look at the output from EXPLAIN to examine the execution plan. Is an appropriate index being used, or is it doing a full scan? How many rows are being examined? Is there "Using filesort" operation? et al.
Q: Is there any design technique that could be applied here?
A: In general, having an appropriate index available, and carefully crafting the SQL statement so the most efficient access plan is enabled.
Q: Maybe I should change the way the query is done
A: Changing the SQL statement may improve performance, that's a good place to start, after looking at the execution plan... can the query be modified to get a more efficient plan?
Q: or change the data type of the timestamp column.
A: I think it's very unlikely that changing the datatype of the TIMESTAMP column will improve performance. That's only 4 bytes. What would you change it to? Using DATETIME would take 7 bytes.
In general, we want the rows to be as short as possible, and to pack as many rows as possible into a block. Its also desirable to have the table physically organized in a way that queries can be satisfied from fewer blocks... the rows the query need are found in fewer pages, rather than the rows being scattered onesy-twosy over a large number of pages.
With InnoDB, increasing the size of the buffer pool may reduce I/O.
And I/O from solid state drives (SSD) will be faster than I/O from spinning hard disks (HDD), and this especially true if there is I/O contention on the HDD from other processes.
Related
For example, if there is a table named paper, I execute sql with
[ select paper.user_id, paper.name, paper.score from paper where user_id in (201,205,209……) ]
I observed that when this statement is executed, index will only be used when the number of "in" is less than a certain number. and the certain number is dynamic.
For example,when the total number of rows in the table is 4000 and cardinality is 3939, the number of "in" must be less than 790,MySQL will execute index query.
(View MySQL explain. If <790, type=range; if >790, type=all)
when the total number of rows in the table is 1300000 and cardinality is 1199166, the number of "in" must be less than 8500,MySQL will execute index query.
The result of this experiment is very strange to me.
I imagined that if I implemented this "in" query, I would first find in (max) and in (min), and then find the page where in (max) and in (min) are located,Then exclude the pages before in (min) and the pages after in (max). This is definitely faster than performing a full table scan.
Then, my test data can be summarized as follows:
Data in the table 1 to 1300000
Data of "in" 900000 to 920000
My question is, in a table with 1300000 rows of data, why does MySQL think that when the number of "in" is more than 8500, it does not need to execute index queries?
mysql version 5.7.20
In fact, this magic number is 8452. When the total number of rows in my table is 600000, it is 8452. When the total number of rows is 1300000, it is still 8452. Following is my test screenshot
When the number of in is 8452, this query only takes 0.099s.
Then view the execution plan. range query.
If I increase the number of in from 8452 to 8453, this query will take 5.066s, even if I only add a duplicate element.
Then view the execution plan. type all.
This is really strange. It means that if I execute the query with "8452 in" first, and then execute the remaining query, the total time is much faster than that of directly executing the query with "8453 in".
who can debug MySQL source code to see what happens in this process?
thanks very much.
Great question and nice find!
The query planner/optimizer has to decide if it's going seek the pages it needs to read or it's going to start reading many more and scan for the ones it needs. The seek strategy is more memory and especially cpu intensive while the scan probably is significantly more expensive in terms of I/O.
The bigger a table the less attractive the seek strategy becomes. For a large table a bigger part of the nonclustered index used for the seek needs to come from disk, memory pressure rises and the potential for sequential reads shrinks the longer the seek takes. Therefore the threshold for the rows/results ratio to which a seek is considered lowers as the table size rises.
If this is a problem there're a few things you could try to tune. But when this is a problem for you in production it might be the right time to consider a server upgrade, optimizing the queries and software involved or simply adjust expectations.
'Harden' or (re)enforce the query plans you prefer
Tweak the engine (when this is a problem that affects most tables server/database settings maybe can be optimized)
Optimize nonclustered indexes
Provide query hints
Alter tables and datatypes
It is usually folly go do a query in 2 steps. That framework seems to be fetching ids in one step, then fetching the real stuff in a second step.
If the two queries are combined into a single on (with a JOIN), the Optimizer is mostly forced to do the random lookups.
"Range" is perhaps always the "type" for IN lookups. Don't read anything into it. Whether IN looks at min and max to try to minimize disk hits -- I would expect this to be a 'recent' optimization. (I have not it in the Changelogs.)
Are those UUIDs with the dashes removed? They do not scale well to huge tables.
"Cardinality" is just an estimate. ANALYZE TABLE forces the recomputation of such stats. See if that changes the boundary, etc.
I am in the process of optimizing the queries in my MySQL database. While using Visual Explain and looking at various query costs, I'm repeatedly finding counter-intuitive values. Operations which use more efficient lookups (e.g. key lookup) seem to have a higher query cost than ostensibly less efficient operations (e.g full table scan or full index scan).
Examples of this can even be seen in the MySQL manual, in the section regarding Visual Explain on this page:
The query cost for the full table scan is a fraction of the key-lookup-based query costs. I see exactly the same scenario in my own database.
All this seems perfectly backwards to me, and raises this question: should I use query cost as the standard when optimizing a query? Or have I fundamentally misunderstood query cost?
MySQL does not have very good metrics relating to Optimization. One of the better ones is EXPLAIN FORMAT=JSON SELECT ..., but it is somewhat cryptic.
Some 'serious' flaws:
Rarely does anything account for a LIMIT.
Statistics on indexes are crude and do not allow for uneven distribution. (Histograms are coming 'soon'.)
Very little is done about whether data/indexes are currently cached, and nothing about whether you have a spinning drive or SSD.
I like this because it lets me compare two formulations/indexes/etc even for small tables where timing is next to useless:
FLUSH STATUS;
perform the query
SHOW SESSION STATUS LIKE "Handler%";
It provides exact counts (unlike EXPLAIN) of reads, writes (to temp table), etc. Its main flaw is in not differentiating how long a read/write took (due to caching, index lookup, etc). However, it is often very good at pointing out whether a query did a table/index scan versus lookup versus multiple scans.
The regular EXPLAIN fails to point out multiple sorts, such as might happen with GROUP BY and ORDER BY. And "Using filesort" does not necessarily mean anything is written to disk.
I have a Postgres table with several columns, one column is the datetime that the column was last updated. My query is to get all the updated rows between a start and end time. It is my understanding for this query to use WHERE in this query instead of BETWEEN. The basic query is as follows:
SELECT * FROM contact_tbl contact
WHERE contact."UpdateTime" >= '20150610' and contact."UpdateTime" < '20150618'
I am new at creating SQL queries, I believe this query is doing a full table scan. I would like to optimize it if possible. I have placed a Normal index on the UpdateTime column, which takes a long time to create, but with this index the query is faster. One thing I am not sure about is if have to keep recalculating this index if the table gets bigger/columns get changed. Also, I am considering a CLUSTERED index on the UpdateTime row, but I wanted to ask if there was a canonical way of optimizing this/if I was on the right track first
Placing an index on UpdateTime is correct. It will allow the index to be used instead of full table scans.
2 WHERE conditions like the above vs. using the BETWEEN keyword are the exact same:
http://dev.mysql.com/doc/refman/5.7/en/comparison-operators.html#operator_between
BETWEEN is just "syntactical sugar" for those that like that syntax better.
Indexes allow for faster reads, but slow down writes (because like you mention, the new data has to be inserted into the index as well). The entire index does not need to be recalculated. Indexes are smart data structures, so the extra data can be added without a lot of extra work, but it does take some.
You're probably doing many more reads than writes, so using an index is a good idea.
If you're doing lots of writes and few reads, then you'd want to think a bit more about it. It would then come down to business requirements. Although overall the throughput may be slowed, read latency may not be a requirement but write latency may be, in which case you wouldn't want the index.
For instance, think of this lottery example: Everytime someone buys a ticket, you have to record their name and the ticket number. However the only time you ever have to read that data, is after the 1 and only drawing to see who had that ticket number. In this database, you wouldn't want to index the ticket number since they'll be so many writes and very few reads.
Here is my situation. I have a MySQL MyISAM table containing about 4 million records with a total of 13,3 GB of data. The table contains messages received from an external system. Two of the columns in the table keep track of a timestamp and a boolean whether the message is handled or not.
When using this query:
SELECT MIN(timestampCB) FROM webshop_cb_onx_message
The result shows up almost instantly.
However, I need to find the earliest timestamp of unhandled messages, like this:
SELECT MIN(timestampCB ) FROM webshop_cb_onx_message WHERE handled = 0
The results of this query show up after about 3 minutes, which is way too slow for the script I'm writing.
Both columns are individually indexed, not together. However, adding an index to the table would take incredibly long considering the amount of data that is in there already.
Does my problem originate from the fact that both columns are separatly indexed, and if so, does anyone have a solution to my issue other than adding another index?
It is commonly recommended that if the selectivity of an index over 20% then a full table scan is preferable over an index access. This would mean it is likely that your index on handled won't actually result in using the index but a full table scan given the selectivity.
A composite index of handled, timestampCB may actually improve the performance given its a composite index, even if the selectivity isn't great MySQL would most likely still use it - even if it didn't you could force it's use.
I'm trying to fine-tune my MySQL server so I check my settings, analyzing slow-query log, and simplify my queries if possible.
Sometimes it is enough if I am indexing correctly, sometimes not. I've read somewhere (please correct me if this is stupidity) that more indexes than I need make the same effect, like if I don't have any of indexes.
How many indexes are enough? You can say it depends on hundreds of factors, but I'm curious about how can I clean up my mysql-slow.log enough to reduce server load.
Furthermore, I saw some "interesting" log entries like this:
# Query_time: 0 Lock_time: 0 Rows_sent: 22 Rows_examined: 44
SELECT * FROM `categories` ORDER BY `orderid` ASC;
The table in question contains exactly 22 rows, index set in orderid. Why is this query showing up in the log after all? Why examine 44 rows if it only contains 22?
The amount of indexing and the line of doing too much will depend on a lot of factors. On small tables like your "categories" table you usually don't want or need an index and it can actually hurt performance. The reason being is that it takes I/O (i.e. time) to read an index and then more I/O and time to retrieve the records associated with the matched rows. An exception being when you only query the columns contained within the index.
In your example you are retrieving all the columns and with only 22 rows and it may be faster to just do a table scan and sort those instead of using the index. The optimizer may/should be doing this and ignoring the index. If that is the case, then the index is just taking up space with no benefit. If your "categories" table is accessed often, you may want to consider pinning it into memory so the db server keeps it accessible without having to goto the disk all the time.
When adding indexes you need to balance out disk space, query performance, and the performance of updating and inserting into the tables. You can get away with more indexes on tables that are static and don't change much as opposed to tables with millions of updates a day. You'll start feeling the affects of index maintenance at that point. What is acceptable in your environment though is and can only be determined by you and your organization.
When doing your analysis, be sure to generate/update your table and index statistics so that you can be assured of accurate calculations.
As a general rule, you should have indexes on all primary keys (you don't have a choice in that), all foreign keys, and any other fields you commonly use to fetch rows.
For example, if I commonly look up users by username, I would have that indexed, even if user ID was the primary key.
How many indexes depends entirely on the queries your running, what kinds of joins are being done (if any), the kind of data stored in the table and how big the tables are (as well as many other factors). There's really no exact science to it. The greatest tool in your arsenal for figuring out how to optimize a query is explain. Using explain you can find out what kind of joins are being down, what possible keys could be used and which key (if any) was used as well as how many rows were examined for each table in the join.
Using this information you can decide how to key your tables and/or modify your queries to make them more efficient. The syntax for explain is very simple.
EXPLAIN SELECT * FROM `categories` ORDER BY `orderid` ASC;
Note, explain does not actually run the query. So if you're using this to debug a query that takes 5 minutes to run, explain will still be very fast.
You do need to be careful when adding indexes though as they do cause inserts and updates to go slower and on very large tables this performance hit can become noticeable. Especially if that same table is used for a lot of reads. While adding a lot of indexes generally won't kill the performance of a query, you should still only add them as yo
Also keep in mind that MySQL will use a maximum of one index per select statement (although if you are using a join, it can also use one for each join). So indexing just because is a waste of disk space and will slow the database down on writes. If you commonly use a where statement on two columns, do one index containing both of those columns, it will be significantly faster than indexing just one alone.
An index can speed up a SELECT query, but it will slow down INSERT/UPDATE/DELETE queries because they need to update the index as well, not just the row.
This is just personal opinion (I've got no facts to back it up), but I think that if there is a query that is taking a long time and an index would speed it up - go for it! "Too many" indexes would be if you added indexes that didn't do any good (e.g. there were no queries it would speed up). For example, a silly thing to do would be to place an index on every column "just because".
There's no magic number for the "best" number of indexes. The basic rule is this: add indexes for queries that are used often and/or need to run quickly.
Having "too many" indexes shouldn't slow down queries, but it each index added adds a small amount of time to add/update items in the db (since it modifies the indices as well), and a small amount of space. However, if you're just adding indexes as required, this is probably not a big concern.