Creating an index on a timestamp to optimize query - mysql

I have a query of the following form:
SELECT * FROM MyTable WHERE Timestamp > [SomeTime] AND Timestamp < [SomeOtherTime]
I would like to optimize this query, and I am thinking about putting an index on timestamp, but am not sure if this would help. Ideally I would like to make timestamp a clustered index, but MySQL does not support clustered indexes, except for primary keys.
MyTable has 4 million+ rows.
Timestamp is actually of type INT.
Once a row has been inserted, it is never changed.
The number of rows with any given Timestamp is on average about 20, but could be as high as 200.
Newly inserted rows have a Timestamp that is greater than most of the existing rows, but could be less than some of the more recent rows.
Would an index on Timestamp help me to optimize this query?

No question about it. Without the index, your query has to look at every row in the table. With the index, the query will be pretty much instantaneous as far as locating the right rows goes. The price you'll pay is a slight performance decrease in inserts; but that really will be slight.

You should definitely use an index. MySQL has no clue what order those timestamps are in, and in order to find a record for a given timestamp (or timestamp range) it needs to look through every single record. And with 4 million of them, that's quite a bit of time! Indexes are your way of telling MySQL about your data -- "I'm going to look at this field quite often, so keep an list of where I can find the records for each value."
Indexes in general are a good idea for regularly queried fields. The only downside to defining indexes is that they use extra storage space, so unless you're real tight on space, you should try to use them. If they don't apply, MySQL will just ignore them anyway.

I don't disagree with the importance of indexing to improve select query times, but if you can index on other keys (and form your queries with these indexes), the need to index on timestamp may not be needed.
For example, if you have a table with timestamp, category, and userId, it may be better to create an index on userId instead. In a table with many different users this will reduce considerably the remaining set on which to search the timestamp.
...and If I'm not mistaken, the advantage of this would be to avoid the overhead of creating the timestamp index on each insertion -- in a table with high insertion rates and highly unique timestamps this could be an important consideration.
I'm struggling with the same problems of indexing based on timestamps and other keys. I still have testing to do so I can put proof behind what I say here. I'll try to postback based on my results.
A scenario for better explanation:
timestamp 99% unique
userId 80% unique
category 25% unique
Indexing on timestamp will quickly reduce query results to 1% the table size
Indexing on userId will quickly reduce query results to 20% the table size
Indexing on category will quickly reduce query results to 75% the table size
Insertion with indexes on timestamp will have high overhead **
Despite our knowledge that our insertions will respect the fact of have incrementing timestamps, I don't see any discussion of MySQL optimisation based on incremental keys.
Insertion with indexes on userId will reasonably high overhead.
Insertion with indexes on category will have reasonably low overhead.
** I'm sorry, I don't know the calculated overhead or insertion with indexing.

If your queries are mainly using this timestamp, you could test this design (enlarging the Primary Key with the timestamp as first part):
CREATE TABLE perf (
, ts INT NOT NULL
, oldPK
, ... other columns
, PRIMARY KEY(ts, oldPK)
, UNIQUE (oldPK)
) ENGINE=InnoDB ;
This will ensure that the queries like the one you posted will be using the clustered (primary) key.
Disadvantage is that your Inserts will be a bit slower. Also, If you have other indices on the table, they will be using a bit more space (as they will include the 4-bytes wider primary key).
The biggest advantage of such a clustered index is that queries with big range scans, e.g. queries that have to read large parts of the table or the whole table will find the related rows sequentially and in the wanted order (BY timestamp), which will also be useful if you want to group by day or week or month or year.
The old PK can still be used to identify rows by keeping a UNIQUE constraint on it.
You may also want to have a look at TokuDB, a MySQL (and open source) variant that allows multiple clustered indices.

Related

Clustered index on integer column surprisingly slow

I have an InnoDB table with 750,000 records. Its primary key is a BIGINT.
When I do:
SELECT COUNT(*) FROM table;
it takes 900ms. explain shows that the index is not used.
When I do:
SELECT COUNT(*) FROM table WHERE pk >= 3000000;
it takes 400ms. explain shows that the index, in this case, is used.
I am looking to do fast counts where x >= pk >= y.
It is my understanding that since I use the primary key of the table, I am using a clustered index, and that therefore the rows are (physically?) ordered by this index. Should it then not be very, very fast to do this count? I was expecting the result to be available in a dozen milliseconds or so.
I have read that faster results can be expected if I select only a small part of the table. I am however interested in doing these counts of ranges. Perhaps I should organize my data in a different way?
In a different case, I have a table with spatial data and use an RTREE index, and then I use MBRContains to count matching rows (and on a secondary index). Surprisingly, this is faster than the simple case above.
In InnoDB, the PRIMARY KEY is "clustered" with the data. This means that the data is sorted by the PK and where pk BETWEEN x AND y must read all the rows from x through y.
So, how does it do a scan by PK? It must read the data blocks. They are bulky in that they have other columns.
But what about COUNT(*) without a WHERE? In this case, the Optimizer looks for the least-bulky index and counts the rows in it. So...
If you have a secondary index, it will use that.
If you only have the PK, then it will read the entire table to do the count.
That is, the artificial addition of a secondary index on the narrowest column is likely to speedup SELECT COUNT(*) FROM tbl.
But wait... Be sure to run each timing test twice. The first time (after a restart) must read the needed blocks from disk. Slow.
The second time all the blocks are likely to be sitting in RAM. Much faster.
SPATIAL and FULLTEXT indexing complicated this discussion. Especially if you have 2 parts to the WHERE, one with Spatial or Fulltext, one with a regular test.
COUNT(1) and COUNT(*) are identical. COUNT(x) checks x for being NOT NULL before including the row in the tally.

Mariadb Explain statement estimating high number of rows that would be found during lookup

I have a table with 32 columns of which 6 rows are primary keys and 2 more column are indexed.
Explain statement provides the below output
I have observed that, everytime the number of rows in the explain statement increases, the select query takes seconds to retrieve data from DB. The above select query returned only 310 rows but it had to scan 382546 rows.
Time taken was calculated by enabling mariadb's slow query log.
Create table query
I would like to understand the incorrectness in the table or query which is considerably slowing down the select query execution.
Your row is relatively large (around 300bytes, depending on the content of your varchar columns). Using the primary key means (for InnoDB) that MySQL will read the whole row. Assuming the estimate of 400k rows is right (which it probably isn't, but you can check by removing the and country_code = 1506 from your query to get a better count), MySQL may end up reading more than 100mb from disk, which reasonably can take several seconds.
Adding a proper index should fix this, in your case I would suggest (country_code, lcr_run_id, tier_type) (which would, with your primary key, actually be the same as just (country_code)).
If most of your queries have that form (e.g. use at least these three columns for lookup), you could think about changing the order of your primary key to start with those three columns, it should give you another speedboost. That operation will take some time though.
Hash partitioning is useless for performance, get rid of it. Ditto for subpartitioning.
Specifying which partition to use defeats the purpose of letting the Optimizer do it for you.
You simply need INDEX(tier_type, lcr_run_id, country_code) with the columns in any desired order.
Plan A: Have the PRIMARY KEY start with those 3 columns (again, the order is not important)
Plan B: Have a "secondary" index with those 3 columns, but not being the same as the start of the PK. (This index could have more columns on the end; let's see some more queries to advise further.)
Either way, it will scan only 310 rows if you also get rid of all partitioning. (Hence, resolving your "returned only 310 rows but it had to scan 382546 rows". Anyway, the '382546' may have been a poor estimate by Explain.)
The important issue here is that indexing works with the leftmost columns in the INDEX. (The PK is an index.) Your SELECT had a match on the first 2 columns, but country_code came later in the list, and the intervening columns were not tested with =.
The three 35M values makes me wonder if the PK is over-specified. For example, if a "zone" is comprised of several "countries", then "zone" is irrelevant in specifying the PK.
The table has only 382K rows, but it is much fatter than it needs to be. Partitioning has a lot of overhead. Also, most columns have (I think) much bigger datatypes than needed. BIGINT takes 8 bytes; INT takes 4 bytes. For example, if there are only a small number of "zones", use TINYINT UNSIGNED, which takes only 1 byte (and allows values 0..255). (See also other 'int' variants.)
Oops, I missed something else. Since zone is not in the WHERE, it can't even get past the primary partitioning.

Mysql select by auto increment primary key while partitioned by date

I was wondering how would mysql act if i partition a table by date and then have some select or update queries by primary key ?
is it going to search all partitions or query optimizer knows in which partition the row is saved ?
What about other unique and not-unique indexed columns ?
Background
Think of a PARTITIONed table as a collection of virtually independent tables, each with its own data BTree and index BTree(s).
All UNIQUE keys, including the PRIMARY KEY must include the "partition key".
If the partition key is available in the query, the query will first try to do "partition pruning" to limit the number of partitions to actually look at. Without that info, it must look at all partitions.
After the "pruning", the processing goes to each of the possible partitions, and performs the query.
Select, Update
A SELECT logically does a UNION ALL of whatever was found in the non-pruned partitions.
An UPDATE applies its action to each non-pruned partitions. No harm is done (except performance) by the updates that did nothing.
Opinion
In my experience, PARTITIONing often slows thing down due to things such as the above. There are a small number of use cases for partitioning: http://mysql.rjweb.org/doc.php/partitionmaint
Your specific questions
partition a table by date and then have some select or update queries by primary key ?
All partitions will be touched. The SELECT combines the one result with N-1 empty results. The UPDATE will do one update, plus N-1 useless attempts to update.
An AUTO_INCREMENT column must be the first column in some index (not necessarily the PK, not necessarily alone). So, using the id is quite efficient in each partition. But that means that it is N times as much effort as in a non-partitioned table. (This is a performance drag for partitioning.)

Optimizing an index in a large MySQL table

I have a large table (about 3 million records) that includes primarily these fields: rowID (int), a deviceID (varchar(20)), a UnixTimestamp in a format like 1536169459 (int(10)), powerLevel which has integers that range between 30 and 90 (smallint(6)).
I'm looking to pull out records within a certain time range (using UnixTimestamp) for a particular deviceID and with a powerLevel above a certain number. With over 3 million records, it takes a while. Is there a way to create an index that will optimize for this?
Create an index over:
DeviceId,
PowerLevel,
UnixTimestamp
When selecting, you will first narrow in to the set of records for your given Device, then it will narrow in to only those records that are in the correct PowerLevel range. And lastly, it will narrow in, for each PowerLevel, to the correct records by UnixTimestamp.
If I understand you correctly, you hope to speed up this sort of query.
SELECT something
FROM tbl
WHERE deviceID = constant
AND start <= UnixTimestamp
AND UnixTimestamp < end
AND Power >= constant
You have one constant criterion (deviceID) and two range critera (UnixTimestamp and Power). MySQL's indexes are BTREE (think sorted in order), and MySQL can only do one index range scan per SELECT.
So, you should probably choose an index on (deviceID, UnixTimestamp, Power). To satisfy the query, MySQL will random-access the index to the entries for deviceID, then further random access to the first row meeting the UnixTimestamp start criterion.
It will then scan the index sequentially, and use the Power information from each index entry to decide whether it should choose each row.
You could also use (deviceID, Power, UnixTimestamp) . But in this case MySQL will find the first entry matching the device and power criteria, then scan the index to look at entries will all timestamps to see which rows it should choose.
Your performance objective is to get MySQL to scan the fewest possible index entries, so it seems very likely the (deviceID, UnixTimestamp, Power) choice is superior. The index column on UnixTimestamp is probably more selective than the one on Power. (That's my guess.)
ALTER TABLE tbl CREATE INDEX tbl_dev_ts_pwr (deviceID, UnixTimestamp, Power);
Look at Bill Karwin's tutorials. Also look at Markus Winand's https://use-the-index-luke.com
The suggested 3-column indexes are only partially useful. The Optimizer will use the first 2 columns, but ignore the third.
Better:
INDEX(DeviceId, PowerLevel),
INDEX(DeviceId, UnixTimestamp)
Why?
The optimizer will pick between those two based on which seems to be more selective. If the time range is 'narrow', then the second index will be used; if there are not many rows with the desired PowerLevel, then the first index will be used.
Even better...
The PRIMARY KEY... You probably have Id as the PK? Perhaps (DeviceId, UnixTimestamp) is unique? (Or can you have two readings for a single device in a single second??) If the pair is unique, get rid of Id completely and have
PRIMARY KEY(DeviceId, UnixTimestamp),
INDEX(DeviceId, PowerLevel)
Notes:
Getting rid of Id saves space, thereby providing a little bit of speed.
When using a secondary index, the executing spends time bouncing between the index's BTree and the data BTree (ordered by the PK). By having PRIMARY KEY(Id), you are guaranteed to do the bouncing. By changing the PK to this, the bouncing is avoided. This may double the speed of the query.
(I am not sure the secondary index will every be used.)
Another (minor) suggestion: Normalize the DeviceId so that it is (perhaps) a 2-byte SMALLINT UNSIGNED (range 0..64K) instead of VARCHAR(20). Even if this entails a JOIN, the query will run a little faster. And a bunch of space is saved.

Should I avoid COUNT all together in InnoDB?

Right now, I'm debating whether or not to use COUNT(id) or "count" columns. I heard that InnoDB COUNT is very slow without a WHERE clause because it needs to lock the table and do a full index scan. Is that the same behavior when using a WHERE clause?
For example, if I have a table with 1 million records. Doing a COUNT without a WHERE clause will require looking up 1 million records using an index. Will the query become significantly faster if adding a WHERE clause decreases the number of rows that match the criteria from 1 million to 500,000?
Consider the "Badges" page on SO, would adding a column in the badges table called count and incrementing it whenever a user earned that particular badge be faster than doing a SELECT COUNT(id) FROM user_badges WHERE user_id = 111?
Using MyIASM is not an option because I need the features of InnoDB to maintain data integrity.
SELECT COUNT(*) FROM tablename seems to do a full table scan.
SELECT COUNT(*) FROM tablename USE INDEX (colname) seems to be quite fast if
the index available is NOT NULL, UNIQUE, and fixed-length. A non-UNIQUE index doesn't help much, if at all. Variable length indices (VARCHAR) seem to be slower, but that may just be because the index is physically larger. Integer UNIQUE NOT NULL indices can be counted quickly. Which makes sense.
MySQL really should perform this optimization automatically.
Performance of COUNT() is fine as long as you have an index that's used.
If you have a million records and the column in question is NON NULL then a COUNT() will be a million quite easily. If NULL values are allowed, those aren't indexed so the number of records is easily obtained by looking at the index size.
If you're not specifying a WHERE clause, then the worst case is the primary key index will be used.
If you specify a WHERE clause, just make sure the column(s) are indexed.
I wouldn't say avoid, but it depends on what you are trying to do:
If you only need to provide an estimate, you could do SELECT MAX(id) FROM table. This is much cheaper, since it just needs to read the max value in the index.
If we consider the badges example you gave, InnoDB only needs to count up the number of badges that user has (assuming an index on user_id). I'd say in most case that's not going to be more than 10-20, and it's not much harm at all.
It really depends on the situation. I probably would keep the count of the number of badges someone has on the main user table as a column (count_badges_awarded) simply because every time an avatar is shown, so is that number. It saves me having to do 2 queries.