Let's say I have a table with 10 billion rows. Each row has a datetime column to signify when the row was added to the table. When querying against this table, 99% of the time the query will be based on a date range. Now my question:
Would it be faster to query based on the date time (e.g. where date between), or to query based on a range of the primary key (rowid)? In the second case I could have a table containing ID range values for each date so it would be simple to know what ID range to query for. Regardless of the implementation, would a primary key query like this be faster than the date range query? And if so, would it be significantly faster/take significantly less resources?
Thanks for any help. I'll be running some tests myself as well, just want to get some outside feedback on the matter to avoid confirmation bias.
Related
I have a large table (about 3 million records) that includes primarily these fields: rowID (int), a deviceID (varchar(20)), a UnixTimestamp in a format like 1536169459 (int(10)), powerLevel which has integers that range between 30 and 90 (smallint(6)).
I'm looking to pull out records within a certain time range (using UnixTimestamp) for a particular deviceID and with a powerLevel above a certain number. With over 3 million records, it takes a while. Is there a way to create an index that will optimize for this?
Create an index over:
DeviceId,
PowerLevel,
UnixTimestamp
When selecting, you will first narrow in to the set of records for your given Device, then it will narrow in to only those records that are in the correct PowerLevel range. And lastly, it will narrow in, for each PowerLevel, to the correct records by UnixTimestamp.
If I understand you correctly, you hope to speed up this sort of query.
SELECT something
FROM tbl
WHERE deviceID = constant
AND start <= UnixTimestamp
AND UnixTimestamp < end
AND Power >= constant
You have one constant criterion (deviceID) and two range critera (UnixTimestamp and Power). MySQL's indexes are BTREE (think sorted in order), and MySQL can only do one index range scan per SELECT.
So, you should probably choose an index on (deviceID, UnixTimestamp, Power). To satisfy the query, MySQL will random-access the index to the entries for deviceID, then further random access to the first row meeting the UnixTimestamp start criterion.
It will then scan the index sequentially, and use the Power information from each index entry to decide whether it should choose each row.
You could also use (deviceID, Power, UnixTimestamp) . But in this case MySQL will find the first entry matching the device and power criteria, then scan the index to look at entries will all timestamps to see which rows it should choose.
Your performance objective is to get MySQL to scan the fewest possible index entries, so it seems very likely the (deviceID, UnixTimestamp, Power) choice is superior. The index column on UnixTimestamp is probably more selective than the one on Power. (That's my guess.)
ALTER TABLE tbl CREATE INDEX tbl_dev_ts_pwr (deviceID, UnixTimestamp, Power);
Look at Bill Karwin's tutorials. Also look at Markus Winand's https://use-the-index-luke.com
The suggested 3-column indexes are only partially useful. The Optimizer will use the first 2 columns, but ignore the third.
Better:
INDEX(DeviceId, PowerLevel),
INDEX(DeviceId, UnixTimestamp)
Why?
The optimizer will pick between those two based on which seems to be more selective. If the time range is 'narrow', then the second index will be used; if there are not many rows with the desired PowerLevel, then the first index will be used.
Even better...
The PRIMARY KEY... You probably have Id as the PK? Perhaps (DeviceId, UnixTimestamp) is unique? (Or can you have two readings for a single device in a single second??) If the pair is unique, get rid of Id completely and have
PRIMARY KEY(DeviceId, UnixTimestamp),
INDEX(DeviceId, PowerLevel)
Notes:
Getting rid of Id saves space, thereby providing a little bit of speed.
When using a secondary index, the executing spends time bouncing between the index's BTree and the data BTree (ordered by the PK). By having PRIMARY KEY(Id), you are guaranteed to do the bouncing. By changing the PK to this, the bouncing is avoided. This may double the speed of the query.
(I am not sure the secondary index will every be used.)
Another (minor) suggestion: Normalize the DeviceId so that it is (perhaps) a 2-byte SMALLINT UNSIGNED (range 0..64K) instead of VARCHAR(20). Even if this entails a JOIN, the query will run a little faster. And a bunch of space is saved.
I use following query frequently:
SELECT * FROM table WHERE Timestamp > [SomeTime] AND Timestamp < [SomeOtherTime] and publish = 1 and type = 2 order by Timestamp
I would like to optimize this query, and I am thinking about put timestamp as part of primary key for clustered index, I think if timestamp is part of primary key , data inserted in table has write to disk sequentially by timestamp field.Also I think this improve my query a lot, but am not sure if this would help.
table has 3-4 million+ rows.
timestamp field never changed.
I use mysql 5.6.11
Anothet point is : if this is improve my query , it is better to use timestamp(4 byte in mysql 5.6) or datetime(5 byte in mysql 5.6)?
Four million rows isn't huge.
A one-byte difference between the data types datetime and timestamp is the last thing you should consider in choosing between those two data types. Review their specs.
Making a timestamp part of your primary key is a bad, bad idea. Think about reviewing what primary key means in a SQL database.
Put an index on your timestamp column. Get an execution plan, and paste that into your question. Determine your median query performance, and paste that into your question, too.
Returning a single day's rows from an indexed, 4 million row table on my desktop computer takes 2ms. (It returns around 8000 rows.)
1) If values of timestamp are unique you can make it primary key. If not, anyway create index on timestamp column as you frequently use it in "where".
2) using BETWEEN clause looks more natural here. I suggest you use TREE index (default index type) not HASH.
3) when timestamp column is indexed, you don't need call order by - it already sorted.
(of course, if your index is TREE not HASH).
4) integer unix_timestamp is better than datetime both from memory usage side and performance side - comparing dates is more complex operation than comparing integer numbers.
Searching data on indexed field takes O(log(rows)) tree lookups. Comparison of integers is O(1) and comparison of dates is O(date_string_length). So, difference is (number of tree lookups) * (difference_comparison) = O(date_string_length)/O(1))* O(log(rows)) = O(date_string_length)* O(log(rows))
I have a table with 5 million rows, and I want to get only rows that have the field date between two dates (date1 and date2). I tried to do
select column from table where date > date1 and date < date2
but the processing time is really big. Is there a smarter way to do this? Maybe access directly a row and make the query only after that row? My point is, is there a way to discard a large part of my table that does not match to the date period? Or I have to read row by row and compare the dates?
Usually you apply some kind of condition before retrieving the results. If you don't have anything to filter on you might want to use LIMIT and OFFSET:
SELECT * FROM table_name WHERE date BETWEEN ? AND ? LIMIT 1000 OFFSET 1000
Generally you will LIMIT to whatever amount of records you'd like to show on a particular page.
You can try/do a couple of things:
1.) If you don't already have one, index your date column
2.) Range partition your table on the date field
When you partition a table, the query optimizer can eliminate partitions that are not able to satisfy the query without actually processing any data.
For example, lets say you partitioned your table by the date field monthly and that you had 6 months of data in the table. If you query for a date between range of a week in OCT-2012, the query optimizer can throw out 5 of the 6 partitions and only scan the partition that has records in the month of OCT in 2012.
For more details, check the MySQL Partitioning page. It gives you all the necessary information and gives a more through example of what I described above in the "Partition Pruning" section.
Note, I would recommend creating/cloning your table in a new partitioned table and do the query in order to test the results and whether it satisfies your requirements. If you haven't already indexed the date column, that should be your first step, test, and if need be check out partitioning.
I am trying to find out how to design the indexes for my data when my query is using ranges for 2 fields.
expenses_tbl:
idx date category amount
auto-inc INT TINYINT DECIMAL(7,2)
PK
The column category defines the type of expense. Like, entertainment, clothes, education, etc. The other columns are obvious.
One of my query on this table is to find all those instances where for a given date range, the expense has been more than $50. This query will look like:
SELECT date, category, amount
FROM expenses_tbl
WHERE date > 120101 AND date < 120811
AND amount > 50.00;
How do I design the index/secondary index on this table for this particular query.
Assumption: The table is very large (It's not currently, but that gives me a scope to learn).
MySQL generally doesn't support ranges on multiple parts of a compound index. Either it will use the index for the date, or an index for the amount, but not both. It might do an index merge if you had two indexes, one on each, but I'm not sure.
I'd check the EXPLAIN before and after adding these indexes:
CREATE INDEX date_idx ON expenses_tbl (date);
CREATE INDEX amount_idx ON expenses_tbl (amount);
Compound index ranges - http://dev.mysql.com/doc/refman/5.5/en/range-access-multi-part.html
Index Merge - http://dev.mysql.com/doc/refman/5.0/en/index-merge-optimization.html
A couple more points that have not been mentioned yet:
The order of the columns in the index can make a difference. You may want to try both of these indexes:
(date, amount)
(amount, date)
Which to pick? Generally you want the most selective condition be the first column in the index.
If your date ranges are large but few expenses are over $50 then you want amount first in the index.
If you have narrow date ranges and most of the expenses are over $50 then you should put date first.
If both indexes are present then MySQL will choose the index with the lowest estimated cost.
You can try adding both indexes and then look at the output of EXPLAIN SELECT ... to see which index MySQL chooses for your query.
You may also want to consider a covering index. By including the column category in the index (as the last column) it means that all the data required for your query is available in the index, so MySQL does not need to look at the base table at all to get the results for your query.
The general answer to your question is that you want a composite index, with two keys. The first being date and the second being the amount.
Note that this index will work for queries with restrictions on the date or on the date and on the expense. It will not work for queries with restrictions on the expense only. If you have both types, you might want a second index on expense.
If the table is really, really large, then you might want to partition it by date and build indexes on expense within each partition.
I have a query of the following form:
SELECT * FROM MyTable WHERE Timestamp > [SomeTime] AND Timestamp < [SomeOtherTime]
I would like to optimize this query, and I am thinking about putting an index on timestamp, but am not sure if this would help. Ideally I would like to make timestamp a clustered index, but MySQL does not support clustered indexes, except for primary keys.
MyTable has 4 million+ rows.
Timestamp is actually of type INT.
Once a row has been inserted, it is never changed.
The number of rows with any given Timestamp is on average about 20, but could be as high as 200.
Newly inserted rows have a Timestamp that is greater than most of the existing rows, but could be less than some of the more recent rows.
Would an index on Timestamp help me to optimize this query?
No question about it. Without the index, your query has to look at every row in the table. With the index, the query will be pretty much instantaneous as far as locating the right rows goes. The price you'll pay is a slight performance decrease in inserts; but that really will be slight.
You should definitely use an index. MySQL has no clue what order those timestamps are in, and in order to find a record for a given timestamp (or timestamp range) it needs to look through every single record. And with 4 million of them, that's quite a bit of time! Indexes are your way of telling MySQL about your data -- "I'm going to look at this field quite often, so keep an list of where I can find the records for each value."
Indexes in general are a good idea for regularly queried fields. The only downside to defining indexes is that they use extra storage space, so unless you're real tight on space, you should try to use them. If they don't apply, MySQL will just ignore them anyway.
I don't disagree with the importance of indexing to improve select query times, but if you can index on other keys (and form your queries with these indexes), the need to index on timestamp may not be needed.
For example, if you have a table with timestamp, category, and userId, it may be better to create an index on userId instead. In a table with many different users this will reduce considerably the remaining set on which to search the timestamp.
...and If I'm not mistaken, the advantage of this would be to avoid the overhead of creating the timestamp index on each insertion -- in a table with high insertion rates and highly unique timestamps this could be an important consideration.
I'm struggling with the same problems of indexing based on timestamps and other keys. I still have testing to do so I can put proof behind what I say here. I'll try to postback based on my results.
A scenario for better explanation:
timestamp 99% unique
userId 80% unique
category 25% unique
Indexing on timestamp will quickly reduce query results to 1% the table size
Indexing on userId will quickly reduce query results to 20% the table size
Indexing on category will quickly reduce query results to 75% the table size
Insertion with indexes on timestamp will have high overhead **
Despite our knowledge that our insertions will respect the fact of have incrementing timestamps, I don't see any discussion of MySQL optimisation based on incremental keys.
Insertion with indexes on userId will reasonably high overhead.
Insertion with indexes on category will have reasonably low overhead.
** I'm sorry, I don't know the calculated overhead or insertion with indexing.
If your queries are mainly using this timestamp, you could test this design (enlarging the Primary Key with the timestamp as first part):
CREATE TABLE perf (
, ts INT NOT NULL
, oldPK
, ... other columns
, PRIMARY KEY(ts, oldPK)
, UNIQUE (oldPK)
) ENGINE=InnoDB ;
This will ensure that the queries like the one you posted will be using the clustered (primary) key.
Disadvantage is that your Inserts will be a bit slower. Also, If you have other indices on the table, they will be using a bit more space (as they will include the 4-bytes wider primary key).
The biggest advantage of such a clustered index is that queries with big range scans, e.g. queries that have to read large parts of the table or the whole table will find the related rows sequentially and in the wanted order (BY timestamp), which will also be useful if you want to group by day or week or month or year.
The old PK can still be used to identify rows by keeping a UNIQUE constraint on it.
You may also want to have a look at TokuDB, a MySQL (and open source) variant that allows multiple clustered indices.