Index design for queries using 2 ranges - mysql

I am trying to find out how to design the indexes for my data when my query is using ranges for 2 fields.
expenses_tbl:
idx date category amount
auto-inc INT TINYINT DECIMAL(7,2)
PK
The column category defines the type of expense. Like, entertainment, clothes, education, etc. The other columns are obvious.
One of my query on this table is to find all those instances where for a given date range, the expense has been more than $50. This query will look like:
SELECT date, category, amount
FROM expenses_tbl
WHERE date > 120101 AND date < 120811
AND amount > 50.00;
How do I design the index/secondary index on this table for this particular query.
Assumption: The table is very large (It's not currently, but that gives me a scope to learn).

MySQL generally doesn't support ranges on multiple parts of a compound index. Either it will use the index for the date, or an index for the amount, but not both. It might do an index merge if you had two indexes, one on each, but I'm not sure.
I'd check the EXPLAIN before and after adding these indexes:
CREATE INDEX date_idx ON expenses_tbl (date);
CREATE INDEX amount_idx ON expenses_tbl (amount);
Compound index ranges - http://dev.mysql.com/doc/refman/5.5/en/range-access-multi-part.html
Index Merge - http://dev.mysql.com/doc/refman/5.0/en/index-merge-optimization.html

A couple more points that have not been mentioned yet:
The order of the columns in the index can make a difference. You may want to try both of these indexes:
(date, amount)
(amount, date)
Which to pick? Generally you want the most selective condition be the first column in the index.
If your date ranges are large but few expenses are over $50 then you want amount first in the index.
If you have narrow date ranges and most of the expenses are over $50 then you should put date first.
If both indexes are present then MySQL will choose the index with the lowest estimated cost.
You can try adding both indexes and then look at the output of EXPLAIN SELECT ... to see which index MySQL chooses for your query.
You may also want to consider a covering index. By including the column category in the index (as the last column) it means that all the data required for your query is available in the index, so MySQL does not need to look at the base table at all to get the results for your query.

The general answer to your question is that you want a composite index, with two keys. The first being date and the second being the amount.
Note that this index will work for queries with restrictions on the date or on the date and on the expense. It will not work for queries with restrictions on the expense only. If you have both types, you might want a second index on expense.
If the table is really, really large, then you might want to partition it by date and build indexes on expense within each partition.

Related

MySQL how to index a query that searches for a substring in column while filtering integer columns

I have a table with a billion+ rows. I have have the below query which I frequently execute:
SELECT SUM(price) FROM mytable WHERE domain IN ('com') AND url LIKE '%/shop%' AND date BETWEEN '2001-01-01' AND '2007-01-01';
Where domain is varchar(10) and url is varchar(255) and price is float. I understand that any query with %..% will not use any index. So logically, I created an index on price domain and date:
create index price_date on mytable(price, domain, date)
The problem here persists, this index is also not used because query contains: url LIKE '%.com/shop%'
On the other hand a FULLTEXT index still will not work since I have other non text filters in the query.
How can I optimise the above query? I have too many rows not to use an index.
UPDATE
Is this an sql limit? could such a query provide better performance on a noSQL database?
You have two range conditions, one uses IN() and the other uses BETWEEN. The best you can hope is that the condition on the first column of the index uses the index to examine rows, and the condition on the second column of the index uses index condition pushdown to make the storage engine do some pre-filtering.
Then it's up to you to choose which column should be the first column in the index, based on how well each condition would narrow down the search. If your condition on date is more likely to reduce the set of examined rows, then put that first in the index definition.
The order of terms in the WHERE clause does not have to match the order of columns in the index.
MySQL does not support optimizing with both a fulltext index and a B-tree index on the same table reference in the same query.
You can't use a fulltext index anyway for the pattern you are searching for. Fulltext indexes don't allow searches for punctuation characters, only words.
I vote for this order:
INDEX(domain, -- first because of "="
date, -- then range
url, price) -- "covering"
but, since the constants look like most of the billion rows would be hit, I don't expect good performance.
If this is a common query and/or "shop" is one of only a few possible filters, we can discuss whether a summary table would be useful.

Optimizing an index in a large MySQL table

I have a large table (about 3 million records) that includes primarily these fields: rowID (int), a deviceID (varchar(20)), a UnixTimestamp in a format like 1536169459 (int(10)), powerLevel which has integers that range between 30 and 90 (smallint(6)).
I'm looking to pull out records within a certain time range (using UnixTimestamp) for a particular deviceID and with a powerLevel above a certain number. With over 3 million records, it takes a while. Is there a way to create an index that will optimize for this?
Create an index over:
DeviceId,
PowerLevel,
UnixTimestamp
When selecting, you will first narrow in to the set of records for your given Device, then it will narrow in to only those records that are in the correct PowerLevel range. And lastly, it will narrow in, for each PowerLevel, to the correct records by UnixTimestamp.
If I understand you correctly, you hope to speed up this sort of query.
SELECT something
FROM tbl
WHERE deviceID = constant
AND start <= UnixTimestamp
AND UnixTimestamp < end
AND Power >= constant
You have one constant criterion (deviceID) and two range critera (UnixTimestamp and Power). MySQL's indexes are BTREE (think sorted in order), and MySQL can only do one index range scan per SELECT.
So, you should probably choose an index on (deviceID, UnixTimestamp, Power). To satisfy the query, MySQL will random-access the index to the entries for deviceID, then further random access to the first row meeting the UnixTimestamp start criterion.
It will then scan the index sequentially, and use the Power information from each index entry to decide whether it should choose each row.
You could also use (deviceID, Power, UnixTimestamp) . But in this case MySQL will find the first entry matching the device and power criteria, then scan the index to look at entries will all timestamps to see which rows it should choose.
Your performance objective is to get MySQL to scan the fewest possible index entries, so it seems very likely the (deviceID, UnixTimestamp, Power) choice is superior. The index column on UnixTimestamp is probably more selective than the one on Power. (That's my guess.)
ALTER TABLE tbl CREATE INDEX tbl_dev_ts_pwr (deviceID, UnixTimestamp, Power);
Look at Bill Karwin's tutorials. Also look at Markus Winand's https://use-the-index-luke.com
The suggested 3-column indexes are only partially useful. The Optimizer will use the first 2 columns, but ignore the third.
Better:
INDEX(DeviceId, PowerLevel),
INDEX(DeviceId, UnixTimestamp)
Why?
The optimizer will pick between those two based on which seems to be more selective. If the time range is 'narrow', then the second index will be used; if there are not many rows with the desired PowerLevel, then the first index will be used.
Even better...
The PRIMARY KEY... You probably have Id as the PK? Perhaps (DeviceId, UnixTimestamp) is unique? (Or can you have two readings for a single device in a single second??) If the pair is unique, get rid of Id completely and have
PRIMARY KEY(DeviceId, UnixTimestamp),
INDEX(DeviceId, PowerLevel)
Notes:
Getting rid of Id saves space, thereby providing a little bit of speed.
When using a secondary index, the executing spends time bouncing between the index's BTree and the data BTree (ordered by the PK). By having PRIMARY KEY(Id), you are guaranteed to do the bouncing. By changing the PK to this, the bouncing is avoided. This may double the speed of the query.
(I am not sure the secondary index will every be used.)
Another (minor) suggestion: Normalize the DeviceId so that it is (perhaps) a 2-byte SMALLINT UNSIGNED (range 0..64K) instead of VARCHAR(20). Even if this entails a JOIN, the query will run a little faster. And a bunch of space is saved.

Combined Index performance with optional where clause

I have a table with the following columns:
id-> PK
customer_id-> index
store_id-> index
order_date-> index
last_modified-> index
other_columns...
other_columns...
I have three single column index. I also have a customer_id_store_id index which is a foreign key constraint referencing other tables.
id, customer_id, store_id are char(36) which is UUID. order_date is datetime and last_modifed is UNIX timestamp.
I want to gain some performance by removing all index and adding one with (customer_id, store_id, order_date). Most queries will have these fields in the where clause. But sometimes the store_id will not be needed.
What is the best approach? to add "store_id IS NOT NULL" in the where clause or creating the index this way (customer_id, order_date, store_id).
I also frequently need to query the table by last_modified field (where clause includes customer_id=, store_id=, last_modified>).
As I only have a single column index on it and there are hundreds of customers who is insert/updating the tables, more often the index scans rows more than necessary. Is it better to create another index (customer_id, store_id, last_modified) or leave it as it is? Or add this column to the previous index making it four columns composite index. But then again the order_date is irrelevant here and omitting it might result the index not being used as intended.
The query works fast on customers that don't have many rows possibly using the customer_id index there. But for customers with large amount of data, this isn't optimal. More often I need only few days of data.
Can anyone please advise what's the best index in this scenario.
It is true that lots of single column indexes on a MySQL table are generally considered harmful.
A query with
WHERE customer_id=constant AND store_id=constant AND last_modified>=constant
will be accelerated by an index on (customer_id, store_id, last_modified). Why? The MySQL query planner can random-access the index to the first item it needs to retrieve, then scan the index sequentially. That same index works for
WHERE customer_id=constant AND store_id=constant
AND last_modified>=constant
AND last_modified< constant + INTERVAL 1 DAY
BUT, that index will not be useful for a query with just
WHERE store_id=constant AND last_modified>constant
or
WHERE customer_id=constant AND store_id IS NOT NULL AND last_modified>=constant
For the first of those query patterns you need (store_id, last_modified) to achieve the ability to sequentially scan the index.
The second of those query patterns requires two different range searches. One is something IS NOT NULL. That's a range search because it has to romp through all the non-null values in the column. The second range search is last_modified>=constant. That's a range search, because it starts with the first value of last_modified that meets the given criterion, and scans to the end of the index.
MySQL indexes are B-trees. That means, essentially, that they're sorted into a particular single order. So, an index is best for accelerating queries that require just one range search. So, the second query pattern is inherently hard to satisfy with an index.
A table can have multiple compound indexes designed to satisfy multiple different query patterns. That's usually the strategy to large tables work well in practical applications. Each index imposes a little bit of performance penalty on updates and inserts. Indexes also take storage space. But storage is very cheap these days.
If you want to use a compound index to search on multiple criteria, these things must be true:
all but one of the criteria must be equality criteria like store_id = constant.
one criterion can be a range-scan criterion like last_modified >= constant or something IS NOT NULL.
the columns in the index must be ordered so that the columns involved in equality criteria all appear, then the the column involved in the range-scan criterion.
you may mention other columns after the range scan criterion. But they make up part of a covering index strategy (beyond the scope of this post).
http://use-the-index-luke.com/ is a good basic intro to the black art of indexing.

What is the most performant table index for a set of history records?

I have a simple history table that I am developing a new lookup for. I am wondering what is the best index (if any) to add to this table so that the lookups are as fast as possible.
The history table is a simple set of records of actions taken. Each action has a type and an action date (and some other attributes). Every day a new set of action records is generated by the system.
The relevant pseudo-schema is:
TABLE history
id int,
type int,
action_date date
...
INDEX
id
...
Note: the table is not indexed on type or action_date.
The new lookup function is intended to retrieve all the records of a specific type that occurred on a specific action date.
My initial inclination is to define a compound key consisting of both the type and the action_date.
However in my case there will be many actions with the same type and date. Further, the actions will be roughly evenly distributed in number each day.
Given all of the above: (a) is an index worthwhile; and (b) if so, what is the preferred index(es)?
I am using MySQL, but I think my question is not specific to this RDBMS.
The first field on index should be the one giving you the smallest dataset for the majority of queries after the condition is applied.
Depending on your business requirements, you may request a specific date or specific date range (most likely the date range). So the date should one the last field on the index. Most likely you will always have the date condition.
A common answer is to have the (type,date) index, but you should consider just the date index if you ever query more than one type value in the query or if you have just a few types (like less than 5) and they are not evenly distributed.
For example, you have type 1 70% of the table, type 2,3,4,... is less than few percent of the table, and you often query type 1, you better have just separate date index, and type index (for cases when you query type 2,3,4,), not compound (type, date) index.
INDEX(type, action_date), regardless of cardinality or distribution of either column. Doing so will minimize the number of 'rows' of the index's BTree` that need to be looked at. (Yes, I am disagreeing with Sergiy's Answer.)
Even for WHERE type IN (2,3) AND action_date ... can use that index.
For checking against a date range of, say 2 weeks, I recommend this pattern:
AND action_date >= '2016-10-16`
AND action_date < '2016-10-16` + INTERVAL 2 WEEK
A way to see how much "work" is needed for a query:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
The numbers presented will give you a feel for how many index (or data) rows need to be touched. This makes it easy to see which of two possible queries/indexes works better, even when the table is too small to get reliable timings.
Yes, an index is worthwhile. Especially if you search for a small subset of the table.
If your search would match 20% or more of the table (approximately), the MySQL optimizer decides that the index is more trouble than it's worth, and it'll do a table-scan even if the index is available.
If you search for one specific type value and one specific date value, an index on (type, date) or an index on (date, type) is a good choice. It doesn't matter much which column you list first.
If you search for multiple values of type or multiple dates, then the order of columns matters. Follow this guide:
The leftmost columns of the index should be the ones on which you do equality comparisons. An equality comparison is one that matches exactly one value (even if that value is found on many rows).
WHERE type = 2 AND date = '2016-10-19' -- both equality
The next column of the index can be part of a range comparison. A range comparison matches multiple values. For example, > or IN( ) or BETWEEN or !=.
WHERE type = 2 AND date > '2016-10-19' -- one equality, one range
Only one such column benefits from an index. If you have range comparisons on multiple columns, only the first column of the index will use the index to support lookups. The subsequent column(s) will have to search through those matching rows "the hard way".
WHERE type IN (2, 3, 4) AND date > '2016-10-19' -- multiple range
If you sometimes search using a range condition on type and equality on date, you'll need to create a second index.
WHERE type IN (2, 3, 4) AND date = '2016-10-19' -- make index on (date, type)
The order of terms in your WHERE clause doesn't matter. The SQL query optimizer will figure that out and reorder them to match the right columns defined in an index.

Creating an index on a timestamp to optimize query

I have a query of the following form:
SELECT * FROM MyTable WHERE Timestamp > [SomeTime] AND Timestamp < [SomeOtherTime]
I would like to optimize this query, and I am thinking about putting an index on timestamp, but am not sure if this would help. Ideally I would like to make timestamp a clustered index, but MySQL does not support clustered indexes, except for primary keys.
MyTable has 4 million+ rows.
Timestamp is actually of type INT.
Once a row has been inserted, it is never changed.
The number of rows with any given Timestamp is on average about 20, but could be as high as 200.
Newly inserted rows have a Timestamp that is greater than most of the existing rows, but could be less than some of the more recent rows.
Would an index on Timestamp help me to optimize this query?
No question about it. Without the index, your query has to look at every row in the table. With the index, the query will be pretty much instantaneous as far as locating the right rows goes. The price you'll pay is a slight performance decrease in inserts; but that really will be slight.
You should definitely use an index. MySQL has no clue what order those timestamps are in, and in order to find a record for a given timestamp (or timestamp range) it needs to look through every single record. And with 4 million of them, that's quite a bit of time! Indexes are your way of telling MySQL about your data -- "I'm going to look at this field quite often, so keep an list of where I can find the records for each value."
Indexes in general are a good idea for regularly queried fields. The only downside to defining indexes is that they use extra storage space, so unless you're real tight on space, you should try to use them. If they don't apply, MySQL will just ignore them anyway.
I don't disagree with the importance of indexing to improve select query times, but if you can index on other keys (and form your queries with these indexes), the need to index on timestamp may not be needed.
For example, if you have a table with timestamp, category, and userId, it may be better to create an index on userId instead. In a table with many different users this will reduce considerably the remaining set on which to search the timestamp.
...and If I'm not mistaken, the advantage of this would be to avoid the overhead of creating the timestamp index on each insertion -- in a table with high insertion rates and highly unique timestamps this could be an important consideration.
I'm struggling with the same problems of indexing based on timestamps and other keys. I still have testing to do so I can put proof behind what I say here. I'll try to postback based on my results.
A scenario for better explanation:
timestamp 99% unique
userId 80% unique
category 25% unique
Indexing on timestamp will quickly reduce query results to 1% the table size
Indexing on userId will quickly reduce query results to 20% the table size
Indexing on category will quickly reduce query results to 75% the table size
Insertion with indexes on timestamp will have high overhead **
Despite our knowledge that our insertions will respect the fact of have incrementing timestamps, I don't see any discussion of MySQL optimisation based on incremental keys.
Insertion with indexes on userId will reasonably high overhead.
Insertion with indexes on category will have reasonably low overhead.
** I'm sorry, I don't know the calculated overhead or insertion with indexing.
If your queries are mainly using this timestamp, you could test this design (enlarging the Primary Key with the timestamp as first part):
CREATE TABLE perf (
, ts INT NOT NULL
, oldPK
, ... other columns
, PRIMARY KEY(ts, oldPK)
, UNIQUE (oldPK)
) ENGINE=InnoDB ;
This will ensure that the queries like the one you posted will be using the clustered (primary) key.
Disadvantage is that your Inserts will be a bit slower. Also, If you have other indices on the table, they will be using a bit more space (as they will include the 4-bytes wider primary key).
The biggest advantage of such a clustered index is that queries with big range scans, e.g. queries that have to read large parts of the table or the whole table will find the related rows sequentially and in the wanted order (BY timestamp), which will also be useful if you want to group by day or week or month or year.
The old PK can still be used to identify rows by keeping a UNIQUE constraint on it.
You may also want to have a look at TokuDB, a MySQL (and open source) variant that allows multiple clustered indices.