Need advice on how to index and optimize a specific MySQL database - mysql

I'm trying to optimize my MySQL DB so I can query it as quickly as possible.
It goes like this:
My DB consists of 1 table that has (for now) about 18 million rows - and growing rapidly.
This table has the following columns - idx, time, tag_id, x, y, z.
No column has any null values.
'idx' is an INT(11) index column, AI and PK. right now it's in ascending order.
'time' is a date-time column. it's also ascending. 50% of the 'time' values in the table are distinct (and the rest of the values will appear probably twice or 3 times at most).
'tag_id' is an INT(11) column. it's not ordered in any way, and there are between 30-100 different possible tag_id values that spread over the whole DB. It's also a foreign key with another table.
INSERT -
A new row is being inserted to the table every 2-3 seconds. 'idx' is calculated by the server (AI). since the 'time' column represents the time the row was inserted, every new 'time' that's inserted will be either higher or equal to the previous row. all the other column values don't have any order.
SELECT -
here is an example of a typical query:
"select x, y, z, time from table where date(time) between '2014-08-01' and '2014-10-01' and tag_id = 123456"
so, 'time' and 'tag_id' are the only columns that appear in the where part, and both of them will ALWAYS appear in the where part of every query. 'x', 'y' and 'z' and 'time' will always appear in the select part. 'tag_id' might also appear in the select part sometimes.
the queries will usually seek higher (more recent) times, rather then the older times. meaning - later rows in the table will be searched more.
INDEXES-
right now, 'idx', being the PK, is the clustered ASC index. 'time' has also a non-clustered ASC index.
That's it. considering all this data, a typical query will return results for me in around 30 seconds. I'm trying to lower this time. any advice??
I'm thinking about changing one or both of the indexes from ASC to DESC (since the higher values are more popular in the search). if I change 'idx' to DESC it will physically reverse the whole table. if I change 'time' to DESC it will reverse the 'time' index tree. but since this is an 18 million row table, changes like this might take a long time for the server so I want to be sure it's a good idea. the question is, if I reverse the order and a new row is inserted, will the server know to put it in the beginning of the table quickly? or will it search the table every time for the place? and will putting a new row in the beginning of the table mean that some kind of data shifting will need to be done to the whole table every time?
Or maybe I just need a different indexing technique??
Any ideas you have are very welcome.. thanks!!

select x, y, z, time from table
where date(time) between '2014-08-01' and '2014-10-01' and tag_id = 123456
Putting a column inside a function call like date(time) spoils any chance of using an index for that column. You must use only a bare column for comparison, if you want to use an index.
So if you want to compare it to dates, you should store a DATE column. If you have a DATETIME column, you may have to use a search term like this:
WHERE `time` >= '2014-08-01 00:00:00 AND `time` < '2014-10-02 00:00:00' ...
Also, you should use multi-column indexes where you can. Put columns used in equality conditions first, then one column used in range conditions. For more on this rule, see my presentation How to Design Indexes, Really.
You may also benefit from adding columns that are not used for searching, so that the query can retrieve the columns from the index entry alone. Put these columns following the columns used for searching or sorting. This is called an index-only query.
So for this query, your index should be:
ALTER TABLE `this_table` ADD INDEX (tag_id, `time`, x, y, z);
Regarding ASC versus DESC, the syntax supports the option for different direction indexes, but in the two most popular storage engines used in MySQL, InnoDB and MyISAM, there is no difference. Either direction of sorting can use either type of index more or less equally well.

Related

Why Index is used only when forced but not by default?

I have around 420 million records in my table. There is an only index on column colC of user_table . Below query returns around 1.5 million records based on
colC. But index is not used somehow and return the records 20 to 25 mins
select colA ,ColB , count(*) as count
from user_table
where colC >='2019-09-01 00:00:00'
and colC<'2019-09-30 23:59:59'
and colA in ("some static value")
and ColB in (17)
group by colA ,ColB;
But when I do force index, it starts getting used and returns the record in 2 mins only. My question why MYSQL is not using index by default when fetch
time is much lesser with index ? I have recreated the index alongwith repair but nothing works to make it in use by default .
Another observation for information is same query(without force index) works for previous months (having same volume of data) .
Update For the details asked by Evert
CREATE TABLE USER_TABLE (
id bigint(20) unsigned NOT NULL AUTO_INCREMENT,
COLA varchar(10) DEFAULT NULL,
COLB int(11) DEFAULT NULL,
COLC datetime DEFAULT NULL,
....
PRIMARY KEY (id),
KEYcolA(COLA),
KEYcolB(COLB),
KEYcolC(COLC)
) ENGINE=MyISAM AUTO_INCREMENT=2328036072 DEFAULT CHARSET=latin1 |
for better performance you could try using composite index .. based on the column involved in your where clause
and try to change the IN clause in a inner join
assuming your IN clause content is a set of fixed values you could use union (or a new table with the value you need )
eg using the union (you can do somethings similar if the IN clause is a subquery)
select user_table.colA ,ColB , count(*) as count
from user_table
INNER JOIN (
select 'FIXED1' colA
union
select 'FIXED2'
....
union
select 'FIXEDX'
) t on t.colA = user_table.colA
where colC >='2019-09-01 00:00:00'
and ColB = 17
group by colA ,ColB;
you could also add a composite index on table user_table on columns
colA, colB, colC
for what related to element used by mysql query optimizer for decide the index to use there several aspect and for all of these the query optimizer assign a cost
any what you should take in consideration
the column involved in Where clause
The size of the tables (and not yiuy case the size of the tables in join)
An estimation of how many rows will be fetched ( to decide whether to use an index, or simply scan the table )
if the datatypes match or not between columns in the jion and where clause
The use of function or data type conversion including mismacth of collation
The size of the index
cardinality of the index
and for all of these option is evaluated a cost and this lead to the index choose
In you case the colC as date could be implies a data conversion (respect the literal values as string ) and for this the index in not choosed ..
Is also for this that i have suggested a composite index with the left most column related to non converted values
Indexes try to get used as best as possible. I cant guarantee, but it SOUNDS like the engine is building a temporary index based on A & B to qualify the static values in your query. For 420+ million is just the time to build such temporary index. By you forcing an index is helping optimize the time otherwise.
Now, if you (and others) don't quite understand indexes, its a way of pre-grouping data to help the optimizer. When you have GROUP BY conditions, those components, where practical, should be part of the index, and TYPICALLY would be part of the criteria as you have in your query.
select colA ,ColB , count(*) as count
from user_table
where colC >='2019-09-01 00:00:00'
and colC<'2019-09-30 23:59:59'
and colA in ("some static value")
and ColB in (17)
group by colA ,ColB;
Now, lets look at your index, and only available based on ColC. Assume that all records are based on a day for scenario purposes. Make pretend each INDEX (single or compound) is stored in its own room. You have an index on just the date column C. In the room, you have 30 boxes (representing Sept 1 to Sept 30), not counting all other boxes for other days. Now, you have to go through each box per day and look for all entries that have a value of ColA and ColB that you want. The stuff in the box is not sorted, so you have to look at every record. Now, do this for the all 30 days of September.
Now, simulate the NEXT index, boxes stored in another room. This room is a compound index based on (and in this order to help optimize you query), Columns A, B and C.
So now, you could have 100 entries for "A". You only care about ColA = "some static value", so you grab that one box.
Now, you open that box and see a bunch of smaller boxes... Oh.. These are all the individual "Column B" records. On the top of each box represents each individual "B" entries so you find the 1 box with the value 17.
Finally, now you open Box B and look in side. Wow... they are all nicely sorted for you by date. So now, you scroll quickly to find Sept 1 and pull all entries up to Sept 30 you are looking for.
By quickly getting to the source by an optimized index will help you in the long run. Having an index on
(colA, colB, colC)
will significantly help your query performance.
One final note. Since you are only querying for a single "A" and single "B" value, you would only get a single row back and would not need a group by clause (in this case).
Hope this helps you and others better understand how indexes work from just individual vs compound (multi-columns).
One additional advantage of a multi-column index. Such as in this case where all the columns are part of the index, the database does not have to go to the raw data pages to confirm the other columns. Meaning you are looking only at the values A, B and C. All these fields are part of the index. It does not have to go back to the raw data pages where the actual data is stored to confirm its qualification to be returned.
In a single column index such as yours, it uses the index to find what records qualify (by date in this case). Then on an each record basis, it has to go to the raw data page holding the entire record (could have 50 columns in a record) just to confirm if the A and B columns qualify, then discard if not applicable. Then go back to the index by date, then back to the raw data page to confirm its A and B... You can probably understand much more time to keep going back and forth.
The second index already has "A", "B" and the pre-sorted date range of "C". Done without having to go to the raw data pages.

Optimizing an index in a large MySQL table

I have a large table (about 3 million records) that includes primarily these fields: rowID (int), a deviceID (varchar(20)), a UnixTimestamp in a format like 1536169459 (int(10)), powerLevel which has integers that range between 30 and 90 (smallint(6)).
I'm looking to pull out records within a certain time range (using UnixTimestamp) for a particular deviceID and with a powerLevel above a certain number. With over 3 million records, it takes a while. Is there a way to create an index that will optimize for this?
Create an index over:
DeviceId,
PowerLevel,
UnixTimestamp
When selecting, you will first narrow in to the set of records for your given Device, then it will narrow in to only those records that are in the correct PowerLevel range. And lastly, it will narrow in, for each PowerLevel, to the correct records by UnixTimestamp.
If I understand you correctly, you hope to speed up this sort of query.
SELECT something
FROM tbl
WHERE deviceID = constant
AND start <= UnixTimestamp
AND UnixTimestamp < end
AND Power >= constant
You have one constant criterion (deviceID) and two range critera (UnixTimestamp and Power). MySQL's indexes are BTREE (think sorted in order), and MySQL can only do one index range scan per SELECT.
So, you should probably choose an index on (deviceID, UnixTimestamp, Power). To satisfy the query, MySQL will random-access the index to the entries for deviceID, then further random access to the first row meeting the UnixTimestamp start criterion.
It will then scan the index sequentially, and use the Power information from each index entry to decide whether it should choose each row.
You could also use (deviceID, Power, UnixTimestamp) . But in this case MySQL will find the first entry matching the device and power criteria, then scan the index to look at entries will all timestamps to see which rows it should choose.
Your performance objective is to get MySQL to scan the fewest possible index entries, so it seems very likely the (deviceID, UnixTimestamp, Power) choice is superior. The index column on UnixTimestamp is probably more selective than the one on Power. (That's my guess.)
ALTER TABLE tbl CREATE INDEX tbl_dev_ts_pwr (deviceID, UnixTimestamp, Power);
Look at Bill Karwin's tutorials. Also look at Markus Winand's https://use-the-index-luke.com
The suggested 3-column indexes are only partially useful. The Optimizer will use the first 2 columns, but ignore the third.
Better:
INDEX(DeviceId, PowerLevel),
INDEX(DeviceId, UnixTimestamp)
Why?
The optimizer will pick between those two based on which seems to be more selective. If the time range is 'narrow', then the second index will be used; if there are not many rows with the desired PowerLevel, then the first index will be used.
Even better...
The PRIMARY KEY... You probably have Id as the PK? Perhaps (DeviceId, UnixTimestamp) is unique? (Or can you have two readings for a single device in a single second??) If the pair is unique, get rid of Id completely and have
PRIMARY KEY(DeviceId, UnixTimestamp),
INDEX(DeviceId, PowerLevel)
Notes:
Getting rid of Id saves space, thereby providing a little bit of speed.
When using a secondary index, the executing spends time bouncing between the index's BTree and the data BTree (ordered by the PK). By having PRIMARY KEY(Id), you are guaranteed to do the bouncing. By changing the PK to this, the bouncing is avoided. This may double the speed of the query.
(I am not sure the secondary index will every be used.)
Another (minor) suggestion: Normalize the DeviceId so that it is (perhaps) a 2-byte SMALLINT UNSIGNED (range 0..64K) instead of VARCHAR(20). Even if this entails a JOIN, the query will run a little faster. And a bunch of space is saved.

MySQL: composite index fulltext+btree?

I want a query that does a fulltext search on one field and then a sort on a different field (imagine searching some text document and order by publication date). The table has about 17M rows and they are more or less uniformly distributed in dates. This is to be used in a webapp request/response cycle, so the query has to finish in at most 200ms.
Schematically:
SELECT * FROM table WHERE MATCH(text) AGAINST('query') ORDER BY date=my_date DESC LIMIT 10;
One possibility is having a fulltext index on the text field and a btree on the publication date:
ALTER TABLE table ADD FULLTEXT index_name(text);
CREATE INDEX index_name ON table (date);
This doesn't work very well in my case. What happens is that MySQL evaluates two execution paths. One is using the fulltext index to find the relevant rows, and once they are selected use a FILESORT to sort those rows. The second is using the BTREE index to sort the entire table and then look for matches using a FULL TABLE SCAN. They're both bad. In my case MySQL chooses the former. The problem is that the first step can select some 30k results which it then has to sort, which means the entire query might take of the order 10 seconds.
So I was thinking: do composite indexes of FULLTEXT+BTREE exist? If you know how a FULLTEXT index works, it first tokenizes the column you're indexing and then builds an index for the tokens. It seems reasonable to me to imagine a composite index such that the second index is a BTREE in dates for each token. Does this exist in MySQL and if so what's the syntax?
BONUS QUESTION: If it doesn't exist in MySQL, would PostgreSQL perform better in this situation?
Use IN BOOLEAN MODE.
The date index is not useful. There is no way to combine the two indexes.
Beware, if a user searches for something that shows up in 30K rows, the query will be slow. There is no straightforward away around it.
I suspect you have a TEXT column in the table? If so, there is hope. Instead of blindly doing SELECT *, let's first find the ids and get the LIMIT applied, then do the *.
SELECT a.*
FROM tbl AS a
JOIN ( SELECT date, id
FROM tbl
WHERE MATCH(...) AGAINST (...)
ORDER BY date DESC
LIMIT 10 ) AS x
USING(date, id)
ORDER BY date DESC;
Together with
PRIMARY KEY(date, id),
INDEX(id),
FULLTEXT(...)
This formulation and indexing should work like this:
Use FULLTEXT to find 30K rows, deliver the PK.
With the PK, sort 30K rows by date.
Pick the last 10, delivering date, id
Reach back into the table 10 times using the PK.
Sort again. (Yeah, this is necessary.)
More (Responding to a plethora of Comments):
The goal behind my reformulation is to avoid fetching all columns of 30K rows. Instead, it fetches only the PRIMARY KEY, then whittles that down to 10, then fetches * only 10 rows. Much less stuff shoveled around.
Concerning COUNT on an InnoDB table:
INDEX(col) makes it so that an index scan works for SELECT COUNT(*) or SELECT COUNT(col) without a WHERE.
Without INDEX(col),SELECT COUNT(*)will use the "smallest" index; butSELECT COUNT(col)` will need a table scan.
A table scan is usually slower than an index scan.
Be careful of timing -- It is significantly affected by whether the index and/or table is already cached in RAM.
Another thing about FULLTEXT is the + in front of words -- to say that each word must exist, else there is no match. This may cut down on the 30K.
The FULLTEXT index will deliver the date, id is random order, not PK order. Anyway, it is 'wrong' to assume any ordering, hence it is 'right' to add ORDER BY, then let the Optimizer toss it if it knows that it is redundant. And sometimes the Optimizer can take advantage of the ORDER BY (not in your case).
Removing just the ORDER BY, in many cases, makes a query run much faster. This is because it avoids fetching, say, 30K rows and sorting them. Instead it simply delivers "any" 10 rows.
(I have not experience with Postgres, so I cannot address that question.)

Fastest result when checking date range

User will select a date e.g. 06-MAR-2017 and I need to retrieve hundred thousand of records for date earlier than 06-MAR-2017 (but it could vary depends on user selection).
From above case, I am using this querySELECT col from table_a where DATE_FORMAT(mydate,'%Y%m%d') < '20170306' I feel that the record is kind of slow. Are there any faster or fastest way to get date results like this?
With 100,000 records to read, the DBMS may decide to read the table record for record (full table scan) and there wouldn't be much you could do.
If on the other hand the table contains billions of records, so 100,000 would just be a small part, then the DBMS may decide to use an index instead.
In any way you should at least give the DBMS the opportunity to select via an index. This means: create an index first (if such doesn't exist yet).
You can create an index on the date column alone:
create index idx on table_a (mydate);
or even provide a covering index that contains the other columns used in the query, too:
create index idx on table_a (mydate, col);
Then write your query such that the date column is accessed directly. You have no index on DATE_FORMAT(mydate,'%Y%m%d'), so above indexes don't help with your original query. You'd need a query that looks up the date itself:
select col from table_a where mydate < date '2017-03-06';
Whether the DBMS then uses the index or not is still up to the DBMS. It will try to use the fastest approach, which very well can still be the full table scan.
If you make a function call in any column at the left side of comparison, MySql will make a full table scan.
The fastest method would be to have an index created on mydate, and make the right side ('20170306') the same datatype of the column (and the index)

What is the most performant table index for a set of history records?

I have a simple history table that I am developing a new lookup for. I am wondering what is the best index (if any) to add to this table so that the lookups are as fast as possible.
The history table is a simple set of records of actions taken. Each action has a type and an action date (and some other attributes). Every day a new set of action records is generated by the system.
The relevant pseudo-schema is:
TABLE history
id int,
type int,
action_date date
...
INDEX
id
...
Note: the table is not indexed on type or action_date.
The new lookup function is intended to retrieve all the records of a specific type that occurred on a specific action date.
My initial inclination is to define a compound key consisting of both the type and the action_date.
However in my case there will be many actions with the same type and date. Further, the actions will be roughly evenly distributed in number each day.
Given all of the above: (a) is an index worthwhile; and (b) if so, what is the preferred index(es)?
I am using MySQL, but I think my question is not specific to this RDBMS.
The first field on index should be the one giving you the smallest dataset for the majority of queries after the condition is applied.
Depending on your business requirements, you may request a specific date or specific date range (most likely the date range). So the date should one the last field on the index. Most likely you will always have the date condition.
A common answer is to have the (type,date) index, but you should consider just the date index if you ever query more than one type value in the query or if you have just a few types (like less than 5) and they are not evenly distributed.
For example, you have type 1 70% of the table, type 2,3,4,... is less than few percent of the table, and you often query type 1, you better have just separate date index, and type index (for cases when you query type 2,3,4,), not compound (type, date) index.
INDEX(type, action_date), regardless of cardinality or distribution of either column. Doing so will minimize the number of 'rows' of the index's BTree` that need to be looked at. (Yes, I am disagreeing with Sergiy's Answer.)
Even for WHERE type IN (2,3) AND action_date ... can use that index.
For checking against a date range of, say 2 weeks, I recommend this pattern:
AND action_date >= '2016-10-16`
AND action_date < '2016-10-16` + INTERVAL 2 WEEK
A way to see how much "work" is needed for a query:
FLUSH STATUS;
SELECT ...;
SHOW SESSION STATUS LIKE 'Handler%';
The numbers presented will give you a feel for how many index (or data) rows need to be touched. This makes it easy to see which of two possible queries/indexes works better, even when the table is too small to get reliable timings.
Yes, an index is worthwhile. Especially if you search for a small subset of the table.
If your search would match 20% or more of the table (approximately), the MySQL optimizer decides that the index is more trouble than it's worth, and it'll do a table-scan even if the index is available.
If you search for one specific type value and one specific date value, an index on (type, date) or an index on (date, type) is a good choice. It doesn't matter much which column you list first.
If you search for multiple values of type or multiple dates, then the order of columns matters. Follow this guide:
The leftmost columns of the index should be the ones on which you do equality comparisons. An equality comparison is one that matches exactly one value (even if that value is found on many rows).
WHERE type = 2 AND date = '2016-10-19' -- both equality
The next column of the index can be part of a range comparison. A range comparison matches multiple values. For example, > or IN( ) or BETWEEN or !=.
WHERE type = 2 AND date > '2016-10-19' -- one equality, one range
Only one such column benefits from an index. If you have range comparisons on multiple columns, only the first column of the index will use the index to support lookups. The subsequent column(s) will have to search through those matching rows "the hard way".
WHERE type IN (2, 3, 4) AND date > '2016-10-19' -- multiple range
If you sometimes search using a range condition on type and equality on date, you'll need to create a second index.
WHERE type IN (2, 3, 4) AND date = '2016-10-19' -- make index on (date, type)
The order of terms in your WHERE clause doesn't matter. The SQL query optimizer will figure that out and reorder them to match the right columns defined in an index.