Behavior of InnoDB clustered compound index - mysql

We are running MySQL/ISAM database with a following table:
create table measurements (
`tm_stamp` int(11) NOT NULL DEFAULT '0',
`fk_channel` int(11) NOT NULL DEFAULT '0',
`value` int(11) DEFAULT NULL,
PRIMARY KEY (`tm_stamp`,`fk_channel`)
);
The tm_stamp-fk_channel combination is required unique, hence the compound primary key. Now, for certain irrelevant reason, the database will be migrated to InnoDB engine. Upon googling something about it, i found out that the key will dictate the physical ordering of the data on the disk. 90% of the queries currently go as follows:
SELECT value FROM measurements
WHERE fk_channel=A AND tm_stamp>=B and tm_stamp<=C
ORDER BY tm_stamp ASC
Inserts are 99% in order of tm_stamp, it's a storage for dataloggers network. The table has low millions of rows but growing steadily. The questions are
Should the sole change of storage engine result in any significant performance change, better or worse?
Does the order of columns in the index matter with regards to the most popular SELECT? This blog suggest something along that line.
Thanks to the nature of clustered index, may we perhaps leave out the ORDER BY clause and gain some performance?

Edit 1:
It appears that changing the primary key from
PRIMARY KEY (`tm_stamp`,`fk_channel`)
to
PRIMARY KEY (`fk_channel`,`tm_stamp`)
always makes sense, for both MyISAM and InnoDB. See http://sqlfiddle.com/#!2/0aa08/1 for proof this is so.
Original answer:
To determine if changing
PRIMARY KEY (`tm_stamp`,`fk_channel`)
to
PRIMARY KEY (`fk_channel`,`tm_stamp`)
would improve your query's performance, you need to determine which field's values cardinality is higher (which field's values are more varied). Running
SELECT COUNT(DISTINCT tm_stamp), COUNT(DISTINCT fk_channel) FROM measurements;
will give you the cardinality of the columns.
So, to answer your question properly we first need to know: What are the common range of values between B and C? 60? 3,600? 86,400? more?
For example, let's say that
SELECT COUNT(DISTINCT tm_stamp), COUNT(DISTINCT fk_channel) FROM measurements;
returns 32,768 and 256. 32,768 divided by 256 is 128. This tells us that tm_stamp has 128 unique values for every value of fk_channel.
So if the difference between B and C is usually less than 128, then leave tm_stamp as the first field in the primary key. If 128 or greater, then make fk_channel the first field.
Another question: Does fk_channel need to be an INT (4 billion unique values, half of which are negative)? If not, then changing fk_channel to TINYINT UNSIGNED (if you have 256 unique values), or SMALLINT UNSIGNED (65536 unique values) would save a lot of time and space.
For example, let's say you have 256 maximum possible fk_channel values, and 65,536 possible values, then you could change your schema via:
create table measurements_new (
tm_stamp INT UNSIGNED NOT NULL DEFAULT '0',
fk_channel TINYINT UNSIGNED NOT NULL DEFAULT '0', -- remove UNSIGNED if values can be negative
value SMALLINT UNSIGNED DEFAULT NULL, -- remove UNSIGNED if values can be negative
PRIMARY KEY (tm_stamp,fk_channel)
) ENGINE=InnoDB
SELECT
tm_stamp,
fk_channel,
value
FROM
measurements
ORDER BY
tm_stamp,
fk_channel;
RENAME TABLE measurements TO measurements_old, measurements_new TO measurements;
This will store the existing data in the new table in PRIMARY KEY order, which will improve performance somewhat.

Staring at the Query
SELECT value FROM measurements
WHERE fk_channel=A AND tm_stamp>=B and tm_stamp<=C
ORDER BY tm_stamp ASC
Your static value is fk_channel and the moving ordered values is tm_stamp. This addresses your second question which seems to be at the heart of the Query's needs.
You would be way better off with PRIMARY KEY columns reversed
create table measurements (
`tm_stamp` int(11) NOT NULL DEFAULT '0',
`fk_channel` int(11) NOT NULL DEFAULT '0',
`value` int(11) DEFAULT NULL,
PRIMARY KEY (`fk_channel`,`tm_stamp`)
);
As for the first question, the storage engine dictates what gets cached.
MyISAM caches index pages only in the Key Cache (sized by key_buffer_size)
InnoDB caches data and indexes in the Buffer Pool (sized by innodb_buffer_pool_size)
I wrote about this in the DBA StackExchange
If you remain with MyISAM, you could change the primary key to include the value column:
create table measurements (
`tm_stamp` int(11) NOT NULL DEFAULT '0',
`fk_channel` int(11) NOT NULL DEFAULT '0',
`value` int(11) DEFAULT NULL,
PRIMARY KEY (`fk_channel`,`tm_stamp`,`value`)
) ENGINE=MyISAM;
That way, your Query's data retrieval is strictly from one file at most, the .MYI of the MyISAM table. The table need not be read at all.
If your switch to InnoDB, fk_channel,tm_stamp gets loaded twice into RAM
Once from InnoDB data page
Once from InnoDB index page

The order of your arguments in the WHERE clause is irrellavent here, the optimizer will pick the best key option (usually a direct comparison on a indexed field over a > or < comparison). With your initial example, the best option was the tm_stamp <> comparison which was not a direct equality check and therefore sub-par.
However, the order of the clustered key does matters.... If the exact comparison is always on the fk_channel column, I'd change the PK to be:
PRIMARY KEY (`fk_channel`,`tm_stamp`)
Now you've got an index that will benefit from the fk_channel=A in your where clause.
Also, while the storage engine plays a role somewhat, but I don't think the issue here is between innodb & myisam.
Finally, I don't think the ORDER BY clause has much bearing on your issue, that's done post query. A group by could affect your performance....

Related

Which index will speed up the query?

In the table of 350 million records, the structure is:
CREATE TABLE `table` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`job_id` int(10) unsigned NOT NULL,
`lock` mediumint(6) unsigned DEFAULT '0',
`time` timestamp NULL DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `job_id` (`job_id`),
KEY `lock` (`lock`),
KEY `time` (`time`)
) ENGINE=MyISAM;
What index should I create to speed up the query:
UPDATE `table` SET `lock` = 1 WHERE `lock` = 0 ORDER BY `time` ASC LIMIT 500;
lock is declared to be NULLable. Does this mean that the value is often NULL? If so, then there is a nasty problem in MyISAM (not InnoDB) that may lead to 500 additional fragmentation hits.
When a MyISAM row is updated and it becomes longer, then the row will not longer fit where it is. (Now my detailed knowledge gets fuzzy.) The new row will be put somewhere else and/or it will be broken into two parts, with a link between the parts. That implies writes in two places.
As Gordon pointed out, any change to any indexed column, lock in your case, involved a costly index update -- remove a 'row' from one place in the index's BTree and add a row in another place.
Does lock have only values 0 or 1? Then use TINYINT (1 byte), not MEDIUMINT (3 bytes).
You should check MAX(id). If it is clean, id's max will be about 350M (not too close to the limit of 4B). But if there has been any churn, it may be much closer to the limit.
I, too, advocate switching to InnoDB. However your 10GB (data+indexes) will grow to 20-30GB in the conversion.
Are you "locking the oldest unlocked" thingies? Will you then do a select to see what got locked?
If this is too slow, don't do 500 at once, pick a lower number.
With InnoDB, can you avoid locking? Perhaps transactional locking would suffice?
I think we need to see the rest of the environment -- other tables, job "flow", etc. There may be other things we can suggest.
And I second the motion for INDEX(lock, time). But when doing so, DROP the index on just lock as being redundant.
And when converting to InnoDB, do all the index changes in the same ALTER. This will run faster than separate passes.
For this query:
UPDATE `table`
SET `lock` = 1
WHERE `lock` = 0
ORDER BY `time` ASC
LIMIT 500;
The best index is table(lock, time). Do note, however, that the update also needs to update the index, so you should test how well this works in practice. Do not make this a clustered index. That will just slow down the process.

High traffic table, optimal indexes?

I have a monitoring table with the following structure:
CREATE TABLE `monitor_data` (
`monitor_id` INT(10) UNSIGNED NOT NULL,
`monitor_data_time` INT(10) UNSIGNED NOT NULL,
`monitor_data_value` INT(10) NULL DEFAULT NULL,
INDEX `monitor_id_data_time` (`monitor_id`, `monitor_data_time`),
INDEX `monitor_data_time` (`monitor_data_time`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB;
This is a very high traffic table with potentially thousands of rows every minute. Each row belongs to a monitor and contains a value and time (unix_timestamp)
I have three issues:
1.
Suddenly, after a number of months in dev, the table suddenly became very slow. Queries that previously was done under a second could now take up to a minute. I'm using standard settings in my.cnf since this is a dev machine, but the behavior was indeed very strange to me.
2.
I'm not sure that I have optimal indexes. A "normal" query looks like this:
SELECT DISTINCT(md.monitor_data_time), monitor_data_value
FROM monitor_data md
WHERE md.monitor_id = 165
AND md.monitor_data_time >= 1484076760
AND md.monitor_data_time <= 1487271199
ORDER BY md.monitor_data_time ASC;
A EXPLAIN on the query above looks like this:
id;select_type;table;type;possible_keys;key;key_len;ref;rows;Extra
1;SIMPLE;md;range;monitor_id_data_time,monitor_data_time;monitor_id_data_time;8;\N;149799;Using index condition; Using temporary; Using filesort
What do you think about the indexes?
3.
If I leave out the DISTINCT in the query above, I actually get duplicate rows even though there aren't any duplicate rows in the table. Any explanation to this behavior?
Any input is greatly appreciated!
UPDATE 1:
New suggestion on table structure:
CREATE TABLE `monitor_data_test` (
`monitor_id` INT UNSIGNED NOT NULL,
`monitor_data_time` INT UNSIGNED NOT NULL,
`monitor_data_value` INT UNSIGNED NULL DEFAULT NULL,
PRIMARY KEY (`monitor_data_time`, `monitor_id`),
INDEX `monitor_data_time` (`monitor_data_time`)
) COLLATE='utf8_general_ci' ENGINE=InnoDB;
SELECT DISTINCT(md.monitor_data_time), monitor_data_value
is the same as
SELECT DISTINCT md.monitor_data_time, monitor_data_value
That is, the pair is distinct. It does not dedup just the time. Is that what you want?
If you are trying to de-dup just the time, then do something like
SELECT time, AVG(value)
...
GROUP BY time;
For optimal performance of
WHERE md.monitor_id = 165
AND md.monitor_data_time >= 14840767604 ...
you need
PRIMARY KEY (monitor_id, monitor_data_time)
and it must be in that order. The opposite order is much less useful. The guiding principle is: Start with the '=', then move on to the 'range'. More discussion here.
Do you have 4 billion monitor_id values? INT takes 4 bytes; consider using a smaller datatype.
Do you have other queries that need optimizing? It is better to design the index(es) after gather all the important queries.
Why PK
In InnoDB, the PRIMARY KEY is "clustered" with the data. That is, the data is an ordered list of triples: (id, time, value) stored in a B+Tree. Locating id = 165 AND time = 1484076760 is a basic operation of a BTree. And it is very fast. Then scanning forward (that's the "+" part of "B+Tree") until time = 1487271199 is a very fast operation of "next row" in this ordered list. Furthermore, since value is right there with the id and time, there is no extra effort to get the values.
You can't scan the requested rows any faster. But it requires PRIMARY KEY. (OK, UNIQUE(id, time) would be 'promoted' to be the PK, but let's not confuse the issue.)
Contrast... Given an index (time, id), it would do the scan over the dates fine, but it would have to skip over any entries where id != 165 But it would have to read all those rows to discover they do not apply. A lot more effort.
Since it is unclear what you intended by DISTINCT, I can't continue this detailed discussion of how that plays out. Suffice it to say: The possible rows have been found; now some kind of secondary pass is needed to do the DISTINCT. (It may not even need to do a sort.)
What do you think about the indexes?
The index on (monitor_id,monitor_data_time) seems appropriate for the query. That's suited to an index range scan operation, very quickly eliminating boatloads of rows that need to be examined.
Better would be a covering index that also includes the monitor_data_value column. Then the query could be satisfied entirely from the index, without a need to lookup pages from the data table to get monitor_data_value.
And even better would be having the InnoDB cluster key be the PRIMARY KEY or UNIQUE KEY on the columns, rather than incurring the overhead of the synthetic row identifier that InnoDB creates when an appropriate index isn't defined.
If I wasn't allowing duplicate (monitor_id, monitor_data_time) tuples, then I'd define the table with a UNIQUE index on those non-nullable columns.
CREATE TABLE `monitor_data`
( `monitor_id` INT(10) UNSIGNED NOT NULL
, `monitor_data_time` INT(10) UNSIGNED NOT NULL
, `monitor_data_value` INT(10) NULL DEFAULT NULL
, UNIQUE KEY `monitor_id_data_time` (`monitor_id`, `monitor_data_time`)
) ENGINE=InnoDB
or equivalent, specify PRIMARY in place of UNIQUE and remove the identifier
CREATE TABLE `monitor_data`
( `monitor_id` INT(10) UNSIGNED NOT NULL
, `monitor_data_time` INT(10) UNSIGNED NOT NULL
, `monitor_data_value` INT(10) NULL DEFAULT NULL
, PRIMARY KEY (`monitor_id`, `monitor_data_time`)
) ENGINE=InnoDB
Any explanation to this behavior?
If the query (shown in the question) returns a different number of rows with the DISTINCT keyword, then there must be duplicate (monitor_id,monitor_data_time,monitor_data_value) tuples in the table. There's nothing in the table definition that guarantees us that there aren't duplicates.
There are a couple of other possible explanations, but those explanations are all related to rows being added/changed/removed, and the queries seeing different snapshots, transaction isolation levels, yada, yada. If the data isn't changing, then there are duplicate rows.
A PRIMARY KEY constraint (or UNIQUE KEY constraint non-nullable columns) would guarantee us uniqueness.
Note that DISTINCT is a keyword in the SELECT list. It's not a function. The DISTINCT keyword applies to all expressions in the SELECT list. The parens around md.monitor_date_time are superfluous.
Leaving the DISTINCT keyword out would eliminate the need for the "Using filesort" operation. And that can be expensive for large sets, particularly when the set is too large to sort in memory, and the sort has to spill to disk.
It would be much more efficient to have guaranteed uniqueness, omit the DISTINCT keyword, and return rows in order by the index, preferably the cluster key.
Also, the secondary index monitor_data_time doesn't benefit this query. (There may be other queries that can make effective use of the index, though one suspects that those queries would also make effective use of a composite index that had monitor_data_time as the leading column.

Best way to speed up a query in a innodb table with 100.000.000 rows in Mysql 5.6

I have a Mysql 5.6 table with 70 million rows in it, but it will grow to 100+ million rows or more in a few weeks.
I have a dedicated machine with a humble 500GB disk and 4GB RAM and the innodb_buffer_pool_size is set to 2GB.
The database uses 99% to selects and 1% to inserts (once a month).
The most important column is descripcion_detallada_producto varchar(300) and it is where the selects are aimed at in 90% of the times.
My table is:
CREATE TABLE `t1` (
`N_orden` bigint(20) NOT NULL DEFAULT '0',
`Fecha` varchar(15) COLLATE latin1_spanish_ci DEFAULT NULL,
`Ncm` int(11) NOT NULL,
`Origen` int(11) NOT NULL,
`Adquisicion` int(11) NOT NULL,
`Medida_Estadistica` int(11) NOT NULL,
`Unidad_Comercializacion` varchar(30) COLLATE latin1_spanish_ci DEFAULT NULL,
`Descripcion_Detallada_Producto` varchar(300) COLLATE latin1_spanish_ci DEFAULT NULL,
`Cantidad_Estadistica` double DEFAULT NULL,
`Peso_Liquido_Kg` double DEFAULT NULL,
`Valor_Fob` double DEFAULT NULL,
`Valor_Frete` double DEFAULT NULL,
`Valor_Seguro` double DEFAULT NULL,
`Valor_Unidad` double DEFAULT NULL,
`Cantidad` double DEFAULT NULL,
`Valor_Total` double DEFAULT NULL,
PRIMARY KEY (`N_orden`),
KEY `Ncm` (`Ncm`),
KEY `Origen` (`Origen`),
KEY `Adquisicion` (`Adquisicion`),
KEY `Medida_Estadistica` (`Medida_Estadistica`),
KEY `Descripcion_Detallada_Producto` (`Descripcion_Detallada_Producto`),
CONSTRAINT `t1_ibfk_1` FOREIGN KEY (`Ncm`) REFERENCES `ncm` (`Ncm`),
CONSTRAINT `t1_ibfk_2` FOREIGN KEY (`Origen`) REFERENCES `paises` (`Codigo_Pais`),
CONSTRAINT `t1_ibfk_3` FOREIGN KEY (`Adquisicion`) REFERENCES `paises` (`Codigo_Pais`),
CONSTRAINT `t1_ibfk_4` FOREIGN KEY (`Medida_Estadistica`) REFERENCES `medida_estadistica` (`Codigo_Medida_Estadistica`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 COLLATE=latin1_spanish_ci;
My question: Today a SELECT query using LIKE '%whatever%' takes normally 5 to 7 minutes, sometimes more. From where I understand the varchar index just are used when 'whatever%' is used, but I NEED to have the possibility to search for strings using left and right wildcards without needing to wait ~7 minutes each search. How can I do it?
The right way to fix the problem is to look at all the queries being run against the table, and their relative frequency. You've only given us part of one. You didn't even say which field it relates to. Since you do say "The most important column is descripcion_detallada_producto varchar(300) and it is where the selects are aimed at in 90% of the times" I'll assume that you only need to optimize
WHERE descripcion_detallada_producto LIKE '%wathever%'
As Vatev has already said, you probably should be using fulltext searches - which are sematically (and syntactically) different from LIKE predicates. Further you should be splitting the descripcion_detallada_producto attribute into it's own relation to reduce the buffer flushing effects of reading huge rows into memory from disk.
If you are searching for entire words that may be anywhere in a text column, you should consider using fulltext indexes, which are obviously used differently than wildcard searches. If you're unsure how to search your fulltext indexes, you can always get help with that.
Doing a search like the following will not use any of your indexes. Instead, it will scan through all rows of your table data, and you're subjected to disk reads (and any correlated disk fragmentation, which isn't usually a problem because we don't usually scan through tables):
SELECT * FROM t1
WHERE Descripcion_Detallada_Producto LIKE `%whatever%'
The following query would just scan through your index on Descripcion_Detallada_Producto which would act as a "covering" index (notice that the columns in the select make the difference):
SELECT N_orden FROM t1
WHERE Descripcion_Detallada_Producto LIKE `%whatever%'
The advantage in scanning an index instead of the actual table data is that the amount of data that is read as it scans is minimized, and ideally with a large innodb_buffer_pool_size, that index would be in memory, which would avoid disk seeks.
Once you get the N_orden values, then you could retrieve the individual records from the table data.
Additional Info
Consider reducing the size of the columns (bigint to unsigned int for N_orden) and reduce size of Descripcion_Detallada_Producto. Even though VARCHAR only uses up actual bytes (plus length) in the table data, each index entry actually uses the max, so reducing even a VARCHAR column size in an index will improve index scan speed.
In addition, if you have categories, restrict searches to selected categories and create a multi-column index on category+description. The following will only have to scan through a portion of a multi-column index on both category and description by restricting the search to a particular category:
SELECT N_orden FROM t1
WHERE Category = 1
AND Descripcion_Detallada_Producto LIKE `%whatever%'
Finally, consider removing wildcard prefixes. Make the user at least type the beginning of the model number.

Count the number of rows between unix time stamps for each ID

I'm trying to populate some data for a table. The query is being run on a table that contains ~50 million records. The query I'm currently using is below. It counts the number of rows that match the template id and are BETWEEN two unix timestamps:
SELECT COUNT(*) as count FROM `s_log`
WHERE `time_sent` BETWEEN '1346904000' AND '1346993271'
AND `template` = '1'
While the query above does work, performance is rather slow while looping through each template which at times can be in the hundreds. The time stamps are stored as int and are properly indexed. Just to test thing out, I tried running the query below, omitting the time_sent restriction:
SELECT COUNT(*) as count FROM `s_log`
AND `template` = '1'
As expected, it runs very fast, but is obviously not restricting count results inside the correct time frame. How can I obtain a count for a specific template AND restrict that count BETWEEN two unix timestamps?
EXPLAIN:
1 | SIMPLE | s_log | ref | time_sent,template | template | 4 | const | 71925 | Using where
SHOW CREATE TABLE s_log:
CREATE TABLE `s_log` (
`id` int(255) NOT NULL AUTO_INCREMENT,
`email` varchar(255) NOT NULL,
`time_sent` int(25) NOT NULL,
`template` int(55) NOT NULL,
`key` varchar(255) NOT NULL,
`node_id` int(55) NOT NULL,
`status` varchar(55) NOT NULL,
PRIMARY KEY (`id`),
KEY `email` (`email`),
KEY `time_sent` (`time_sent`),
KEY `template` (`template`),
KEY `node_id` (`node_id`),
KEY `key` (`key`),
KEY `status` (`status`),
KEY `timestamp` (`timestamp`)
) ENGINE=MyISAM AUTO_INCREMENT=2078966 DEFAULT CHARSET=latin1
The best index you may have in this case is composite one template + time_sent
CREATE INDEX template_time_sent ON s_log (template, time_sent)
PS: Also as long as all your columns in the query are integer DON'T enclose their values in quotes (in some cases it could lead to issues, at least with older mysql versions)
First, you have to create an index that has both of your columns together (not seperately). Also check your table type, i think it would work great if your table is innoDB.
And lastly, use your WHERE clause in this fashion:
`WHEREtemplate= '1' ANDtime_sent` BETWEEN '1346904000' AND '1346993271'
What this does is first check if template is 1, if it is then it would check for the second condition else skip. This will definitely give you performance-edge
If you have to call the query for each template maybe it would be faster to get all the information with one query call by using GROUP BY:
SELECT template, COUNT(*) as count FROM `s_log`
WHERE `time_sent` BETWEEN 1346904000 AND 1346993271;
GROUP BY template
It's just a guess that this would be faster and you also would have to redesign your code a bit.
You could also try to use InnoDB instead of MyISAM. InnoDB uses a clustered index which maybe performs better on large tables. From the MySQL site:
Accessing a row through the clustered index is fast because the row data is on the same page where the index search leads. If a table is large, the clustered index architecture often saves a disk I/O operation when compared to storage organizations that store row data using a different page from the index record. (For example, MyISAM uses one file for data rows and another for index records.)
There are some questions on Stackoverflow which discuss the performance between InnoDB and MyISAM:
Should I use MyISAM or InnoDB Tables for my MySQL Database?
Migrating from MyISAM to InnoDB
MyISAM versus InnoDB

Which indices should be added to speed up queries on massive InnoDB table?

Here is my table:
CREATE TABLE `letters` (
`a` bigint(20) unsigned NOT NULL,
`b` bigint(20) unsigned NOT NULL,
`c` bigint(20) unsigned NOT NULL,
`d` bigint(20) unsigned NOT NULL,
`e` bigint(20) unsigned NOT NULL,
PRIMARY KEY (`a`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8$$
The table will have about 1+ billion rows.
Each column can be queried; each column can be referenced. e.g.:
SELECT [any column] FROM letters WHERE [any / any other column] IN ([subquery or list]);
My question: what indices should I add to speed up any query in the format above? (Also, if possible, please try to describe 'why' it/they should be added so that I can learn from your answer).
Thanks!
-- Extra info: inserts will happen on a fairly regular basis (a few/handful every second) but select queries will happen more frequently.
Since any column can appear in the WHERE clause, you've to add an index for each of the column, except for the field a since is already the PRIMARY KEY and as such is already indexed.
UPDATE: as for the subsequent discussion, Poodlehat pointed out that the column e has a low index selectivity, i.e. "The ratio of the number of distinct values in the indexed column / columns to the number of records in the table". For this reason, it's not clear whether adding an index on column e will help or slow down queries. So Lucas will try experimentally and hopefully share the results to us.
I think you need to have a unique index on a (or it should be a primary key), definitely indexes on b,c,d (on each). No need for index on e (it won't be used anyway since as you say it has just 15 different values)