MySQL Multiple column index - mysql

Ok, I have the following MySQL table structure:
CREATE TABLE `creditlog` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`memberId` int(10) unsigned NOT NULL,
`quantity` decimal(10,2) unsigned DEFAULT NULL,
`timeAdded` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`reference` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `memberId` (`memberId`),
KEY `timeAdded` (`timeAdded`));
And I'm querying it like this:
SELECT SUM(quantity) FROM creditlog where timeAdded>'2016-09-01' AND timeAdded<'2016-10-01' AND memberId IN (3,6,8,9,11)
Now, I also use the use index (timeAdded) because due to the number of entries it is more convenient. Explaining the above query shows:
type -> range,
key -> timeAdded,
rows -> 921294
extra -> using where
Meanwhile if I use the memberId INDEX it shows:
type -> range,
key -> memberId,
rows -> 1707849
extra -> using where
Now, my question is it's possible to combine these 2 indexes somehow to be used together and reduce the surface of the query since I ll also need to add more conditions (on other columns).

MySQL almost never uses two indexes in a single query; it is just not cost effective. However, composite indexes are often very efficient. You need this order: INDEX(memberId, timeAdded).
Build the index this way...
First include column(s) that are in the WHERE clause tested with =. (None, in your case.)
Any column(s) with IN.
One 'range', such as <, BETWEEN, etc.
Move onto all the fields of the GROUP BY or ORDER BY. (Not relevant here.)
There are a lot of exceptions and caveats. Some are given in my cookbook .
(Contrary to popular opinion, cardinality is almost never relevant in designing an index.)
Here is a way to compare two indexes (even with a table that is too small to get reliable timings):
FLUSH STATUS;
SELECT SQL_NO_CACHE ...;
SHOW SESSION STATUS LIKE 'Handler%';
(repeat for other query/index)
Smaller numbers almost always indicate better.
"timeAdded>'2016-09-01' AND timeAdded<'2016-10-01'" -- That excludes midnight on the first day. I recommend this pattern:
timeAdded >= '2016-09-01'
AND timeAdded < '2016-09-01' + INTERVAL 1 MONTH
That also avoids computing dates.
That smells like a common query? Have you considered building and maintaining Summary tables ? The equivalent query would probably run 10 times as fast.

Related

High traffic table, optimal indexes?

I have a monitoring table with the following structure:
CREATE TABLE `monitor_data` (
`monitor_id` INT(10) UNSIGNED NOT NULL,
`monitor_data_time` INT(10) UNSIGNED NOT NULL,
`monitor_data_value` INT(10) NULL DEFAULT NULL,
INDEX `monitor_id_data_time` (`monitor_id`, `monitor_data_time`),
INDEX `monitor_data_time` (`monitor_data_time`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB;
This is a very high traffic table with potentially thousands of rows every minute. Each row belongs to a monitor and contains a value and time (unix_timestamp)
I have three issues:
1.
Suddenly, after a number of months in dev, the table suddenly became very slow. Queries that previously was done under a second could now take up to a minute. I'm using standard settings in my.cnf since this is a dev machine, but the behavior was indeed very strange to me.
2.
I'm not sure that I have optimal indexes. A "normal" query looks like this:
SELECT DISTINCT(md.monitor_data_time), monitor_data_value
FROM monitor_data md
WHERE md.monitor_id = 165
AND md.monitor_data_time >= 1484076760
AND md.monitor_data_time <= 1487271199
ORDER BY md.monitor_data_time ASC;
A EXPLAIN on the query above looks like this:
id;select_type;table;type;possible_keys;key;key_len;ref;rows;Extra
1;SIMPLE;md;range;monitor_id_data_time,monitor_data_time;monitor_id_data_time;8;\N;149799;Using index condition; Using temporary; Using filesort
What do you think about the indexes?
3.
If I leave out the DISTINCT in the query above, I actually get duplicate rows even though there aren't any duplicate rows in the table. Any explanation to this behavior?
Any input is greatly appreciated!
UPDATE 1:
New suggestion on table structure:
CREATE TABLE `monitor_data_test` (
`monitor_id` INT UNSIGNED NOT NULL,
`monitor_data_time` INT UNSIGNED NOT NULL,
`monitor_data_value` INT UNSIGNED NULL DEFAULT NULL,
PRIMARY KEY (`monitor_data_time`, `monitor_id`),
INDEX `monitor_data_time` (`monitor_data_time`)
) COLLATE='utf8_general_ci' ENGINE=InnoDB;
SELECT DISTINCT(md.monitor_data_time), monitor_data_value
is the same as
SELECT DISTINCT md.monitor_data_time, monitor_data_value
That is, the pair is distinct. It does not dedup just the time. Is that what you want?
If you are trying to de-dup just the time, then do something like
SELECT time, AVG(value)
...
GROUP BY time;
For optimal performance of
WHERE md.monitor_id = 165
AND md.monitor_data_time >= 14840767604 ...
you need
PRIMARY KEY (monitor_id, monitor_data_time)
and it must be in that order. The opposite order is much less useful. The guiding principle is: Start with the '=', then move on to the 'range'. More discussion here.
Do you have 4 billion monitor_id values? INT takes 4 bytes; consider using a smaller datatype.
Do you have other queries that need optimizing? It is better to design the index(es) after gather all the important queries.
Why PK
In InnoDB, the PRIMARY KEY is "clustered" with the data. That is, the data is an ordered list of triples: (id, time, value) stored in a B+Tree. Locating id = 165 AND time = 1484076760 is a basic operation of a BTree. And it is very fast. Then scanning forward (that's the "+" part of "B+Tree") until time = 1487271199 is a very fast operation of "next row" in this ordered list. Furthermore, since value is right there with the id and time, there is no extra effort to get the values.
You can't scan the requested rows any faster. But it requires PRIMARY KEY. (OK, UNIQUE(id, time) would be 'promoted' to be the PK, but let's not confuse the issue.)
Contrast... Given an index (time, id), it would do the scan over the dates fine, but it would have to skip over any entries where id != 165 But it would have to read all those rows to discover they do not apply. A lot more effort.
Since it is unclear what you intended by DISTINCT, I can't continue this detailed discussion of how that plays out. Suffice it to say: The possible rows have been found; now some kind of secondary pass is needed to do the DISTINCT. (It may not even need to do a sort.)
What do you think about the indexes?
The index on (monitor_id,monitor_data_time) seems appropriate for the query. That's suited to an index range scan operation, very quickly eliminating boatloads of rows that need to be examined.
Better would be a covering index that also includes the monitor_data_value column. Then the query could be satisfied entirely from the index, without a need to lookup pages from the data table to get monitor_data_value.
And even better would be having the InnoDB cluster key be the PRIMARY KEY or UNIQUE KEY on the columns, rather than incurring the overhead of the synthetic row identifier that InnoDB creates when an appropriate index isn't defined.
If I wasn't allowing duplicate (monitor_id, monitor_data_time) tuples, then I'd define the table with a UNIQUE index on those non-nullable columns.
CREATE TABLE `monitor_data`
( `monitor_id` INT(10) UNSIGNED NOT NULL
, `monitor_data_time` INT(10) UNSIGNED NOT NULL
, `monitor_data_value` INT(10) NULL DEFAULT NULL
, UNIQUE KEY `monitor_id_data_time` (`monitor_id`, `monitor_data_time`)
) ENGINE=InnoDB
or equivalent, specify PRIMARY in place of UNIQUE and remove the identifier
CREATE TABLE `monitor_data`
( `monitor_id` INT(10) UNSIGNED NOT NULL
, `monitor_data_time` INT(10) UNSIGNED NOT NULL
, `monitor_data_value` INT(10) NULL DEFAULT NULL
, PRIMARY KEY (`monitor_id`, `monitor_data_time`)
) ENGINE=InnoDB
Any explanation to this behavior?
If the query (shown in the question) returns a different number of rows with the DISTINCT keyword, then there must be duplicate (monitor_id,monitor_data_time,monitor_data_value) tuples in the table. There's nothing in the table definition that guarantees us that there aren't duplicates.
There are a couple of other possible explanations, but those explanations are all related to rows being added/changed/removed, and the queries seeing different snapshots, transaction isolation levels, yada, yada. If the data isn't changing, then there are duplicate rows.
A PRIMARY KEY constraint (or UNIQUE KEY constraint non-nullable columns) would guarantee us uniqueness.
Note that DISTINCT is a keyword in the SELECT list. It's not a function. The DISTINCT keyword applies to all expressions in the SELECT list. The parens around md.monitor_date_time are superfluous.
Leaving the DISTINCT keyword out would eliminate the need for the "Using filesort" operation. And that can be expensive for large sets, particularly when the set is too large to sort in memory, and the sort has to spill to disk.
It would be much more efficient to have guaranteed uniqueness, omit the DISTINCT keyword, and return rows in order by the index, preferably the cluster key.
Also, the secondary index monitor_data_time doesn't benefit this query. (There may be other queries that can make effective use of the index, though one suspects that those queries would also make effective use of a composite index that had monitor_data_time as the leading column.

Count the number of rows between unix time stamps for each ID

I'm trying to populate some data for a table. The query is being run on a table that contains ~50 million records. The query I'm currently using is below. It counts the number of rows that match the template id and are BETWEEN two unix timestamps:
SELECT COUNT(*) as count FROM `s_log`
WHERE `time_sent` BETWEEN '1346904000' AND '1346993271'
AND `template` = '1'
While the query above does work, performance is rather slow while looping through each template which at times can be in the hundreds. The time stamps are stored as int and are properly indexed. Just to test thing out, I tried running the query below, omitting the time_sent restriction:
SELECT COUNT(*) as count FROM `s_log`
AND `template` = '1'
As expected, it runs very fast, but is obviously not restricting count results inside the correct time frame. How can I obtain a count for a specific template AND restrict that count BETWEEN two unix timestamps?
EXPLAIN:
1 | SIMPLE | s_log | ref | time_sent,template | template | 4 | const | 71925 | Using where
SHOW CREATE TABLE s_log:
CREATE TABLE `s_log` (
`id` int(255) NOT NULL AUTO_INCREMENT,
`email` varchar(255) NOT NULL,
`time_sent` int(25) NOT NULL,
`template` int(55) NOT NULL,
`key` varchar(255) NOT NULL,
`node_id` int(55) NOT NULL,
`status` varchar(55) NOT NULL,
PRIMARY KEY (`id`),
KEY `email` (`email`),
KEY `time_sent` (`time_sent`),
KEY `template` (`template`),
KEY `node_id` (`node_id`),
KEY `key` (`key`),
KEY `status` (`status`),
KEY `timestamp` (`timestamp`)
) ENGINE=MyISAM AUTO_INCREMENT=2078966 DEFAULT CHARSET=latin1
The best index you may have in this case is composite one template + time_sent
CREATE INDEX template_time_sent ON s_log (template, time_sent)
PS: Also as long as all your columns in the query are integer DON'T enclose their values in quotes (in some cases it could lead to issues, at least with older mysql versions)
First, you have to create an index that has both of your columns together (not seperately). Also check your table type, i think it would work great if your table is innoDB.
And lastly, use your WHERE clause in this fashion:
`WHEREtemplate= '1' ANDtime_sent` BETWEEN '1346904000' AND '1346993271'
What this does is first check if template is 1, if it is then it would check for the second condition else skip. This will definitely give you performance-edge
If you have to call the query for each template maybe it would be faster to get all the information with one query call by using GROUP BY:
SELECT template, COUNT(*) as count FROM `s_log`
WHERE `time_sent` BETWEEN 1346904000 AND 1346993271;
GROUP BY template
It's just a guess that this would be faster and you also would have to redesign your code a bit.
You could also try to use InnoDB instead of MyISAM. InnoDB uses a clustered index which maybe performs better on large tables. From the MySQL site:
Accessing a row through the clustered index is fast because the row data is on the same page where the index search leads. If a table is large, the clustered index architecture often saves a disk I/O operation when compared to storage organizations that store row data using a different page from the index record. (For example, MyISAM uses one file for data rows and another for index records.)
There are some questions on Stackoverflow which discuss the performance between InnoDB and MyISAM:
Should I use MyISAM or InnoDB Tables for my MySQL Database?
Migrating from MyISAM to InnoDB
MyISAM versus InnoDB

MySQL, multi column indexes

I have two MySQL queries which I use to help me arrange news on my website. All the fields here indicated are numeric (int or tinyint) and all have an index on them. Can somebody help me build the multi column indexes which will speed up the two queries please?
SELECT MAX(content_time) AS content_time
FROM cm_data
WHERE content_time < UNIX_TIMESTAMP()
AND page = '1'
AND (content_type = '1' OR content_type = '5')
SELECT content_id
FROM cm_data
WHERE (content_time < UNIX_TIMESTAMP()
OR (hp_time >= UNIX_TIMESTAMP() AND content_time < UNIX_TIMESTAMP()
)
)
AND page = '1'
AND (content_type = '1' OR content_type = '5')
ORDER BY hp_time DESC
, content_time DESC
LIMIT 20
And here is the DB schema:
CREATE TABLE IF NOT EXISTS `cm_data` (
`content_id` mediumint(8) NOT NULL AUTO_INCREMENT,
`content_time` int(11) DEFAULT NULL,
`hp_time` int(11) NOT NULL DEFAULT '0',
`content_type` tinyint(2) DEFAULT NULL,
`page` tinyint(2) NOT NULL DEFAULT '1',
PRIMARY KEY (`content_id`),
KEY `content_time` (`content_time`),
KEY `content_type` (`content_type`),
KEY `page` (`page`),
);
Indexes aren't always the answer (especially where many INSERT statements are to be expected). That being said, there are a couple of things you can do to optimize these queries:
1) Make sure your constraints are properly formed. In the second query, you have essentially Case1 OR ( Case2 AND Case1 ), which can be simplified to just Case1 (boolean algebra). Making sure these are as reduced as possible before the query optimizer looks at it can not only reduce the effort required by the optimizer, but also catch cases that it can't resolve.
2) Check the order of your constraints. Since it doesn't state anywhere in the documentation that the query optimizer will check the constraints in any specific order for non-unique keys, a little knowledge of optimization can be used to potentially speed up queries. Generally speaking, evaluation of non-integer constraints will be costly (STRINGs, etc). In your case, UNIX_TIMESTAMP() returns an UNSIGNED INT, whereas the field is specified as an INTEGER. This will result in a cast operation for each row, which will be costly. So, if you can reduce the number of times the comparisons must be done (by filtering out rows with simpler constraints first), fewer operations must be performed, leading to a shorter execution time.
2a) MySQL doesn't support indexes on functions, so you could either change the datatype of the field to UNSIGNED INT (preferred if available), or create an additional column which gets updated with a trigger which contains the UNSIGNED INT equivalent of the field, and index it instead. The latter isn't ideal, but could offer some performance increase at the expense of increased table size.

MySQL Query Runs forever

I have a table with over 250 million records. Our Reporting server queries regularly to that table using similar kind of query.
SELECT
COUNT(*),
DATE(updated_at) AS date,
COUNT(DISTINCT INT_FIELD)
FROM
TABLE_WITH_250_Million
WHERE
Field1 = 'value in CHAR'
AND field2 = 'VALUE in CHAR'
AND updated_at > '2012-04-27'
AND updated_at < '2012-04-28 00:00:00'
GROUP BY
Field2,
DATE(updated_at)
ORDER BY
date DESC
I have tried to create a BTREE index on the table including Field1,Field2,Field3 DESC in the same order but its not giving me the right result.
Can anyone help me how do I optimize it. My problem is I can't change the query as I don't have code where this reporting server is executing query from.
Any help would be really appreciated.
Thanks
Here's my table:
CREATE TABLE backup_jobs (
id int(11) unsigned NOT NULL AUTO_INCREMENT,
backup_profile_id int(11) DEFAULT NULL,
state varchar(32) DEFAULT NULL,
limit int(11) DEFAULT NULL,
file_count int(11) DEFAULT NULL,
byte_count bigint(20) DEFAULT NULL,
created_at datetime DEFAULT NULL,
updated_at datetime DEFAULT NULL,
status_type varchar(32) DEFAULT NULL,
status_param_1 varchar(255) DEFAULT NULL,
status_param_2 varchar(255) DEFAULT NULL,
status_param_3 varchar(255) DEFAULT NULL,
started_at datetime DEFAULT NULL,
PRIMARY KEY (id),
KEY index_backup_jobs_on_state (state),
KEY index_backup_jobs_on_backup_profile_id (backup_profile_id),
KEY index_backup_jobs_created_at (created_at),
KEY idx_backup_jobs_state_updated_at (state,updated_at) USING BTREE,
KEY idx_backup_jobs_state_status_param_1_updated_at (state,status_param_1,updated_at) USING BTREE
) ENGINE=MyISAM AUTO_INCREMENT=508748682 DEFAULT CHARSET=utf8;
Add the int_field into the index:
CREATE INDEX idx_backup_jobs_state_status_param_1_updated_at_backup_profile_id ON backup_jobs (state, status_param_1, updated_at, backup_profile_id)
to make it cover all fields.
This way, table lookups go (you will see Using index in the plan) which will make your query some 10x faster (your mileage may vary).
Also note that (at least for the single-date range provided) GROUP BY DATE(updated_at) and ORDER BY date DESC are redundant and will only make the query to use temporary and filesort without any real purpose. Not that you can do much about it, though, if you cannot change the query.
I'm sure that all 250M rows didn't occur in the date range of interest.
The problem is that the between nature of the date check forces a table scan, because you can't know where the date falls.
I'd recommend that you partition the 250M row table into weeks, months, quarters, or years and only scan the partitions need for a given date range. You'll only have to scan the partitions within the range. That'll help matters.
If you go down the partition road, you'll need to talk to a MySQL DBA, preferrably someone who's familiar with partioning. It's not for the faint of heart.
http://dev.mysql.com/doc/refman/5.1/en/partitioning.html
Per your query, you'll have to take the lead here -- smallest granularity. We have no idea what the frequency is of activity, what the Field1, Field2 status entries are, how far back your data goes, how many entries would be normal on a given SINGLE DATE. All that said, I would build my indexes based on smallest granularity first that closely matches your querying criteria.
Ex: if your "Field1" has a dozen possible "CHAR" values, and you are applying an "IN" clause, and Field1 is first in your index, it will hit each char for each date and field2 value. 250 million records could force a lot of index paging activity especially based on history. Likewise with your Field2. However, due to your "Group By" clause on Field2 and date updated, I would have ONE of those respectively in the first/second position of the index. Based on historical data, I would even tend to shoot at the following index to have dates as the primary basis, and within that, the secondary criteria.
index ( Updated_At, Field2, Field1, INT_FIELD )
This way, your entire query can be done on just the index alone and does not need to query against the raw data of the actual record. All the fields are right there in the index to pull from. You have a finite date range, so your updated_at is right-away qualified, and in order prep of the group by. From that, your "CHAR" values from Field2 will right-along nicely finish your group by. Field1 to qualify your 3rd criteria of "IN" char list, and finally your INT_FIELD for count( distinct ).
Don't know how long the index will take to build on 250 million, but that is where I would start.

Index for BETWEEN operation in MySql

I have several tables in MySQL in wich are stored chronological data. I added covering index for this tables with date field in the end. In my queries i'm selecting data for some period using BETWEEN operation for date field. So my WHERE statement consists from all fields from covering index.
When i'm executing EXPLAIN query in Extra column i have "Using where" - so, as i think, it means, that date field doesn't searched in index. When i'm selecting data for one period - i'm using "=" operation instead of BETWEEN and "Using where" doesn't appear - all searched in index.
What can i do, to all my WHERE statement to be searched in index, containing BETWEEN operation?
UPDATE:
table structure:
CREATE TABLE phones_stat (
id_site int(10) unsigned NOT NULL,
group smallint(5) unsigned NOT NULL,
day date NOT NULL,
id_phone mediumint(8) unsigned NOT NULL,
sessions int(10) unsigned NOT NULL,
PRIMARY KEY (id_site,group,day,id_phone) USING BTREE
) ;
query:
SELECT id_phone,
SUM(sessions) AS cnt
FROM phones_stat
WHERE id_site = 25
AND group = 1
AND day BETWEEN '2010-01-01' AND '2010-01-31'
GROUP BY id_phone
ORDER BY cnt DESC
How many rows do you have? Sometimes an index is not used if the optimizer deems it unnecessary (for instance, if the number of rows in your table(s) is very small). Could you give us an idea of what your SQL looks like?
You could try hinting your index usage and seeing what you get in EXPLAIN, just to confirm that your index is being overlooked, e.g.
http://dev.mysql.com/doc/refman/5.1/en/optimizer-issues.html
If you're GROUPing by id_phone, then a more useful index will be one which starts with that i.e.
... PRIMARY KEY (id_phone, id_site, `group`, day) USING BTREE
If you change the index to that and rerun the query, does it help?