Optimize Indexes for Particular Query in mySQL - mysql

I have a fairly simple query that is taking about 14 seconds to complete and I would like to speed it up. I think I have the correct indexes in place, but I'm not sure...
Here is the query
SELECT *
FROM opportunities
WHERE cid = 7785
AND STATUS != 4
AND otype != 200
AND links > 0
AND ontopic != 'F'
ORDER BY links DESC
LIMIT 0, 100;
Here is the table schema
CREATE TABLE `opportunities` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`cid` int(11) NOT NULL,
`url` varchar(900) CHARACTER SET utf8 NOT NULL,
`status` tinyint(4) NOT NULL,
`links` int(11) NOT NULL,
`otype` int(11) NOT NULL,
`reserved` tinyint(4) NOT NULL,
`ontopic` varchar(3) CHARACTER SET utf8 NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `cid` (`cid`,`url`),
KEY `cid1` (`cid`),
KEY `url` (`url`),
KEY `otype` (`otype`),
KEY `reserved` (`reserved`),
KEY `ontopic` (`ontopic`),
KEY `status` (`status`),
KEY `links` (`links`),
KEY `ontopic_links` (`ontopic`,`links`),
KEY `cid_status_otype_links_ontopic` (`cid`,`status`,`otype`,`links`,`ontopic`)
) ENGINE=InnoDB AUTO_INCREMENT=13022832 DEFAULT CHARSET=latin1
Here is the result of the EXPLAIN command
id: 1
select_type: Simple
table: opportunities
partitions: null
type: range
possible_keys: cid,cid1,otype,ontopic,status,links,ontopic_links,cid_status_otype_links_ontopic
key: links
keylen: 4
ref: null
rows: 1531552
filtered: 0.33
Extra: Using index condition; Using where
Thoughts / Questions
Am I reading it correctly that it is using the "links" key to do the query? Why wouldn't it use a more complete index, like the cid_status_otype_links_ontopic which covers all the conditions of my query?
Thanks in advance!
As requested
There are 30,961 results that match the query when you remove the LIMIT 0,100. Interestingly, the "count()" command returns almost instantaneously.

It's a funny thing about using inequality comparisons, that they count as range conditions.
That is, equality matches one value, but anything other than equality (!=, >, <, IN, BETWEEN).
By matching multiple values, it means that only the first column in an index used in a range condition is going to be optimized. You'd think that your index cid_status_otype_links_ontopic has all the columns mentioned in conditions of your query, but only the first two will be used. The first because you have an equality comparison for cid. The second because the next column is used in an inequality comparison, and then that's where it stops using columns from the index.*
Evidence: if you can force that index to be used, you should see the keylen field of the EXPLAIN result show only 5, which is the size of cid (4 bytes) + status (1 byte).
The MySQL optimizer apparently has predicted that it would be more beneficial to use your links index, because that allows it to access the rows in index order, which is the same as the sort order you requested with your ORDER BY.
Evidence: you don't see "Using filesort" in your EXPLAIN notes.
Is that really better than using one of the other indexes? Maybe, maybe not. The optimizer's predictions aren't always perfect.
You can use an index hint to override the optimizer's choice:
SELECT * FROM opportunities USE INDEX (cid_status_otype_links_ontopic) WHERE ...
Try that out, do the EXPLAIN of that query and compare it to your other EXPLAIN. Then execute both queries and see which is reliably faster.
(* Actually, I have to add a footnote about the index column usage. MySQL 5.6 and later can do a little bit better than just the two columns, when you see the note "Using Index Condition" in the EXPLAIN. But it's not quite the same. You can read more about that here: https://dev.mysql.com/doc/refman/5.6/en/index-condition-pushdown-optimization.html)

What you have must plow through all of the rows, using your 5-column index, then sort the results and deliver 100 rows.
The only index likely to be useful is INDEX(cid, links). This is because cid is the only column being tested with =, then having links might be useful for the ORDER BY and LIMIT. There is still the risk that the != tests will require filtering a lot of rows.
Are status and otype multi-valued? If either has only 2 values, then turning the != into = and adding it to the index would be beneficial.
Do you really need all the columns (SELECT *)? If not, and if you don't need any big columns (url), then you could go with a 'covering' index.
More on writing indexes .

Related

Order By causing my query to run really slow

I have an sql query as follows
select *
from incidents
where remote_ip = '192.168.1.1' and is_infringement = 1
order by reported_at desc
limit 1;
This query at the moment takes 313.24 secs to run.
If I remove the order by so the query is
select *
from incidents
where remote_ip = '192.168.1.1' and is_infringement = 1
then it only takes 0.117 secs to run.
The reported_at column is indexed.
So 2 questions, firstly why is it takings so long with this order_by statement and secondly how can i speed it up?
EDIT: In response to the questions below here is the output when using explain:
'1', 'SIMPLE', 'incidents', 'index', 'uniqueReportIndex,idx_incidents_remote_ip', 'incidentsReportedAt', '4', NULL, '1044', '100.00', 'Using where'
And the table create statement:
CREATE TABLE `incidents` (
`id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
`incident_ip_id` int(10) unsigned DEFAULT NULL,
`remote_id` bigint(20) DEFAULT NULL,
`remote_ip` char(32) NOT NULL,
`is_infringement` tinyint(1) NOT NULL DEFAULT '0',
`messageBody` text,
`reported_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00' COMMENT 'Formerly : created_datetime',
`created_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
`updated_at` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
PRIMARY KEY (`id`),
UNIQUE KEY `uniqueReportIndex` (`remote_ip`,`host_id_1`,`licence_feature`,`app_end`),
UNIQUE KEY `uniqueRemoteIncidentId` (`remote_id`),
KEY `incident_ip_id` (`incident_ip_id`),
KEY `id` (`id`),
KEY `incidentsReportedAt` (`reported_at`),
KEY `idx_incidents_remote_ip` (`remote_ip`)
)
Note: i have omitted some of the non relevant fields so there are more indexes than fields but you can safely assume the fields for all the indexes are in the table
The output of EXPLAIN reveals that, because of the ORDER BY clause, MySQL decides to use the incidentsReportedAt index. It reads each row from the table data in the order provided by the index and checks the WHERE conditions on it. This requires reading a lot of information from the table data, information that is scattered through the entire table. Not a good workflow.
Update
The OP created an index on columns reported_at and report_ip (as suggested in the original answer, see below) and the execution time went down from 313 to 133 seconds. An improvement, but not enough. I think the cause of this still large execution time is the access to table data for each row to verify the is_infringement = 1 part of the WHERE clause but even adding it to the index won't help very much.
The OP says in a comment:
Ok after further research and changing the index to be the other way round (remote_ip, reported_at) the query is now super fast (0.083 sec).
This index is better, indeed, because the remote_ip = '192.168.1.1' condition filters out a lot of rows. The same effect can be achieved using the existing uniqueReportIndex index. It is possible that the original index on reported_at fooled MySQL into thinking it is better to use it to check the rows in the order required by ORDER BY instead of filtering first and sorting at the end.
I think MySQL uses the new index on (remote_ip, reported_at) for filtering (WHERE remote_ip = '192.168.1.1') and for sorting (ORDER BY reported_at DESC). The WHERE condition provides a small list of candidate rows that are easily identified and also sorted using this index.
The original answer follows.
The advice it provides is not correct but it helped the OP find the correct solution.
Create an index on columns reported_at and report_ip in this order
then see what EXPLAIN says and how the query performs. It should work faster.
You can even create the new index on columns reported_at, report_ip and is_infringement (the order of columns in the index is very important).
The index on three columns helps MySQL identify the rows without the need to read the table data (because all the columns from WHERE and ORDER BY clauses are in the index). It needs to read the table data only for the rows it returns because of SELECT *.
After you create the new index (either on two or three columns), remove the old index incidentsReportedAt. It is not needed any more; it uses disk and memory space and takes time to be updated but it is not used. The new index (that has the reported_at column on the first position) will be used instead.
The index on two columns requires more reads of the table data for the is_infringement = 1 condition. The query probably runs a little slower that with the three-columns index. On the other hand, there is a little gain on table updates and disk and memory space usage.
The decision to index on two or three columns depends on how often the query posted in the question runs and what it serves (visitors, admins, cron jobs etc).

High traffic table, optimal indexes?

I have a monitoring table with the following structure:
CREATE TABLE `monitor_data` (
`monitor_id` INT(10) UNSIGNED NOT NULL,
`monitor_data_time` INT(10) UNSIGNED NOT NULL,
`monitor_data_value` INT(10) NULL DEFAULT NULL,
INDEX `monitor_id_data_time` (`monitor_id`, `monitor_data_time`),
INDEX `monitor_data_time` (`monitor_data_time`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB;
This is a very high traffic table with potentially thousands of rows every minute. Each row belongs to a monitor and contains a value and time (unix_timestamp)
I have three issues:
1.
Suddenly, after a number of months in dev, the table suddenly became very slow. Queries that previously was done under a second could now take up to a minute. I'm using standard settings in my.cnf since this is a dev machine, but the behavior was indeed very strange to me.
2.
I'm not sure that I have optimal indexes. A "normal" query looks like this:
SELECT DISTINCT(md.monitor_data_time), monitor_data_value
FROM monitor_data md
WHERE md.monitor_id = 165
AND md.monitor_data_time >= 1484076760
AND md.monitor_data_time <= 1487271199
ORDER BY md.monitor_data_time ASC;
A EXPLAIN on the query above looks like this:
id;select_type;table;type;possible_keys;key;key_len;ref;rows;Extra
1;SIMPLE;md;range;monitor_id_data_time,monitor_data_time;monitor_id_data_time;8;\N;149799;Using index condition; Using temporary; Using filesort
What do you think about the indexes?
3.
If I leave out the DISTINCT in the query above, I actually get duplicate rows even though there aren't any duplicate rows in the table. Any explanation to this behavior?
Any input is greatly appreciated!
UPDATE 1:
New suggestion on table structure:
CREATE TABLE `monitor_data_test` (
`monitor_id` INT UNSIGNED NOT NULL,
`monitor_data_time` INT UNSIGNED NOT NULL,
`monitor_data_value` INT UNSIGNED NULL DEFAULT NULL,
PRIMARY KEY (`monitor_data_time`, `monitor_id`),
INDEX `monitor_data_time` (`monitor_data_time`)
) COLLATE='utf8_general_ci' ENGINE=InnoDB;
SELECT DISTINCT(md.monitor_data_time), monitor_data_value
is the same as
SELECT DISTINCT md.monitor_data_time, monitor_data_value
That is, the pair is distinct. It does not dedup just the time. Is that what you want?
If you are trying to de-dup just the time, then do something like
SELECT time, AVG(value)
...
GROUP BY time;
For optimal performance of
WHERE md.monitor_id = 165
AND md.monitor_data_time >= 14840767604 ...
you need
PRIMARY KEY (monitor_id, monitor_data_time)
and it must be in that order. The opposite order is much less useful. The guiding principle is: Start with the '=', then move on to the 'range'. More discussion here.
Do you have 4 billion monitor_id values? INT takes 4 bytes; consider using a smaller datatype.
Do you have other queries that need optimizing? It is better to design the index(es) after gather all the important queries.
Why PK
In InnoDB, the PRIMARY KEY is "clustered" with the data. That is, the data is an ordered list of triples: (id, time, value) stored in a B+Tree. Locating id = 165 AND time = 1484076760 is a basic operation of a BTree. And it is very fast. Then scanning forward (that's the "+" part of "B+Tree") until time = 1487271199 is a very fast operation of "next row" in this ordered list. Furthermore, since value is right there with the id and time, there is no extra effort to get the values.
You can't scan the requested rows any faster. But it requires PRIMARY KEY. (OK, UNIQUE(id, time) would be 'promoted' to be the PK, but let's not confuse the issue.)
Contrast... Given an index (time, id), it would do the scan over the dates fine, but it would have to skip over any entries where id != 165 But it would have to read all those rows to discover they do not apply. A lot more effort.
Since it is unclear what you intended by DISTINCT, I can't continue this detailed discussion of how that plays out. Suffice it to say: The possible rows have been found; now some kind of secondary pass is needed to do the DISTINCT. (It may not even need to do a sort.)
What do you think about the indexes?
The index on (monitor_id,monitor_data_time) seems appropriate for the query. That's suited to an index range scan operation, very quickly eliminating boatloads of rows that need to be examined.
Better would be a covering index that also includes the monitor_data_value column. Then the query could be satisfied entirely from the index, without a need to lookup pages from the data table to get monitor_data_value.
And even better would be having the InnoDB cluster key be the PRIMARY KEY or UNIQUE KEY on the columns, rather than incurring the overhead of the synthetic row identifier that InnoDB creates when an appropriate index isn't defined.
If I wasn't allowing duplicate (monitor_id, monitor_data_time) tuples, then I'd define the table with a UNIQUE index on those non-nullable columns.
CREATE TABLE `monitor_data`
( `monitor_id` INT(10) UNSIGNED NOT NULL
, `monitor_data_time` INT(10) UNSIGNED NOT NULL
, `monitor_data_value` INT(10) NULL DEFAULT NULL
, UNIQUE KEY `monitor_id_data_time` (`monitor_id`, `monitor_data_time`)
) ENGINE=InnoDB
or equivalent, specify PRIMARY in place of UNIQUE and remove the identifier
CREATE TABLE `monitor_data`
( `monitor_id` INT(10) UNSIGNED NOT NULL
, `monitor_data_time` INT(10) UNSIGNED NOT NULL
, `monitor_data_value` INT(10) NULL DEFAULT NULL
, PRIMARY KEY (`monitor_id`, `monitor_data_time`)
) ENGINE=InnoDB
Any explanation to this behavior?
If the query (shown in the question) returns a different number of rows with the DISTINCT keyword, then there must be duplicate (monitor_id,monitor_data_time,monitor_data_value) tuples in the table. There's nothing in the table definition that guarantees us that there aren't duplicates.
There are a couple of other possible explanations, but those explanations are all related to rows being added/changed/removed, and the queries seeing different snapshots, transaction isolation levels, yada, yada. If the data isn't changing, then there are duplicate rows.
A PRIMARY KEY constraint (or UNIQUE KEY constraint non-nullable columns) would guarantee us uniqueness.
Note that DISTINCT is a keyword in the SELECT list. It's not a function. The DISTINCT keyword applies to all expressions in the SELECT list. The parens around md.monitor_date_time are superfluous.
Leaving the DISTINCT keyword out would eliminate the need for the "Using filesort" operation. And that can be expensive for large sets, particularly when the set is too large to sort in memory, and the sort has to spill to disk.
It would be much more efficient to have guaranteed uniqueness, omit the DISTINCT keyword, and return rows in order by the index, preferably the cluster key.
Also, the secondary index monitor_data_time doesn't benefit this query. (There may be other queries that can make effective use of the index, though one suspects that those queries would also make effective use of a composite index that had monitor_data_time as the leading column.

MySQL: Difference between LIKE 123 and = 123 regarding INDEX usage

I am experiencing a very strange behaviour which just turned out to be a matter of using the correct operator in my where condition.
Assume the following table structure with some million entries:
CREATE TABLE `obj` (
`obj__id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`obj__obj_type__id` int(10) unsigned DEFAULT NULL,
`obj__title` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`obj__const` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`obj__description` text COLLATE utf8_unicode_ci,
`obj__created` datetime DEFAULT NULL,
`obj__created_by` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`obj__updated` datetime DEFAULT NULL,
`obj__updated_by` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`obj__property` int(10) unsigned DEFAULT '0',
`obj__status` int(10) unsigned DEFAULT '1',
`obj__sysid` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`obj__scantime` datetime DEFAULT NULL,
`obj__imported` datetime DEFAULT NULL,
`obj__hostname` varchar(255) COLLATE utf8_unicode_ci DEFAULT NULL,
`obj__undeletable` int(1) unsigned NOT NULL DEFAULT '0',
`obj__rt_cf__id` int(11) unsigned DEFAULT NULL,
`obj__cmdb_status__id` int(10) unsigned DEFAULT NULL,
PRIMARY KEY (`obj__id`),
KEY `obj_FKIndex1` (`obj__obj_type__id`),
KEY `obj_ibfk_2` (`obj__cmdb_status__id`),
KEY `obj__sysid` (`obj__sysid`),
KEY `obj__title` (`obj__title`),
KEY `obj__const` (`obj__const`),
KEY `obj__hostname` (`obj__hostname`),
KEY `obj__status` (`obj__status`),
KEY `obj__updated_by` (`obj__updated_by`)
) ENGINE=InnoDB AUTO_INCREMENT=7640131 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;
A very simple select with two conditions ordering by obj__title with a limit of 500 performs quiet slow (500ms):
SELECT SQL_NO_CACHE * FROM obj WHERE (obj__status = 2) AND (obj__obj_type__id = 59) ORDER BY obj__title ASC LIMIT 0, 500;
Without the "ORDER BY obj__title" it runs like a charm (<1ms).
EXPLAIN SELECT is telling me that MySQL is performing a filesort and not using the obj__title index. So, ok, it is quiet obvious that this query is slow:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE obj index_merge obj_FKIndex1,obj__status obj_FKIndex1,obj__status 5,5 NULL 1336 Using intersect(obj_FKIndex1,obj__status); Using where; Using filesort
When i am forcing the index obj__title to use with FORCE or USE INDEX, mysql is not using the other indexes resulting in a very poor performance again. But nevermind, it is quiet obvious that the poor performance has something to do with the combination of the two conditions and the order by.
Now that i spend hours on investigating on optimizing this query i came up with a very simple exchange: I exchanged the operator of my conditions from = to LIKE. So my query is like:
EXPLAIN SELECT SQL_NO_CACHE * FROM obj WHERE (obj__status LIKE 2) AND (obj__obj_type__id LIKE 59) ORDER BY obj__title ASC LIMIT 0, 500;
This is what happened..
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE obj index obj_FKIndex1,obj__status obj__title 768 NULL 500 Using where
Query performance is 150ms. I was shocked actually.
I am not really happy with the speed but at least it is performing ok.
But what I would really like to know is why LIKE is using the index, and = does not? I did not found any hints on that on the MySQL documentation. Only a few notes about LIKE being case insensitive and LIKE acting a bit different for VARCHARS > 255, or any other CHAR or TEXT fields.. No single word about it's integer behaviour.
Can someone shed light on this situation? Any Database design or query tips to speed up the query more are very welcome as well!
For this query:
SELECT SQL_NO_CACHE *
FROM obj
WHERE (obj__status = 2) AND (obj__obj_type__id = 59)
ORDER BY obj__title ASC
LIMIT 0, 500;
The best index is obj(obj__status, obj__obj_type__id, obj__title).
Otherwise, I would expect an index on one of the two where fields.
However, when you use like, you are comparing numbers to strings. This generally prevents an index from being used. The only possible index is for the order by, which happens to work in your case.
But, the proper index should have better performance.
The ORDER BY has to satisfied before the LIMIT. If there are a bloatload of rows, and MySQL performs a sort operation ("Using filesort") shown in the Extra column, that can be expensive.
MySQL can also satisfy an ORDER BY obj__title without performing a sort operation, by making use of an index with a leading column of obj__title. And that's what you see happening when you change the predicates. EXPLAIN shows that the index on obj__title is being used, there's no sort operation. But MySQL has to inspect each row, to see if it satisfies the predicates or not.
The LIKE predicate is causing the column to be evaluated in a string context, rather than numeric. That is, MySQL has to perform an implicit conversion from integer to varchar. And that prevents MySQL from using the index to satisfy the predicates. MySQL is basically being forced to do the conversion for every row in the table, in order to evaluate the predicate.
For best performance of that first query:
SELECT SQL_NO_CACHE *
FROM obj
WHERE obj__status = 2
AND obj__obj_type__id = 59
ORDER BY obj__title ASC
LIMIT 0, 500
You'd want an index with leading columns:
.... ON obj (obj__status, obj__obj_type__id, obj__title)
Then, MySQL could satisfy both of the equality predicates and the order by making use of the single index.
Note that this makes the index on just the single column obj__status redundant. Any query making use of the index on obj__status could make use of the new index.
Your first select needs this composite index. (I take the liberty of removing the "obj_" which just clutters the SQL.)
INDEX(type_id, status, title)
MySQL rarely uses more than one index in a query; this 3-column index is suited for WHERE status=(const) AND type_id=(const) ORDER BY title. I see that it used "index intersect" to try to compensate for the lack of a suitable composite index, but only partially.
Perhaps the optimizer looked at LIKE and said "Punt! I give up on using numeric comparisons, so let's not use either index on type_id or status. Instead, let's see if we can avoid the filesort by using INDEX(title)". And it happened to be better.
There is another thing that makes that filesort especially costly. "Using temporary" and "Filesort" prefer to do everything in RAM via a MEMORY table. But several things can prevent that. One is fetching of a TEXT field, which you do (SELECT * which includes description TEXT). I doubt if the optimizer noticed that. But the timings seem to have.
For more tips on indexing, see my index cookbook. Meanwhile, use LIKE only on strings, not numeric values.

Optimize index for multi field ordering with mixed direction

I'm trying to optimize a MySQL table for faster reads. The ratio of read to writes is about 100:1 so I'm disposed to sacrifice write performances with multi indexes.
Relevant fields for my table are the following and it contains about 200000 records
CREATE TABLE `publications` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`title` varchar(255) NOT NULL,
-- omitted fields
`publicaton_date` date NOT NULL,
`active` tinyint(1) NOT NULL DEFAULT '0',
`position` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
-- these are just attempts, they are not production index
KEY `publication_date` (`publication_date`),
KEY `publication_date_2` (`publication_date`,`position`,`active`)
) ENGINE=MyISAM;`enter code here`
Since I'm using Ruby on Rails to access data in this table I've defined a default scope for this table which is
default_scope where(:active => true).order('publication_date DESC, position ASC')
i.e. every query in this table by default will be completed automatically with the following SQL fragment, so you can assume that almost all queries will have these conditions
WHERE `publications`.`active` = 1 ORDER BY publication_date DESC, position
So I'm mainly interested in optimize this kind of query, plus queries with publication_date in the WHERE condition.
I tried with the following indexes in various combinations (also with multiple of them at the same time)
`publication_date`
`publication_date`,`position`
`publication_date`,`position`,`active`
However a simple query as this one still doesn't use the index properly and uses filesort
SELECT `publications`.* FROM `publications`
WHERE `publications`.`active` = 1
AND (id NOT IN (35217,35216,35215,35218))
ORDER BY publication_date DESC, position
LIMIT 8 OFFSET 0
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: publications
type: ALL
possible_keys: PRIMARY
key: NULL
key_len: NULL
ref: NULL
rows: 34903
Extra: Using where; Using filesort
1 row in set (0.00 sec)
Some considerations on my issue:
According to MySQL documentation a composite index can't be used for ordering when you mix ASC and DESC in ORDER BY clause
active is a boolean flag, so put it in a standalone index make no sense (it has just 2 possible values) but it's always used in WHERE clause so it should appear somewhere in an index to avoid Using where in Extra
position is an integer with few possible values and it's always used scoped to publication_date so I think it's useless to have it in a standalone index
Lot of queries uses publication_date in the where part so it can be useful to have it also in a standalone index, even if redundant and it's the first column of the composite index.
One problem is that your are mixing sort orders in the order by clause. You could invert your position (inverted_position = max_position - position) so that you may also invert the sort order on that column.
You can then create a compound index on [publication_date, inverted_position] and change your order by clause to publication_date DESC, inverted_position DESC.
The active column should most likely not be part of the index as it has a very low selectivity.

When should I use a composite index?

When should I use a composite index in a database?
What are the performance ramification by using a
composite index)?
Why should I use use a composite index?
For example, I have a homes table:
CREATE TABLE IF NOT EXISTS `homes` (
`home_id` int(10) unsigned NOT NULL auto_increment,
`sqft` smallint(5) unsigned NOT NULL,
`year_built` smallint(5) unsigned NOT NULL,
`geolat` decimal(10,6) default NULL,
`geolng` decimal(10,6) default NULL,
PRIMARY KEY (`home_id`),
KEY `geolat` (`geolat`),
KEY `geolng` (`geolng`),
) ENGINE=InnoDB ;
Does it make sense for me to use a composite index for both geolat and geolng, such that:
I replace:
KEY `geolat` (`geolat`),
KEY `geolng` (`geolng`),
with:
KEY `geolat_geolng` (`geolat`, `geolng`)
If so:
Why?
What is the performance ramification by using a composite index)?
UPDATE:
Since many people have stated it entirely dependent upon the queries I perform, below is the most common query performed:
SELECT * FROM homes
WHERE geolat BETWEEN ??? AND ???
AND geolng BETWEEN ??? AND ???
UPDATE 2:
With the following database schema:
CREATE TABLE IF NOT EXISTS `homes` (
`home_id` int(10) unsigned NOT NULL auto_increment,
`primary_photo_group_id` int(10) unsigned NOT NULL default '0',
`customer_id` bigint(20) unsigned NOT NULL,
`account_type_id` int(11) NOT NULL,
`address` varchar(128) collate utf8_unicode_ci NOT NULL,
`city` varchar(64) collate utf8_unicode_ci NOT NULL,
`state` varchar(2) collate utf8_unicode_ci NOT NULL,
`zip` mediumint(8) unsigned NOT NULL,
`price` mediumint(8) unsigned NOT NULL,
`sqft` smallint(5) unsigned NOT NULL,
`year_built` smallint(5) unsigned NOT NULL,
`num_of_beds` tinyint(3) unsigned NOT NULL,
`num_of_baths` decimal(3,1) unsigned NOT NULL,
`num_of_floors` tinyint(3) unsigned NOT NULL,
`description` text collate utf8_unicode_ci,
`geolat` decimal(10,6) default NULL,
`geolng` decimal(10,6) default NULL,
`display_status` tinyint(1) NOT NULL,
`date_listed` timestamp NOT NULL default CURRENT_TIMESTAMP,
`contact_email` varchar(100) collate utf8_unicode_ci NOT NULL,
`contact_phone_number` varchar(15) collate utf8_unicode_ci NOT NULL,
PRIMARY KEY (`home_id`),
KEY `customer_id` (`customer_id`),
KEY `city` (`city`),
KEY `num_of_beds` (`num_of_beds`),
KEY `num_of_baths` (`num_of_baths`),
KEY `geolat` (`geolat`),
KEY `geolng` (`geolng`),
KEY `account_type_id` (`account_type_id`),
KEY `display_status` (`display_status`),
KEY `sqft` (`sqft`),
KEY `price` (`price`),
KEY `primary_photo_group_id` (`primary_photo_group_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=8 ;
Using the following SQL:
EXPLAIN SELECT homes.home_id,
address,
city,
state,
zip,
price,
sqft,
year_built,
account_type_id,
num_of_beds,
num_of_baths,
geolat,
geolng,
photo_id,
photo_url_dir
FROM homes
LEFT OUTER JOIN home_photos ON homes.home_id = home_photos.home_id
AND homes.primary_photo_group_id = home_photos.home_photo_group_id
AND home_photos.home_photo_type_id = 2
WHERE homes.display_status = true
AND homes.geolat BETWEEN -100 AND 100
AND homes.geolng BETWEEN -100 AND 100
EXPLAIN returns:
id select_type table type possible_keys key key_len ref rows Extra
----------------------------------------------------------------------------------------------------------
1 SIMPLE homes ref geolat,geolng,display_status display_status 1 const 2 Using where
1 SIMPLE home_photos ref home_id,home_photo_type_id,home_photo_group_id home_photo_group_id 4 homes.primary_photo_group_id 4
I don't quite understand how to read the EXPLAIN command. Does this look good or bad. Right now, I am NOT using a composite index for geolat and geolng. Should I be?
You should use a composite index when you are using queries that benefit from it. A composite index that looks like this:
index( column_A, column_B, column_C )
will benefit a query that uses those fields for joining, filtering, and sometimes selecting. It will also benefit queries that use left-most subsets of columns in that composite. So the above index will also satisfy queries that need
index( column_A, column_B, column_C )
index( column_A, column_B )
index( column_A )
But it will not (at least not directly, maybe it can help partially if there are no better indices) help for queries that need
index( column_A, column_C )
Notice how column_B is missing.
In your original example, a composite index for two dimensions will mostly benefit queries that query on both dimensions or the leftmost dimension by itself, but not the rightmost dimension by itself. If you're always querying two dimensions, a composite index is the way to go, doesn't really matter which is first (most probably).
Imagine you have the following three queries:
Query I:
SELECT * FROM homes WHERE `geolat`=42.9 AND `geolng`=36.4
Query II:
SELECT * FROM homes WHERE `geolat`=42.9
Query III:
SELECT * FROM homes WHERE `geolng`=36.4
If you have seperate index per column, all three queries use indexes. In MySQL, if you have composite index (geolat, geolng), only query I and query II (which is using the first part of the composit index) uses indexes. In this case, query III requires full table search.
On Multiple-Column Indexes section of manual, it is clearly explained how multiple column indexes work, so I don't want to retype manual.
From the MySQL Reference Manual page:
A multiple-column index can be
considered a sorted array containing
values that are created by
concatenating the values of the
indexed columns.
If you use seperated index for geolat and geolng columns, you have two different index in your table which you can search independent.
INDEX geolat
-----------
VALUE RRN
36.4 1
36.4 8
36.6 2
37.8 3
37.8 12
41.4 4
INDEX geolng
-----------
VALUE RRN
26.1 1
26.1 8
29.6 2
29.6 3
30.1 12
34.7 4
If you use composite index you have only one index for both columns:
INDEX (geolat, geolng)
-----------
VALUE RRN
36.4,26.1 1
36.4,26.1 8
36.6,29.6 2
37.8,29.6 3
37.8,30.1 12
41.4,34.7 4
RRN is relative record number (to simplify, you can say ID). The first two index generated seperate and the third index is composite. As you can see you can search based on geolng on composite one since it is indexed by geolat, however it's possible to search by geolat or "geolat AND geolng" (since geolng is second level index).
Also, have a look at How MySQL Uses Indexes manual section.
There could be a misconception about what composite index does. Many people think that composite index can be used to optimise a search query as long as the where clause covers the indexed columns, in your case geolat and geolng. Let's delve deeper:
I believe your data on the coordinates of homes would be random decimals as such:
home_id geolat geolng
1 20.1243 50.4521
2 22.6456 51.1564
3 13.5464 45.4562
4 55.5642 166.5756
5 24.2624 27.4564
6 62.1564 24.2542
...
Since geolat and geolng values hardly repeat itself. A composite index on geolat and geolng would look something like this:
index_id geolat geolng
1 20.1243 50.4521
2 20.1244 61.1564
3 20.1251 55.4562
4 20.1293 66.5756
5 20.1302 57.4564
6 20.1311 54.2542
...
Therefore the second column of the composite index is basically useless! The speed of your query with a composite index is probably going to be similar to an index on just the geolat column.
As mentioned by Will, MySQL provides spatial extension support. A spatial point is stored in a single column instead of two separate lat lng columns. Spatial index can be applied to such a column. However, the efficiency could be overrated based on my personal experience. It could be that spatial index does not resolve the two dimensional problem but merely speed up the search using R-Trees with quadratic splitting.
The trade-off is that a spatial point consumes much more memory as it used eight-byte double-precision numbers for storing coordinates. Correct me if I am wrong.
Composite indexes are useful for
0 or more "=" clauses, plus
at most one range clause.
A composite index cannot handle two ranges. I discuss this further in my index cookbook.
Find nearest -- If the question is really about optimizing
WHERE geolat BETWEEN ??? AND ???
AND geolng BETWEEN ??? AND ???
then no index can really handle both dimensions.
Instead, one must 'think out of the box'. If one dimension is implemented via partitioning and the other is implemented by carefully picking the PRIMARY KEY, one can get significantly better efficiency for very large tables of lat/lng lookup. My latlng blog goes into the details of how to implement "find nearest" on the globe. It includes code.
The PARTITIONs are stripes of latitude ranges. The PRIMARY KEY deliberately starts with longitude so that the useful rows are likely to be in the same block. A Stored Routine orchestrates the messy code for doing order by... limit... and for growing the 'square' around the target until you have enough coffee shops (or whatever). It also takes care of the great-circle calculations and handling the dateline and poles.
More
I have written another blog; it compares 5 ways of doing lat/lng searches: http://mysql.rjweb.org/doc.php/latlng#representation_choices (It references the link given above as one of the 5.) One of the other ways is this, and it points out that they are optimal for the particular case:
INDEX(geolat, geolng),
INDEX(geolng, geolat)
That is, having both columns in two indexes, and not having single-column indexes on geolat and geolng is important.
Composite indexes are very powerful as they:
Enforce structure integrity
Enable sorting on a FILTERED id
ENFORCE STRUCTURE INTEGRITY
Composite indexes are not just another type of index; they can provide NECESSARY structure to a table by enforcing integrity as the Primary Key.
Mysql's Innodb supports clustering and the following example illustrates why a composite index may be necessary.
To create a friends' tables (i.e. for a social network) we need 2 columns: user_id, friend_id.
Table Strcture
user_id (medium_int)
friend_id (medium_int)
Primary Key -> (user_id, friend_id)
By virtue, a Primary Key (PK) is unique and by creating a composite PK, Innodb will automatically check that no duplicates on user_id, friend_id exists when a new record is added. This is the expected behavior as no user should have more than 1 record (relationship link) with friend_id = 2 for instance.
Without a composite PK, we can create this schema using a surrogate key:
user_friend_id
user_id
friend_id
Primary Key -> (user_friend_id)
Now, whenever a new record is added we will have to check that a prior record with the combination user_id, friend_id does not already exist.
As such, a composite index can enforce structure integrity.
ENABLE SORTING ON A FILTERED ID
It is very common to sort a set of records by the post's time (timestamp or datetime). Usually, this means posting on a given id. Here is an example
Table User_Wall_Posts (think if Facebook's wall posts)
user_id (medium_int)
timestamp (timestamp)
author_id (medium_int)
comment_post (text)
Primary Key -> (user_id, timestamp, author_id)
We want to query and find all posts for user_id = 10 and sort the comment posts by timestamp (date).
SQL QUERY
SELECT * FROM User_Wall_Posts WHERE user_id = 10 ORDER BY timestamp DES
The composite PK enables Mysql to filter and sort the results using the index; Mysql will not have to use a temporary file or filesort to fetch the results. Without a composite key, this would not be possible and would cause a very inefficient query.
As such, composite keys are very powerful and suit more than the simple problem of "I want to search for column_a, column_b so I will use composite keys. For my current database schema, I have just as many composite keys as single keys. Don't overlook a composite key's use!
To do spacial searches, you need an R-Tree algorithm, which allows searching geographical areas very quickly. Exactly what you need for this job.
Some databases have spacial indexes built in. A quick Google search shows MySQL 5 has them (which looking at your SQL I'm guessing you're using MySQL).
Composite index can be useful when you want to optimise group by clause (check this article http://dev.mysql.com/doc/refman/5.0/en/group-by-optimization.html).
Please pay attention:
The most important preconditions for using indexes for GROUP BY are
that all GROUP BY columns reference attributes from the same index,
and that the index stores its keys in order (for example, this is a
BTREE index and not a HASH index)
There is no Black and White, one size fits all answer.
You should use a composite (or multi-column) index, when your query work load would benefit from one.
You need to profile your query work load in order to determine this.
A composite index comes into play when queries can be satisfied entirely from that index: meaning all the columns required by the query are in (covered) by an index.
UPDATE (in response to edit to posted question): If you are selecting * from the table the composite index may be used, it may not. You will need to run EXPLAIN PLAN to be sure.
I'm with #Mitch, depends entirely your queries. Fortunately you can create and drop indexes at any time, and you can prepend the EXPLAIN keyword to your queries to see if the query analyzer uses the indexes.
If you'll be looking up an exact lat/long pair this index would likely make sense. But you're probably going to be looking for homes within a certain distance of a particular place, so your queries will look something like this (see source):
select *, sqrt( pow(h2.geolat - h1.geolat, 2)
+ pow(h2.geolng - h1.geolng, 2) ) as distance
from homes h1, homes h2
where h1.home_id = 12345 and h2.home_id != h1.home_id
order by distance
and the index very likely won't be helpful at all. For geospatial queries, you need something like this.
Update: with this query:
SELECT * FROM homes
WHERE geolat BETWEEN ??? AND ???
AND geolng BETWEEN ??? AND ???
The query analyzer could use an index on geolat alone, or an index on geolng alone, or possibly both indexes. I don't think it would use a composite index. But it's easy to try out each of these permutations on a real data set and then (a) see what EXPLAIN tells you and (b) measure the time the query really takes.