High traffic table, optimal indexes? - mysql

I have a monitoring table with the following structure:
CREATE TABLE `monitor_data` (
`monitor_id` INT(10) UNSIGNED NOT NULL,
`monitor_data_time` INT(10) UNSIGNED NOT NULL,
`monitor_data_value` INT(10) NULL DEFAULT NULL,
INDEX `monitor_id_data_time` (`monitor_id`, `monitor_data_time`),
INDEX `monitor_data_time` (`monitor_data_time`)
)
COLLATE='utf8_general_ci'
ENGINE=InnoDB;
This is a very high traffic table with potentially thousands of rows every minute. Each row belongs to a monitor and contains a value and time (unix_timestamp)
I have three issues:
1.
Suddenly, after a number of months in dev, the table suddenly became very slow. Queries that previously was done under a second could now take up to a minute. I'm using standard settings in my.cnf since this is a dev machine, but the behavior was indeed very strange to me.
2.
I'm not sure that I have optimal indexes. A "normal" query looks like this:
SELECT DISTINCT(md.monitor_data_time), monitor_data_value
FROM monitor_data md
WHERE md.monitor_id = 165
AND md.monitor_data_time >= 1484076760
AND md.monitor_data_time <= 1487271199
ORDER BY md.monitor_data_time ASC;
A EXPLAIN on the query above looks like this:
id;select_type;table;type;possible_keys;key;key_len;ref;rows;Extra
1;SIMPLE;md;range;monitor_id_data_time,monitor_data_time;monitor_id_data_time;8;\N;149799;Using index condition; Using temporary; Using filesort
What do you think about the indexes?
3.
If I leave out the DISTINCT in the query above, I actually get duplicate rows even though there aren't any duplicate rows in the table. Any explanation to this behavior?
Any input is greatly appreciated!
UPDATE 1:
New suggestion on table structure:
CREATE TABLE `monitor_data_test` (
`monitor_id` INT UNSIGNED NOT NULL,
`monitor_data_time` INT UNSIGNED NOT NULL,
`monitor_data_value` INT UNSIGNED NULL DEFAULT NULL,
PRIMARY KEY (`monitor_data_time`, `monitor_id`),
INDEX `monitor_data_time` (`monitor_data_time`)
) COLLATE='utf8_general_ci' ENGINE=InnoDB;

SELECT DISTINCT(md.monitor_data_time), monitor_data_value
is the same as
SELECT DISTINCT md.monitor_data_time, monitor_data_value
That is, the pair is distinct. It does not dedup just the time. Is that what you want?
If you are trying to de-dup just the time, then do something like
SELECT time, AVG(value)
...
GROUP BY time;
For optimal performance of
WHERE md.monitor_id = 165
AND md.monitor_data_time >= 14840767604 ...
you need
PRIMARY KEY (monitor_id, monitor_data_time)
and it must be in that order. The opposite order is much less useful. The guiding principle is: Start with the '=', then move on to the 'range'. More discussion here.
Do you have 4 billion monitor_id values? INT takes 4 bytes; consider using a smaller datatype.
Do you have other queries that need optimizing? It is better to design the index(es) after gather all the important queries.
Why PK
In InnoDB, the PRIMARY KEY is "clustered" with the data. That is, the data is an ordered list of triples: (id, time, value) stored in a B+Tree. Locating id = 165 AND time = 1484076760 is a basic operation of a BTree. And it is very fast. Then scanning forward (that's the "+" part of "B+Tree") until time = 1487271199 is a very fast operation of "next row" in this ordered list. Furthermore, since value is right there with the id and time, there is no extra effort to get the values.
You can't scan the requested rows any faster. But it requires PRIMARY KEY. (OK, UNIQUE(id, time) would be 'promoted' to be the PK, but let's not confuse the issue.)
Contrast... Given an index (time, id), it would do the scan over the dates fine, but it would have to skip over any entries where id != 165 But it would have to read all those rows to discover they do not apply. A lot more effort.
Since it is unclear what you intended by DISTINCT, I can't continue this detailed discussion of how that plays out. Suffice it to say: The possible rows have been found; now some kind of secondary pass is needed to do the DISTINCT. (It may not even need to do a sort.)

What do you think about the indexes?
The index on (monitor_id,monitor_data_time) seems appropriate for the query. That's suited to an index range scan operation, very quickly eliminating boatloads of rows that need to be examined.
Better would be a covering index that also includes the monitor_data_value column. Then the query could be satisfied entirely from the index, without a need to lookup pages from the data table to get monitor_data_value.
And even better would be having the InnoDB cluster key be the PRIMARY KEY or UNIQUE KEY on the columns, rather than incurring the overhead of the synthetic row identifier that InnoDB creates when an appropriate index isn't defined.
If I wasn't allowing duplicate (monitor_id, monitor_data_time) tuples, then I'd define the table with a UNIQUE index on those non-nullable columns.
CREATE TABLE `monitor_data`
( `monitor_id` INT(10) UNSIGNED NOT NULL
, `monitor_data_time` INT(10) UNSIGNED NOT NULL
, `monitor_data_value` INT(10) NULL DEFAULT NULL
, UNIQUE KEY `monitor_id_data_time` (`monitor_id`, `monitor_data_time`)
) ENGINE=InnoDB
or equivalent, specify PRIMARY in place of UNIQUE and remove the identifier
CREATE TABLE `monitor_data`
( `monitor_id` INT(10) UNSIGNED NOT NULL
, `monitor_data_time` INT(10) UNSIGNED NOT NULL
, `monitor_data_value` INT(10) NULL DEFAULT NULL
, PRIMARY KEY (`monitor_id`, `monitor_data_time`)
) ENGINE=InnoDB
Any explanation to this behavior?
If the query (shown in the question) returns a different number of rows with the DISTINCT keyword, then there must be duplicate (monitor_id,monitor_data_time,monitor_data_value) tuples in the table. There's nothing in the table definition that guarantees us that there aren't duplicates.
There are a couple of other possible explanations, but those explanations are all related to rows being added/changed/removed, and the queries seeing different snapshots, transaction isolation levels, yada, yada. If the data isn't changing, then there are duplicate rows.
A PRIMARY KEY constraint (or UNIQUE KEY constraint non-nullable columns) would guarantee us uniqueness.
Note that DISTINCT is a keyword in the SELECT list. It's not a function. The DISTINCT keyword applies to all expressions in the SELECT list. The parens around md.monitor_date_time are superfluous.
Leaving the DISTINCT keyword out would eliminate the need for the "Using filesort" operation. And that can be expensive for large sets, particularly when the set is too large to sort in memory, and the sort has to spill to disk.
It would be much more efficient to have guaranteed uniqueness, omit the DISTINCT keyword, and return rows in order by the index, preferably the cluster key.
Also, the secondary index monitor_data_time doesn't benefit this query. (There may be other queries that can make effective use of the index, though one suspects that those queries would also make effective use of a composite index that had monitor_data_time as the leading column.

Related

MySQL Large Table Sharding to Smaller Table based on Unique ID

We have a large MySQL table (device_data) with the following columns:
ID (int)
dt (timestamp)
serial_number (char(20))
data1 (double)
data2 (double)
... // other columns
The table receives around 10M rows every day.
We have done a sharding by separating the table based on the date of the timestamp (device_data_YYYYMMDD). However, we feel this is not effective because most of our queries (shown below) always check on the "serial_number" and will perform across many dates.
SELECT * FROM device_data WHERE serial_number = 'XXX' AND dt >= '2018-01-01' AND dt <= '2018-01-07';
Therefore, we think that creating the sharding based on the serial number will be more effective. Basically, we will have:
device_data_<serial_number>
device_data_0012393746
device_data_7891238456
Hence, when we want to find data for a particular device, we can easily reference as:
SELECT * FROM device_data_<serial_number> WHERE dt >= '2018-01-01' AND dt <= '2018-01-07';
This approach seems to be effective because:
The application at all time will access the data based on the device first.
We have checked that there is no query that access the data without specifying the device serial number first.
The table for each device will be relatively small (9000 rows per day)
A few challenges that we think we will face is:
We have alot of devices. This means that the table device_data_ will be alot too. I have checked that MySQL does not provide limitation in the number of tables in the database. Will this impact on performance vs keeping them in one table?
How will this impact on later on when we would like to scale MySQL (e.g. using master / slave, etc)?
Are there other alternative / solution in resolving this?
Update. Below is the show create table result from our existing table:
CREATE TABLE `test_udp_new` (
`id` int(20) unsigned NOT NULL AUTO_INCREMENT,
`dt` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`device_sn` varchar(20) NOT NULL,
`gps_date` datetime NOT NULL,
`lat` decimal(10,5) DEFAULT NULL,
`lng` decimal(10,5) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `device_sn_2` (`dt`,`device_sn`),
KEY `dt` (`dt`),
KEY `data` (`data`) USING BTREE,
KEY `test_udp_new_device_sn_dt_index` (`device_sn`,`dt`),
KEY `test_udp_new_device_sn_data_dt_index` (`device_sn`,`data`,`dt`)
) ENGINE=InnoDB AUTO_INCREMENT=44449751 DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC
The most frequent queries being run:
SELECT *
FROM test_udp_new
WHERE device_sn = 'xxx'
AND dt >= 'xxx'
AND dt <= 'xxx'
ORDER BY dt DESC;
The optimal way to handle that query is in a non-partitioned table with
INDEX(serial_number, dt)
Even better is to change the PRIMARY KEY. Assuming you currently have id AUTO_INCREMENT because there is not a unique combination of columns suitable for being a "natural PK",
PRIMARY KEY(serial_number, dt, id), -- to optimize that query
INDEX(id) -- to keep AUTO_INCREMENT happy
If there are other queries that are run often, please provide them; this may hurt them. In large tables, it is a juggling task to find the optimal index(es).
Other Comments:
There are very few use cases for which partitioning actually speed up processing.
Making lots of 'identical' tables is a maintenance nightmare, and, again, not a performance benefit. There are probably a hundred Q&A on stackoverflow shouting not to do such.
By having serial_number first in the PRIMARY KEY, all queries referring to a single serial_number are likely to benefit.
A million serial_numbers? No problem.
One common use case for partitioning involves purging "old" data. This is because big DELETEs are much more costly than DROP PARTITION. That involves PARTITION BY RANGE(TO_DAYS(dt)). If you are interested in that, my PK suggestion still stands. (And the query in question will run about the same speed with or without this partitioning.)
How many months before the table outgrows your disk? (If this will be an issue, let's discuss it.)
Do you need 8-byte DOUBLE? FLOAT has about 7 significant digits of precision and takes only 4 bytes.
You are using InnoDB?
Is serial_number fixed at 20 characters? If not, use VARCHAR. Also, CHARACTER SET ascii may be better than the default of utf8?
Each table (or each partition of a table) involves at least one file that the OS must deal with. When you have "too many", the OS groans, often before MySQL groans. (It is hard to make either "die" of overdose.)
Addressing the query
PRIMARY KEY (`id`),
KEY `device_sn_2` (`dt`,`device_sn`),
KEY `dt` (`dt`),
KEY `data` (`data`) USING BTREE,
KEY `test_udp_new_device_sn_dt_index` (`device_sn`,`dt`),
KEY `test_udp_new_device_sn_data_dt_index` (`device_sn`,`data`,`dt`)
-->
PRIMARY KEY(`device_sn`,`dt`, id),
INDEX(id)
KEY `dt_sn` (`dt`,`device_sn`),
KEY `data` (`data`) USING BTREE,
Notes:
By starting the PK with device_sn, dt, you get the clustering benefits to make the query with WHERE device_sn = .. AND dt BETWEEN ...
INDEX(id) is to keep AUTO_INCREMENT happy.
When you have INDEX(a,b), INDEX(a) is redundant.
The (20) is meaningless; id will max out at about 4 billion.
I tossed the last index because it is probably helped enough by the new PK.
lng decimal(10,5) -- Don't need 5 decimal places to left of point; only need 3 or 2. So: lat decimal(7,5),lng decimal(8,5)`. This will save a total of 3 bytes per row.

Mysql concept of key for the same column with shared primary key

As below code, I set chat_id and student_id as primary key so that the same chat will be added for another student_id. When it goes to production, the record will be a lot. Should I add one more index for student_id only so that the search will be faster for every user when they come back to the screen to see recently messages?
CREATE TABLE `tim_chat_recipients` (
`chat_id` INT(11) NOT NULL,
`student_id` INT(11) NOT NULL,
`message_status` TINYINT(1) NOT NULL DEFAULT '0' COMMENT '1:New, 2:Read, 3:Deleted',
PRIMARY KEY (`chat_id`, `student_id`))COLLATE='latin1_swedish_ci' ENGINE=InnoDB;
If the code is going to search for student_id without a chat_id then an index over at least student_id is required or queries will result in non-scaling table/cluster scans.. this is because the implicit cluster/index (chat_id, student_id) can't be used for index/reduction without a supplied chat_id.
This secondary index - INDEX(student_id, [optionally other columns]) - might include other columns, either as part of the index or includes, although "appropriate selection" depends a good bit on the queries. In this case it may make sense to include the message_status column to be able to use "messages not seen" as a filter with minimal read-IO as it can avoid a probe back to the table/cluster. (Indices are about balancing maintenance/disk costs and benefits to queries.)
In short: for an index to be selected by the Query Planner in a SARGABLE query, the left parts of the index must be known and mentioned in either the WHERE or, sometimes, the JOIN .. ON clauses.

Mysql, In my matrix, an index is not used

given this table:
CREATE TABLE `matrix` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`city1_id` int(10) unsigned NOT NULL DEFAULT '0',
`city2_id` int(10) unsigned NOT NULL DEFAULT '0',
`timeinmin` mediumint(8) unsigned NOT NULL DEFAULT '0',
`distancem` mediumint(8) unsigned NOT NULL DEFAULT '0',
`OWNER` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`id`),
UNIQUE KEY `city12_index` (`city1_id`,`city2_id`),
UNIQUE KEY `city21_index` (`city2_id`,`city1_id`),
KEY `city1_index` (`city1_id`),
KEY `city2_index` (`city2_id`),
KEY `ownerIndex` (`OWNER`),
CONSTRAINT `PK_city_city1` FOREIGN KEY (`city1_id`) REFERENCES `city` (`id`) ON DELETE CASCADE ON UPDATE CASCADE,
CONSTRAINT `PK_city_city2` FOREIGN KEY (`city2_id`) REFERENCES `city` (`id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=5118409 DEFAULT CHARSET=utf8;
there are very huge amount of datas.
This SQL runs very fast:
select count(*) from city_matrix where owner=1
since there is index on "owner"
select count(*) from city_matrix where owner=1 order by id
this also runs very fast. But this:
select count(*) from city_matrix where owner=1 order by city1_id
requires some seconds, BUT there is index on city1_id too!
The explain tells this:
1, 'SIMPLE', 'city_matrix', '', 'ref', 'ownerIndex', 'ownerIndex', '4', 'const', 169724, 100.00, ''
This is a great question. MySQL determines the right index based on many different cases. Its main goal is to find the most suitable index that can retrieve the data fast.
select count(*) from city_matrix where owner=1 order by id
In this query, MySQL determined that where owner=1 reduced the results to a small enough number that ordering by ID was relatively easy. For example, if ID is also a key (primary/unique/index), which I suspect it is, MySQL could take advantage of ID for sorting.
In case of this:
select count(*) from city_matrix where owner=1 order by city1_id
MySQL can still filter out all the records for owner but will take time to shuffle all the city1_id data so that you receive sorted result. Since it took time, show processlist during that time could have showed you that the query was reordering data.
To help MySQL do the job faster, we can use something called a covering index. Covering index has all fields used in the query so that MySQL just has to read through the index to get you the data without having to touch the underlying table. A composite index on owner and city1_id will help MySQL use one single index to filter data, and that same index again to sort data and then do a count on it.
So, let's create the covering index:
create index idx_city_matrix_city1_owner on city_matrix(owner, city1_id)
As you noticed, MySQL took some time to make the index and once the index was ready, it could zip through data pretty quickly to give you counts.
EDIT: It is important to note that when you do count(*) like the statements about do, you don't need ordering. The resultset is scalar - just one value. Ordering by any field does not impact your count. For example, count all the fruits on the table will give you the same results as count all the fruits on the table ordered by its size.
The process for retrieval and index application is as follows:
The intermediate result which is retrieved by MySQL for the key owner is "stored" in a temporary table (either in memory or on disk depending on the size of the result).
Based on histogram data on the intermediate result an index can be applied. If the data is not unique enough, the index can be discarded as not useful (for example: There are only 5 cities in this 169k results).
Work around:
Apply the index with a hint: This is considered poor since it can lead to unwanted index use speeding up one query and slowing down the next one (Yes, an index can make a query slower);
Create a multi column index which contains both the owner and city1_id.
One last remark
An order by on a COUNT(*) always slows down everything since the order by does not change anything of your result.

Mysql slow perfomance on big table

I have following table with millions rows:
CREATE TABLE `points` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`DateNumber` int(10) unsigned DEFAULT NULL,
`Count` int(10) unsigned DEFAULT NULL,
`FPTKeyId` int(10) unsigned DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `id_UNIQUE` (`id`),
KEY `index3` (`FPTKeyId`,`DateNumber`) USING HASH
) ENGINE=InnoDB AUTO_INCREMENT=16755134 DEFAULT CHARSET=utf8$$
As you can see i have created indexes. I donnt know am i do it right may be not.
The problem is queries execute super slow.
Let's take a simple query
SELECT fptkeyid, count FROM points group by fptkeyid
I cannt get result because query aborting by timeout(10 min). What i am doing wrong?
Beware MySQL's stupid behaviour: GROUP BYing implicitly executes ORDER BY.
To prevent this, explicitely add ORDER BY NULL, which prevents unnecessary ordering.
http://dev.mysql.com/doc/refman/5.0/en/select.html says:
If you use GROUP BY, output rows are sorted according to the GROUP BY
columns as if you had an ORDER BY for the same columns. To avoid the
overhead of sorting that GROUP BY produces, add ORDER BY NULL:
SELECT a, COUNT(b) FROM test_table GROUP BY a ORDER BY NULL;
+
http://dev.mysql.com/doc/refman/5.6/en/group-by-optimization.html says:
The most important preconditions for using indexes for GROUP BY are
that all GROUP BY columns reference attributes from the same index,
and that the index stores its keys in order (for example, this is a
BTREE index and not a HASH index).
Your query does not make sense:
SELECT fptkeyid, count FROM points group by fptkeyid
You group by fptkeyid so count is not useful here. There should be an aggregate function. Not a count field. Next that that count is also a MySQL function which makes it not very useful / advisable to use the same name for a field.
Don't you need something like:
SELECT fptkeyid, SUM(`count`) FROM points group by fptkeyid
If not please explain what result you expect from the query.
Created a database with test data, half a million records, to see if I can find something equal to your issue. This is what the explain tells me:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE points index NULL index3 10 NULL 433756
And on the SUM query:
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE points index NULL index3 10 NULL 491781
Both queries are done on a laptop (macbook air) within a second, nothing takes long. Inserting though took some time, few minutes to get half a million records. But retrieving and calculating does not.
We need more to answer your question totally complete. Maybe the configuration of the database is wrong, for example almost no memory allocated?
I would personally start with your AUTO_INCREMENT value. You have set it to increase by 16,755,134 for each new record. Your field value is set to INT UNSIGNED which means that the range of values is 0 to 4,294,967,295 (or almost 4.3 billion). This means that you would have only 256 values before the field goes beyond the data type limits thereby compromising the purpose of the PRIMARY KEY INDEX.
You could changed the data type to BIGINT UNSIGNED and you would have a value range of 0 to 18,446,744,073,709,551,615 (or slightly more then 18.4 quintillion) which would allow you to have up to 1,100,960,700,983 (or slightly more then 1.1 trillion) unique values with this AUTO_INCREMENT value.
I would first ask if you really need to have your AUTO_INCREMENT value set to such a large number and if not then I would suggest changing that to 1 (or at least some lower number) as storing the field values as INT vs BIGINT will save considerable disk space within larger tables such as this. Either way, you should get a more stable PRIMARY KEY INDEX which should help improve queries.
I think the problem is your server bandwidth. Having a million rows would probably need at least high megabyte bandwidths.

When should I use a composite index?

When should I use a composite index in a database?
What are the performance ramification by using a
composite index)?
Why should I use use a composite index?
For example, I have a homes table:
CREATE TABLE IF NOT EXISTS `homes` (
`home_id` int(10) unsigned NOT NULL auto_increment,
`sqft` smallint(5) unsigned NOT NULL,
`year_built` smallint(5) unsigned NOT NULL,
`geolat` decimal(10,6) default NULL,
`geolng` decimal(10,6) default NULL,
PRIMARY KEY (`home_id`),
KEY `geolat` (`geolat`),
KEY `geolng` (`geolng`),
) ENGINE=InnoDB ;
Does it make sense for me to use a composite index for both geolat and geolng, such that:
I replace:
KEY `geolat` (`geolat`),
KEY `geolng` (`geolng`),
with:
KEY `geolat_geolng` (`geolat`, `geolng`)
If so:
Why?
What is the performance ramification by using a composite index)?
UPDATE:
Since many people have stated it entirely dependent upon the queries I perform, below is the most common query performed:
SELECT * FROM homes
WHERE geolat BETWEEN ??? AND ???
AND geolng BETWEEN ??? AND ???
UPDATE 2:
With the following database schema:
CREATE TABLE IF NOT EXISTS `homes` (
`home_id` int(10) unsigned NOT NULL auto_increment,
`primary_photo_group_id` int(10) unsigned NOT NULL default '0',
`customer_id` bigint(20) unsigned NOT NULL,
`account_type_id` int(11) NOT NULL,
`address` varchar(128) collate utf8_unicode_ci NOT NULL,
`city` varchar(64) collate utf8_unicode_ci NOT NULL,
`state` varchar(2) collate utf8_unicode_ci NOT NULL,
`zip` mediumint(8) unsigned NOT NULL,
`price` mediumint(8) unsigned NOT NULL,
`sqft` smallint(5) unsigned NOT NULL,
`year_built` smallint(5) unsigned NOT NULL,
`num_of_beds` tinyint(3) unsigned NOT NULL,
`num_of_baths` decimal(3,1) unsigned NOT NULL,
`num_of_floors` tinyint(3) unsigned NOT NULL,
`description` text collate utf8_unicode_ci,
`geolat` decimal(10,6) default NULL,
`geolng` decimal(10,6) default NULL,
`display_status` tinyint(1) NOT NULL,
`date_listed` timestamp NOT NULL default CURRENT_TIMESTAMP,
`contact_email` varchar(100) collate utf8_unicode_ci NOT NULL,
`contact_phone_number` varchar(15) collate utf8_unicode_ci NOT NULL,
PRIMARY KEY (`home_id`),
KEY `customer_id` (`customer_id`),
KEY `city` (`city`),
KEY `num_of_beds` (`num_of_beds`),
KEY `num_of_baths` (`num_of_baths`),
KEY `geolat` (`geolat`),
KEY `geolng` (`geolng`),
KEY `account_type_id` (`account_type_id`),
KEY `display_status` (`display_status`),
KEY `sqft` (`sqft`),
KEY `price` (`price`),
KEY `primary_photo_group_id` (`primary_photo_group_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci AUTO_INCREMENT=8 ;
Using the following SQL:
EXPLAIN SELECT homes.home_id,
address,
city,
state,
zip,
price,
sqft,
year_built,
account_type_id,
num_of_beds,
num_of_baths,
geolat,
geolng,
photo_id,
photo_url_dir
FROM homes
LEFT OUTER JOIN home_photos ON homes.home_id = home_photos.home_id
AND homes.primary_photo_group_id = home_photos.home_photo_group_id
AND home_photos.home_photo_type_id = 2
WHERE homes.display_status = true
AND homes.geolat BETWEEN -100 AND 100
AND homes.geolng BETWEEN -100 AND 100
EXPLAIN returns:
id select_type table type possible_keys key key_len ref rows Extra
----------------------------------------------------------------------------------------------------------
1 SIMPLE homes ref geolat,geolng,display_status display_status 1 const 2 Using where
1 SIMPLE home_photos ref home_id,home_photo_type_id,home_photo_group_id home_photo_group_id 4 homes.primary_photo_group_id 4
I don't quite understand how to read the EXPLAIN command. Does this look good or bad. Right now, I am NOT using a composite index for geolat and geolng. Should I be?
You should use a composite index when you are using queries that benefit from it. A composite index that looks like this:
index( column_A, column_B, column_C )
will benefit a query that uses those fields for joining, filtering, and sometimes selecting. It will also benefit queries that use left-most subsets of columns in that composite. So the above index will also satisfy queries that need
index( column_A, column_B, column_C )
index( column_A, column_B )
index( column_A )
But it will not (at least not directly, maybe it can help partially if there are no better indices) help for queries that need
index( column_A, column_C )
Notice how column_B is missing.
In your original example, a composite index for two dimensions will mostly benefit queries that query on both dimensions or the leftmost dimension by itself, but not the rightmost dimension by itself. If you're always querying two dimensions, a composite index is the way to go, doesn't really matter which is first (most probably).
Imagine you have the following three queries:
Query I:
SELECT * FROM homes WHERE `geolat`=42.9 AND `geolng`=36.4
Query II:
SELECT * FROM homes WHERE `geolat`=42.9
Query III:
SELECT * FROM homes WHERE `geolng`=36.4
If you have seperate index per column, all three queries use indexes. In MySQL, if you have composite index (geolat, geolng), only query I and query II (which is using the first part of the composit index) uses indexes. In this case, query III requires full table search.
On Multiple-Column Indexes section of manual, it is clearly explained how multiple column indexes work, so I don't want to retype manual.
From the MySQL Reference Manual page:
A multiple-column index can be
considered a sorted array containing
values that are created by
concatenating the values of the
indexed columns.
If you use seperated index for geolat and geolng columns, you have two different index in your table which you can search independent.
INDEX geolat
-----------
VALUE RRN
36.4 1
36.4 8
36.6 2
37.8 3
37.8 12
41.4 4
INDEX geolng
-----------
VALUE RRN
26.1 1
26.1 8
29.6 2
29.6 3
30.1 12
34.7 4
If you use composite index you have only one index for both columns:
INDEX (geolat, geolng)
-----------
VALUE RRN
36.4,26.1 1
36.4,26.1 8
36.6,29.6 2
37.8,29.6 3
37.8,30.1 12
41.4,34.7 4
RRN is relative record number (to simplify, you can say ID). The first two index generated seperate and the third index is composite. As you can see you can search based on geolng on composite one since it is indexed by geolat, however it's possible to search by geolat or "geolat AND geolng" (since geolng is second level index).
Also, have a look at How MySQL Uses Indexes manual section.
There could be a misconception about what composite index does. Many people think that composite index can be used to optimise a search query as long as the where clause covers the indexed columns, in your case geolat and geolng. Let's delve deeper:
I believe your data on the coordinates of homes would be random decimals as such:
home_id geolat geolng
1 20.1243 50.4521
2 22.6456 51.1564
3 13.5464 45.4562
4 55.5642 166.5756
5 24.2624 27.4564
6 62.1564 24.2542
...
Since geolat and geolng values hardly repeat itself. A composite index on geolat and geolng would look something like this:
index_id geolat geolng
1 20.1243 50.4521
2 20.1244 61.1564
3 20.1251 55.4562
4 20.1293 66.5756
5 20.1302 57.4564
6 20.1311 54.2542
...
Therefore the second column of the composite index is basically useless! The speed of your query with a composite index is probably going to be similar to an index on just the geolat column.
As mentioned by Will, MySQL provides spatial extension support. A spatial point is stored in a single column instead of two separate lat lng columns. Spatial index can be applied to such a column. However, the efficiency could be overrated based on my personal experience. It could be that spatial index does not resolve the two dimensional problem but merely speed up the search using R-Trees with quadratic splitting.
The trade-off is that a spatial point consumes much more memory as it used eight-byte double-precision numbers for storing coordinates. Correct me if I am wrong.
Composite indexes are useful for
0 or more "=" clauses, plus
at most one range clause.
A composite index cannot handle two ranges. I discuss this further in my index cookbook.
Find nearest -- If the question is really about optimizing
WHERE geolat BETWEEN ??? AND ???
AND geolng BETWEEN ??? AND ???
then no index can really handle both dimensions.
Instead, one must 'think out of the box'. If one dimension is implemented via partitioning and the other is implemented by carefully picking the PRIMARY KEY, one can get significantly better efficiency for very large tables of lat/lng lookup. My latlng blog goes into the details of how to implement "find nearest" on the globe. It includes code.
The PARTITIONs are stripes of latitude ranges. The PRIMARY KEY deliberately starts with longitude so that the useful rows are likely to be in the same block. A Stored Routine orchestrates the messy code for doing order by... limit... and for growing the 'square' around the target until you have enough coffee shops (or whatever). It also takes care of the great-circle calculations and handling the dateline and poles.
More
I have written another blog; it compares 5 ways of doing lat/lng searches: http://mysql.rjweb.org/doc.php/latlng#representation_choices (It references the link given above as one of the 5.) One of the other ways is this, and it points out that they are optimal for the particular case:
INDEX(geolat, geolng),
INDEX(geolng, geolat)
That is, having both columns in two indexes, and not having single-column indexes on geolat and geolng is important.
Composite indexes are very powerful as they:
Enforce structure integrity
Enable sorting on a FILTERED id
ENFORCE STRUCTURE INTEGRITY
Composite indexes are not just another type of index; they can provide NECESSARY structure to a table by enforcing integrity as the Primary Key.
Mysql's Innodb supports clustering and the following example illustrates why a composite index may be necessary.
To create a friends' tables (i.e. for a social network) we need 2 columns: user_id, friend_id.
Table Strcture
user_id (medium_int)
friend_id (medium_int)
Primary Key -> (user_id, friend_id)
By virtue, a Primary Key (PK) is unique and by creating a composite PK, Innodb will automatically check that no duplicates on user_id, friend_id exists when a new record is added. This is the expected behavior as no user should have more than 1 record (relationship link) with friend_id = 2 for instance.
Without a composite PK, we can create this schema using a surrogate key:
user_friend_id
user_id
friend_id
Primary Key -> (user_friend_id)
Now, whenever a new record is added we will have to check that a prior record with the combination user_id, friend_id does not already exist.
As such, a composite index can enforce structure integrity.
ENABLE SORTING ON A FILTERED ID
It is very common to sort a set of records by the post's time (timestamp or datetime). Usually, this means posting on a given id. Here is an example
Table User_Wall_Posts (think if Facebook's wall posts)
user_id (medium_int)
timestamp (timestamp)
author_id (medium_int)
comment_post (text)
Primary Key -> (user_id, timestamp, author_id)
We want to query and find all posts for user_id = 10 and sort the comment posts by timestamp (date).
SQL QUERY
SELECT * FROM User_Wall_Posts WHERE user_id = 10 ORDER BY timestamp DES
The composite PK enables Mysql to filter and sort the results using the index; Mysql will not have to use a temporary file or filesort to fetch the results. Without a composite key, this would not be possible and would cause a very inefficient query.
As such, composite keys are very powerful and suit more than the simple problem of "I want to search for column_a, column_b so I will use composite keys. For my current database schema, I have just as many composite keys as single keys. Don't overlook a composite key's use!
To do spacial searches, you need an R-Tree algorithm, which allows searching geographical areas very quickly. Exactly what you need for this job.
Some databases have spacial indexes built in. A quick Google search shows MySQL 5 has them (which looking at your SQL I'm guessing you're using MySQL).
Composite index can be useful when you want to optimise group by clause (check this article http://dev.mysql.com/doc/refman/5.0/en/group-by-optimization.html).
Please pay attention:
The most important preconditions for using indexes for GROUP BY are
that all GROUP BY columns reference attributes from the same index,
and that the index stores its keys in order (for example, this is a
BTREE index and not a HASH index)
There is no Black and White, one size fits all answer.
You should use a composite (or multi-column) index, when your query work load would benefit from one.
You need to profile your query work load in order to determine this.
A composite index comes into play when queries can be satisfied entirely from that index: meaning all the columns required by the query are in (covered) by an index.
UPDATE (in response to edit to posted question): If you are selecting * from the table the composite index may be used, it may not. You will need to run EXPLAIN PLAN to be sure.
I'm with #Mitch, depends entirely your queries. Fortunately you can create and drop indexes at any time, and you can prepend the EXPLAIN keyword to your queries to see if the query analyzer uses the indexes.
If you'll be looking up an exact lat/long pair this index would likely make sense. But you're probably going to be looking for homes within a certain distance of a particular place, so your queries will look something like this (see source):
select *, sqrt( pow(h2.geolat - h1.geolat, 2)
+ pow(h2.geolng - h1.geolng, 2) ) as distance
from homes h1, homes h2
where h1.home_id = 12345 and h2.home_id != h1.home_id
order by distance
and the index very likely won't be helpful at all. For geospatial queries, you need something like this.
Update: with this query:
SELECT * FROM homes
WHERE geolat BETWEEN ??? AND ???
AND geolng BETWEEN ??? AND ???
The query analyzer could use an index on geolat alone, or an index on geolng alone, or possibly both indexes. I don't think it would use a composite index. But it's easy to try out each of these permutations on a real data set and then (a) see what EXPLAIN tells you and (b) measure the time the query really takes.