What are the differences between PRIMARY, UNIQUE, INDEX and FULLTEXT when creating MySQL tables?
How would I use them?
Differences
KEY or INDEX refers to a normal non-unique index. Non-distinct values for the index are allowed, so the index may contain rows with identical values in all columns of the index. These indexes don't enforce any restraints on your data so they are used only for access - for quickly reaching certain ranges of records without scanning all records.
UNIQUE refers to an index where all rows of the index must be unique. That is, the same row may not have identical non-NULL values for all columns in this index as another row. As well as being used to quickly reach certain record ranges, UNIQUE indexes can be used to enforce restraints on data, because the database system does not allow the distinct values rule to be broken when inserting or updating data.
Your database system may allow a UNIQUE index to be applied to columns which allow NULL values, in which case two rows are allowed to be identical if they both contain a NULL value (the rationale here is that NULL is considered not equal to itself). Depending on your application, however, you may find this undesirable: if you wish to prevent this, you should disallow NULL values in the relevant columns.
PRIMARY acts exactly like a UNIQUE index, except that it is always named 'PRIMARY', and there may be only one on a table (and there should always be one; though some database systems don't enforce this). A PRIMARY index is intended as a primary means to uniquely identify any row in the table, so unlike UNIQUE it should not be used on any columns which allow NULL values. Your PRIMARY index should be on the smallest number of columns that are sufficient to uniquely identify a row. Often, this is just one column containing a unique auto-incremented number, but if there is anything else that can uniquely identify a row, such as "countrycode" in a list of countries, you can use that instead.
Some database systems (such as MySQL's InnoDB) will internally store a table's actual records within the PRIMARY KEY's B-tree index.
FULLTEXT indexes are different from all of the above, and their behaviour differs significantly between database systems. FULLTEXT indexes are only useful for full text searches done with the MATCH() / AGAINST() clause, unlike the above three - which are typically implemented internally using b-trees (allowing for selecting, sorting or ranges starting from left most column) or hash tables (allowing for selection starting from left most column).
Where the other index types are general-purpose, a FULLTEXT index is specialised, in that it serves a narrow purpose: it's only used for a "full text search" feature.
Similarities
All of these indexes may have more than one column in them.
With the exception of FULLTEXT, the column order is significant: for the index to be useful in a query, the query must use columns from the index starting from the left - it can't use just the second, third or fourth part of an index, unless it is also using the previous columns in the index to match static values. (For a FULLTEXT index to be useful to a query, the query must use all columns of the index.)
All of these are kinds of indices.
primary: must be unique, is an index, is (likely) the physical index, can be only one per table.
unique: as it says. You can't have more than one row with a tuple of this value. Note that since a unique key can be over more than one column, this doesn't necessarily mean that each individual column in the index is unique, but that each combination of values across these columns is unique.
index: if it's not primary or unique, it doesn't constrain values inserted into the table, but it does allow them to be looked up more efficiently.
fulltext: a more specialized form of indexing that allows full text search. Think of it as (essentially) creating an "index" for each "word" in the specified column.
I feel like this has been well covered, maybe except for the following:
Simple KEY / INDEX (or otherwise called SECONDARY INDEX) do increase performance if selectivity is sufficient. On this matter, the usual recommendation is that if the amount of records in the result set on which an index is applied exceeds 20% of the total amount of records of the parent table, then the index will be ineffective. In practice each architecture will differ but, the idea is still correct.
Secondary Indexes (and that is very specific to mysql) should not be seen as completely separate and different objects from the primary key. In fact, both should be used jointly and, once this information known, provide an additional tool to the mysql DBA: in Mysql, indexes embed the primary key. It leads to significant performance improvements, specifically when cleverly building implicit covering indexes such as described there.
If you feel like your data should be UNIQUE, use a unique index. You may think it's optional (for instance, working it out at application level) and that a normal index will do, but it actually represents a guarantee for Mysql that each row is unique, which incidentally provides a performance benefit.
You can only use FULLTEXT (or otherwise called SEARCH INDEX) with Innodb (In MySQL 5.6.4 and up) and Myisam Engines
You can only use FULLTEXT on CHAR, VARCHAR and TEXT column types
FULLTEXT index involves a LOT more than just creating an index. There's a bunch of system tables created, a completely separate caching system and some specific rules and optimizations applied. See http://dev.mysql.com/doc/refman/5.7/en/fulltext-restrictions.html and http://dev.mysql.com/doc/refman/5.7/en/innodb-fulltext-index.html
Related
What could be the possible downside of having UNIQUE constraint for a large string (varchar) (roughly 100 characters or so) in MYSQL during :
insert phase
retrieval phase (on another primary key)
Can the length of the query impact the performance of read/writes ? (Apart from disk/memory usage for book-keeping).
Thanks
Several issues. There is a limit on the size of a column in an index (191, 255, 767, 3072, etc, depending on various things).
Your column fits within the limit.
Simply make a UNIQUE or PRIMARY key for that column. There are minor performance concerns, but keep this in mind: Fetching a row is more costly than any datatype issues involving the key used to locate it.
Your column won't fit.
Now the workarounds get ugly.
Index prefixing (INDEX foo(50)) has a number of problems and inefficiencies.
UNIQUE foo(50) is flat out wrong. It is declaring that the first 50 characters are constrained to be unique, not the entire column.
Workarounds with hashing the string (cf md5, sha1, etc) have a number of problems and inefficiencies. Still, this may be the only viable way to enforce uniqueness of a long string.
(I'll elaborate if needed.)
Fetching a row (Assuming the statement is parsed and the PRIMARY KEY is available.)
Drill down the BTree containing the data (and ordered by the PK). This may involve bring a block (or more) from disk into the buffer_pool.
Parse the block to find the row. (There are probably dozens of rows in the block.)
At some point in the process lock the row for reading and/or be blocked by some other connection that is, say, updating or deleting.
Pick apart the row -- that is, split into columns.
For any text/blob columns needed, reach into the off-record storage. (Wide columns are not stored with the narrow columns of the row; they are stored in other block(s).) The costly part is locating (and reading from disk if not cached) the extra block(s) containing the big TEXT/BLOB.
Convert from the internal storage (not word-aligned, little-endian, etc) into the desired format. (A small amount of CPU code, but necessary. This means that the data files are compatible across OS and even hardware.)
If the next step is to compare two strings (for JOIN or ORDER BY), then that a simple subroutine call to a scan over however many characters there are. (OK, most utf8 collations are not 'simple'.) And, yes, comparing two INTs would be faster.
Disk space
Should INT be used instead of VARCHAR(100) for the PRIMARY KEY? It depends.
Every secondary key has a copy of the PRIMARY KEY in it. This implies that a PK that is VARCHAR(100) makes secondary indexes bulkier than if the PK were INT.
If there are no secondary keys, then the above comment implies that INT is the bulkier approach!
If there are more than 2 secondary keys, then using varchar is likely to be bulkier.
(For exactly one secondary key, it is a tossup.)
Speed
If all the columns of a SELECT are in a secondary index, the query may be performed entirely in the index's BTree. ("Covering index", as indicated in EXPLAIN by "Using index".) This is sometimes a worthwhile optimization.
If the above does not apply, and it is useful to look up row(s) via a secondary index, then there are two BTree lookups -- once in the index, then via the PK. This is sometimes a noticeable slowdown.
The point here is that artificially adding an INT id may be slower than simply using the bulky VARCHAR as the PK. Each case should be judged on its tradeoffs; I am not making a blanket statement.
This describes different indexes:
KEY or INDEX refers to a normal non-unique index. Non-distinct values
for the index are allowed, so the index may contain rows with
identical values in all columns of the index. These indexes don't
enforce any restraints on your data so they are used only for making
sure certain queries can run quickly.
UNIQUE refers to an index where all rows of the index must be unique.
That is, the same row may not have identical non-NULL values for all
columns in this index as another row. As well as being used to speed
up queries, UNIQUE indexes can be used to enforce restraints on data,
because the database system does not allow this distinct values rule
to be broken when inserting or updating data.
I understand the benefit to application logic (you don't want uniqueness check) but is there also a performance improvement? Specifically, how much faster are writes using INDEX instead of UNIQUE?
UNIQUE KEY is a constraint, and you use it when you want to enforce that constraint.
KEY is an index, which you pick to make certain queries more efficient.
The performance of inserting into a table with either type of index is virtually the same. That is, the difference, if any, is so minor that it's not worth picking one over the other for the sake of performance.
Choose the type of index to support your constraints. Use UNIQUE KEY if and only if you want to enforce uniqueness. Use KEY otherwise.
Your question is like asking, "which is faster, a motorcycle or a speedboat?" They are used in different situations, so judging them on their speed isn't the point.
INSERT
When a row is inserted, all unique keys (PRIMARY and UNIQUE) are immediately checked for duplicate keys. This is so that you get an error on the INSERT if necessary. The updating of non-unique INDEXes is delayed (for discussion, see "Change buffering"). The work will be done in the background so your INSERT won't be waiting for it.
So, there is a slight overhead in UNIQUE for inserting. But, as already pointed out, if you need the uniqueness constraint, then use it.
SELECT
Any kind of index (PRIMARY, UNIQUE, or INDEX) may be used to speed up a SELECT. Mostly, the types of index work identically. However with PRIMARY and UNIQUE, the optimizer can know that there will only one (or possibly zero) rows matching a given value, so it can fetch the one row, then quit. For a non-unique index, there could be more than one row, so it keeps scanning the index, checking for more rows. This scan stops after peeking at the first non-matching row. So, there is a small (very small) overhead for non-unique indexes versus unique.
Bottom Line
The performance issues are less important than the semantics (uniqueness constraint vs. not).
I am trying to understand exactly what is and is not useful in a multiple-field index. I have read this existing question (and many more) plus other sites/resources (MySQL Performance Blog, Percona slideshares, etc.) but I'm not totally confident that what I've found on the subject is current and accurate. So please bear with me while I repeat some of what I think I know.
By indexing wisely, I can not only reduce how long it takes to match my query condition(s), but also reduce how long it takes to fetch the fields I want in my query result.
The index is just a sorted, duplicated subset of the full data, paired with pointers (MyISAM) or PKs (InnoDB), that I can search more efficiently than the full table.
Given the above, using an index to match my condition(s) really happens in the same way as fetching my desired result, except I created this special-purpose table (the index) that gets me an intermediate result set really quickly; and with this intermediate result set I can retrieve my final desired result set much more efficiently than by performing a full table scan.
Furthermore, if the index covers all the fields in my query (not just the conditions), instead of an intermediate result set, the index will give me everything I need without having to fetch any rows from the complete table.
InnoDB tables are clustered on the PK, so rows with consecutive PKs are likely to be stored in the same block (given many rows per block), and I can grab a range of rows with consecutive PKs fairly efficiently.
MyISAM tables are not clustered; there is some hidden internal row ordering that has no fixed relation to the PK (or any index), so any time I want to grab a set of rows, I may have to retrieve a different block for every single row - even if these rows have consecutive PKs.
Assuming the above is at least generally accurate, here's my puzzle. I have a slowly changing dimension table defined with the following columns (more or less) and using MyISAM:
dim_owner_ID INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
person_ID INT UNSIGNED NOT NULL,
raw_name VARCHAR(92) NOT NULL,
first VARCHAR(30),
middle VARCHAR(50),
last VARCHAR(30),
suffix CHAR(3),
flag CHAR(1)
Each "owner" is a unique instance of a particular individual with a particular name, so if Sue Smith changes her name to Sue Brown, that results in two rows that are the same except for the last field and the surrogate key. My understanding is that the only way to enforce this constraint internally is to do:
UNIQUE INDEX uq_owner_complete (person_ID, raw_name, first, middle, last, suffix, flag)
And that's basically going to duplicate the entire table (except for the surrogate key).
I also need to index a few other fields for quick joins and searches. While there will be some writes, and disk space is neither free nor infinite, read performance is absolutely the #1 priority here. These smaller indexes should serve very well to cover the conditions of the queries that will be run against the table, but in almost every case, the entire row needs to be selected.
With that in mind:
Is there any reasonable middle ground between sticking with short, single-field indexes (prefix where possible) and expanding every index to cover the entire table?
How would the latter be any different from storing the entire dataset five times on disk, but sorted differently each time?
Is there any benefit to adding the PK/surrogate ID to each of the smaller indexes in the hope that the query optimizer will be able to work some sort of index merge magic?
If this were an InnoDB index, the PK would already be there, but since it's MyISAM it's got pointers to the full rows instead. So if I'm understanding things correctly, there's no point (no pun intended) to adding the PK to any other index, unless doing so would allow the retrieval of the desired result set directly from the index. Which is not likely here.
I understand if it seems like I'm trying too hard to optimize, and maybe I am, but the tasks I need to perform using this database take weeks at a time, so every little bit helps.
You have to understand one concept. An index (either InnoDB or MyiSAM, ether Primary or secondary) is a data structure that's called "B+ tree".
Each node in the B+ tree is a couple (k, v), where k is a key, v is a value. If you build index on last_name your keys will be "Smith", "Johnson", "Kuzminsky" etc.
Value in the index is some data. If the index is the secondary index then the data is a primary key values.
So if you build index on last_name each node will be a couple (last_name, id) e.g. ("Smith", 5).
Primary index is an index where k is primary key and data is all other fields.
Bearing in mind the above let me comment some points:
By indexing wisely, I can not only reduce how long it takes to match my query condition(s), but also reduce how long it takes to fetch the fields I want in my query result.
Not exactly. If your secondary index is good you can quickly find v based on you query condition. E.g. you can quickly find PK by last name.
The index is just a sorted, duplicated subset of the full data, paired with pointers (MyISAM) or PKs (InnoDB), that I can search more efficiently than the full table.
Index is B+tree where each node is a couple of indexed field(s) value(s) and PK.
Given the above, using an index to match my condition(s) really happens in the same way as fetching my desired result, except I created this special-purpose table (the index) that gets me an intermediate result set really quickly; and with this intermediate result set I can retrieve my final desired result set much more efficiently than by performing a full table scan.
Not exactly. If there were no index you'd have to scan whole table and choose only records where last_name = "Smith". But you have index (last_name, PK), so having key "Smith" you can quickly find all PK where last_name = "Smith". And then you can quickly find your full result (because you need not only the last name, but the first name too). So you're right, queries like SELECT * FROM table WHERE last_name = "Smith" are executed in two steps:
Find all PK
By PK find full record.
Furthermore, if the index covers all the fields in my query (not just the conditions), instead of an intermediate result set, the index will give me everything I need without having to fetch any rows from the complete table.
Exactly. If your index is actually (last_name, first_name, id) and your query is SELECT first_name WHERE last_name = "Smith" you don't do the second step. You have the first name in the secondary index, so you don't have to go to the Primary index.
InnoDB tables are clustered on the PK, so rows with consecutive PKs are likely to be stored in the same block (given many rows per block), and I can grab a range of rows with consecutive PKs fairly efficiently.
Right. Two neighbor PK values will most likely be in the same page. Well, except cases when one PK is the last value in a page and next PK value is stored in the next page.
Basically, this is why B+ tree structure was invented. It's not only efficient for search but also efficient in sequential access. And until recently we had rotating hard drives.
MyISAM tables are not clustered; there is some hidden internal row ordering that has no fixed relation to the PK (or any index), so any time I want to grab a set of rows, I may have to retrieve a different block for every single row - even if these rows have consecutive PKs.
Right. If you insert new records to MyISAM table the records will be added to the end of MYD file regardless the PK order.
Primary index of MyISAM table will be B+tree with pointers to records in the MYD file.
Now about your particular problem. I don't see any reason to define UNIQUE INDEX uq_owner_complete.
Is there any reasonable middle ground between sticking with short, single-field indexes (prefix where possible) and expanding every index to cover the entire table?
The best is to have the secondary index on all columns that are used in the WHERE clause, except low selective fields (like sex). The most selective fields must go first in the index. For example (last_name, eye_color) is good. (eye_color, last_name) is bad.
If the covering index allows to avoid additional PK lookup, that's excellent. But if not that's acceptable too.
How would the latter be any different from storing the entire dataset five times on disk, but sorted differently each time?
Yes.
Is there any benefit to adding the PK/surrogate ID to each of the smaller indexes in the hope that the query optimizer will be able to work some sort of index merge magic?
PK is already a part of the index.( Remember, it's stored as data.) So, it makes no sense to explicitly add PK fields to the secondary index. I think (but not sure) that MyISAM secondary indexes store the PK values too (and Primary indexes do store the pointers).
To summarize:
Make your PK shorter as possible (surrogate PK works great)
Add as many indexes as you need until writes performance becomes unacceptable for you.
I am trying to understand indexes in MySQL. I know that an index created in a table can speed up executing queries and it can slow down the inserting and updating of rows.
When creating an index, I used this query on a table called authors that contains (AuthorNum, AuthorFName, AuthorLName, ...)
Create index Index_1 on Authors ([What to put here]);
I know I have to put a column name, but which one?
Do I have to put the column name that will be compared in the Where statement when a user query the Table or what?
The Anatomy of an Index
An index is a distinct data structure within a database and is data redundancy. Its primary purpose is to provide an ordered representation of the indexed data through a logical ordering which is independent of the physical ordering. We do this using a doubly linked list and a tree structure known as the balanced search tree (B-tree). B-trees are nice because they keep data sorted and allow searches, access, insertions, and deletions in logarithmic time. Because of the doubly linked list, we are able to go backwards or forwards as needed on the index for various queries easily. Inserts become simple since we only have to rearrange pointers to the different pieces of data. Databases use these doubly linked list to connect leaf nodes (usually in a B+ tree or B-tree), each of which are stored in a page, and to establish logical ordering between the leaf nodes. Operations like UPDATE or INSERT become slower because they are actually two writing operations in the filesystem (one for the table data and one for the index data).
Defining an Optimal Index With WHERE
To define an optimal index you must not only understand how indexes work, but you must also understand how the application queries the data. E.g., you must know the column combinations that appear in the WHERE clause.
A common restriction with queries on LAST_NAME and FIRST_NAME columns deals with case sensitivity. For example, instead of doing an exact search like Hotinger we would prefer to match all results such as HoTingEr and so on. This is very easy to do in a WHERE clause: we just say WHERE UPPER(LAST_NAME) = UPPER('Hotinger')
However, if we define an index of LAST_NAME and query, it will actually run a full table scan because the query is not on LAST_NAME but on UPPER(LAST_NAME). From the database's perspective, this is completely different. So, in this case you should define the index on UPPER(LAST_NAME) instead.
Indexes do not necessarily have to be for one column. For example, if the primary key is a composite key (consisting of multiple columns) it will create a concatenated index also known as a combined index. Note that the ordering of the concatenated index has a significant impact on its usability and scalability so it must be chosen carefully. Basically, the ordering should match the way it is ordered in the WHERE clause.
Defining an Optimal Index With LIKE
The position of the wildcard characters makes a huge difference. LIKE clauses only use the characters before the wildcard during tree traversal; the rest do not narrow the scanned index range. The more selective the prefix of the LIKE clause the more narrow the scanned index becomes. This makes the index lookup faster. As a tip, avoid LIKE clauses which lead with wildcards like "%OTINGER%" For full-text searches, MySQL offers MATCH and AGAINST keywords. Starting with MySQL 5.6, you can have full-text indexes. Look at Full-Text Search Functions from MySQL for more in-depth discussion on indexing these results.
Yes, generally you need an index on the column or columns that you compare in the WHERE clause of your queries to speed up queries.
If you search by AuthorFName, then you create an index on that column. If they search by AuthorLName, then you create an index on that column.
In this case though, maybe what you should be looking at is a FULLTEXT index. That would allow users to enter fuzzy queries, which would return a number of results ordered by relevance.
From the MySQL Manual:
Indexes are used to find rows with specific column values quickly.
Without an index, MySQL must begin with the first row and then read
through the entire table to find the relevant rows. The larger the
table, the more this costs. If the table has an index for the columns
in question, MySQL can quickly determine the position to seek to in
the middle of the data file without having to look at all the data. If
a table has 1,000 rows, this is at least 100 times faster than reading
sequentially. If you need to access most of the rows, it is faster to
read sequentially, because this minimizes disk seeks.
An index usually means a B-Tree. Understand the structure of the B-Tree and you'll understand what index can and cannot do.
In your particular case:
WHERE AuthorLName = 'something' and WHERE AuthorLName LIKE 'something%' can be sped-up by an index on {AuthorLName}.
WHERE AuthorLName = 'something AND AuthorFName = 'something else' can be sped-up by a composite index on {AuthorLName, AuthorFName} or {AuthorFName, AuthorLName}.
WHERE AuthorLName = 'something OR AuthorFName = 'something else' (which doesn't make much sense, but is here as an example) can be sped-up by having two indexes: on {AuthorLName} and on {AuthorFName}.
WHERE AuthorLName LIKE '%something' cannot be sped-up by a B-Tree index (cunsider full-text indexing).
Etc...
See Use The Index, Luke! for a much more thorough treatment of the subject than possible in a simple SO post.
Limited length index:
When using text columns or very large varchar columns you won't be able to create an index over the entire length of the text/varchar, there are some limits (around 1024 ASCII characters in length).
In such a case you specify the length in the index declaration.
CREATE INDEX `my_limited_length_index` ON `my_table`(`long_text_content`(512));
-- please notice the use of the numeric length of the index after the column name
Processed value index (apparently available in PostgreSQL not MySQL):
Indexes are not exclusively built from one column, some may be built from multiple columns and other may be built from just some of the info a column has. For example if you have a full datetime column but you know you're only going to filter records by date you can build an index based on the datetime column but only containing date info.
-- `my_table` has a `created` column of type timestamp
CREATE INDEX `my_date_created` ON `my_table`(DATE(`created`));
-- please notice the use of the DATE function which extracts only
-- the date from the `created` timestamp
index shall span the columns you are going to use in WHERE statement.
To better understand, here is an example:
SELECT * FROM Authors WHERE AuthorNum > 10 AND AuthorLName LIKE 'A%';
SELECT * FROM Authors WHERE AuthorLName LIKE 'Be%';
If you are often using the shown above queries, you are highly adviced to have two indexes:
Create index AuthNum_AuthLName_Index on Authors (AuthorNum, AuthorLName);
Create index AuthLName_Index on Authors (AuthorLName);
The key thing to remember: index shall have the same combiation of columns used in WHERE statements
I have a mysql table of 3 integer fields. None of the fields have a unique value - but the three of them combined are unique.
When I query this table, I only search by the first field.
Which approach is recommended for indexing such table?
Having a multiple-field primary key on the 3 fields, or setting an index on the first field, which is not unique?
Thanks,
Doori Bar
Both. You'll need the multi-field primary key to ensure uniqueness, and you'll want the index on the first field for speed during searches.
You can have a UNIQUE Constraint on the three fields combined to meet your data quality standards. If you are primarily searching by Field1 then you should have an index on it.
You should also consider how you JOIN this table.
Your indexes should really support the bigger workload first - you will have to look at the execution plan to determine what suits you best.
The primary key will prevent your application from accidenttly inserting dupe rows. You probably want that.
Order the columns in the PK correctly though or make an index on the first column clustered for better performance. Compare how the query runs (with the PK present) and with and without the index on the first column.
If you're using InnoDB, you must have a clustered index. If you don't specify one, MySQL will use one in the background anyway. So, you may as well use a clustered (unique) primary key by combining all three columns.
The primary key will also then prevent duplicates, which is a bonus.
If you're returning all three integer fields, then you'll have a covered index, which means that the database won't even have to touch the actual record. It will get everything it needs right from the index.
The only caveat would be inserts (and appends). Updating a clustered index, especially on multiple columns, does have some performance penalization. It will be up to you to test and determine the best approach.