I would like to make system whitch allows to search user messages, by specific user.
assume having folowing table
create table messages(
user_id int,
message nvarchar(500));
So what kind of index I should use here, if I want to search for all messages from user 1, containing word 'foo'.
Simple, non unique index user_id
It will filter only specific user messages nd then full scan for specific word.
FULLTEXT index on message
this will find all messages from all users and then filter by ID, seems to be very inefficient in case of big amount of users.
comopound index on both user_id and message
So full text index tree is created for each user separately, so they can be searched individually. During query system filters messages by ID and then performs text search on remaining rows in index.
A.F.A.I.K. last one is impossible. So then I assume I shall use 1-st option, It will perform better in case of few thousands of users?
And if each will have ~100 messages, full iteration won't cost much resources?
Perhaps I can include username into message and use BOOLEAN full text search mode, but I think it would be slower than by using indexed user_id.
#Alden Quimby's answer is correct as far as it goes, but there is more to the story, because MySQL will only try to choose the optimal index, and its ability to make that determination is limited because of the way fulltext indexes interact with the optimizer.
What actually happens is this:
If the specified user_id exists in either 0 or 1 matching rows in the table, the optimizer will realize this and will choose user_id as the index for that query. Fast execution.
Otherwise, the optimizer will choose the fulltext index, filtering every row matched by the fulltext index to eliminate rows not containing a user_id that matches the WHERE clause. Not quite as fast.
So it's not truly the "optimum" path. It's more like fulltext, with a nice optimization to avoid the fulltext search under the one condition that we know we have almost nothing of interest in the table.
The reason this breaks down is that a fulltext index doesn't give any meaningful statistics back to the optimizer. It just says "yeah, I think that query should probably only require me to check 1 row" ... which, of course, pleases the optimizer greatly, so the fulltext index wins the bid for lowest cost, unless the index with the integer value also comes in comparably low or lower.
Still, that doesn't mean I wouldn't try it this way first.
There's another option, which would work best with fulltext queries IN BOOLEAN MODE and that is to create another column which you would populate with something like CONCAT('user_id_',user_id) or something similar, and then declare a 2-column fulltext index.
filter_string VARCHAR(48) # populated with CONCAT('user_id_',user_id);
....
FULLTEXT KEY (message,filter_string)
Then specify everything in the query.
SELECT ...
WHERE user_id = 500 AND
MATCH (message,filter_string) AGAINST ('+kittens +puppies +user_id_500' IN BOOLEAN MODE);
Now, the fulltext index will be responsible for matching only those rows where kittens, puppies, and "user_id_500" appears in the combined fulltext index of the two columns, but you'd still want to have the integer filter there too to make sure the final results are constrained in spite of any random appearance of "user_id_500" in the message.
You should add a fulltext index on message and a regular index on user_id, and use the query:
SELECT *
FROM messages
WHERE MATCH(message) AGAINST(#search_query)
AND user_id = #user_id;
You're right that you can't do option 3. But rather than trying to pick between 1 and 2, let MySQL do the work for you. MySQL will only use one of the two indexes, and will do a linear scan to complete the second filter, but it will estimate the effectiveness of each index and choose the optimal one.
Note: only do this if you can afford the overhead of two indexes (slower insert/update/delete). Also, if you know that each user will only have a few messages, then yes it might make sense to use a simple index and do a regex in the application layer or something like that.
Turn on the "Optimizer trace" and look for "considered_execution_plans". I contend that the Optimizer will always pick the FULLTEXT index, even when some other index might be better. This may be because it is quite costly when the MATCH is not pre-computed as when the FT index is built.
More on Optimizer Trace: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#optimizer_trace (Earlier in that doc are my tips on FULLTEXT.)
Related
I would like to make system whitch allows to search user messages, by specific user.
assume having folowing table
create table messages(
user_id int,
message nvarchar(500));
So what kind of index I should use here, if I want to search for all messages from user 1, containing word 'foo'.
Simple, non unique index user_id
It will filter only specific user messages nd then full scan for specific word.
FULLTEXT index on message
this will find all messages from all users and then filter by ID, seems to be very inefficient in case of big amount of users.
comopound index on both user_id and message
So full text index tree is created for each user separately, so they can be searched individually. During query system filters messages by ID and then performs text search on remaining rows in index.
A.F.A.I.K. last one is impossible. So then I assume I shall use 1-st option, It will perform better in case of few thousands of users?
And if each will have ~100 messages, full iteration won't cost much resources?
Perhaps I can include username into message and use BOOLEAN full text search mode, but I think it would be slower than by using indexed user_id.
#Alden Quimby's answer is correct as far as it goes, but there is more to the story, because MySQL will only try to choose the optimal index, and its ability to make that determination is limited because of the way fulltext indexes interact with the optimizer.
What actually happens is this:
If the specified user_id exists in either 0 or 1 matching rows in the table, the optimizer will realize this and will choose user_id as the index for that query. Fast execution.
Otherwise, the optimizer will choose the fulltext index, filtering every row matched by the fulltext index to eliminate rows not containing a user_id that matches the WHERE clause. Not quite as fast.
So it's not truly the "optimum" path. It's more like fulltext, with a nice optimization to avoid the fulltext search under the one condition that we know we have almost nothing of interest in the table.
The reason this breaks down is that a fulltext index doesn't give any meaningful statistics back to the optimizer. It just says "yeah, I think that query should probably only require me to check 1 row" ... which, of course, pleases the optimizer greatly, so the fulltext index wins the bid for lowest cost, unless the index with the integer value also comes in comparably low or lower.
Still, that doesn't mean I wouldn't try it this way first.
There's another option, which would work best with fulltext queries IN BOOLEAN MODE and that is to create another column which you would populate with something like CONCAT('user_id_',user_id) or something similar, and then declare a 2-column fulltext index.
filter_string VARCHAR(48) # populated with CONCAT('user_id_',user_id);
....
FULLTEXT KEY (message,filter_string)
Then specify everything in the query.
SELECT ...
WHERE user_id = 500 AND
MATCH (message,filter_string) AGAINST ('+kittens +puppies +user_id_500' IN BOOLEAN MODE);
Now, the fulltext index will be responsible for matching only those rows where kittens, puppies, and "user_id_500" appears in the combined fulltext index of the two columns, but you'd still want to have the integer filter there too to make sure the final results are constrained in spite of any random appearance of "user_id_500" in the message.
You should add a fulltext index on message and a regular index on user_id, and use the query:
SELECT *
FROM messages
WHERE MATCH(message) AGAINST(#search_query)
AND user_id = #user_id;
You're right that you can't do option 3. But rather than trying to pick between 1 and 2, let MySQL do the work for you. MySQL will only use one of the two indexes, and will do a linear scan to complete the second filter, but it will estimate the effectiveness of each index and choose the optimal one.
Note: only do this if you can afford the overhead of two indexes (slower insert/update/delete). Also, if you know that each user will only have a few messages, then yes it might make sense to use a simple index and do a regex in the application layer or something like that.
Turn on the "Optimizer trace" and look for "considered_execution_plans". I contend that the Optimizer will always pick the FULLTEXT index, even when some other index might be better. This may be because it is quite costly when the MATCH is not pre-computed as when the FT index is built.
More on Optimizer Trace: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#optimizer_trace (Earlier in that doc are my tips on FULLTEXT.)
I have the following problem when running a mysql query:
Query is very slow and when i use explain the query key is null but possible_keys are avaiable and the order is correct, i also tried adding independent indexes per each row but still key was NULL.
You can see table, index and mysql explain here: https://snag.gy/vcChl6.jpg
The optimizer likely has just decided that there is no reason to use the index.
Since you are using SELECT * that means that means that if it used the index, then it would have to use the primary key from the index to then go back and look up all the necessary data from the clustered index. That is referred to as a double lookup, and is generally bad for performance. As there are so few records in this table, the optimizer likely decided that it can easily do a full table scan instead and get your result faster.
In short, this is expected behavior.
If you want to SELECT just some columns, add them to the t1 index and then just SELECT only the columns you need, with that given WHERE clause. It should use the index then. As your table grows in size, it may start using the index as well, once it estimates that the double lookup is cheaper than the full table scan.
A guess: Most rows are of that 'project' and that 'lang'.
The Optimizer does not understand that fact, so it takes the index that is obviously the best:
(id_project, id_lang)
This one would be equally good: (id_lang, id_project).
No fair... The EXPLAIN mentions indexes named id_project and id_lang (not useful), but the list of indexes shows a composite index t1(id_project, id_lang) (useful).
Then, as Willem suggests, it has to bounce between the index and the table. Normally (that is, when it has adequate statistics), the Optimizer will say "Oh, more than ~20% of the table is being referenced; let's ignore any index."
Things you can do:
Get rid of that index.
Change * to a list of just the columns you need. In particular, if you avoid the 3 TEXT columns, two optimizations kick in. Alternatively, any that will never be longer than 255 characters can be changed to VARCHAR(255).
Use some other filtering, ordering, limiting, etc. If this is a web application, do you really want to get ~534 rows?
I am really interested in how MySQL indexes work, more specifically, how can they return the data requested without scanning the entire table?
It's off-topic, I know, but if there is someone who could explain this to me in detail, I would be very, very thankful.
Basically an index on a table works like an index in a book (that's where the name came from):
Let's say you have a book about databases and you want to find some information about, say, storage. Without an index (assuming no other aid, such as a table of contents) you'd have to go through the pages one by one, until you found the topic (that's a full table scan).
On the other hand, an index has a list of keywords, so you'd consult the index and see that storage is mentioned on pages 113-120,231 and 354. Then you could flip to those pages directly, without searching (that's a search with an index, somewhat faster).
Of course, how useful the index will be, depends on many things - a few examples, using the simile above:
if you had a book on databases and indexed the word "database", you'd see that it's mentioned on pages 1-59,61-290, and 292 to 400. In such case, the index is not much help and it might be faster to go through the pages one by one (in a database, this is "poor selectivity").
For a 10-page book, it makes no sense to make an index, as you may end up with a 10-page book prefixed by a 5-page index, which is just silly - just scan the 10 pages and be done with it.
The index also needs to be useful - there's generally no point to index e.g. the frequency of the letter "L" per page.
The first thing you must know is that indexes are a way to avoid scanning the full table to obtain the result that you're looking for.
There are different kinds of indexes and they're implemented in the storage layer, so there's no standard between them and they also depend on the storage engine that you're using.
InnoDB and the B+Tree index
For InnoDB, the most common index type is the B+Tree based index, that stores the elements in a sorted order. Also, you don't have to access the real table to get the indexed values, which makes your query return way faster.
The "problem" about this index type is that you have to query for the leftmost value to use the index. So, if your index has two columns, say last_name and first_name, the order that you query these fields matters a lot.
So, given the following table:
CREATE TABLE person (
last_name VARCHAR(50) NOT NULL,
first_name VARCHAR(50) NOT NULL,
INDEX (last_name, first_name)
);
This query would take advantage of the index:
SELECT last_name, first_name FROM person
WHERE last_name = "John" AND first_name LIKE "J%"
But the following one would not
SELECT last_name, first_name FROM person WHERE first_name = "Constantine"
Because you're querying the first_name column first and it's not the leftmost column in the index.
This last example is even worse:
SELECT last_name, first_name FROM person WHERE first_name LIKE "%Constantine"
Because now, you're comparing the rightmost part of the rightmost field in the index.
The hash index
This is a different index type that unfortunately, only the memory backend supports. It's lightning fast but only useful for full lookups, which means that you can't use it for operations like >, < or LIKE.
Since it only works for the memory backend, you probably won't use it very often. The main case I can think of right now is the one that you create a temporary table in the memory with a set of results from another select and perform a lot of other selects in this temporary table using hash indexes.
If you have a big VARCHAR field, you can "emulate" the use of a hash index when using a B-Tree, by creating another column and saving a hash of the big value on it. Let's say you're storing a url in a field and the values are quite big. You could also create an integer field called url_hash and use a hash function like CRC32 or any other hash function to hash the url when inserting it. And then, when you need to query for this value, you can do something like this:
SELECT url FROM url_table WHERE url_hash=CRC32("http://gnu.org");
The problem with the above example is that since the CRC32 function generates a quite small hash, you'll end up with a lot of collisions in the hashed values. If you need exact values, you can fix this problem by doing the following:
SELECT url FROM url_table
WHERE url_hash=CRC32("http://gnu.org") AND url="http://gnu.org";
It's still worth to hash things even if the collision number is high cause you'll only perform the second comparison (the string one) against the repeated hashes.
Unfortunately, using this technique, you still need to hit the table to compare the url field.
Wrap up
Some facts that you may consider every time you want to talk about optimization:
Integer comparison is way faster than string comparison. It can be illustrated with the example about the emulation of the hash index in InnoDB.
Maybe, adding additional steps in a process makes it faster, not slower. It can be illustrated by the fact that you can optimize a SELECT by splitting it into two steps, making the first one store values in a newly created in-memory table, and then execute the heavier queries on this second table.
MySQL has other indexes too, but I think the B+Tree one is the most used ever and the hash one is a good thing to know, but you can find the other ones in the MySQL documentation.
I highly recommend you to read the "High Performance MySQL" book, the answer above was definitely based on its chapter about indexes.
Basically an index is a map of all your keys that is sorted in order. With a list in order, then instead of checking every key, it can do something like this:
1: Go to middle of list - is higher or lower than what I'm looking for?
2: If higher, go to halfway point between middle and bottom, if lower, middle and top
3: Is higher or lower? Jump to middle point again, etc.
Using that logic, you can find an element in a sorted list in about 7 steps, instead of checking every item.
Obviously there are complexities, but that gives you the basic idea.
Take a look at this link: http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
How they work is too broad of a subject to cover in one SO post.
Here is one of the best explanations of indexes I have seen. Unfortunately it is for SQL Server and not MySQL. I'm not sure how similar the two are...
In MySQL InnoDB, there are two types of index.
Primary key which is called clustered index. Index key words are stored with
real record data in the B+Tree leaf node.
Secondary key which is non clustered index. These index only store primary key's key words along with their own index key words in the B+Tree leaf node. So when searching from secondary index, it will first find its primary key index key words and scan the primary key B+Tree to find the real data records. This will make secondary index slower compared to primary index search. However, if the select columns are all in the secondary index, then no need to look up primary index B+Tree again. This is called covering index.
Take at this videos for more details about Indexing
Simple Indexing
You can create a unique index on a table. A unique index means that two rows cannot have the same index value. Here is the syntax to create an Index on a table
CREATE UNIQUE INDEX index_name
ON table_name ( column1, column2,...);
You can use one or more columns to create an index. For example, we can create an index on tutorials_tbl using tutorial_author.
CREATE UNIQUE INDEX AUTHOR_INDEX
ON tutorials_tbl (tutorial_author)
You can create a simple index on a table. Just omit UNIQUE keyword from the query to create simple index. Simple index allows duplicate values in a table.
If you want to index the values in a column in descending order, you can add the reserved word DESC after the column name.
mysql> CREATE UNIQUE INDEX AUTHOR_INDEX
ON tutorials_tbl (tutorial_author DESC)
Adding some visual representation to the list of answers.
MySQL uses an extra layer of indirection: secondary index records point to primary index records, and the primary index itself holds the on-disk row locations. If a row offset changes, only the primary index needs to be updated.
Caveat: Disk data structure looks flat in the diagram but actually is a
B+ tree.
Source: link
I want to add my 2 cents. I am far from being a database expert, but I've recently read up a bit on this topic; enough for me to try and give an ELI5. So, here's may layman's explanation.
I understand it as such that an index is like a mini-mirror of your table, pretty much like an associative array. If you feed it with a matching key then you can just jump to that row in one "command".
But if you didn't have that index / array, the query interpreter must use a for-loop to go through all rows and check for a match (the full-table scan).
Having an index has the "downside" of extra storage (for that mini-mirror), in exchange for the "upside" of looking up content faster.
Note that (in dependence of your db engine) creating primary, foreign or unique keys automatically sets up a respective index as well. That same principle is basically why and how those keys work.
Let's suppose you have a book, probably a novel, a thick one with lots of things to read, hence lots of words.
Now, hypothetically, you brought two dictionaries, consisting of only words that are only used, at least one time in the novel. All words in that two dictionaries are stored in typical alphabetical order. In hypothetical dictionary A, words are printed only once while in hypothetical dictionary B words are printed as many numbers of times it is printed in the novel. Remember, words are sorted alphabetically in both the dictionaries.
Now you got stuck at some point while reading a novel and need to find the meaning of that word from anyone of those hypothetical dictionaries. What you will do? Surely you will jump to that word in a few steps to find its meaning, rather look for the meaning of each of the words in the novel, from starting, until you reach that bugging word.
This is how the index works in SQL. Consider Dictionary A as PRIMARY INDEX, Dictionary B as KEY/SECONDARY INDEX, and your desire to get for the meaning of the word as a QUERY/SELECT STATEMENT.
The index will help to fetch the data at a very fast rate. Without an index, you will have to look for the data from the starting, unnecessarily time-consuming costly task.
For more about indexes and types, look this.
Indexes are used to find rows with specific column values quickly. Without an index, MySQL must begin with the first row and then read through the entire table to find the relevant rows. The larger the table, the more this costs. If the table has an index for the columns in question, MySQL can quickly determine the position to seek to in the middle of the data file without having to look at all the data. This is much faster than reading every row sequentially.
Indexing adds a data structure with columns for the search conditions and a pointer
The pointer is the address on the memory disk of the row with the
rest of the information
The index data structure is sorted to optimize query efficiency
The query looks for the specific row in the index; the index refers to the pointer which will find the rest of the information.
The index reduces the number of rows the query has to search through from 17 to 4.
Just say I had a query as below..
SELECT
name,category,address,city,state
FROM
table
WHERE
MATCH(name,subcategory,category,tag1) AGAINST('education')
AND
city='Oakland'
AND
state='CA'
LIMIT
0, 10;
..and I had a fulltext index as name,subcategory,category,tag1 and a composite index as city,state; is this good enough for this query? Just wondering if something extra is needed when mixing additional AND's when making use of the fulltext index with the MATCH/AGAINST.
Edit: What I am trying to understand is, what happens with the additional columns that are within the query but are not indexed in the chosen index (the fulltext index), the above example being city and state. How does MySQL now find the matching rows for these since it can't use two indexes (or can it?) - so, basically, I'm trying to understand how MySQL goes about finding the data optimally for the columns NOT in the chosen fulltext index and if there is anything I can or should do to optimize the query.
If I understand your question, you know that the MATCH AGAINST uses your FULLTEXT index and your wondering how MySQL goes about applying the rest of the WHERE clause (ie. does it do a tablescan or an indexed lookup).
Here's what I'm assuming about your table: it has a PRIMARY KEY on some id column and the FULLTEXT index.
So first off, MySQL will never use the FULLTEXT index for the city/state WHERE clause. Why? Because FULLTEXT indexes only apply with MATCH AGAINST. See here in the paragraph after the first set of bullets (not the Table of Contents bullets).
EDIT: In your case, assuming your table doesn't only have like 10 rows, MySQL will apply the FULLTEXT index for your MATCH AGAINST, then do a tablescan on those results to apply the city/state WHERE.
So what if you add a BTREE index onto city and state?
CREATE INDEX city__state ON table (city(10),state(2)) USING BTREE;
Well MySQL can only use one index for this query since it's a simple select. It will either use the FULLTEXT or the BTREE. Note that when I say one index, I mean one index definition, not one column in a multi-part index. Anwway, this then begs the question which one does it use?
That depends on the table analysis. MySQL will attempt to estimate (based on table stats from the last OPTIMIZE TABLE) which index will prune the most records. If the city/state WHERE gets you down to 10 records while the MATCH AGAINST only gets you down to 100, then MySQL will use the city__state index first for the city/state WHERE and then do a tablescan for the MATCH AGAINST.
On the other hand, if the MATCH_AGAINST gets you down to 10 records while the city/state WHERE gets you down to only a 1000, then MySQL will apply the FULLTEXT index first and tablescan for city and state.
The bottom line is the cardinality of your index. Essentially, how unique are the values that will go into your index? If every record in your table has city set to Oakland, then it's not a very unique key and so having city = 'Oakland' doesn't really reduce the number of records all that much for you. In that case, we say your city__state index has a low cardinality.
Consequently if 90% of the words in your FULLTEXT index are "John", then that doesn't really help you much either for the exact same reasons.
If you can afford the space and the UPDATE/DELETE/INSERT overhead, I would recommend adding the BTREE index and letting MySQL decide which index he wants to use. In my experience, he usually does a very good job of picking the right one.
I hope that answers your question.
EDIT: On a side note, making sure you pick the right size for your BTREE index (in my example I picked the first 10 char in city). This obviously makes a huge impact to cardinality. If you picked city(1), then obviously you'll get a lower cardinality then if you did city(10).
EDIT2: MySQL's query plan (estimation) for which index prunes the most records is what you see in EXPLAIN.
I think you can easily determine which index gets used by using EXPLAIN on your query. Please check the accepted answer for this question, which provides some good resources on how to interpret the output of EXPLAIN.
How does MySQL now find the matching rows for these since it can't use
two indexes
Yes it can: Can MySQL use multiple indexes for a single query? Also, you should read the documentation: How MySQL Uses Indexes
I had similar task some time ago, and I have noticed that MySQL can use either FULLTEXT index or any other index/indexes in one query, but not both; I wasn't able to mix FULLTEXT with any other index. Any selection with fulltext search will work in such way:
select subset using FULLTEXT search
select records matching other criteria from that subset 'Using where'
So you can use either fulltext index or any other index (I wasn't able to use both indexes by FORCE INDEX or anything else).
I suggest trying with both using fulltext and using other index (i.e. on City and State columns) and compare the results - they may vary depending on actual content in your database.
In my case I have discovered that forcing regular (non-fulltext) index in such query produced better performance (since I had very large number of rows, about 300 000, and non-fulltext criteria matched about 1000 of them).
I was using MySQL 5.5.24
I am really interested in how MySQL indexes work, more specifically, how can they return the data requested without scanning the entire table?
It's off-topic, I know, but if there is someone who could explain this to me in detail, I would be very, very thankful.
Basically an index on a table works like an index in a book (that's where the name came from):
Let's say you have a book about databases and you want to find some information about, say, storage. Without an index (assuming no other aid, such as a table of contents) you'd have to go through the pages one by one, until you found the topic (that's a full table scan).
On the other hand, an index has a list of keywords, so you'd consult the index and see that storage is mentioned on pages 113-120,231 and 354. Then you could flip to those pages directly, without searching (that's a search with an index, somewhat faster).
Of course, how useful the index will be, depends on many things - a few examples, using the simile above:
if you had a book on databases and indexed the word "database", you'd see that it's mentioned on pages 1-59,61-290, and 292 to 400. In such case, the index is not much help and it might be faster to go through the pages one by one (in a database, this is "poor selectivity").
For a 10-page book, it makes no sense to make an index, as you may end up with a 10-page book prefixed by a 5-page index, which is just silly - just scan the 10 pages and be done with it.
The index also needs to be useful - there's generally no point to index e.g. the frequency of the letter "L" per page.
The first thing you must know is that indexes are a way to avoid scanning the full table to obtain the result that you're looking for.
There are different kinds of indexes and they're implemented in the storage layer, so there's no standard between them and they also depend on the storage engine that you're using.
InnoDB and the B+Tree index
For InnoDB, the most common index type is the B+Tree based index, that stores the elements in a sorted order. Also, you don't have to access the real table to get the indexed values, which makes your query return way faster.
The "problem" about this index type is that you have to query for the leftmost value to use the index. So, if your index has two columns, say last_name and first_name, the order that you query these fields matters a lot.
So, given the following table:
CREATE TABLE person (
last_name VARCHAR(50) NOT NULL,
first_name VARCHAR(50) NOT NULL,
INDEX (last_name, first_name)
);
This query would take advantage of the index:
SELECT last_name, first_name FROM person
WHERE last_name = "John" AND first_name LIKE "J%"
But the following one would not
SELECT last_name, first_name FROM person WHERE first_name = "Constantine"
Because you're querying the first_name column first and it's not the leftmost column in the index.
This last example is even worse:
SELECT last_name, first_name FROM person WHERE first_name LIKE "%Constantine"
Because now, you're comparing the rightmost part of the rightmost field in the index.
The hash index
This is a different index type that unfortunately, only the memory backend supports. It's lightning fast but only useful for full lookups, which means that you can't use it for operations like >, < or LIKE.
Since it only works for the memory backend, you probably won't use it very often. The main case I can think of right now is the one that you create a temporary table in the memory with a set of results from another select and perform a lot of other selects in this temporary table using hash indexes.
If you have a big VARCHAR field, you can "emulate" the use of a hash index when using a B-Tree, by creating another column and saving a hash of the big value on it. Let's say you're storing a url in a field and the values are quite big. You could also create an integer field called url_hash and use a hash function like CRC32 or any other hash function to hash the url when inserting it. And then, when you need to query for this value, you can do something like this:
SELECT url FROM url_table WHERE url_hash=CRC32("http://gnu.org");
The problem with the above example is that since the CRC32 function generates a quite small hash, you'll end up with a lot of collisions in the hashed values. If you need exact values, you can fix this problem by doing the following:
SELECT url FROM url_table
WHERE url_hash=CRC32("http://gnu.org") AND url="http://gnu.org";
It's still worth to hash things even if the collision number is high cause you'll only perform the second comparison (the string one) against the repeated hashes.
Unfortunately, using this technique, you still need to hit the table to compare the url field.
Wrap up
Some facts that you may consider every time you want to talk about optimization:
Integer comparison is way faster than string comparison. It can be illustrated with the example about the emulation of the hash index in InnoDB.
Maybe, adding additional steps in a process makes it faster, not slower. It can be illustrated by the fact that you can optimize a SELECT by splitting it into two steps, making the first one store values in a newly created in-memory table, and then execute the heavier queries on this second table.
MySQL has other indexes too, but I think the B+Tree one is the most used ever and the hash one is a good thing to know, but you can find the other ones in the MySQL documentation.
I highly recommend you to read the "High Performance MySQL" book, the answer above was definitely based on its chapter about indexes.
Basically an index is a map of all your keys that is sorted in order. With a list in order, then instead of checking every key, it can do something like this:
1: Go to middle of list - is higher or lower than what I'm looking for?
2: If higher, go to halfway point between middle and bottom, if lower, middle and top
3: Is higher or lower? Jump to middle point again, etc.
Using that logic, you can find an element in a sorted list in about 7 steps, instead of checking every item.
Obviously there are complexities, but that gives you the basic idea.
Take a look at this link: http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
How they work is too broad of a subject to cover in one SO post.
Here is one of the best explanations of indexes I have seen. Unfortunately it is for SQL Server and not MySQL. I'm not sure how similar the two are...
In MySQL InnoDB, there are two types of index.
Primary key which is called clustered index. Index key words are stored with
real record data in the B+Tree leaf node.
Secondary key which is non clustered index. These index only store primary key's key words along with their own index key words in the B+Tree leaf node. So when searching from secondary index, it will first find its primary key index key words and scan the primary key B+Tree to find the real data records. This will make secondary index slower compared to primary index search. However, if the select columns are all in the secondary index, then no need to look up primary index B+Tree again. This is called covering index.
Take at this videos for more details about Indexing
Simple Indexing
You can create a unique index on a table. A unique index means that two rows cannot have the same index value. Here is the syntax to create an Index on a table
CREATE UNIQUE INDEX index_name
ON table_name ( column1, column2,...);
You can use one or more columns to create an index. For example, we can create an index on tutorials_tbl using tutorial_author.
CREATE UNIQUE INDEX AUTHOR_INDEX
ON tutorials_tbl (tutorial_author)
You can create a simple index on a table. Just omit UNIQUE keyword from the query to create simple index. Simple index allows duplicate values in a table.
If you want to index the values in a column in descending order, you can add the reserved word DESC after the column name.
mysql> CREATE UNIQUE INDEX AUTHOR_INDEX
ON tutorials_tbl (tutorial_author DESC)
Adding some visual representation to the list of answers.
MySQL uses an extra layer of indirection: secondary index records point to primary index records, and the primary index itself holds the on-disk row locations. If a row offset changes, only the primary index needs to be updated.
Caveat: Disk data structure looks flat in the diagram but actually is a
B+ tree.
Source: link
I want to add my 2 cents. I am far from being a database expert, but I've recently read up a bit on this topic; enough for me to try and give an ELI5. So, here's may layman's explanation.
I understand it as such that an index is like a mini-mirror of your table, pretty much like an associative array. If you feed it with a matching key then you can just jump to that row in one "command".
But if you didn't have that index / array, the query interpreter must use a for-loop to go through all rows and check for a match (the full-table scan).
Having an index has the "downside" of extra storage (for that mini-mirror), in exchange for the "upside" of looking up content faster.
Note that (in dependence of your db engine) creating primary, foreign or unique keys automatically sets up a respective index as well. That same principle is basically why and how those keys work.
Let's suppose you have a book, probably a novel, a thick one with lots of things to read, hence lots of words.
Now, hypothetically, you brought two dictionaries, consisting of only words that are only used, at least one time in the novel. All words in that two dictionaries are stored in typical alphabetical order. In hypothetical dictionary A, words are printed only once while in hypothetical dictionary B words are printed as many numbers of times it is printed in the novel. Remember, words are sorted alphabetically in both the dictionaries.
Now you got stuck at some point while reading a novel and need to find the meaning of that word from anyone of those hypothetical dictionaries. What you will do? Surely you will jump to that word in a few steps to find its meaning, rather look for the meaning of each of the words in the novel, from starting, until you reach that bugging word.
This is how the index works in SQL. Consider Dictionary A as PRIMARY INDEX, Dictionary B as KEY/SECONDARY INDEX, and your desire to get for the meaning of the word as a QUERY/SELECT STATEMENT.
The index will help to fetch the data at a very fast rate. Without an index, you will have to look for the data from the starting, unnecessarily time-consuming costly task.
For more about indexes and types, look this.
Indexes are used to find rows with specific column values quickly. Without an index, MySQL must begin with the first row and then read through the entire table to find the relevant rows. The larger the table, the more this costs. If the table has an index for the columns in question, MySQL can quickly determine the position to seek to in the middle of the data file without having to look at all the data. This is much faster than reading every row sequentially.
Indexing adds a data structure with columns for the search conditions and a pointer
The pointer is the address on the memory disk of the row with the
rest of the information
The index data structure is sorted to optimize query efficiency
The query looks for the specific row in the index; the index refers to the pointer which will find the rest of the information.
The index reduces the number of rows the query has to search through from 17 to 4.