In my database, I have a table that contains a companyId, pointing to a company, and some text. I would like to do a FULLTEXT search, but as I always make requests against a specific companyId I'd like to use a composite key that combines my companyId and the fulltext index.
Is there anyway to do that ? As I guess this is not possible, what is the optimal way to create indexes so that the following query is fastest ?
The request will always be
SELECT * FROM textTable
WHERE companyId = ? (Possibly more conditions) AND
MATCH(value) AGAINST("example")
Should I create my indexes on integer columns normally and add one fulltext index ? or should I include the value column in the index ? Maybe both ?
Depending on how large the text field is, MySQL will store the data to disk, so including it in a compound index won't do you much good. I this scenario, the index should simply be the companyId, which will have a reference back to the PK, which will then parse each of the text fields.
The general rule of thumb is to filter early, so filtering first by companyId (int - 4 bytes) is preferable. Now, if the text field is small enough, you can add it as the second value in the index to prevent the second round-trip, but understand that this will impact INSERT performances significantly.
The best option for this type of scenario may be to use a NoSQL database to handle the text lookup, if the text fields are large; they're pretty exceptional with that type of job.
Why you're selecting everything? this is not going to help your query performance.
You must select only the columns that you need. This would reduce the amount of time taking in scanning the table.
For FULLTEXT search, your method would sort the results by relevance and then uses the index lookup based on your WHERE clause. So, you need to revert that into a straight forward lookup by adding the index lookup within your SELECT clause.
The query should be something like this :
SELECT
MATCH(value) AGAINST("example")
FROM
textTable
you can add more filtering based on WHERE clause, or better you may involve IF statement or CASE or any other function if you would.
something like this :
SELECT
IF(MATCH(value) AGAINST("example"), 'Your Returned value if TRUE', 'Your returned value if FALSE')
FROM textTable
This will do a full table scan, and this is faster.
For indexing part,
You can use index on companyId and include value with it. You can do the same with any other columns that you use in your SELECT, WHERE, and ORDER BY.
Sometimes it is okay to create more indexes, each one specific for particular task.
Related
I would like to make system whitch allows to search user messages, by specific user.
assume having folowing table
create table messages(
user_id int,
message nvarchar(500));
So what kind of index I should use here, if I want to search for all messages from user 1, containing word 'foo'.
Simple, non unique index user_id
It will filter only specific user messages nd then full scan for specific word.
FULLTEXT index on message
this will find all messages from all users and then filter by ID, seems to be very inefficient in case of big amount of users.
comopound index on both user_id and message
So full text index tree is created for each user separately, so they can be searched individually. During query system filters messages by ID and then performs text search on remaining rows in index.
A.F.A.I.K. last one is impossible. So then I assume I shall use 1-st option, It will perform better in case of few thousands of users?
And if each will have ~100 messages, full iteration won't cost much resources?
Perhaps I can include username into message and use BOOLEAN full text search mode, but I think it would be slower than by using indexed user_id.
#Alden Quimby's answer is correct as far as it goes, but there is more to the story, because MySQL will only try to choose the optimal index, and its ability to make that determination is limited because of the way fulltext indexes interact with the optimizer.
What actually happens is this:
If the specified user_id exists in either 0 or 1 matching rows in the table, the optimizer will realize this and will choose user_id as the index for that query. Fast execution.
Otherwise, the optimizer will choose the fulltext index, filtering every row matched by the fulltext index to eliminate rows not containing a user_id that matches the WHERE clause. Not quite as fast.
So it's not truly the "optimum" path. It's more like fulltext, with a nice optimization to avoid the fulltext search under the one condition that we know we have almost nothing of interest in the table.
The reason this breaks down is that a fulltext index doesn't give any meaningful statistics back to the optimizer. It just says "yeah, I think that query should probably only require me to check 1 row" ... which, of course, pleases the optimizer greatly, so the fulltext index wins the bid for lowest cost, unless the index with the integer value also comes in comparably low or lower.
Still, that doesn't mean I wouldn't try it this way first.
There's another option, which would work best with fulltext queries IN BOOLEAN MODE and that is to create another column which you would populate with something like CONCAT('user_id_',user_id) or something similar, and then declare a 2-column fulltext index.
filter_string VARCHAR(48) # populated with CONCAT('user_id_',user_id);
....
FULLTEXT KEY (message,filter_string)
Then specify everything in the query.
SELECT ...
WHERE user_id = 500 AND
MATCH (message,filter_string) AGAINST ('+kittens +puppies +user_id_500' IN BOOLEAN MODE);
Now, the fulltext index will be responsible for matching only those rows where kittens, puppies, and "user_id_500" appears in the combined fulltext index of the two columns, but you'd still want to have the integer filter there too to make sure the final results are constrained in spite of any random appearance of "user_id_500" in the message.
You should add a fulltext index on message and a regular index on user_id, and use the query:
SELECT *
FROM messages
WHERE MATCH(message) AGAINST(#search_query)
AND user_id = #user_id;
You're right that you can't do option 3. But rather than trying to pick between 1 and 2, let MySQL do the work for you. MySQL will only use one of the two indexes, and will do a linear scan to complete the second filter, but it will estimate the effectiveness of each index and choose the optimal one.
Note: only do this if you can afford the overhead of two indexes (slower insert/update/delete). Also, if you know that each user will only have a few messages, then yes it might make sense to use a simple index and do a regex in the application layer or something like that.
Turn on the "Optimizer trace" and look for "considered_execution_plans". I contend that the Optimizer will always pick the FULLTEXT index, even when some other index might be better. This may be because it is quite costly when the MATCH is not pre-computed as when the FT index is built.
More on Optimizer Trace: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#optimizer_trace (Earlier in that doc are my tips on FULLTEXT.)
I have an application that basically functions like a grid where a user can view data and sort/filter by any of the columns. It's a very small amount of data (~200K rows / 50MB), but too big to comfortably fit in the browser and do it in javascript.
The crudest/simplest approach I've thought of is to store it in mysql table with an index on every single column (yes, every column). The database/table is about 99% read / 1% write, so I'm not too concerned if the insert times go up by 100x or so.
Is there any downside of doing the above? What kinds of things should I be concerned about? Are there any better (server-side) approaches for doing something like this?
If those are "words", toss them all into a single column, separated by spaces. Then add a single FULLTEXT index just on that column. If you have other non-'word' columns, you may need to index them, too.
Caveat: FULLTEXT has several limitations. And benefits (such as singular/plural).
Presumably, you will refuse to show anything unless the do some filtering? Don't tell me you want to let the user paginate through 200K rows!
You have not said whether they can filter on multiple columns.
You should construct the query in your app code. The syntax for fulltext is different than equality and different than range tests.
If you are tempted by the problematic EAV schema design, see http://mysql.rjweb.org/doc.php/eav
I have rambled on; I could be more focused if you gave us some clues of the data and the queries.
Example:
CREATE TABLE ... (
...
all_words TEXT NOT NULL,
LastUpdated DATETIME NOT NULL,
...
FULLTEXT(all_words)
)
SELECT ...
WHERE MATCH(all_words) AGAINST ('...' IN BOOLEAN MODE)
ORDER BY LastUpdated DESC
LIMIT 50;
(Note: Since only one index is used, and FULLTEXT has priority, an index on LastUpdated would not be useful for this example.)
I have the following problem when running a mysql query:
Query is very slow and when i use explain the query key is null but possible_keys are avaiable and the order is correct, i also tried adding independent indexes per each row but still key was NULL.
You can see table, index and mysql explain here: https://snag.gy/vcChl6.jpg
The optimizer likely has just decided that there is no reason to use the index.
Since you are using SELECT * that means that means that if it used the index, then it would have to use the primary key from the index to then go back and look up all the necessary data from the clustered index. That is referred to as a double lookup, and is generally bad for performance. As there are so few records in this table, the optimizer likely decided that it can easily do a full table scan instead and get your result faster.
In short, this is expected behavior.
If you want to SELECT just some columns, add them to the t1 index and then just SELECT only the columns you need, with that given WHERE clause. It should use the index then. As your table grows in size, it may start using the index as well, once it estimates that the double lookup is cheaper than the full table scan.
A guess: Most rows are of that 'project' and that 'lang'.
The Optimizer does not understand that fact, so it takes the index that is obviously the best:
(id_project, id_lang)
This one would be equally good: (id_lang, id_project).
No fair... The EXPLAIN mentions indexes named id_project and id_lang (not useful), but the list of indexes shows a composite index t1(id_project, id_lang) (useful).
Then, as Willem suggests, it has to bounce between the index and the table. Normally (that is, when it has adequate statistics), the Optimizer will say "Oh, more than ~20% of the table is being referenced; let's ignore any index."
Things you can do:
Get rid of that index.
Change * to a list of just the columns you need. In particular, if you avoid the 3 TEXT columns, two optimizations kick in. Alternatively, any that will never be longer than 255 characters can be changed to VARCHAR(255).
Use some other filtering, ordering, limiting, etc. If this is a web application, do you really want to get ~534 rows?
I am trying to understand indexes in MySQL. I know that an index created in a table can speed up executing queries and it can slow down the inserting and updating of rows.
When creating an index, I used this query on a table called authors that contains (AuthorNum, AuthorFName, AuthorLName, ...)
Create index Index_1 on Authors ([What to put here]);
I know I have to put a column name, but which one?
Do I have to put the column name that will be compared in the Where statement when a user query the Table or what?
The Anatomy of an Index
An index is a distinct data structure within a database and is data redundancy. Its primary purpose is to provide an ordered representation of the indexed data through a logical ordering which is independent of the physical ordering. We do this using a doubly linked list and a tree structure known as the balanced search tree (B-tree). B-trees are nice because they keep data sorted and allow searches, access, insertions, and deletions in logarithmic time. Because of the doubly linked list, we are able to go backwards or forwards as needed on the index for various queries easily. Inserts become simple since we only have to rearrange pointers to the different pieces of data. Databases use these doubly linked list to connect leaf nodes (usually in a B+ tree or B-tree), each of which are stored in a page, and to establish logical ordering between the leaf nodes. Operations like UPDATE or INSERT become slower because they are actually two writing operations in the filesystem (one for the table data and one for the index data).
Defining an Optimal Index With WHERE
To define an optimal index you must not only understand how indexes work, but you must also understand how the application queries the data. E.g., you must know the column combinations that appear in the WHERE clause.
A common restriction with queries on LAST_NAME and FIRST_NAME columns deals with case sensitivity. For example, instead of doing an exact search like Hotinger we would prefer to match all results such as HoTingEr and so on. This is very easy to do in a WHERE clause: we just say WHERE UPPER(LAST_NAME) = UPPER('Hotinger')
However, if we define an index of LAST_NAME and query, it will actually run a full table scan because the query is not on LAST_NAME but on UPPER(LAST_NAME). From the database's perspective, this is completely different. So, in this case you should define the index on UPPER(LAST_NAME) instead.
Indexes do not necessarily have to be for one column. For example, if the primary key is a composite key (consisting of multiple columns) it will create a concatenated index also known as a combined index. Note that the ordering of the concatenated index has a significant impact on its usability and scalability so it must be chosen carefully. Basically, the ordering should match the way it is ordered in the WHERE clause.
Defining an Optimal Index With LIKE
The position of the wildcard characters makes a huge difference. LIKE clauses only use the characters before the wildcard during tree traversal; the rest do not narrow the scanned index range. The more selective the prefix of the LIKE clause the more narrow the scanned index becomes. This makes the index lookup faster. As a tip, avoid LIKE clauses which lead with wildcards like "%OTINGER%" For full-text searches, MySQL offers MATCH and AGAINST keywords. Starting with MySQL 5.6, you can have full-text indexes. Look at Full-Text Search Functions from MySQL for more in-depth discussion on indexing these results.
Yes, generally you need an index on the column or columns that you compare in the WHERE clause of your queries to speed up queries.
If you search by AuthorFName, then you create an index on that column. If they search by AuthorLName, then you create an index on that column.
In this case though, maybe what you should be looking at is a FULLTEXT index. That would allow users to enter fuzzy queries, which would return a number of results ordered by relevance.
From the MySQL Manual:
Indexes are used to find rows with specific column values quickly.
Without an index, MySQL must begin with the first row and then read
through the entire table to find the relevant rows. The larger the
table, the more this costs. If the table has an index for the columns
in question, MySQL can quickly determine the position to seek to in
the middle of the data file without having to look at all the data. If
a table has 1,000 rows, this is at least 100 times faster than reading
sequentially. If you need to access most of the rows, it is faster to
read sequentially, because this minimizes disk seeks.
An index usually means a B-Tree. Understand the structure of the B-Tree and you'll understand what index can and cannot do.
In your particular case:
WHERE AuthorLName = 'something' and WHERE AuthorLName LIKE 'something%' can be sped-up by an index on {AuthorLName}.
WHERE AuthorLName = 'something AND AuthorFName = 'something else' can be sped-up by a composite index on {AuthorLName, AuthorFName} or {AuthorFName, AuthorLName}.
WHERE AuthorLName = 'something OR AuthorFName = 'something else' (which doesn't make much sense, but is here as an example) can be sped-up by having two indexes: on {AuthorLName} and on {AuthorFName}.
WHERE AuthorLName LIKE '%something' cannot be sped-up by a B-Tree index (cunsider full-text indexing).
Etc...
See Use The Index, Luke! for a much more thorough treatment of the subject than possible in a simple SO post.
Limited length index:
When using text columns or very large varchar columns you won't be able to create an index over the entire length of the text/varchar, there are some limits (around 1024 ASCII characters in length).
In such a case you specify the length in the index declaration.
CREATE INDEX `my_limited_length_index` ON `my_table`(`long_text_content`(512));
-- please notice the use of the numeric length of the index after the column name
Processed value index (apparently available in PostgreSQL not MySQL):
Indexes are not exclusively built from one column, some may be built from multiple columns and other may be built from just some of the info a column has. For example if you have a full datetime column but you know you're only going to filter records by date you can build an index based on the datetime column but only containing date info.
-- `my_table` has a `created` column of type timestamp
CREATE INDEX `my_date_created` ON `my_table`(DATE(`created`));
-- please notice the use of the DATE function which extracts only
-- the date from the `created` timestamp
index shall span the columns you are going to use in WHERE statement.
To better understand, here is an example:
SELECT * FROM Authors WHERE AuthorNum > 10 AND AuthorLName LIKE 'A%';
SELECT * FROM Authors WHERE AuthorLName LIKE 'Be%';
If you are often using the shown above queries, you are highly adviced to have two indexes:
Create index AuthNum_AuthLName_Index on Authors (AuthorNum, AuthorLName);
Create index AuthLName_Index on Authors (AuthorLName);
The key thing to remember: index shall have the same combiation of columns used in WHERE statements
I would like to make system whitch allows to search user messages, by specific user.
assume having folowing table
create table messages(
user_id int,
message nvarchar(500));
So what kind of index I should use here, if I want to search for all messages from user 1, containing word 'foo'.
Simple, non unique index user_id
It will filter only specific user messages nd then full scan for specific word.
FULLTEXT index on message
this will find all messages from all users and then filter by ID, seems to be very inefficient in case of big amount of users.
comopound index on both user_id and message
So full text index tree is created for each user separately, so they can be searched individually. During query system filters messages by ID and then performs text search on remaining rows in index.
A.F.A.I.K. last one is impossible. So then I assume I shall use 1-st option, It will perform better in case of few thousands of users?
And if each will have ~100 messages, full iteration won't cost much resources?
Perhaps I can include username into message and use BOOLEAN full text search mode, but I think it would be slower than by using indexed user_id.
#Alden Quimby's answer is correct as far as it goes, but there is more to the story, because MySQL will only try to choose the optimal index, and its ability to make that determination is limited because of the way fulltext indexes interact with the optimizer.
What actually happens is this:
If the specified user_id exists in either 0 or 1 matching rows in the table, the optimizer will realize this and will choose user_id as the index for that query. Fast execution.
Otherwise, the optimizer will choose the fulltext index, filtering every row matched by the fulltext index to eliminate rows not containing a user_id that matches the WHERE clause. Not quite as fast.
So it's not truly the "optimum" path. It's more like fulltext, with a nice optimization to avoid the fulltext search under the one condition that we know we have almost nothing of interest in the table.
The reason this breaks down is that a fulltext index doesn't give any meaningful statistics back to the optimizer. It just says "yeah, I think that query should probably only require me to check 1 row" ... which, of course, pleases the optimizer greatly, so the fulltext index wins the bid for lowest cost, unless the index with the integer value also comes in comparably low or lower.
Still, that doesn't mean I wouldn't try it this way first.
There's another option, which would work best with fulltext queries IN BOOLEAN MODE and that is to create another column which you would populate with something like CONCAT('user_id_',user_id) or something similar, and then declare a 2-column fulltext index.
filter_string VARCHAR(48) # populated with CONCAT('user_id_',user_id);
....
FULLTEXT KEY (message,filter_string)
Then specify everything in the query.
SELECT ...
WHERE user_id = 500 AND
MATCH (message,filter_string) AGAINST ('+kittens +puppies +user_id_500' IN BOOLEAN MODE);
Now, the fulltext index will be responsible for matching only those rows where kittens, puppies, and "user_id_500" appears in the combined fulltext index of the two columns, but you'd still want to have the integer filter there too to make sure the final results are constrained in spite of any random appearance of "user_id_500" in the message.
You should add a fulltext index on message and a regular index on user_id, and use the query:
SELECT *
FROM messages
WHERE MATCH(message) AGAINST(#search_query)
AND user_id = #user_id;
You're right that you can't do option 3. But rather than trying to pick between 1 and 2, let MySQL do the work for you. MySQL will only use one of the two indexes, and will do a linear scan to complete the second filter, but it will estimate the effectiveness of each index and choose the optimal one.
Note: only do this if you can afford the overhead of two indexes (slower insert/update/delete). Also, if you know that each user will only have a few messages, then yes it might make sense to use a simple index and do a regex in the application layer or something like that.
Turn on the "Optimizer trace" and look for "considered_execution_plans". I contend that the Optimizer will always pick the FULLTEXT index, even when some other index might be better. This may be because it is quite costly when the MATCH is not pre-computed as when the FT index is built.
More on Optimizer Trace: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#optimizer_trace (Earlier in that doc are my tips on FULLTEXT.)