Can indexing a MySQL column improve a LIKE search? - mysql

I'm using a table that I downloaded elsewhere, and they put a column called DATA_SOURCE which concatenates all the different data sources of a record like so:
sourceA; sourceB; sourceC; ...
So, if I am looking for records from sourceB, I would have to do a like search on %sourceB%.
This is obviously a time-consuming query. My question is, if I were to index column DATA_SOURCE, would it improve the performance of these wildcard like searches? Or would it not make a difference.

No. Indexes will be used on like searches, but only if the search criterium does not begin with a wildcard.
So LIKE 'Albert %' will be indexable, while LIKE '%Einstein%' will not.
Reason for this is of course that all an index does is construct an internal table of where the results are when sorted by that column, to reduce a search from linear to logarithmic complexity. If the search criterium starts with a wildcard it will still have to loop through all possible values to match them (a so-called table scan), thus eliminating the potential performance gain of an index.

Related

Doing a mysql like %term% on 1B records (with indexed field)

I have the following query that I'm using and was wondering if it would work performantly, or whether I should use ElasticSearch from the start:
SELECT
*
FROM
entity_access
JOIN entity ON (entity.id=entity_access.entity_id)
WHERE
user_id = 144
AND name LIKE '%format%'
The entity_access table will have about a billion results. But each user should have 5k entries max. My thinking was that a LIKE %term% would be trivial on a table of 5k rows (under 50ms), so hopefully it would be the same if I have a good index on a large table before doing it? Or is there something I'm missing here?
Two things. First, it doesn't matter how many total rows in the table, because the index on user_id will select only those rows for matching. As you say there are about 5k per user_id, then that's easily managed.
Second, LIKE '%foo%' will not use an index: the leading '%' precludes that. If you want to use an index, you'll have to accept a pattern of LIKE 'foo%'. If that fits the use case, then the query as written will perform fine.
If either of the above conditions doesn't hold, then consider using a dedicated search engine (like Sphinx, or roll-your own with radix trees) or materialize your search into a more indexable format (such as using MySQL Full-Text Search).

Why can "WHERE col LIKE 'a%'" use an index, but not "WHERE col LIKE '%a%'"?

I just found this post:
MySQL like query runs extremly slow for 5000 records table
And I'm interested in understanding asaph's post where he says:
I wouldn't expect select * from customer where code like '%a%' to be
fast since it couldn't possibly use an index. Every record has to be
checked. Consider select * from customer where code like 'a%' if
possible since that could feasibly use an index.
Can someone explain the difference between the two select statements? I know one has only one wildcard and will only find things that start with "a". but why can that one be indexed?
Although the actual details of MySQL's B-tree indexes are more complicated than this, for most purposes it's close enough to say that having an index on a column lets the MySQL engine perform SELECTs on your table as if it was ordered by that column.
If the code column has an index on it, and you're searching for records where code LIKE 'a%', then all MySQL (or whatever other SQL package, as long as it's sufficiently clever) has to do is spit out all the records from the start of 'a' to to the start of 'b'. However, if you're searching for records where code LIKE '%a%', then having the table already ordered by code won't help you, because whether a row matches the WHERE clause has no simple relationship to its position in the index. So for the second query, there's nothing the database can reasonably do except check every character of the code entry of every single row in the table (unless it already has the result cached).
This is fairly easy to understand intuitively, because you can imagine doing something reasonably analogous yourself, as a human. If you want to find all the words in the Oxford English Dictionary that begin with 'a', then you just go through all the pages from the start of 'a' to the start of 'b', and everything you see is a word starting with 'a'. If you want to find all the words in the dictionary with an 'a' in them anywhere, then the dictionary being ordered doesn't offer you much help. If you're sophisticated enough, you can plausibly exploit the ordering of the dictionary a little (such as by using your knowledge that all the words before the first 'b...' word in the dictionary contain an 'a'), but ultimately you're gonna have to look at almost every single word.
From the manual:
Most MySQL indexes (PRIMARY KEY, UNIQUE, INDEX, and FULLTEXT) are stored in B-trees. A B-tree index can be used for column comparisons in expressions that use the =, >, >=, <, <=, or BETWEEN operators. The following SELECT statements do not use indexes:
SELECT * FROM tbl_name WHERE key_col LIKE '%Patrick%';
The index also can be used for LIKE comparisons if the argument to LIKE is a constant string that does not start with a wildcard character. For example, the following SELECT statements use indexes:
SELECT * FROM tbl_name WHERE key_col LIKE 'Patrick%';
SELECT * FROM tbl_name WHERE key_col LIKE 'Pat%_ck%';
MySQL uses BTREE indexes.
If you have a string comparison using LIKE with a leading wildcard, then it's faster for MySQL to do a table scan because the index cannot be used to narrow down the results.
If you have a string comparison using LIKE with a trailing wildcard, then it's faster to use the index because fewer records need to be scanned.

Is it OK to index all the fields in this mysql query?

I have this mysql query and I am not sure what are the implications of indexing all the fields in the query . I mean is it OK to index all the fields in the CASE statement, Join Statement and Where Statement? Are there any performance implications of indexing fields?
SELECT roots.id as root_id, root_words.*,
CASE
WHEN root_words.title LIKE '%text%' THEN 1
WHEN root_words.unsigned_title LIKE '%normalised_text%' THEN 2
WHEN unsigned_source LIKE '%normalised_text%' THEN 3
WHEN roots.root LIKE '%text%' THEN 4
END as priorities
FROM roots INNER JOIN root_words ON roots.id=root_words.root_id
WHERE (root_words.unsigned_title LIKE '%normalised_text%') OR (root_words.title LIKE '%text%')
OR (unsigned_source LIKE '%normalised_text."%') OR (roots.root LIKE '%text%') ORDER by priorities
Also, How can I further improve the speed of the query above?
Thanks!
You index columns in tables, not queries.
None of the search criteria you've specified will be able to make use of indexes (since the search terms begin with a wild card).
You should make sure that the id column is indexed, to speed the JOIN. (Presumably, it's already indexed as a PRIMARY KEY in one table and a FOREIGN KEY in the other).
To speed up this query you will need to use full text search. Adding indexes will not speed up this particular query and will cost you time on INSERTs, UPDATEs, and DELETEs.
Caveat: Indexes speed up retrieval time but cause inserts and updates to run slower.
To answer the implications of indexing every field, there is a performance hit when using indexes whenever the data that is indexed is modified, either through inserts, updates, or deletes. This is because SQL needs to maintain the index. It's a balance between how often the data is read versus how often it is modified.
In this specific query, the only index that could possibly help would be in your JOIN clause, on the fields roots.id and root_words.root_id.
None of the checks in your WHERE clause could be indexed, because of the leading '%'. This causes SQL to scan every row in these tables for a matching value.
If you are able to remove the leading '%', you would then benefit from indexes on these fields... if not, you should look into implementing full-text search; but be warned, this isn't trivial.
Indexing won't help when used in conjunction with LIKE '%something%'.
It's like looking for words in a dictionary that have ae in them somewhere. The dictionary (or Index in this case) is organised based on the first letter of the word, then the second letter, etc. It has no mechanism to put all the words with ae in them close together. You still end up reading the whole dictionary from beginning to end.
Indexing the fields used in the CASE clause will likely not help you. Indexing helps by making it easy to find records in a table. The CASE clause is about processing the records you have found, not finding them in the first place.
Optimisers can also struggle with optimising multiple unrelated OR conditions such as yours. The optimiser is trying to narrow down the amount of effort to complete your query, but that's hard to do when unrelated conditions could make a record acceptable.
All in all your query would benefit from indexes on roots(root_id) and/or roots(id), but not much else.
If you were to index additional fields though, the two main costs are:
- Increased write time (insert, update or delete) due to additional indexes to write to
- Increased space taken up on the disk

MySQL Fulltext search but using LIKE

I'm recently doing some string searches from a table with about 50k strings in it, fairly large I'd say but not that big. I was doing some nested queries for a 'search within results' kinda thing. I was using LIKE statement to get a match of a searched keyword.
I came across MySQL's Full-Text search which I tried so I added a fulltext index to my str column. I'm aware that Full-text searches doesn't work on virtually created tables or even with Views so queries with sub-selects will not fit. I mentioned I was doing a nested queries, example is:
SELECT s2.id, s2.str
FROM
(
SELECT s1.id, s1.str
FROM
(
SELECT id, str
FROM strings
WHERE str LIKE '%term%'
) AS s1
WHERE s1.str LIKE '%another_term%'
) AS s2
WHERE s2.str LIKE '%a_much_deeper_term%';
This is actually not applied to any code yet, I was just doing some tests. Also, searching strings like this can be easily achieved by using Sphinx (performance wise) but let's consider Sphinx not being available and I want to know how this will work well in pure SQL query. Running this query on a table without Full-text added takes about 2.97 secs. (depends on the search term). However, running this query on a table with Full-text added to the str column finished in like 104ms which is fast (i think?).
My question is simple, is it valid to use LIKE or is it a good practice to use it at all in a table with Full-text added when normally we would use MATCH and AGAINST statements?
Thanks!
In this case you not neccessarily need subselects. You can siply use:
SELECT id, str
FROM item_strings
WHERE str LIKE '%term%'
AND str LIKE '%another_term%'
AND str LIKE '%a_much_deeper_term%'
... but also raises a good question: the order in which you are excluding the rows. I guess MySQL is smart enough to assume that the longest term will be the most restrictive, so starting with a_much_deeper_term it will eliminate most of the records then perform addtitional comparsion only on a few rows. - Contrary to this, if you start with term you will probably end up with many possible records then you have to compare them against the st of the terms.
The interesting part is that you can force the order in which the comparsion is made by using your original subselect example. This gives the opportunity to make a decision which term is the most restrictive based upon more han just the length, but for example:
the ratio of consonants a vowels
the longest chain of consonants of the word
the most used vowel in the word
...etc. You can also apply some heuristics based on the type of textual infomation you are handling.
Edit:
This is just a hunch but it could be possible to apply the LIKE to the words in the fulltext indexitself. Then match the rows against the index as if you have serched for full words.
I'm not sure if this is actually done, but it would be a smart thing to pull off by the MySQL people. Also note that this theory can only be used if all possible ocurrences arein fact in the fulltext search. For this you need that:
Your search pattern must be at least the size of the miimal word-length. (If you re searching for example %id% then it can be a part of a 3 letter word too, which is excluded by default form FULLTEXT index).
Your search pattern must not be a substring of any listed excluded word for example: and, of etc.
Your pattern must not contain any special characters.

log-queries-not-using-indexes and LIKE in MySQL

I have this "log-queries-not-using-indexes" enabled in MySQL. I have one query which is being logged by MySQL as such i.e. query that is not using indexes. The thing is this query uses LIKE for e.g.
category like '%fashion%'
and if I remove LIKE or change it to
category = 'fashion'
then it says it is using indexes.
So when we are using LIKE in our query, MySQL will log it as not using indexes no matter what?
Thanks
Using a clause like %fashion% will never use a regular index. You need a full-text index if you want to do that kind of search.
Remember that a varchar indexes on first part of the string. So, if you are searching for an ocurrence of fashion on any part of the string, then index will offer no help to improve performance since you will need to search every single string.
However, if you only search for the first part like this:
select * from table where field like 'fashion%'
Then index could be used and could be helpful.