Problem: I have a table containing "keywords" associated to a "label".
I have to find the label associated to an input string thanks to a query.
Example:
DB (table):
keywords| label | weight
PLOP | ploplabel | 12
PLOP | ploplbl | 8
TOTO | totolabel | 4
... | ... | ...
Input string : "PLOP 123"
Should return: "ploplabel"
The first instinctive query in my mind, for partial keywords reasearch, was:
SELECT label FROM table WHERE keywords LIKE "%inputstring%" ORDER BY weight DESC
But as you may have seen, it is the opposite I would need, something like:
SELECT label FROM table WHERE %keywords% LIKE "inputstring" ORDER BY weight DESC
Is it something we can do in MySQL (innoDB === no fulltext)?
Build your system using innodb and create a myisam fulltext table to index back into your innodb data. That way you get all the advantages of the innodb engine: clustered primary key indexes, row level locking and transactions supplemented by the fulltext capabilities of one or more myisam tables.
Any way to achieve fulltext-like search on InnoDB
Related
I've upgraded a table from myisam to innodb but am not having the same performance. The innodb returns a 0 score when there should be some relation. The myisam table returns a match for the same term (I kept a copy of the old table so I can still run the same query).
SELECT MATCH (COLUMNS) AGAINST ('+"Term Ex"' IN BOOLEAN MODE) as score
FROM table_myisam
where id = 1;
Returns:
+-------+
| score |
+-------+
| 1 |
+-------+
but:
SELECT MATCH (COLUMNS) AGAINST ('+"Term Ex"' IN BOOLEAN MODE) as score
FROM table
where id = 1;
returns:
+-------+
| score |
+-------+
| 0 |
+-------+
I thought the ex might not have been indexed because innodb_ft_min_token_size was set to 3. I lowered that to 1 and optimized the table but that had no affect. The column contents are 99 characters long so I presumed the whole column wasn't indexed because of innodb_ft_max_token_size. I increased that as well to 150 and ran the optimize again but again had the same result.
The only difference between these tables is the engine and the character set. This table is using utf8, the myisam table is using latin1.
Has anyone seen these behavior, or have advice for how to resolve it?
UPDATE:
I added ft_stopword_file="" to my my.cnf and ran OPTIMIZE TABLE table again. This time I got
optimize | note | Table does not support optimize, doing recreate + analyze instead
The query worked after this change. Ex is not a stop word though so not sure why it would make a difference.
A new query that fails though is:
SELECT MATCH (Columns) AGAINST ('+Term +Ex +in' IN BOOLEAN MODE) as score FROM Table where id = 1;
+-------+
| score |
+-------+
| 0 |
+-------+
the in causes this to fail but that is the next word in my table.
SELECT MATCH (Columns) AGAINST ('+Term +Ex' IN BOOLEAN MODE) as score FROM Table where id = 1;
+--------------------+
| score |
+--------------------+
| 219.30206298828125 |
+--------------------+
I also tried CREATE TABLE my_stopwords(value VARCHAR(30)) ENGINE = INNODB;, then updated my.cnf with innodb_ft_server_stopword_table='db/my_stopwords'. I restarted and ran:
show variables like 'innodb_ft_server_stopword_table';
which brought back:
+---------------------------------+---------------------------+
| Variable_name | Value |
+---------------------------------+---------------------------+
| innodb_ft_server_stopword_table | 'db/my_stopwords'; |
+---------------------------------+---------------------------+
so I thought the in would not cause the query to fail now but it continues. I also tried OPTIMIZE TABLE table again and even ALTER TABLE table DROP INDEX ... and ALTER TABLE table ADD FULLTEXT KEY ... none of which have had an affect.
Second Update
The issue is with the stop words.
$userinput = preg_replace('/\b(a|about|an|are|as|at|be|by|com|de|en|for|from|how|i|in|is|it|la|of|on|or|that|the|this|to|was|what|when|where|who|will|with|und|the|www)\b/', '', $userinput);
resolves the issue but that doesn't appear as a good solution to me. I'd like a solution that avoids the stop words breaking this in mysql.
Stopword table data:
CREATE TABLE `my_stopwords` (
`value` varchar(30) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
and
Name: my_stopwords
Engine: InnoDB
Version: 10
Row_format: Compact
Rows: 0
Avg_row_length: 0
Data_length: 16384
Max_data_length: 0
Index_length: 0
Data_free: 0
Auto_increment: NULL
Create_time: 2019-04-09 17:39:55
Update_time: NULL
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
There are several differences between MyISAM's FULLTEXT and InnoDB's. I think you were caught by the handling of 'short' words and/or stop words. MyISAM will show rows, but InnoDB will fail to.
What I have done when using FT (and after switching to InnoDB) is to filter the user's input to avoid short words. It takes extra effort but gets me the rows desired. My case is slightly different since the resulting query is something like this. Note that I have added + to require the words, but not on words shorter than 3 (my ft_min_token_size is 3). These searches were for build a table and build the table:
WHERE match(description) AGAINST('+build* a +table*' IN BOOLEAN MODE)
WHERE match(description) AGAINST('+build* +the* +table*' IN BOOLEAN MODE)
(The trailing * may be redundant; I have not investigated that.)
Another approach
Since FT is very efficient at non-short, non-stop words, do the search with two phases, each being optional: To search for "a long word", do
WHERE MATCH(d) AGAINST ('+long +word' IN BOOLEAN MODE)
AND d REGEXP '[[:<:]]a[[:>:]]'
The first part whittles down the possible rows rapidly by looking for 'long' and 'word' (as words). The second part makes sure there is a word a in the string, too. The REGEXP is costly but will be applied only to those rows that pass the first test.
To search just for "long word":
WHERE MATCH(d) AGAINST ('+long +word' IN BOOLEAN MODE)
To search just for the word "a":
WHERE d REGEXP '[[:<:]]a[[:>:]]'
Caveat: This case will be slow.
Note: My examples allow for the words to be in any order, and in any location in the string. That is, this string will match in all my examples: "She was longing for a word from him."
Here is a step by step procedure which should have reproduced your problem. (This is actually how you should have written your question.) The environment is a freshly installed VM with Debian 9.8 and Percona Server Ver 5.6.43-84.3.
Create an InnoDB table with a fulltext index and some dummy data:
create table test.ft_innodb (
txt text,
fulltext index (txt)
) engine=innodb charset=utf8 collate=utf8_unicode_ci;
insert into test.ft_innodb (txt) values
('Some dummy text'),
('Text with a long and short stop words in it ex');
Execute a test query to verify that it doesn't work yet as we need:
select txt
, match(t.txt) against ('+some' in boolean mode) as score0
, match(t.txt) against ('+with' in boolean mode) as score1
, match(t.txt) against ('+in' in boolean mode) as score2
, match(t.txt) against ('+ex' in boolean mode) as score3
from test.ft_innodb t;
Result (rounded):
txt | score0 | score1 | score2 | score3
-----------------------------------------------|--------|--------|--------|-------
Some dummy text | 0.0906 | 0 | 0 | 0
Text with a long and short stop words in it ex | 0 | 0 | 0 | 0
As you see, it's not working with stop words ("+with") or with short words ("+ex").
Create an empty InnoDB table for custom stop words:
create table test.my_stopwords (value varchar(30)) engine=innodb;
Edit /etc/mysql/my.cnf and add the following two lines in the [mysqld] block:
[mysqld]
# other settings
innodb_ft_server_stopword_table = "test/my_stopwords"
innodb_ft_min_token_size = 1
Restart MySQL with service mysql restart
Run the query from (2.) again (The result should be the same)
Rebuild the fulltext index with
optimize table test.ft_innodb;
It will actually rebuild the entire tabe including all indexes.
Execute the test query from (2.) again. Now the result is:
txt | score1 | score1 | score2 | score3
-----------------------------------------------|--------|--------|--------|-------
Some dummy text | 0.0906 | 0 | 0 | 0
Text with a long and short stop words in it ex | 0 | 0.0906 | 0.0906 | 0.0906
You see it works just fine for me. And it's quite simple to reproduce. (Again - This is how you should have written your question.)
Since your procedure is rather chaotic than detailed, it's difficult to say what could go wrong for you. For example:
CREATE TABLE my_stopwords(value VARCHAR(30)) ENGINE = INNODB;
This doesn't contain the information, in which database you have defined that table. Note that I have prefixed all my tables with the corresponding database. Now consider the following: I change my.cnf and set innodb_ft_server_stopword_table = "db/my_stopwords". Note - There is no such table on my server (not even the schema db exists). Restart the MySQL server. And check the new settings with
show variables like 'innodb_ft_server_stopword_table';
This returns:
Variable_name | Value
--------------------------------|----------------
innodb_ft_server_stopword_table | db/my_stopwords
And after optimize table test.ft_innodb; the test query returns this:
txt | score0 | score1 | score2 | score3
-----------------------------------------------|--------|--------|--------|-------
Some dummy text | 0.0906 | 0 | 0 | 0
Text with a long and short stop words in it ex | 0 | 0 | 0 | 0.0906
You see? It's not working with stopwords any more. But it works with short non stop words like "+ex". So make sure, that the table you defined in innodb_ft_server_stopword_table actually exists.
A common technique in searching is to make an extra column with the 'sanitized' string to search in. Then add the FULLTEXT index to that column instead of the original column.
In your case, removing the stopwords is the main difference. But there may also be punctuation that could (should?) be removed. Sometimes hyphenated words or words or contractions or part numbers or model numbers cause trouble. They can be modified to change the punctuation or spacing to make it more friendly with the FT requirements and/or the user's flavor of input. Another thing is to add words to the search-string column that are common misspellings of the words that are in the column.
Sure, this is more work than you would like to have to do. But I think it provides a viable solution.
I'm trying to query using mysql FULLTEXT, but unfortunately its returning empty results even the table contain those input keyword.
Table: user_skills:
+----+----------------------------------------------+
| id | skills |
+----+----------------------------------------------+
| 1 | Dance Performer,DJ,Entertainer,Event Planner |
| 2 | Animation,Camera Operator,Film Direction |
| 3 | DJ |
| 4 | Draftsman |
| 5 | Makeup Artist |
| 6 | DJ,Music Producer |
+----+----------------------------------------------+
Indexes:
Query:
SELECT id,skills
FROM user_skills
WHERE ( MATCH (skills) AGAINST ('+dj' IN BOOLEAN MODE))
Here once I run the above query none of the DJ rows are returning. In the table there are 3 rows with is having the value dj.
A full text index is the wrong approach for what you are trying to do. But, your specific issue is the minimum word length, which is either 3 or 4 (by default), depending on the ending. This is explained in the documentation, specifically here.
Once you reset the value, you will need to recreate the index.
I suspect you are trying to be clever. You have probably heard the advice "don't store lists of things in delimited strings". But you instead countered "ah, but I can use a full text index". You can, although you will find that more complex queries do not optimize very well.
Just do it right. Create the association table user_skills with one row per user and per skill that the user has. You will find it easier to use in queries, to prevent duplicates, to optimize queries, and so on.
Your search term is to short
as in mysql doc
Some words are ignored in full-text searches:
Any word that is too short is ignored. The default minimum length of words that are found by full-text searches is three characters for
InnoDB search indexes, or four characters for MyISAM. You can control
the cutoff by setting a configuration option before creating the
index: innodb_ft_min_token_size configuration option for InnoDB search
indexes, or ft_min_word_len for MyISAM.
.
Boolean full-text searches have these characteristics:
They do not use the 50% threshold.
They do not automatically sort rows in order of decreasing relevance.
You can see this from the preceding query result: The row with the
highest relevance is the one that contains “MySQL” twice, but it is
listed last, not first.
They can work even without a FULLTEXT index, although a search
executed in this fashion would be quite slow.
The minimum and maximum word length full-text parameters apply.
https://dev.mysql.com/doc/refman/5.6/en/fulltext-natural-language.html
https://dev.mysql.com/doc/refman/5.6/en/fulltext-boolean.html
Very simple problem yet hard to find a solution.
Address table with 2,498,739 rows has a field of min_ip and max_ip fields. These are the core anchors of the table for filtering.
The query is very simple.
SELECT *
FROM address a
WHERE min_ip < value
AND max_ip > value;
So it is logical to create an index for the min_ip and max_ip to make the query faster.
Index created for the following.
CREATE INDEX ip_range ON address (min_ip, max_ip) USING BTREE;
CREATE INDEX min_ip ON address (min_ip ASC) USING BTREE;
CREATE INDEX max_ip ON address (max_ip DESC) USING BTREE;
I did try to create just the first option (combination of min_ip and max_ip) but it did not work so I prepared at least 3 indexes to give MySQL more options for index selection. (Note that this table is pretty much static and more of a lookup table)
+------------------------+---------------------+------+-----+---------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+------------------------+---------------------+------+-----+---------------------+-----------------------------+
| id | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| network | varchar(20) | YES | | NULL | |
| min_ip | int(11) unsigned | NO | MUL | NULL | |
| max_ip | int(11) unsigned | NO | MUL | NULL | |
+------------------------+---------------------+------+-----+---------------------+-----------------------------+
Now, it should be straight forward to query the table with min_ip and max_ip as the filter criteria.
EXPLAIN
SELECT *
FROM address a
WHERE min_ip < 2410508496
AND max_ip > 2410508496;
The query performed something around 0.120 to 0.200 secs. However, on a load test, the query rapidly degrades performance.
MySQL server CPU usage sky rocket to 100% CPU usage on just a few simultaneous queries and performance degrades rapidly and does not scale up.
Slow query on mysql server was turned on with 10 secs or higher, and eventually the select query shows up in the logs just after a few seconds of load test.
So I checked the query with explain and found out that it did'nt use an index.
Explain plan result
id select_type table type possible_keys key key_len ref rows Extra
------ ----------- ------ ------ ---------------------- ------ ------- ------ ------- -------------
1 SIMPLE a ALL ip_range,min_ip,max_ip (NULL) (NULL) (NULL) 2417789 Using where
Interestingly, it was able to determine ip_range, ip_min and ip_max as potential indexes but never use any of it as shown in the key column.
I know I can use FORCE INDEX and tried to use explain plan on it.
EXPLAIN
SELECT *
FROM address a
FORCE INDEX (ip_range)
WHERE min_ip < 2410508496
AND max_ip > 2410508496;
Explain plan with FORCE INDEX result
id select_type table type possible_keys key key_len ref rows Extra
------ ----------- ------ ------ ------------- -------- ------- ------ ------- -----------------------
1 SIMPLE a range ip_range ip_range 4 (NULL) 1208894 Using index condition
With FORCE INDEX, yes it uses the ip_range index as key, and rows shows a subset from the query that does not use FORCE INDEX which is 1,208,894 from 2,417,789.
So definitely, using the index should have better performance. (Unless I misunderstood the explain result)
But what is more interesting is, after a couple of test, I found out that on some instances, MySQL does use index even without FORCE INDEX. And my observation is when the value is small, it does use the index.
EXPLAIN
SELECT *
FROM address a
WHERE min_ip < 508496
AND max_ip > 508496;
Explain Result
id select_type table type possible_keys key key_len ref rows Extra
------ ----------- ------ ------ ---------------------- -------- ------- ------ ------ -----------------------
1 SIMPLE a range ip_range,min_ip,max_ip ip_range 4 (NULL) 1 Using index condition
So, it just puzzled me that base on the value pass to the select query, MySQL decides when to use an index and when not to use an index.
I can't imagine what is the basis for determining when to use the index on a certain value being passed to the query. I do understand that
index may not be used if there is no matching index suitable in the WHERE condition but in this case, it is very clear the ip_range index which
is an index based on min_ip and max_ip column is suitable for the WHERE condition in this case.
But the bigger problem I have is, what about other queries. Do I have to go and test those queries on a grand scale.
But even then, as the data grows, can I rely and expect MySQL to use the index?
Yes, I can always use FORCE INDEX to ensure it uses the index. But this is not standard SQL that works on all database.
ORM frameworks may not be able to support FORCE INDEX syntax when they generate the SQL and it tightly couples your query with your index names.
Not sure if anyone has ever encountered this issue but this seems to be a very big problem for me.
Fully agree with Vatev and the others. Not only MySQL does that. Scanning the table is sometimes cheaper than looking at the index first then looking up corresponding entries on disk.
The only time when it for sure uses the index is, when it's a covering index, which means, that every column in the query (for this particular table of course) is present in the index. Meaning, if you need for example only the network column
SELECT network
FROM address a
WHERE min_ip < 2410508496
AND max_ip > 2410508496;
then a covering index like
CREATE INDEX ip_range ON address (min_ip, max_ip, network) USING BTREE;
would only look at the index as there's no need to lookup additional data on disk at all. And the whole index could be kept in memory.
Ranges like that are nasty to optimize. But I have a technique. It requires non-overlapping ranges and stores only a start_ip, not the end_ip (which is effectively available from the 'next' record). It provides stored routines to hide the messy code, involving ORDER BY ... LIMIT 1 and other tricks. For most operations it won't hit more than one block of data, unlike the obvious approaches that tend to fetch half or all the table.
I do agree to all the answers above. but you can try to make only one composite
index like this:
create index ip_rang on address (min_ip ASC,max_ip DESC) using BTREE;
As you know index is also has the disadvantage of using your disk space so consider the optimal index for using.
I just stumbled upon a few lines of code in a system I just started working with that I don't really get. The system has a large table that saves lots of entities with unique IDs and removes them once they're not longer needed but it never reuses them. So the table looks like this
------------------------
| id |info1|info2|info3|
------------------------
| 1 | foo1| foo2| foo3|
------------------------
| 17 | bar1| bar2| bar3|
------------------------
| 26 | bam1| bam2| bam3|
------------------------
| 328| baz1| baz2| baz3|
------------------------
etc.
In one place in the codebase there is a while loop whose purpose it is to loop through all entities in the DB and do things to them and right now this is solved like this
int lastId = fetchMaxId()
int id = 0
while (id = fetchNextId()){
doStuffWith(id)
}
where fetchMaxId is straight forward
int fetchMaxId(){
return sqlQuery("SELECT MAX(id) FROM Table")
}
but fetchNextId confuses me. It is implemented as
int fetchNextId(currentId, maxId){
return sqlQuery("
SELECT id FROM Table where id > :currentId and id <= :maxId LIMIT 1
")
}
This system has been in production for several years so it obviously works but when I tried searching for a solution to why this works I only found people saying the same thing that I already thought i knew. The order in which a MySQL DB returns the result is not easily determined and should not be relied upon so if you wan't a particular order use a ORDER BY clause. But are there some times you can safely ignore the ORDER BY? This code has worked for 12 years and continued to work through several DB updates. Are we just lucky or am I missing something here? Before I saw this code I would have said that if you called
fetchNextId(1, 328)
you could end up with either 17 or 26 as the answer.
Some clues to why this works may be that the id column is the primary key of the Table in question and it's set to auto increment but I can't find any documentation that would explain why
fetchNextId(1, 328)
should always returns 17 when called on the table-snippet given above.
The short answer is yes, the primary key has an order, all indexes have an order, and a primary key is simply a unique index.
As you have rightly said, you should not rely on data being returned in the order the data is stored in, the optimiser is free to return it in any order it likes, and this will be dependent on the query plan. I will however attempt to explain why your query has worked for 12 years.
Your clustered index is just your table data, and your clustering key defines the order that it is stored in. The data is stored on the leaf, and the clustering key helps the root (and intermediate notes) act as pointers to quickly get to the right leaf to retrieve the data. A nonclustered index is a very similar structure, but the lowest level simply contains a pointer to the correct position on the leaf of the clustered index.
In MySQL the primary key and the clustered index are synonymous, so the primary key is ordered, however they are fundamentally two different things. In other DBMS you can define both a primary key and a clustered index, when you do this your primary key becomes a unique nonclustered index with a pointer back to the clustered index.
In it's simplest terms you can imagine a table with an ID column that is the primary key, and another column (A), your B-Tree structure for your clustered index would be something like:
Root Node
+---+
| 1 |
+---+
Intermediate Nodes
+---+ +---+ +---+
| 1 | | 4 | | 7 |
+---+ +---+ +---+
Leaf
+-----------+ +-----------+ +-----------+
ID -> | 1 | 2 | 3 | | 4 | 5 | 6 | | 7 | 8 | 9 |
A -> | A | B | C | | D | E | F | | G | H | I |
+-----------+ +-----------+ +-----------+
In reality the leaf pages will be much bigger, but this is just a demo. Each page also has a pointer to the next page and the previous page for ease of traversing the tree. So when you do a query like:
SELECT ID, A
FROM T
WHERE ID > 5
LIMIT 1;
you are scanning a unique index so it is very likely this will be a sequential scan. Very likely is not guaranteed though.
MySQL will scan the Root node, if there is a potential match it will move on to the intermediate nodes, if the clause had been something like WHERE ID < 0 then MySQL would know that there were no results without going any further than the root node.
Once it moves on to the intermediate node it can identify that it needs to start on the second page (between 4 and 7) to start searching for an ID > 5. So it will sequentially scan the leaf starting on the second leaf page, having already identified the LIMIT 1 it will stop once it finds a match (in this case 6) and return this data from the leaf. In such a simple example this behaviour appears to be reliable and logical. I have tried to force exceptions by choosing an ID value I know is at the end of a leaf page to see if the leaf will be scanned in the reverse order, but as yet have been unable to produce this behaviour, this does not however mean it won't happen, or that future releases of MySQL won't do this in the scenarios I have tested.
In short, just add an order by, or use MIN(ID) and be done with it. I wouldn't lose too much sleep trying to delve into the inner workings of the query optimiser to see what kind of fragmentation, or data ranges would be required to observe different ordering of the clustered index within the query plan.
The answer to your question is yes. If you look at MySQL documentation you will see that whenever a table has a primary key it has an associated index.
When looking at the documentation for indexes you will see that they will mention primary keys as a type of index.
So in case of your particular scenario:
SELECT id FROM Table where id > :currentId and id <= :maxId LIMIT 1
The query will stop executing as soon as it has found a value because of the LIMIT 1.
Without the LIMIT 1 it would have returned 17, 24 and 328.
However will all that said I don't think you will run into any order problems when the primary key is auto incrementing but whenever there is a scenario were the primary key is a unique employee no. instead of an auto incrementing field I would not trust the order of the result because the documentation also notes that MySQL reads sequentially, so the possibility is there that a primary key could fall out of the WHERE clause conditions and be skipped.
I have a large database with two tables: stat and total.
The example of the relation is the following:
STAT:
| ID | total event |
+--------+--------------+
| 7 | 2 |
| 8 | 1 |
TOTAL:
|ID | Event |
+---+--------------+
| 7 | "hello" |
| 7 | "everybody" |
| 8 | "hi" |
This is a very simplified version; also consider that STAT table could have 500K records, and for each STAT I can have about 200 TOTAL rows.
Currently, if I run a simple SELECT query in table TOTAL the system is terribly slow.
Could anyone help me with some advice for the creation of the TOTAL table? Is it possible to say to MySQL that the id column is already sorted so that there is no reason to scan all the rows till the end where, for example, id=7?
Add INDEX(ID) to your tables (both), if you did not already.
SELECT COUNT(*) FROM TOTAL WHERE ID=7 -> if ID is indexed, this will be fast.
You can add an index, and furthermore you can partition your table.
As per #ypercube's comment, tables are not stored in a sorted state, so one cannot "tell" this to the database. However you can add an index on tables to make them faster to search.
One important thing to check - it looks like TOTAL.ID is intended as a foreign key - if so, the table TOTAL should have a primary key called ID. Rename the existing column of that name to STAT_ID instead, so it is obvious what it is a foreign key for. Then add an index on STAT_ID.
Lastly, as a point of style, I recommend that you make your table and column names case-insensitive, and write them in lower-case. It makes it easier to read SQL when keywords are in upper case, and database objects are in lower.