MySQL FULLTEXT decimal point treated as word separator - mysql

We sell lipo batteries that are 3.7v, 7.4v, 11.1v and the voltage is in a description field. It should be possible to FULLTEXT index that character based field with an FT_MIN_WORD_LEN of 4 and have it contain the tokens "3.7v" etc. and these to be found when searching. All my experiments show that when searching these tokens are missing from the index and I suspect this is because the decimal point is acting as a token separator and no tokens are long enough to meet min length.
What am I doing wrong? Why won't Match Against 3.7v find my entries? Does MySQL FULLTEXT understand the difference between a full stop and a decimal point?

Even if FULLTEXT were smart enough to recognize those two uses of ".", what about the 5 other uses. And what about other punctuation marks? When show "_" be part of a "word" and when not? Etc, etc.
Here's a suggestion for your situation (and many others).
Cleanse the data.
Put it in the table.
Similarly, cleanse the query to be fed into the AGAINST clause.
By "cleanse", I mean do any of several things to modify the data to work adequately with FULLTEXT's limitations.
In your one example, I suggest changing 3.7v or 3.7 v to 3_7v.
You may find that some "words" are shorter than min_word_length; for them, you could pad them or do some other kludge.
I recommend you use InnoDB, not MyISAM for all MySQL work. (And note that the setting there is innodb_ft_min_token_size, and it defaults to "3".)

I found a solution here...
https://dev.mysql.com/doc/refman/8.0/en/full-text-adding-collation.html
MySql documentation 12.9.7
Basically there are xml files that control the behaviour of character sets and I was able to change the behaviour of the "." character from punctuation to regular character. Given that the column contains part numbers I changed most of the characters so they were not punctuation creating a new collation set and used that for my part number column. Now works as required.

Related

Searching for an exact word in multiple languages efficiently using MYSQL

I have a simple database table which stores id, language and text. I want to do a search for any word/character and look for an exact match. The catch is I have over 10 million rows.
e.g. search for the word i would return rows with the text containing "i" like "information was bad" and "I like dogs".
This also needs to work for stopwords and other languages which don't use whitespace.
My first immediate thought is just to do LOWERCASE(text) LIKE %word% with a lowercase index on text but after googling it seems like it would do a full table scan and I am using planetscale so I have to pay for a fulltable scan which simply cannot work as I will run out of usage quick.
My next thought was a BOOLEAN full text search but then I run into the issue of stop words being ignored in english and having to use an ngram parser for languages like Chinese and then having to work out what language is being submitted and what index should be used.
Does anyone have any better ideas?
Use CHARACTER SET utf8mb4
Use the latest available COLLATION for that charset -- utf8mb4_unicode_520_ci or utf8mb4_0900_ai_ci or something else for the latest MariaDB.
Do not use LOWERCASE or LOWER (etc), instead, let the collation take care of such (note the "ci" in the collation name).
Yes, you may need ngram instead of FULLTEXT for certain Asian languages.
The stoplist can be turned off.
The min word length can be changed -- at a cost.
Your app code can look at the encoding to decide whether to use ngram of fulltext.
This provides a list of hex values: http://mysql.rjweb.org/doc.php/charcoll#diagnosing_charset_issues Note that E3-EA is mostly "wordless" languages.
I recommend using app code for making decisions and build the SQL query. It may even degenerate to LIKE '%word%' or REGEXP '\\bword\\b' in some cases. Note that REGEXP is generally slower than LIKE, but provides "word boundary" testing if the search strings contain multiple words.
When applicable, FULLTEXT is significantly faster than any other technique.
When doing WHERE ... AND MATCH ..., the match will be performed first, even if the rest of the WHERE is more selective.
LIKE '%...' and all(?) REGEXP tests will. read and test every one of your 10M rows (unless there is a LIMIT).

How MYSQL full text search works with special characters included using Natural Language Matches Predicate

I've been having a hard time trying to fix some bugs i've received over an articles search but i'm getting odd behaviors depending on the query i search and i've been trying fixes like wrapping the query in "", replacing the special character by spaces or splitting the text by spaces/closing quotes, bracets, parenthesis but it hasn't been too useful.
I've looked through a lot of pages/documentation and haven't completely understand how this search works. Here's a bit more context on the problem:
This Articles search uses a Natural Language Matches Predicate against the title and content. Both content and title could contain special characters, numbers, ips and even URLs, so, what is expected is that this search can return the best accurate results/exact results but this doesn't happen all the time and it varies on how the user types the text.
An example:
If i search by an entire article title, for example: Guess who's back - tl;dr: Emot at the top of the results i get the article matching with the title but i also get other results that seem to contain any of the words in the text i typed.
But if i search a fragment of the previous example: tl;dr: i do not get any results, any idea on why this happens? is there any internal confg that MYSQL text search applies when performing the search?
Something worth mention is that i do not know how the tables/indexes were configured, i do not have access to that kind of information, i'm just trying to understand how MYSQL works with this to be able to tell my manager that some behavior he and the customers expect will or won't be possible depending of what they're searching.
To anyone who can help me with this, thanks in advance.
The starting point is mysql's documentation on Natural Language Full-Text Searches. The documentation is quite comprehensive.
Matching of title and getting multiple results:
The full-text engine splits the phrase into words and performs a search in the FULLTEXT index for the words. Nonword characters need not be matched exactly: Phrase searching requires only that matches contain exactly the same words as the phrase and in the same order. For example, "test phrase" matches "test, phrase".
Searching on tl;dr and not getting results is explained in two different places, the first describes what fulltext search considers a word, the second describes a further limitation on indexing too short words:
The MySQL FULLTEXT implementation regards any sequence of true word characters (letters, digits, and underscores) as a word. That sequence may also contain apostrophes ('), but not more than one in a row. This means that aaa'bbb is regarded as one word, but aaa''bbb is regarded as two words. Apostrophes at the beginning or the end of a word are stripped by the FULLTEXT parser; 'aaa'bbb' would be parsed as aaa'bbb. The built-in FULLTEXT parser determines where words start and end by looking for certain delimiter characters; for example, (space), , (comma), and . (period).
Any word that is too short is ignored. The default minimum length of words that are found by full-text searches is three characters for InnoDB search indexes, or four characters for MyISAM.
Based on what you described to me, you seem to be looking for more like exact substring matching (like operator), rather than what fulltext search is for.

MySQL Match Fulltext

Im' trying to do a fulltext search with mysql, to match a string. The problem is that it's returning odd results in the first place.
For example, the string 'passat 2.0 tdi' :
AND MATCH (
records_veiculos.titulo, records_veiculos.descricao
)
AGAINST (
'passat 2.0 tdi' WITH QUERY EXPANSION
)
is returning this as the first result (the others are fine) :
Volkswagen Passat Variant 1.9 TDI- ANO 2003
wich is incorrect, since there's no "2.0" in this example.
What could it be?
edit: Also, since this will probably be a large database (expecting up to 500.000 records), will this search method be the best for itself, or would it be better to install any other search engine like Sphinx? Or in case it doesn't, how to show relevant results?
edit2: For the record, despite the question being marked as answered, the problem with the MySQL delimiters persists, so if anyone has a suggestion on how to escape delimiters, it would be appreciated and worth the 500 points at stake. The sollution I found to increase the resultset was to replace WITH QUERY EXPANSION with IN BOOLEAN MODE, using operators to force the engine to get the words I needed, like :
AND MATCH (
records_veiculos.titulo, records_veiculos.descricao
)
AGAINST (
'+passat +2.0 +tdi' IN BOOLEAN MODE
)
It didn't solve at all, but at least the relevance of the results as changed significantly.
From the MySQL documentation on Fulltext search:
"The FULLTEXT parser determines where words start and end by looking for certain delimiter characters; for example, “ ” (space), “,” (comma), and “.” (period)."
This means that the period is delimiting the 2 and 0. So it's not looking for '2.0'; it's looking for '2' and '0', and not finding it. WITH QUERY EXPANSION is probably causing relevant related words to show up, thus obviating the need for '2' and '0' to be individual words in the result rankings. A character minimum may also be being enforced.
By default I believe mysql only indexes and matches words with 4 or more characters. You could also try escaping the period? It might be ignored this or otherwise using it as a stop character.
What is the match rank that it returns for that? Does the match have to contain all "words" my understanding was it worked like Google and only needs to match some of the words.
Having said that, have a mind to the effect of adding WITH QUERY EXPANSION, that automatically runs a second search for "related" words, which may not be what you have typed, but which the fulltext engines deems probably related.
Relevant Documentation: http://dev.mysql.com/doc/refman/5.1/en/fulltext-query-expansion.html
The "." is what's matching on 2003 in your query results.
If you're going to do searches on 3 character text strings, you should set ft_min_word_len=3
in your mysql config, restart mysql. Otherwise, a search for "tdi" will return results with "TDI-" but not with just "TDI", because rows with "TDI-" will be indexed but "TDI" alone will not.
After making that config change, you'll have to rebuild your index on that table. (Warning: your index might be significantly larger now.)

Sphinx - delimiters

I would like to know if the Sphinx engine works with any delimiters (like commas and periods in normal MySQL). My question comes from the urge, not to use them at all, but to escape them or at least thay they don't enter in conflict when performing MATCH operations with FULLTEXT searches, since I have problems dealing with them in MySQL by default and I would prefer not to be forced to replace those delimiters by any other characters to provide a good set of results.
Sorry if I'm saying something stupid, but I don't have experience with Sphinx or other complementary (?) search engines.
To give you an example, if I perform a search with
"Passat 2.0 TDI"
MySQL by default would identify the period in this case as a delimiter and since the "2" and "0" are too short to be considered words by default, the results would be a bit messed up.
Is it easy to handle with Sphinx (or other search engine)? I'm open to suggestions.
This is for a large project, with probably more than 500.000 possible records (not trivial at all).
Cheers!
You can effectively control which characters are delimiters by specifying the charset table of a specific sphinx index.
If you exclude a character from your charset table, it effectively acts as a delimiter. If you specify it in your charset table (even spaces as U+0020), it will no longer acts as a delimiter and will be part of your token strings.
Each index (which uses one or more sphinx data sources) can have a different charset table for flexibility.
NB: If you want single character words, you can specify the min_word_len of each the sphinx index.
This is probably the best section of the documentation to read. As sphinx is a fulltext engine primarily it's highly tunable as to how it handles phrases and also how you pass them in.

Ignoring HTML entity for ampersand in MySQL Full-Text Search

I have a lot of data that is being entered into records with the HTML entity &. A full-text search for the word "amp" will result in records containing & to be shown, which is highly undesirable.
Presumably this is because MySQL ignores the '&' and the ';'. So does anyone know of any way within MySQL to force it to treat special characters as part of the word so that my search for "amp" doesn't include all results with & in them - ideally without some form of subquery or extra WHERE clause?
My solution so far (not yet implemented) is to decode the entities on INSERT and re-encode them when displaying on the web. This would be ok, but adds some overhead to everything that I'd like to avoid if possible. Also it works well for new entries, but I would need to backdate it to nearly 7 million records... which I kinda don't want to have to do if I can help it.
--
I updated my my.cnf file with the following:
ft_stopword_file = /etc/mysql/custom-stopwords
Does there need to be any special permissions on this file?
Your "decode HTML entities on INSERT and encode them on output" is your best bet, that'll take care of things like " as well. You'd probably want to strip out HTML tags too along the way to keep MySQL from finding things in attribute values.
If speed and formatting is an issue then you could stuff the text/plain version in a separate column and put your full text index on that and let everything else use the text/html version. Of course, you'd have to maintain both columns at the same time and your storage requirement would go up; OTOH, this approach would let you add tags, author names, and other extra bits of interesting data to the index without messing up your displayed text.
In the mean time, did you rebuild your full text index after you added the ft_stopword_file to your config file? AFAIK, the stopwords are applied on the way into the index rather than while the index is consulted.
perhaps you need to specifically ignore these. try to include -& to your fulltext query. Another option and I am unsure if it requires a MySql source code change is to add amp and & to the stop words list of MySql
You added it to the stopwords file and it's not working? Sounds like either a bug in MySQL or your stopwords list isn't being used. Have you reviewed this? Quote:
False hits or misses may occur for
stopword lookups if the stopword file
or columns used for full-text indexing
or searches have a character set or
collation different from
character_set_server or
collation_server.
Case sensitivity of stopword lookups
depends on the server collation. For
example, lookups are case insensitive
if the collation is latin1_swedish_ci,
whereas lookups are case sensitive if
the collation is latin1_general_cs or
latin1_bin.
Could any of those possibility be impacting your stopword entry of & not being read?