Ignoring HTML entity for ampersand in MySQL Full-Text Search - mysql

I have a lot of data that is being entered into records with the HTML entity &. A full-text search for the word "amp" will result in records containing & to be shown, which is highly undesirable.
Presumably this is because MySQL ignores the '&' and the ';'. So does anyone know of any way within MySQL to force it to treat special characters as part of the word so that my search for "amp" doesn't include all results with & in them - ideally without some form of subquery or extra WHERE clause?
My solution so far (not yet implemented) is to decode the entities on INSERT and re-encode them when displaying on the web. This would be ok, but adds some overhead to everything that I'd like to avoid if possible. Also it works well for new entries, but I would need to backdate it to nearly 7 million records... which I kinda don't want to have to do if I can help it.
--
I updated my my.cnf file with the following:
ft_stopword_file = /etc/mysql/custom-stopwords
Does there need to be any special permissions on this file?

Your "decode HTML entities on INSERT and encode them on output" is your best bet, that'll take care of things like " as well. You'd probably want to strip out HTML tags too along the way to keep MySQL from finding things in attribute values.
If speed and formatting is an issue then you could stuff the text/plain version in a separate column and put your full text index on that and let everything else use the text/html version. Of course, you'd have to maintain both columns at the same time and your storage requirement would go up; OTOH, this approach would let you add tags, author names, and other extra bits of interesting data to the index without messing up your displayed text.
In the mean time, did you rebuild your full text index after you added the ft_stopword_file to your config file? AFAIK, the stopwords are applied on the way into the index rather than while the index is consulted.

perhaps you need to specifically ignore these. try to include -& to your fulltext query. Another option and I am unsure if it requires a MySql source code change is to add amp and & to the stop words list of MySql

You added it to the stopwords file and it's not working? Sounds like either a bug in MySQL or your stopwords list isn't being used. Have you reviewed this? Quote:
False hits or misses may occur for
stopword lookups if the stopword file
or columns used for full-text indexing
or searches have a character set or
collation different from
character_set_server or
collation_server.
Case sensitivity of stopword lookups
depends on the server collation. For
example, lookups are case insensitive
if the collation is latin1_swedish_ci,
whereas lookups are case sensitive if
the collation is latin1_general_cs or
latin1_bin.
Could any of those possibility be impacting your stopword entry of & not being read?

Related

Searching for an exact word in multiple languages efficiently using MYSQL

I have a simple database table which stores id, language and text. I want to do a search for any word/character and look for an exact match. The catch is I have over 10 million rows.
e.g. search for the word i would return rows with the text containing "i" like "information was bad" and "I like dogs".
This also needs to work for stopwords and other languages which don't use whitespace.
My first immediate thought is just to do LOWERCASE(text) LIKE %word% with a lowercase index on text but after googling it seems like it would do a full table scan and I am using planetscale so I have to pay for a fulltable scan which simply cannot work as I will run out of usage quick.
My next thought was a BOOLEAN full text search but then I run into the issue of stop words being ignored in english and having to use an ngram parser for languages like Chinese and then having to work out what language is being submitted and what index should be used.
Does anyone have any better ideas?
Use CHARACTER SET utf8mb4
Use the latest available COLLATION for that charset -- utf8mb4_unicode_520_ci or utf8mb4_0900_ai_ci or something else for the latest MariaDB.
Do not use LOWERCASE or LOWER (etc), instead, let the collation take care of such (note the "ci" in the collation name).
Yes, you may need ngram instead of FULLTEXT for certain Asian languages.
The stoplist can be turned off.
The min word length can be changed -- at a cost.
Your app code can look at the encoding to decide whether to use ngram of fulltext.
This provides a list of hex values: http://mysql.rjweb.org/doc.php/charcoll#diagnosing_charset_issues Note that E3-EA is mostly "wordless" languages.
I recommend using app code for making decisions and build the SQL query. It may even degenerate to LIKE '%word%' or REGEXP '\\bword\\b' in some cases. Note that REGEXP is generally slower than LIKE, but provides "word boundary" testing if the search strings contain multiple words.
When applicable, FULLTEXT is significantly faster than any other technique.
When doing WHERE ... AND MATCH ..., the match will be performed first, even if the rest of the WHERE is more selective.
LIKE '%...' and all(?) REGEXP tests will. read and test every one of your 10M rows (unless there is a LIMIT).

MySQL FULLTEXT decimal point treated as word separator

We sell lipo batteries that are 3.7v, 7.4v, 11.1v and the voltage is in a description field. It should be possible to FULLTEXT index that character based field with an FT_MIN_WORD_LEN of 4 and have it contain the tokens "3.7v" etc. and these to be found when searching. All my experiments show that when searching these tokens are missing from the index and I suspect this is because the decimal point is acting as a token separator and no tokens are long enough to meet min length.
What am I doing wrong? Why won't Match Against 3.7v find my entries? Does MySQL FULLTEXT understand the difference between a full stop and a decimal point?
Even if FULLTEXT were smart enough to recognize those two uses of ".", what about the 5 other uses. And what about other punctuation marks? When show "_" be part of a "word" and when not? Etc, etc.
Here's a suggestion for your situation (and many others).
Cleanse the data.
Put it in the table.
Similarly, cleanse the query to be fed into the AGAINST clause.
By "cleanse", I mean do any of several things to modify the data to work adequately with FULLTEXT's limitations.
In your one example, I suggest changing 3.7v or 3.7 v to 3_7v.
You may find that some "words" are shorter than min_word_length; for them, you could pad them or do some other kludge.
I recommend you use InnoDB, not MyISAM for all MySQL work. (And note that the setting there is innodb_ft_min_token_size, and it defaults to "3".)
I found a solution here...
https://dev.mysql.com/doc/refman/8.0/en/full-text-adding-collation.html
MySql documentation 12.9.7
Basically there are xml files that control the behaviour of character sets and I was able to change the behaviour of the "." character from punctuation to regular character. Given that the column contains part numbers I changed most of the characters so they were not punctuation creating a new collation set and used that for my part number column. Now works as required.

Search in a field with html entities

Our customer's data (SQL Server 2005) has html entities in it (é -> é).
We need to search inside those fields, so a search for "équipe" will find "équipe".
We can't change the data, because our customer's customers can edit those fields as will (with a HTML editor), so if we remove the entities, on the next edit they might reappear, and the problem will still be there.
We can't use a .net server-side function, because we need to find the rows before they are returned to the server.
I would use a function that replaces the entities by their UTF-8 counterparts, but it's kind of tiresome, and I think it seriously drops the search performances (something about full table scan if I recall correctly).
Any idea ?
Thanks
You would only need to examine and encode the incoming search term.
If you convert "équipe" to "équipe" and use that in your WHERE/FTS clause then any index on that field could still be used, if the optimizer deems it appropriate.

Sphinx - delimiters

I would like to know if the Sphinx engine works with any delimiters (like commas and periods in normal MySQL). My question comes from the urge, not to use them at all, but to escape them or at least thay they don't enter in conflict when performing MATCH operations with FULLTEXT searches, since I have problems dealing with them in MySQL by default and I would prefer not to be forced to replace those delimiters by any other characters to provide a good set of results.
Sorry if I'm saying something stupid, but I don't have experience with Sphinx or other complementary (?) search engines.
To give you an example, if I perform a search with
"Passat 2.0 TDI"
MySQL by default would identify the period in this case as a delimiter and since the "2" and "0" are too short to be considered words by default, the results would be a bit messed up.
Is it easy to handle with Sphinx (or other search engine)? I'm open to suggestions.
This is for a large project, with probably more than 500.000 possible records (not trivial at all).
Cheers!
You can effectively control which characters are delimiters by specifying the charset table of a specific sphinx index.
If you exclude a character from your charset table, it effectively acts as a delimiter. If you specify it in your charset table (even spaces as U+0020), it will no longer acts as a delimiter and will be part of your token strings.
Each index (which uses one or more sphinx data sources) can have a different charset table for flexibility.
NB: If you want single character words, you can specify the min_word_len of each the sphinx index.
This is probably the best section of the documentation to read. As sphinx is a fulltext engine primarily it's highly tunable as to how it handles phrases and also how you pass them in.

MySQL - Do's and Don'ts

I am currently learning MySQL and am noticing a lot of different do's and don'ts.
Is there anywhere I can find the absolute list of best practices that you go by or learned from?
Thanks for your time.
Do use InnoDB; don't use MyISAM.
(OK, OK, unless you absolutely have to, often due to fulltext matching not being available in InnoDB. Even then you're often better off putting the canonical data in InnoDB and the fulltext index on a separate MyISAM searchbait table, which you can then process for stemming.)
Do use BINARY columns when you want rigorous string matching, otherwise you get a case-insensitive comparison by default. Do set the collation correctly for your character set (best: UTF-8) or case-insensitive comparisons will behave strangely.
Do use ANSI SQL mode if you want your code to be portable. ANSI_QUOTES allows you to use standard double-quoted "identifier" (table, column, etc.) names to avoid reserved words; MySQL's default way of saying this is backquotes but they're non-standard and won't work elsewhere. If you can't control settings like this, omit any identifier quoting and try to avoid reserved words (which is annoying, as across the different databases there are many).
Do use your data access layer's MySQL string literal escaping or query parameterisation functions; don't try to create escaped literals yourself because the rules for them are a lot more complicated than you think and if you get it wrong you've got an SQL injection hole.
Don't rely on MySQL's behaviour of returning a particular row when you select columns that don't have a functional dependency on the GROUP BY column(s). This is an error in other databases and can easily hide bugs that will only pop up when the internal storage in the database changes, causing a different row to be returned.
SELECT productid, MIN(cost)
FROM products
GROUP BY productcategory -- this doesn't do what you think
Well, there won't be an absolute list of dos and donts as the goal posts keep moving. MySql moved on in leaps and bounds between versions 4 and 5, and some fairly essential bug fixes for MySql seem to be around the corner (I'm thinking of the issue surrounding the use of count(distinct col1) from ...).
Here are a couple of issues off the top of my head:
don't rely on views to be able to use indexes on the underlying tables
http://forums.mysql.com/read.php?100,22967,66618#msg-66618
The order of columns in indexes intended to be used by GROUP BY is important:
http://dev.mysql.com/doc/refman/5.1/en/group-by-optimization.html
COUNT(DISTINCT) is slow:
http://www.delphifaq.com/faq/databases/mysql/f3095.shtml
although there might be a bug fix a-coming....
http://bugs.mysql.com/bug.php?id=17865
Here are some other questions from this site you might find useful:
Database opimization
Database design with MySql
Finetuning tips
DON'T WRITE YOUR SQL IN ALL CAPS, EVEN THOUGH THE OFFICIAL REFERENCE DOES IT. I MEAN, OK, IT MAKES IT PRETTY OBVIOUS TO DIFFERENTIATE BETWEEN IDENTIFIERS AND KEYWORDS. NO, WAIT, THAT'S WHY WE HAVE SYNTAX HIGHLIGHTING.
Do use SQL_MODE "Traditional".
SET SQL_MODE='TRADITIONAL'
Or put it in your my.cnf (even better, because you can't forget it; but ensure it gets deployed on to ALL instances including dev, test etc).
If you don't do this, inserting invalid values into columns will succeed anyway. This is not usually a Good Thing, as it may mean that you lose data.
It's important that it's turned on in dev as well as you'll spot those problems early.
Oh I need this list too .. joking. No. The problem is that whatever works with 1 MB database will never be good for 1 GB database, same applies to 1GB database vs 1TB database. etc.