FULLTEXT Relevance in MySQL - mysql

I'm learning to set up searches in PHP with MySQL, and I like the idea of FULLTEXT BOOLEAN searches. But there's one part I'm not really sure I understand: Relevance.
According to the manual here, when a word has no operator (plus or minus) before it, "the word is optional, but the rows that contain it are rated higher". But according to an earlier statement on that page, "They do not automatically sort rows in order of decreasing relevance".
So my question is, if they don't do it automatically, how do you manually do it? Or at least, how does one reference this "Relevance"? And if you cannot, then what is the point of them assigning values if the results are not sorted by them?
Just trying to wrap my head around this whole system of BOOLEAN MODE.

Related

mysql fulltext search not working, even though entry exists and ft_min_word_len is set to 1

I have a mysql table and need to perform a fulltext search on it. However, even though the required data is in the table, fulltext search does not retrieve it.
Here is the query:
SELECT * from declensions_english where match(finite) against("did" IN BOOLEAN MODE)
The searched after data looks like this:
I did
Of course this is in the column called finite. There are about a million rows in the table beside this one, so wildcards are very slow.
Why is this not working? It's not because of the length of the word (did), because I've already set ft_min_word_len to 1. There are other cases with three letter words that deliver the expected outcome (i. e. the data is retrieved). But there are also cases where even four letter words are not found. I have no idea what's going on, but I am only using fulltext search since yesterday, so consider me a newbie here.
Since you use ft_min_word_len, I must assume that your table uses myisam table engine.
The word did is on the myisam stop word list, this is why the search does not return it.
You can either disable or change the stop word list or migrate to innodb, which does not have this word on its default stop word list.
To be honest, I cannot think of any reason to use myisam in 2019. You really should migrate over to innodb.

Is it possible to get "ideal" full-text relevance for two constant(same) samples?

Full-text MATCH gives a relative relevance for all records in an indexed table. However, I make the decision based on a similarity level (let's say <70% is insufficient to consider it as a match) between tested sample and constant sample (which I compare against).
Previously I used Levenshtein Distance to get percentage coefficient of how much two samples are similar. But this method showed itself as incredibly inefficient for my dataset.
What I'd like to do is to get a relevance coefficient for sample matched to itself to consider it as 100% relevance
I tried queries like:
SELECT
samples.`name`,
MATCH(samples.`name`)
AGAINST ('Constant sample' IN NATURAL LANGUAGE MODE),
MATCH (perfectSample.sample)
AGAINST ('Constant sample' IN NATURAL LANGUAGE MODE)
FROM
samples,
(SELECT 'Constant sample' as sample) as perfectSample
But embedded from does not support full-text match (My idea was: since MyISAM table must not have FULLTEXT index, It is possible to achieve it this way).
So the actual question is: Is it possible to obtain FULLTEXT relevance for 2 constant values?
OK, so here is what I managed to do. Maybe someone will get any use of it.
First of all, samples should be inserted to a InnoDB (important) table that has FULLTEXT index on a field that has to be MATCHed
After this it is necessary to fetch all values (samples) that will be compared with.
SELECT * FROM samples
Next, these fetched fields need to be MATCHed against themselves. It is better to put a WHERE clause so that a field is not matched to anything else.
SELECT
samples.value,
MATCH (samples.value) AGAINST (:fetchedVal)
WHERE samples.value = :fetchedVal
This will give a relevancy for each sample AGAINST itself.
Note: It is important to use InnoDB because MyISAM MATCH with only one row will produce result that will not be useful. For example: same query can produce relevancy value 40.1511 for InnoDB and 3 for MyISAM.
This is due to the way of how word uniqueness is calculated. You can read more about this here
And that's it. Second query will give (in my opinion) 100% relevancy, which can be used to determine similarity level between this sample and others
It is a bit dirty, but that's the only option that worked for me. And since no one suggested anything else (better) I will keep this as an answer until better solution is found

MySQL table MATCH AGAINST

Hi all,
I have this simple table created, called classics in a DB called, publications on XAMMP. I am trying to do a MATCH AGAINST search for an author name which i thought I understood.
Also, I have made sure the table is FULLTEXT indexed, both author and title columns as required. The table is of the type MyISAM also.
I tried this and it failed.
SELECT author FROM classics WHERE MATCH(author) AGAINST('Charles');
I know Charles must be present in the author column and it is as you an see but i get no rows returned.
Now if I rewerite it to any other author, it works
SELECT author FROM classics WHERE MATCH(author) AGAINST ('jane');
Here is what i get with jane...
I'm not sure but it seemed earlier i had to included both fields I'd indexed in the query, instead of just being able to search author alone. Is this correct and does anyone know why I can't get charles returned?.
Many thanks!.
It's not returning those rows because "charles" appears in 50% of the rows. This is a well-documented restriction of MySQL FULLTEXT search.
If you want to get around this restriction, you can use BOOLEAN MODE.
Here's the relevant excerpt from the manual:
A word that matches half of the rows in a table is less likely to locate relevant documents. In fact, it most likely finds plenty of irrelevant documents. We all know this happens far too often when we are trying to find something on the Internet with a search engine. It is with this reasoning that rows containing the word are assigned a low semantic value for the particular data set in which they occur. A given word may reach the 50% threshold in one data set but not another.
The 50% threshold has a significant implication when you first try full-text searching to see how it works: If you create a table and insert only one or two rows of text into it, every word in the text occurs in at least 50% of the rows. As a result, no search returns any results. Be sure to insert at least three rows, and preferably many more. Users who need to bypass the 50% limitation can use the boolean search mode; see Section 12.9.2, “Boolean Full-Text Searches”.

SQL Server 2008 Containstable generate negative rank with weighted_term

I have a table with full text search enabled on Title column. I try to make a weighted search with a containstable but i get an Arithmetic overflow for the Rank value. The query is as follow
SELECT ID, CAST(Res_Tbl.RANK AS Decimal) AS Relevancy , Title
FROM table1 AS INNER JOIN
CONTAINSTABLE(table1,Title,'ISABOUT("pétoncle" weight (.8), "pétoncle" weight (.8), "PÉTONCLE" weight (.8))',LANGUAGE 1036 ) AS Res_Tbl
ON ID = Res_Tbl.[KEY]
When I execute this query I get : Arithmetic overflow error for type int, value = -83886083125.000076.
If I remove one of the two ';' in the ISABOUT function the query complete successfully.
Note you need to have some results if there is no result the query complete successfully.
Does anybody know how to solve this ?
This question is also on dba.stackexchange.com
Qualifier: Since I can't recreate this, I'm unable to know for sure if this will fix the problem. However, these are some things that I'm seeing.
First off, the ampersand, pound sign, and semicolon are word-break characters. That means, that instead of searching for the string "pétoncle", what you're actually searching for is "p", "233", and "toncle". Clearly, that's not your intent.
I have to presume that you have the text "pétoncle" somewhere in your dataset. That means you need that entire string to be complete.
There are a few things you can do.
1) Turn off Stopwords all together. You can do that by altering the full text index to turn it off.
Note that you have to have your database set to SQL Server 2008 compatability for this to not generate a syntax error:
ALTER FULLTEXT INDEX ON Table1 SET STOPLIST OFF;
2) Create a new stoplist
If you create an empty StopList, you might be able to add the stopwords that you want or copy the system stoplist and remove the stopwords that you don't want. (I would advise the second approach).
Having said that, I wasn't able to find the & or # in the system stoplist, so they may be hard coded. You may have to simply turn the stoplist off.
3) Change your search to ignore the "pétoncle" case.
If you drop the "pétoncle" from the ISABOUT and change them to "p toncle", it might work:
'ISABOUT("pétoncle" weight (.8), "p toncle" weight (.8))'
Those are just some ideas. Like I said, without being able to access the system or recreate the scenario, we won't be able to help much.
Some more information for your researching pleasure:
Stopwords and Stoplists
Alter Fulltext Index syntax
FullText search using Thesaurus file and special characters
For people who got to this page searching for negative rank results returned by SQL Server, as I did, it turns out that can happen if some of your match terms are too long (beyond some character limit). SQL Server will not actually complain or produce an error at query time, instead, the ranking will be mostly garbage, producing negative rank for some choices of weights (in my case, esp. with low weight values on the overlong terms). Limit token/word length and avoid this problem (probably a bug deep inside SQL Server 2008 fulltext search).

MySQL Match Fulltext

Im' trying to do a fulltext search with mysql, to match a string. The problem is that it's returning odd results in the first place.
For example, the string 'passat 2.0 tdi' :
AND MATCH (
records_veiculos.titulo, records_veiculos.descricao
)
AGAINST (
'passat 2.0 tdi' WITH QUERY EXPANSION
)
is returning this as the first result (the others are fine) :
Volkswagen Passat Variant 1.9 TDI- ANO 2003
wich is incorrect, since there's no "2.0" in this example.
What could it be?
edit: Also, since this will probably be a large database (expecting up to 500.000 records), will this search method be the best for itself, or would it be better to install any other search engine like Sphinx? Or in case it doesn't, how to show relevant results?
edit2: For the record, despite the question being marked as answered, the problem with the MySQL delimiters persists, so if anyone has a suggestion on how to escape delimiters, it would be appreciated and worth the 500 points at stake. The sollution I found to increase the resultset was to replace WITH QUERY EXPANSION with IN BOOLEAN MODE, using operators to force the engine to get the words I needed, like :
AND MATCH (
records_veiculos.titulo, records_veiculos.descricao
)
AGAINST (
'+passat +2.0 +tdi' IN BOOLEAN MODE
)
It didn't solve at all, but at least the relevance of the results as changed significantly.
From the MySQL documentation on Fulltext search:
"The FULLTEXT parser determines where words start and end by looking for certain delimiter characters; for example, “ ” (space), “,” (comma), and “.” (period)."
This means that the period is delimiting the 2 and 0. So it's not looking for '2.0'; it's looking for '2' and '0', and not finding it. WITH QUERY EXPANSION is probably causing relevant related words to show up, thus obviating the need for '2' and '0' to be individual words in the result rankings. A character minimum may also be being enforced.
By default I believe mysql only indexes and matches words with 4 or more characters. You could also try escaping the period? It might be ignored this or otherwise using it as a stop character.
What is the match rank that it returns for that? Does the match have to contain all "words" my understanding was it worked like Google and only needs to match some of the words.
Having said that, have a mind to the effect of adding WITH QUERY EXPANSION, that automatically runs a second search for "related" words, which may not be what you have typed, but which the fulltext engines deems probably related.
Relevant Documentation: http://dev.mysql.com/doc/refman/5.1/en/fulltext-query-expansion.html
The "." is what's matching on 2003 in your query results.
If you're going to do searches on 3 character text strings, you should set ft_min_word_len=3
in your mysql config, restart mysql. Otherwise, a search for "tdi" will return results with "TDI-" but not with just "TDI", because rows with "TDI-" will be indexed but "TDI" alone will not.
After making that config change, you'll have to rebuild your index on that table. (Warning: your index might be significantly larger now.)