Im currently using a query for an autocomplete box with like. However I want to use the match, against which should be faster but I'm running against some issues with the sorting.
I want to rank a query like this:
[query] %
[query]%
% [query]%
%[query]%
For now I use
SELECT * FROM table
WHERE name LIKE '%query%'
ORDER BY (case
WHEN name LIKE 'query %' THEN 1
WHEN name LIKE 'query%' THEN 2
WHEN name LIKE '% query%' THEN 3
ELSE 4 END) ASC
When I use...
SELECT * FROM table
WHERE MATCH(name) AGAINST('query*' IN BOOLEAN MODE)
...all results get the same 'ranking score'.
For example searching for Natio
returns Pilanesberg National Park and National Park Kruger with the same score while I want the second result as first becouse it starts with the query.
How can I achieve this?
I had your same problem and I had to approach it in a different way.
The documentation of MySQL says:
The term frequency (TF) value is the number of times that a word appears in a document. The inverse document frequency (IDF) value of a word is calculated using the following formula, where total_records is the number of records in the collection, and matching_records is the number of records that the search term appears in.
${IDF} = log10( ${total_records} / ${matching_records} )
When a document contains a word multiple times, the IDF value is multiplied by the TF value:
${TF} * ${IDF}
Using the TF and IDF values, the relevancy ranking for a document is calculated using this formula:
${rank} = ${TF} * ${IDF} * ${IDF}
And this is followed by an example where it explains the above declaration: it search for the word 'database' in different fields and returns a rank based upon the results.
In your example the words "Pilanesberg National Park", "National Park Kruger" will return the same rank against ('Natio' IN BOOLEAN MODE)* because the rank is based not on the common sense similarity of the word (or in this case you'd expected to tell the database what's meaning -for you- "similar to"), but is based on the above formula, related to the frequency.
And note also that the value of the freqency is affected by the type of index (InnoDB or MyISAM) and by the version of MySQL (in older version you cannot use Full-text indexes with InnoDB tables).
Regarding your problem, you can use MySQL user defined variables or functions or procedures in order to evaluate the rank basing upon your idea of rank. Examples here, here or here. And also here.
See also:
MySQL match() against() - order by relevance and column?
MYsql FULLTEXT query yields unexpected ranking; why?
Related
I've been looking for resources to explain how this query exactly sorts retrieved items by relevance, and haven't been able to find any.
Hopefully one of you can explain the logistics of it to me?
SELECT *, MATCH(body, subject) AGAINST ('words' IN BOOLEAN MODE) AS relevance
FROM `messages`
WHERE MATCH(body, subject) AGAINST ('words' IN BOOLEAN MODE)
ORDER BY relevance DESC
In this case, I know that first half of this query searches through the messages.body and messages.subject columns for the search terms "words". It then returns those results, (regardless of the Boolean Operators) in what is essential a "random order" (ordered by what is found first, then found 2nd, and so on).
What I don't understand, however, is how MySQL interprets the WHERE clause and the rest of the query. How does repeating the first half of code reorder the results by relevance?
For example, an ORDER BY clause that sorts a users.user_id column by desc. numerical order MAKES SENSE to me because each row/cell has a clear order (e.g. - 3 , 2 , 1, and so on)
But how does (going back to the original query) MySQL interpret these "word" results (words, obviously not having any values/numbers/clear-order) and sort them according to relevance?
Is it because the Boolean Full-text Search gives hidden numerical values to these search terms? Like if the AGAINST clause read:
AGAINST ('+apple -macintosh ~microsoft >windows' IN BOOLEAN MODE)
Like "apple" gets a value of 100, "macintosh" a value of -100, "microsoft" a value of 20, and "windows" a value of 40 (to reflect the Operator Effects)?
I know that this is oversimplifying the process (especially when considering if a column contains more than one of these search terms), but that is the best I got.
What I basically need, is a layman-terms explanation of the WHERE clause's (the 2nd half of query code's) effect on the query results as a whole.
How can I do a MySQL search which will match partial words but also provide accurate relevancy sorting?
SELECT name, MATCH(name) AGAINST ('math*' IN BOOLEAN MODE) AS relevance
FROM subjects
WHERE MATCH(name) AGAINST ('math*' IN BOOLEAN MODE)
The problem with boolean mode is the relevancy always returns 1, so the sorting of results isn't very good. For example, if I put a limit of 5 on the search results the ones returned don't seem to be the most relevant sometimes.
If I search in natural language mode, my understanding is that the relevancy score is useful but I can't match partial words.
Is there a way to perform a query which fulfils all of these criteria:
Can match partial words
Results are returned with accurate relevancy
Is efficient
The best I've got so far is:
SELECT name
FROM subjects
WHERE name LIKE 'mat%'
UNION ALL
SELECT name
FROM subjects
WHERE name LIKE '%mat%' AND name NOT LIKE 'mat%'
But I would prefer not to be using LIKE.
The new InnoDB full-text search feature in MySQL 5.6 helps in this case.
I use the following query:
SELECT MATCH(column) AGAINST('(word1* word2*) ("word1 word1")' IN BOOLEAN MODE) score, id, column
FROM table
having score>0
ORDER BY score
DESC limit 10;
where ( ) groups words into a subexpression. The first group has like word% meaning; the second looks for exact phrase. The score is returned as float.
I obtained a good solution in this (somewhat) duplicate question a year later:
MySQL - How to get search results with accurate relevance
Well i'm running 2 queries that should show me the same result,
First query:
SELECT count( id ) AS cv FROM table_name WHERE field_name LIKE '%êêê01, word02, word03%'
Second query:
SELECT count( id ) AS cv FROM table_name WHERE match(field_name) against('êêê01, word02, word03')
but the first show more rows than the second, someone could tell me why?
I'm using fulltext index on this field,
Thanks.
I did a quick research and the following quote should answer your question:
One problem with MATCH on MySQL is that it seems to only match against whole words so a search for 'bla' won't match a column with a value of 'blah'.
It's also described in the documentation for match
By default, the MATCH() function performs a natural language search for a string against a text collection. A collection is a set of one or more columns included in a FULLTEXT index. The search string is given as the argument to AGAINST(). For each row in the table, MATCH() returns a relevance value; that is, a similarity measure between the search string and the text in that row in the columns named in the MATCH() list.
Meanwhile like is more "powerful" as it can look upon individuals characters:
Per the SQL standard, LIKE performs matching on a per-character basis, thus it can produce results different from the = comparison operator:
Which explains why like returns more results than match.
First, to describe my data set. I am using SNOMED CT codes and trying to make a usable list out of them. The relevant columns are rowId, conceptID, and Description. rowId is unique, the other two are not. I want to select a very specific subset of those codes:
SELECT *
FROM SnomedCode
WHERE LENGTH(Description)=MIN(LENGTH(Description))
GROUP BY conceptID
The result should be a list of 400,000 unique conceptIDs (out of 1.4 million) and the shortest applicable description for each code. The query above is obviously malformed (and would only return rows where LENGTH(description)=1 because the shortest description in the table is 1 character long.) What am I missing?
SELECT conceptID, MAX(Description)
FROM SnomedCode A
WHERE LENGTH(Description)=(SELECT MIN(LENGTH(B.Description))
FROM SnomedCode B
WHERE B.conceptID = A.conceptID)
GROUP BY conceptID
The "GROUP BY" and "MAX(Description)" are not really necessary, but were added as a tiebreaker for different descriptions with same length for a conceptID, as the requirements include unique conceptIDs.
MAX was chosen to penalize possible leading spaces. Otherwise MIN(Description) works as well.
BTW, this query takes quite some time if you have over million records. Test it with "AND conceptID in (list-of-conceptIDs-to-test)" added in the WHERE clause.
The table SnomedCode must have an index on conceptID. If not, the query will take forever.
I could not find answer to my question anywhere. I have done a research but without any luck.
Let's say that we query a following statement:
SELECT `id`, MATCH(`name`,`content`) AGAINST ('some keywords') AS `score` FROM `pages` WHERE MATCH(`name`,`content`) AGAINST ('some keywords')
Is MySQL going to give me a score according for the whole table pages, will it scan for all other records there? Or is it going to give me a score just considering a row with columns supplied (name and content) in our case.
Yes it uses the keyword frequency in other rows too. Taken from MySQL Fulltext Search, the formula is:
w = (log(dtf)+1)/sumdtf * U/(1+0.0115*U) * log((N-nf)/nf)
Where:
dtf is the number of times the term appears in the document sumdtf
is the sum of (log(dtf)+1)'s for all terms in the same document U
is the number of Unique terms in the document N is the total
number of documents nf is the number of documents that contain
the term