MySQL table MATCH AGAINST - mysql

Hi all,
I have this simple table created, called classics in a DB called, publications on XAMMP. I am trying to do a MATCH AGAINST search for an author name which i thought I understood.
Also, I have made sure the table is FULLTEXT indexed, both author and title columns as required. The table is of the type MyISAM also.
I tried this and it failed.
SELECT author FROM classics WHERE MATCH(author) AGAINST('Charles');
I know Charles must be present in the author column and it is as you an see but i get no rows returned.
Now if I rewerite it to any other author, it works
SELECT author FROM classics WHERE MATCH(author) AGAINST ('jane');
Here is what i get with jane...
I'm not sure but it seemed earlier i had to included both fields I'd indexed in the query, instead of just being able to search author alone. Is this correct and does anyone know why I can't get charles returned?.
Many thanks!.

It's not returning those rows because "charles" appears in 50% of the rows. This is a well-documented restriction of MySQL FULLTEXT search.
If you want to get around this restriction, you can use BOOLEAN MODE.
Here's the relevant excerpt from the manual:
A word that matches half of the rows in a table is less likely to locate relevant documents. In fact, it most likely finds plenty of irrelevant documents. We all know this happens far too often when we are trying to find something on the Internet with a search engine. It is with this reasoning that rows containing the word are assigned a low semantic value for the particular data set in which they occur. A given word may reach the 50% threshold in one data set but not another.
The 50% threshold has a significant implication when you first try full-text searching to see how it works: If you create a table and insert only one or two rows of text into it, every word in the text occurs in at least 50% of the rows. As a result, no search returns any results. Be sure to insert at least three rows, and preferably many more. Users who need to bypass the 50% limitation can use the boolean search mode; see Section 12.9.2, “Boolean Full-Text Searches”.

Related

Get mySQL full text match score for strings not in the table (optimally in a mixed result set with matches from the table)?

This must be a niche scenario since I have not been able to find a similar question around and in my brief testing in my SQL workbench just using the string in place of the column name did not work.
eg:
SELECT MATCH ('fork') AGAINST ('user entered text about forks' IN NATURAL LANGUAGE MODE);
Doesn't work...
I have a query that returns matches on a full text index with the relevance score as one of the columns returned. In this app, I am looking for "search suggestions" in a suggestions table that is built off the websites search index content. The user side also stores everything they search for in their local browser storage.
Currently, I have front end code that uses regex to pull matches from their local storage search history (up to 5) and then sends what they typed (as they type) to the back end to get the best matches from the suggestions table.
The way it works now, is the (up to 5) history matches are shown first, then the rest are filled in up to 10 total matches from the back end. What I would prefer, is that I send the history matches to the back end and include them in the FT match query in some way so that the result set contains all matched suggestions from the table + the history matches sent from the front end, but all sorted by the full text match relevance score to get them all in order of relevance. The new way may result in no history matches showing or it might result in more than 5 history matches showing, it would all boil down the releveance score.
Is something like this possible? The only other way I could image doing this is somehow creating a temporary table with a full text index, on the fly, and then joining that table in my current query, then removing the temp table when its done. The problem with that, in my mind, is that this is all happening in real time as the user types so I don't want to add something like that if its going to bog down the response time. Is there a fast/optimal way of doing this? Is there a way that would also remove the temporary table when the query ends?
Or is there some other command that can just give me a score based on string value against what the user typed in like what I tried above?
EDIT:
It looks like my temporary table idea could work:
https://dev.mysql.com/doc/refman/8.0/en/create-temporary-table.html
I'll just have to see what kind of perforamce impact this has. Im still interested to hear thoughts on if this is the best / only way or if there is a better one.
The CREATE TEMPORARY TABLE route was the way to go here. I tested it out and its working.
Worthy of note to future travelers. I had to switch my main table from innodb to myisam for this to work. I was able to mix/match the myisam temp table with the innodb main table, but the scoring algorithms are different so the innodb matches were taking priority due to higher scores. This was not an issue for me as I really did not need / use transactions for the primary suggestions table so I just made them both MyISAM engines.
Another item of note, is that I had to switch to splitting the user's query into "words" and ecapsulating them in "*" and running the match as a boolean search instead of natural language becausae in the case of the temp table, a user would likely have entered similar searches which would mean most of the words were in more than 50% of the rows so no matches were returning. Boolean search works around this. Again, not a big deal for my particular use case.
Had I needed to stay in innodb for this, it would have been a problem because from what I can tell, there is no way to set a full text index on an innodb temporary table.

FULLTEXT Relevance in MySQL

I'm learning to set up searches in PHP with MySQL, and I like the idea of FULLTEXT BOOLEAN searches. But there's one part I'm not really sure I understand: Relevance.
According to the manual here, when a word has no operator (plus or minus) before it, "the word is optional, but the rows that contain it are rated higher". But according to an earlier statement on that page, "They do not automatically sort rows in order of decreasing relevance".
So my question is, if they don't do it automatically, how do you manually do it? Or at least, how does one reference this "Relevance"? And if you cannot, then what is the point of them assigning values if the results are not sorted by them?
Just trying to wrap my head around this whole system of BOOLEAN MODE.

How to search inside a SQL table for a phrase

I am currently using MySQL but I am willing to migrate if necessary to any solution suggested.
I am looking for an easy way to implement a search on a table.
The table has multiple entries with data similar to what will be found on user accounts, like names, addresses, phone numbers and a text column that contains comments of arbitrary length.
I want to make a search so that I can go over all rows and columns and find the best matching row. Slightly misspells corrected (Not very important). But most important is the ability to cross search everything.
Table can have as many as 20,000 rows.
Search parameter will be for example: "Company First Name"
Expected results:
company|Contact First Name|Address|...|...
example 2, slightly misspelled search parameters : "Pinaple Street Compani"
Expected results row:
company|pinapple street|..|...
companie|pinapple street|..|...
company|pinaple street|..|...
EDIT:
Forgot to clarify that multiple searches will be done at the same time so it has to be fast (Around 100 searches at the same time). Also the language of the data is not english and the database is utf8 with support for non-english characters
The misspelling problem is hard, if not impossible, to solve well in pure MySQL.
The multiple-column FULLTEXT search isn't so bad.
Your query will look something like this ...
SELECT column, column
FROM table
WHERE MATCH(Company, FirstName, LastName, whatever, whatever)
AGAINST('search terms' IN NATURAL LANGUAGE MODE)
It will produce a bunch of results, ordered by what MySQL guesses is the most likely hit first. MySQL's guesses aren't great, but they're usually adequate.
You'll need a FULLTEXT index matching the list of columns in your MATCH() clause. Creating that index looks like this.
ALTER TABLE book
ADD FULLTEXT INDEX Fulltext_search_index_1
(Company, FirstName, LastName, whatever, whatever);
Comments in your question notwithstanding, you just need an index for the group of columns which you will search.
20K rows won't be a big burden on any recent-vintage server hardware.
Misspelling: You could try SOUNDEX(), but it's an early 20th century algorithm designed by the Bell System to look up peoples' names in American English. It's designed to get many false positive hits, and it really is dumber than a bucket of rocks.
If you really do need spell correction you may need to investigate Sphinx.

Is it possible to get "ideal" full-text relevance for two constant(same) samples?

Full-text MATCH gives a relative relevance for all records in an indexed table. However, I make the decision based on a similarity level (let's say <70% is insufficient to consider it as a match) between tested sample and constant sample (which I compare against).
Previously I used Levenshtein Distance to get percentage coefficient of how much two samples are similar. But this method showed itself as incredibly inefficient for my dataset.
What I'd like to do is to get a relevance coefficient for sample matched to itself to consider it as 100% relevance
I tried queries like:
SELECT
samples.`name`,
MATCH(samples.`name`)
AGAINST ('Constant sample' IN NATURAL LANGUAGE MODE),
MATCH (perfectSample.sample)
AGAINST ('Constant sample' IN NATURAL LANGUAGE MODE)
FROM
samples,
(SELECT 'Constant sample' as sample) as perfectSample
But embedded from does not support full-text match (My idea was: since MyISAM table must not have FULLTEXT index, It is possible to achieve it this way).
So the actual question is: Is it possible to obtain FULLTEXT relevance for 2 constant values?
OK, so here is what I managed to do. Maybe someone will get any use of it.
First of all, samples should be inserted to a InnoDB (important) table that has FULLTEXT index on a field that has to be MATCHed
After this it is necessary to fetch all values (samples) that will be compared with.
SELECT * FROM samples
Next, these fetched fields need to be MATCHed against themselves. It is better to put a WHERE clause so that a field is not matched to anything else.
SELECT
samples.value,
MATCH (samples.value) AGAINST (:fetchedVal)
WHERE samples.value = :fetchedVal
This will give a relevancy for each sample AGAINST itself.
Note: It is important to use InnoDB because MyISAM MATCH with only one row will produce result that will not be useful. For example: same query can produce relevancy value 40.1511 for InnoDB and 3 for MyISAM.
This is due to the way of how word uniqueness is calculated. You can read more about this here
And that's it. Second query will give (in my opinion) 100% relevancy, which can be used to determine similarity level between this sample and others
It is a bit dirty, but that's the only option that worked for me. And since no one suggested anything else (better) I will keep this as an answer until better solution is found

Fulltext search on many tables

I have three tables, all of which have a column with a fulltext index. The user will enter search terms into a single text box, and then all three tables will be searched.
This is better explained with an example:
documents
doc_id
name FULLTEXT
table2
id
doc_id
a_field FULLTEXT
table3
id
doc_id
another_field FULLTEXT
(I realise this looks stupid but that's because I've removed all the other fields and tables to simplify it).
So basically I want to do a fulltext search on name, a_field and another_field, and then show the results as a list of documents, preferably with what caused that document to be found, e.g. if another_field matched, I would display what another_field is.
I began working on a system whereby three fulltext search queries are performed and the results inserted into a table with a structure like:
search_results
table_name
row_id
score
(This could later be made to cache results for a few days with e.g. a hash of the search terms).
This idea has two problems. The first is that the same document can be in the search results up to three times with different scores. Instead of that, if the search term is matched in two tables, it should have one result, but a higher score.
The second is that parsing the results is difficult. I want to display a list of documents, but I don't immediately know the doc_id without a join of some kind; however the table to join to is dependant on the table_name column, and I'm not sure how to accomplish that.
Wanting to search multiple related tables like this must be a common thing, so I guess what I'm asking is am I approaching this in the right way? Can someone tell me the best way of doing it please.
I would create a denormalized single index. Ie, put all three document types into a single table with fields for doc_id, doc_type and a single fulltext block. Then you can search all three document types at once.
You might also find that Lucene would make sense in this situation. It gives you faster searching, as well as much more functionality around how the searching and scoring works.
The downside is that you're keeping a separate denomalized copy of the text for each record. The upside is that searching is much faster.