MySQL fulltext search on multiple tables - another one - mysql

So I checked around about MySQL fulltext search on multiple, made-to-be joint tables. I know, now, that this is not possible because an index cannot be made on joint tables. The given solution is always to do two "match" with and/or – but it doesn't solve my problem.
The situation is as follows. I got :
– A "works" table that contains book titles, short descriptions and texts extracts.
– A "authors" table, with the name of the authors.
My search must be made IN BOOLEAN MODE for some reasons. Also, the default behavior for the words entered in the search field is AND (I preprocess the request by replacing spaces with +).
A user will typically enter in the search field : "NameOfAuthor TitleOfTheBook" or "NameOfAuthor aRandomWord (that he looks for in the extracts)" or "TitleOfTheBook" alone. He expect to find out only the results (and all of them) that matches all the word he entered.
So if I :
– match against the "works" fields OR the "authors" fields, I will have an answer only if the short descriptions in the "works" table mention the name of the author.
If I don't preprocess the query (if I don't transform "NameOfAuthor TitleOfTheBook" into "+NameOfAuthor +TitleOfTheBook"), I will have all the books from one author and all the books that contains some words of the query, which is not suitable.
– match against the "works" fields AND the "authors" fields, I will have nothing. If I don't preprocess the query for the "Match against author" part, it may work in this case, but not in general, because it will not work with any search that doesn't mention the author's name.
It seems to me that the only solution is an index that would mix works fields and author name. But it's not possible to do an index over a joint… The situation seems so typical that I can't believe that this is a real issue. So I'm probably stupid, but I just can't figure a solution. Any idea ? Must I create a specific, virtual table for this search ?
Thank you very much !

Well, writing down the question helped me to figure something… The idea would be to crush the user input into an $wordsArray and do the fulltext search for each of them.
So, the idea would be to :
//Parse the words from the query field
for (there is still a $word to check in the $wordsArray) {
// do a fulltext search on "works" fields against $word OR "authors" fields against $word
// save the results in a multi-dimensional $resultArray
}
// Keep only the results that exists in every row of the $resultArray
// Display
I think that is quite heavy, though… But the only alternative I can imagine is a database pregenerated table for those search purpose with an index on it. It all depends on the scale.
Except if someone else has a better solution !

Related

Match a given set of keywords from mysql database

I have a column in my mysql database with set of keywords. (Specifically the lable data i'm getting from google vision api). Is there a easy way to match and return similar records when another set of lables given to the database.
In database: "Bike vehicle transport light floor"
What i'm giving as search parameters : "light bike car green"
Approach i've taken currently: use the "LIKE" keyword with wildcard. Is there a better way to do this?
Thanks
A solution I propose, for which you'd have to use a STORED PROCEDURE is create a table of "words".
word_id INT() AUTOINCREMENT
word VARCHAR(255)
Then split each word in the field and add it to the words table. If new add if old get the existing code for it. You then create a used_words table that links each record with the multiple words in contains.
record_id *(current record ID)*
word_id INT()
CONSTRAINT record_id *current_table(current record id)*
CONSTRAINT word_id words(word_id)
Finally, to compare a list against another, you see if every word you chose exist in the used_words table
select word_id from used_words
WHERE word_in not in (
SELECT word_id FROM used_words
WHERE record_id="$existing_id"
)
WHERE record_id="$new_entry_id"
If the result is NULL, then all the words exist. Otherwise, you have the list of different words.
The algorithm should work , but not a single SQL query
This isn't a "complete" answer, and I'm not expecting it to be accepted as such.
Your topic of question is "Information Retrieval", and there are several good books on the subject (although they'll cover a far wider scope than your specific question - so YMMV unless you're particularly interested in the subject).
I'd read up on normalisation. I'd start by decoupling those keywords into a joining table, well indexed.
Also take a look a the subject of stemming. It's not a silver bullet, but it's core to getting the right results. Some database engines can handle this for you - MySQL cannot (to my knowledge). I'd recommend looking at the Porter Stemmer for a good English example. There are libraries for every major language.
Finally consider synonyms. There's no easy way to handle these (in code); you'll need to build a database of them (better still, grab a free one from online). You'll use this to "increase" the supplied keyword list, using related words. ("Aeroplane" becomes "Aeroplane, Vehicle, Aircraft, Flying machine, Transport", etc).

Searching usernames containing special characters

To start off with, I have looked into this issue and gone through quite a few suggestions here on SO, but many leave me in doubt whether they are good performance-wise.
So to my problem:
I have a table with usernames and want to provide users the possibility to search for others by their name. As these names are taken from Steam though, the names not containing some form of special character are in the minority.
The easiest solution would be to use LIKE name%, but with the table size constantly increasing, I don't see this as the best solution, even though it may be the only one.
I tried using a fulltext search, but the many special characters crushed that idea.
Any other solutions or am I stuck with LIKE?
Current table rows: 120k+
Well I don't believe that string-functions are faster, but contemporary I don't got any big database for testing performance. Let's give it a try:
WHERE substr(name, 1, CHAR_LENGTH('ste')) = 'ste'
I would like to suggest one solution which I applied before.
First of all, I clean all special characters from the string in name column.
Then I store cleaned string in another column (called cleaned_name) and index (fulltext search) this column instead of the original column.
Finally, I used the same function in step 1 to clean the queried name before executing a fulltext search on cleaned_name.
I hope that this solution is suitable for you.

Position independent string matching

I have 2,000,000 strings in my mysql database. Now , when a new string comes as input, I try to find out if the string is already in my database, else, I insert the string.
Definition of String Match
For my case, position of a word in the text doesn't matter. Only all the words should be present in the string and no extra words in either string.
Ex - Ram is a boy AND boy is a Ram will be said to match. Ram is a good boy won't match.
PS - Please ignore the sense
Now, my question is what is the best way to do these matching given the number of strings(2,000,000) I have to match with .
Solution I could think of :
Index all the strings in SOLR/Sphinx
On new search, I will just
hit the search server and have to consider at max top 10 strings
Advantages :-
Faster than mysql full text search
Disadvantages :-
Keeping search server updated with the new queries in mysql
database.
Are there any other better solutions that I can go for ? Any suggestions and approach to tackle this are most welcome :)
Thanks !
You could just compute a second column that has the words in sorted order. THen just a unique index on that column :)
ALTER TABLE table ADD sorted varchar(255) not null, unique index(sorted);
then... (PHP for convenience, but other languages will be similar)
$words = explode(' ',trim($string));
sort($words);
$sorted = mysql_real_escape_string(implode(' ',$words));
$string = mysql_real_escape_string($string);
$sql = "INSERT IGNORE INTO table SET `string`='$string',`sorted`='$sorted'";
I would suggest to create some more tables that stores the information about your existing data.
so that regardless of how much data your table has you will not have to deal with performance issue during "match/check and insert" logic in your query.
please check the schema suggestion I have made for similar requirement in another post on SO.
accommodate fuzzy matching
in above post to achieve your needs you will need just one extra table where I have mentioned data match with 90% accuracy. let me know if that answer is not clear or you have any doubt on that.
EDIT-1
in your case you will have 3 tables. one you already have, where you have your 2,000,000 string messages stored. now another two table i was talking about is as follows.
second table to store all unique Expression (unique word accross all messages)
third table to store link between each Expression(word) and messgae that word appears in.
see the below query results.
Now lets say your input has a string "Is Boy Ram"
first extract Each Expression from string you have 3 in this string. "Is" and "Ram" and "Boy".
now its just matter of completing the Select query to see if these all expression exist in last table
"MyData_ExpressionString" for single StringID. I guess now you have better picture and you know what to do next. and yes, i haven't created Indexes but I guess you already know what indexes you will need.
Calculate a bloom filter for each string by adding all the words to the filter for the given string. On any new string lookup, calculate the bloom filter, and lookup the matching strings in the DB.
You probably can get by with a fairly short bloom filter, some testing on your strings could tell you how long you need.

MySQL table MATCH AGAINST

Hi all,
I have this simple table created, called classics in a DB called, publications on XAMMP. I am trying to do a MATCH AGAINST search for an author name which i thought I understood.
Also, I have made sure the table is FULLTEXT indexed, both author and title columns as required. The table is of the type MyISAM also.
I tried this and it failed.
SELECT author FROM classics WHERE MATCH(author) AGAINST('Charles');
I know Charles must be present in the author column and it is as you an see but i get no rows returned.
Now if I rewerite it to any other author, it works
SELECT author FROM classics WHERE MATCH(author) AGAINST ('jane');
Here is what i get with jane...
I'm not sure but it seemed earlier i had to included both fields I'd indexed in the query, instead of just being able to search author alone. Is this correct and does anyone know why I can't get charles returned?.
Many thanks!.
It's not returning those rows because "charles" appears in 50% of the rows. This is a well-documented restriction of MySQL FULLTEXT search.
If you want to get around this restriction, you can use BOOLEAN MODE.
Here's the relevant excerpt from the manual:
A word that matches half of the rows in a table is less likely to locate relevant documents. In fact, it most likely finds plenty of irrelevant documents. We all know this happens far too often when we are trying to find something on the Internet with a search engine. It is with this reasoning that rows containing the word are assigned a low semantic value for the particular data set in which they occur. A given word may reach the 50% threshold in one data set but not another.
The 50% threshold has a significant implication when you first try full-text searching to see how it works: If you create a table and insert only one or two rows of text into it, every word in the text occurs in at least 50% of the rows. As a result, no search returns any results. Be sure to insert at least three rows, and preferably many more. Users who need to bypass the 50% limitation can use the boolean search mode; see Section 12.9.2, “Boolean Full-Text Searches”.

How can I make MySQL Fulltext indexing ignore url strings, particularly the extension

i'm indexing strings containing URL's in MySQL Fulltext... but i dont want the urls included in the results.
As an example i search for "PHP" or "HTML" and i get records like "Ibiza Angels Massage Company see funandfrolicks.php"... a hedonistic distraction at best.
I can't see examples of adding regular expressions to the stop word list.
The other thing i thought of (and failed on) is creating the fulltext SQL, and decreasing the word contribution... however, in the following SQL, the relevance value did not change.
SELECT title, content,match(title,content) against('+PHP >".php"' IN BOOLEAN MODE)
FROM tb_feed
WHERE match(title,content) against('PHP >".php"' IN BOOLEAN MODE)
ORDER BY published DESC LIMIT 10;
An alternative is a messy SQL statement with the additional condition ...
WHERE ... IF(content REGEXP '.php', content REGEXP '(^| )php', 1) ...
Thoughts... whats the best solution?
If the numbers of results are bearable, you could choose to not display the matches the words that you want to ignore. Such as .php or .html. This is very quick to kludge but will involve using more memory than you need to.
Another solution is to create another field with the keywords that you are wanting to search on. With this field you omit urls and any other keywords that are not desired. This solution will take a short amount of time to write but will take up extra space on the hard drive.
The better solution is to create another table called keyword (or similar). When a user submits a search query the keyword table is searched looking for the specified keywords. The keyword table is populated by splitting the input data when the content was uploaded or retrieved.
This last option has the advantage of possibly being fast, the data is compact as the keywords are stored once only with a index pointing back to the main content record. It allows clever searches to occur if you so desire.
If you want php/html not part of the URL, one simple way is to try
like "% php %"
like "% html %"
That way, php/html must be a word in the sentence.