Match a given set of keywords from mysql database - mysql

I have a column in my mysql database with set of keywords. (Specifically the lable data i'm getting from google vision api). Is there a easy way to match and return similar records when another set of lables given to the database.
In database: "Bike vehicle transport light floor"
What i'm giving as search parameters : "light bike car green"
Approach i've taken currently: use the "LIKE" keyword with wildcard. Is there a better way to do this?
Thanks

A solution I propose, for which you'd have to use a STORED PROCEDURE is create a table of "words".
word_id INT() AUTOINCREMENT
word VARCHAR(255)
Then split each word in the field and add it to the words table. If new add if old get the existing code for it. You then create a used_words table that links each record with the multiple words in contains.
record_id *(current record ID)*
word_id INT()
CONSTRAINT record_id *current_table(current record id)*
CONSTRAINT word_id words(word_id)
Finally, to compare a list against another, you see if every word you chose exist in the used_words table
select word_id from used_words
WHERE word_in not in (
SELECT word_id FROM used_words
WHERE record_id="$existing_id"
)
WHERE record_id="$new_entry_id"
If the result is NULL, then all the words exist. Otherwise, you have the list of different words.
The algorithm should work , but not a single SQL query

This isn't a "complete" answer, and I'm not expecting it to be accepted as such.
Your topic of question is "Information Retrieval", and there are several good books on the subject (although they'll cover a far wider scope than your specific question - so YMMV unless you're particularly interested in the subject).
I'd read up on normalisation. I'd start by decoupling those keywords into a joining table, well indexed.
Also take a look a the subject of stemming. It's not a silver bullet, but it's core to getting the right results. Some database engines can handle this for you - MySQL cannot (to my knowledge). I'd recommend looking at the Porter Stemmer for a good English example. There are libraries for every major language.
Finally consider synonyms. There's no easy way to handle these (in code); you'll need to build a database of them (better still, grab a free one from online). You'll use this to "increase" the supplied keyword list, using related words. ("Aeroplane" becomes "Aeroplane, Vehicle, Aircraft, Flying machine, Transport", etc).

Related

Autocomplete with MySQL Fulltext search that proposes words instead of results

There are lots of question around fulltext searches with mySQL and I've read lots of them without finding what I am looking for (in google or stackoverflow).
I am not looking to match rows (or documents) but I am looking to match words contained in the rows.
For ex, imagine you have a companies table, with an id, a name and a small_description column. You could find rows like :
1 | MyBaker | fine bakery since 1920
2 | Bakery factory | all the materials for a bakery
etc...
now, when the user types "bak", I would like to suggest him the word "bakery" (and I do not want to directly suggest him MyBaker and Bakery factory since there are hundreds of companies that will match but only a handful different words)
I think that the underlying mySQL fulltext engine is already having some kind of "word lookup", so I'd like to use that instead of parsing the name and small_description myself to recreate another table with word | nb_occurences
(not to mention that it may be hard to keep synchronized if lots of update are done in the other table to decrement the counters :( )
the reason behind this is to create an autocomplete search
where word suggestions will be correlated to the database content
For ex, amazon (.fr) is doing a pretty awful job. If you type "tel", it will suggest a dozen "telephone" matches and 0 "television" or "telescope" or "telemetry" ... !
while this is not really a problem in desktop where typing the full word is fast, for mobile it is really a problem
this is amplified by the fact that some words suggested by the smartphone keyboard are not in my database AND that some words of my database are never suggested by the smartphone keyboard.
for ex, my database have 0 telephone and television but lots of telemetry and teleconference
finally, I'd also like to forgive bad spelling if possible (ex : telme should match telemetry)
I hope someone can help me to leverage the existing fulltext index to achieve my goal
FULLTEXT search finds rows of data matching the word or words you present to it. As you know, it is not simply a word search.
You could, in your back-end program, take the results of your FULLTEXT search, break it up into words, and consider the most frequent of those words for autocompletion. This might work well if you modified your searches using WITH QUERY EXPANSION.
(Keep in mind that natural language FULLTEXT searches work strangely with small sets of data to search, so test with a table with many rows, not just a few.)
But, FULLTEXT does not handle stemming (chateau + chateaux - chat) correctly, nor does it offer to correct misspellings.
You could use Apache Lucene for your purpose, but it is a large and complex system.
I think you need the word / nb_appearances table, unpleasant as it is to maintain. It will give you the capability of doing
SELECT word
FROM words
WHERE word LIKE CONCAT(:input,'%')
ORDER BY nb_appearances DESC;
to get partial word matches. FULLTEXT cannot do that. You can also add a second lookup table to correct common misspellings in your application domain, for example, telmetry --> telemetry. It is a pain in the neck, of course.

MySQL: database structure choice - big data - duplicate data or bridging

We have a 90GB MySQL database with some very big tables (more than 100M rows). We know this is not the best DB engine but this is not something we can change at this point.
Planning for a serious refactoring (performance and standardization), we are thinking on several approaches on how to restructure our tables.
The data flow / storage is currently done in this way:
We have one table called articles, one connection table called article_authors and one table authors
One single author can have 1..n firstnames, 1..n lastnames, 1..n emails
Every author has a unique parent (unique_author), except if that author is the parent
The possible data query scenarios are as follows:
Get the author firstname, lastname and email for a given article
Get the unique authors.id for an author called John Smith
Get all articles from the author called John Smith
The current DB schema looks like this:
EDIT: The main problem with this structure is that we always duplicate similar given_names and last_names.
We are now hesitating between two different structures:
Large number of tables, data are split and there are connections with IDs. No duplicates in the main tables: articles and authors. Not sure how this will impact the performance as we would need to use several joins in order to retrieve data, example:
Data is split among a reasonable number of tables with duplicate entries in the table article_authors (author firstname, lastname and email alternatives) in order to reduce the number of tables and the application code complexity. One author could have 10 alternatives, so we will have 10 entries for the same author in the article_authors table:
The current schema is probably the best. The middle table is a many-to-many mapping table, correct? That can be made more efficient by following the tips here: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table
Rewrite #1 smells like "over-normalization". A big waste.
Rewrite #2 has some merit. Let's talk about phone_number instead of last_name because it is rather common for a person to have multiple phone_numbers (home, work, mobile, fax), but unlikely to have multiple names. (Well, OK, there are pseudonyms for some authors).
It is not practical to put a bunch of phone numbers in a cell; it is much better to have a separate table of phone numbers linked back to whoever they belong to. This would be 1:many. (Ignore the case of two people sharing the same phone number -- due to sharing a house, or due to working at the same company. Let the number show up twice.)
I don't see why you want to split firstname and lastname. What is the "firstname" of "J. K. Rowling"? I suggest that it is not useful to split names into first and last.
A single author would have a unique "id". MEDIUMINT UNSIGNED AUTO_INCREMENT is good for such. "J. K. Rowling" and "JK Rowling" can both link to the same id.
More
I think it is very important to have a unique id for each author. The id can be then used for linking to books, etc.
You have pointed out that it is challenging to map different spellings into a single id. I think this should be essentially a separate task with separate table(s). And it is this task that you are asking about.
That is, split the database split, and split the tasks in your mind, into:
one set of tables containing stuff to help deduce the correct author_id from the inconsistent information provided from the outside.
one set of tables where author_id is known to be unique.
(It does not matter whether this is one versus two DATABASEs, in the MySQL sense.)
The mental split helps you focus on the two different tasks, plus it prevents some schema constraints and confusion. None of your proposed schemas does the clean split I am proposing.
Your main question seems to be about the first set of tables -- how do turn strings of text ("JK Rawling") into a specific id. At this point, the question is first about algorithms, and only secondly about the schema.
That is, the tables should be designed to support the algorithm, not to drive it. Furthermore, when a new provider comes along with some strange new text format, you may need to modify the schema - possibly adding a special table for that provider's data. So, don't worry about making the perfect schema this early in the game; plan on running ALTER TABLE and CREATE TABLE next month or even next year.
If a provider is consistent in spelling, then a table with (provider_id, full_author_name, author_id) is probably a good first cut. But that does not handle variations of spelling, new authors, and new providers. We are getting into gray areas where human intervention will quickly be needed. Even worse is the issue of two authors with the same name.
So, design the algorithm with the assumption that simple data is easily and efficiently available from a database. From that, the schema design will somewhat easily flow.
Another tip here... Some degree of "brute force" is OK for the hard-to-match cases. Most of the time, you can easily map name strings to author_id very efficiently.
It may be easier to fetch a hundred rows from a table, them massage them in your algorithm in your app code. (SQL is rather clumsy for algorithms.)
if you want to reduce size you could also think about splitting email addresses in two parts: 'jkrowling#' + 'gmail.com'. You could have a table where you store common email domains but seeing that over-normalization is a concern...

How to search inside a SQL table for a phrase

I am currently using MySQL but I am willing to migrate if necessary to any solution suggested.
I am looking for an easy way to implement a search on a table.
The table has multiple entries with data similar to what will be found on user accounts, like names, addresses, phone numbers and a text column that contains comments of arbitrary length.
I want to make a search so that I can go over all rows and columns and find the best matching row. Slightly misspells corrected (Not very important). But most important is the ability to cross search everything.
Table can have as many as 20,000 rows.
Search parameter will be for example: "Company First Name"
Expected results:
company|Contact First Name|Address|...|...
example 2, slightly misspelled search parameters : "Pinaple Street Compani"
Expected results row:
company|pinapple street|..|...
companie|pinapple street|..|...
company|pinaple street|..|...
EDIT:
Forgot to clarify that multiple searches will be done at the same time so it has to be fast (Around 100 searches at the same time). Also the language of the data is not english and the database is utf8 with support for non-english characters
The misspelling problem is hard, if not impossible, to solve well in pure MySQL.
The multiple-column FULLTEXT search isn't so bad.
Your query will look something like this ...
SELECT column, column
FROM table
WHERE MATCH(Company, FirstName, LastName, whatever, whatever)
AGAINST('search terms' IN NATURAL LANGUAGE MODE)
It will produce a bunch of results, ordered by what MySQL guesses is the most likely hit first. MySQL's guesses aren't great, but they're usually adequate.
You'll need a FULLTEXT index matching the list of columns in your MATCH() clause. Creating that index looks like this.
ALTER TABLE book
ADD FULLTEXT INDEX Fulltext_search_index_1
(Company, FirstName, LastName, whatever, whatever);
Comments in your question notwithstanding, you just need an index for the group of columns which you will search.
20K rows won't be a big burden on any recent-vintage server hardware.
Misspelling: You could try SOUNDEX(), but it's an early 20th century algorithm designed by the Bell System to look up peoples' names in American English. It's designed to get many false positive hits, and it really is dumber than a bucket of rocks.
If you really do need spell correction you may need to investigate Sphinx.

Best MySQL search query for multiple keywords in multiple columns

The problem here is that i have multiple columns:
| artist | name | lyrics | content
I want to search in these columns by multiple keywords. The problem is that i can't make any good algorithm with LIKE or/and.
The best possibility is to search for each keyword in each column, but in that way i will get result that may contain the keyword in the name but will not contain the second keyword of artist.
I want everything to be with AND, but this way, It will work for the keywords if there is only one column that i'm searching about. In other way, to receive a result, every of the column must have all keywords...
Is there any possibility someone to know what algorithm i have to create, that when you search with 3 keywords (ex: 1 for artist and 2 for name) to find the correct result?
The best solution is not to use MySQL for the search, but use a text-indexing tool like Apache Solr.
You could then run queries against Solr like:
name:"word" AND artist:"otherword"
It's pretty common to use Solr for indexing data even if you keep the original data in MySQL.
Not only would it give you the query flexibility you want, but it would run hundreds of times faster than using LIKE '%word%' queries on MySQL.
Another alternative is to use MySQL's builtin fulltext indexing feature:
CREATE FULLTEXT INDEX myft ON mytable (artist, name, lyrics, content);
SELECT * FROM mytable
WHERE MATCH(artist, name, lyrics, content)
AGAINST ('+word +otherword' IN BOOLEAN MODE)
But it's not as flexible if you want to search for words that occur in specific columns, unless you create a separate index on each column.
AND works for displaying multiple rows too. it just depends upon the rows you have in your table which you havent provided. PS, im sorry if my answer is not clear, i dont have the reputation to make it a comment

MySQL fulltext search on multiple tables - another one

So I checked around about MySQL fulltext search on multiple, made-to-be joint tables. I know, now, that this is not possible because an index cannot be made on joint tables. The given solution is always to do two "match" with and/or – but it doesn't solve my problem.
The situation is as follows. I got :
– A "works" table that contains book titles, short descriptions and texts extracts.
– A "authors" table, with the name of the authors.
My search must be made IN BOOLEAN MODE for some reasons. Also, the default behavior for the words entered in the search field is AND (I preprocess the request by replacing spaces with +).
A user will typically enter in the search field : "NameOfAuthor TitleOfTheBook" or "NameOfAuthor aRandomWord (that he looks for in the extracts)" or "TitleOfTheBook" alone. He expect to find out only the results (and all of them) that matches all the word he entered.
So if I :
– match against the "works" fields OR the "authors" fields, I will have an answer only if the short descriptions in the "works" table mention the name of the author.
If I don't preprocess the query (if I don't transform "NameOfAuthor TitleOfTheBook" into "+NameOfAuthor +TitleOfTheBook"), I will have all the books from one author and all the books that contains some words of the query, which is not suitable.
– match against the "works" fields AND the "authors" fields, I will have nothing. If I don't preprocess the query for the "Match against author" part, it may work in this case, but not in general, because it will not work with any search that doesn't mention the author's name.
It seems to me that the only solution is an index that would mix works fields and author name. But it's not possible to do an index over a joint… The situation seems so typical that I can't believe that this is a real issue. So I'm probably stupid, but I just can't figure a solution. Any idea ? Must I create a specific, virtual table for this search ?
Thank you very much !
Well, writing down the question helped me to figure something… The idea would be to crush the user input into an $wordsArray and do the fulltext search for each of them.
So, the idea would be to :
//Parse the words from the query field
for (there is still a $word to check in the $wordsArray) {
// do a fulltext search on "works" fields against $word OR "authors" fields against $word
// save the results in a multi-dimensional $resultArray
}
// Keep only the results that exists in every row of the $resultArray
// Display
I think that is quite heavy, though… But the only alternative I can imagine is a database pregenerated table for those search purpose with an index on it. It all depends on the scale.
Except if someone else has a better solution !