I am currently using MySQL but I am willing to migrate if necessary to any solution suggested.
I am looking for an easy way to implement a search on a table.
The table has multiple entries with data similar to what will be found on user accounts, like names, addresses, phone numbers and a text column that contains comments of arbitrary length.
I want to make a search so that I can go over all rows and columns and find the best matching row. Slightly misspells corrected (Not very important). But most important is the ability to cross search everything.
Table can have as many as 20,000 rows.
Search parameter will be for example: "Company First Name"
Expected results:
company|Contact First Name|Address|...|...
example 2, slightly misspelled search parameters : "Pinaple Street Compani"
Expected results row:
company|pinapple street|..|...
companie|pinapple street|..|...
company|pinaple street|..|...
EDIT:
Forgot to clarify that multiple searches will be done at the same time so it has to be fast (Around 100 searches at the same time). Also the language of the data is not english and the database is utf8 with support for non-english characters
The misspelling problem is hard, if not impossible, to solve well in pure MySQL.
The multiple-column FULLTEXT search isn't so bad.
Your query will look something like this ...
SELECT column, column
FROM table
WHERE MATCH(Company, FirstName, LastName, whatever, whatever)
AGAINST('search terms' IN NATURAL LANGUAGE MODE)
It will produce a bunch of results, ordered by what MySQL guesses is the most likely hit first. MySQL's guesses aren't great, but they're usually adequate.
You'll need a FULLTEXT index matching the list of columns in your MATCH() clause. Creating that index looks like this.
ALTER TABLE book
ADD FULLTEXT INDEX Fulltext_search_index_1
(Company, FirstName, LastName, whatever, whatever);
Comments in your question notwithstanding, you just need an index for the group of columns which you will search.
20K rows won't be a big burden on any recent-vintage server hardware.
Misspelling: You could try SOUNDEX(), but it's an early 20th century algorithm designed by the Bell System to look up peoples' names in American English. It's designed to get many false positive hits, and it really is dumber than a bucket of rocks.
If you really do need spell correction you may need to investigate Sphinx.
Related
There are lots of question around fulltext searches with mySQL and I've read lots of them without finding what I am looking for (in google or stackoverflow).
I am not looking to match rows (or documents) but I am looking to match words contained in the rows.
For ex, imagine you have a companies table, with an id, a name and a small_description column. You could find rows like :
1 | MyBaker | fine bakery since 1920
2 | Bakery factory | all the materials for a bakery
etc...
now, when the user types "bak", I would like to suggest him the word "bakery" (and I do not want to directly suggest him MyBaker and Bakery factory since there are hundreds of companies that will match but only a handful different words)
I think that the underlying mySQL fulltext engine is already having some kind of "word lookup", so I'd like to use that instead of parsing the name and small_description myself to recreate another table with word | nb_occurences
(not to mention that it may be hard to keep synchronized if lots of update are done in the other table to decrement the counters :( )
the reason behind this is to create an autocomplete search
where word suggestions will be correlated to the database content
For ex, amazon (.fr) is doing a pretty awful job. If you type "tel", it will suggest a dozen "telephone" matches and 0 "television" or "telescope" or "telemetry" ... !
while this is not really a problem in desktop where typing the full word is fast, for mobile it is really a problem
this is amplified by the fact that some words suggested by the smartphone keyboard are not in my database AND that some words of my database are never suggested by the smartphone keyboard.
for ex, my database have 0 telephone and television but lots of telemetry and teleconference
finally, I'd also like to forgive bad spelling if possible (ex : telme should match telemetry)
I hope someone can help me to leverage the existing fulltext index to achieve my goal
FULLTEXT search finds rows of data matching the word or words you present to it. As you know, it is not simply a word search.
You could, in your back-end program, take the results of your FULLTEXT search, break it up into words, and consider the most frequent of those words for autocompletion. This might work well if you modified your searches using WITH QUERY EXPANSION.
(Keep in mind that natural language FULLTEXT searches work strangely with small sets of data to search, so test with a table with many rows, not just a few.)
But, FULLTEXT does not handle stemming (chateau + chateaux - chat) correctly, nor does it offer to correct misspellings.
You could use Apache Lucene for your purpose, but it is a large and complex system.
I think you need the word / nb_appearances table, unpleasant as it is to maintain. It will give you the capability of doing
SELECT word
FROM words
WHERE word LIKE CONCAT(:input,'%')
ORDER BY nb_appearances DESC;
to get partial word matches. FULLTEXT cannot do that. You can also add a second lookup table to correct common misspellings in your application domain, for example, telmetry --> telemetry. It is a pain in the neck, of course.
I have a column in my mysql database with set of keywords. (Specifically the lable data i'm getting from google vision api). Is there a easy way to match and return similar records when another set of lables given to the database.
In database: "Bike vehicle transport light floor"
What i'm giving as search parameters : "light bike car green"
Approach i've taken currently: use the "LIKE" keyword with wildcard. Is there a better way to do this?
Thanks
A solution I propose, for which you'd have to use a STORED PROCEDURE is create a table of "words".
word_id INT() AUTOINCREMENT
word VARCHAR(255)
Then split each word in the field and add it to the words table. If new add if old get the existing code for it. You then create a used_words table that links each record with the multiple words in contains.
record_id *(current record ID)*
word_id INT()
CONSTRAINT record_id *current_table(current record id)*
CONSTRAINT word_id words(word_id)
Finally, to compare a list against another, you see if every word you chose exist in the used_words table
select word_id from used_words
WHERE word_in not in (
SELECT word_id FROM used_words
WHERE record_id="$existing_id"
)
WHERE record_id="$new_entry_id"
If the result is NULL, then all the words exist. Otherwise, you have the list of different words.
The algorithm should work , but not a single SQL query
This isn't a "complete" answer, and I'm not expecting it to be accepted as such.
Your topic of question is "Information Retrieval", and there are several good books on the subject (although they'll cover a far wider scope than your specific question - so YMMV unless you're particularly interested in the subject).
I'd read up on normalisation. I'd start by decoupling those keywords into a joining table, well indexed.
Also take a look a the subject of stemming. It's not a silver bullet, but it's core to getting the right results. Some database engines can handle this for you - MySQL cannot (to my knowledge). I'd recommend looking at the Porter Stemmer for a good English example. There are libraries for every major language.
Finally consider synonyms. There's no easy way to handle these (in code); you'll need to build a database of them (better still, grab a free one from online). You'll use this to "increase" the supplied keyword list, using related words. ("Aeroplane" becomes "Aeroplane, Vehicle, Aircraft, Flying machine, Transport", etc).
The problem here is that i have multiple columns:
| artist | name | lyrics | content
I want to search in these columns by multiple keywords. The problem is that i can't make any good algorithm with LIKE or/and.
The best possibility is to search for each keyword in each column, but in that way i will get result that may contain the keyword in the name but will not contain the second keyword of artist.
I want everything to be with AND, but this way, It will work for the keywords if there is only one column that i'm searching about. In other way, to receive a result, every of the column must have all keywords...
Is there any possibility someone to know what algorithm i have to create, that when you search with 3 keywords (ex: 1 for artist and 2 for name) to find the correct result?
The best solution is not to use MySQL for the search, but use a text-indexing tool like Apache Solr.
You could then run queries against Solr like:
name:"word" AND artist:"otherword"
It's pretty common to use Solr for indexing data even if you keep the original data in MySQL.
Not only would it give you the query flexibility you want, but it would run hundreds of times faster than using LIKE '%word%' queries on MySQL.
Another alternative is to use MySQL's builtin fulltext indexing feature:
CREATE FULLTEXT INDEX myft ON mytable (artist, name, lyrics, content);
SELECT * FROM mytable
WHERE MATCH(artist, name, lyrics, content)
AGAINST ('+word +otherword' IN BOOLEAN MODE)
But it's not as flexible if you want to search for words that occur in specific columns, unless you create a separate index on each column.
AND works for displaying multiple rows too. it just depends upon the rows you have in your table which you havent provided. PS, im sorry if my answer is not clear, i dont have the reputation to make it a comment
I am trying to search for a shop name in one of MySQL table, the table has a field called fullname. As of now I am using the SOUNDS LIKE method of MySQL however here's an example that failed:
Say I have the string Banana's Shop. Then using SOUNDS LIKE with query of 'nana' or 'bananas' won't give me the result. Here's my current query:
SELECT `fullName` FROM `shop` WHERE `fullName` SOUNDS LIKE 'nana';
is there a better way to do simple search like this in MySQL that is smarter so that typo's would also still match?
The ancient and slightly honorable SOUNDEX algorithm used by SOUNDS LIKE doesn't handle suffix sounds. That is, nana doesn't, and can't, match banana. banani will match banana, however.
Two utterances don't necessarily sound alike unless they have the same number of syllables. It's good for matching stuff like surnames: Smith, Schmitt, and Schmidt all have the same SOUNDEX value.
Calling SOUNDEX 'smart text search' is an exaggeration. http://en.wikipedia.org/wiki/Soundex
You might consider MySQL FULLTEXT search, which you can look up. This does a certain amount of phrase matching. That is, if you had "banana shop" and "banana slug" in your column, the word "banana" would have a shot at matching both those values.
Be careful with FULLTEXT. It works counterintuitively when you have less than about a couple of hundred rows in the table you're searching.
But that's not a typo-friendly word matcher. What you're asking isn't really easy.
You could consider the Levenshtein algorithm (which you can look up). But it's a hairball to get working properly.
I have three tables, all of which have a column with a fulltext index. The user will enter search terms into a single text box, and then all three tables will be searched.
This is better explained with an example:
documents
doc_id
name FULLTEXT
table2
id
doc_id
a_field FULLTEXT
table3
id
doc_id
another_field FULLTEXT
(I realise this looks stupid but that's because I've removed all the other fields and tables to simplify it).
So basically I want to do a fulltext search on name, a_field and another_field, and then show the results as a list of documents, preferably with what caused that document to be found, e.g. if another_field matched, I would display what another_field is.
I began working on a system whereby three fulltext search queries are performed and the results inserted into a table with a structure like:
search_results
table_name
row_id
score
(This could later be made to cache results for a few days with e.g. a hash of the search terms).
This idea has two problems. The first is that the same document can be in the search results up to three times with different scores. Instead of that, if the search term is matched in two tables, it should have one result, but a higher score.
The second is that parsing the results is difficult. I want to display a list of documents, but I don't immediately know the doc_id without a join of some kind; however the table to join to is dependant on the table_name column, and I'm not sure how to accomplish that.
Wanting to search multiple related tables like this must be a common thing, so I guess what I'm asking is am I approaching this in the right way? Can someone tell me the best way of doing it please.
I would create a denormalized single index. Ie, put all three document types into a single table with fields for doc_id, doc_type and a single fulltext block. Then you can search all three document types at once.
You might also find that Lucene would make sense in this situation. It gives you faster searching, as well as much more functionality around how the searching and scoring works.
The downside is that you're keeping a separate denomalized copy of the text for each record. The upside is that searching is much faster.