Extra fulltext ordering criteria beyond default relevance - mysql

I'm implementing an ingredient text search, for adding ingredients to a recipe. I've currently got a full text index on the ingredient name, which is stored in a single text field, like so:
"Sauce, tomato, lite, Heinz"
I've found that because there are a lot of ingredients with very similar names in the database, simply sorting by relevance doesn't work that well a lot of the time. So, I've found myself sorting by a bunch of my own rules of thumb, which probably duplicates a lot of the full-text search algorithm which spits out a numerical relevance. For instance (abridged):
ORDER BY
[ingredient name is exactly search term],
[ingredient name starts with search term],
[ingredient name starts with any word from the search and contains all search terms in some order],
[ingredient name contains all search terms in some order],
...and so on. Each of these is defined in the SELECT specification as an expression returning either 1 or 0, and so I order by those in sequential order.
I would love to hear suggestions for:
A better way to define complicated order-by criteria in one place, say perhaps in a view or stored procedure that you can pass just the search term to and get back a set of results without having to worry about how they're ordered?
A better tool for this than MySQL's fulltext engine -- perhaps if I was using Sphinx or something [which I've heard of but not used before], would I find some sort of complicated config option designed to solve problems like this?
Some google search terms which might turn up discussion on how to order text items within a specific domain like this? I haven't found much that's of use.
Thanks for reading!

Jeremy,
What you are looking for is Rank Boosting which is supported by Solr. Here is a link where you can read more about this:
http://wiki.apache.org/solr/SolrRelevancyCookbook#Ranking_Terms

Related

Autocomplete with MySQL Fulltext search that proposes words instead of results

There are lots of question around fulltext searches with mySQL and I've read lots of them without finding what I am looking for (in google or stackoverflow).
I am not looking to match rows (or documents) but I am looking to match words contained in the rows.
For ex, imagine you have a companies table, with an id, a name and a small_description column. You could find rows like :
1 | MyBaker | fine bakery since 1920
2 | Bakery factory | all the materials for a bakery
etc...
now, when the user types "bak", I would like to suggest him the word "bakery" (and I do not want to directly suggest him MyBaker and Bakery factory since there are hundreds of companies that will match but only a handful different words)
I think that the underlying mySQL fulltext engine is already having some kind of "word lookup", so I'd like to use that instead of parsing the name and small_description myself to recreate another table with word | nb_occurences
(not to mention that it may be hard to keep synchronized if lots of update are done in the other table to decrement the counters :( )
the reason behind this is to create an autocomplete search
where word suggestions will be correlated to the database content
For ex, amazon (.fr) is doing a pretty awful job. If you type "tel", it will suggest a dozen "telephone" matches and 0 "television" or "telescope" or "telemetry" ... !
while this is not really a problem in desktop where typing the full word is fast, for mobile it is really a problem
this is amplified by the fact that some words suggested by the smartphone keyboard are not in my database AND that some words of my database are never suggested by the smartphone keyboard.
for ex, my database have 0 telephone and television but lots of telemetry and teleconference
finally, I'd also like to forgive bad spelling if possible (ex : telme should match telemetry)
I hope someone can help me to leverage the existing fulltext index to achieve my goal
FULLTEXT search finds rows of data matching the word or words you present to it. As you know, it is not simply a word search.
You could, in your back-end program, take the results of your FULLTEXT search, break it up into words, and consider the most frequent of those words for autocompletion. This might work well if you modified your searches using WITH QUERY EXPANSION.
(Keep in mind that natural language FULLTEXT searches work strangely with small sets of data to search, so test with a table with many rows, not just a few.)
But, FULLTEXT does not handle stemming (chateau + chateaux - chat) correctly, nor does it offer to correct misspellings.
You could use Apache Lucene for your purpose, but it is a large and complex system.
I think you need the word / nb_appearances table, unpleasant as it is to maintain. It will give you the capability of doing
SELECT word
FROM words
WHERE word LIKE CONCAT(:input,'%')
ORDER BY nb_appearances DESC;
to get partial word matches. FULLTEXT cannot do that. You can also add a second lookup table to correct common misspellings in your application domain, for example, telmetry --> telemetry. It is a pain in the neck, of course.

How to do keyword query in MySQL with big data amount?

Here is the scenario:
I want to do a keyword query in MySQL with big data amount, which is in 10 million level.
The match is just to judge whether the keyword is a substring of the current specified field.
If there is a string: "A BC DEF", and the keyword is "BC", then it is matched. Just this simple, but I want it to be as quickly as possible. Because this is gonna applied to a website's search module (with relatively high concurrency), I don't want the user to wait for a long time.
Could anyone give me an idea? Thanks a lot!
P.S. I've searched things about fulltext in MySQL, as well as some search engines like Lucene and Sphinx, which one is better and more appropriate to apply? My web project is based on Java EE. Thanks!
Consider using MySQL Full-Text Search Functions
http://dev.mysql.com/doc/refman/5.5/en/fulltext-search.html
Then you can use a SQL Query like this:
SELECT * FROM articles
WHERE MATCH (title, body) AGAINST ('BC');

how to perform MySQL smart text search in a column?

I am trying to search for a shop name in one of MySQL table, the table has a field called fullname. As of now I am using the SOUNDS LIKE method of MySQL however here's an example that failed:
Say I have the string Banana's Shop. Then using SOUNDS LIKE with query of 'nana' or 'bananas' won't give me the result. Here's my current query:
SELECT `fullName` FROM `shop` WHERE `fullName` SOUNDS LIKE 'nana';
is there a better way to do simple search like this in MySQL that is smarter so that typo's would also still match?
The ancient and slightly honorable SOUNDEX algorithm used by SOUNDS LIKE doesn't handle suffix sounds. That is, nana doesn't, and can't, match banana. banani will match banana, however.
Two utterances don't necessarily sound alike unless they have the same number of syllables. It's good for matching stuff like surnames: Smith, Schmitt, and Schmidt all have the same SOUNDEX value.
Calling SOUNDEX 'smart text search' is an exaggeration. http://en.wikipedia.org/wiki/Soundex
You might consider MySQL FULLTEXT search, which you can look up. This does a certain amount of phrase matching. That is, if you had "banana shop" and "banana slug" in your column, the word "banana" would have a shot at matching both those values.
Be careful with FULLTEXT. It works counterintuitively when you have less than about a couple of hundred rows in the table you're searching.
But that's not a typo-friendly word matcher. What you're asking isn't really easy.
You could consider the Levenshtein algorithm (which you can look up). But it's a hairball to get working properly.

Concise FULLTEXT Search

I've been trying to find some help on using MySQL's FULLTEXT search. I realise that this has been discussed to death, but I can't quite understand how to get a concise set of results.
I have a MyISAM table of say 500,000 products with a FULLTEXT index setup on the "product_name" table.
A basic query would be:
SELECT * from products MATCH(product_name) AGAINST ("coffee table") AS relevance
WHERE MATCH(product_name) AGAINST ("coffee table").
I got a list of a few hundred products that relate to either coffee or tables. This wasn't specific enough and meant that useful results were cluttered with too many other items.
I altered my query to use MATCH to give a relevance to each result, and then used LIKE to perform the actual query.
SELECT * from products MATCH(product_name) AGAINST ("coffee table") AS relevance
WHERE ((product_name like "%coffee%" AND product_name like "%table%") or product_name like "%coffee table%")
This idea I got from seeing how Wordpress performs a search. This worked well until someone performs a search with more specific keywords. A real-world example was a search for "Nike blazer low premium vintage". In this case, there were no results (whereas the first method using MATCH returned hundreds)
I know I can use IN BOOLEAN MODE, but many users won't know to use the +/- operators to alter their query. I'm yet to work out how I should use the HAVING clause to limit results.
Also, due to this being shared hosting, I am unable to alter the default min word length - which means missing keywords like the colour "red" or the brand-name "GAP" for example.
I have read a little into creating a keyword index table, but have not found suitable references for this.
Can someone please offer a solution where I can use a product search term (as entered by Joe Public) that will give a concise set of results. Thanks
I have done more research and as many people have said, it's not a good solution for "human" like searching - one example is how it handles word plurals (car / cars). I looked at Apache Lucene but it's beyond my ability to setup and configure.
For the moment, the "solution" has been to stick with IN BOOLEAN MODE (as Mathieu also suggested).

How can I make MySQL Fulltext indexing ignore url strings, particularly the extension

i'm indexing strings containing URL's in MySQL Fulltext... but i dont want the urls included in the results.
As an example i search for "PHP" or "HTML" and i get records like "Ibiza Angels Massage Company see funandfrolicks.php"... a hedonistic distraction at best.
I can't see examples of adding regular expressions to the stop word list.
The other thing i thought of (and failed on) is creating the fulltext SQL, and decreasing the word contribution... however, in the following SQL, the relevance value did not change.
SELECT title, content,match(title,content) against('+PHP >".php"' IN BOOLEAN MODE)
FROM tb_feed
WHERE match(title,content) against('PHP >".php"' IN BOOLEAN MODE)
ORDER BY published DESC LIMIT 10;
An alternative is a messy SQL statement with the additional condition ...
WHERE ... IF(content REGEXP '.php', content REGEXP '(^| )php', 1) ...
Thoughts... whats the best solution?
If the numbers of results are bearable, you could choose to not display the matches the words that you want to ignore. Such as .php or .html. This is very quick to kludge but will involve using more memory than you need to.
Another solution is to create another field with the keywords that you are wanting to search on. With this field you omit urls and any other keywords that are not desired. This solution will take a short amount of time to write but will take up extra space on the hard drive.
The better solution is to create another table called keyword (or similar). When a user submits a search query the keyword table is searched looking for the specified keywords. The keyword table is populated by splitting the input data when the content was uploaded or retrieved.
This last option has the advantage of possibly being fast, the data is compact as the keywords are stored once only with a index pointing back to the main content record. It allows clever searches to occur if you so desire.
If you want php/html not part of the URL, one simple way is to try
like "% php %"
like "% html %"
That way, php/html must be a word in the sentence.