Concise FULLTEXT Search - mysql

I've been trying to find some help on using MySQL's FULLTEXT search. I realise that this has been discussed to death, but I can't quite understand how to get a concise set of results.
I have a MyISAM table of say 500,000 products with a FULLTEXT index setup on the "product_name" table.
A basic query would be:
SELECT * from products MATCH(product_name) AGAINST ("coffee table") AS relevance
WHERE MATCH(product_name) AGAINST ("coffee table").
I got a list of a few hundred products that relate to either coffee or tables. This wasn't specific enough and meant that useful results were cluttered with too many other items.
I altered my query to use MATCH to give a relevance to each result, and then used LIKE to perform the actual query.
SELECT * from products MATCH(product_name) AGAINST ("coffee table") AS relevance
WHERE ((product_name like "%coffee%" AND product_name like "%table%") or product_name like "%coffee table%")
This idea I got from seeing how Wordpress performs a search. This worked well until someone performs a search with more specific keywords. A real-world example was a search for "Nike blazer low premium vintage". In this case, there were no results (whereas the first method using MATCH returned hundreds)
I know I can use IN BOOLEAN MODE, but many users won't know to use the +/- operators to alter their query. I'm yet to work out how I should use the HAVING clause to limit results.
Also, due to this being shared hosting, I am unable to alter the default min word length - which means missing keywords like the colour "red" or the brand-name "GAP" for example.
I have read a little into creating a keyword index table, but have not found suitable references for this.
Can someone please offer a solution where I can use a product search term (as entered by Joe Public) that will give a concise set of results. Thanks

I have done more research and as many people have said, it's not a good solution for "human" like searching - one example is how it handles word plurals (car / cars). I looked at Apache Lucene but it's beyond my ability to setup and configure.
For the moment, the "solution" has been to stick with IN BOOLEAN MODE (as Mathieu also suggested).

Related

How to search inside a SQL table for a phrase

I am currently using MySQL but I am willing to migrate if necessary to any solution suggested.
I am looking for an easy way to implement a search on a table.
The table has multiple entries with data similar to what will be found on user accounts, like names, addresses, phone numbers and a text column that contains comments of arbitrary length.
I want to make a search so that I can go over all rows and columns and find the best matching row. Slightly misspells corrected (Not very important). But most important is the ability to cross search everything.
Table can have as many as 20,000 rows.
Search parameter will be for example: "Company First Name"
Expected results:
company|Contact First Name|Address|...|...
example 2, slightly misspelled search parameters : "Pinaple Street Compani"
Expected results row:
company|pinapple street|..|...
companie|pinapple street|..|...
company|pinaple street|..|...
EDIT:
Forgot to clarify that multiple searches will be done at the same time so it has to be fast (Around 100 searches at the same time). Also the language of the data is not english and the database is utf8 with support for non-english characters
The misspelling problem is hard, if not impossible, to solve well in pure MySQL.
The multiple-column FULLTEXT search isn't so bad.
Your query will look something like this ...
SELECT column, column
FROM table
WHERE MATCH(Company, FirstName, LastName, whatever, whatever)
AGAINST('search terms' IN NATURAL LANGUAGE MODE)
It will produce a bunch of results, ordered by what MySQL guesses is the most likely hit first. MySQL's guesses aren't great, but they're usually adequate.
You'll need a FULLTEXT index matching the list of columns in your MATCH() clause. Creating that index looks like this.
ALTER TABLE book
ADD FULLTEXT INDEX Fulltext_search_index_1
(Company, FirstName, LastName, whatever, whatever);
Comments in your question notwithstanding, you just need an index for the group of columns which you will search.
20K rows won't be a big burden on any recent-vintage server hardware.
Misspelling: You could try SOUNDEX(), but it's an early 20th century algorithm designed by the Bell System to look up peoples' names in American English. It's designed to get many false positive hits, and it really is dumber than a bucket of rocks.
If you really do need spell correction you may need to investigate Sphinx.

Is it possible to get "ideal" full-text relevance for two constant(same) samples?

Full-text MATCH gives a relative relevance for all records in an indexed table. However, I make the decision based on a similarity level (let's say <70% is insufficient to consider it as a match) between tested sample and constant sample (which I compare against).
Previously I used Levenshtein Distance to get percentage coefficient of how much two samples are similar. But this method showed itself as incredibly inefficient for my dataset.
What I'd like to do is to get a relevance coefficient for sample matched to itself to consider it as 100% relevance
I tried queries like:
SELECT
samples.`name`,
MATCH(samples.`name`)
AGAINST ('Constant sample' IN NATURAL LANGUAGE MODE),
MATCH (perfectSample.sample)
AGAINST ('Constant sample' IN NATURAL LANGUAGE MODE)
FROM
samples,
(SELECT 'Constant sample' as sample) as perfectSample
But embedded from does not support full-text match (My idea was: since MyISAM table must not have FULLTEXT index, It is possible to achieve it this way).
So the actual question is: Is it possible to obtain FULLTEXT relevance for 2 constant values?
OK, so here is what I managed to do. Maybe someone will get any use of it.
First of all, samples should be inserted to a InnoDB (important) table that has FULLTEXT index on a field that has to be MATCHed
After this it is necessary to fetch all values (samples) that will be compared with.
SELECT * FROM samples
Next, these fetched fields need to be MATCHed against themselves. It is better to put a WHERE clause so that a field is not matched to anything else.
SELECT
samples.value,
MATCH (samples.value) AGAINST (:fetchedVal)
WHERE samples.value = :fetchedVal
This will give a relevancy for each sample AGAINST itself.
Note: It is important to use InnoDB because MyISAM MATCH with only one row will produce result that will not be useful. For example: same query can produce relevancy value 40.1511 for InnoDB and 3 for MyISAM.
This is due to the way of how word uniqueness is calculated. You can read more about this here
And that's it. Second query will give (in my opinion) 100% relevancy, which can be used to determine similarity level between this sample and others
It is a bit dirty, but that's the only option that worked for me. And since no one suggested anything else (better) I will keep this as an answer until better solution is found

Bulding search engine for large database

I'm building a fairly large database where I will have a lot of tables with various data.
But each table has similar fields, for example video title or track title.
Now the problem I'm facing is how to build a query which would look for a keyword match across five or more tables, keep in mind that each table can potentially have from 100k to 1million rows or in some cases even couple million rows.
I think using joins or separate queries for each table would be very slow, so what I thought of is to make one separate table where I would store search data.
For example I think it could have fields like these,
id ---- username ---- title ---- body ---- date ---- belongs_to ---- post_id
This way I think it would perform a lot faster searches, or am I totally wrong?
The only problem with this approach that I can think of it is that it would be hard to manage this table because if original record from some of the tables is deleted I would also need to delete record from 'search' table as well.
Don't use MySQL for joining lots of tables, I would suggest you to take a look at Apache Solr, with RDBMS
Take a look at some information retrieval systems. They also require their own indices, so you need to index the data after each update (or in regular intervals) to keep the search index up to date. But they offer the following advantages:
much faster, because they use special algorithms and data structures designed for specifically that purpose
ability to search for documents based on a set of terms (and maybe also a set of negative terms that must not appear in the result)
search for phrases (i.e. terms that appear after each other in a specific order)
automatic stemming (i.e. stripping the endings of words like "s", "ed", "ing" ...)
detection of spelling mistakes (i.e. "Did you mean ...?")
stopwords to avoid indexing really common meaningless words ("a", "the", etc.)
wildcard queries
advanced ranking strategies (i.e. rank by relevance, based on the number and the position of each occurrences of the search terms)
I have used xapian in the past for my projects and I was quite happy with it. Lucene, Solr and elastic search are some other really popular projects that might fit your needs.

Sphinx vs. MySql - Search through list of friends (efficiency/speed)

I'm porting my application searches over to Sphinx from MySQL and am having a hard time figuring this one out, or if it even needs to be ported at all (I really want to know if it's worth using sphinx for this specific case for efficiency/speed):
users
uid uname
1 alex
2 barry
3 david
friends
uid | fid
1 2
2 1
1 3
3 1
Details are:
- InnoDB
- users: index on uid, index on uname
- friends: combined index on uid,fid
Normally, to search all of alex's friends with mysql:
$uid = 1
$searchstr = "%$friendSearch%";
$query = "SELECT f.fid, u.uname FROM friends f
JOIN users u ON f.fid=u.uid
WHERE f.uid=:uid AND u.uname LIKE :friendSearch";
$friends = $dbh->prepare($query);
$friends->bindParam(':uid', $uid, PDO::PARAM_INT);
$friends->bindParam(':friendSearch', $searchstr, PDO::PARAM_STR);
$friends->execute();
Is it any more efficient to find alex's friends with sphinx vs mysql or would that be an overkill? If sphinx would be faster for this as the list hits thousands of people,
what would the indexing query look like? How would I delete a friendship that no longer exists with sphinx as well, can I have a detailed example in this case? Should I change this query to use Sphinx?
Ok this is how I see this working.
I have the exact same problem with MongoDB. MongoDB "offers" searching capabilities but just like MySQL you should never use them unless you wanna be choked with IO, CPU and memory problems and be forced to use a lot more servers to cope with your index than you normally would.
The whole idea if using Sphinx (or another search tech) is to lower cost per server by having a performant index searcher.
Sphinx however is not a storage engine. It is not as simple to query exact relationships across tables, they have remmedied this a little with SphinxQL but due to the nature of the full text index it still doesn't do an integral join like you would get in MySQL.
Instead I would store the relationships within MySQL but have an index of "users" within Sphinx.
In my website I personally have 2 indexes:
main (houses users,videos,channels and playlists)
help (help system search)
These are delta updated once every minute. Since realtime indexes are still bit experimental at times and I personally have seen problems with high insertion/deletion rates I keep to delta updates. So I would use a delta index to update the main searchable objects of my site since this is less resource intensive and more performant than realtime indexes (from my own tests).
Do note inorder to process deletions and what not your Sphinx collection through delta you will need a killlist and certain filters for your delta index. Here is an example from my index:
source main_delta : main
{
sql_query_pre = SET NAMES utf8
sql_query_pre =
sql_query = \
SELECT id, deleted, _id, uid, listing, title, description, category, tags, author_name, duration, rating, views, type, adult, videos, UNIX_TIMESTAMP(date_uploaded) AS date_uploaded \
FROM documents \
WHERE id>( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 ) OR update_time >( SELECT last_index_time FROM sph_counter WHERE counter_id=1 )
sql_query_killlist = SELECT id FROM documents WHERE update_time>=( SELECT last_index_time FROM sph_counter WHERE counter_id=1 ) OR deleted = 1
}
This processes deletions and additions once every minute which is pretty much realtime for a real web app.
So now we know how to store our indexes. I need to talk about the relationships. Sphinx (even though it has SphinxQL) won't do integral joins across data so I would personally recommend doing the relationship outside of Sphinx, not only that but as I said this relationship table will get high load so this is something that could impact the Sphinx index.
I would do a query to pick out all ids and using that set of ids use the "filter" method on the sphinx API to filter the main index down to specific document ids. Once this is done you can search in Sphinx as normal. This is the most performant method I have found to date of dealing with this.
The key thing to remember at all times is that Sphinx is a search tech while MySQL is a storage tech. Keep that in mind and you should be ok.
Edit
As #N.B said (which I overlooked in my answer) Sphinx does have SphinxSE. Although primative and still in sort of testing stage of its development (same as realtime indexes) it does provide an actual MyISAM/InnoDB type storage to Sphinx. This is awesome. However there are caveats (as with anything):
The language is primative
The joins are primative
However it can/could do the job your looking for so be sure to look into it.
so I'm going to go ahead and kinda outline what -I- feel the best use cases for sphinx are and you can kinda decide if it's more or less in line for what you're looking to do.
If all you're looking to do is a string search one one field; then with MySQL you can do wild card searches without much trouble and honstly with an index on it unless you're expecting millions of rows you are going to be fine.
Now take facebook, that is not only indexing names, but pages ect or even any advanced search fields. Sphinx can take in x columns from MySQL, PostGRES, MongoDB, (insert your db you want here) and create a searchable full-text index across all of those.
Example:
You have 5 fields (house number, street, city, state, zipcode) and you want to do a full text search across all of those. Now with MySQL you could do searches on every single one, however with sphinx you can glob them all together then sphinx does some awesome statistical findings based on the string you've passed in and the matches which are resulting from it.
This Link: PHP Sphinx Searching does a great job at walking you through what it would look like and how things work together.
So you aren't really replacing a database; you're just adding a special daemon to it (sphinx) which allows you to create specialized indexes and run your full text searches against it.
No index can help you with this query, since you're looking for the string as an infix, not a prefix (you're looking for '%friendname%', not 'friendname%'.
Moreover, the LIKE solution will get you into corners: suppose you were looking for a friend called Ann. The LIKE expression will also match Marianne, Danny etc. There's no "complete word" notion in a LIKE expression.
A real solution is to use a text index. A FULLTEXT index is only available on MyISAM, and MySQL 5.6 (not GA at this time) will introduce FULLTEXT on InnoDB.
Otherwise you can indeed use Sphinx to search the text.
With just hundreds or thousands, you will probably not see a big difference, unless you're really going to do many searches per second. With larger numbers, you will eventually realize that a full table scan is inferior to Sphinx search.
I'm using Sphinx a lot, on dozens and sometimes hundreds of millions large texts, and can testify it works like a charm.
The problem with Sphinx is, of course, that it's an external tool. With Sphinx you have to tell it to read data from your database. You can do so (using crontab for example) every 5 minutes, every hour, etc. So if rows are DELETEd, they will only be removed from sphinx the next time it reads the data from table. If you can live with that - that's the simplest solution.
If you can't, there are real time indexes in sphinx, so you may directly instruct it to remove certain rows. I am unable to explain everything in this port, so here are a couple links for you:
Index updates
Real time indexes
As final conclusion, you have three options:
Risk it and use a full table scan, assuming you won't have high load.
Wait for MySQL 5.6 and use FULLTEXT with InnoDB.
Use sphinx
At this point in time, I would certainly use option #3: use sphinx.
Take a look at the solution I propose here:
https://stackoverflow.com/a/22531268/543814
Your friend names are probably short, and your query looks simple enough. You can probably afford to store all suffixes, perhaps in a separate table, pointing back to the original table to get the full name.
This would give you fast infix search at the cost of a little bit more storage space.
Furthermore, to avoid finding 'Marianne' when searching for 'Ann', consider:
Using case-sensitive search. (Fragile; may break if your users enter their names or their search queries with incorrect capitalization.)
After the query, filtering your search results further, requiring word boundaries around the search term (e.g. regex \bAnn\b).

Extra fulltext ordering criteria beyond default relevance

I'm implementing an ingredient text search, for adding ingredients to a recipe. I've currently got a full text index on the ingredient name, which is stored in a single text field, like so:
"Sauce, tomato, lite, Heinz"
I've found that because there are a lot of ingredients with very similar names in the database, simply sorting by relevance doesn't work that well a lot of the time. So, I've found myself sorting by a bunch of my own rules of thumb, which probably duplicates a lot of the full-text search algorithm which spits out a numerical relevance. For instance (abridged):
ORDER BY
[ingredient name is exactly search term],
[ingredient name starts with search term],
[ingredient name starts with any word from the search and contains all search terms in some order],
[ingredient name contains all search terms in some order],
...and so on. Each of these is defined in the SELECT specification as an expression returning either 1 or 0, and so I order by those in sequential order.
I would love to hear suggestions for:
A better way to define complicated order-by criteria in one place, say perhaps in a view or stored procedure that you can pass just the search term to and get back a set of results without having to worry about how they're ordered?
A better tool for this than MySQL's fulltext engine -- perhaps if I was using Sphinx or something [which I've heard of but not used before], would I find some sort of complicated config option designed to solve problems like this?
Some google search terms which might turn up discussion on how to order text items within a specific domain like this? I haven't found much that's of use.
Thanks for reading!
Jeremy,
What you are looking for is Rank Boosting which is supported by Solr. Here is a link where you can read more about this:
http://wiki.apache.org/solr/SolrRelevancyCookbook#Ranking_Terms