Text search in MySQL - Performance and Alternatives

Text search in MySQL - Performance and Alternatives - mysql

I have a set of tables in MySQL like this (foreign keys referenced by [table_name]_id):
Articles(id, author_id, title, date, broad_search, ...)
Keywords(id, article_id, keyword (varchar))
Authors(id, name, ...)
Attachments(id, article_id, url, ...)
The table we are concerned about most is 'Keywords' so I am mentioning the indexes only for it:
id - Primary - BTREE
(article_id,keyword) - Unique - BTREE
keyword - BTREE
article_id - BTREE
Each Article has associated list of Keywords. The "broad_search" column in Articles states whether that particular article can be matched broadly with the keywords (broad_search=1) or if it has to be an exact match of the keyword(broad_search=0). I have a SELECT query which pulls a list of articles based on keywords, broad_search parameter and other filter criteria.
$sql = "SELECT *
FROM Keywords k, Attachments at, Articles ar, Authors a (2 more tables)
WHERE
((ar.broad_search=0 AND k.keyword = '$Keyword')
OR (ar.broad_search=1 AND (INSTR('$Keyword', k.keyword)>0 OR k.keyword like '%$Keyword%')))
AND at.article_id = ar.id
AND a.id = ar.author_id
... (more conditions)
LIMIT 20";
An article can be set to either braod match or exact match, and I'm trying to get a list of them based on a keyword.
Exact match is straightforward. But broad match has various cases which will not let me use a simple wild card pattern like '%search_term%'. An example:
Keywords for a broad match article = {books, used books, reading books, popular book}
search term = new books
Now, we cannot use the mysql wildcard string matching as '%new books%' will not match any of the keywords but it needs to be retrieved as the search term contains a substring of the keywords (broad_search=1). So, broad_search is of 2 types: search_term = "cars" in keyword "used cars" and search term = "used cars" in keyword "cars".
If broad_search=0, do an exact match. If broad_search=1, match both cases:
((ar.broad_search=0 AND k.keyword = '$Keyword')
OR (ar.broad_search=1 AND (INSTR('$Keyword', k.keyword)>0 OR k.keyword like '%$Keyword%')))
The query I wrote perfectly does the job. But the issue is with performance. The keywords table is very large with 100,000+ rows and keeps growing. Also, this is a high load app and kills my server due to the huge number of requests it receives.
I feel this is not the right way to perform a text search. I tried going through mysql docs regarding full text search but I did not quite understand it's application and if it fits my search criteria. Also, I was thinking if Apache Lucene would be a better choice, but I haven't used it earlier so not really sure (this query runs in a PHP script).
How should I be implementing this? Is it indexing issue, or is the MySQL INSTR function inefficient, or should I be using a whole different approach?

MySQL isn't a search engine, it's a Relation Database Management System (RDBMS). However, you can implement native MySQL tools to emulate Full-Text searching capabilities, such as setting up a search table as MyISAM and adding a FULLTEXT index to columns you wish to search upon. You can read the MySQL docs for more info on how MySQL supports Full-Text searching.
Even if you get Full-Text search queries to work the way you want, you will still miss out on a whole host of features that a true search engine (Lucene) supports. Features such as a facets, spatial searches, result boosting, weighting, etc. I'd suggest you read up on Apache SOLR, as it supports all these features and many more. There is even a PHP SOLR API which you can use to access a SOLR instance.
I'm not saying to abandon MySQL altogether, but use it for it's intended purpose, to persistently store data which can be queried upon, and which can be uses to populate your search engine indices. SOLR even has a built in Document Import Handler, which will allow you to set a database query to be used when you want to mass import data from your MySQL database.
The learning curve is relatively high, as it is with learning most new technologies, but when you are done you will wonder how you ever got by without using a true Full-Text search engine.

Related

MySql use INNODB_FT_INDEX_CACHE in production

In our application we are about to introduce tagging to various entities. I read about possible solutions and found this series of blog posts:
http://howto.philippkeller.com/2005/04/24/Tags-Database-schemas/
http://howto.philippkeller.com/2005/05/05/Tags-with-MySQL-fulltext/
http://howto.philippkeller.com/2005/06/19/Tagsystems-performance-tests/
Based on this I'm thinking about to introduce fulltext indexing on our "tags" columns. So far so good (knowing the limits of fulltext index such as stopwords, tokenization separators, etc).
However, we need to be able to list the existing tags in some cases, and we require that tags defined by different users are hidden from each other (basically if 'user A' defines tag "private" then 'user B' should not see it).
I found a possible solution that seems to work, however, I'm not convinced. This is what I currently do:
SET GLOBAL innodb_ft_aux_table='' to the name of the table
run SELECT DISTINCT WORD tag, COUNT(1) cnt FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE WHERE DOCID IN (?,?) GROUP BY tag ORDER BY cnt DESC (the DOCID list is prefiltered by user permissions - the result set is expected to be lower than a few thousand records, mostly below 100)
I can also extend the WHERE part of the select to filter the result list (for autocomplete purposes). But, since this approach uses a global variable, it is not safe to use in concurrent environment. I can apply some kind of synchronization but should I at all use this table? Also, since this table is categorized as "monitoring" by mysql documentation I'm not sure this is a proper use-case. If not, what other solutions do I have with fulltext index?

How do you optimize a query from an "Advanced Search Form"?

Generally speaking, how are MySQL queries generated from "Advanced Search Forms" with 20 - 30 optional fields optimized when there are so many different possibilities, columns, tables, outcomes, etc.?

Generally speaking, they aren't returned from the database. With advanced searches that have many field options and maybe also some boolean logic thrown in for good measure are returned from a search server such as Lucene, Sphinx or Xapian.
Returning from the database is almost impossible to do efficiently. It is usually done by building query dynamically such as:
SELECT ... WHERE 1=1
Then, loop over each other field and add
AND field LIKE '%abc'
AND field_2 LIKE '%def%'
AND field LIKE 'GHI%'
Idealy adding the whildchar % at the start or end if possible (for performance reasons), or both for searches within the field data.
Building and running this query makes it virtually impossible to have good indexing which makes them slow and a prime candidate for performance bottlenecks.

Sphinx vs. MySql - Search through list of friends (efficiency/speed)

I'm porting my application searches over to Sphinx from MySQL and am having a hard time figuring this one out, or if it even needs to be ported at all (I really want to know if it's worth using sphinx for this specific case for efficiency/speed):
users
uid uname
1 alex
2 barry
3 david
friends
uid | fid
1 2
2 1
1 3
3 1
Details are:
- InnoDB
- users: index on uid, index on uname
- friends: combined index on uid,fid
Normally, to search all of alex's friends with mysql:
$uid = 1
$searchstr = "%$friendSearch%";
$query = "SELECT f.fid, u.uname FROM friends f
JOIN users u ON f.fid=u.uid
WHERE f.uid=:uid AND u.uname LIKE :friendSearch";
$friends = $dbh->prepare($query);
$friends->bindParam(':uid', $uid, PDO::PARAM_INT);
$friends->bindParam(':friendSearch', $searchstr, PDO::PARAM_STR);
$friends->execute();
Is it any more efficient to find alex's friends with sphinx vs mysql or would that be an overkill? If sphinx would be faster for this as the list hits thousands of people,
what would the indexing query look like? How would I delete a friendship that no longer exists with sphinx as well, can I have a detailed example in this case? Should I change this query to use Sphinx?

Ok this is how I see this working.
I have the exact same problem with MongoDB. MongoDB "offers" searching capabilities but just like MySQL you should never use them unless you wanna be choked with IO, CPU and memory problems and be forced to use a lot more servers to cope with your index than you normally would.
The whole idea if using Sphinx (or another search tech) is to lower cost per server by having a performant index searcher.
Sphinx however is not a storage engine. It is not as simple to query exact relationships across tables, they have remmedied this a little with SphinxQL but due to the nature of the full text index it still doesn't do an integral join like you would get in MySQL.
Instead I would store the relationships within MySQL but have an index of "users" within Sphinx.
In my website I personally have 2 indexes:
main (houses users,videos,channels and playlists)
help (help system search)
These are delta updated once every minute. Since realtime indexes are still bit experimental at times and I personally have seen problems with high insertion/deletion rates I keep to delta updates. So I would use a delta index to update the main searchable objects of my site since this is less resource intensive and more performant than realtime indexes (from my own tests).
Do note inorder to process deletions and what not your Sphinx collection through delta you will need a killlist and certain filters for your delta index. Here is an example from my index:
source main_delta : main
{
sql_query_pre = SET NAMES utf8
sql_query_pre =
sql_query = \
SELECT id, deleted, _id, uid, listing, title, description, category, tags, author_name, duration, rating, views, type, adult, videos, UNIX_TIMESTAMP(date_uploaded) AS date_uploaded \
FROM documents \
WHERE id>( SELECT max_doc_id FROM sph_counter WHERE counter_id=1 ) OR update_time >( SELECT last_index_time FROM sph_counter WHERE counter_id=1 )
sql_query_killlist = SELECT id FROM documents WHERE update_time>=( SELECT last_index_time FROM sph_counter WHERE counter_id=1 ) OR deleted = 1
}
This processes deletions and additions once every minute which is pretty much realtime for a real web app.
So now we know how to store our indexes. I need to talk about the relationships. Sphinx (even though it has SphinxQL) won't do integral joins across data so I would personally recommend doing the relationship outside of Sphinx, not only that but as I said this relationship table will get high load so this is something that could impact the Sphinx index.
I would do a query to pick out all ids and using that set of ids use the "filter" method on the sphinx API to filter the main index down to specific document ids. Once this is done you can search in Sphinx as normal. This is the most performant method I have found to date of dealing with this.
The key thing to remember at all times is that Sphinx is a search tech while MySQL is a storage tech. Keep that in mind and you should be ok.
Edit
As #N.B said (which I overlooked in my answer) Sphinx does have SphinxSE. Although primative and still in sort of testing stage of its development (same as realtime indexes) it does provide an actual MyISAM/InnoDB type storage to Sphinx. This is awesome. However there are caveats (as with anything):
The language is primative
The joins are primative
However it can/could do the job your looking for so be sure to look into it.

so I'm going to go ahead and kinda outline what -I- feel the best use cases for sphinx are and you can kinda decide if it's more or less in line for what you're looking to do.
If all you're looking to do is a string search one one field; then with MySQL you can do wild card searches without much trouble and honstly with an index on it unless you're expecting millions of rows you are going to be fine.
Now take facebook, that is not only indexing names, but pages ect or even any advanced search fields. Sphinx can take in x columns from MySQL, PostGRES, MongoDB, (insert your db you want here) and create a searchable full-text index across all of those.
Example:
You have 5 fields (house number, street, city, state, zipcode) and you want to do a full text search across all of those. Now with MySQL you could do searches on every single one, however with sphinx you can glob them all together then sphinx does some awesome statistical findings based on the string you've passed in and the matches which are resulting from it.
This Link: PHP Sphinx Searching does a great job at walking you through what it would look like and how things work together.
So you aren't really replacing a database; you're just adding a special daemon to it (sphinx) which allows you to create specialized indexes and run your full text searches against it.

No index can help you with this query, since you're looking for the string as an infix, not a prefix (you're looking for '%friendname%', not 'friendname%'.
Moreover, the LIKE solution will get you into corners: suppose you were looking for a friend called Ann. The LIKE expression will also match Marianne, Danny etc. There's no "complete word" notion in a LIKE expression.
A real solution is to use a text index. A FULLTEXT index is only available on MyISAM, and MySQL 5.6 (not GA at this time) will introduce FULLTEXT on InnoDB.
Otherwise you can indeed use Sphinx to search the text.
With just hundreds or thousands, you will probably not see a big difference, unless you're really going to do many searches per second. With larger numbers, you will eventually realize that a full table scan is inferior to Sphinx search.
I'm using Sphinx a lot, on dozens and sometimes hundreds of millions large texts, and can testify it works like a charm.
The problem with Sphinx is, of course, that it's an external tool. With Sphinx you have to tell it to read data from your database. You can do so (using crontab for example) every 5 minutes, every hour, etc. So if rows are DELETEd, they will only be removed from sphinx the next time it reads the data from table. If you can live with that - that's the simplest solution.
If you can't, there are real time indexes in sphinx, so you may directly instruct it to remove certain rows. I am unable to explain everything in this port, so here are a couple links for you:
Index updates
Real time indexes
As final conclusion, you have three options:
Risk it and use a full table scan, assuming you won't have high load.
Wait for MySQL 5.6 and use FULLTEXT with InnoDB.
Use sphinx
At this point in time, I would certainly use option #3: use sphinx.

Take a look at the solution I propose here:
https://stackoverflow.com/a/22531268/543814
Your friend names are probably short, and your query looks simple enough. You can probably afford to store all suffixes, perhaps in a separate table, pointing back to the original table to get the full name.
This would give you fast infix search at the cost of a little bit more storage space.
Furthermore, to avoid finding 'Marianne' when searching for 'Ann', consider:
Using case-sensitive search. (Fragile; may break if your users enter their names or their search queries with incorrect capitalization.)
After the query, filtering your search results further, requiring word boundaries around the search term (e.g. regex \bAnn\b).

Improving MySQL's relevance score using Sphinx for full text search

I am working on an information retrieval system using MySQL's with the natural language mode.
The data I have is annotated to considering different categories. Eg. Monkey, cat, dog will be annotated as 'animals' whereas duck, sparrow as 'birds'. The problem is that I am retrieving documents based on the occurrences of these tags.
Now MySQL has a limitation that if a particular term comes in more than 50% in the entire data that term is not considered. Considering my requirement I want it to score all the matching terms even if a particular term comes more than 50% in the entire data.
I have read few things about combination of Sphinx with MySQL for search efficiency but I am not sure whether this could be applied for my situation.
Please provide a solution for this problem

Sphinx is very good at very fast fulltext search. It doesn't have the 50% rule that mySQL has, but you will need to use it in place of mySQL's fulltext search. Basically what you do is install Sphinx and set up an import to copy all your mySQL data into Sphinx. Then you can build SphinxSE or query Sphinx directly through a library to get your results. You can then get the details of your results by querying mySQL.
I use SphinxSE because you can query Sphinx through mySQL and join your mySQL table to the results in a single query. It's quite nice.

How to perform search on MySQL table for a website

How do I perform a search similar to that of Wikipedia on a MySQL table (or several tables at a time) without crawling the database first? (Search on wikipedia used to show you the relevancy in percentage).
What I'm looking for here is how to determine relevancy of the results and sort them accordingly, especially in case where you pull data from several tables at a time.
What do you use for search function on your websites?

You can use MySQL's full-text search functionality. You need to have a FULLTEXT index on the fields to be searched. For natural language searches, it returns a relevance value which is "a similarity measure between the search string and the text in that row."
If you are searching multiple tables, the relevance value should be comparable across sets of results; you could do a UNION of individual fulltext queries on each of the tables, then sort the results of the union based on the relevance value.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008