Aggregate most relevant results with MySQL's fulltext search across many tables - mysql

I am running fulltext queries on multiple tables on MySQL 5.5.22. The application uses innodb tables, so I have created a few MyISAM tables specifically for fulltext searches.
For example, some of my tables look like
account_search
===========
id
account_id
name
description
hobbies
interests
product_search
===========
id
product_id
name
type
description
reviews
As these tables are solely for fulltext search, they are denormalized. Data can come from multiple tables and are agregated into the search table. Besides the ID columns, the rest of the columns are assigned to 1 fulltext index.
To work around the "50%" rule with fulltext searches, I am using IN BOOLEAN MODE.
So for the above, I would run:
SELECT *, MATCH(name, type, description, reviews) AGAINST('john') as relevance
FROM product_search
WHERE MATCH(name, type, description, reviews) AGAINST('john*' IN BOOLEAN MODE) LIMIT 10
SELECT *, MATCH(name, description, hobbies, interests) AGAINST('john') as relevance
FROM account_search
WHERE MATCH(name, description, hobbies, interests) AGAINST('john*' IN BOOLEAN MODE) LIMIT 10
Let's just assume that we have products called "john" as well :P
The problem I am facing are:
To get meaningful relevance, I need to use a search without IN BOOLEAN MODE. This means that the search is subjected to the 50% rule and word length rules. So, quite often, if I most of the products in the product_search table is called john, their relevance would be returned as 0.
Relevances between multiple queries are not comparable. (I think a relevance of 14 from one query does not equal a relevance of 14 from another different query).
Searches will not be just limited to these 2 tables, there are other "object types", for example: "orders", "transactions", etc.
I would like to be able to return the top 7 most relevant results of ALL object types given a set of keywords (1 search box returns results for ALL objects).
Given the above, what are some algorithms or perhaps even better ideas for get the top 7?
I know I can use things like solr and elasticsearch, I have already tried them and am in the proces of integrating them into the application, but I would like to be able to provide search for those who only have access to MySQL.

So after thinking about this for a while, I decided that the relevance ranking has to be done with 1 query within MySQL.
This is because:
Relevance between seperate queries can't be compared.
It's hard to combine the contents of multiple searches together in meaningful ways.
I have switched to using 1 index table dedicated to search. Entries are inserted, removed, and updates depending on inserts, removals and updates to the real underlying data in the innodb tables (this is all automatic).
The table looks like this:
search
==============
id //id for the entry
type //the table the data came from
column //column the data came from
type_id //id of the row the in the original table
content //text
There's a full text index on the content column. It is important to realize that not all columns from all tables will be indexed, only things that I deem to be useful in search has been added.
Thus, it's just a simple case of running a query to match on content, retrieve what we have and do further processing. To process the final result, a few more queries would be required to ask the parent table for the title of the search result and perhaps some other meta data, but this is a workable solution.
I don't think this approach will really scale (updates and inserts will need to update this table as well), but I think it is a pretty good way to provide decent application wide search for smaller deployments of the application.
For scalability, use something like elastic search, solr or lucene.

Related

Doing a mysql like %term% on 1B records (with indexed field)

I have the following query that I'm using and was wondering if it would work performantly, or whether I should use ElasticSearch from the start:
SELECT
*
FROM
entity_access
JOIN entity ON (entity.id=entity_access.entity_id)
WHERE
user_id = 144
AND name LIKE '%format%'
The entity_access table will have about a billion results. But each user should have 5k entries max. My thinking was that a LIKE %term% would be trivial on a table of 5k rows (under 50ms), so hopefully it would be the same if I have a good index on a large table before doing it? Or is there something I'm missing here?
Two things. First, it doesn't matter how many total rows in the table, because the index on user_id will select only those rows for matching. As you say there are about 5k per user_id, then that's easily managed.
Second, LIKE '%foo%' will not use an index: the leading '%' precludes that. If you want to use an index, you'll have to accept a pattern of LIKE 'foo%'. If that fits the use case, then the query as written will perform fine.
If either of the above conditions doesn't hold, then consider using a dedicated search engine (like Sphinx, or roll-your own with radix trees) or materialize your search into a more indexable format (such as using MySQL Full-Text Search).

Best MySQL search query for multiple keywords in multiple columns

The problem here is that i have multiple columns:
| artist | name | lyrics | content
I want to search in these columns by multiple keywords. The problem is that i can't make any good algorithm with LIKE or/and.
The best possibility is to search for each keyword in each column, but in that way i will get result that may contain the keyword in the name but will not contain the second keyword of artist.
I want everything to be with AND, but this way, It will work for the keywords if there is only one column that i'm searching about. In other way, to receive a result, every of the column must have all keywords...
Is there any possibility someone to know what algorithm i have to create, that when you search with 3 keywords (ex: 1 for artist and 2 for name) to find the correct result?
The best solution is not to use MySQL for the search, but use a text-indexing tool like Apache Solr.
You could then run queries against Solr like:
name:"word" AND artist:"otherword"
It's pretty common to use Solr for indexing data even if you keep the original data in MySQL.
Not only would it give you the query flexibility you want, but it would run hundreds of times faster than using LIKE '%word%' queries on MySQL.
Another alternative is to use MySQL's builtin fulltext indexing feature:
CREATE FULLTEXT INDEX myft ON mytable (artist, name, lyrics, content);
SELECT * FROM mytable
WHERE MATCH(artist, name, lyrics, content)
AGAINST ('+word +otherword' IN BOOLEAN MODE)
But it's not as flexible if you want to search for words that occur in specific columns, unless you create a separate index on each column.
AND works for displaying multiple rows too. it just depends upon the rows you have in your table which you havent provided. PS, im sorry if my answer is not clear, i dont have the reputation to make it a comment

How 'and' and 'or' work in SQL

Imagine I have a database for a large website which has a table called 'users' that has a large number of records. When I execute a query such as SELECT * FROM users WHERE username='John' my understanding is that (ignoring caching etc.) the database would navigate the index and find the user(s) named John. Imagine this query returns 1 million results and I am only interested in users called John who are 25 years old, so I perform another query: SELECT * FROM users WHERE username='John' AND age=25
How does this work? does it loop through all the users named John and find only those who's age matches 25, or is there a better way of doing it? I assume this is database and storage engine specific so we can assume I am using MySQL with InnoDB.
The answer is -- you're not supposed to ask this question. In a declarative language like SQL you describe the result desired and the processing engine determines the optimal way to produce the result. It may take different paths to get to the result depending on seemingly minor differences in the request, or the method used may change from version to version of the product, or even based on some factor completely unrelated to the product (available memory or disk space, for instance).
That said, the following is true of most SQL databases in most cases:
The database will use only one index in evaluating a WHERE clause.
If more than one index could be used to evaluate the WHERE clause the database will use statistics about the cardinality (distribution of values) in each index to select the "best" one.
If there is an index built from more than one column, and the head column(s) of that index are present in the filter conditions of the WHERE clause, that index can possibly be used to filter by multiple columns in a single index.
So, in your example, most databases would use indexes on either age or name to do the first-level filtering, then scan the resulting records to do the second level of filtering. The only exception would be if you had a compound index on (name, age) or (age, name) in which case only an index scan would be needed to find the records.
Assuming you have indexes on both columns, it generally examines the statistics of the data itself to choose an option that reduces the cardinality of the result set as quickly as possible.
For example, if 20% of people are aged 25 but only 3% are called John, it will get the Johns first then strip out those who are not aged 25.
If you have a composite key made up of both columns, then that should be even faster, since there's no "stripping" involved at all.
Bottom line, it comes down to the DB engine understanding the makeup of the data and choosing the best execution plan based on that. That's why it's often good to re-calculate statistics periodically, as the data may change.
If you have a query like this:
SELECT *
FROM users
WHERE username = 'John' AND age = 25;
Then the optimal index is users(username, age) or users(age, username). With this index, the matching records can be found just by looking them up in the index.
As for what happens if you only have an index on username. It would typically look up the rows with "John" in the username column. It would then fetch the records from the data pages and continue the filtering based on the data on the pages.

Fulltext search on many tables

I have three tables, all of which have a column with a fulltext index. The user will enter search terms into a single text box, and then all three tables will be searched.
This is better explained with an example:
documents
doc_id
name FULLTEXT
table2
id
doc_id
a_field FULLTEXT
table3
id
doc_id
another_field FULLTEXT
(I realise this looks stupid but that's because I've removed all the other fields and tables to simplify it).
So basically I want to do a fulltext search on name, a_field and another_field, and then show the results as a list of documents, preferably with what caused that document to be found, e.g. if another_field matched, I would display what another_field is.
I began working on a system whereby three fulltext search queries are performed and the results inserted into a table with a structure like:
search_results
table_name
row_id
score
(This could later be made to cache results for a few days with e.g. a hash of the search terms).
This idea has two problems. The first is that the same document can be in the search results up to three times with different scores. Instead of that, if the search term is matched in two tables, it should have one result, but a higher score.
The second is that parsing the results is difficult. I want to display a list of documents, but I don't immediately know the doc_id without a join of some kind; however the table to join to is dependant on the table_name column, and I'm not sure how to accomplish that.
Wanting to search multiple related tables like this must be a common thing, so I guess what I'm asking is am I approaching this in the right way? Can someone tell me the best way of doing it please.
I would create a denormalized single index. Ie, put all three document types into a single table with fields for doc_id, doc_type and a single fulltext block. Then you can search all three document types at once.
You might also find that Lucene would make sense in this situation. It gives you faster searching, as well as much more functionality around how the searching and scoring works.
The downside is that you're keeping a separate denomalized copy of the text for each record. The upside is that searching is much faster.

How to perform search on MySQL table for a website

How do I perform a search similar to that of Wikipedia on a MySQL table (or several tables at a time) without crawling the database first? (Search on wikipedia used to show you the relevancy in percentage).
What I'm looking for here is how to determine relevancy of the results and sort them accordingly, especially in case where you pull data from several tables at a time.
What do you use for search function on your websites?
You can use MySQL's full-text search functionality. You need to have a FULLTEXT index on the fields to be searched. For natural language searches, it returns a relevance value which is "a similarity measure between the search string and the text in that row."
If you are searching multiple tables, the relevance value should be comparable across sets of results; you could do a UNION of individual fulltext queries on each of the tables, then sort the results of the union based on the relevance value.