How to perform search on MySQL table for a website - mysql

How do I perform a search similar to that of Wikipedia on a MySQL table (or several tables at a time) without crawling the database first? (Search on wikipedia used to show you the relevancy in percentage).
What I'm looking for here is how to determine relevancy of the results and sort them accordingly, especially in case where you pull data from several tables at a time.
What do you use for search function on your websites?

You can use MySQL's full-text search functionality. You need to have a FULLTEXT index on the fields to be searched. For natural language searches, it returns a relevance value which is "a similarity measure between the search string and the text in that row."
If you are searching multiple tables, the relevance value should be comparable across sets of results; you could do a UNION of individual fulltext queries on each of the tables, then sort the results of the union based on the relevance value.

Related

MySQL: Full-Text inflectional forms of words? [duplicate]

Stemming Words in MySQL
For e.g. the user might search for "testing", "tested" or "tests". All these words are related to each other because the base word "test" is common in all of them.
Is there a way to get such result or function?
MySQL Full-Text Search
Historically, full-text searches were supported in MyISAM engines. After version 5.6, MySQL also supported full-text searches in InnoDB storage engines. This has been great news, since it enables developers to benefit from InnoDB’s referential integrity, ability to perform transactions, and row level locks.
There are basically two approaches to full-text searches in MySQL: natural language and boolean mode. (A third option augments the natural language search with a second expansion query.)
The main difference between the natural and boolean modes is that the boolean allows certain operators as part of the search. For instance, boolean operators can be used if a word has greater relevance than others in the query or if a specific word should be present in the results, etc. It’s worth noticing that in both cases, results can be sorted by the relevance computed by MySQL during the search.
The best fit for our problem was to use InnoDb full-text searches in boolean mode. Why?
We had little time to implement the search function.
At this point, we had no big data to crunch nor a massive load to require something like Elasticsearch or Sphinx.
We used shared hosting that doesn’t support Elasticsearch or Sphinx and the hardware was pretty limited at this stage.
While we wanted word stemming in our search function, it wasn’t a deal breaker: we could implement it (within constraints) by way of some simple PHP coding and data denormalization
Full-text searches in boolean mode can search words with wildcards (for the word stemming) and sort the results based on relevance.
In the Normalized Vertabelo Model
Let’s see how a simple search would work. We’ll create a sample table first:
CREATE TABLE artists (
id int(11) NOT NULL AUTO_INCREMENT, name varchar(255) NOT NULL,bio text NOT NULL, CONSTRAINT artists_pk PRIMARY KEY (id)
)ENGINE InnoDB;
CREATE FULLTEXT INDEX artists_idx_1 ON artists (name);
In natural language mode
You can insert some sample data and start testing. (It would be good to add it to your sample dataset.) For instance, we’ll try searching for Michael Jackson:
SELECT
*
FROM
artists
WHERE
MATCH (artists.name) AGAINST ('Michael Jackson' IN NATURAL LANGUAGE MODE)
This query will find records that match the search terms and will sort matching records by relevance; the better the match, the more relevant it is and the higher the result will appear in the list.
In boolean mode
We can perform the same search in boolean mode. If we don’t apply any operators to our query, the only difference will be that results are not sorted by relevance:
SELECT
*
FROM
artists
WHERE
MATCH (artists.name) AGAINST ('Michael Jackson' IN BOOLEAN MODE)
The wildcard operator in boolean mode
Since we want to search stemmed and partial words, we will need the wildcard operator (*). This operator can be used in boolean mode searches, which is why we chose that mode.
So, let’s unleash the power of boolean search and try searching for part of the artist’s name. We’ll use the wildcard operator to match any artist whose name starts with ‘Mich’:
SELECT
*
FROM
artists
WHERE
MATCH (name) AGAINST ('Mich*' IN BOOLEAN MODE)

Mysql query show no result with some bad words?

I am working on an adult site, for this site I have created an internal research.
For search I use this query:
SELECT SQL_CALC_FOUND_ROWS
id_photo, title, description, model, data_ins,
MATCH(title, description, model) AGAINST('".trim(strtolower(addslashes($_GET['q'])))."') as score
FROM ".$prefix."photo
WHERE MATCH(title, description, model) AGAINST('".trim(strtolower(addslashes($_GET['q'])))."')
ORDER BY score DESC LIMIT ".$start.", ".$step."
Everything works smoothly and without php or mysql errors, but the client pointed out a strange thing to me.
eg :
searching for the word starting with "c" and ending with "ck" the
query returns no results.
searching for the word starting with "d"
and ending with "ck" the query returns the correct results.
I use something similar to this to verify if there are results:
$photo_query_id = $db->prepare("my query");
$photo_query_id->execute();
if($photo_query_id->rowCount() < 1){
//...
}
The two words are both used hundreds of times in both titles and descriptions, so why does mysql sometimes prefer not to show results?
Is there a list of bad words in some mysql config file that is blocking queries? And in case where do I find it and how do I modify it?
Use a BOOLEAN MODE search or use the InnoDB database engine for your table. When you do a natural language search against a MyISAM full-text index, words that appear in more than 50% of the rows are treated as stopwords.
From the documentation:
The 50% threshold can surprise you when you first try full-text searching to see how it works, and makes InnoDB tables more suited to experimentation with full-text searches. If you create a MyISAM table and insert only one or two rows of text into it, every word in the text occurs in at least 50% of the rows. As a result, no search returns any results until the table contains more rows. Users who need to bypass the 50% limitation can build search indexes on InnoDB tables, or use the boolean search mode explained in Section 12.10.2, “Boolean Full-Text Searches”]2.

Stemming Words in MySQL

Stemming Words in MySQL
For e.g. the user might search for "testing", "tested" or "tests". All these words are related to each other because the base word "test" is common in all of them.
Is there a way to get such result or function?
MySQL Full-Text Search
Historically, full-text searches were supported in MyISAM engines. After version 5.6, MySQL also supported full-text searches in InnoDB storage engines. This has been great news, since it enables developers to benefit from InnoDB’s referential integrity, ability to perform transactions, and row level locks.
There are basically two approaches to full-text searches in MySQL: natural language and boolean mode. (A third option augments the natural language search with a second expansion query.)
The main difference between the natural and boolean modes is that the boolean allows certain operators as part of the search. For instance, boolean operators can be used if a word has greater relevance than others in the query or if a specific word should be present in the results, etc. It’s worth noticing that in both cases, results can be sorted by the relevance computed by MySQL during the search.
The best fit for our problem was to use InnoDb full-text searches in boolean mode. Why?
We had little time to implement the search function.
At this point, we had no big data to crunch nor a massive load to require something like Elasticsearch or Sphinx.
We used shared hosting that doesn’t support Elasticsearch or Sphinx and the hardware was pretty limited at this stage.
While we wanted word stemming in our search function, it wasn’t a deal breaker: we could implement it (within constraints) by way of some simple PHP coding and data denormalization
Full-text searches in boolean mode can search words with wildcards (for the word stemming) and sort the results based on relevance.
In the Normalized Vertabelo Model
Let’s see how a simple search would work. We’ll create a sample table first:
CREATE TABLE artists (
id int(11) NOT NULL AUTO_INCREMENT, name varchar(255) NOT NULL,bio text NOT NULL, CONSTRAINT artists_pk PRIMARY KEY (id)
)ENGINE InnoDB;
CREATE FULLTEXT INDEX artists_idx_1 ON artists (name);
In natural language mode
You can insert some sample data and start testing. (It would be good to add it to your sample dataset.) For instance, we’ll try searching for Michael Jackson:
SELECT
*
FROM
artists
WHERE
MATCH (artists.name) AGAINST ('Michael Jackson' IN NATURAL LANGUAGE MODE)
This query will find records that match the search terms and will sort matching records by relevance; the better the match, the more relevant it is and the higher the result will appear in the list.
In boolean mode
We can perform the same search in boolean mode. If we don’t apply any operators to our query, the only difference will be that results are not sorted by relevance:
SELECT
*
FROM
artists
WHERE
MATCH (artists.name) AGAINST ('Michael Jackson' IN BOOLEAN MODE)
The wildcard operator in boolean mode
Since we want to search stemmed and partial words, we will need the wildcard operator (*). This operator can be used in boolean mode searches, which is why we chose that mode.
So, let’s unleash the power of boolean search and try searching for part of the artist’s name. We’ll use the wildcard operator to match any artist whose name starts with ‘Mich’:
SELECT
*
FROM
artists
WHERE
MATCH (name) AGAINST ('Mich*' IN BOOLEAN MODE)

Aggregate most relevant results with MySQL's fulltext search across many tables

I am running fulltext queries on multiple tables on MySQL 5.5.22. The application uses innodb tables, so I have created a few MyISAM tables specifically for fulltext searches.
For example, some of my tables look like
account_search
===========
id
account_id
name
description
hobbies
interests
product_search
===========
id
product_id
name
type
description
reviews
As these tables are solely for fulltext search, they are denormalized. Data can come from multiple tables and are agregated into the search table. Besides the ID columns, the rest of the columns are assigned to 1 fulltext index.
To work around the "50%" rule with fulltext searches, I am using IN BOOLEAN MODE.
So for the above, I would run:
SELECT *, MATCH(name, type, description, reviews) AGAINST('john') as relevance
FROM product_search
WHERE MATCH(name, type, description, reviews) AGAINST('john*' IN BOOLEAN MODE) LIMIT 10
SELECT *, MATCH(name, description, hobbies, interests) AGAINST('john') as relevance
FROM account_search
WHERE MATCH(name, description, hobbies, interests) AGAINST('john*' IN BOOLEAN MODE) LIMIT 10
Let's just assume that we have products called "john" as well :P
The problem I am facing are:
To get meaningful relevance, I need to use a search without IN BOOLEAN MODE. This means that the search is subjected to the 50% rule and word length rules. So, quite often, if I most of the products in the product_search table is called john, their relevance would be returned as 0.
Relevances between multiple queries are not comparable. (I think a relevance of 14 from one query does not equal a relevance of 14 from another different query).
Searches will not be just limited to these 2 tables, there are other "object types", for example: "orders", "transactions", etc.
I would like to be able to return the top 7 most relevant results of ALL object types given a set of keywords (1 search box returns results for ALL objects).
Given the above, what are some algorithms or perhaps even better ideas for get the top 7?
I know I can use things like solr and elasticsearch, I have already tried them and am in the proces of integrating them into the application, but I would like to be able to provide search for those who only have access to MySQL.
So after thinking about this for a while, I decided that the relevance ranking has to be done with 1 query within MySQL.
This is because:
Relevance between seperate queries can't be compared.
It's hard to combine the contents of multiple searches together in meaningful ways.
I have switched to using 1 index table dedicated to search. Entries are inserted, removed, and updates depending on inserts, removals and updates to the real underlying data in the innodb tables (this is all automatic).
The table looks like this:
search
==============
id //id for the entry
type //the table the data came from
column //column the data came from
type_id //id of the row the in the original table
content //text
There's a full text index on the content column. It is important to realize that not all columns from all tables will be indexed, only things that I deem to be useful in search has been added.
Thus, it's just a simple case of running a query to match on content, retrieve what we have and do further processing. To process the final result, a few more queries would be required to ask the parent table for the title of the search result and perhaps some other meta data, but this is a workable solution.
I don't think this approach will really scale (updates and inserts will need to update this table as well), but I think it is a pretty good way to provide decent application wide search for smaller deployments of the application.
For scalability, use something like elastic search, solr or lucene.

Text search in MySQL - Performance and Alternatives

I have a set of tables in MySQL like this (foreign keys referenced by [table_name]_id):
Articles(id, author_id, title, date, broad_search, ...)
Keywords(id, article_id, keyword (varchar))
Authors(id, name, ...)
Attachments(id, article_id, url, ...)
The table we are concerned about most is 'Keywords' so I am mentioning the indexes only for it:
id - Primary - BTREE
(article_id,keyword) - Unique - BTREE
keyword - BTREE
article_id - BTREE
Each Article has associated list of Keywords. The "broad_search" column in Articles states whether that particular article can be matched broadly with the keywords (broad_search=1) or if it has to be an exact match of the keyword(broad_search=0). I have a SELECT query which pulls a list of articles based on keywords, broad_search parameter and other filter criteria.
$sql = "SELECT *
FROM Keywords k, Attachments at, Articles ar, Authors a (2 more tables)
WHERE
((ar.broad_search=0 AND k.keyword = '$Keyword')
OR (ar.broad_search=1 AND (INSTR('$Keyword', k.keyword)>0 OR k.keyword like '%$Keyword%')))
AND at.article_id = ar.id
AND a.id = ar.author_id
... (more conditions)
LIMIT 20";
An article can be set to either braod match or exact match, and I'm trying to get a list of them based on a keyword.
Exact match is straightforward. But broad match has various cases which will not let me use a simple wild card pattern like '%search_term%'. An example:
Keywords for a broad match article = {books, used books, reading books, popular book}
search term = new books
Now, we cannot use the mysql wildcard string matching as '%new books%' will not match any of the keywords but it needs to be retrieved as the search term contains a substring of the keywords (broad_search=1). So, broad_search is of 2 types: search_term = "cars" in keyword "used cars" and search term = "used cars" in keyword "cars".
If broad_search=0, do an exact match. If broad_search=1, match both cases:
((ar.broad_search=0 AND k.keyword = '$Keyword')
OR (ar.broad_search=1 AND (INSTR('$Keyword', k.keyword)>0 OR k.keyword like '%$Keyword%')))
The query I wrote perfectly does the job. But the issue is with performance. The keywords table is very large with 100,000+ rows and keeps growing. Also, this is a high load app and kills my server due to the huge number of requests it receives.
I feel this is not the right way to perform a text search. I tried going through mysql docs regarding full text search but I did not quite understand it's application and if it fits my search criteria. Also, I was thinking if Apache Lucene would be a better choice, but I haven't used it earlier so not really sure (this query runs in a PHP script).
How should I be implementing this? Is it indexing issue, or is the MySQL INSTR function inefficient, or should I be using a whole different approach?
MySQL isn't a search engine, it's a Relation Database Management System (RDBMS). However, you can implement native MySQL tools to emulate Full-Text searching capabilities, such as setting up a search table as MyISAM and adding a FULLTEXT index to columns you wish to search upon. You can read the MySQL docs for more info on how MySQL supports Full-Text searching.
Even if you get Full-Text search queries to work the way you want, you will still miss out on a whole host of features that a true search engine (Lucene) supports. Features such as a facets, spatial searches, result boosting, weighting, etc. I'd suggest you read up on Apache SOLR, as it supports all these features and many more. There is even a PHP SOLR API which you can use to access a SOLR instance.
I'm not saying to abandon MySQL altogether, but use it for it's intended purpose, to persistently store data which can be queried upon, and which can be uses to populate your search engine indices. SOLR even has a built in Document Import Handler, which will allow you to set a database query to be used when you want to mass import data from your MySQL database.
The learning curve is relatively high, as it is with learning most new technologies, but when you are done you will wonder how you ever got by without using a true Full-Text search engine.