MySQL InnoDB Text Search Options - mysql

Knowing full well that my InnoDB tables don't support FULLTEXT searches, I'm wondering what my alternatives are for searching text in tables ? Is the performance that bad when using LIKE ?
I see a lot of suggestions saying to make a copy of the InnoDB table in question in a MYISAM table, and then run queries against THAT table and match keys between the two and I just don't know that that's a pretty solution.
I'm not opposed to using some 3rd party solution, I'm not a huge fan of that though. I'd like to explore more of what MySQL can do on its own.
Thoughts ?

If you want to do it right you probably should go with Lucene or Sphinx from the very start.
it will allow you to keep your table structure.
you'll have a huge performance boost (think ahead)
you'll get access to a lot of fancy search functions
Both Lucene and Sphinx scale amazingly well (Lucene powers Wikipedia and Digg / Sphinx powers Slashdot)

Using LIKE can only use an index when there is no leading %. It will be a huge performance hit to do LIKE '%foo%' on a large table. If I were you, I'd look into using sphinx. It has the ability to build its index by slurping data out of MySQL using a query that you provide. It's pretty straightforward and was designed to solve your exact problem.
There's also solr which is an http wrapper around lucene, but I find sphinx to be a little more straightforward.

I as others have i would urge use of Lucene, Sphinx or Solr.
However if these are out and your requirements are simple I've used the steps here to build simple search capability on a number projects in the past.
That link is for Symfony/PHP but you can apply the concepts to any language and application structure assuming there is an implementation of a stemming algorithm available. However, if you dont use a data access pattern where you can hook in to update the index when a record is updated its not as easily doable.
Also a couple downsides are that if you want a single index table but need to index multiple tables you either have to emulate referential integrity in your DAL, or add a fk column for each different table you want to index. Im not sure what youre trying to do so that may rule it out entirely.

Related

Proper way to implement near-match searching MySQL

I have a table on a MySQL database that has two (relevant) columns, 'id' and 'username'.
I have read that MySQL and relational databases in general are not optimal for searching for near matches on strings, so I wonder, what is the industry practice for implementing simple, but not exact match, search functionalities- for example when one searches for accounts by name on Facebook and non-exact matches are shown? I found Apache Lucene when researching this, but this seems to be used for indexing pages of a website, not necessarily arbitrary strings in a database table.
Is there an external tool for this use case? It seems like any SQL query for this task would require a full scan, even if it was simply looking for the inclusion of a substring.
In your situation I would recommend for you to use Elasticsearch instead of relational database. This search engine is a powerful tool for implementing search and analytics functionality.
Elasticsearch also flexible and versatile, with a rich query language using JSON as query language and support for many different types of data.
And of course supports near-match searching. As you said, MySQL and anothers relational databases aren't recommended to use near-match searching, they aren't for this purpose.
--------------UPDATE------------
If you want to use full-text-search using a relational database It's possile but you might have problem to scale if your numbers of users increase a lot. Keep in mind that ElasticSearch is robust and powerfull, so, you can do a lot of types of searches so easily in this search engine, but it can be more expensive too.
When I propose to you use ElasticSearch I'm thinking about the scaling the search. But I've thinking in your problem since I answered and I've understood that you only need a simple full-text-search. For conclude, in the begginning you can use only relational database to do that, but in the future you might move your search to ElasticSearch or if your search became complex.
Follow this guide to do full-text search in Postgresql. http://rachbelaid.com/postgres-full-text-search-is-good-enough/
There's another example in MySql: https://sjhannah.com/blog/2014/11/03/using-soundex-and-mysql-full-text-search-for-fuzzy-matching/
Like I said in the comments, It's a trade-off you must to do. You can prefer to use ElasticSearch in the beginning or you can choose another database and move to ElasticSearch in the future.
I also recommend this book to you: Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. Actually I'm reading this book and it would help you to understand this topic.
--------------UPDATE------------
To implement near-match searching in ElasticSearch you can use fuzzy matching query. The fuzzy matching query allows you to controls how lenient the matching should be, for example for this query bellow:
{
"query": {
"fuzzy": {
"username": {
"value": "julienambrosio",
"fuzziness": 2
}
}
}
}
They'll return "julienambrosio", such as "julienambrosio1", "julienambrosio12" or "juliembrosio".
You can adjust the level of fuzziness to control how lenient/strict the matching should be.
Before you create this example you should to study more about ElasticSearch. There're a lot of courses in udemy, youtube and etc.
You can read more about in the official docs.

Search/Navigation in Solr (or other faceted search engines), what's the proper way?

I'm a bit confused as to where my use of Mysql should end, and where SOLR should begin.
I've got a lot of relational data (just like an ecommerce site, like amazon).
I started by putting this into mysql and ran into troubles with the size/speed, and indexing didnt help with this much data. So, I put all this data into SOLR and it's really fast. In a way, SOLR is like a cache of my MYSQL database currently (which contains multiple relational tables linked by IDs).
The thing is I'm confused about a bunch of things.
1) Is there any need for the mysql database? Can I just as easily edit and add data into the SOLR table? Is the Mysql database just adding more overhead?
2) How is it best to do the faceting and use that for search navigation? I currently unify the whole mysql database into one flat solr file. How is it best to do this when considering the many to many relationships an entry could have? Should this all be done in SOLR using the PathHierarchyTokenizer? Should I cram multiple facets into one field?
3) Is there any need to store the categories/faceets in MYSQL so that SEO titles for these pages can be created? I'm guessing this could not be done in SOLR, since there is no real concept of a page as a facet? There appears to be a need of some kind of static store/cache of facets, where additional information could be added, and it seems to make sense that this would not be done in SOLR?
4) Or should I just use SOLR as a cache for my MYSQL db? So I get all the category menus from mysql, but when a URL query happens, it gets all the products from SOLR?
Would love to hear others thoughts on this, because while solr is nice and fast, there is a big overlap with DBs and I'm struggling to get my head around where each makes sense to use.
Is there any need for the mysql database?
If you need relations between things then an RDBMS is better than Solr which has a flat, denormalized view of its world.
Solr is not generally considered as a Source of Record. If you use it as such all fields need to be "stored". Consider how you will backup and restore your data.
As a good starting point, use Solr for what it does well, search. Index just what you need to index, store what you need to store. Remember that Solr indexes can get very big with analysis and bigger indexes are less efficient.

Search function

I am building a simple search facility.
The idea is that it will search the fields of code, title, description and category.
It's quite simple to search for the category and code as it's just one word (%code%).
However, I am unsure how I would break down the title and description to search for any keywords the user enters?
Does anyone have any good techniques for this?
Thanks.
Given the little amount of info:
If you're using the MyISAM storage from MySQL you can enable FULLTEXT indexes and use a FULLTEXT search over that; see this link for more information on that.
If you're, however, using InnoDB (which I'm also using on my databases), you can't directly enable it in MySQL.
You have a few options; either you split up the keywords yourself and search for entries matching one or more of those keywords and check afterwards how many keywords matched for the ordering. You can also include that in the query, but then you'd need to make a query for each keyword and combine those results with a parent query.
Another option, which is the option I finally chose because of the performance and flexibility, is to use a SOLR server and use the php solr_client (see the php manual on it). The SOLR server will index the database given a few (fairly simple) configuration files and allow fulltext searches on any indexed field. More info about setting up a SOLR server can be found in the manual for SOLR: tutorial.
There are, ofcourse, many many other methods and tools. The above are just a few that I've used in the past or am still using (I'm really happy using solr, but that's something personal, I guess).
Good luck.
What you want is not something MySQL does very well. Yhn mentioned some options.
MySQL's FULLTEXT indexes are not popular for good reasons.
Breaking your texts down to keywords and forming indexed tables of them that link back to the original items can work. But doing that, in essence, is like starting to build your own search engine.
Much better search engines than you are likely to build are available. Yhn mentioned SOLR, which is very good, but I want to mention also Sphinx Search, which I use. SOLR has some interesting features that Sphinx doesn't have, but I had the impression Sphinx is easier to learn and get started with. It's worth your consideration.

Search Short Fields Using Solr, Etc. or Use Straight-Forward DB Index

My website stores several million entities. Visitors search for entities by typing words contained only in the titles. The titles are at most 100 characters long.
This is not a case of classic document search, where users search inside large blobs.
The fields are very short. Also, the main issue here is performance (and not relevance) seeing as entities are provided "as you type" (auto-suggested).
What would be the smarter route?
Create a MySql table [word, entity_id], have 'word' indexed, and then query using
select entity_id from search_index where word like '[query_word]%
This obviously requires me to break down each title to its words and add a row for each word.
Use Solr or some similar search engine, which from my reading are more oriented towards full text search.
Also, how will this affect me if I'd like to introduce spelling suggestions in the future.
Thank you!
Pro's of a Database Only Solution:
Less set up and maintenance (you already have a database)
If you want to JOIN your search results with other data or otherwise manipulate them you will be able to do so natively in the database
There will be no time lag (if you periodically sync Solr with your database) or maintenance procedure (if you opt to add/update entries in Solr in real time everywhere you insert them into the database)
Pro's of a Solr Solution:
Performance: Solr handles caching and is fast out of the box
Spell check - If you are planning on doing spell check type stuff Solr handles this natively
Set up and tuning of Solr isn't very painful, although it helps if you are familiar with Java application servers
Although you seem to have simple requirements, I think you are getting at having some kind of logic around search for words; Solr does this very well
You may also want to consider future requirements (what if your documents end up having more than just a title field and you want to assign some kind of relevancy? What if you decide to allow people to search the body text of these entities and/or you want to index other document types like MS Word? What if you want to facet search results? Solr is good at all of these).
I am not sure if you would need to create an entry for every word in your database, vs. just '%[query_word]%' search if you are going to create records with each word anyway. It may be simpler to just go with a database for starters, since the requirements seem pretty simple. It should be fairly easy to scale the database performance.
I can tell you we use Solr on site and we love the performance and we use it for even very simple lookups. However, one thing we are missing is a way to combine Solr data with database data. And there is extra maintenance. At the end of the day there is not an easy answer.

How fast is MySQL compared to a C/C++ program running in the server?

Ok, I have a need to perform some intensive text manipulation operations.
Like concatenating huge (say 100 pages of standard text), and searching in them etc. so I am wondering if MySQL would give me a better performance for these specific operations, compared to a C program doing the same thing?
Thanks.
Any database is always slower than a flat-file program outside the database.
A database server has overheads that a program reading and writing simple files doesn't have.
In general the database will be slower. But much depends on the type of processing you want to do, the time you can devote for coding and the coding skills. If the database provides out-of-the-box the tools and functionality you need, then why don't give it a try, which should take much less time than coding own tool. If the performance turns out to be an issue then write your own solution.
But I think that MySQL will not provide the text manipulation operations you want. In Oracle world one has Text Mining and Oracle Text.
There are several good responses that I voted up, but here are more considerations from my opinion:
No matter what path you take: indexing the text is critical for speed. There's no way around it. The only choice is how complex you need to make your index for space constraints as well as search query features. For example, a simple b-tree structure is fast and easy to implement but will use more disk space than a trie structure.
Unless you really understand all the issues, or want to do this as a learning exercise, you are going to be much better off using an application that has had years of performance tuning.
That can mean a relational databases like MySQL even though full-text is a kludge in databases designed for tables of rows and columns. For MySQL use the MyISAM engine to do the indexing and add a full text index on a "blob" column. (Afaik, the InnoDB engine still doesn't handle full text indexing, so you need to use MyISAM). For Postgresql you can use tsearch.
For a bit more difficulty of implementation though you'll see the best performance integrating indexing apps like Xapian, Hyper Estraier or (maybe) Lucene into your C program.
Besides better performance, these apps will also give you important features that MySQL full-text searching is missing, such as word stemming, phrase searching, etc., in other words real full-text query parsers that aren't limited to an SQL mindset.
Relational Databases are normally not good for handling large text data. The performance-wise strength of realtional DBs is the indexation and autmatically generated query plan. Freeform text does not work well in with this model.
If you are talking about storing plain text in one db field and trying to manipulate with data, then C/C++ sould be faster solution. In simple way, MySQL should be a lot bigger C programm than yours, so it must be slower in simple tasks like string manipulation :-)
Of course you must use correct algorithm to reach good result. There is useful e-book about string search algorithms with examples included: http://www-igm.univ-mlv.fr/~lecroq/string/index.html
P.S. Benchmark and give us report :-)
Thanks for all the answers.
I kind of thought that a DB would involve some overhead as well. But what I was thinking is that since my application required that the text be stored somewhere in the first place already, then the entire process of extracting the text from DB, passing it to the C program, and writing back the result into the DB would overall be less efficient than processing it within the DB??
If you're literally talking about concatenating strings and doing a regexp match, it sounds like something that's worth doing in C/C++ (or Java or C# or whatever your favorite fast high-level language is).
Databases are going to give you other features like persistence, transactions, complicated queries, etc.
With MySQL you can take advantage of full-text indices, which will be hundreds times faster, then directly searching through the text.
MySQL is fairly efficient. You need to consider whether writing your own C program would mean more or less records need to be accessed to get the final result, and whether more or less data needs to be transferred over the network to get the final result.
If either solution will result in the same number of records being accessed, and the same amount transferred over the network, then there probably won't be a big difference either way. If performance is critical then try both and benchmark them (if you don't have time to benchmark both then you probably want to go for whichever is easiest to implemnent anyway).
MySQL is written in C, so it is not correct to compare it to a C program. It's itself a C program