I have a mysql database where large amounts of text are constantly added. (10 pages of text per hour). Text is stored as plaintext in text fields. Every row contacts a page or two of texts.
I need to do full text search (search for a keyword in the text and do complex queries) on this database on a regular basis. I just need to search for newly added text. But it is very important for added text to be immediately searchable (within a minute or two).
From what i've read, fulltext with mysql is very inefficient. I know lucene is an option but i'm not sure yet how quickly it can index new text.
So what are my options? is there a way to make mysql more efficient? is lucene my best solution? anything more appropriate?
thanks
I have done benchmarking for Indexing Times for Sphinx & Solr. Sphinx is far ahead as compared to Solr with respect to Indexing Algorithms (super fast indexing times and small index size).
When you say 10 pages of text, it seems you even dont need Real Time Sphinx Indexing. You can follow the main + delta indexing scheme in Sphinx (you can find that on Sphinx Documentation). It would be superfast and near real time. If you want more help on this please feel free to ask, would be glad to explain you.
Solr is great but when it comes to optimized Algorithms Sphinx rocks!! Try Sphinx.
Coming to your questions in the comment, Solr/Lucene supports incremental indexing (known as delta imports in their terminology) and its quiet easy to configure however they are pretty slow as compared to the method used by Sphinx.
Main+Delta is quick enough because what you can do is create a temporary table store you new text in that and index that. According to the documentation:Sphinx supports "live" (almost real time) index updates and it could be implemented using
so called "main+delta" scheme. The idea is to set up two sources and two indexes, with one "main" index for the data, and one "delta" for the new documents.
Say for example you have 10 Million records so you can keep that as the main index and all the new documents get added to a new table which will act as the delta. This new table can be indexed from time to time (say every 1hr) and the data gets searchable within very few seconds as you have 10 pages of text. Now after your new records are being searched you can merge the documents of the main table + delta table which can be carried out without interfering your search. When the documents are merged, empty the new table and again after an hour you can carry out the whole process again. I hope you got that else please feel free to ask any question.
You have a couple of options:
Sphinx Search: Can integrate directly with your MySQL DB. Has support for real-time indexes, with limitations
Solr/Lucene: Feed it data via JSON or XML from your DB. Has rich querying capabilities. Current versions are not real-timey w/o some edge builds. You have to re-index your data and commit it for changes to appear. Which depending on your amount of data, you could do a commit every 10 min. This wont be an issue until you have 100K / 1M+ documents as Lucene is very fast at indexing. 10 pages / hour is pretty trivial.
ElasticSearch: Is Java based like Solr/Lucene but appears to be the truly "near real time" enough. Its engineered out of the box to be distributed and support linear scale-out. You feed it data via JSON and query via JSON.
It really depends on your needs and capabilities. Sphinx might be the easiest to get started. But its Real Time index limitations might not work for you.
Related
I'm working on a analysis assignment, we got a partial data-set from the university Library containing almost 300.000.000 rows.
Each row contains:
ID
Date
Owner
Deadline
Checkout_date
Checkin_date
I put all this inside a MySQL table, then I started querying that for my analysis assignment, however simple query (SELECT * FROM table WHERE ID = something) where taking 9-10 minutes to complete. So I created an index for all the columns, which made it noticeable faster ~ 30 sec.
So I started reading similar issues, and people recommended switching to a "Wide column store" or "Search engine" instead of "Relational".
So my question is, what would be the best database engine to use for this data?
Using a search engine to search is IMO the best option.
Elasticsearch of course!
Disclaimer: I work at elastic. :)
The answer is, of course, "it depends". In your example, you're counting the number of records in the database with a given ID. I find it hard to believe that it would take 30 seconds in MySQL, unless you're on some sluggish laptop.
MySQL has powered an incredible number of systems because it is full-featured, stable, and has pretty good performance. It's bad (or has been bad) at some things, like text search, clustering, etc.
Systems like Elasticsearch are good with globs of text, but still may not be a good fit for your system, depending on usage. From your schema, you have one text field ("owner"), and you wouldn't need Elasticsearch's text searching capabilities on a field like that (who ever needed to stem a user name?). Elasticsearch is also used widely for log files, which also don't need a text engine. It is, however, good with blocks of text and with with clustering.
If this is a class assignment, I'd stick with MySQL.
I think I'd better ask this question instead of guessing around without any experiment.
We are planning to add a new column as code
The code needs to have the following features:
It has to be unique.
Better to be a string, it's much easier for us to migrate data
Has to be random with enough space to avoid collision.
I am planning to just use UUID.
create table code(
id char(36),
unique index index1 (id)
) type=innodb;
Our operation behavior:
insert new code (at most 20K every day)
get row by code (very heavily, we may need to get every row in the database in limited time like 10 minutes).
Now I am worry about performance a little bit. We already have 400K row in our database. In the future it could grow to 10M or 30M.
Do you have any suggestion or see any problem?
BTW: I am not able to use auto incremented int because it's not randomized.
Go ahead. You won't get any problems neither with mysql nor with UUID.
UUIDs generated randomly have enough states that there will be for certain no collision. (Indeed it's still a chance but its 1 over 10^31 in case of 30M entries.)
On the other Hand: Why bother with uuid (which divides your random in 5 groups with no sense at all) when you can as easily use SecureRandom to just generate 16 byte values and use them?
According to this answer: How to store uuid as number? it is faster to store them in binary in mysq.
You want to read this, too: UUID performance in MySQL?
When using mysql you should think about concepts for backup. Mysql will handle big databases quiet easily, but exporting/importing 30M rows can take some time.
For mysql row limits see this question: How many rows in a database are TOO MANY?
I would suggest not to use MySQL for these cases (at least not the main storage that you actively search from).
There are number of different other technologies such as:
Apache Lucene
Apache Solr
Elastic Search
and more others...
These tools are build for fast searches on big data sets. It the above look like overkill you might simply use one of the NoSQL dbs, it will give you much more performance in your case.
There are huge number of articles comparing performance and limitation between all of these.
I am working on an application that needs to do interesting things with search, including full-text search, hit-highlighting, faceted-search, etc...
The dataset is likely to be between 3000-10000 records with 20-30 fields on each, and is all stored in MySQL. The traffic profile of the site is likely to be on the small size of medium.
All of these requirements could be achieved (clunkily) in MySQL, but at what point (in terms of data-size and traffic levels) does it become worth looking at more focused technologies like Solr or Sphinx?
This question calls for a very broad answer to be answered in all aspects. There are very well certain specificas that may make one system superior to another for a special use case, but I want to cover the basics here.
I will deal entirely with Solr as an example for several search engines that function roughly the same way.
I want to start with some hard facts:
You cannot rely on Solr/Lucene as a secure database. There are a list of facts why but they mostly consist of missing recovery options, lack of acid transactions, possible complications etc. If you decide to use solr, you need to populate your index from another source like an SQL table. In fact solr is perfect for storing documents that include data from several tables and relations, that would otherwise requrie complex joins to be constructed.
Solr/Lucene provides mind blowing text-analysis / stemming / full text search scoring / fuzziness functions. Things you just can not do with MySQL. In fact full text search in MySql is limited to MyIsam and scoring is very trivial and limited. Weighting fields, boosting documents on certain metrics, score results based on phrase proximity, matching accurazy etc is very hard work to almost impossible.
In Solr/Lucene you have documents. You cannot really store relations and process. Well you can of course index the keys of other documents inside a multivalued field of some document so this way you can actually store 1:n relations and do it both ways to get n:n, but its data overhead. Don't get me wrong, its perfectily fine and efficient for a lot of purposes (for example for some product catalog where you want to store the distributors for products and you want to search only parts that are available at certain distributors or something). But you reach the end of possibilities with HAS / HAS NOT. You can almonst not do something like "get all products that are available at at least 3 distributors".
Solr/Lucene has very nice facetting features and post search analysis. For example: After a very broad search that had 40000 hits you can display that you would only get 3 hits if you refined your search to the combination of having this field this value and that field that value. Stuff that need additional queries in MySQL is done efficiently and convinient.
So let's sum up
The power of Lucene is text searching/analyzing. It is also mind blowingly fast because of the reverse index structure. You can really do a lot of post processing and satisfy other needs. Altough it's document oriented and has no "graph querying" like triple stores do with SPARQL, basic N:M relations are possible to store and to query. If your application is focused on text searching you should definitely go for Solr/Lucene if you haven't good reasons, like very complex, multi-dmensional range filter queries, to do otherwise.
If you do not have text-search but rather something where you can point and click something but not enter text, good old relational databases are probably a better way to go.
Use Solr if:
You do not want to stress your database.
Get really full text search.
Perform lightning fast search results.
I currently maintain a news website with 5 million users per month, with MySQL as the main datastore and Solr as the search engine.
Solr works like magick for full text indexing, which is difficult to achieve with Mysql. A mix of Mysql and Solr can be used: Mysql for CRUD operations and Solr for searches. I have previusly worked with one of India's best real estate online classifieds portal which was using Solr for search ( and was previously using Mysql). The migration reduced the search times manifold.
Solr can be easily integrated with Mysql:
Solr Full Dataimport can be used for importing data from Mysql tables into Solr collections.
Solr Delta import can be scheduled at short frequencies to load latest data from Mysql to Solr collections.
My website stores several million entities. Visitors search for entities by typing words contained only in the titles. The titles are at most 100 characters long.
This is not a case of classic document search, where users search inside large blobs.
The fields are very short. Also, the main issue here is performance (and not relevance) seeing as entities are provided "as you type" (auto-suggested).
What would be the smarter route?
Create a MySql table [word, entity_id], have 'word' indexed, and then query using
select entity_id from search_index where word like '[query_word]%
This obviously requires me to break down each title to its words and add a row for each word.
Use Solr or some similar search engine, which from my reading are more oriented towards full text search.
Also, how will this affect me if I'd like to introduce spelling suggestions in the future.
Thank you!
Pro's of a Database Only Solution:
Less set up and maintenance (you already have a database)
If you want to JOIN your search results with other data or otherwise manipulate them you will be able to do so natively in the database
There will be no time lag (if you periodically sync Solr with your database) or maintenance procedure (if you opt to add/update entries in Solr in real time everywhere you insert them into the database)
Pro's of a Solr Solution:
Performance: Solr handles caching and is fast out of the box
Spell check - If you are planning on doing spell check type stuff Solr handles this natively
Set up and tuning of Solr isn't very painful, although it helps if you are familiar with Java application servers
Although you seem to have simple requirements, I think you are getting at having some kind of logic around search for words; Solr does this very well
You may also want to consider future requirements (what if your documents end up having more than just a title field and you want to assign some kind of relevancy? What if you decide to allow people to search the body text of these entities and/or you want to index other document types like MS Word? What if you want to facet search results? Solr is good at all of these).
I am not sure if you would need to create an entry for every word in your database, vs. just '%[query_word]%' search if you are going to create records with each word anyway. It may be simpler to just go with a database for starters, since the requirements seem pretty simple. It should be fairly easy to scale the database performance.
I can tell you we use Solr on site and we love the performance and we use it for even very simple lookups. However, one thing we are missing is a way to combine Solr data with database data. And there is extra maintenance. At the end of the day there is not an easy answer.
I have a MySQL/Rails app that needs search. Here's some info about the data:
Users search within their own data only, so searches are narrowed down by user_id to begin with.
Each user will have up to about five thousand records (they accumulate over time).
I wrote out a typical user's records to a text file. The file size is 2.9 MB.
Search has to cover two columns: title and body. title is a varchar(255) column. body is column type text.
This will be lightly used. If I average a few searches per second that would be surprising.
It's running an a 500 MB CentOS 5 VPS machine.
I don't want relevance ranking or any kind of fuzziness. Searches should be for exact strings and reliably return all records containing the string. Simple date order -- newest to oldest.
I'm using the InnoDB table type.
I'm looking at plain SQL search (through the searchlogic gem) or full text search using Sphinx and the Thinking Sphinx gem.
Sphinx is very fast and Thinking Sphinx is cool, but it adds complexity, a daemon to maintain, cron jobs to maintain the index.
Can I get away with plain SQL search for a small scale app?
I think plain SQL search won't be the good choice. Because of when we fetching text type columns in MySQL the request is always falling to hard drive no matter what cache settings are.
You can use plain SQL search only with very small apps.
I'd prefer Sphinx for that.
I would start out simple -- chances are that plain SQL will work well, and you can always switch to full text search later if the search function proves to be a bottleneck.
I'm developing and maintaining an application with a search function with properties similar to yours, and plain SQL search has worked very well for me so far. I had similar performance concerns when I first implemented the search function a year or two ago, but I haven't seen any performance problems whatsoever yet.
Having used MySQL fulltext search for about 4 years, and just moving now to Sphinx, I'd say that a regular MySQL search using the fulltext boolean (ie exact) syntax will be fine. It's fast and it will do exactly what you want. The amount of data you will be searching at any one time will be small.
The only problem might be ordering the results. MySQL's fulltext search can get slow when you start ordering things by (eg) date, as that requires that you search the entire table, rather than just the first nn results it finds. That was ultimately the reason I moved to Sphinx.
Sphinx is also awesome, so don't be afraid to try it, but it sounds like the additional functionality may not be required in your case.