Solr only vs. Solr/MySQL solution - mysql

Currently I have a system, which is based solely on Solr. Which means, that I store all data in Solr (using SolrJ) with no other datastore involved. The problem is now, that I experience some performance issues. I thought, that it maybe could make sense to store in MySQL and then synchronize the data with Solr with e.g. the DataImportHandler. So that I have the reading operations on the Solr index and the main writing operations in MySQL and then sometimes only Solr-Writing operations when synchronizing with Solr.
The thing is that I expect hundreds of millions documents which should be stored and I don't really now if that the MySQL/Solr makes sense.
Is there another better solution? Maybe Master-Solr for writing and Solr-slaves for reading?
Update: What I forgot to say is, that also in case of a schema.xml change, the "storing data in MySQL" solution could be useful in my opinion, because then I can re-commit all the data without caring about Solr's self-stored data.

Its not preferable to use the same Solr instance for both reading and writing as the activities (with commit and optimize) on Solr during writing would heavily impact the read operations.
Master - Slave confgurations would be nicer approach, with master primarily for writes and slaves for read only purposes.
Slaves being periodically refreshed with the contents from Master. (So there would be some delay)
You can always scale by adding multiple slaves.
Using MySQL as a persistant store with Master-Slave Solr would be a best approach.
MySQL providing a stable data store, and would guard you against index corruption or some more issues which would result in data lost.
Using dataimport handler you can do it easily with incremental updates, but there would be more time tag for latest data to appear on slaves.
With this you can also use Index swapping for full refreshes.
In case the index grows up hugh to be be maintainable and has performance impact, you may want to check solr shards.

I also thought about the same issue: storing everything in solr or stor in mySql and index in Solr.
I decided to go the 2nd way: store with MySQL and index in solr.
The reason: handling of data (reading and writing data) in MySql is much better than by Solr. Also data import/export from/to MySql is supported/possible by lots of tools, out of the box.
Next Point: Backup. There are much more established ways for backing up an MySql DB than an Solr index.
Of course, for fulltext-search, Solr is much more better than MySql. So i decided, that everyone should have to work where he knows best.
For your Information: i'm talking about an medium Index: 4GB for some million documents.
//Edit: don't forgett, that some features requiere stared data in lucene (not only indexed), like highlighting. If you need this, you have to store the documents in solr (additional). An alternative way could be implementing those features on client-side. (I did it this way)

Related

Database for Full Text Search and 200M+ Records

Iam about to create a huge database with at least 200 Million entries.
The database needs to be searchable using full text and should be fast.
My database gets data from many different datasources and i need to import the new or updated data regularly.
Is it a good idea to store all my data in a relational database like mysql and then create a nosql document database (e.g. mongodb or elasticsearch) just for the purpose of searching or does that not provide any benefit in terms of
reliability and the prevention of redundant information?
I believe that keeping primary records in a SQL database and duplicating them to a noSQL database is a very common approach.
ElasticSearch has an ongoing status page about their resiliency. Even in the newest version, ElasticSearch can loose data in a number of different situations. A major change in the structure of an ElasticSearch index (such as adding analyzers) requires that you re-index all of the documents. This process is safer if you have another source for the documents. At the end of the day, ElasticSearch isn't designed to consistently store documents - I would only ever choose to use ElasticSearch as the primary store in situations where occasional data loss isn't a disaster.
Unlike ElasticSearch, MongoDB is designed to be resilient. You should be able to safely store documents in MongoDB. I've found trying to do full text searches in MongoDB can be a little painful, at least compared to ElasticSearch. In my opinion, for text search, the only advantage MongoDB has over MySQL's FULLTEXT is that it is distributed.
We are running ElasticSearch and MySQL right now - and the benefits greatly outweigh the hassles of extra infrastructure and dealing with replication between the two. We had previously attempted to use a noSQL solution as the primary datastore, with disastrous results. Running a ES in conjunction with a MySQL gets you the best of both worlds - consistency & safety of data in SQL, with the scalable, effective full text search in ES.
I don't know how applicable to your situation this is, but Evan Weaver compared a few of the common Rails search options (Sphinx, Ferret and Solr), running some benchmarks.

MySQL DB replication hook to clean local cache

I have the app a MySQL DB is a slave for other remote Master DB. And i use memcache to do caching of some DB data.
My slave DB can be updated if there are updates in a Master DB. So in my application i want to know when my local (slave) DB is updated to invalidate related cached data and display fresh data i got from master.
Is there any way to run some program when slave mysql DB is updated ? i would then filter q query and understand if i need to clean a cache or not.
Thanks
First of all you are looking for solution similar to what Facebook did in their db architecture (As I remember they patched MySQL for this).
You can build your own solution based on one of these techniques:
Parse replication log on slave side, remove cache entry when you see update of data in the log
Load UDF (user defined function) for memcached, attach trigger on replica side (it will call UDF remove function) to interested tables inside MySQL.
Please note that this configuration is complicated during the support and maintenance. If you can sacrifice stale data in the cache maybe small ttl will help you.
As Kirugan says, it's as simple as writing your own SQL parser, and ensuring that you also provide an indexed lookup keyed to the underlying data for anything you insert into the cache, then cross reference the datasets for any DML you apply to the database. Of course, this will be a lot simpler if you create a simplified, abstract syntax to represent the DML, but thereby losing the flexibilty of SQL and of course, having to re-implement any legacy code using your new syntax. Apart from fixing the existing code, it should only take a year or two to get this working right. Basing your syntax on MySQL's handler API rather than SQL will probably save a lot of pain later in the project.
Of course, if you need full cache consistency then you need to ensure that a logical transaction now spans all the relevant datacentres which will have something of an adverse impact on your performance (certainly much slower than just referencing the master directly).
For a company like facebook, with hundreds of thousands of servers and terrabytes of data (and no requirement for cache consistency) such an approach to solving the problem leads to massive savings. If you only have 2 servers, a better solution would be to switch to multi-master replication, possibly add another database node, optimize the storage (e.g. switching to ssds / adding fast bcache) make sure you have session affinity to the dbms from the aplication (but not stcky sessions) and spend some time tuning your dbms, particularly its cache performance.

Using MongoDB vs MySQL with lots of JSON fields?

There is a microblogging type of application. Two main basic database stores zeroed upon are:
MySQL or MongoDB.
I am planning to denormalize lot of data I.e. A vote done on a post is stored in a voting table, also a count is incremented in the main posts table. There are other actions involved with the post too (e.g. Like, vote down).
If I use MySQL, some of the data better suits as JSON than fixed schema, for faster lookups.
E.g.
POST_ID | activity_data
213423424 | { 'likes': {'count':213,'recent_likers' :
['john','jack',..fixed list of recent N users]} , 'smiles' :
{'count':345,'recent_smilers' :
['mary','jack',..fixed list of recent N users]} }
There are other components of the application as well, where usage of JSON is being proposed.
So, to update a JSON field, the sequence is:
Read the JSON in python script.
Update the JSON
Store the JSON back into MySQL.
It would have been single operation in MongoDB with atomic operations like $push,$inc,$pull etc. Also
document structure of MongoDB suits my data well.
My considerations while choosing the data store.
Regarding MySQL:
Stable and familiar.
Backup and restore is easy.
Some future schema changes can be avoided using some fields as schemaless JSON.
May have to use layer of memcached early.
JSON blobs will be static in some tables like main Posts, however will be updated alot in some other tables like Post votes and likes.
Regarding MongoDB:
Better suited to store schema less data as documents.
Caching might be avoided till a later stage.
Sometimes the app may become write intensive, MongoDB can perform better at those points where unsafe writes are not an issue.
Not sure about stability and reliability.
Not sure about how easy is it to backup and restore.
Questions:
Shall we chose MongoDB if half of data is schemaless, and is being stored as JSON if using MySQL?
Some of the data like main posts is critical, so it will be saved using safe writes, the counters etc
will be saved using unsafe writes. Is this policy based on importance of data, and write intensiveness correct?
How easy is it to monitor, backup and restore MongoDB as compared to MySQL? We need to plan periodic backups ( say daily ), and restore them with ease in case of disaster. What are the best options I have with MongoDB to make it a safe bet for the application.
Stability, backup, snapshots, restoring, wider adoption I.e.database durability are the reasons pointing me
to use MySQL as RDBMS+NoSql even though a NoSQL document storage could serve my purpose better.
Please focus your views on the choice between MySQL and MongoDB considering the database design I have in mind. I know there could be better ways to plan database design with either RDBMS or MongoDB documents. But that is not the current focus of my question.
UPDATE : From MySQL 5.7 onwards, MySQL supports a rich native JSON datatype which provides data flexibility as well as rich JSON querying.
https://dev.mysql.com/doc/refman/5.7/en/json.html
So, to directly answer the questions...
Shall we chose mongodb if half of data is schemaless, and is being stored as JSON if using MySQL?
Schemaless storage is certainly a compelling reason to go with MongoDB, but as you've pointed out, it's fairly easy to store JSON in a RDBMS as well. The power behind MongoDB is in the rich queries against schemaless storage.
If I might point out a small flaw in the illustration about updating a JSON field, it's not simply a matter of getting the current value, updating the document and then pushing it back to the database. The process must all be wrapped in a transaction. Transactions tend to be fairly straightforward, until you start denormalizing your database. Then something as simple as recording an upvote can lock tables all over your schema.
With MongoDB, there are no transactions. But operations can almost always be structured in a way that allow for atomic updates. This usually involves some dramatic shifts from the SQL paradigms, but in my opinion they're fairly obvious once you stop trying to force objects into tables. At the very least, lots of other folks have run into the same problems you'll be facing, and the Mongo community tends to be fairly open and vocal about the challenges they've overcome.
Some of the data like main posts is critical , so it will be saved using safe writes , the counters etc will be saved using unsafe writes. Is this policy based on importance of data, and write intensiveness correct?
By "safe writes" I assume you mean the option to turn on an automatic "getLastError()" after every write. We have a very thin wrapper over a DBCollection that allows us fine grained control over when getLastError() is called. However, our policy is not based on how "important" data is, but rather whether the code following the query is expecting any modifications to be immediately visible in the following reads.
Generally speaking, this is still a poor indicator, and we have instead migrated to findAndModify() for the same behavior. On the occasion where we still explicitly call getLastError() it is when the database is likely to reject a write, such as when we insert() with an _id that may be a duplicate.
How easy is it to monitor,backup and restore Mongodb as compared to mysql? We need to plan periodic backups (say daily), and restore them with ease in case of disaster. What are the best options I have with mongoDb to make it a safe bet for the application?
I'm afraid I can't speak to whether our backup/restore policy is effective as we have not had to restore yet. We're following the MongoDB recommendations for backing up; #mark-hillick has done a great job of summarizing those. We're using replica sets, and we have migrated MongoDB versions as well as introduced new replica members. So far we've had no downtime, so I'm not sure I can speak well to this point.
Stability,backup,snapshots,restoring,wider adoption i.e.database durability are the reasons pointing me to use MySQL as RDBMS+NoSql even though a NoSQL document storage could serve my purpose better.
So, in my experience, MongoDB offers storage of schemaless data with a set of query primitives rich enough that transactions can often be replaced by atomic operations. It's been tough to unlearn 10+ years worth of SQL experience, but every problem I've encountered has been addressed by the community or 10gen directly. We have not lost data or had any downtime that I can recall.
To put it simply, MongoDB is hands down the best data storage ecosystem I have ever used in terms of querying, maintenance, scalability, and reliability. Unless I had an application that was so clearly relational that I could not in good conscience use anything other than SQL, I would make every effort to use MongoDB.
I don't work for 10gen, but I'm very grateful for the folks who do.
I'm not going to comment on the comparisons (I work for 10gen and don't feel it's appropriate for me to do so), however, I will answer the specific MongoDB questions so that you can better make your decision.
Back-Up
Documentation here is very thorough, covering many aspects:
Block-Level Methods (LVM makes it very easy and quite a lot of folk do this)
With/Without Journaling
EBS Snapshots
General Snapshots
Replication (technically not back-up, however, a lot of folk use replica sets for their redundancy and back-up - not recommending this but it is done)
Until recently, there is no MongoDB equivalent of mylvmbackup but a nice guy wrote one :) In his words
Early days so far: it's just a glorified shell script and needs way more error checking. But already it works for me and I figured I'd share the joy. Bug reports, patches & suggestions welcome.
Get yourself a copy from here.
Restores
Formats etc
mongodump is completely documented here and mongorestore is here.
mongodump will not contain the indexes but does contain the system.indexes collection so mongorestore can rebuild the indexes when you restore the bson file. The bson file is the actual data whereas mongoexport/mongoimport are not type-safe so it could be anything (techically speaking) :)
Monitoring
Documented here.
I like Cacti but afaik, the Cacti templates have not kept up with the changes in MongoDB and so rely on old syntax so post 2.0.4, I believe there are issues.
Nagios works well but it's Nagios so you either love or hate it. A lot of folk use Nagios and it seems to provide them with great visiblity.
I've heard of some folk looking at Zappix but I've never used it so can't comment.
Additionally, you can use MMS, which is free and hosted externally. Your MongoDB instances run an agent and one of those agents communicate (using python code) over https to mms.10gen.com. We use MMS to view all performance statistics on the MongoDB instances and it is very beneficial from a high-level wide view as well as offering the ability to drill down. It's simple to install and you don't have to run any hardware for this. Many customers run it and some compliment it with Cacti/Nagios.
Help information on MMS can be found here (it's a very detailed, inclusive document).
One of the disadvantages of a mysql solution with stored json is that you will not be able to efficiently search on the json data. If you store it all in mongodb, you can create indexes and/or queries on all of your data including the json.
Mongo's writes work very well, and really the only thing you lose vs mysql is transaction support, and thus the ability to rollback multipart saves. However, if you are able to commit your changes in atomic operations, then there isn't a data safety issue. If you are replicated, mongo provides an "eventually consistent" promise such that the slaves will eventually mirror the master.
Mongodb doesn't provide native enforcement or cascading of certain db constructs such as foreign keys, so you have to manage those yourself (such as either through composition, which is one of mongo's strenghts), or through use of dbrefs.
If you really need transaction support and robust 'safe' writes, yet still desire the flexibility provided by nosql, you might consider a hybrid solution. This would allow you to use mysql as your main post store, and then use mongodb as your 'schemaless' store. Here is a link to a doc discussing hybrid mongo/rdbms solutions: http://www.10gen.com/events/hybrid-applications The article is from 10gen's site, but you can find other examples simply by doing a quick google search.
Update 5/28/2019
The here have been a number of changes to both MySQL and Mongodb since this answer was posted, so the pros/cons between them have become even blurrier. This update doesn't really help with the original question, but I am doing it to make sure any new readers have a bit more recent information.
MongoDB now supports transactions: https://docs.mongodb.com/manual/core/transactions/
MySql now supports indexing and searching json fields:
https://dev.mysql.com/doc/refman/5.7/en/json.html

With Solr, Do I Need A SQL db as well?

I'm thinking about using solr to implement spatial and text indexing. At the moment, I have entries going in to a MYSQL database as well as solr. When solr starts, it reads all the data from MYSQL. As new entries come in, my web servers write them to MYSQL and, at the same time, adds documents to solr. More and more, it seems that my MYSQL implementation is just becoming a write-only persisten store (more or less, a backup for the data in solr) - all of the reading of entries are done via solr queries. Really the only data being read from MYSQL is user info, which doesn't need to be indexed/searched.
A few questions:
Do I really need the MYSQL implementation or could I simply store all of my data in solr?
If solr only, what are the risks associated with this solution?
Thanks!
Almost always, the answer is yes. It needn't be a database necessarily, but you should retain the original data somewhere outside of Solr in the event you alter how you index the data in Solr. Unlike most databases, which Solr is not, Solr can't simple re-index itself. You could hypothetically configure your schema so that all your original data is marked as "stored" and then perhaps to a CSV dump and re-index that way, but I wouldn't recommend this approach.
Shameless plug: For any information on using Solr, I recommend my book.
I recommend a separate repository. MySQL is one choice. Some people use the filesystem.
You often want a different schema for searching than for storing. That is easy to do with a separate repository.
When you change the Solr schema, you need to reload the content. Unloading all the content from Solr can be slow. If it is already in a separate repository, then you don't need to dump it from Solr, you can overwrite what is there.
In general, making Solr be both a search engine and a repository really reduces your flexibility and options for making search the best it can be.

Solr vs. MySQL performance for autocomplete

In one of our applications, we need to hold some plain tabular data and we need to be able to perform user-side autocompletion on one of the columns.
The initial solution we came up with, was to couple MySQL with Solr to achieve this (MySQL to hold data and Solr to just hold the tokenized column and return ids as result). But something unpleasant happened recently (developers started storing some of the data in Solr, because the MySQL table and the operations done on it are nothing that Solr can not provide) and we thought maybe we could merge them together and eliminate one of the two.
So we had to either: (1) move all the data to Solr (2) use MySQL for autocompletion
(1) sounded terrible so I gave it a shot with (2), I started with loading that single column's data into MySQL, disabled all caches on both MySQL and Solr, wrote a tiny webapp that is able to perform very similar queries [1] on both databases, and fired up a few JMeter scenarios against both in a local and similar environment. The results show a 2.5-3.5x advantage for Solr, however, I think the results may be totally wrong and fault prone.
So, what would you suggest for:
Correctly benchmarking these two systems, I believe you need to
provide similar[to MySQL] environment to JVM.
Designing this system.
Thanks for any leads.
[1] SELECT column FROM table WHERE column LIKE 'USER-INPUT%' on MySQL and column:"USER-INPUT" on Solr.
I recently moved a website over from getting its data from the database (postgres) to getting all data from Solr. Unbelievable difference in speed. We also have autocomplete for Australian suburbs (about 15K of them) and it finds them in a couple of milliseconds, so the ajax auto-complete (we used jQuery) reacts almost instantly.
All updates are done against the original database, but our site is a mostly-read site. We used triggers to fire events when records were updated and that spawns a reindex into Solr of the record.
The other big speed improvement was pre-caching data required to render the items - ie we denormalize data and pre-calculate lots of stuff at Solr indexing time so the rendering is easy for the web guys and super fast.
Another advantage is that we can put our site into read-only mode if the database needs to be taken offline for some reason - we just fall back to Solr. At least the site doesn't go fully down.
I would recommend using Solr as much as possible, for both speed and scalability.