Schema considerations when moving from an RDBMS (MySQL) to Solr - mysql

Whilst testing a Solr install for a future MySQL -> Solr migration, it's immediately apparent that the "rules" for what constitutes a good data stucture, and by extension an efficient search, are very different in Solr when compared to an RDBMS like MySQL. The most obvious thing being that data isn't (or doesn't seem to be) normalised to the same degree.
Does anyone have any advice regarding the best way to go about making the transition from MySQL to Solr? Are there any established patterns for structuring data in a non-RDBMS (Solr specifically) that I should be learning about? Any common pitfalls to avoid? Is it simply a case of de-normalising related tables into objects?

First of all, you have to ask yourself if you want to:
migrate the whole thing to Solr or
just use Solr as a complement used for searching.
For anything other than non-trivial relational schemas, I'd recommend #2. The more heterogeneous data you have in one index, the less useful it is.

The Solr Enterprise Search Server? If it were me doing it, I would migrate only your documents over, not the entire database. Is that feasible?

Related

Using MongoDB vs MySQL with lots of JSON fields?

There is a microblogging type of application. Two main basic database stores zeroed upon are:
MySQL or MongoDB.
I am planning to denormalize lot of data I.e. A vote done on a post is stored in a voting table, also a count is incremented in the main posts table. There are other actions involved with the post too (e.g. Like, vote down).
If I use MySQL, some of the data better suits as JSON than fixed schema, for faster lookups.
E.g.
POST_ID | activity_data
213423424 | { 'likes': {'count':213,'recent_likers' :
['john','jack',..fixed list of recent N users]} , 'smiles' :
{'count':345,'recent_smilers' :
['mary','jack',..fixed list of recent N users]} }
There are other components of the application as well, where usage of JSON is being proposed.
So, to update a JSON field, the sequence is:
Read the JSON in python script.
Update the JSON
Store the JSON back into MySQL.
It would have been single operation in MongoDB with atomic operations like $push,$inc,$pull etc. Also
document structure of MongoDB suits my data well.
My considerations while choosing the data store.
Regarding MySQL:
Stable and familiar.
Backup and restore is easy.
Some future schema changes can be avoided using some fields as schemaless JSON.
May have to use layer of memcached early.
JSON blobs will be static in some tables like main Posts, however will be updated alot in some other tables like Post votes and likes.
Regarding MongoDB:
Better suited to store schema less data as documents.
Caching might be avoided till a later stage.
Sometimes the app may become write intensive, MongoDB can perform better at those points where unsafe writes are not an issue.
Not sure about stability and reliability.
Not sure about how easy is it to backup and restore.
Questions:
Shall we chose MongoDB if half of data is schemaless, and is being stored as JSON if using MySQL?
Some of the data like main posts is critical, so it will be saved using safe writes, the counters etc
will be saved using unsafe writes. Is this policy based on importance of data, and write intensiveness correct?
How easy is it to monitor, backup and restore MongoDB as compared to MySQL? We need to plan periodic backups ( say daily ), and restore them with ease in case of disaster. What are the best options I have with MongoDB to make it a safe bet for the application.
Stability, backup, snapshots, restoring, wider adoption I.e.database durability are the reasons pointing me
to use MySQL as RDBMS+NoSql even though a NoSQL document storage could serve my purpose better.
Please focus your views on the choice between MySQL and MongoDB considering the database design I have in mind. I know there could be better ways to plan database design with either RDBMS or MongoDB documents. But that is not the current focus of my question.
UPDATE : From MySQL 5.7 onwards, MySQL supports a rich native JSON datatype which provides data flexibility as well as rich JSON querying.
https://dev.mysql.com/doc/refman/5.7/en/json.html
So, to directly answer the questions...
Shall we chose mongodb if half of data is schemaless, and is being stored as JSON if using MySQL?
Schemaless storage is certainly a compelling reason to go with MongoDB, but as you've pointed out, it's fairly easy to store JSON in a RDBMS as well. The power behind MongoDB is in the rich queries against schemaless storage.
If I might point out a small flaw in the illustration about updating a JSON field, it's not simply a matter of getting the current value, updating the document and then pushing it back to the database. The process must all be wrapped in a transaction. Transactions tend to be fairly straightforward, until you start denormalizing your database. Then something as simple as recording an upvote can lock tables all over your schema.
With MongoDB, there are no transactions. But operations can almost always be structured in a way that allow for atomic updates. This usually involves some dramatic shifts from the SQL paradigms, but in my opinion they're fairly obvious once you stop trying to force objects into tables. At the very least, lots of other folks have run into the same problems you'll be facing, and the Mongo community tends to be fairly open and vocal about the challenges they've overcome.
Some of the data like main posts is critical , so it will be saved using safe writes , the counters etc will be saved using unsafe writes. Is this policy based on importance of data, and write intensiveness correct?
By "safe writes" I assume you mean the option to turn on an automatic "getLastError()" after every write. We have a very thin wrapper over a DBCollection that allows us fine grained control over when getLastError() is called. However, our policy is not based on how "important" data is, but rather whether the code following the query is expecting any modifications to be immediately visible in the following reads.
Generally speaking, this is still a poor indicator, and we have instead migrated to findAndModify() for the same behavior. On the occasion where we still explicitly call getLastError() it is when the database is likely to reject a write, such as when we insert() with an _id that may be a duplicate.
How easy is it to monitor,backup and restore Mongodb as compared to mysql? We need to plan periodic backups (say daily), and restore them with ease in case of disaster. What are the best options I have with mongoDb to make it a safe bet for the application?
I'm afraid I can't speak to whether our backup/restore policy is effective as we have not had to restore yet. We're following the MongoDB recommendations for backing up; #mark-hillick has done a great job of summarizing those. We're using replica sets, and we have migrated MongoDB versions as well as introduced new replica members. So far we've had no downtime, so I'm not sure I can speak well to this point.
Stability,backup,snapshots,restoring,wider adoption i.e.database durability are the reasons pointing me to use MySQL as RDBMS+NoSql even though a NoSQL document storage could serve my purpose better.
So, in my experience, MongoDB offers storage of schemaless data with a set of query primitives rich enough that transactions can often be replaced by atomic operations. It's been tough to unlearn 10+ years worth of SQL experience, but every problem I've encountered has been addressed by the community or 10gen directly. We have not lost data or had any downtime that I can recall.
To put it simply, MongoDB is hands down the best data storage ecosystem I have ever used in terms of querying, maintenance, scalability, and reliability. Unless I had an application that was so clearly relational that I could not in good conscience use anything other than SQL, I would make every effort to use MongoDB.
I don't work for 10gen, but I'm very grateful for the folks who do.
I'm not going to comment on the comparisons (I work for 10gen and don't feel it's appropriate for me to do so), however, I will answer the specific MongoDB questions so that you can better make your decision.
Back-Up
Documentation here is very thorough, covering many aspects:
Block-Level Methods (LVM makes it very easy and quite a lot of folk do this)
With/Without Journaling
EBS Snapshots
General Snapshots
Replication (technically not back-up, however, a lot of folk use replica sets for their redundancy and back-up - not recommending this but it is done)
Until recently, there is no MongoDB equivalent of mylvmbackup but a nice guy wrote one :) In his words
Early days so far: it's just a glorified shell script and needs way more error checking. But already it works for me and I figured I'd share the joy. Bug reports, patches & suggestions welcome.
Get yourself a copy from here.
Restores
Formats etc
mongodump is completely documented here and mongorestore is here.
mongodump will not contain the indexes but does contain the system.indexes collection so mongorestore can rebuild the indexes when you restore the bson file. The bson file is the actual data whereas mongoexport/mongoimport are not type-safe so it could be anything (techically speaking) :)
Monitoring
Documented here.
I like Cacti but afaik, the Cacti templates have not kept up with the changes in MongoDB and so rely on old syntax so post 2.0.4, I believe there are issues.
Nagios works well but it's Nagios so you either love or hate it. A lot of folk use Nagios and it seems to provide them with great visiblity.
I've heard of some folk looking at Zappix but I've never used it so can't comment.
Additionally, you can use MMS, which is free and hosted externally. Your MongoDB instances run an agent and one of those agents communicate (using python code) over https to mms.10gen.com. We use MMS to view all performance statistics on the MongoDB instances and it is very beneficial from a high-level wide view as well as offering the ability to drill down. It's simple to install and you don't have to run any hardware for this. Many customers run it and some compliment it with Cacti/Nagios.
Help information on MMS can be found here (it's a very detailed, inclusive document).
One of the disadvantages of a mysql solution with stored json is that you will not be able to efficiently search on the json data. If you store it all in mongodb, you can create indexes and/or queries on all of your data including the json.
Mongo's writes work very well, and really the only thing you lose vs mysql is transaction support, and thus the ability to rollback multipart saves. However, if you are able to commit your changes in atomic operations, then there isn't a data safety issue. If you are replicated, mongo provides an "eventually consistent" promise such that the slaves will eventually mirror the master.
Mongodb doesn't provide native enforcement or cascading of certain db constructs such as foreign keys, so you have to manage those yourself (such as either through composition, which is one of mongo's strenghts), or through use of dbrefs.
If you really need transaction support and robust 'safe' writes, yet still desire the flexibility provided by nosql, you might consider a hybrid solution. This would allow you to use mysql as your main post store, and then use mongodb as your 'schemaless' store. Here is a link to a doc discussing hybrid mongo/rdbms solutions: http://www.10gen.com/events/hybrid-applications The article is from 10gen's site, but you can find other examples simply by doing a quick google search.
Update 5/28/2019
The here have been a number of changes to both MySQL and Mongodb since this answer was posted, so the pros/cons between them have become even blurrier. This update doesn't really help with the original question, but I am doing it to make sure any new readers have a bit more recent information.
MongoDB now supports transactions: https://docs.mongodb.com/manual/core/transactions/
MySql now supports indexing and searching json fields:
https://dev.mysql.com/doc/refman/5.7/en/json.html

Try MongoDB or stick to MySQL

I am coding a web portal which stores a lot user data and later on maybe documents. In the meantime I use MySQL with many relations. I have read much about NoSQL and find that it is an interesting topic.
Is MongoDB or CouchDB ready to fully replace MySQL? Would something change in the usage of Doctrine in my application?
Is MongoDB or CouchDB ready to fully replace MySQL?
Sure, lots of people are storing their entire data set in MongoDB instead of MySQL and they are doing fine.
But I do not think that is the correct question. The key questions are really the following:
Does implementing MongoDB improve your system? Less queries, more flexibility, better performance?
Are you capable of implementing MongoDB at the appropriate scale?
MongoDB is a tool like many others and it does not solve all problems. In my experience, most systems are best implemented with some mix of databases. That would means something like MongoDB for some data and SQL for other data.

With Solr, Do I Need A SQL db as well?

I'm thinking about using solr to implement spatial and text indexing. At the moment, I have entries going in to a MYSQL database as well as solr. When solr starts, it reads all the data from MYSQL. As new entries come in, my web servers write them to MYSQL and, at the same time, adds documents to solr. More and more, it seems that my MYSQL implementation is just becoming a write-only persisten store (more or less, a backup for the data in solr) - all of the reading of entries are done via solr queries. Really the only data being read from MYSQL is user info, which doesn't need to be indexed/searched.
A few questions:
Do I really need the MYSQL implementation or could I simply store all of my data in solr?
If solr only, what are the risks associated with this solution?
Thanks!
Almost always, the answer is yes. It needn't be a database necessarily, but you should retain the original data somewhere outside of Solr in the event you alter how you index the data in Solr. Unlike most databases, which Solr is not, Solr can't simple re-index itself. You could hypothetically configure your schema so that all your original data is marked as "stored" and then perhaps to a CSV dump and re-index that way, but I wouldn't recommend this approach.
Shameless plug: For any information on using Solr, I recommend my book.
I recommend a separate repository. MySQL is one choice. Some people use the filesystem.
You often want a different schema for searching than for storing. That is easy to do with a separate repository.
When you change the Solr schema, you need to reload the content. Unloading all the content from Solr can be slow. If it is already in a separate repository, then you don't need to dump it from Solr, you can overwrite what is there.
In general, making Solr be both a search engine and a repository really reduces your flexibility and options for making search the best it can be.

Are there any advantages to using mongodb over mysql if said mongo db were used without embedded documents?

I'm using a php framework with a mongodb adapter that doesn't currently comprehend embedded documents as a Model/association relationship. After reading about mongodb for a few days it seems that you should use embedded documents for objects that are most often displayed together. This makes a lot of sense to me. It was said during one mongo schema talk that a collection of many small documents can negate some of the advantages of mongo over an RDBMS.
In searching stackoverflow and beyond, I can't seem to see what advantages exist, if any, when deploying mongodb into an environment where it is implemented with a reasonably normalized schema like you'd find in a traditional RDBMS.
Are there still advantages to using MongoDB when used in this way? Scaling? Performance?
If by "reasonably normalized" you mean that you need information from one table to filter the information from another table (i.e. a join), then mongo is going to work against you. In a SQL database you can easily get the info from multiple tables with a single query. In mongo you'll need multiple queries to get data from multiple collections. Any speed advantage mongo gives you in pulling from a single collection will quickly be negated by making multiple round trips to the database.
Here are some advantages that MongoDb might give you (depending on your usecase):
Schemaless: More flexible if document structure is modified later.
Performance: MongoDB utilizes the RAM available very well making it very performant
Easy replication: Replication is easy to setup
Sharding/Clustering: MongoDB is designed with sharding in mind. It is easy to setup and doesn't require experts.
Map/Reduce: If you happen to need this, there is built-in support.
Javascript: Intuitive to use if you already know Javascript (and who doesn't nowadays :) )
MongoDB website has a good list of casestudies of production deployments.
MongoDB has replication and sharding built in.
These are things that can be done with MySQL.
The downside is the learning curve and lack of programmers that know it.
If it's just for you, it would be fun as a learning project.
If this is for a larger project, you'll need to weigh the lack of MongoDB programmers and learning curve against popularity of MySQL.
I have been developing my University dissertation project with MySQL first then thought to give a shot to MongoDB to improve performance. Rewriting code was really easy and straightforward with Jongo. Production has been really smooth.
Unfortunately performance were terrible. I am not particularly skilled with MongoDB queries, but I believe I did quite a lot of research: I have used map reduce, I have used the aggregation framework, $limit and all that stuff... when at same stage I got the message: "request heap use exceeded 10% of physical RAM" I really gave up and delivered the MySQL version.
For me it's really a shame because I was working so hard to make it work the best way possible with MongoDB (as a University project stands out if you do something different). However I think I will continue study MongoDB in future, but for the moment I stick to performance (or better what I can make perform).
I hope my comment will not offend MongoDB fans, but this is my experience.

Where to use MongoDB and where MySQL?

I am thinking about using one of two databases - MySQL and MongoDB. I am planning to storing text and numeric data and I will building my app in RoR.
So I don't know, which database system could be better for this purpose - can you help me, please, under which criterium I will decide?
Let me cast this question within more general setting and into some historical perspective.
In the 60s they were asking whether to use hierarchical or network database
In the 70s the debate was relational against network
In the 80s Relational turned into SQL databases, so question mutated to SQL vs. network
In the 90s it was SQL against object databases
In 00s it was SQL against XML databases
Today we have SQL vs. NoSQL
Do you see a pattern here? Would you still bet some money onto SQL competitor, especially if it's nothing more than glorified hash table?
I have used also MySQL and MongoDB with Mongoid in my projects, and I can say that if you want to keep binary data like images, mp3s and other stuff in your database so try Mongo, for other reasons you can use SQL databases. MongoDB has no structure - you processing the hash, so you can dynamicly add and remove keys/columns.
In your case I would use MySQL.
In my opinion you should base your decision on the purpose of your application. Do you want to search through your text data, how will you define keys. There is little use in going for MySQL if you have to request each record and scan it. Even if there is functionality to do text scans in MySQL (does it have that?) MongoDB will probably do the job more efficiently. The other way around, if you are not going to use MongoDB's strong points then you might as well go for MySQL.
Another factor might be the deadline for implementing something. If you need it fast, don't waste time on learning something new. If you have time to experiment, figure out the key features you will most likely rely upon in your application.
I think, if you need a hard structure you should use MySQL because it't its nature, but if you need something more dynamic, whith no structure at all (schema-less) you should use MongoDB, I've never use MongoDB but I know it's more object/document oriented.
It would be helpful if you could provide some more detail. Would your data easily fit into a schema, or do you need the flexibility that a document store offers? What about auto-sharding, etc? Without more information, no one can give you advice that fits your needs. Lacking that, you can't hope for feedback any better than people's personal preferences, which is little more than a flamewar waiting to happen.