How to handle ever changing database structure - mysql

I am working on my masters thesis. For my implementation I have some MySQL tables.
With every iteration my table structure will differ (adding, removing columns etc). I was wondering what the best way is to handle the ever changing structure, without changing old code too much.
I read that Facebook has a version control system where the can specify exactly what kind of code/feature is available and for what user. As far as I know that must mean that they manage many different database structures at once. How does their old code work along side their new code with respect to their database? Do they do a lot of testing? Did they abandon MySQL all together?
Personally I like FriendFeeds Solution a lot. However I am wondering if it is too much for me.

Why anyone would try to use a relational database for non-relational data.
Forget about FriendFied and take a look at NoSQL solutions. They are schemaless, they support horizontal scalability much better than any RDBS and most of them are free/open source.
I can recommend MongoDB. It's very fast, written in C++, but no ACID complaint.
Also you could try RavenDB. It's not as fast as MongoDB and inserts are very slow compared to Mongo, but it's ACID complaint. Written in .NET.

Related

MySQL - Alternatives to "text" columns to store HTML Content

I have a large PHP application and a MySQL Database for my company. All communications such as emails, quotes which are HTML based, are surrently stored in "text" columns in MySql tables. These tables have grown to be quite large now, 15-16GB in size, and are slow to restore, move to other servers etc.
Is there a more modern approach to storing information of this nature, perhaps a different kind of database altogether which is perfect for storing documents. Obviously this data would need to be retrievable from the core application etc.
I've heard of things such as MongoDB but don't know if these are designed to cater for this kind of storage.
Does anyone have any suggestions on how to move away from "Text" columns? I'm sure this must be an old fashioned technique by now.
Regards
James
It would appear that your question is "What options exist for storing data because I think TEXT columns in relational databases are old-fashioned?".
Regarding the question...there's lots of different tools and substrates for storing data - apart from processing, that's what computers do. As to which is the most appropriate, well that depends on how you want to access the data, update it and dispose of it: how robust these mechanisms need to be : how fast they need to operate.
don't know if these are designed to cater for this kind of storage.
But the only thing you've told us about the storage is that there is a lot of it. You should certainly acquaint yourself with the CAP theorem. It looks deceptively simple. But you're going to get into a lot of problems mixing different databases in the same application if you don't know exactly what you're doing.
and are slow to restore, move to other servers etc.
I don't know what you mean by "etc". But if your problem is the size of the data, then the solution is to simply partition the dataset - and its probably more practical to do this in tables with a unified view than to use MySQL's table partitioning.
Depending on the size of the data, you might also consider compressing the HTML data before inserting it - after all, you can't index it.
Yes, you could rewrite your application to run on a NoSQL database - if you exclusively use trivial ORM that should be straightforward. But it doesn't reduce the volume of data.

Where to use MongoDB and where MySQL?

I am thinking about using one of two databases - MySQL and MongoDB. I am planning to storing text and numeric data and I will building my app in RoR.
So I don't know, which database system could be better for this purpose - can you help me, please, under which criterium I will decide?
Let me cast this question within more general setting and into some historical perspective.
In the 60s they were asking whether to use hierarchical or network database
In the 70s the debate was relational against network
In the 80s Relational turned into SQL databases, so question mutated to SQL vs. network
In the 90s it was SQL against object databases
In 00s it was SQL against XML databases
Today we have SQL vs. NoSQL
Do you see a pattern here? Would you still bet some money onto SQL competitor, especially if it's nothing more than glorified hash table?
I have used also MySQL and MongoDB with Mongoid in my projects, and I can say that if you want to keep binary data like images, mp3s and other stuff in your database so try Mongo, for other reasons you can use SQL databases. MongoDB has no structure - you processing the hash, so you can dynamicly add and remove keys/columns.
In your case I would use MySQL.
In my opinion you should base your decision on the purpose of your application. Do you want to search through your text data, how will you define keys. There is little use in going for MySQL if you have to request each record and scan it. Even if there is functionality to do text scans in MySQL (does it have that?) MongoDB will probably do the job more efficiently. The other way around, if you are not going to use MongoDB's strong points then you might as well go for MySQL.
Another factor might be the deadline for implementing something. If you need it fast, don't waste time on learning something new. If you have time to experiment, figure out the key features you will most likely rely upon in your application.
I think, if you need a hard structure you should use MySQL because it't its nature, but if you need something more dynamic, whith no structure at all (schema-less) you should use MongoDB, I've never use MongoDB but I know it's more object/document oriented.
It would be helpful if you could provide some more detail. Would your data easily fit into a schema, or do you need the flexibility that a document store offers? What about auto-sharding, etc? Without more information, no one can give you advice that fits your needs. Lacking that, you can't hope for feedback any better than people's personal preferences, which is little more than a flamewar waiting to happen.

What database systems should a startup company consider?

Right now I'm developing the prototype of a web application that aggregates large number of text entries from a large number of users. This data must be frequently displayed back and often updated. At the moment I store the content inside a MySQL database and use NHibernate ORM layer to interact with the DB. I've got a table defined for users, roles, submissions, tags, notifications and etc. I like this solution because it works well and my code looks nice and sane, but I'm also worried about how MySQL will perform once the size of our database reaches a significant number. I feel that it may struggle performing join operations fast enough.
This has made me think about non-relational database system such as MongoDB, CouchDB, Cassandra or Hadoop. Unfortunately I have no experience with either. I've read some good reviews on MongoDB and it looks interesting. I'm happy to spend the time and learn if one turns out to be the way to go. I'd much appreciate any one offering points or issues to consider when going with none relational dbms?
The other answers here have focused mainly on the technical aspects, but I think there are important points to be made that focus on the startup company aspect of things:
Availabililty of talent. MySQL is very common and you will probably find it easier (and more importantly, cheaper) to find developers for it, compared to the more rarified database systems. This larger developer base will also mean more tutorials, a more active support community, etc.
Ease of development. Again, because MySQL is so common, you will find it is the db of choice for a great many systems / services. This common ground may make any external integration a little easier.
You are preparing for a situation that may never exist, and is manageable if it does. Very few businesses (nevermind startups) come close to MySQL's limits, and with all due respect (and I am just guessing here); the likelihood that your startup will ever hit the sort of data throughput to cripple a properly structured, well resourced MySQL db is almost zero.
Basically, don't spend your time ( == money) worrying about which db to use, as MySQL can handle a lot of data, is well proven and well supported.
Going back to the technical side of things... Something that will have a far greater impact on the speed of your app than choice of db, is how efficiently data can be cached. An effective cache can have dramatic effects on reducing db load and speeding up the general responsivness of an app. I would spend your time investigating caching solutions and making sure you are developing your app in such a way that it can make the best use of those solutions.
FYI, my caching solution of choice is memcached.
So far no one has mentioned PostgreSQL as alternative to MySQL on the relational side. Be aware that MySQL libs are pure GPL, not LGPL. That might force you to release your code if you link to them, although maybe someone with more legal experience could tell you better the implications. On the other side, linking to a MySQL library is not the same that just connecting to the server and issue commands, you can do that with closed source.
PostreSQL is usually the best free replacement of Oracle and the BSD license should be more business friendly.
Since you prefer a non relational database, consider that the transition will be more dramatic. If you ever need to customize your database, you should also consider the license type factor.
There are three things that really have a deep impact on which one is your best database choice and you do not mention:
The size of your data or if you need to store files within your database.
A huge number of reads and very few (even restricted) writes. In that case more than a database you need a directory such as LDAP
The importance of of data distribution and/or replication. Most relational databases can be more or less well replicated, but because of their concept/design do not handle data distribution as well... but will you handle as much data that does not fit into one server or have access rights that needs special separate/extra servers?
However most people will go for a non relational database just because they do not like learning SQL
What do you think is a significant amount of data? MySQL, and basically most relational database engines, can handle rather large amount of data, with proper indexes and sane database schema.
Why don't you try how MySQL behaves with bigger data amount in your setup? Make some scripts that generate realistic data to MySQL test database and and generate some load on the system and see if it is fast enough.
Only when it is not fast enough, first start considering optimizing the database and changing to different database engine.
Be careful with NHibernate, it is easy to make a solution that is nice and easy to code with, but has bad performance with large amount of data. For example whether to use lazy or eager fetching with associations should be carefully considered. I don't mean that you shouldn't use NHibernate, but make sure that you understand how NHibernate works, for example what "n + 1 selects" -problem means.
Measure, don't assume.
Relational databases and NoSQL databases can both scale enormously, if the application is written right in each case, and if the system it runs on is properly tuned.
So, if you have a use case for NoSQL, code to it. Or, if you're more comfortable with relational, code to that. Then, measure how well it performs and how it scales, and if it's OK, go with it, if not, analyse why.
Only once you understand your performance problem should you go searching for exotic technology, unless you're comfortable with that technology or want to try it for some other reason.
I'd suggest you try out each db and pick the one that makes it easiest to develop your application. Go to http://try.mongodb.org to try MongoDB with a simple tutorial. Don't worry as much about speed since at the beginning developer time is more valuable than the CPU time.
I know that many MongoDB users have been able to ditch their ORM and their caching layer. Mongo's data model is much closer to the objects you work with than relational tables, so you can usually just directly store your objects as-is, even if they contain lists of nested objects, such as a blog post with comments. Also, because mongo is fast enough for most sites as-is, you can avoid dealing the complexities of caching and generally deliver a more real-time site. For example, Wordnik.com reported 250,000 reads/sec and 100,000 inserts/sec with a 1.2TB / 5 billion object DB.
There are a few ways to connect to MongoDB from .Net, but I don't have enough experience with that platform to know which is best:
Norm: http://wiki.github.com/atheken/NoRM/
MongoDB-CSharp: http://github.com/samus/mongodb-csharp
Simple-MongoDB: http://code.google.com/p/simple-mongodb/
Disclaimer: I work for 10gen on MongoDB so I am a bit biased.

Cassandra or MySQL/PostgreSQL?

I have huge database (kinda wordnet) and want to know if it's easier to use Cassandra instead of MySQL|PostrgreSQL
All my life I was using MySQL and PostrgreSQL and I could easily think in terms of relational algebra, but several weeks ago I learned about Cassandra and that it's used in Facebook and Twitter.
Is it more convenient?
What DBMS are usually used nowadays to store social net's data, relationships between objects, wordnet?
There is nothing like a Silver bullet solution, everything is built to solve specific problem and has its own pros and cons. It is up to you to decide - what problem statement you have and what is best solution that fits your problem. Whether you use Cassandra (NoSQL) or MySQL(RDBMS), it is all driven from your system's requirements. Below are the inputs that will help you in taking better decision while deciding on database.
Why to Use NoSQL
In the case of RDBMS database, making choice is quite easy because almost all the databases like MySQL, Oracle, MS SQL, PostgreSQL in this category offer almost same kind of solutions oriented to the ACID property. When it comes to NoSQL, decision becomes difficult because every NoSQL database offers different solution and you have to understand which one is best suited for your app/system requirement. For example, MongoDB fits for use cases where your system demands schema-less document store. HBase might fit for Search engines, analysing log data, any place where scanning huge, two-dimensional join-less tables is a requirement. Redis is built to provide In-Memory search for varieties of data structures like tree, queue, link list etc and can be good fit for making real time leader board, pub-sub kind of system. Similarly there are other database in this category (including Cassandra) which fits for different problems. Now lets move to original question, and answer them one by one.
When to use Cassandra
Being a part of NoSQL family, Cassandra offers solution for problem where your requirement is to have very heavy write system and you want to have quite responsive reporting system on top of that stored data. Consider use case of Web analytics where log data is stored for each request and you want to built analytical platform around it to count hits by hour, by browser, by IP, etc in real time manner. You can refer to blog post (http://blogs.shephertz.com/2015/04/22/why-cassandra-excellent-choice-for-realtime-analytics-workload/) to understand more about the use cases where Cassandra fits in.
When to Use a RDMS instead of Cassandra/NoSQL
Cassandra is based on NoSQL database and does not provide ACID and relational data property. If you have strong requirement of ACID property (for example Financial data), Cassandra would not be a fit in that case. Obviously, you can make work out of it, however you will end up writing lots of application code to handle ACID property and will loose on time to market badly. Also managing that kind of system with Cassandra would be complex and tedious for you.
There are many different flavours of "NoSQL" databases. If your application is really like Wordnet perhaps you should look at a graph database such as Neo4j.
I would suggest to analyse your request.
If you are going with more clusters, machines take NoSQL
If your data model is complicated - require efficient structures take NoSQL (no limits with type of columns)
If you fit in a few machines without scales, and you don't need super performance for multi request (as for example in social network - where lot of users send http request), and you don't think you involve saleability take RDBMS (Postgres have some good functions and structures which you can use, like array column type).
Cassandra should work better with large scales of data, multi purpose.
neo4j - would be better for special structures, graphs.
Cassandra and other NoSQL stores are being used for social based sites because of their need for massive write based operations. Not that MySQL and Postgres can't achieve this but NoSQL requires far less time and money, generally speaking.
Sounds like you may want to look at Neo4J though, just in terms of your object model needs.
All different products and they all have their pro's and conn's. What kind of problem do you have to solve?
Huge, as in TB's?

What database works well with 200+GB of data?

I've been using mysql (with innodb; on Amazon rds) because it's sort of universal default, but it's been ridiculously under-performing, and tweaking it only delays the inevitable.
The data is mostly relatively short (<1kB of bytes each) blobs information about 100Ms of urls. There is (or should be, mysql cannot seem to handle it) very high amount of insert / update / retrieve but few complex queries - not that complex queries wouldn't be useful, but because mysql is so slow that it's far faster to get the data out, process it locally, and cache the results somewhere.
I can keep tweaking mysql and throwing more hardware at it, but it seems increasingly futile.
So what are the options? SQL/relational model/etc. optional - anything will do as long as it's fast, networked, and language-independent.
Have you done any sort of end-to-end profiling of your application and MySQL database? To provide better advice it would also be good to understand what improvements you have tried to implement, and your database structure. You haven't given a lot of information on how your MySQL database is configured either. It provides a lot of options for tuning.
You should pick up a copy of High Performance MySQL if you haven't already to learn more about the product.
There is no point in doing anything until you know what your problem is. NoSQL solutions can offer performance benefits but you have provided little evidence that MySQL is incapable of servicing your needs.
Well "Fast, networked and language-independent" + "few complex queries" brings to mind the various NoSQL solutions. To name a few:
MongoDB
CouchDB
Cassandra
And if that's not fast enough, there are always the wicked fast Redis which is my personal favorite atm. :) It is not a database per se, but it's good enough for most scenarios.
I am sure other people can list more NoSQL databases...
and there is always http://nosql-database.org/ .
Generally speaking, databases in this category is better and faster in your scenario because they have relaxed constraints and thus is easier and faster to insert/update/retrieve frequently. But that requires that you think harder about your data model and it is generally not possible to do SQL-style complex queries directly -- you'll instead write more pre-computed data or use a more denormalized design to account for the lack of complex queries.
But since complex queries is a minor problem in your case, I think NoSQL solutions are ideal for you.
With the data you've given about your application's data and workload, it is almost impossible to determine whether the problem really is MySQL itself or something else. You seem to assume that you can throw any workload to a relational engine and it should handle it. Therefore the suggestions made by other commenters about analyzing the performance more carefully are valid in my opinion. Without more data (transactions / second etc.) any further analysis regarding other suitable engines is also futile.
I'm not sure I agree with the advice to jump ship on traditional databases. It might not be the most efficient tool, but it is the one that is FAR more widely understood and used, and a strongly doubt you have a problem that can't be handled by an efficiently set up relational database.
Obvious answers are Oracle, SQLServer, etc, but it might just be your database structure isn't right. I don't know much about MySQL but I do know it's used in some pretty big projects (eBay being noteworthy).