Which is the Best database for Rails application? - mysql

I am developing a Rails application that will access a lot of RSS feeds or crawl sites for data (mostly news). It will be something like Google News but with a different approach, so I'll store a lot of news (or news summaries), classify them in different categories and use ranking and recommendation techniques.
Should I go with MySQL?
Is it worthwhile using IBM DB2
purexml to store the doucuments?
Also Ruby search implementations
(Ferret, Ultrasphinx and others) are
not needed If I choose DB2. Is that correct?
What are the advantages of
PostreSQL in this?
Does it makes sense to use Couch DB in
this scenario?
I'd like to choose the best option but without over-complicating the solution. So I discarded the idea to use two different storage solutions (one for the news documents and other for the rest of the data). I'm also considering only "free" options, so I didn't look at Oracle or MS SQL Server.

purexml is heavier than SQL, so you pay more for your roundtrip between webserver and DB. If you plan to have lots of users, I'd avoid it, your better off letting your webserver cache the requests, thus avoiding creating xml(rss) everytime, if that is what you are thinking about.
I'd go with MySQL because its really good at serving and its totally free, well PostgreSQL is too, but haven't used it so I can't say.
CouchDB could make sense, but not if you plan on doing OLAP (Offline Analysis) of your data, a normal RDBMS will be better at it.

Admitting firstly that I generally don't like mysql, I will say that there has been writing on this topic regarding postgres:
http://oldmoe.blogspot.com/2008/08/101-reasons-why-postgresql-is-better.html
This is always my choice when I need a pure relational database. I don't know whether a document database would be more appropriate for your application without knowing more about it. It does sound like it's something you should at least investigate.

MySQL is probably one of the best options out there; light, easy to install and maintain, multiplatform and free. On top of that there are some good free client tools.
Something to think about; because of the nature of your system you will probably have some tables that will grow quite a lot very quickly so you might want to think about performance.
Thus, MySQL supports vertical partitioning but only from V 5.1.

It sounds to me the application you will build can easily become a large-scale web app. I would suggest PostgreSQL, for it has been known for its reliability.
You can check out the following link -- Bob Ippolito from MochiMedia tells us why they ditched MySQL for PostgreSQL. Although the posts are more than 3 years old, the issues MySQL 5.1 has recently tend to prove that they are still relevant.
http://bob.pythonmac.org/archives/category/sql/mysql/

MySQL is good in production. I haven't used PostgreSQL for rails, but it's a good solution as well.
In the dev and test environments I'd start out with SQLite (default), and perhaps migrate to your target DB in the test environment as you move closer to completion.

Related

SQLite or MySQL for a read-mostly website?

Is it practical to use SQLite as the database backend for a website with, say, 300,000 unique visitors a month ?
Writes to the database will be pretty limited - user signing up or logging in, adding comments etc. The vast majority of the use will just be queries getting content based on a primary key in the URL. I'd like to know if SQLite can cope as a website backend and won't end up dramatically slower then MySQL.
I've seen this SO question and others, but they're not really the same and seem like they could be out-of-date now too. http://www.sqlite.org/whentouse.html suggests it'd be fine, but they might be a bit biased!
SQLite is a very cool product - and with HTML5 on the horizon, it's a good idea for any web developer to get acquainted with it. However you should bear in mind that sqlite does not scale well. If you ever need to share data across multiple webservers, it's going to be very difficult using sqlite as the data-substrate.
However to ease the development, you could look at PDO / dbx_ in PHP which provides an abstraction layer (i.e. same code talks to all sorts of databases) however there are some subtle variations between the way different systems implement stuff like transactions - and variations in SQL - if you do go down this route, I'd recommend maintaining your own abstraction layer between the PHP PDO/DBX calls and your application - think stored procedures implemented in PHP.
300,000 unique visitors a month ?
aaaarrrgghhhh! pet hate. While you need to think about how much money your site will be making in order to plan a budget, this is not a useful metric for capacity planning. Really you want to look at expected hit/page rates.
I think you would be fine. Sqlite is able to support multi-threading just fine, and as you are mostly reading from it, there shouldn't be a problem. Also, if you are writing to it, it fully supports transactions as well. You have to remember though that it's still just one file and no service - so if you are going to cluster it you will be out of luck. Maybe you should check what problems exactly you have with mysql and solve them.
sqlite is very fast, but it becomes difficult to use once you need to cluster. However pretty much all databases are difficult once you need to cluster. If you are read oriented, it shouldn't matter too much which you use. Just make sure you are using memcached.

What database systems should a startup company consider?

Right now I'm developing the prototype of a web application that aggregates large number of text entries from a large number of users. This data must be frequently displayed back and often updated. At the moment I store the content inside a MySQL database and use NHibernate ORM layer to interact with the DB. I've got a table defined for users, roles, submissions, tags, notifications and etc. I like this solution because it works well and my code looks nice and sane, but I'm also worried about how MySQL will perform once the size of our database reaches a significant number. I feel that it may struggle performing join operations fast enough.
This has made me think about non-relational database system such as MongoDB, CouchDB, Cassandra or Hadoop. Unfortunately I have no experience with either. I've read some good reviews on MongoDB and it looks interesting. I'm happy to spend the time and learn if one turns out to be the way to go. I'd much appreciate any one offering points or issues to consider when going with none relational dbms?
The other answers here have focused mainly on the technical aspects, but I think there are important points to be made that focus on the startup company aspect of things:
Availabililty of talent. MySQL is very common and you will probably find it easier (and more importantly, cheaper) to find developers for it, compared to the more rarified database systems. This larger developer base will also mean more tutorials, a more active support community, etc.
Ease of development. Again, because MySQL is so common, you will find it is the db of choice for a great many systems / services. This common ground may make any external integration a little easier.
You are preparing for a situation that may never exist, and is manageable if it does. Very few businesses (nevermind startups) come close to MySQL's limits, and with all due respect (and I am just guessing here); the likelihood that your startup will ever hit the sort of data throughput to cripple a properly structured, well resourced MySQL db is almost zero.
Basically, don't spend your time ( == money) worrying about which db to use, as MySQL can handle a lot of data, is well proven and well supported.
Going back to the technical side of things... Something that will have a far greater impact on the speed of your app than choice of db, is how efficiently data can be cached. An effective cache can have dramatic effects on reducing db load and speeding up the general responsivness of an app. I would spend your time investigating caching solutions and making sure you are developing your app in such a way that it can make the best use of those solutions.
FYI, my caching solution of choice is memcached.
So far no one has mentioned PostgreSQL as alternative to MySQL on the relational side. Be aware that MySQL libs are pure GPL, not LGPL. That might force you to release your code if you link to them, although maybe someone with more legal experience could tell you better the implications. On the other side, linking to a MySQL library is not the same that just connecting to the server and issue commands, you can do that with closed source.
PostreSQL is usually the best free replacement of Oracle and the BSD license should be more business friendly.
Since you prefer a non relational database, consider that the transition will be more dramatic. If you ever need to customize your database, you should also consider the license type factor.
There are three things that really have a deep impact on which one is your best database choice and you do not mention:
The size of your data or if you need to store files within your database.
A huge number of reads and very few (even restricted) writes. In that case more than a database you need a directory such as LDAP
The importance of of data distribution and/or replication. Most relational databases can be more or less well replicated, but because of their concept/design do not handle data distribution as well... but will you handle as much data that does not fit into one server or have access rights that needs special separate/extra servers?
However most people will go for a non relational database just because they do not like learning SQL
What do you think is a significant amount of data? MySQL, and basically most relational database engines, can handle rather large amount of data, with proper indexes and sane database schema.
Why don't you try how MySQL behaves with bigger data amount in your setup? Make some scripts that generate realistic data to MySQL test database and and generate some load on the system and see if it is fast enough.
Only when it is not fast enough, first start considering optimizing the database and changing to different database engine.
Be careful with NHibernate, it is easy to make a solution that is nice and easy to code with, but has bad performance with large amount of data. For example whether to use lazy or eager fetching with associations should be carefully considered. I don't mean that you shouldn't use NHibernate, but make sure that you understand how NHibernate works, for example what "n + 1 selects" -problem means.
Measure, don't assume.
Relational databases and NoSQL databases can both scale enormously, if the application is written right in each case, and if the system it runs on is properly tuned.
So, if you have a use case for NoSQL, code to it. Or, if you're more comfortable with relational, code to that. Then, measure how well it performs and how it scales, and if it's OK, go with it, if not, analyse why.
Only once you understand your performance problem should you go searching for exotic technology, unless you're comfortable with that technology or want to try it for some other reason.
I'd suggest you try out each db and pick the one that makes it easiest to develop your application. Go to http://try.mongodb.org to try MongoDB with a simple tutorial. Don't worry as much about speed since at the beginning developer time is more valuable than the CPU time.
I know that many MongoDB users have been able to ditch their ORM and their caching layer. Mongo's data model is much closer to the objects you work with than relational tables, so you can usually just directly store your objects as-is, even if they contain lists of nested objects, such as a blog post with comments. Also, because mongo is fast enough for most sites as-is, you can avoid dealing the complexities of caching and generally deliver a more real-time site. For example, Wordnik.com reported 250,000 reads/sec and 100,000 inserts/sec with a 1.2TB / 5 billion object DB.
There are a few ways to connect to MongoDB from .Net, but I don't have enough experience with that platform to know which is best:
Norm: http://wiki.github.com/atheken/NoRM/
MongoDB-CSharp: http://github.com/samus/mongodb-csharp
Simple-MongoDB: http://code.google.com/p/simple-mongodb/
Disclaimer: I work for 10gen on MongoDB so I am a bit biased.

any formal benchmarking of Open source Database software?

Is there any formal performance and stress test reports of open source database, specially sqlite,MySQL an PgSQL?
I want to use sqlite in server for its simple structure and easy embeddable capability. But I can not find any pros and cons (by Googling and Yahoo!ing) regarding performance of these database software.
Please suggest.
I found this article. It has a disclaimer at the top about the age of the information. However, it may be some help to you.
Here is another article that seems a little more recent and up2date.
Seems from reading these that SQLite is quite adequate in terms of performance.
Sysbench is a great utility for benchmarking mysql and I believe has plugins or the capability to test PostgreSQL. Keep in mind that you're not going to get a simple number that says "DBMS A is faster than DBMS B" -- at best you can hope to get an idea of what kind of scaling you'll get for a particular type of workload that is hopefully similar to whatever workload you'll end up throwing at your system.
Regardless of performance, if you really know what you are doing with RDBMS software and need an open source solution, you'll probably want to go with PostgreSQL -- otherwise, stick with MySQL.
Benchmark is not the most important in database choice.
I think SQLite and MySQL are quicker than Progres or Firebird but if you need some specific features like CTEs, only few database have even if it is SQL Standard
Benchmarking is hard. And expensive. And in the installations most of them are done, SQLite won't even be tested, because it's designed for completely different workloads and simply doesn't deal with the situation. (For example, any real benchmark will have clients running on different machines from the server, which SQLite AFAIK doesn't really do - whereas it does do very well in the case where you have a single client locally).
You can always look at something like spec, for example http://www.spec.org/jAppServer2004/results/jAppServer2004.html that shows both pg and mysql at least. But beware that the hardware platforms are different (and that these tests are also not from today).
But the bottom line is that if you want to compare performance for your application, the only really relevant benchmark you can run is your own application in a testing environment.

Where to find a good reference when choosing a database?

I and two others are working on a project at the university.
In the project we are making a prototype of a MMORPG.
We have decided to use PostgreSQL as our database. The other databases we considered were MS SQL-server and MySQL.
Does somebody have a good reference which will justify our choice? (preferably written during the last year)
Someone recently recommended me wikivs.com: MySQL vs. PostgreSQL - it is a quite detailed comparison of those two, and might be of help to you.
the most mentioned difference between MySQL and PostgreSQL is about your reading/writing ratios. If you read a lot more than you write, MySQL is usually faster; but if you do a lot of heavy updates to a table, as often as other threads have to read, then the default locking in MySQL is not the best, and PostgreSQL can be a better choice, performance-wise.
IOW, PostgreSQL scales better regarding to DB writes.
that's why it's usually said that MySQL is best for webapps, while PostgreSQL is more 'enterprisey'.
Of course, the picture is not so simple:
InnoDB tables on MySQL have a very different performance behaviour
At the load levels where PostgreSQL's better locks overtake MySQL's, other parts of your platform could be the bottlenecks.
PostgreSQL does comply better with standards, so it can be easier to replace later.
in the end, the choice has so many variables that no matter which way you go, you'll find some important issue that makes it the right choice.
Go with something that someone in your team has actual experience of using in production. All databases have issues which frequent users are aware of.
I cannot stress enough that someone in the team needs PRODUCTION experience of using it. Not using it for their homework, or to keep their list of CDs in.
All of these databases have their advantages and disadvantages. Which is better is dependent on:
Your teams experience
Your exact requirements
Your current environemnt e.g. whats your app written in and going to be hosted on?
SQL servers main problem is the cost unless you use express edition which has performance limitations however its very easy to use and has a number of good tools.
There is a comparison of the different sql versions at:
http://www.microsoft.com/sql/prodinfo/features/compare-features.mspx
You could then compare these with MySQL and PostGre.
If the purpose of this comparison is a theoretical one for your essay then you can reference web pages such as the microsoft link and compare performance, cost etc.
Postgresql has a page of case studies that you can quote and link to.
Really, any of the above would have worked for you. I personally like PostgreSQL. One solid advantage it has over MSSQL (even assuming you can get it for "free") is that PostgreSQL is non-proprietary. If you're going to introduce a dependency into your project (and re-inventing an RDBMS would be crazy), you don't want it to be a black box.

What do you need to take into consideration when deciding between MySQL and Amazon's SimpleDB for a RoR app?

I am just beginning to do research into the feasibility of using Amazon's SimpleDB service as the datastore for RoR application I am planning to build. We will be using EC2 for the web server, and had planned to also use EC2 for the MySQL servers. But now the question is, why not use SimpleDB?
The application will (if successful) need to be very scalable in terms of # of users supported, will need to maintain a simple and efficient code base, and will need to be reliable.
I'm curious as to what the SO communities thoughts are on this.
The Ruby SimpleDB library is not as complete as ActiveRecord (the default Rails DB adapter), so many of the features you're used to will not be there.
On the plus side it's schemaless, scalable and works well with ec2.
If you're going to do things like full text search in your app then SimpleDB might not be the best choice, stick with AR + sphinx.
Well, considering simple DB doesn't use SQL, or even have tables, means that it's a completely different beast than MySQL and other SQL-based things (http://aws.amazon.com/simpledb/). There are no constraints, triggers, or joins. Good luck.
Here's one way of getting it up and running:
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=1242
(via http://rubyforge.org/projects/aws-sdb/ )
I suppose if you're never going to need to query the data outside of rails, then SimpleDB may prove to be OK. But as it's not a first-class supported DB, you're likely to run into bugs that are difficult to fix. I wouldn't want to run a production rails app in a semi-beta backend.
To me this just feels like, "Hey there are these neat tools out there, I should go build a project using them," rather than actually needing to use these specific tools. Maybe I'm just being crabby but it feels like a classic case of premature optimization. You're trying to use an external service that costs money for an app that isn't even written yet and you don't say you've got a guaranteed audience or one that will necessarily scale to a level that warrants that.
"The application will (if successful) need to be very scalable in terms of # of users supported", seriously, that describes half the Internet. It's the "if successful" part that's really the question. Just concentrate on building the application quickly and easily. The easiest way to do that is just use ROR as it is out-of-the-box so to speak. Pair it with a database, use ActiveRecord and get something built and attracting users.
In fact, I'll go further and say that EC2 is rather expensive for always on servers. Deploy it over on Slicehost or another hosted solution and then move it to EC2 if you need to in order to support demand.
I myself am very interested in this topic. Right now I'm on a cloud computing high so I'd say go with SimpleDB since it'll probably scale better in the sense that you'll have high availability, but that's just my thoughts as of the moment. Not from experience yet.
Edit: It's true that SimpleDB has no normal features a "normal" database, but it should do the trick if you only need a simple CRUD layer to work against, which is my case
There's a library called SimpleRecord that is a drop in replacement for ActiveRecord, but uses SimpleDB as its backend data store.