Concurrency handling using the filesystem VS an RDMBS (MySQL) - mysql

I'm building an English web dictionary where users can type in words and get definitions. I thought about this for a while and since the data is 100% static and I was only to retrieve one word at a time I was better off using the filesystem (ext3) as the database system instead of opting to use MySQL to store definitions. I figured there would be less overhead considering that you have to connect to MySQL and that in itself is a very slow operation.
My fear is that if my system were to get bombarded by let's say 500 word retrievals/sec, would I still be better off using the filesystem as the database? or will the increased filesystem reads hinder performance as opposed to something that MySQL might be doing under the hood?
Currently the hierarchy is segmented by first letter, second letter and third letter of the word. So if you were to search for the definition of "water", the script (PHP) will try to read from "../dict/w/a/t/water.word" (after cleaning up the word of problematic characters and lowercasing it)
Am I heading in the right direction with this or is there a faster solution (not counting storing definitions in memory using something like memcached)? Will the amount of files stored in any directory factor in performance? What's the rough benchmark for the number of files that I should store in a directory?

What are your grounds for your belief that this decision will matter to the overall performance of the solution? WHat does it do other than provide definitions?
Do you have MySQL as part of the solution anyway, or would you need to add it should you select it as the solution here?
Where is the definitive source of definitions? The (maybe replicated) filesystem, or some off line DB?
It seems like something that should be in a DB architecturally - filesystems are a strange place to map a large number of names to values (as is evidenced by your file system structure breaking things down by initial letters)
If it's in the DB, answering questions like "how many definitions are there?" is a lot easier, but if you don't care about such things for your application, this may not matter.
So to some extent this feels like looking to hyper optimise the performance of something whose performance won't actually make much difference to the overall solution.
I'm a fan of "make it correct, then make it fast", and "correct" would be more straightforward to achieve with a DB.
And of course, the ultimate answer would to be try both and see which one works best in your situation.
Paul

The type of lookups that a dictionary requires is exactly what a database is good at. I think the filesystem method you describe will be unworkable. Don't make it hard! Use a Database.

You can keep a connection pool around to speed up connecting to the DB.
Also, if this application needs to scale to multiple servers, the file system may be tricky to share between servers.
So, I third the suggestion. Use a DB.
But unless it's a fabulously large dictionary, caching would mean you're nearly alwys getting stuff from local memory, so I don't think this is going to be the biggest issue for your application :)

A DB sounds perfect for your needs.
I also don't see why memcached is relevant (how big is your data? Can't be more than a few GB... right?)

The data is approximately a couple of GBs. And my goal is speed, speed, speed (definitions will be loaded using XHR). The data as I said is static and is never going to change, and in no where would I using anything other than a single read operation for each request. So I'm having a pretty hard time getting convinced of using MySQL and all its bloat.
Which would be first to fail under high load using this strategy, the filesystem or MySQL? As for scaling replication is the answer since the data will never change and is only a couple of GBs.

Make it work first. Premature optimisation is bad.
Using a database enables easier refactoring of your schema, and you don't have to write an implementation of an index-based lookup, which in actual fact is nontrivial.
Saying that connecting to a database "is a very slow operation" overstates the problem. Actually connecting should not take very long, plus you can reuse connections anyway.
If you are worried about read-scaling, a 1G database is very small, so you can push readonly replicas of it to each web server and they can each read from their local copy. Provided the writes stay at a level which doesn't impact read performance, that gives you almost perfect read-scalability.
Moreover, 1G of data will fit into ram easily, so you can make it fast by loading the entire database into memory at startup time (before that node advertises itself to the load balancer).
500 lookups per second is trivially small. I would start worrying about 5000 per second per server, maybe. If you can't achieve 5000 key lookups per second on modern hardware (from a database which fits in RAM?!!), there is something seriously wrong with your implementation.

Agreeing that this is premature optimization, and that MySQL surely will be performant enough for this use case. I must add you can also use a file based database, like the very fast Tokyo Cabinet as a compromise. Sadly it doesn't have a PHP binding so you could use its grandfather, DBM.
That said, do not use a filesystem, there's no good reason to, as far as I can see.

Use a virtual Drive in your ram (google it for a how to for your distro) or if your data is provided by PHP use APC, memcache might work well with mysql. Personally I don't think the optimization you are doing here is really where you should be spending your time. 500 requests a second is massive, I think using mysql would give you better forward features for later. I think you need to concentrate on features and not speed if you want to differentiate yourself from your competitors. Also there are a few good talks about UI for the web, the server speed is only a small factor in the whole picture.
Good luck

You might also think about a no-sql database (like riak, mongo, or even redis) for something like this. They are all super-fast and help out with your replication. Mysql might be over-kill and hard-to-scale in an instance like this, but the other ones have some robust tools

Related

PostgreSQL and PQC

I'm using MySQL with Memcached, but I'm planning to start using PostgreSQL instead of MySQL.
I know Memcached can work with PostgreSQL, but I found this online: PostgreSQL Query Cache. I've seen a presentation online, and it says memcached is used in this. But I don't understand: memcached, I have to "program" in my PHP-code, and PQC, not?
What's it all about? Is PQC the same as memcached, and could it replace memcached? For example: I have a table with all countries. It never changes, so I want to cache this instead of retrieving it from the database every time. Will PQC do this automatically?
PQC is an implementation of caching that uses Memcached. It sits in front of your database server and caches query results for you. If you are running a lot of identical queries, this will make your database load a whole lot less and your return times a whole lot faster. It is not a substitute for good design of your application, but it can certainly help, and the cost of implementing it is extremely low since it takes advantage of an existing layer of abstraction.
Memcached is a lower level tool. A well designed application will leave you a nice place to put code between the business logic and the database layer to cache results, and this is where you put your memcached calls. In other words, if your code is designed to allow this abstraction, fantastic. Otherwise, you're looking at a lot more work to implement.

Is mongoDB or Cassandra better than MySQL for large datasets?

In our (currently MySQL) database there are over 120 million records, and we make frequent use of complex JOIN queries and application-level logic in PHP that touch the database. We're a marketing company that does data mining as our primary focus, so we have many large reports that need to be run on a daily, weekly, or monthly basis.
Concurrently, customer service operates on a replicated slave of the same database.
We would love to be able to make these reports happen in real time on the web instead of having to manually generate spreadsheets for them. However, many of our reports take a significant amount of time to pull data for (in some cases, over an hour).
We do not operate in the cloud, choosing instead to operate using two physical servers in our server room.
Given all this, what is our best option for a database?
I think you're going the wrong way about the problem.
Thinking if you drop in NoSQL that you'll get better performance is not really true. At the lowest level, you're writing and retrieving a fair chunk of data. That implies your bottleneck is (most likely) HDD I/O (which is the common bottleneck).
Sticking to the hardware you have momentarily and using a monolithic data storage isn't scalable and as you noticed - has implications when wanting to do something in real-time.
What are your options? You need to scale your server and software setup (which is what you'd have to do with any NoSQL anyway, stick in faster hard drives at some point).
You also might want to look into alternative storage engines (other than MyISAM and InnoDB - for example, one of better engines that seemingly turn random I/O to sequential I/O is TokuDB).
Implementing faster HDD subsystem would also aid to your needs (FusionIO if you have the resources to get it).
Without more information on your end (what the server setup is, what MySQL version you're using and what storage engines + data sizes you're operating with), it's all speculation.
Cassandra still needs Hadoop for MapReduce, and MongoDB has limited concurrency with regard to MapReduce...
... so ...
... 120 mio records is not that much, and MySQL should easily be able to handle that. My guess is an IO bottleneck, or you're doing lots of random reads instead of sequential reads. I'd rather hire a MySQL techie for a month or so to tune your schema and queries, instead of investing into a new solution.
If you provide more information about your cluster, we might be able to help you better. "NoSQL" by itself is not the solution to your problem.
As much as I'm not a fan of MySQL once your data gets large, I have to say that you're nowhere near needing to move to a NoSQL solution. 120M rows is not a big deal: the database I'm currently working with has ~600M in one table alone and we query it efficiently. Managing that much data from an ops perspective is the problem; querying it isn't.
It's all about proper indexes and the correct use of them when joining, and secondarily memory settings. Find your slow queries (mysql slow query log FTW!), and learn to use the explain keyword to understand whey they are slow. Then tweak your indexes so your queries are efficient. Further, make sure you understand MySQL's memory settings. There are great pages in the docs explaining how they work, and they aren't that hard to understand.
If you've done both of those things and you're still having problems, make sure disk I/O isn't an issue. Then you should look in to another solution for querying your data if it is.
NoSQL solutions like Cassandra have a lot of benefits. Cassandra is fantastic at writing data. Scaling your writes is very easy--just add more nodes! But the tradeoff is that it's harder to get the data back out. From a cost perspective, if you have expertise in MySQl, it's probably better to leverage that and scale your current solution until it hits a limit before completely switching your underlying architecture.

Which is the right database for the job?

I am working on a feature and could use opinions on which database I should use to solve this problem.
We have a Rails application using MySQL. We have no issues with MySQL and it runs great. But for a new feature, we are deciding whether to stay MySQL or not. To simplify the problem, let's assume there is a User and Message model. A user can create messages. The message is delivered to other users based on their association with the poster.
Obviously there is an association based on friendship but there are many many more associations based on the user's profile. I plan to store some metadata about the poster along with the message. This way I don't have to pull the metadata each time when I query the messages.
Therefore, a message might look like this:
{
id: 1,
message: "Hi",
created_at: 1234567890,
metadata: {
user_id: 555,
category_1: null,
category_2: null,
category_3: null,
...
}
}
When I query the messages, I need to be able to query based on zero or more metadata attributes. This call needs to be fast and occurs very often.
Due to the number of metadata attributes and the fact any number can be included in a query, creating SQL indexes here doesn't seem like a good idea.
Personally, I have experience with MySQL and MongoDB. I've started research on Cassandra, HBase, Riak and CouchDB. I could use some help from people who might have done the research as to which database is the right one for my task.
And yes, the messages table can easily grow into millions or rows.
This is a very open ended question, so all we can do is give advice based on experience. The first thing to consider is if it's a good idea to decide on using something you haven't used before, instead of using MySQL, which you are familiar with. It's boring not to use shiny new things when you have the opportunity, but believe me that it's terrible when you've painted yourself in a corner because you though that the new toy would do everything it said on the box. Nothing ever works the way it says in the blog posts.
I mostly have experience with MongoDB. It's a terrible choice unless you want to spend a lot of time trying different things and realizing they don't work. Once you scale up a bit you basically can't use things like secondary indexes, updates, and other things that make Mongo an otherwise awesomely nice tool (most of this has to do with its global write lock and the database format on disk, it basically sucks at concurrency and fragments really easily if you remove data).
I don't agree that HBase is out of the question, it doesn't have secondary indexes, but you can't use those anyway once you get above a certain traffic load. The same goes for Cassandra (which is easier to deploy and work with than HBase). Basically you will have to implement your own indexing which ever solution you choose.
What you should consider is things like if you need consistency over availability, or vice versa (e.g. how bad is it if a message is lost or delayed vs. how bad is it if a user can't post or read a message), or if you will do updates to your data (e.g. data in Riak is an opaque blob, to change it you need to read it and write it back, in Cassandra, HBase and MongoDB you can add and remove properties without first reading the object). Ease of use is also an important factor, and Mongo is certainly easy to use from the programmer's perspective, and HBase is horrible, but just spend some time making your own library that encapsulates the nasty stuff, it will be worth it.
Finally, don't listen to me, try them out and see how they perform and how it feels. Make sure you try to load it as hard as you can, and make sure you test everything you will do. I've made the mistake of not testing what happens when you remove lots of data in MongoDB, and have paid for that dearly.
I would recommend to look at presentation about Why databases suck for messaging which is mainly targeted on the fact why you shouldn't use databases such as MySQL for messaging.
I think in this scenario CouchDB's changes feed may come quite handy although you probably would also have to create some more complex views based on querying message metadata. If speed is critical try to also look at redis which is really fast and comes with pub/sub functionality. MongoDB with it's ad hoc queries support may also be a decent solution for this use case.
I think you're spot-on in storing metadata along with each message! Sacrificing storage for faster retrieval time is probably the way to go. Note that it could get complicated if you ever need to change a user's metadata and propagate that to all the messages. You should consider how often that might happen, whether you'll actually need to update all the message records, and based on that whether it's worth paying the price for the sake of less queries (it probably is worth it, but that depends on the specifics of your system).
I agree with #Andrej_L that Hbase isn't the right solution for this problem. Cassandra falls in with it for the same reason.
CouchDB could solve your problem, but you're going to have to define views (materialized indices) for any metadata you're going to want to query. If the whole point of not using MySQL here is to avoid indexing everything, then Couch is probably not the right solution either.
Riak would be a much better option since it queries your data using map-reduce. That allows you to build any query you like without the need to pre-index all your data as in couch. Millions of rows are not a problem for Riak - no worries there. Should the need arise, it also scales very well by simply adding more nodes (and it can balance itself too, so this is really a non-issue).
So based on my own experience, I'd recommend Riak. However, unlike you, I've no direct experience with MongoDB so you'll have to judge it agains Riak yourself (or maybe someone else here can answer on that).
From my experience with Hbase is not good solution for your application.
Because:
Doesn't contain secondary index by default(you should install plugins or something like these). So you can effectively search only by primary key. I have implemented secondary index using hbase and additional tables. So you can't use this one in online application because of for getting result you should run map/reduce job and it will take much time on million data.
It's very difficult to support and adjust this db. For effective work you will use HBAse with Hadoop and it's necessary powerful computers or several ones.
Hbase is very useful when you need make aggregation reports on big amount of data. It seems that you needn't.
Due to the number of metadata attributes and the fact any number can
be included in a query, creating SQL indexes here doesn't seem like a
good idea.
It sounds like you need a join, so you can mostly forget about CouchDB till they sort out the multiview code that was worked on (not actually sure it is still worked on).
Riak can query as fast as you make it, depends on the nodes
Mongo will let you create an index on any field, even if that is an array
CouchDB is very different, it builds indexes using a stored Map-Reduce(but without the reduce) they call a "view"
RethinkDB will let you have SQL but a little faster
TokuDB will too
Redis will kill all in speed, but it's entirely stored in RAM
single level relations can be done in all of them, but differently for each.

Clustering, Sharding or simple Partition / Replication

We have created a Facebook application and it got a lot of virality. The problem is that our database started getting REALLY FULL (some tables have more than 25 million rows now). It got to the point that the app just stopped working because there was a queue of thousands and thousands of writes to be made.
I need to implement a solution for scaling this app QUICKLY but I'm not sure if I should pursue Sharding or Clustering since I'm not sure what are the pro's and con's of each of them and I was thinking of doing a Partition / Replication approach but I think that doesn't help if the load is on the writes?
25 million rows is a completely reasonable size for a well-constructed relational database. Something you should bear in mind, however, is that the more indexes you have (and the more comprehensive they are), the slower your writes will be. Indexes are designed to improve query performance at the expense of write speed. Be sure that you're not over-indexed.
What sort of hardware is powering this database? Do you have enough RAM? It's far easier to change these attributes than it is to try to implement complex RDBMS load balancing techniques, especially if you're under a time crunch.
Clustering/Sharding/Partitioning comes when single node has reached to the point where its hardware cannot bear the load. But your hardware has still room to expand.
This is the first lesson I learnt when I started being hit by such issues
Well, to understand that, you need to understand how MySQL handles clustering. There are 2 main ways to do it. You can either do Master-Master replication, or NDB (Network Database) clustering.
Master-Master replication won't help with write loads, since both masters need to replay every single write issued (so you're not gaining anything).
NDB clustering will work very well for you if and only if you are doing mostly primary key lookups (since only with PK lookups can NDB operate more efficient than a regular master-master setup). All data is automatically partitioned among many servers. Like I said, I would only consider this if the vast majority of your queries are nothing more than PK lookups.
So that leaves two more options. Sharding and moving away from MySQL.
Sharding is a good option for handling a situation like this. However, to take full advantage of sharding, the application needs to be fully aware of it. So you would need to go back and rewrite all the database accessing code to pick the right server to talk to for each query. And depending on how your system is currently setup, it may not be possible to effectively shard...
But another option which I think may suit your needs best is switching away from MySQL. Since you're going to need to rewrite your DB access code anyway, it shouldn't be too hard to switch to a NoSQL database (again, depending on your current setup). There are tons of NoSQL servers out there, but I like MongoDB. It should be able to withstand your write load without worry. Just beware that you really need a 64 bit server to use it properly (with your data volume).
Replication is for data backup not for performance so its out of question.
Well, 8GB RAM is still not that much you can have many hundred GB RAM with quite big hard disk space and MySQL would still work for you.
Clustering/Sharding/Partitioning comes when single node has reached to the point where its hardware cannot bear the load. But your hardware has still room to expand.
If you don't want to upgrade your hardware then you need to give more information about database design and if there are lot of joins or not so that above named options can be considered deeply.

What database systems should a startup company consider?

Right now I'm developing the prototype of a web application that aggregates large number of text entries from a large number of users. This data must be frequently displayed back and often updated. At the moment I store the content inside a MySQL database and use NHibernate ORM layer to interact with the DB. I've got a table defined for users, roles, submissions, tags, notifications and etc. I like this solution because it works well and my code looks nice and sane, but I'm also worried about how MySQL will perform once the size of our database reaches a significant number. I feel that it may struggle performing join operations fast enough.
This has made me think about non-relational database system such as MongoDB, CouchDB, Cassandra or Hadoop. Unfortunately I have no experience with either. I've read some good reviews on MongoDB and it looks interesting. I'm happy to spend the time and learn if one turns out to be the way to go. I'd much appreciate any one offering points or issues to consider when going with none relational dbms?
The other answers here have focused mainly on the technical aspects, but I think there are important points to be made that focus on the startup company aspect of things:
Availabililty of talent. MySQL is very common and you will probably find it easier (and more importantly, cheaper) to find developers for it, compared to the more rarified database systems. This larger developer base will also mean more tutorials, a more active support community, etc.
Ease of development. Again, because MySQL is so common, you will find it is the db of choice for a great many systems / services. This common ground may make any external integration a little easier.
You are preparing for a situation that may never exist, and is manageable if it does. Very few businesses (nevermind startups) come close to MySQL's limits, and with all due respect (and I am just guessing here); the likelihood that your startup will ever hit the sort of data throughput to cripple a properly structured, well resourced MySQL db is almost zero.
Basically, don't spend your time ( == money) worrying about which db to use, as MySQL can handle a lot of data, is well proven and well supported.
Going back to the technical side of things... Something that will have a far greater impact on the speed of your app than choice of db, is how efficiently data can be cached. An effective cache can have dramatic effects on reducing db load and speeding up the general responsivness of an app. I would spend your time investigating caching solutions and making sure you are developing your app in such a way that it can make the best use of those solutions.
FYI, my caching solution of choice is memcached.
So far no one has mentioned PostgreSQL as alternative to MySQL on the relational side. Be aware that MySQL libs are pure GPL, not LGPL. That might force you to release your code if you link to them, although maybe someone with more legal experience could tell you better the implications. On the other side, linking to a MySQL library is not the same that just connecting to the server and issue commands, you can do that with closed source.
PostreSQL is usually the best free replacement of Oracle and the BSD license should be more business friendly.
Since you prefer a non relational database, consider that the transition will be more dramatic. If you ever need to customize your database, you should also consider the license type factor.
There are three things that really have a deep impact on which one is your best database choice and you do not mention:
The size of your data or if you need to store files within your database.
A huge number of reads and very few (even restricted) writes. In that case more than a database you need a directory such as LDAP
The importance of of data distribution and/or replication. Most relational databases can be more or less well replicated, but because of their concept/design do not handle data distribution as well... but will you handle as much data that does not fit into one server or have access rights that needs special separate/extra servers?
However most people will go for a non relational database just because they do not like learning SQL
What do you think is a significant amount of data? MySQL, and basically most relational database engines, can handle rather large amount of data, with proper indexes and sane database schema.
Why don't you try how MySQL behaves with bigger data amount in your setup? Make some scripts that generate realistic data to MySQL test database and and generate some load on the system and see if it is fast enough.
Only when it is not fast enough, first start considering optimizing the database and changing to different database engine.
Be careful with NHibernate, it is easy to make a solution that is nice and easy to code with, but has bad performance with large amount of data. For example whether to use lazy or eager fetching with associations should be carefully considered. I don't mean that you shouldn't use NHibernate, but make sure that you understand how NHibernate works, for example what "n + 1 selects" -problem means.
Measure, don't assume.
Relational databases and NoSQL databases can both scale enormously, if the application is written right in each case, and if the system it runs on is properly tuned.
So, if you have a use case for NoSQL, code to it. Or, if you're more comfortable with relational, code to that. Then, measure how well it performs and how it scales, and if it's OK, go with it, if not, analyse why.
Only once you understand your performance problem should you go searching for exotic technology, unless you're comfortable with that technology or want to try it for some other reason.
I'd suggest you try out each db and pick the one that makes it easiest to develop your application. Go to http://try.mongodb.org to try MongoDB with a simple tutorial. Don't worry as much about speed since at the beginning developer time is more valuable than the CPU time.
I know that many MongoDB users have been able to ditch their ORM and their caching layer. Mongo's data model is much closer to the objects you work with than relational tables, so you can usually just directly store your objects as-is, even if they contain lists of nested objects, such as a blog post with comments. Also, because mongo is fast enough for most sites as-is, you can avoid dealing the complexities of caching and generally deliver a more real-time site. For example, Wordnik.com reported 250,000 reads/sec and 100,000 inserts/sec with a 1.2TB / 5 billion object DB.
There are a few ways to connect to MongoDB from .Net, but I don't have enough experience with that platform to know which is best:
Norm: http://wiki.github.com/atheken/NoRM/
MongoDB-CSharp: http://github.com/samus/mongodb-csharp
Simple-MongoDB: http://code.google.com/p/simple-mongodb/
Disclaimer: I work for 10gen on MongoDB so I am a bit biased.