RDBMS vs. DynamoDB and Serverless vs. not Serverless - mysql

I have a little bit of a problem concerning the design of a planned application, especially database engine and Serverless/not serverless.
The goal is a Web Application which talks via the Rest API to the database. The Rest API itself is really just CRUD operations, so for that the Serverless aproach (AWS Lambda) would fit pretty good in my opinion. For that, the probably most efficient database to choose would be DynamoDB (NoSQL).
I am familliar with RDBMS and have only little knowledge of NoSQL databases.
The Schema of the application is not yet finished and should be expandable at later points, because there could be new features to implement and so on. Because of this, i would rather use a RDBMS and not a NoSQL database, because they don't scale that well in terms of editing the schema at later points. (at least that's what i read the last couple of hours)
Choosing for example Amazon RDS MySQL database, would be much more expensive and i don't know how well they do with the Serverless aproach of the Rest API.
So i am standing at a point i really don't know what services to use here. Could i still use DynamoDB? The schema would propably be very relational.

DynamoDB doesn't have any concept of schema so the whole thing about editing is kind of unrelated (to DynamoDB). If by schema you mean objects with certain properties then it depends on the use case. If you are OK with having object in a table that don't share the same "schema" then it is extremely simple as that is allowed by default. On the other hand, if you need all the objects to share the same set of attributes and you are going to change them frequently then this is indeed not as easy and straight forward compared to RDS.
Next, if you have a relational schema - tables - and are planning to do some JOINs on them then DynamoDB is really not a good solution. DynamoDB is good for a specific type of use cases like storing sessions or something with similar (low) complexity. Writing more complex queries in DynamoDB can get very tedious and painful.
Considering the price. Well, I wouldn't really say that DynamoDB is that cheap. It seems like that at a first glance but if you dig deeper into it then you find that it is actually pretty expensive, mainly from the perspective of writes. You need to provision read and write capacity and more throughput you require the more costly it gets (you can go with auto-scaling for burst traffic but in case of consistent traffic, this will not help you that much). At larger scales, RDS (not Aurora) will cost only a fraction of what the DynamoDB will cost you (assuming that we are talking about use case which can be handled by RDS).
If you are worried about RDS integration with Lambda then the complexity is not that bigger compared to DynamoDB. There are some considerations that need to be taken such as the lambda execution time hard limit (which is currently 15 minutes) and RDS may be slower to respond (compared to DynamoDB), but if your query is taking that long then you are either doing something wrong or misusing those tools.
All in all, if you are comfortable with using RDS and don't need that millisecond latency provided by DynamoDB (or even microsecond latency if using DAX as well) then I would definitely go with RDS over DynamoDB in your case. Again, DynamoDB is not a general purpose solution to every data related problem and more often then not, I see it being heavily misused for stuff that can be easily handled by RDS.

Related

AWS MySQL RDS vs AWS DynamoDB [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I've been using MySQL for a fair while now and I'm comfortable with its structure & SQL Queries etc.
Currently building a new system in AWS and I've been looking at DynamoDB. Currently I only know a little about it.
Is one better then the other?
What are the advantage of DynamoDB?
what is the transition like from MySQL queries etc to this flat style DB?
Really DynamoDB and MySQL are apples and oranges. DynamoDB is a NoSQL storage layer while MySQL is used for relational storage. You should pick what to use based on the actual needs of your application. In fact, some applications might be well served by using both.
If, for example, you are storing data that does not lend itself well to a relational schema (tree structures, schema-less JSON representations, etc.) that can be looked up against a single key or a key/range combination then DynamoDB (or some other NoSQL store) would likely be your best bet.
If you have a well-defined schema for your data that can fit well in a relational structure and you need the flexibility to query the data in a number of different ways (adding indexes as necessary of course), then RDS might be a better solution.
The main benefit for using DynamoDB as a NoSQL store is that you get guaranteed read/write throughput at whatever level you require without having to worry about managing a clustered data store. So if your application requires 1000 reads/writes per second, you can just provision your DynamoDB table for that level of throughput and not really have to worry about the underlying infrastructure.
RDS has much of the same benefit of not having to worry about the infrastructure itself, however if you end up needing to do a significant number of writes to the point where the largest instance size will no longer keep up, you are kind of left without options (you can scale horizontally for reads using read replicas).
Updated note: DynamoDb does now support global secondary indexing, so you do now have the capability to perform optimized lookups on data fields other than the hash or combination of hash and range keys.
We have just migrated all of our DynamoDB tables to RDS MySQL.
While using DynamoDB for specific tasks may make sense, building a new system on top of DynamoDB is really a bad idea. Best laid plans etc., you always need that extra flexibility from your DB.
Here are our reasons we moved from DynamoDB:
Indexing - Changing or adding keys on-the-fly is impossible without creating a new table.
Queries - Querying data is extremely limited. Especially if you want to query non-indexed data. Joins are of course impossible so you have to manage complex data relations on your code/cache layer.
Backup - Such a tedious backup procedure is a disappointing surprise compared to the slick backup of RDS
GUI - bad UX, limited search, no fun.
Speed - Response time is problematic compared to RDS. You find yourself building elaborate caching mechanism to compensate for it in places you would have settled for RDS's internal caching.
Data Integrity - While the concept of fluid data structure sounds nice to begin with, some of your data is better "set in stone". Strong typing is a blessing when a little bug tries to destroy your database. With DynamoDB anything is possible and indeed anything that can go wrong does.
We now use DynamoDB as a backup for some systems and I'm sure we'll use it in the future for specific, well defined tasks. It's not a bad DB, it's just not the DB to serve 100% of your core system.
As far as advantages go, I'd say Scalability and Durability. It scales incredibly and transparently and it's (sort of) always up. These are really great features, but they do not compensate in any way for the downside aspects.
You can read AWS explanation about it here.
In short, if you have mainly Lookup queries (and not Join queries), DynamoDB (and other NoSQL DB) is better. If you need to handle a lot of data, you will be limited when using MySQL (and other RDBMS).
You can't reuse your MySQL queries nor your data schema, but if you spend the effort to learn NoSQL, you will add an important tool to your tool box. There are many cases where DynamoDB is giving the simplest solution.
When using DynamoDB you should also know that the items/records in DynamoDB are limited to 400KB (See DynamoDB Limits). For many use cases this will not work. So DynamoDB will be good for few things but not all. Same goes for many of the other NoSQL database.

Seeking clarification about mysql 5.6 memcache integration

I'm having trouble getting a clear understanding of what MySQL 5.6 is introducing w/r/t memcache.
As I understand it, memcache by itself is essentially a huge, shared, memory-resident hash table that is managed by a server, memcached. In particular, it knows nothing about a persistent data store, and offers no services in that regard. It simply knows about keys and values (like a Perl hash).
What I think mySQL 5.6 introduces is a NoSQL API, whereby mySQL clients can request data from the mySQL server by key, rather than by a SELECT statement. (And similarly, they can perform updates with key=value pairs). MySQL uses memcached to cache these in memory as a performance boost, but also takes care of things like writing updates back to the database before they age out of the cache, etc.
In other words, the use of memcached is an implementation detail of the mySQL 5.6 NoSQL feature, and is not something the application programmer needs to be aware of.
I'd welcome any corrections or amplification to my understanding.
Thanks,
Chap
I think it's quite simple (from the official documentation):
I disagree with your last sentence, the application programmer has to be really aware of the memcache plugin because having it onboard of the MySQL server means that he can decide (maybe he will be forced to) access data through a memcached language interface or via the SQL interface
To better understand the impact of this plugin onto an app design you should know that there are 3 configuration tables used by MySQL for a proper memcached management; understanding how the "cache_policies" works will shade some light to some of your doubts:
Table cache_policies specifies whether to use InnoDB as the data store of memcached (innodb_only), or to use the traditional memcached engine as the backstore (cache-only), or both (caching). In the last case, if memcached cannot find a key in memory, it searches for the value in an InnoDB table.
here is the link: innodb-memcached-internals
This quote above means that, depending on what you decided for a specific key-value, you will have different application scenarios :
innodb_only -> means that you can query the data via a sql interface or via a memcached interface, here is a link to some memcached language interface examples memcached-interfaces
cache-only -> means that you should query the data via the memchached interface only
caching -> means that you can use both the interfaces (note that the storage mechanism slightly changes)
Of course this latter configuration decision is strictly related to your specific needs
I don't really have a complete answer for you I'm afraid, as I too am struggling to find the detail I require before toying around with it.
That said however there is one important point which I have managed to uncover that you seem to have missed, namely that by accessing the InnoDB storage engine via the new plugin you are actually completely bypassing SQL and avoiding all the overhead that comes with it.
This of course makes it essentially a key/value store more akin to most NoSQL databases complete with all the drawbacks associated with them. i.e. no joins etc...
However on the flip side for many applications these days, this is exactly what we want. There has been only a handful of real world performance mentions that I have come across but all seem to point to this implementation significantly outperforming MongoDB and other similar NoSQL solutions (how much truth is in it I do not know) with even one (relatively in depth) comparison claiming as high as 700k qps on a commodity server (compared with around 100k on a well tuned MySQL setup), which is incredible if true.
Resource here:
http://yoshinorimatsunobu.blogspot.co.uk/search/label/handlersocket
Anyway, sorry I can't be any more help but its food for thought at least!

SQL Server vs. NoSQL

So I have a website that could eventually get some pretty high traffic. My DB implementation is in SQL Server 2008 at the moment. I really only have 2 tables and a few stored procs. Most of the DB could be re-designed to work without joining (although it wouldn't make sense when I can join so easily within SQL Server).
I heard that sites like Digg and Facebook use NoSQL databases for a lot of their basic data access. Is this something worth looking into, or will SQL Server not really slow me down that bad?
I use paging on my site (although this might change in the future), and I also use AJAX'd data access for most of the "live" stuff, so it doesn't really seem to be a performance hindrance at the moment, but I'm afraid it will be as the data starts expanding exponentially.
Am I going to gain a lot of performance my moving to NoSQL? Honestly, right now I don't even completely understand NoSQL, so any tips on how this will help me improve the better.
Thanks guys.
Actually Facebook use a relational database at its core, see SOCC Keynote Address: Building Facebook: Performance at Massive Scale. And so do many other web-scale sites, see Why does Quora use MySQL as the data store instead of NoSQLs such as Cassandra, MongoDB, CouchDB etc?. There is also a discussion of how to scale SQL Server to web-scale size, see How do large-scale sites and applications remain SQL-based? which is based on MySpace's architecture (more details at Scale out SQL Server by using Reliable Messaging). I'm not saying that NoSQL doesn't have its use cases, I just want to point out that there are many shades of gray between white and black.
If you're afraid that your current solution will not scale then perhaps you should look at what are the factors that prevent scalability with your current solution. Test data is cheap to produce, load the 'exponentially increased' data volume and run your test harness, see where it cracks. None of the NoSQL solutions will bring magic off-the-shelf scalability, they all require you to understand how to use them effectively and deploy them correctly. And they also require you to test with large volumes if you want to ensure success at scale. Same for traditional relational solutions.
Sql Server scales pretty well. For example, Stack Overflow used it to serve you this very page. Facebook and Google might use a form of nosql, but even if you make it really big you're unlikely to rise to that level.
With a simple table structure and data that fits on one server, it doesn't matter much what platform you use. There are a several possible reasons to need to move to NoSQL:
Data scaling - SQL works best when all the data fits on one server (up to a few TB). The reason a lot of NoSQL stores don't have join is that they were designed not to require all the objects to be on one server.
Performance scaling - NoSQL stores do tend to be faster at handling high traffic, but not necessarily by enough to matter. You can improve SQL performance quite a lot with replication and caching as long as you aren't running into data size issues. Writes generally do have to run on the one server, but in most cases you will need to improve read performance long before write performance becomes an issue.
Complex data access - some types of queries simply don't fit well into a relational model. Graph and set stores work quite differently from relational databases so are a better fit for some applications.
Easier development - If you don't already have a SQL database and all the code to support it, using a schemaless datastore can save quite a bit of development time.
I don't think so you have to move your database from SQL to NoSQL unless and untill you are serving thousands of TB data. If you properly normalize your tables and serve the data and also need to set proper archive mechanism it should work.
If you still have question what to choose and how, than check this. Let's assume that you have decided to move on to NoSQL database than there are lot of market player. Just have a look at the list which is again depending upon your need and type of data you have.
Am I going to gain a lot of performance my moving to NoSQL?
It depends.
Check out this article for 7 reasons when you DON'T want to use NoSQL. If none is your case, then read further.
The main advantage of Document-based NoSQL for the traditional enterprise needs is cheaper hosting at high scale due to lower CPU usage on querying denormalised data (the most often request). Key points:
The CPU is going nuts on JOINs and GROUP BYs in the SQL queries, when a denormilised data structure implies no/less JOINs, hence less stress on CPU.
CPU is the most expensive resource in the cloud, then storage is the cheapest. And denormalised data trades higher storage for lower CPU.
How to get there?
Master the DDD (Domain-Driven Design).
Gain good understanding of CQRS (Command Query Responsibility Segregation) and Eventual consistency.
Understand your domain and business processes.
Design model, which is tuned to the access patterns.
Review.
Repeat steps 3 - 5.

What database systems should a startup company consider?

Right now I'm developing the prototype of a web application that aggregates large number of text entries from a large number of users. This data must be frequently displayed back and often updated. At the moment I store the content inside a MySQL database and use NHibernate ORM layer to interact with the DB. I've got a table defined for users, roles, submissions, tags, notifications and etc. I like this solution because it works well and my code looks nice and sane, but I'm also worried about how MySQL will perform once the size of our database reaches a significant number. I feel that it may struggle performing join operations fast enough.
This has made me think about non-relational database system such as MongoDB, CouchDB, Cassandra or Hadoop. Unfortunately I have no experience with either. I've read some good reviews on MongoDB and it looks interesting. I'm happy to spend the time and learn if one turns out to be the way to go. I'd much appreciate any one offering points or issues to consider when going with none relational dbms?
The other answers here have focused mainly on the technical aspects, but I think there are important points to be made that focus on the startup company aspect of things:
Availabililty of talent. MySQL is very common and you will probably find it easier (and more importantly, cheaper) to find developers for it, compared to the more rarified database systems. This larger developer base will also mean more tutorials, a more active support community, etc.
Ease of development. Again, because MySQL is so common, you will find it is the db of choice for a great many systems / services. This common ground may make any external integration a little easier.
You are preparing for a situation that may never exist, and is manageable if it does. Very few businesses (nevermind startups) come close to MySQL's limits, and with all due respect (and I am just guessing here); the likelihood that your startup will ever hit the sort of data throughput to cripple a properly structured, well resourced MySQL db is almost zero.
Basically, don't spend your time ( == money) worrying about which db to use, as MySQL can handle a lot of data, is well proven and well supported.
Going back to the technical side of things... Something that will have a far greater impact on the speed of your app than choice of db, is how efficiently data can be cached. An effective cache can have dramatic effects on reducing db load and speeding up the general responsivness of an app. I would spend your time investigating caching solutions and making sure you are developing your app in such a way that it can make the best use of those solutions.
FYI, my caching solution of choice is memcached.
So far no one has mentioned PostgreSQL as alternative to MySQL on the relational side. Be aware that MySQL libs are pure GPL, not LGPL. That might force you to release your code if you link to them, although maybe someone with more legal experience could tell you better the implications. On the other side, linking to a MySQL library is not the same that just connecting to the server and issue commands, you can do that with closed source.
PostreSQL is usually the best free replacement of Oracle and the BSD license should be more business friendly.
Since you prefer a non relational database, consider that the transition will be more dramatic. If you ever need to customize your database, you should also consider the license type factor.
There are three things that really have a deep impact on which one is your best database choice and you do not mention:
The size of your data or if you need to store files within your database.
A huge number of reads and very few (even restricted) writes. In that case more than a database you need a directory such as LDAP
The importance of of data distribution and/or replication. Most relational databases can be more or less well replicated, but because of their concept/design do not handle data distribution as well... but will you handle as much data that does not fit into one server or have access rights that needs special separate/extra servers?
However most people will go for a non relational database just because they do not like learning SQL
What do you think is a significant amount of data? MySQL, and basically most relational database engines, can handle rather large amount of data, with proper indexes and sane database schema.
Why don't you try how MySQL behaves with bigger data amount in your setup? Make some scripts that generate realistic data to MySQL test database and and generate some load on the system and see if it is fast enough.
Only when it is not fast enough, first start considering optimizing the database and changing to different database engine.
Be careful with NHibernate, it is easy to make a solution that is nice and easy to code with, but has bad performance with large amount of data. For example whether to use lazy or eager fetching with associations should be carefully considered. I don't mean that you shouldn't use NHibernate, but make sure that you understand how NHibernate works, for example what "n + 1 selects" -problem means.
Measure, don't assume.
Relational databases and NoSQL databases can both scale enormously, if the application is written right in each case, and if the system it runs on is properly tuned.
So, if you have a use case for NoSQL, code to it. Or, if you're more comfortable with relational, code to that. Then, measure how well it performs and how it scales, and if it's OK, go with it, if not, analyse why.
Only once you understand your performance problem should you go searching for exotic technology, unless you're comfortable with that technology or want to try it for some other reason.
I'd suggest you try out each db and pick the one that makes it easiest to develop your application. Go to http://try.mongodb.org to try MongoDB with a simple tutorial. Don't worry as much about speed since at the beginning developer time is more valuable than the CPU time.
I know that many MongoDB users have been able to ditch their ORM and their caching layer. Mongo's data model is much closer to the objects you work with than relational tables, so you can usually just directly store your objects as-is, even if they contain lists of nested objects, such as a blog post with comments. Also, because mongo is fast enough for most sites as-is, you can avoid dealing the complexities of caching and generally deliver a more real-time site. For example, Wordnik.com reported 250,000 reads/sec and 100,000 inserts/sec with a 1.2TB / 5 billion object DB.
There are a few ways to connect to MongoDB from .Net, but I don't have enough experience with that platform to know which is best:
Norm: http://wiki.github.com/atheken/NoRM/
MongoDB-CSharp: http://github.com/samus/mongodb-csharp
Simple-MongoDB: http://code.google.com/p/simple-mongodb/
Disclaimer: I work for 10gen on MongoDB so I am a bit biased.

Switching from MySQL to Cassandra - Pros/Cons?

For a bit of background - this question deals with a project running on a single small EC2 instance, and is about to migrate to a medium one. The main components are Django, MySQL and a large number of custom analysis tools written in python and java, which do the heavy
lifting. The same machine is running Apache as well.
The data model looks like the following - a large amount of real time data comes in streamed from various networked sensors, and ideally, I'd like to establish a long-poll approach rather than the current poll every 15 minutes approach (a limitation of computing stats and writing into the database itself). Once the data comes in, I store the raw version in
MySQL, let the analysis tools loose on this data, and store statistics in another few tables. All of this is rendered using Django.
Relational features I would need -
Order by [SliceRange in Cassandra's API seems to satisy this]
Group by
Manytomany relations between multiple tables [Cassandra SuperColumns seem to do well for one to many]
Sphinx on this gives me a nice full text engine, so thats a necessity too. [On Cassandra, the Lucandra project seems to satisfy this need]
My major problem is that data reads are extremely slow (and writes aren't that hot either). I don't want to throw a lot of money and hardware on it right now, and I'd prefer something that can scale easily with time. Vertically scaling MySQL is not trivial in that sense (or cheap).
So essentially, after having read a lot about NOSQL and experimented with things like MongoDB, Cassandra and Voldemort, my questions are,
On a medium EC2 instance, would I gain any benefits in reads/writes by shifting to something like Cassandra? This article (pdf) definitely seems to suggest that. Currently, I'd say a few hundred writes per minute would be the norm. For reads - since the data changes every 5 minutes or so, cache invalidation has to happen pretty quickly. At some point, it should be able to handle a large number of concurrent users as well. The app performance currently gets killed on MySQL doing some joins on large tables even if indexes are created - something to the order of 32k rows takes more than a minute to render. (This may be an artifact of EC2 virtualized I/O as well). Size of tables is around 4-5 million rows, and there are about 5 such tables.
Everyone talks about using Cassandra on multiple nodes, given the CAP theorem and eventual consistency. But, for a project that is just beginning to grow, does it make sense
to deploy a one node cassandra server? Are there any caveats? For instance, can it replace MySQL as a backend for Django? [Is this recommended?]
If I do shift, I'm guessing I'll have to rewrite parts of the app to do a lot more "administrivia" since I'd have to do multiple lookups to fetch rows.
Would it make any sense to just use MySQL as a key value store rather than a relational engine, and go with that? That way I could utilize a large number of stable APIs available, as well as a stable engine (and go relational as needed). (Brett Taylor's post from Friendfeed on this - http://bret.appspot.com/entry/how-friendfeed-uses-mysql)
Any insights from people who've done a shift would be greatly appreciated!
Thanks.
Cassandra and the other distributed databases available today do not provide the kind of ad-hoc query support you are used to from sql. This is because you can't distribute queries with joins performantly, so the emphasis is on denormalization instead.
However, Cassandra 0.6 (beta officially out tomorrow, but you can build from the 0.6 branch yourself if you're impatient) supports Hadoop map/reduce for analytics, which actually sounds like a good fit for you.
Cassandra provides excellent support for adding new nodes painlessly, even to an initial group of one.
That said, at a few hundred writes/minute you're going to be fine on mysql for a long, long time. Cassandra is much better at being a key/value store (even better, key/columnfamily) but MySQL is much better at being a relational database. :)
There is no django support for Cassandra (or other nosql database) yet. They are talking about doing something for the next version after 1.2, but based on talking to django devs at pycon, nobody is really sure what that will look like yet.
If you're a relational database developer (as I am), I'd suggest/point out:
Get some experience working with Cassandra before you commit to its use on a production system... especially if that production system has a hard deadline for completion. Maybe use it as the backend for something unimportant first.
It's proving more challenging than I'd anticipated to do simple things that I take for granted about data manipulation using SQL engines. In particular, indexing data and sorting result sets is non-trivial.
Data modelling has proven challenging as well. As a relational database developer you come to the table with a lot of baggage... you need to be willing to learn how to model data very differently.
These things said, I strongly recommend building something in Cassandra. If you're like me, then doing so will challenge your understanding of data storage and make you rethink a relational-database-fits-all-situations outlook that I didn't even realize I held.
Some good resources I've found include:
Dominic Williams' Cassandra blog posts
Secondary Indexes in Cassandra
More from Ed Anuff on indexing
Cassandra book (not fantastic, but a good start)
"WTF is a SuperColumn" pdf
The Django-cassandra is an early beta mode. Also Django didn't made for no-sql databases. The key in Django ORM is based on SQL (Django recommends to use PostgreSQL). If you need to use ONLY no-sql (you can mix sql and no-sql in same app) you need to risky use no-sql ORM (it significantly slower than traditional SQL orm or direct use of No-SQL storage). Or you'll need to completely full rewrite django ORM. But in this case i can't presume, why you need Django. Maybe you can use something else, like Tornado?