Sharding on MySQL vs PostgreSQL

Sharding on MySQL vs PostgreSQL - mysql

It seems like most large companies that have to shard their databases choose MySQL over PostgreSQL. What are the major advantages that MySQL has over PostgreSQL when it comes to distributed database? I don't see any major downside to Postgres that will prevent a successful implementation of sharding at the application level, but the sheer number of companies that choose MySQL over Postgres is giving me pause and making me wonder if I'm missing something.

PARTITIONing involves a single server; Sharding involves many servers. They solve (or fail to solve) different problems. Partitioning provides very few use cases to justify its existence; sharding provides write scaling at the cost of complexity.
MySQL's has no built-in sharding capability. There are 3rd party packages that assist with such, but there is still a large burden on the DBA. (See Spider and various Proxy servers.)
So, I see no reason why Postgres (or any other RDBMS) could not be sharded. After all, you do most of the work; the RDBMS sits on multiple machines not realizing that there are siblings with other chunks of the data.
(Disclaimer: I am very familiar with MySQL, and not familiar with Postgres.)

Related

Data Base for handle large data

We have started a new project using MySQL, spring boot, and Angular js. Initially, we did not realize our DB is going to handle large data.
The number of tables will not be large (<130), only 10 to 20 tables will be contained in more data, which is almost inserted/ read/ update.
The estimated amount of data in that 10 table is going to grow at 12,00,000 records in a month, and we should not delete those data be able to do various reports.
There needs to be (read-only) replicated database as a backup/failover, and maybe for offloading reports in peak time.
I don't have first-hand experience with that large databases, so I'm asking the ones that have which DB is the best choice in this situation. as we have completed 100% coding and development but now we realize this. I have doubts may be MYSQL going to handle large data. I know that Oracle is the safe bet, interested if Mysql with a similar setup. But it is bound only in MySQL I am ok with any DB based on you all feedback I can take a call.
Open source DB more preferable but it's not mandatory we can go for paid DB also.

Handling Large Data
MySQL is more than capable of handling such loads. In fact, it is capable of handling much much more load than what you are talking about. You just have to create the right kind of tables. You can do that by choosing
the correct storage engine for your use-case
the correct character set
the optimal data type for your column
the right indexing strategy - creating indexes thoughtfully
the right partitioning strategy (if the data in the table exceeds tens of millions of records)
EDIT: You've also got to choose the right kind of data modelling and normalization strategy for your use-case. Most of OLTP applications require some level of normalization. But if you want to do analytics and aggregates on heavy tables, you should either have a Data Warehouse of have highly denormalized tables to avoid joins and/or have a column-oriented database to support such queries.
MySQL is open-source and has a very strong community support so you will find a lot of literature around any issue that you face. You can also find all the filed bugs (resolved and unresolved) here.
As far as the number of tables are concerned, there's really no cap on that. See here, MySQL permits 4 billion tables if you're using InnoDB as the engine.
A lot of very big companies with scale use MySQL in some capacity. Facebook is one of them.
Native JSON Support
With the growing popularity of JSON as the de facto data exchange format across the internet, MySQL has also provided native JSON support in 5.7, so now you can store and query JSON from your APIs, if required.
HA and Replication
MySQL Replication works! Earlier, MySQL used to support coordinate replication only but now it supports GTID replication which makes it easier to maintain and fix replication issues. There are third-party replicators also available in the market. For instance, Continuent's Tungsten is a replicator written in Java and is a replacement for native replication. It comes with a lot of configuration options which are not available with native MySQL replication.

I agree with MontyPython, MySql can do it and the design is critical. Fortunately MySql allows you to be flexible over time as needed.
I've had history tables needed used in daily reporting that grew to over a billion records in plain MySql and had no problems.
I've also used MySql Merge tables to divide up tables with big-ish rows (100KB+) to speed things up. Basically keeping the individual merge table file sizes under 30GB each. However that solution increases the open file count (in the system) per client - might be a bigger deal on a clustered system. That one was not.
That said, I like to give Honorable Mention to:
MariaDB - MySql but with contributions from Facebook, Alibaba, Google, and more.
I've moved most of my MySql community edition projects over to MariaDB and have been very happy. It's an almost transparent upgrade.
They offer an interesting enterprise Big Data Analytics (MariaDB AX) package, but with your current requirements its probably overkill and the standard community edition will fulfill your needs.
For example, here's an informative tutorial on how to set up a scalable Cluster (Galera) and adding MaxScale for High Availability:
https://mariadb.com/resources/blog/getting-started-mariadb-galera-and-mariadb-maxscale-centos
Another interesting option is Vitesse - developed at Youtube, which allows for sharded mysql through a (mostly) driver based solution. It solves the problem of needing to have available access to huge amounts of data and always yield good performance. As such, it goes beyond high availability and focuses on a solution wherein no single query (ie. a report against millions of rows of historical data) can negatively impact the other queries needing to be performed.

How does Cassandra compare to MySQL (or any other RDBMS) in a single node setup?

Having studied about relational databases, document-stores, graph databases, and column-oriented databases, I concluded that something like Cassandra best fits my needs. In particular, the ability to add columns on the fly and no requirement to have a strict schema seals the deal for me. This seems to nicely bridge the gap between a rather novel graph db and a time-tested rdbms.
But I am concerned about how running Cassandra on a single node. Like many others, I can start only with a small amount of data, so more than one node to start with is just not practical. Based on another excellent SO question: Why don't you start off with a "single & small" Cassandra server as you usually do it with MySQL? I concluded that Cassandra can indeed be run just fine as a single node, as long as one is willing to give up benefits like availability which are derived from a multi-node setup.
There also seem to be ways of implementing dynamic adding of fields in an RDBMS for instance as discussed here on SO: How to design a database for User Defined Fields? This would, to some extent, mimic schemaless-ness.
So I would now like to understand how do Cassandra and MySQL compare - with regard to features and performance, on a single node setup? What would you advise someone in my situation - start with a simple RDBMS with the plan/intent to switch to Cassandra later on? Or start with Cassandra?

In a single node setup of Cassandra, many of the advantages of Cassandra are lost, so the main reason for doing that would be if you intended to expand to multiple nodes in the future. Performance would tend to favor RDBMS in most applications when using a single node since RDBMS is designed for that environment and can assume all data is local.
The strengths of Cassandra are scalability and availability. You can add nodes to increase capacity and having multiple nodes means you can deal with hardware failures and not have downtime. These strengths come at the cost of more difficult schema design since access is based primarily on consistent hashing. It also means you don't have full SQL available and often must rely on denormalization techniques to support fast access to data. Cassandra is also weak for ACID transactions since it is inherently difficult to coordinate atomic actions on multiple nodes.
RDBMS by contrast is a more mature technology. ACID transactions are no problem. Schema design is much simpler since you can add efficient indexes to any column to optimize queries, and you have joins available so that redundant data can be largely eliminated. By eliminating redundant data it is much easier to keep your data consistent, since there are not multiple copies of data that need to be updated when someone changes their address for example. But you run the risk of running out of space on a single machine to store all your data. And if you get a disk crash you will have downtime and need backups to restore the data, while Cassandra can often easily repair the data on a node that is out of sync. There is also no easy way to scale an RDBMS to handle higher transaction rates other than buying a faster machine.
There are a lot of other differences, but those are the major ones. Neither one is better than the other, but each one may be better suited to certain applications. So it really depends on the requirements of your use case which one will be a better fit.

Should I use SQLite instead of MySQL?

I need to improve a PHP-MySQL web application, which only uses MySQL for REPL operations (and some search functions). 99% of the applications that I worked with never used advanced MySQL features, like replication, cross-table constraints, locking etc.
To my understanding I should instead use SQLite.
Are there any practical benefits if I do this?
Will I see a significant (>100ms) speed boost?
Should I expect problems with tables with more than 1,000,000 rows?

There is no catch-all answer to that, but there is a main point to consider: A very good rule of thumb is, that the higher your degree of concurrency is, the more you'll profit from MySQL and vice versa.
This means that in a scenario where database requests never ever are concurrent, you might see a speedup by using SQlite, though I doubt it would be in the 100ms order of magnitude.
The reason behind this is (very roughly):
In a database server environment, such as MySQL, PostgreSQL, MS SQL, Oracle and friends, a dedicated process (or a group of processes) exclusively touch the database files - the important part being dedicated. This means, that concurrency issues can be resolved in-process.
In a file-based database, such as SQlite, MS Access (Jet Engine) and friends, multiple processes will touch the DB files without knowing of each other - this implies that concurrency issues have to be resolved by writing them to the DB or helper file(s). This is typically much slower and less robust. In exchange for that, the overhead of communication between the database client (the web app) and the database server (which is in-process) is nonexistent.
Edit
After comment I want to make it more clear, that I am talking of concurrent writes, not concurrent reads. Concurrent reads of an unchanging dataset is not a hard problem - it doesn't need any locking at all.

The principal advantage of SQLite is that it is a file-based relational database that uses SQL as its query language. Being file-based tremendously simplifies deployment, making it very good for the case where an application needs a little database but must be run in an environment where having a database server would be problematic. (For example, many browsers use SQLite to manage their cookie stores; using a database server for that problem would be verging on the insane in many ways.)
The principal advantage of MySQL (with a sane table type) is that it is a database server that uses SQL as its query language. Being server-based allows for many features that a file-based system can't handle simply (such as replication) but does make things quite a bit more complex to deploy.
Whether the benefits of the additional complexity of a database server (e.g., MySQL) outweigh the costs (relative to a file-based database engine like SQLite) depends on a great many factors, notably including how many installations are expected and who is expected to perform those installations.

How to scale MySQL with multiple machines?

I have a web app running LAMP. We recently have an increase in load and is now looking at solutions to scale. Scaling apache is pretty easy we are just going to have multiple multiple machines hosting it and round robin the incoming traffic.
However, each instance of apache will talk with MySQL and eventually MySQL will be overloaded. How to scale MySQL across multiple machines in this setup? I have already looked at this but specifically we need the updates from the DB available immediately so I don't think replication is a good strategy here? Also hopefully this can be done with minimal code change.
PS. We have around a 1:1 read-write ratio.

There're only two strategies: replication and sharding. Replication comes often in place when you have less write and much read traffic, so you can redirect the reads to many slaves, with the pitfall of lots of replication traffic with the time and a probability for inconsitency.
With sharding you shard your database tables across multiple machines (called functional sharding), which makes especially joins much harder. If this doenst fit anymore you also need to shard you rows across multiple machines, but this is no fun and depends a sharding layer implemented between you application and the database.
Document oriented databases or column stores do this work for you, but they are currently optimized for OLAP not for OLTP.

Depends on the application backend (i.e. how the PKs, transactions and insert IDs are handled), you might consider MASTER-MASTER replication with different auto_increment setups. This can be tricky and needs to be thoroughly tested but it can work.
Also, in new MySQL 5.6 there is a GTID (Global Transaction Identifier) that generally helps a lot in keeping the replication in sync, especially in this scenario.

You should take a look at MySQL Performance Blog. Maybe you'll find something useful.

Well... good luck scaling all those writes to a real large scale. The database engine becomes the bottleneck, too many locks and buffers mgmt and stuff...
The only way I found that really works is scale out, sharding, unfortunately sharding is not provided for MySQL "out of the box" (like in some NoSQLs such as Mongo). ScaleBase (disclaimer: I work there) is a maker of a complete scale-out solution an "automatic sharding machine" if you like. ScaleBae analyzes your data and SQL stream, splits the data across DB nodes, route commands and aggregates results in runtime – so you won’t have to!

MySQL vs PostgreSQL? Which should I choose for my Django project?

My Django project is going to be backed by a large database with several hundred thousand entries, and will need to support searching (I'll probably end up using djangosearch or a similar project.)
Which database backend is best suited to my project and why? Can you recommend any good resources for further reading?

For whatever it's worth the the creators of Django recommend PostgreSQL.
If you're not tied to any legacy
system and have the freedom to choose
a database back-end, we recommend
PostgreSQL, which achives a fine
balance between cost, features, speed
and stability. (The Definitive Guide to Django, p. 15)

As someone who recently switched a project from MySQL to Postgresql I don't regret the switch.
The main difference, from a Django point of view, is more rigorous constraint checking in Postgresql, which is a good thing, and also it's a bit more tedious to do manual schema changes (aka migrations).
There are probably 6 or so Django database migration applications out there and at least one doesn't support Postgresql. I don't consider this a disadvantage though because you can use one of the others or do them manually (which is what I prefer atm).
Full text search might be better supported for MySQL. MySQL has built-in full text search supported from within Django but it's pretty useless (no word stemming, phrase searching, etc.). I've used django-sphinx as a better option for full text searching in MySQL.
Full text searching is built-in with Postgresql 8.3 (earlier versions need TSearch module). Here's a good instructional blog post: Full-text searching in Django with PostgreSQL and tsearch2

large database with several hundred
thousand entries,
This is not large database, it's very small one.
I'd choose PostgreSQL, because it has a lot more features. Most significant it this case: in PostgreSQL you can use Python as procedural language.

Go with whichever you're more familiar with. MySQL vs PostgreSQL is an endless war. Both of them are excellent database engines and both are being used by major sites. It really doesn't matter in practice.

All the answers bring interesting information to the table, but some are a little outdated, so here's my grain of salt.
As of 1.7, migrations are now an integral feature of Django. So they documented the main differences that Django developers might want to know beforehand.
Backend Support
Migrations are supported on all backends that Django ships with, as
well as any third-party backends if they have programmed in support
for schema alteration (done via the SchemaEditor class).
However, some databases are more capable than others when it comes to schema migrations; some of the caveats are covered below.
PostgreSQL
PostgreSQL is the most capable of all the databases here in terms of schema support.
MySQL
MySQL lacks support for transactions around schema alteration operations, meaning that if a migration fails to apply you will have to manually unpick the changes in order to try again (it’s impossible to roll back to an earlier point).
In addition, MySQL will fully rewrite tables for almost every schema operation and generally takes a time proportional to the number of rows in the table to add or remove columns. On slower hardware this can be worse than a minute per million rows - adding a few columns to a table with just a few million rows could lock your site up for over ten minutes.
Finally, MySQL has relatively small limits on name lengths for columns, tables and indexes, as well as a limit on the combined size of all columns an index covers. This means that indexes that are possible on other backends will fail to be created under MySQL.
SQLite
SQLite has very little built-in schema alteration support, and so
Django attempts to emulate it by:
Creating a new table with the new schema
Copying the data across
Dropping the old table
Renaming the new table to match the original name
This process generally works well, but it can be slow and occasionally
buggy. It is not recommended that you run and migrate SQLite in a
production environment unless you are very aware of the risks and its
limitations; the support Django ships with is designed to allow
developers to use SQLite on their local machines to develop less
complex Django projects without the need for a full database.

Even if Postgresql looks better, I find it has some performances issues with Django:
Postgresql is made to handle "long connections" (connection pooling, persistant connections, etc.)
MySQL is made to handle "short connections" (connect, do your queries, disconnect, has some performances issues with a lot of open connections)
The problem is that Django does not support connection pooling or persistant connection, it has to connect/disconnect to the database at each view call.
It will works with Postgresql, but connecting to a Postgresql cost a LOT more than connecting to a MySQL database (On Postgresql, each connection has it own process, it's a lot slower than just popping a new thread in MySQL).
Then you get some features like the Query Cache that can be really useful on some cases. (But you lost the superb text search of PostgreSQL)

When a migration fails in django-south, the developers encourage you not to use MySQL:
! The South developers regret this has happened, and would
! like to gently persuade you to consider a slightly
! easier-to-deal-with DBMS (one that supports DDL transactions)

Having gone down the road of MySQL because I was familiar with it (and struggling to find a proper installer and a quick test of the slow web "workbench" interface of postgreSQL put me off), at the end of the project, after a few months after deployment, while looking into back up options, I see you have to pay for MySQL's enterprise back up features. Gotcha right at the very end.
With MySql I had to write some ugly monster raw SQL queries in Django because no select distinct per group for retrieving the latest per group query. Also looking at postgreSQL's full-text search and wishing I had used postgresSQL.
I recommend PostgreSQL even if you are familiar with MySQL, but your mileage may vary.
UPDATE: DBeaver is a great equivalent of MySql Workbench gui tool but works with PostgreSQL very nicely (and many others as its a universal DB tool).

To add to previous answers :
"Full text search might be better supported for MySQL"
The FULLTEXT index in MySQL is a joke.
It only works with MyISAM tables, so you lose ACID, Transactions, Constraints, Relations, Durability, Concurrency, etc.
INSERT/UPDATE/DELETE to a largish TEXT column (like a forum post) will a rebuild a large part of the index. If it does not fit in myisam_key_buffer, then large IO will occur. I've seen a single forum post insertion trigger 100MB or more of IO ... meanwhile the posts table is exclusiely locked !
I did some benchmarking (3 years ago, may be stale...) which showed that on large datasets, basically postgres fulltext is 10-100x faster than mysql, and Xapian 10-100x faster than postgres (but not integrated).
Other reasons not mentioned are the extremely smart query optimizer, large choice of join types (merge, hash, etc), hash aggregation, gist indexes on arrays, spatial search, etc which can result in extremely fast plans on very complicated queries.

Will this application be hosted on your own servers or by a hosting company? Make sure that if you are using a hosting company, they support the database of choice.

There is a major licensing difference between the two db that will affect you if you ever intend to distribute code using the db. MySQL's client libraries are GPL and PostegreSQL's is under a BSD like license which might be easier to work with.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008