PostgreSQL vs MySQL for handling inserting/deleteing large quanties of blobs - mysql

I have an application that I will soon be doing a significant rewrite of that stores a significant quantity of transient blobs in a database. The application will insert and delete a large number of blobs (up to 5mb in size each) during the course of each day. Currently the application uses a version of PostgreSQL that is very old (7.3.x). With this version of PostgreSQL we had to routinely run the external vacuum process to keep the database size under control, additionally this process required the application to be shutdown to function correctly.
We were looking at either upgrading to the newest PostgreSQL or migrating to another database. Specifically we were interested in moving to MySQL. I was wondering if anyone here was familiar with the blob handling support of the newest versions of these servers, and provide any suggestions on which one might perform best for an application that will constantly be inserting and removing blobs. Other feature differences between the two servers aren't an issue for us.
I did some research and found plenty of feature comparisons between MySQL and PostgreSQL, but nothing that really addressed this issue. I'm hoping someone here might have some experience with this aspect of one or both database systems.
Thanks

Postgres 7.x indeed was a major PITA when it comes to vacuuming. 9.0 is a lot better in this area. The autovacuum daemon can be configured on a per-table level since I think 8.3 and for the described scenario you would probably make it very aggressive for that table (or tables if more than one is involved).
I don't think it matters whether you delete rows with BLOBs (i.e bytea) column or not. Especially because the blobs are stored out-of-line anyway (you might need to configure the auto-vaccuum daemon for the so called TOAST table as well, but I'm not sure)
The question is rather how many rows (in percent of the total rows) you delete/update in the table, rather than how big each blob is.
As much as I like PostgreSQL I do have to admit that the whole vacuum topic (even though getting easier and easier which each release) is still one of its weakest points (and the source of a lot of trouble).
I can't say anything about MySQL as I have never used in a production environment. In contrast to you the other features (beside blobs) are important enough for me to stay away from MySQL - and if it's only for the license.

Related

Data Base for handle large data

We have started a new project using MySQL, spring boot, and Angular js. Initially, we did not realize our DB is going to handle large data.
The number of tables will not be large (<130), only 10 to 20 tables will be contained in more data, which is almost inserted/ read/ update.
The estimated amount of data in that 10 table is going to grow at 12,00,000 records in a month, and we should not delete those data be able to do various reports.
There needs to be (read-only) replicated database as a backup/failover, and maybe for offloading reports in peak time.
I don't have first-hand experience with that large databases, so I'm asking the ones that have which DB is the best choice in this situation. as we have completed 100% coding and development but now we realize this. I have doubts may be MYSQL going to handle large data. I know that Oracle is the safe bet, interested if Mysql with a similar setup. But it is bound only in MySQL I am ok with any DB based on you all feedback I can take a call.
Open source DB more preferable but it's not mandatory we can go for paid DB also.
Handling Large Data
MySQL is more than capable of handling such loads. In fact, it is capable of handling much much more load than what you are talking about. You just have to create the right kind of tables. You can do that by choosing
the correct storage engine for your use-case
the correct character set
the optimal data type for your column
the right indexing strategy - creating indexes thoughtfully
the right partitioning strategy (if the data in the table exceeds tens of millions of records)
EDIT: You've also got to choose the right kind of data modelling and normalization strategy for your use-case. Most of OLTP applications require some level of normalization. But if you want to do analytics and aggregates on heavy tables, you should either have a Data Warehouse of have highly denormalized tables to avoid joins and/or have a column-oriented database to support such queries.
MySQL is open-source and has a very strong community support so you will find a lot of literature around any issue that you face. You can also find all the filed bugs (resolved and unresolved) here.
As far as the number of tables are concerned, there's really no cap on that. See here, MySQL permits 4 billion tables if you're using InnoDB as the engine.
A lot of very big companies with scale use MySQL in some capacity. Facebook is one of them.
Native JSON Support
With the growing popularity of JSON as the de facto data exchange format across the internet, MySQL has also provided native JSON support in 5.7, so now you can store and query JSON from your APIs, if required.
HA and Replication
MySQL Replication works! Earlier, MySQL used to support coordinate replication only but now it supports GTID replication which makes it easier to maintain and fix replication issues. There are third-party replicators also available in the market. For instance, Continuent's Tungsten is a replicator written in Java and is a replacement for native replication. It comes with a lot of configuration options which are not available with native MySQL replication.
I agree with MontyPython, MySql can do it and the design is critical. Fortunately MySql allows you to be flexible over time as needed.
I've had history tables needed used in daily reporting that grew to over a billion records in plain MySql and had no problems.
I've also used MySql Merge tables to divide up tables with big-ish rows (100KB+) to speed things up. Basically keeping the individual merge table file sizes under 30GB each. However that solution increases the open file count (in the system) per client - might be a bigger deal on a clustered system. That one was not.
That said, I like to give Honorable Mention to:
MariaDB - MySql but with contributions from Facebook, Alibaba, Google, and more.
I've moved most of my MySql community edition projects over to MariaDB and have been very happy. It's an almost transparent upgrade.
They offer an interesting enterprise Big Data Analytics (MariaDB AX) package, but with your current requirements its probably overkill and the standard community edition will fulfill your needs.
For example, here's an informative tutorial on how to set up a scalable Cluster (Galera) and adding MaxScale for High Availability:
https://mariadb.com/resources/blog/getting-started-mariadb-galera-and-mariadb-maxscale-centos
Another interesting option is Vitesse - developed at Youtube, which allows for sharded mysql through a (mostly) driver based solution. It solves the problem of needing to have available access to huge amounts of data and always yield good performance. As such, it goes beyond high availability and focuses on a solution wherein no single query (ie. a report against millions of rows of historical data) can negatively impact the other queries needing to be performed.

Seeking clarification about mysql 5.6 memcache integration

I'm having trouble getting a clear understanding of what MySQL 5.6 is introducing w/r/t memcache.
As I understand it, memcache by itself is essentially a huge, shared, memory-resident hash table that is managed by a server, memcached. In particular, it knows nothing about a persistent data store, and offers no services in that regard. It simply knows about keys and values (like a Perl hash).
What I think mySQL 5.6 introduces is a NoSQL API, whereby mySQL clients can request data from the mySQL server by key, rather than by a SELECT statement. (And similarly, they can perform updates with key=value pairs). MySQL uses memcached to cache these in memory as a performance boost, but also takes care of things like writing updates back to the database before they age out of the cache, etc.
In other words, the use of memcached is an implementation detail of the mySQL 5.6 NoSQL feature, and is not something the application programmer needs to be aware of.
I'd welcome any corrections or amplification to my understanding.
Thanks,
Chap
I think it's quite simple (from the official documentation):
I disagree with your last sentence, the application programmer has to be really aware of the memcache plugin because having it onboard of the MySQL server means that he can decide (maybe he will be forced to) access data through a memcached language interface or via the SQL interface
To better understand the impact of this plugin onto an app design you should know that there are 3 configuration tables used by MySQL for a proper memcached management; understanding how the "cache_policies" works will shade some light to some of your doubts:
Table cache_policies specifies whether to use InnoDB as the data store of memcached (innodb_only), or to use the traditional memcached engine as the backstore (cache-only), or both (caching). In the last case, if memcached cannot find a key in memory, it searches for the value in an InnoDB table.
here is the link: innodb-memcached-internals
This quote above means that, depending on what you decided for a specific key-value, you will have different application scenarios :
innodb_only -> means that you can query the data via a sql interface or via a memcached interface, here is a link to some memcached language interface examples memcached-interfaces
cache-only -> means that you should query the data via the memchached interface only
caching -> means that you can use both the interfaces (note that the storage mechanism slightly changes)
Of course this latter configuration decision is strictly related to your specific needs
I don't really have a complete answer for you I'm afraid, as I too am struggling to find the detail I require before toying around with it.
That said however there is one important point which I have managed to uncover that you seem to have missed, namely that by accessing the InnoDB storage engine via the new plugin you are actually completely bypassing SQL and avoiding all the overhead that comes with it.
This of course makes it essentially a key/value store more akin to most NoSQL databases complete with all the drawbacks associated with them. i.e. no joins etc...
However on the flip side for many applications these days, this is exactly what we want. There has been only a handful of real world performance mentions that I have come across but all seem to point to this implementation significantly outperforming MongoDB and other similar NoSQL solutions (how much truth is in it I do not know) with even one (relatively in depth) comparison claiming as high as 700k qps on a commodity server (compared with around 100k on a well tuned MySQL setup), which is incredible if true.
Resource here:
http://yoshinorimatsunobu.blogspot.co.uk/search/label/handlersocket
Anyway, sorry I can't be any more help but its food for thought at least!

sqlite or mysql for large datasets

I am working with large datasets (10s of millions of records, at times, 100s of millions), and want to use a database program that links well with R. I am trying to decide between mysql and sqlite. The data is static, but there are lot of queries that I need to do.
In this link to sqlite help, it states that:
"With the default page size of 1024 bytes, an SQLite database is limited in size to 2 terabytes (241 bytes). And even if it could handle larger databases, SQLite stores the entire database in a single disk file and many filesystems limit the maximum size of files to something less than this. So if you are contemplating databases of this magnitude, you would do well to consider using a client/server database engine that spreads its content across multiple disk files, and perhaps across multiple volumes."
I'm not sure what this means. When I have experimented with mysql and sqlite, it seems that mysql is faster, but I haven't constructed very rigorous speed tests. I'm wondering if mysql is a better choice for me than sqlite due to the size of my dataset. The description above seems to suggest that this might be the case, but my data is no where near 2TB.
I'd appreciate any insights into understanding this constraint of maximum file size from the filesystem and how this could affect speed for indexing tables and running queries. This could really help me in my decision of which database to use for my analysis.
The SQLite database engine stores the entire database into a single file. This may not be very efficient for incredibly large files (SQLite's limit is 2TB, as you've found in the help). In addition, SQLite is limited to one user at a time. If your application is web based or might end up being multi-threaded (like an AsyncTask on Android), mysql is probably the way to go.
Personally, since you've done tests and mysql is faster, I'd just go with mysql. It will be more scalable going into the future and will allow you to do more.
I'm not sure what this means. When I have experimented with mysql and sqlite, it seems that mysql is faster, but I haven't constructed very rigorous speed tests.
The short short version is:
If your app needs to fit on a phone or some other embedded system, use SQLite. That's what it was designed for.
If your app might ever need more than one concurrent connection, do not use SQLite. Use PostgreSQL, MySQL with InnoDB, etc.
It seems that (in R, at least), that SQLite is awesome for ad hoc analysis. With the RSQLite or sqldf packages it is really easy to load data and get started. But for data that you'll use over and over again, it seems to me that MySQL (or SQL Server) is the way to go because it offers a lot more features in terms of modifying your database (e.g., adding or changing keys).
SQL if you are mainly using this as a web service.
SQLite, if you want it to able to function offline.
SQLite generally is much much faster, as majority (or ALL) of data/indexes will be cached in memory. However, in the case of SQLite. If the data is split up across multiple tables, or even multiple SQLite database files, from my experience so far. For even millions of records (i yet to have 100's of millions though), it is far more effective then SQL (compensate the latency / etc). However that is when the records are split apart in differant tables, and queries are specific to such tables (dun query all tables).
An example would be a item database used in a simple game. While this may not sound much, a UID would be issued for even variations. So the generator soon quickly work out to more then a million set of 'stats' with variations. However this was mainly due to each 1000 sets of records being split among different tables. (as we mainly pull records via its UID). Though the performance of splitting was not properly measured. We were getting queries that were easily 10 times faster then SQL (Mainly due to network latency).
Amusingly though, we ended up reducing the database to a few 1000 entries, having item [pre-fix] / [suf-fix] determine the variations. (Like diablo, only that it was hidden). Which proved to be much faster at the end of the day.
On a side note though, my case was mainly due to the queries being lined up one after another (waiting for the one before it). If however, you are able to do multiple connections / queries to the server at the same time. The performance drop in SQL, is more then compensated, from your client side. Assuming this queries do not branch / interact with one another (eg. if got result query this, else that)

Maximum capabilities of MySQL

How do I know when a project is just to big for MySQL and I should use something with a better reputation for scalability?
Is there a max database size for MySQL before degradation of performance occurs? What factors contribute to MySQL not being a viable option compared to a commercial DBMS like Oracle or SQL Server?
Google uses MySQL. Is your project bigger than Google?
Smart-alec comments aside, MySQL is a professional level database application. If your application puts a strain on MySQL, I bet it'll do the same to just about any other database.
If you are looking for a couple of examples:
Facebook moved to Cassandra only after it was storing over 7 Terabytes of inbox data. (Source: Lakshman, Malik: Cassandra - A Decentralized Structured Storage System.) (... Even though they were having quite a few issues at that stage.)
Wikipedia also handles hundreds of Gigabytes of text data in MySQL.
I work for a very large Internet company. MySQL can scale very, very large with very good performance, with a couple of caveats.
One problem you might run into is that an index greater than 4 gigabytes can't go into memory. I spent a lot of time once trying to improve the MySQL's full-text performance by fiddling with some index parameters, but you can't get around the fundamental problem that if your query hits disk for an index, it gets slow.
You might find some helper applications that can help solve your problem. For the full-text problem, there is Sphinx: http://www.sphinxsearch.com/
Jeremy Zawodny, who now works at Craig's List, has a blog on which he occasionally discusses the performance of large databases: http://blog.zawodny.com/
In summary, your project probably isn't too big for MySQL. It may be too big for some of the ways that you've used MySQL before, and you may need to adapt them.
Mostly it is table size.
I am assuming here that you will use the Oracle innoDB plugin for mysql as your engine. If you do not, that probably means you're using a commercial engine such as infiniDB, InfoBright for Tokutek, in which case your questions should be sent to them.
InnoDB gets a bit nasty with very large tables. You are advised to partition your tables if at all possible with very large instances. Essentially, if your (frequently used) indexes don't all fit into ram, inserts will be very slow as they need to touch a lot of pages not in ram. This cannot be worked around.
You can use the MySQL 5.1 partitioning feature if it does what you want, or partition your tables at the application level if it does not. If you can get your tables' indexes to fit in ram, and only load one table at a time, then you're on a winner.
You can use the plugin's compression to make your ram go a bit further (as the pages are compressed in ram as well as on disc) but it cannot beat the fundamental limtation.
If your table's indexes don't all (or at least MOSTLY - if you have a few indexes which are NULL in 99.99% of cases you might get away without those ones) fit in ram, insert speed will suck.
Database size is not a major issue, provided your tables individually fit in ram while you're doing bulk loading (and of course, you only load one at once).
These limitations really happen with most row-based databases. If you need more, consider a column database.
Infobright and Infinidb both use a mysql-based core and are column based engines which can handle very large tables.
Tokutek is quite interesting too - you may want to contact them for an evaluation.
When you evaluate the engine's suitability, be sure to load it with very large data on production-grade hardware. There's no point in testing it with a (e.g.) 10G database, that won't prove anything.
MySQL is a commercial DBMS, you just have the option to get the support/monitoring that is offered by Oracle or Microsoft. Or you can use community support or community provided monitoring software.
Things you should look at are not only size at operations. Critical are also:
Scenaros for backup and restore?
Maintenance. Example: SQL Server Enterprise can rebuild an index WHILE THE OLD ONE IS AVAILABLE - transparently. This means no downtime for an index rebuild.
Availability (basically you do not want to have to restoer a 5000gb database if a server dies) - mirroring preferred, replication "sucks" (technically).
Whatever you go for, be carefull with Oracle RAC (their cluster) - it is known to be "problematic" (to say it finely). SQL Server is known to be a lot cheaper, scale a lot worse (no "RAC" option) but basically work without making admins want to commit suicide every hour (the "RAC" option seems to do that). Scalability "a lot worse" still is good enough for the Terra Server (http://msdn.microsoft.com/en-us/library/aa226316(SQL.70).aspx)
THere wer some questions here recently of people having problems rebuilding indices on a 10gb database or something.
So much for my 2 cents. I am sure some MySQL specialists will jump in on issues there.

MySQL vs PostgreSQL? Which should I choose for my Django project?

My Django project is going to be backed by a large database with several hundred thousand entries, and will need to support searching (I'll probably end up using djangosearch or a similar project.)
Which database backend is best suited to my project and why? Can you recommend any good resources for further reading?
For whatever it's worth the the creators of Django recommend PostgreSQL.
If you're not tied to any legacy
system and have the freedom to choose
a database back-end, we recommend
PostgreSQL, which achives a fine
balance between cost, features, speed
and stability. (The Definitive Guide to Django, p. 15)
As someone who recently switched a project from MySQL to Postgresql I don't regret the switch.
The main difference, from a Django point of view, is more rigorous constraint checking in Postgresql, which is a good thing, and also it's a bit more tedious to do manual schema changes (aka migrations).
There are probably 6 or so Django database migration applications out there and at least one doesn't support Postgresql. I don't consider this a disadvantage though because you can use one of the others or do them manually (which is what I prefer atm).
Full text search might be better supported for MySQL. MySQL has built-in full text search supported from within Django but it's pretty useless (no word stemming, phrase searching, etc.). I've used django-sphinx as a better option for full text searching in MySQL.
Full text searching is built-in with Postgresql 8.3 (earlier versions need TSearch module). Here's a good instructional blog post: Full-text searching in Django with PostgreSQL and tsearch2
large database with several hundred
thousand entries,
This is not large database, it's very small one.
I'd choose PostgreSQL, because it has a lot more features. Most significant it this case: in PostgreSQL you can use Python as procedural language.
Go with whichever you're more familiar with. MySQL vs PostgreSQL is an endless war. Both of them are excellent database engines and both are being used by major sites. It really doesn't matter in practice.
All the answers bring interesting information to the table, but some are a little outdated, so here's my grain of salt.
As of 1.7, migrations are now an integral feature of Django. So they documented the main differences that Django developers might want to know beforehand.
Backend Support
Migrations are supported on all backends that Django ships with, as
well as any third-party backends if they have programmed in support
for schema alteration (done via the SchemaEditor class).
However, some databases are more capable than others when it comes to schema migrations; some of the caveats are covered below.
PostgreSQL
PostgreSQL is the most capable of all the databases here in terms of schema support.
MySQL
MySQL lacks support for transactions around schema alteration operations, meaning that if a migration fails to apply you will have to manually unpick the changes in order to try again (it’s impossible to roll back to an earlier point).
In addition, MySQL will fully rewrite tables for almost every schema operation and generally takes a time proportional to the number of rows in the table to add or remove columns. On slower hardware this can be worse than a minute per million rows - adding a few columns to a table with just a few million rows could lock your site up for over ten minutes.
Finally, MySQL has relatively small limits on name lengths for columns, tables and indexes, as well as a limit on the combined size of all columns an index covers. This means that indexes that are possible on other backends will fail to be created under MySQL.
SQLite
SQLite has very little built-in schema alteration support, and so
Django attempts to emulate it by:
Creating a new table with the new schema
Copying the data across
Dropping the old table
Renaming the new table to match the original name
This process generally works well, but it can be slow and occasionally
buggy. It is not recommended that you run and migrate SQLite in a
production environment unless you are very aware of the risks and its
limitations; the support Django ships with is designed to allow
developers to use SQLite on their local machines to develop less
complex Django projects without the need for a full database.
Even if Postgresql looks better, I find it has some performances issues with Django:
Postgresql is made to handle "long connections" (connection pooling, persistant connections, etc.)
MySQL is made to handle "short connections" (connect, do your queries, disconnect, has some performances issues with a lot of open connections)
The problem is that Django does not support connection pooling or persistant connection, it has to connect/disconnect to the database at each view call.
It will works with Postgresql, but connecting to a Postgresql cost a LOT more than connecting to a MySQL database (On Postgresql, each connection has it own process, it's a lot slower than just popping a new thread in MySQL).
Then you get some features like the Query Cache that can be really useful on some cases. (But you lost the superb text search of PostgreSQL)
When a migration fails in django-south, the developers encourage you not to use MySQL:
! The South developers regret this has happened, and would
! like to gently persuade you to consider a slightly
! easier-to-deal-with DBMS (one that supports DDL transactions)
Having gone down the road of MySQL because I was familiar with it (and struggling to find a proper installer and a quick test of the slow web "workbench" interface of postgreSQL put me off), at the end of the project, after a few months after deployment, while looking into back up options, I see you have to pay for MySQL's enterprise back up features. Gotcha right at the very end.
With MySql I had to write some ugly monster raw SQL queries in Django because no select distinct per group for retrieving the latest per group query. Also looking at postgreSQL's full-text search and wishing I had used postgresSQL.
I recommend PostgreSQL even if you are familiar with MySQL, but your mileage may vary.
UPDATE: DBeaver is a great equivalent of MySql Workbench gui tool but works with PostgreSQL very nicely (and many others as its a universal DB tool).
To add to previous answers :
"Full text search might be better supported for MySQL"
The FULLTEXT index in MySQL is a joke.
It only works with MyISAM tables, so you lose ACID, Transactions, Constraints, Relations, Durability, Concurrency, etc.
INSERT/UPDATE/DELETE to a largish TEXT column (like a forum post) will a rebuild a large part of the index. If it does not fit in myisam_key_buffer, then large IO will occur. I've seen a single forum post insertion trigger 100MB or more of IO ... meanwhile the posts table is exclusiely locked !
I did some benchmarking (3 years ago, may be stale...) which showed that on large datasets, basically postgres fulltext is 10-100x faster than mysql, and Xapian 10-100x faster than postgres (but not integrated).
Other reasons not mentioned are the extremely smart query optimizer, large choice of join types (merge, hash, etc), hash aggregation, gist indexes on arrays, spatial search, etc which can result in extremely fast plans on very complicated queries.
Will this application be hosted on your own servers or by a hosting company? Make sure that if you are using a hosting company, they support the database of choice.
There is a major licensing difference between the two db that will affect you if you ever intend to distribute code using the db. MySQL's client libraries are GPL and PostegreSQL's is under a BSD like license which might be easier to work with.