Realistic performance comparison of MySQL vs PostgreSQL - mysql

We are in the process of designing a new system that will either use MySQL or Postgres depending upon the performance.But there are several problems in doing a realistic comparison.I have summed up some of them,it would be helpful if some experts threw some wisdom in here.
Using a neutral performance testing tool
There is something for postgres called explain analyze which basically gives all the details necessary to optimize on the database side.But MySQL does not have something which is as detailed as this one.
Of course these commands give info about a single query, real time performance involves bigger workloads on how the application is going to receive.
How much of this is true ? If a query is slower in postgres and faster in MySQL will it be faster in postgres over heavier workloads, of course only real time tests can tell,but is it worth going in this direction?
I am familiar with Jmeter, but are there any other better tools to do such tasks.
Optimization of both the databases
Postgres is said to be slower for simple reads, but scales well as the data grows and for more complex workloads.Taken from here and here.
With that said,how much optimisation is necessary so that the tests are fair to both database systems.
Any additional points are also welcome.

size of data will have more significance than workload, resource (memory) tuning can have a big effect too.
"With that said,how much optimisation is necessary so that the tests are fair to both database systems."
Is seems to me that the only way to be fair is to do real-world optimisation. Optimise your test systems to as close to production as you can justify. if you're not going to be writing SQL both are going to perform about the same. (+/- $1000 worth of server hardware)
if you're writing SQL you want to keep the programmers happy. ($10000 of programmers won't get you much more performance)

The only realistic performance comparison is with the system that you are designing. Why don't you make your system to be configurable to use either MySql or PostgreSQL then run load tests against it with both databases and compare the performance results? That is what I did in comparing MySql vs PostgreSQL vs Docker in this open source news feed micro-service.

Related

Tuning MySQL Database

I am having a MySQL database which is running on a dedicated Ubuntu server having 2GB RAM and 500GB hard drive. I appreciate if anyone could help on fine tuning the database to increase the performance. Enhancements need to impact on CRUD tasks of the database, including procedure calls' and scheduled events' performances.
I have done searches on the web regarding this and found various mechanisms, tools and etc in various websites to do the job. But I need to know the proper way of escalating the performance (ex: execution time of an SQL query and etc) of a MySQL database itself without using any 3rd party tools or software. The database configurations which I am having are listed below.
MySQL version: 5.5
Used storage engine: MyISAM
Operating system: Ubuntu 12
Hard disk capacity: 500GB
RAM: 2GB
Other: The database consists of Tables, Indexes, Stored Procedures, Scheduled Events and Views
You have said nothing about the specifics of your data, its distribution, the type of workload you use, the ratio of reads to writes, the variety of your queries, the complexity of your queries, and so on. This is a vital part of the tuning process for one simple reason:
Tuning is specific to your data and your workload.
The guys who make database platforms such as MySQL pay a lot of attention to making sure the default settings are good enough for the majority of users. If there was some easy route to improving the performance of a database, they'd already have done it at the factory.
The guys who make the third party tools, on the other hand, write code that reads your data and your logs to find out information about your tables, their contents, and your queries, and that code makes best-guess estimates about tuning based on your data and your workload. They're not perfect, but they sure beat having to do that stuff manually if you don't know how to.
Think of tuning a database like tuning a guitar: You start with an idea of what you want (Standard tuning? Drop D? DADGAD?) and then you make small adjustments to one string at a time, measuring it against your desired result. Once you've achieved the best possible result for that string, you move onto the next one and make small changes there etc. When you get to the final string, you might have adjusted the balance of the whole guitar so you might have to revisit the settings from the beginning to make tiny incremental changes until the whole lot is singing perfectly.
Read http://dev.mysql.com/doc/refman/5.5/en/server-parameters.html to get started on the most important "strings" to tune in MySQL 5.5. There are lots, but none of them are particularly difficult on their own.
As a tangent, tuning your server away from the defaults might give you a 5-10% boost in performance. You'd be much better spending your time looking at your database design, data types, and the indexes you're using. You can often make 50%-100% improvements in performance by doing that sort of thing.
You should find http://www.mysqlcalculator.com/ helpful for starters.
This will show you some critical general defaults and allow you to
enter your own values as displayed by
SHOW GLOBAL VARIABLES
to calculate MySQL maximum memory usage.
This will only scratch the surface - and will be enlightening.
There is NO simple answer.

mysql performance benchmark

I'm thinking about moving our production env from a self hosted solution to amazon aws. I took a look at the different services and thought about using RDS as replacement for our mysql instances. The hardware we're using for our master seems to be better than the best hardware we can get when using rds (Quadruple Extra Large DB Instance). Since I can't simply move our production env to aws and see if the performance is still good enough I'd love to make some tests in advance.
I thought about creating a full query log from our current master, configure the rds instance and start to replay the full query log against it. Actually I don't even know if this kind of testing is a good idea but I guess you'll tell me if there are better ways to make sure the performance of mysql won't drop dramatically when making the move to rds.
Is there a preferred tool to replay the full query log?
at what metrics should I take a look while running the test
cpu usage?
memory usage?
disk usage?
query time?
anything else?
Thanks in advance
I'd recommend against replaying the query log - it's almost certainly not going to give you the information you want, and will take a significant amount of effort.
Firstly, you'd need to prepare your database so that replaying the query log won't break constraints when inserting, updating or deleting data, and that subsequent "select" queries will find the records they should find. This is distinctly non-trivial on anything other than a toy database - just taking a back-up and replaying the log doesn't necessarily guarantee the ordering of DML statements will match what happened on production. This may well give you a false sense of comfort - all your select statements return in a few milliseconds, because the data they're looking for doesn't exist!
Secondly, load and performance testing rarely works by replaying what happened on production - that doesn't (usually) reflect the peak conditions that will bring your system to its knees. For instance, most production systems run happily most of the time at <50% capacity, but go through spikes during the day, when they might reach 80% or more of capacity - that's what you care about, can your new environment handle the peaks.
My recommendation would be to use a tool like JMeter to write performance scripts (either directly to the database using the JDBC driver, or through the front end if you've got a web appilcation). Your performance scripts should reflect the behaviour you see from users, and be parameterized so they're not dependent on the order in which records are created.
Set yourself some performance targets (ideally based on current production levels, with a multiplier to cover you against spikes), e.g. "100 concurrent users, with no query taking more than 1 second"), and use JMeter to simulate that load. If you reach it first time, congratulations - go home! If not, look at the performance counters to see where the bottleneck is; see if you can alleviate that bottleneck (or tune your queries, your awesome on-premise hardware may be hiding some performance issues). Typical bottlenecks are CPU, RAM, and disk I/O.
Experiment with different test scenarios - "lots of writes", "lots of reads", "lots of reporting queries", and mix them up.
The idea is to understand the bottlenecks on the system, and see how far you are from those bottleneck, and understand what you can do to alleviate them. Once you know that, your decision to migrate will be far more robust.

Is mongoDB or Cassandra better than MySQL for large datasets?

In our (currently MySQL) database there are over 120 million records, and we make frequent use of complex JOIN queries and application-level logic in PHP that touch the database. We're a marketing company that does data mining as our primary focus, so we have many large reports that need to be run on a daily, weekly, or monthly basis.
Concurrently, customer service operates on a replicated slave of the same database.
We would love to be able to make these reports happen in real time on the web instead of having to manually generate spreadsheets for them. However, many of our reports take a significant amount of time to pull data for (in some cases, over an hour).
We do not operate in the cloud, choosing instead to operate using two physical servers in our server room.
Given all this, what is our best option for a database?
I think you're going the wrong way about the problem.
Thinking if you drop in NoSQL that you'll get better performance is not really true. At the lowest level, you're writing and retrieving a fair chunk of data. That implies your bottleneck is (most likely) HDD I/O (which is the common bottleneck).
Sticking to the hardware you have momentarily and using a monolithic data storage isn't scalable and as you noticed - has implications when wanting to do something in real-time.
What are your options? You need to scale your server and software setup (which is what you'd have to do with any NoSQL anyway, stick in faster hard drives at some point).
You also might want to look into alternative storage engines (other than MyISAM and InnoDB - for example, one of better engines that seemingly turn random I/O to sequential I/O is TokuDB).
Implementing faster HDD subsystem would also aid to your needs (FusionIO if you have the resources to get it).
Without more information on your end (what the server setup is, what MySQL version you're using and what storage engines + data sizes you're operating with), it's all speculation.
Cassandra still needs Hadoop for MapReduce, and MongoDB has limited concurrency with regard to MapReduce...
... so ...
... 120 mio records is not that much, and MySQL should easily be able to handle that. My guess is an IO bottleneck, or you're doing lots of random reads instead of sequential reads. I'd rather hire a MySQL techie for a month or so to tune your schema and queries, instead of investing into a new solution.
If you provide more information about your cluster, we might be able to help you better. "NoSQL" by itself is not the solution to your problem.
As much as I'm not a fan of MySQL once your data gets large, I have to say that you're nowhere near needing to move to a NoSQL solution. 120M rows is not a big deal: the database I'm currently working with has ~600M in one table alone and we query it efficiently. Managing that much data from an ops perspective is the problem; querying it isn't.
It's all about proper indexes and the correct use of them when joining, and secondarily memory settings. Find your slow queries (mysql slow query log FTW!), and learn to use the explain keyword to understand whey they are slow. Then tweak your indexes so your queries are efficient. Further, make sure you understand MySQL's memory settings. There are great pages in the docs explaining how they work, and they aren't that hard to understand.
If you've done both of those things and you're still having problems, make sure disk I/O isn't an issue. Then you should look in to another solution for querying your data if it is.
NoSQL solutions like Cassandra have a lot of benefits. Cassandra is fantastic at writing data. Scaling your writes is very easy--just add more nodes! But the tradeoff is that it's harder to get the data back out. From a cost perspective, if you have expertise in MySQl, it's probably better to leverage that and scale your current solution until it hits a limit before completely switching your underlying architecture.

SQL query optimization and debugging

the question is about the best practice.
How to perform a reliable SQL query test?
That is the question is about optimization of DB structure and SQL query itself not the system and DB performance, buffers, caches.
When you have a complicated query with a lot of joins etc, one day you need to understand how to optimize it and you come to EXPLAIN command (mysql::explain, postresql::explain) to study the execution plan.
After tuning the DB structure you execute the query to see any performance changes but here you're on the pan of multiple level of optimization/buffering/caching. How to avoid this? I need the pure time for the query execution and be sure it is not affected.
If you know different practise for different servers please specify explicitly: mysql, postgresql, mssql etc.
Thank you.
For Microsoft SQL Server you can use DBCC FREEPROCCACHE (to drop compiled query plans) and DBCC DROPCLEANBUFFERS (to purge the data cache) to ensure that you are starting from a completely uncached state. Then you can profile both uncached and cached performance, and determine your performance accurately in both cases.
Even so, a lot of the time you'll get different results at different times depending on how complex your query is and what else is happening on the server. It's usually wise to test performance multiple times in different operating scenarios to be sure you understand what the full performance profile of the query is.
I'm sure many of these general principles apply to other database platforms as well.
In the PostgreSQL world you need to flush the database cache as well as the OS cache as PostgreSQL leverages the OS caching system.
See this link for some discussions.
http://archives.postgresql.org/pgsql-performance/2010-08/msg00295.php
Why do you need pure execution time? It depends on so many factors and almost meaningless on live server. I would recommend to collect some statistic from live server and analyze queries execution time using pgfouine tool (it's for postgresql) and make decisions based on it. You will see exactly what do you need to tune and how effective was your changes on a report.

Are the consistency/data loss/query optimization issues I read about "that bad"?

As I've been looking into the differences between Postgres and MySQL, it has struck me that, if what I read is to be believed, MySQL should be (disclaimer: by reading the rest of this sentence, you agree to read the next paragraph as well) the laughingstock of the RMDB world: it doesn't enforce ACID by default, the net is rife with stories of MySQL-related data loss and by all accounts and the query optimizer is a joke.
But none of this seems to matter. It's not hard to tell that MySQL has about a million times* as much hype as Postgres (it's LAMP, not LAPP), big installations of MySQL are not unheard of (LJ? Digg?) and I haven't noticed a drop in MySQL's popularity.
This makes me wonder: are these "problems" with MySQL really that bad?
So, if you have used MySQL for a reasonably large project**, what was your experience like? Did you use Postgres as well? How was it worse? How was it better?
*: [citation needed]
**: I'm well aware that, for "small things" (blogs, what have you), MySQL (along with practically every other RDB) is just fine.
Since it's tagged [subjective], I'll be subjective. For me it's about the little things. PostgreSQL is more developer friendly and makes it easy to do the right thing regarding data integrity by default.
If you give MySQL an incorrect type, it will implicitly convert it even if the conversion is incorrect. PostgreSQL will complain.
EXPLAIN in PostgeSQL is way more useful than in MySQL. It gives you the exact structured query plan. What kind of algorithm will it use, what cost does does each step have, etc. This means that if the query optimizer in MySQL doesn't do what you think it does, you will have hard time to debug it.
If you ever wrote anything more complex in the MySQL stored procedure language, you will know how painful it is. PL/pgSQL is actually a nice language + you can use many other languages.
MySQL doesn't have sequences, so if you need them you have to roll your own. Most people will do it wrong and have race conditions in their code.
PostgreSQL exposes most of it's internal lock types to the developer. If you need to lock your table in a special way, you can do that.
Everything is programmable in PostgreSQL. For example, if you need your own data type for some specific data, you can add it. You can add casts and operators for the data types. Probably not worth the effort for small projects, but it's better than storing things as strings.
PostgreSQL adds every action including DDL changes to a transaction, unlike MySQL. If you have a conversion script that creates/drops tables, BEGIN/END won't help you in MySQL to keep it in consistent state.
That doesn't mean it's impossible to write good database applications with MySQL, it just requires more effort.
MySQL can be used for reasonably large applications, provided you really know what you do and don't trust the defaults.
MySQL defaults are optimized to be easy-to-use and to get started quickly and to provide best performance (usually). Other databases choose defaults that are at the very least ACID and are scalable (i.e. choose defaults that are not necessarily the best/fastest for small data sets)
Another item is that MySQL only learned to be a "real database" relatively recent, while almost all competing products started life with full ACID in mind.
MySQL had problems with almost all aspects of ACID at one time or another. Most of them are gone or can be configured away, but you will have to check each one. The problem with troubles in atomicity for example is that you will not notice them until you place your system under heavy load (which often coincides with it being a production system, unfortunately).
So my summary would be: MySQL is capable of working in this environments, but it takes work. And the path it took to get to that point cost it quite a few points in the confidence area.
Provided you know what its capabilities are, then it may fit your use case.
If used correctly, then it is ACID compliant. If used incorrectly, it is not. The trouble is, that people seem to assume that it's a good thing to have ACID compliance.
In reality ACID is often the enemy of performance (Particularly the D for durability). By relaxing durability very slightly, we can typically get a very large performance boost.
Likewise, even using the MyISAM engine (which doesn't have much by way of durability, and not a lot of the others either) is still appropriate to some problem domains.
We are using MySQL in some applications - and it is doing a pretty good job.
In the newer projects we are using the InnoDB engine - and albeit it may be slower than the default engine it is working well.
Right now we are using an ORM mapper - and so most of the complexity is hidden behind the ORM mapper (and working nice).
I think the infrastructure (Tools and information) is one of MySQL's big plusses: we are using really nice tools: Toad for MySQL and MySQL Administrator.
Altough I have to admit that I had a shocking experience last week when helping a friend with a SQL statement and the correleated subquery nearly stopped his MySQL server - but with the trick of enclosing it in another query - it worked really well.
This is nothing which REALLY shocks me - because I've used other DB systems which cost big bucks (I'm looking at you - DB2) - and they had other things to work around. (maybe not as drastic - but still you had to optimize for them).
I haven't used both for a single large project, but having used both I have some idea of how they compare.
In general almost all MySQL's problems can be worked around with good discipline. The issue is more that developer has to know all the gotchas and work around them. After working with PostgreSQL or Oracle this feels a bit like death by a thousand papercuts. You get that used to stuff just working.
This is a pretty significant issue in the types of stuff that I have worked on. Complex schemas with complex queries and lots of data. tight schedules with little time for performance engineering meaning that getting consistently reasonable performance without having to manually optimize queries is important. A good cost based optimizer is almost a requirement. Combine that with quite a lot of outsourcing with development teams that don't have the experience to catch all the gotchas in time and the little issues escalate to large QA problems. Hitting any of MySQL silent data corruption gotchas in production is something that really scares me. I'll take any declarative constraints at the database level that I can get to have atleast some safety net, MySQL unfortunately falls short on that.
PostgreSQL has the added benefit that it can run significantly more algorithms using more advanced data-structures in the database. Most of our large projects have a few cases where MySQL will hit its limits. Moving the algorithms outside the database requires considerably more effort with pretty tricky code involving correct locking and synchronization. In particular I have at one time or another hit the need for partial indexes, indexes on expressions, custom aggregate functions, set returning stored procedures, array and hash datatypes, inverted indexes on array values, update/delete-returning, deferrable foreign key constraints.
On the other hand MySQL has at least for now a better story for scale out. If I had to support a huge number users on a reasonably simple application, and had the team to build a heavily partitioned and replicated database with eventual consistency, I'd pick MySQL over PostgreSQL for the low level data storage building block. On the other the competitors in that space are the key-value databases.
are these "problems" with MySQL really that bad?
Actually, the pain MySQL will inflict on you can range from moderate to insane, and much of it depends on MyISAM.
I find a good rule of thumb is this :
are you backing up some MyISAM tables ?
MyISAM is great for data you don't really care about, like traffic logs and the like, or for data that you can easily restore in case of a problem since it's read-only and hence never changed since the time you loaded that 10GB dump. In those cases the compact row format of MyISAM brings great space savings (that however do not translate into faster seq scan speed, for some reason).
If the data you put in MyISAM tables is worth backing up, you are going to enter in a world of hurt when you realize some day that it is all inconsistent because of the lack of FK and constraint checks, and incidentally all your backups will contain inconsistent data too.
If you make lots of concurrent updates to MyISAM tables, then you are gonna go way past the world of hurt stage : when the load reaches a certain threshold, you are doomed. Of course the readers block writers which block readers which block queued writers, etc, so the performance is bad, load avg goes to 200, and your box is nuked, but also I could consistently crasy MyISAM tables in a benchmark I wrote 2 years ago just by hitting them with too much load. Random data ensued, sometimes crashing the mysql on selects or spewing random errors.
So, if you avoid MyISAM like the plague it is, the problems with MySQL aren't really that bad. InnoDB is robust. However, generally I find it inferior to Postgres, which is faster and has so many less gotchas, and Gets The Job Done easier and faster.
No, the issues you mention are NOT a big deal. See Google and Facebook as two examples of companies that are using MySQL to accomplish Herculean tasks you'll only ever dream of encountering.
I use the following rules when running a MySQL to prevent headaches down the line:
Take daily, weekly, monthly snapshots of database. More often than not the problems you'll run in to have nothing to do with MySQL, instead it's a boneheaded developer running:
DELETE FROM mytable; # Where is the WHERE?
Use InnoDB by default, the only reason to use MyISAM is for full text search.
Get your database schema under source control.