Query to Test MySQL Performance - mysql

I need to upgrade our MySQL 5.0.27 database to 5.5. I understand there are some significant performance increases even between 5.0.27 and 5.1.
Before I complete the upgrade I would like to have a base for performance testing before and after the upgrade.
I have several tables with 500k+ rows and one with 5.8m+ rows.
Is there a query or method I can use to get a good reliable performance metric from my own data before and after the upgrade?

You should run benchmark tests with the queries that your system currently runs, there is no point in profiling queries that your database won't likely run when under a real world load as it won't let you see how your actual system will run, which ultimately is the only thing that matters.

I'm going to be bold and assume that you're only interested in the performance gains for your application. So I would suggest that you first copy the current environment and settings (so that's MySQL 5.0.27 et all) on a test server and simulate standard load and user interaction. Then upgrade to MySQL 5.1 or 5.5 on the test server and perform the same simulation, that way you get the real numbers you are interested in and you can practice on rolling out all these updates.
You can use something as http://loadimpact.com/ for a reliable benchmark of your website. But always do multiple runs and take the average, min, and max to compare it.

Related

Amazon RDS MySQL/Aurora query sometimes hangs forever. Any 2 cents on the metrics and approaches we can triage it and prevent it from happening?

Just some contexts: In our old data pipeline system, we are running MySQL 5.6. or Aurora on Amazon rds. Bad thing about our old data pipeline is running a lot of heavy computations on the database servers because we are handcuffed by what was designed: treating transactional databases as data warehouse and our backend API directly “fishing” the databases heavily in our old system. We are currently patching this old data pipeline, while re-design the new data warehouse in SnowFlake.
In our old data pipeline system, the data pipeline calculation is a series of sequential MySQL queries. As our data grows bigger and bigger in the old data pipeline, what the problem now is the calculation might just hang forever at, for example, the step 3 MySQL query, while all metrics in Amazon CloudWatch/ grafana we are monitoring (CPU, database connections, freeable memory, network throughput, swap usages, read latency, available storage, write latency, etc. ) looks normal. The MySQL slow query log is not really helpful here because each of our query in the data pipeline is essentially slow anyway (can takes hours to run a query because the old data pipeline is running a lot of heavy computations on the database servers). The way we usually solve these problems is to “blindly” upgrade the MySQL/Aurora Amazon rds service and hoping it will solve the issue. I am wondering
(1) What are the recommended database metrics in MySQL 5.6. or Aurora on Amazon rds we should monitor real-time to help us identify why a query freezes forever? Like innodb_buffer_pool_size?
(2) Is there any existing tool and/or in-house approach where we can predict how many hardware resources we need before we can confidently execute a query and know it will succeed? Could someone share some 2 cents?
One thought: Since Amazon rds sometimes is a bit blackbox, one possible way is to host our own MySQL server on an Amazon EC2 instance in parallel to our Amazon MySQL 5.6/Aurora rds production server, so we can ssh into MySQL server and run a lot of command tools like mytop (https://www.tecmint.com/mysql-performance-monitoring/) to gather a lot more real time MySQL metrics which can help us triage the issue. Open to any 2 cents from gurus. Thank you!
None of the tools mentioned at that link should need to run on the database server itself, and to the extent that this is true, there should be no difference in their behavior if they aren't. Run them on any Linux server, giving the appropriate --host and --user and --password arguments (in whatever form they may expect). Even mysqladmin works remotely. Most of the MySQL command line tools do (such as the mysql cli, mysqldump, mysqlbinlog, and even mysqlcheck).
There is no magic coupling that most administrative utilities can gain by running on the same server as MySQL Server itself -- this is a common misconception but, in fact, even when running on the same machine, they still have to make a connection to the server, just like any other client. They may connect to the unix socket locally rather than using TCP, but it's still an ordinary client connection, and provides no extra capabilities.
It is also possible to run an external replica of an RDS/MySQL or Aurora/MySQL server on your own EC2 instance (or in your own data center, even). But this isn't likely to tell you a whole lot that you can't learn from the RDS metrics, particularly in light of the above. (Note also, that even replica servers acquire their replication streams using an ordinary client connection back to the master server.)
Avoid the temptation to tweak server parameters. On RDS, most of the defaults are quite sane, and unless you know specifically and precisely why you want to adjust a parameter... don't do it.
The most likely explanation for slow queries... is poorly written queries and/or poorly designed indexes.
If you are not familiar with EXPLAIN SELECT, then you need to learn it, live it, an love it. SQL is declarative, not procedural. That is, SQL tells the server what you want -- not specifically how to obtain it internall. For example: SELECT ... FROM x JOIN y tells the server to match up the rows from table x and y ON a certain criteria, but does not tell the server whether to read from x then find the matching rows in y... or read from y and find the matching rows in x. The net result is the same either way -- it doesn't matter which table the server examines first, internally -- but if the query or the indexes don't allow the server to correctly deduce the optimum path to the results you've requested, it can spend countless hours churning through unnecessary effort.
Take for an extreme and overly-simplified example, a table with millions of rows and table with 1 row. It would make sense to read the small table first, so you know what 1 value you're trying to join in the large table. It would make no sense to read throuh each row in the large table, then go over and check the small table for a match for each of the millions of rows. The order in which you join tables can be different than the order in which the actual joining is done.
And that's where EXPLAIN comes in. This allows you to inspect the query plan -- the strategy the internal query optimizer has concluded will get it to the answer you need with the least amount of effort. This is the core of the magic of relational database systems -- finding the correct solution in the optimal time, based on what it knows about the data. EXPLAIN shows you the order in which the tables are being accessed, how they're being joined, which indexes are being used, and an estimate of the number of rows from each table are involved -- and these numbers multiply together to give you an estimate of the number of permutations involved in resolving your query. Two small tables, each with 50,000 rows, joined without a proper index, means an entirely unreasonable 2,500,000,000 unique combinations between the two tables that must be evaluated; every row must be compared to every other row. In short, if this turns out to be the kind of thing that you are (unknowingly) asking the server to do, then you are definitely doing something wrong. Inspecting your query plan should be second nature any time you write a complex query, to ensure that the server is using a sensible strategy to resolve it.
The output is cryptic, but secret decoder rings are available.
https://dev.mysql.com/doc/refman/5.7/en/explain.html#explain-execution-plan

mysql db enterprise upgrade from 5.5.42 to 5.7.11

All,
Our product uses Mysql enterprise db. currently at 5.5.42 but planning to upgrade to 5.7.latest. What changes should I expect to see ?
I'm interested in :
1) Performance impacts
2) Broken behavior
3) Changed behavior
4) Improvements -- functional and performance-wise
Is it advised to 1st upgrade our product to use 5.6.x and then next to 5.7.x OR directly to 5.7.x and then test/QA from there ? Any inputs will be valuable.
Performance
Performance gains in 5.7 can be achieved only on high loaded systems. If you have less than 40-60 parallel executions, then expect small performance fall around 10%. This is very system- and load-dependent. Always test yourself!
Here is good MySQL 5.7 performance test done by Percona: https://www.percona.com/blog/2016/05/17/mysql-5-7-read-write-benchmarks/
Changed and broken behaviour, Improvements
There were a lot of changes in 5.7. Check this link for complete list: http://www.thecompletelistoffeatures.com/
I would mention the following changes, that one should think of during upgrade:
SQL_MODE was changed. It is good for new installations, but for old systems most probably you will have to set it back to the old value.
First GA release had password expiry, but after complaints it was disabled in 5.7.11. So nothing to worry about now.
SQL optimizer was changed a lot. Queries should run faster now, but I expect, that some could run slower. The most time should be spent on queries testing on the new version!
Here is good overview of all changes affecting upgrade to 5.7:
http://dev.mysql.com/doc/refman/5.7/en/upgrading-from-previous-series.html
Upgrade
There are two ways to upgrade: using mysql_upgrade and mysqldump.
With mysql_upgrade go with one version at a time. There is no need to test intermediary version. Test only final version.
With mysqldump go directly to 5.7. It is advisable to skip mysql database. Users can not be migrated with mysqldump, use grant statements.
For further information read this: http://dev.mysql.com/doc/refman/5.7/en/upgrading.html
There are tremendous amount of change between 5.5 and 5.7. It's simply not practical to enlist them all here. I'll try to highlight the most important ones. Obviously there are also a lot of new features introduced (GTID, semi-sync replication, JSON support, virtual columns etc.).
1) Performance
You should expect performance gains since 5.6 and 5.7 improved a lot in mutex handling therefore less contention is expected.
Also scalability improvements were done so you can scale your servers up further and performance/query capacity follows for longer than before.
There were a lot of change in the optimizer. So make sure to check your queries.
At the end of the day you should always check your own case with the your own application.
2+3) Changed and broken behaviour
Password expiry. In 5.7 passwords can have expire time. Just something to be aware of.
Password management commands have changed so make sure you check them before you get surprised by having your automation broken.
Since optimizer and the cost model changed a lot some queries might get executed differently.
Two parameters I would mention specifically that caused some headaches for us: Look into the optimizer_switch and decide what you turn on and off. range_index_dive_limit default value was raised from 10 to 200 by default in 5.7.4+. Since I see this parameter is rarely being set in any my.cnf this new default might affect you big time.
4) Improvements
A lot. As I wrote above there are plenty of new features and scalability, performance improvements that were meant to server you better.
5) Upgrade path
Yes, it's best to do one major upgrade at a time, test and when everything's working you can do the second major upgrade.
Compatibility - The issue that would be your greatest concern when upgrading. Because your codes/syntaxes are made with lower version, expect them to behave differently for some of them will be invalid or would not be recognized with the newer version (great impact will be occurred if you jump to a very far higher version).
Normally, you'll get improved performance, security, etc with newer versions but you need to bend your codes/syntaxes to cope with new functions and features.

mysql performance benchmark

I'm thinking about moving our production env from a self hosted solution to amazon aws. I took a look at the different services and thought about using RDS as replacement for our mysql instances. The hardware we're using for our master seems to be better than the best hardware we can get when using rds (Quadruple Extra Large DB Instance). Since I can't simply move our production env to aws and see if the performance is still good enough I'd love to make some tests in advance.
I thought about creating a full query log from our current master, configure the rds instance and start to replay the full query log against it. Actually I don't even know if this kind of testing is a good idea but I guess you'll tell me if there are better ways to make sure the performance of mysql won't drop dramatically when making the move to rds.
Is there a preferred tool to replay the full query log?
at what metrics should I take a look while running the test
cpu usage?
memory usage?
disk usage?
query time?
anything else?
Thanks in advance
I'd recommend against replaying the query log - it's almost certainly not going to give you the information you want, and will take a significant amount of effort.
Firstly, you'd need to prepare your database so that replaying the query log won't break constraints when inserting, updating or deleting data, and that subsequent "select" queries will find the records they should find. This is distinctly non-trivial on anything other than a toy database - just taking a back-up and replaying the log doesn't necessarily guarantee the ordering of DML statements will match what happened on production. This may well give you a false sense of comfort - all your select statements return in a few milliseconds, because the data they're looking for doesn't exist!
Secondly, load and performance testing rarely works by replaying what happened on production - that doesn't (usually) reflect the peak conditions that will bring your system to its knees. For instance, most production systems run happily most of the time at <50% capacity, but go through spikes during the day, when they might reach 80% or more of capacity - that's what you care about, can your new environment handle the peaks.
My recommendation would be to use a tool like JMeter to write performance scripts (either directly to the database using the JDBC driver, or through the front end if you've got a web appilcation). Your performance scripts should reflect the behaviour you see from users, and be parameterized so they're not dependent on the order in which records are created.
Set yourself some performance targets (ideally based on current production levels, with a multiplier to cover you against spikes), e.g. "100 concurrent users, with no query taking more than 1 second"), and use JMeter to simulate that load. If you reach it first time, congratulations - go home! If not, look at the performance counters to see where the bottleneck is; see if you can alleviate that bottleneck (or tune your queries, your awesome on-premise hardware may be hiding some performance issues). Typical bottlenecks are CPU, RAM, and disk I/O.
Experiment with different test scenarios - "lots of writes", "lots of reads", "lots of reporting queries", and mix them up.
The idea is to understand the bottlenecks on the system, and see how far you are from those bottleneck, and understand what you can do to alleviate them. Once you know that, your decision to migrate will be far more robust.

SQL query optimization and debugging

the question is about the best practice.
How to perform a reliable SQL query test?
That is the question is about optimization of DB structure and SQL query itself not the system and DB performance, buffers, caches.
When you have a complicated query with a lot of joins etc, one day you need to understand how to optimize it and you come to EXPLAIN command (mysql::explain, postresql::explain) to study the execution plan.
After tuning the DB structure you execute the query to see any performance changes but here you're on the pan of multiple level of optimization/buffering/caching. How to avoid this? I need the pure time for the query execution and be sure it is not affected.
If you know different practise for different servers please specify explicitly: mysql, postgresql, mssql etc.
Thank you.
For Microsoft SQL Server you can use DBCC FREEPROCCACHE (to drop compiled query plans) and DBCC DROPCLEANBUFFERS (to purge the data cache) to ensure that you are starting from a completely uncached state. Then you can profile both uncached and cached performance, and determine your performance accurately in both cases.
Even so, a lot of the time you'll get different results at different times depending on how complex your query is and what else is happening on the server. It's usually wise to test performance multiple times in different operating scenarios to be sure you understand what the full performance profile of the query is.
I'm sure many of these general principles apply to other database platforms as well.
In the PostgreSQL world you need to flush the database cache as well as the OS cache as PostgreSQL leverages the OS caching system.
See this link for some discussions.
http://archives.postgresql.org/pgsql-performance/2010-08/msg00295.php
Why do you need pure execution time? It depends on so many factors and almost meaningless on live server. I would recommend to collect some statistic from live server and analyze queries execution time using pgfouine tool (it's for postgresql) and make decisions based on it. You will see exactly what do you need to tune and how effective was your changes on a report.

Logging mysql queries

I am about to begin developing a logging system for future implementation in a current PHP application to get load and usage statistics from a MYSQL database.
The statistic will later on be used to get info about database calls per second, query times etc.
Of course, this will only be used when the app is in testing stage, since It will most certainly cause a bit of additional load itself.
However, my biggest questionmark right now is if i should use MYSQL to log the queries, or go for a file-based system. I'll guess that it would be a bit of a headache to create something that would allow writings from multiple locations when using a file based system to handle the logs?
How would you do it?
Use the general log, which will show client activity, including all the queries:
http://dev.mysql.com/doc/refman/5.1/en/query-log.html
If you need very detailed statistics on how long each query is taking, use the slow log with a long_query_time of 0 (or some other sufficiently short time):
http://dev.mysql.com/doc/refman/5.1/en/slow-query-log.html
Then use http://www.maatkit.org/ to analyze the logs as needed.
MySQL already had logging built in- Chapter 5.2 of the manual describes these. You'll probably be interested in The General Query Log (all queries), the Binary Query Log (queries that change data) and the Slow log (queries that take too long, or don't use indexes).
If you insist on using your own solution, you will want to write a database middle layer that all your DB calls go through, which can handle the timing aspects. As to where you write them, if you're in devel, it doesn't matter too much, but the idea of using a second db isn't bad. You don't need to use an entirely separate DB, just as far as using a different instance of MySQL (on a different machine, or just a different instance using a different port). I'd go for using a second MySQL instance instead of the filesystem- you'll get all your good SQL functions like SUM and AVG to parse your data.
If all you are interested in is longer-term, non-real time analysis, turn on MySQL's regular query logging. There are tons of tools for doing analysis on the query-logs (both regular and slow-query), giving you information about the run-times, average rows returned, etc. Seems to be what you are looking for.
If you are doing tests on MySQL you should store the results in a different database such as Postgres, this way you won't increase the load with your operations.
I agree with macabail but would only add that you could couple this with a cron job and a simple script to extract and generate any statistics you might want.