How do I benchmark MySQL? - mysql

I'm currently using MySQL workbench. I want to see the difference in performance as the number of rows in a table increases. I want to specifically test and compare 1000 rows, 10,000 rows, 100,000 rows, 1,000,000 rows and 10,000,000 rows.
So, are there any tools that will allow me to do this and provide statistics on disk I/O, memory usage, CPU usage and time to complete query?

yes. Benchmark is your best option I guess for some of them
you can make simple queries likes:
jcho360> select benchmark (10000000,1+1);
+--------------------------+
| benchmark (10000000,1+1) |
+--------------------------+
| 0 |
+--------------------------+
1 row in set (0.18 sec)
jcho360> select benchmark (10000000,1/1);
+--------------------------+
| benchmark (10000000,1/1) |
+--------------------------+
| 0 |
+--------------------------+
1 row in set (1.30 sec)
a sum is faster than a division (you can do this with all the things that you can imagine.
I'll recommend you to take a look to this program that will help you with this part of performance.
Mysqlslap (it's like benchmark but you can customize more the result).
SysBench (test CPUperformance, I/O performance, mutex contention, memory speed, database performance).
Mysqltuner (with this you can analize general statistics, Storage engine Statistics, performance metrics).
mk-query-profiler (perform analysis of a SQL Statement).
mysqldumpslow (good to know witch queries are causing problems).
some of them are third party, but I'm pretty sure that you can find tons of info googling the name of the APP

Related

On the Efficiency of Data Infrastructure, Storage and Retrieval with SQL

I'm curious about which is the most efficient way to store and retrieve data in and from a database.
The table:
+----+--------+--------+ +-------+ +----------+
| id | height | weight | ← | bmi | ← | category |
+----+--------+--------+ +-------+ +----------+
| 1 | 184 | 64 | | 18.90 | | 2 |
| 2 | 147 | 80 | | 37.02 | | 4 |
| … | ……… | …… | | …… …… | | … |
| | | | ← |  | ← | |
+----+--------+--------+ +-------+ +----------+
From a storage perspective
If we want to be more efficient in terms of storing the data, columns bmi and category would be obsolete, adding data we could've otherwise figured out based on the former two columns height and weight.
From a retrieval perspective
Leaving out the category column we could ask
SELECT *
FROM bmi_entry
WHERE bmi >= 18.50 AND bmi < 25.00
and leaving out the bmi column as well, that becomes
SELECT *
FROM bmi_entry
WHERE weight / ((height * 100) * (height * 100)) >= 18.50
AND weight / ((height * 100) * (height * 100)) < 25
However, calculation could hypothetically take much longer that simply comparing a column to a value, in which case
SELECT *
FROM bmi_entry
WHERE category = 2
would be the far superior query in terms of retrieval time.
Best practice?
At first, I was about to go with method one, thinking why store "useless" data and take up storage space… but then I thought about the implementation and how potentially having to recalculate those "obsolete" fields for every single row every time I want to sort and retrieve specific sets of BMI entries within specific ranges or categories could dramatically slow down the time it takes to collect the data.
Ultimately:
Wouldn't the arithmetic functions of division and multiplication take more time and thus slow down the user experience?
Would there ever be a case in which you would prioritise storage space over retrieval time?
If the answer to (1.) is a simple "yup", you can comment that below. :-)
If you have a more in depth elaboration on either (1.) or (2.), however, feel free to post that or those as well, as I, and others, would be very interested in reading more!
Wouldn't the arithmetic functions of division and multiplication take more time and thus slow down the user experience?
You might have assumed "yup" would be the answer, but in fact the complexity of the arithmetic is not the issue. The issue is that you shouldn't need to evaluate the expressions at all to check if it should be included in your query result.
Searching on an expression instead of an indexed column, MySQL will be forced to visit every single row and evaluate the expression. This is a table-scan. The cost of the query, even disregarding the possible slowness of the arithmetic, is bound to increase in linear proportion to the number of rows.
In complexity of algorithms, we say this is "Order N" cost to the algorithm. Even if it's actually "N * a fixed multiplier due to the cost of of the arithmetic," it's still the N we're worried about, especially if N is ever-increasing.
You showed the example where you stored an extra column for the pre-calculated bmi or category, but that alone wouldn't avoid the table-scan. Searching for category=2 is still going to cause a table-scan unless category is an indexed column.
Indexing a column is fine, but it's a little more tricky to index an expression. Recent versions of MySQL have given us that ability for most types of expressions, but if you're using an older version of MySQL you may be out of luck.
With MySQL 8.0, you can index the expression without having to store the calculated columns. The index is prepared based on the result of the expression. The index itself takes storage space, but it would have if you had indexed the column too. Read more about this here: https://dev.mysql.com/doc/refman/8.0/en/create-index.html in the section on "Functional Key Parts".
Would there ever be a case in which you would prioritise storage space over retrieval time?
Sure. Suppose you have a very large amount of data, but you don't need to run queries especially frequently or quickly.
Example: I managed a database of bulk statistics that we added to throughout the month, but we only needed to query it about once at the end of the month to make a report. It didn't matter that this report took a couple of hours to prepare, because the managers who read the report would be viewing it in a document, not by running the query themselves. Meanwhile, the storage space for the indexes would have been too much for the server the data was on, so they were dropped.
Once a month I would kick off the task of running the query for the report, and then switch windows and go do some of my other work for a few hours. As long as I got the result by the time the people who needed to read it were expecting it (e.g. the next day) I didn't care how long it took to do the query.
Ultimately the best practice you're looking for varies, based on your needs and the resources you can utilize for the task.
There is no best practice. It depends on the considerations of what you are trying to do. Here are some considerations:
Consistency
Storing the in separate columns means that the values can get out-of-synch.
Using a computed column or view means that the values are always consistent.
Updatability (the inverse of consistency)
Storing the data in separate columns means that the values can be updated.
Storing the data as computed columns means that the values cannot be separately updated.
Read Performance
Storing the data in separate columns increases the size of the rows, which tends to increase the size of the table. This can decrease performance because more data must be read -- for any query on the table.
This is not an issue for computed columns, unless they are persisted in some way.
Indexing
Either method supports indexing.

MySQL LIMIT X, Y slows down as I increase X

I have a db with around 600 000 listings, while browsing these on a page with pagination, I use this query to limit records:
SELECT file_id, file_category FROM files ORDER BY file_edit_date DESC LIMIT 290580, 30
On first pages LIMIT 0, 30 it loads in few ms, same for LIMIT 30,30, LIMIT 60,30, LIMIT 90,30, etc. But as I move forward to the end of the pages, the query takes around 1 second to execute.
Indexes are probably not related, it also happens if I run this:
SELECT * FROM `files` LIMIT 400000,30
Not sure why.
Is there a way to improve this ?
Unless there is a better solution, would it be a bad practice to just load all records and loop over them in the PHP page to see if the record is inside the pagination range and print it ?
Server is an i7 with 16GB ram;
MySQL Community Server 5.7.28;
files table is around 200 MB
here is the my.cnf if it matters
query_cache_type = 1
query_cache_size = 1G
sort_buffer_size = 1G
thread_cache_size = 256
table_open_cache = 2500
query_cache_limit = 256M
innodb_buffer_pool_size = 2G
innodb_log_buffer_size = 8M
tmp_table_size=2G
max_heap_table_size=2G
You may find that adding the following index will help performance:
CREATE INDEX idx ON files (file_edit_date DESC, file_id, file_category);
If used, MySQL would only need a single index scan to retrieve the number of records at some offset. Note that we include the columns in the select clause so that the index may cover the entire query.
LIMIT was invented to reduce the size of the result set, it can be used by the optimizer if you order the result set using an index.
When using LIMIT x,n the server needs to process x+n rows to deliver a result. The higher the value for x, the more rows have to be processed.
Here is the explain output from a simple table, having an unique index on column a:
MariaDB [test]> explain select a,b from t1 order by a limit 0, 2;
+------+-------------+-------+-------+---------------+---------+---------+------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+---------------+---------+---------+------+------+-------+
| 1 | SIMPLE | t1 | index | NULL | PRIMARY | 4 | NULL | 2 | |
+------+-------------+-------+-------+---------------+---------+---------+------+------+-------+
1 row in set (0.00 sec)
MariaDB [test]> explain select a,b from t1 order by a limit 400000, 2;
+------+-------------+-------+-------+---------------+---------+---------+------+--------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+---------------+---------+---------+------+--------+-------+
| 1 | SIMPLE | t1 | index | NULL | PRIMARY | 4 | NULL | 400002 | |
+------+-------------+-------+-------+---------------+---------+---------+------+--------+-------+
1 row in set (0.00 sec)
When running the statements above (without EXPLAIN) the execution time for LIMIT 0 is 0.01 secs, for LIMIT 400000 0.6 secs.
Since MariaDB doesn't support LIMIT in a subquery, you could split your SQL statements in to two statements:
The first statement retrieves the id's (and needs to read the index file only), the second statement uses the id's retrieved from first statement:
MariaDB [test]> select a from t1 order by a limit 400000, 2;
+--------+
| a |
+--------+
| 595312 |
| 595313 |
+--------+
2 rows in set (0.08 sec)
MariaDB [test]> select a,b from t1 where a in (595312,595313);
+--------+------+
| a | b |
+--------+------+
| 595312 | foo |
| 595313 | foo |
+--------+------+
2 rows in set (0.00 sec)
Caution: I am about to use some strong language. Computers are big and fast, and they can handle bigger stuff than they could even a decade ago. But, as you are finding out, there are limits. I'm going to point out multiple limits that you have threatened; I will try to explain why the limits may be a problem.
Settings
query_cache_size = 1G
is terrible. Whenever a table is written to, the QC scans the 1GB looking for any references to that table in order to purge entries in the QC. Decrease that to 50M. This, alone, will speed up the entire system.
sort_buffer_size = 1G
tmp_table_size=2G
max_heap_table_size=2G
are bad for a different reason. If you have multiple connections performing complex queries, lots of RAM could be allocated for each, thereby chewing up RAM, leading to swapping, and possibly crashing. Don't set them higher than about 1% of RAM.
In general, do not blindly change values in my.cnf. The most important setting is innodb_buffer_pool_size, which should be bigger than your dataset, but no bigger than 70% of available RAM.
load all records
Ouch! The cost of shoveling all that data from MySQL to PHP is non-trivial. Once it gets to PHP, it will be stored in structures that are not designed for huge amounts of data -- 400030 (or 600000) rows might take 1GB inside PHP; this would probably blow out its "memory_limit", leading PHP crashing. (OK, just dying with an error message.) It is possible to raise that limit, but then PHP might push MySQL out of memory, leading to swapping, or maybe running out of swap space. What a mess!
OFFSET
As for the large OFFSET, why? Do you have a user paging through the data? And he is almost to page 10,000? Are there cobwebs covering him?
OFFSET must read and step over 290580 rows in your example. That is costly.
For a way to paginate without that overhead, see http://mysql.rjweb.org/doc.php/pagination .
If you have a program 'crawling' through all 600K rows, 30 at a time, then the tip about "remember where you left off" in that link will work very nicely for such use. It does not "slow down".
If you are doing something different; what is it?
Pagination and gaps
Not a problem. See also: http://mysql.rjweb.org/doc.php/deletebig#deleting_in_chunks which is more aimed at walking through an entire table. It focuses on an efficient way to find the 30th row going forward. (This is not necessarily any better than remembering the last id.)
That link is aimed at DELETEing, but can easily be revised toSELECT`.
Some math for scanning a 600K-row table 30 rows at a time:
My links: 600K rows are touched. Or twice that, if you peek forward with LIMIT 30,1 as suggested in the second link.
OFFSET ..., 30 must touch (600K/30)*600K/2 rows -- about 6 billion rows.
(Corollary: changing 30 to 100 would speed up your query, though it would still be painfully slow. It would not speed up my approach, but it is already quite fast.)

Google Cloud SQL is SLOW: mysql instance with 10GB RAM is 20x slower than Macbook Pro configured with 125MB ram

We dumped our table per Google Cloud SQL instructions and imported it into a second generation Google Cloud SQL instance.
We were very excited to see how our numbers would be running on "google hardware".
After stress testing our Rails app with Apache ab and seeing 150ms higher completed times, we noticed ActiveRecord was taking from 30ms to 50ms more than our production server (bare metal) in the same pages.
While we dug deeper, what really blew our minds were simple count queries like this:
GOOGLE CLOUD SQL - db-n1-standard-4 (4vcpu and 15GB RAM)
1. Cold query
mysql> SELECT COUNT(*) FROM `event_log`;
+----------+
| COUNT(*) |
+----------+
| 3998050 |
+----------+
1 row in set (19.26 sec)
2. Repeat query
mysql> SELECT COUNT(*) FROM `event_log`;
+----------+
| COUNT(*) |
+----------+
| 3998050 |
+----------+
1 row in set (1.16 sec)
SELECT ##innodb_buffer_pool_size/1024/1024/1024;
+------------------------------------------+
| ##innodb_buffer_pool_size/1024/1024/1024 |
+------------------------------------------+
| 10.500000000000 |
+------------------------------------------+
1 row in set (0.00 sec)
I can then repeat the query multiple times and the performance is the same.
Running the same query in my macbook pro 2017 with the exact same dump:
MACBOOK PRO 2017
1. Cold query
mysql> SELECT COUNT(*) FROM `event_log`;
+----------+
| COUNT(*) |
+----------+
| 3998050 |
+----------+
1 row in set (1.51 sec)
2. Repeat query
mysql> SELECT COUNT(*) FROM `event_log`;
+----------+
| COUNT(*) |
+----------+
| 3998050 |
+----------+
1 row in set (0,51 sec)
SELECT ##innodb_buffer_pool_size/1024/1024/1024;
+------------------------------------------+
| ##innodb_buffer_pool_size/1024/1024/1024 |
+------------------------------------------+
| 0.125000000000 |
+------------------------------------------+
1 row in set (0,03 sec)
What makes it even more absurd is that, as you can see above, I haven't tuned anything from my default mysql install, so it's using only 125MB of RAM in my Macbook, while the Google Cloud instance has 10GB of RAM available.
We tried increasing Google Cloud SQL instance size up to db-n1-highmen-8 (8vCPU with 52GB ram!) to no increase of performance (if we decrease from db-n1-standard-4 we do see a decrease in performance).
Last but not least, using this question we can confirm that our database has only 46GB, but during the import the storage usage in the google cloud sql kept growing until reaching absurd 74GB... we don't know if that's because of binary logging (which is ON on google cloud SQL by default and off on my local machine).
So .. isn't anyone using Google Cloud sql on production? :)
UPDATE: we used the exact same .sql dump and loaded it into a db.r4.large AWS RDS (so same cpu / ram) and got consistent 0,50s performance in the query, and it also didnt consume more then 46GB in the instance.
Compare the execution plans (prepending EXPLAIN) and you'll likely find some notable implementation differences resulting from variations in configuration parameters beyond the buffer pool size.
I encountered similar issues setting up a Postgres Cloud SQL db over the weekend with ~100gb of data, mirroring a local db on my macbook pro. Performance was comparable to my local db for very targeted selects using indices, but queries that scanned non-trivial amounts of data were 2-5x slower.
Comparing the config results of SHOW ALL (SHOW VARIABLES in mysql I think) between local and cloud instances I noticed several differences, such as max_parallel_workers_per_gather = 0 on Cloud SQL vs 2 on my local instance.
In the case of a select count(*)... a max_parallel_workers_per_gather setting > 0 allows the use of a Gather over the results of parallel sequential scans using multiple workers; when set to zero the engine has to perform a single sequential scan. For other queries I noticed similar trends where parallel workers were used in my local db, with lower costs and faster speeds than the cloud instance.
That's just one contributing factor; I'm sure digging into settings would turn up many more such explanations. These are the tradeoffs that come with managed services (though it'd be nice to have more control over such parameters).

MySQL LIMIT x,y performance huge difference on 2 machine

I have a query in an InnoDb item table which contains 400k records (only...). I need to page the result for the presentation layer (60 per page) so I use LIMIT with values depending on the page to display.
The query is (the 110000 offset is just an example):
SELECT i.id, sale_type, property_type, title, property_name, latitude,
longitude,street_number, street_name, post_code,picture, url,
score, dw_id, post_date
FROM item i WHERE picture IS NOT NULL AND picture != ''
AND sale_type = 0
ORDER BY score DESC LIMIT 110000, 60;
Running this query on my machine takes about 1s.
Running this query on our test server is 45-50s.
EXPLAIN are both the same:
+----+-------------+-------+-------+---------------+-----------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-----------+---------+------+--------+-------------+
| 1 | SIMPLE | i | index | NULL | IDX_SCORE | 5 | NULL | 110060 | Using where |
+----+-------------+-------+-------+---------------+-----------+---------+------+--------+-------------+
The only configuration difference when query show variables are:
innodb_use_native_aio. It is enabled on the Test server, not on my machine. I tried disabling it and I don't see any significant change
innodb_buffer_pool_size 1G on Test server, 2G on my machine
Test server has 2Gb of ram, 2 core CPU:
mysqld uses > 65% of RAM at all time, but only increase 1-2% running above query
mysqld uses 14% of CPU while running the above query, none when idle
My local machine has 8Gb, 8 core CPU:
mysqld uses 28% of RAM at all time, and doesn't really increase while running the above query (or for a so short time I can see it)
mysqld uses 48% of CPU while running the above query, none when idle
Where and what can I do to have the same performance on the Test server? Is the RAM and/or CPU too low?
UPDATE
I have setup a new Test server with the same specs but 8G of RAM and 4 core CPU and the performance just jumped to values similar to my machine. The original server didn't seem to use all of the RAM/CPU, why are performance so worse?
One of the surest ways to kill performance is to make MySQL scan an index that doesn't fit in memory. So during a query, it has to load part of the index into the buffer pool, then evict that part and load the other part of the index. Causing churn in the buffer pool like this during a query will cause a lot of I/O load, and that makes it very slow. Disk I/O is about 100,000 times slower than RAM.
So there's a big difference between 1GB of buffer pool and 2GB of buffer pool, if your index is, say 1.5GB.
Another tip: you really don't want to use LIMIT 110000, 60. That causes MySQL to read 110000 rows from the buffer pool (possibly loading them from disk if necessary) just to discard them. There are other ways to page through result sets much more efficiently.
See articles such as Optimized Pagination using MySQL.

COUNT(id) query is taking too long, what performance enhancements might help?

I have a query timeout problem. When I did a:
SELECT COUNT(id) AS rowCount FROM infoTable;
in my program, my JDBC call timed out after 2.5 minutes.
I don't have much database admin expertise but I am currently tasked with supporting a legacy database. In this mysql database, there is an InnoDB table:
+-------+------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+----------------+
| id | bigint(20) | NO | PRI | NULL | auto_increment |
| info | longtext | NO | | | |
+-------+------------+------+-----+---------+----------------+
It currently has a high id of 5,192,540, which is the approximate number of rows in the table. Some of the info text is over 1M, some is very small. Around 3000 rows are added on a daily basis. Machine has loads of free disk space, but not a lot of extra memory. Rows are read and are occasionally modified but are rarely deleted, though I'm hoping to clean out some of the older data which is pretty much obsolete.
I tried the same query manually on a smaller test database which had 1,492,669 rows, installed on a similar machine with less disk space, and it took 9.19 seconds.
I tried the same query manually on an even smaller test database which had 98,629 rows and it took 3.85 seconds. I then added an index to id:
create index infoTable_idx on infoTable(id);
and the subsequent COUNT took 4.11 seconds, so it doesn't seem that adding an index would help in this case. (Just for kicks, I did the same on the aforementioned mid-sized db and access time increased from 9.2 to 9.3 seconds.)
Any idea how long a query like this should be taking? What is locked during this query? What happens if someone is adding data while my program is selecting?
Thanks for any advice,
Ilane
You might try executing the following explain statement, might be a bit quicker:
mysql> EXPLAIN SELECT id FROM table;
That may or may not yield quicker results, look for the rows field.