How to run OPTIMIZE TABLE with the least downtime - mysql

I have a MySQL 5.5 DB of +-40GB on a 64GB RAM machine in a production environment. All tables are InnoDB. There is also a slave running as a backup.
One table - the most important one - grew to 150M rows, inserting and deleting became slow. To speed up inserting and deleting I deleted half of the table. This did not speed up as expected; inserting and deleting is still slow.
I've read that running OPTIMIZE TABLE can help in such a scenario. As I understand this operation will require a read lock on the entire table and optimizing the table might take quite a while on a big table.
What would be a good strategy to optimize this table while minimizing downtime?
EDIT The specific table to be optimized has +- 91M rows and looks like this:
+-------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| channel_key | varchar(255) | YES | MUL | NULL | |
| track_id | int(11) | YES | MUL | NULL | |
| created_at | datetime | YES | | NULL | |
| updated_at | datetime | YES | | NULL | |
| posted_at | datetime | YES | | NULL | |
| position | varchar(255) | YES | MUL | NULL | |
| dead | int(11) | YES | | 0 | |
+-------------+--------------+------+-----+---------+----------------+

Percona Toolkit's pt-online-schema-change does this for you. In this case it worked very well.

300 ms to insert seems excessive, even with slow disks. I would look into the root cause. Optimizing this table is going to take a lot of time. MySQL will create a copy of your table on disk.
Depending on the size of your innodb_buffer_pool (if the table is innodb), free memory on the host, I would try to preload the whole table in the page cache of the OS, so that at least reading the data will be sped up by a couple orders of magnitude.
If you're using innodb_file_per_table, or if it's a MyISAM table, it's easy enough to make sure the whole file is cached using "time cat /path/to/mysql/data/db/huge_table.ibd > /dev/null". When you rerun the command, and it runs in under a few seconds, you can assume the file content is sitting in the OS page cache.
You can monitor the progress whilst the "optimize table" is running, by looking at the size of the temporary file. It's usually in the database data directory, with a temp filename starting with a dash (#) character.

This article suggests to first drop all indexes in a table, then optimize it, and then add indexes back. It claims the speed difference is 20 times compared to just optimize.

update your version of mysql, in 8.0.x version the optimize table is fastest than 5.5
a optimize in a table with 91 millions could take like 3 hours in your version, you can run in morning, like 3am to not disturb the users of your app.

Related

Physical disk rewrite of mysql data

I am using mysql for the first time in years to help a friend out. The issue: a mysql table that gets updated a lot with INT and CHAR values. This web app site is hosted on a large generic provider, so I have no direct control of setup/parameters/etc. The performance has gotten really, really bad for this table, to the point where processing a data page that should take a max of 10 seconds is sometimes taking 15 minutes.
I initially tried running all updates as a single transaction, rather than the 50ish statements in a php loop in the web app (written several years ago). The problem, at least what I think, is that this app is running on a giant mysql instance with many other generic websites, and the disk speed just isn't able to handle so many updates.
I am able to use chron/batch jobs on this provider. The web app is mainly used during work hours, so I could limit access to the web app during overnight hours.
I normally work with postgresql or ms sql server, so my knowledge of mysql is fairly limited.
Would performance be increased if I force the table to be dropped and rewritten overnight? Is there some mysql function like postgres's vacuum? I have tried to search for information, but unfortunately using words like rewrite table just brings up references to sql syntax helpers or performance tuning.
Alternately, I guess that I could create a new storage mechanism in mysql, as long as it could be done via a php script. Would there be a better storage mode than the default storage engine for something frequently updated?
Performance of mysql depends on multiple factors that it's complicated enough to have a clear answer in every case. I think we can check the following steps to help figuring out on what to improve on INSERT data into mysql.
Database Engine.
There are 5 engine that you can use depends on your purposes: MyISAM, Memory, InnoDB, Archive, NDB.
Document
An engine which has Locking granularity as table will be slower than engine has its value as row because it will lock a table from changing when insert or update a single record, while Locking granularity as row mean locking only that row when you insert or update records.
When perform INSERT OR UPDATE record, engine has B-tree indexes attribute will be slower because it's have to rebuild it's indexes, so that you will have faster speed SELECT query. Therefore number of indexes in table will slow inserting and updating speed as well.
Indexes as CHAR will be slower than indexes as INT because it takes more time to figure out where to find the right node to store data in mysql.
MYSQL Statement
MYSQL has a estimation system that help you to discover performance of a query by add EXPLAIN before your mysql statement.
Example
EXPLAIN SELECT SQL_NO_CACHE * FROM Table_A WHERE id = 1;
Document
I worked on a web application, where we used mysql (it's really good !) to scale really large data.
In addition to what #Lam Nguyen said in his answer here is few things to consider,
Check which mysql engine you are using to see which locks it obtains during select, insert , update. To check which engine you are using here is a sample query with which you could run your litmus test.
mysql> show table status where name="<your_table_name>";
+-------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+--------------------+----------+----------------+---------+
| Name | Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time | Update_time | Check_time | Collation | Checksum | Create_options | Comment |
+-------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+--------------------+----------+----------------+---------+
| Login | InnoDB | 10 | Dynamic | 2 | 8192 | 16384 | 0 | 0 | 0 | NULL | 2019-04-28 12:16:59 | NULL | NULL | utf8mb4_general_ci | NULL | | |
+-------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+--------------------+----------+----------------+---------+
The default engine which comes with mysql installation is InnoDB. InnoDB does not acquire any lock while inserting a row.
SELECT ... FROM is a consistent read, reading a snapshot of the database and setting no locks unless the transaction isolation level is set to SERIALIZABLE.
A locking read, an UPDATE, or a DELETE generally set record locks on every index record that is scanned in the processing of the SQL statement.
InnoDB lock sets
Check for columns which you are indexing. Index the column which you would really query a lot. Avoid indexing char columns.
To check which columns of you table got indexed run,
mysql> show index from BookStore2;
+------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Bookstore2 | 0 | PRIMARY | 1 | ISBN_NO | A | 0 | NULL | NULL | | BTREE | | | YES | NULL |
| Bookstore2 | 1 | SHORT_DESC_IND | 1 | SHORT_DESC | A | 0 | NULL | NULL | YES | BTREE | | | YES | NULL |
| Bookstore2 | 1 | SHORT_DESC_IND | 2 | PUBLISHER | A | 0 | NULL | NULL | YES | BTREE | | | YES | NULL |
+------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
3 rows in set (0.03 sec)
Do not run inner query on a large data set in a table. To actually see what your query does run explain on your query and see the number of rows iter
mysql> explain select * from login;
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------+
| 1 | SIMPLE | login | NULL | ALL | NULL | NULL | NULL | NULL | 2 | 100.00 | NULL |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------+
1 row in set, 1 warning (0.03 sec)
Avoid joining too may tables.
Make sure you are querying with a primary key in criteria or at least you are querying on your indexed column.
When your table grows too big make sure you split it across clusters.
With few tweaks, we would still be able to get query results in minimal time.

How to properly index, and chose the best primary key for MySQL InnoDB table

This is my fist time with big MySQL tables, and i have a couple of questions about the speed of a search.
I have a table with 100 million entries in a MySQL table. The table now look like this:
+-----------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+--------------+------+-----+---------+-------+
| Accession | char(10) | NO | PRI | NULL | |
| DB | char(6) | NO | | NULL | |
| Organism | varchar(255) | NO | | NULL | |
| Gene | varchar(255) | NO | | NULL | |
| Name | varchar(255) | NO | | NULL | |
| Header | text | NO | | NULL | |
| Sequence | text | NO | | NULL | |
+-----------+--------------+------+-----+---------+-------+
with indexes like this:
+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
| uniprot | 0 | PRIMARY | 1 | Accession | A | 94275840 | NULL | NULL | | BTREE | | |
| uniprot | 1 | main_index | 1 | Accession | A | 94275840 | NULL | NULL | | BTREE | | |
| uniprot | 1 | main_index | 2 | DB | A | 94275840 | NULL | NULL | | BTREE | | |
| uniprot | 1 | main_index | 3 | Organism | A | 94275840 | 191 | NULL | | BTREE | | |
| uniprot | 1 | main_index | 4 | Gene | A | 94275840 | 191 | NULL | | BTREE | | |
| uniprot | 1 | main_index | 5 | Name | A | 94275840 | 191 | NULL | | BTREE | | |
+---------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+
My question is about the efficiency of this. The searces i use are very simple, but i need the answer really fast.
For 80% of the times i use Accession as a query and i want the sequence back.
select sequence from uniprot where accession="q32p44";
...
1 row in set (0.06 sec)
For 10% of the times i search for a "Gene" and 10% of the time i search for an Organism.
The table is unique for "Accession".
My questions are:
Can i make this table more efficient (search time wise) anyhow?
Is the indexing good?
Do i speed up the search time by making a multiple keyed primary key like (Accession, Gene, Organism)?
Thanks a lot!
EDIT1:
As requested in the comments:
mysql> show create table uniprot;
+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table | Create Table |
+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| uniprot | CREATE TABLE `uniprot` (
`Accession` char(10) NOT NULL,
`DB` char(6) NOT NULL,
`Organism` varchar(255) NOT NULL,
`Gene` varchar(255) NOT NULL,
`Name` varchar(255) NOT NULL,
`Header` text NOT NULL,
`Sequence` text NOT NULL,
PRIMARY KEY (`Accession`),
KEY `main_index` (`Accession`,`DB`,`Organism`(191),`Gene`(191),`Name`(191))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 |
+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Don't use "prefix" indexing, it almost never does as well as you might expect.
CHAR(10) with utf8mb4 means that you are taking 40 bytes always. accession="q32p44" implies VARCHAR and ascii would be better. With those changes, I would not bother switching to a 'surrogate' key. Consider the same issue for DB.
With PRIMARY KEY(Accession) and InnoDB, there is no advantage in having KEY main_index (Accession, ...). Drop that KEY.
What is Sequence? If it is a text string with only 4 different letters, then it should be highly compressible. And, with 100M rows, shrinking the disk footprint could lead to a noticeable speedup. I would COMPRESS it in the client and store it into a BLOB.
Do you really need 255 in varchar(255)? Please shrink to something 'reasonable' for the data. That way, we can reconsider what index(es) to add, without using prefixing.
select sequence from uniprot where accession="q32p44";
works very efficiently with PRIMARY KEY(accession)
select sequence from uniprot where accession="q32p44" AND gene = '...';
also works efficiently with that PK. It will find the one row for q32p44 and then simply check that gene matches; then deliver 0 or 1 row.
select sequence from uniprot where gene = '...';
would benefit from INDEX(gene). Similarly for Organism.
How big is the table (in GB)? What is the value of innodb_buffer_pool_size? How much RAM do you have? If the table is a lot bigger than the buffer pool, a random "point query" (WHERE accession = constant) will typically take one disk hit. To discuss other queries, please show us the SELECT.
Edit
With 100M rows, shrinking the disk footprint is important for performance. There are multiple ways to do it. I want to focus on (1) Shrink the size of each column; (2) Avoid implicit overhead in indexes.
Each secondary key implicitly includes the PRIMARY KEY. So, if there are 3 indexes, there are 3 copies of the PK. That means that the size of the PK is especially important.
I'm recommending something like
CREATE TABLE `uniprot` (
`Accession` VARCHAR(10) CHARACTER SET ascii NOT NULL,
`DB` VARCHAR(6) NOT NULL,
`Organism` varchar(100) NOT NULL,
`Gene` varchar(100) NOT NULL,
`Name` varchar(100) NOT NULL,
`Header` text NOT NULL,
`Sequence` text NOT NULL,
PRIMARY KEY (`Accession`),
INDEX(Gene), -- implicitly (Gene, Accession)
INDEX(Name) -- implicitly (Organism, Accession)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
And your main queries are
SELECT Sequence FROM uniprot WHERE Accession = '...';
SELECT Sequence FROM uniprot WHERE Gene = '...';
SELECT Sequence FROM uniprot WHERE Organism = '...';
If Accession is really variable length and shorter than to and ascii, then what I suggest brings the length down from 40 bytes * 3 occurrences * 100M rows = 12GB, just for the copies of Accession, down to perhaps 2GB. I think the savings of 10GB is worth it. Going to BIGINT would be also be about 2GB (no further savings); going to INT would be about 1GB (more savings, but not much).
Shrinking Gene and Organism to 'reasonable' sizes (if practical) avoids the need for using prefixing, hence allowing the index to work better. But, you can argue that maybe prefixing will work "well enough" in INDEX(Gene(11)). Let's get some numbers to make the argument one way or another. What is the average length of Gene (and Organism)? How many initial characters in Gene are usually sufficient to identify a Gene?
Another space question is whether there are a lot of duplicates in Gene and/or Organism. If so, then "normalizing" those fields would be warranted. Ditto for Name, Header, and Sequence.
The need for a JOIN (or two) if you make surrogates for Accession and/or Gene is only a slight bit of overhead, not enough to worry about.
First off, as mentioned in the comments I wouldn't use a natural key (Accession), I would opt for a surrogate key (Id), however with 100M rows, that would be a painful alter during which the table will be locked.
With that being said, Accession is already indexed b/c it's a Primary Key so for simple queries, you can't optimize further:
select sequence from uniprot where accession="q32p44";
If doing look-ups against other columns then your best bet is to add separate indices for each column:
ALTER TABLE uniprot ADD INDEX (Gene(10)), ADD KEY (Organism(10));
The goal is to index the uniqueness of the values (cardinality), so if you have a lot of values with somethingsomething1, somethingsomething2, somethingsomething3 then it would be best to go with a prefix of 18+ but no larger than say 30.
Per MySQL docs:
If names in the column usually differ in the first 10 characters, this index should not be much slower than an index created from the entire name column. Also, using column prefixes for indexes can make the index file much smaller, which could save a lot of disk space and might also speed up INSERT operations.
So the goal is to index the uniqueness (cardinality) but without inflating size on disk.
I would also remove that main_index index, as I don't see the benefit as you are not searching on all those columns at the same time, and due to length, will slow down your writes with little gain on the reads.
Be sure to test before you run anything on production. Perhaps get a small sampling (1-5% of the dataset) and prefix your queries you plan on running with explain to see how MySQL will execute them.

MySQL Performance issues / slow query with large amounts of data

MySql
I've a query that is taking sometime to load on a table, named impression that
has about 57 million rows. Table definition can be found below:
+-----------------+--------------+------+-----+
| Field | Type | Null | Key |
+-----------------+--------------+------+-----+
| id | int(11) | NO | PRI |
| data_type | varchar(16) | NO | MUL |
| object_id | int(11) | YES | |
| user_id | int(11) | YES | |
| posted | timestamp | NO | MUL |
| lat | float | NO | |
| lng | float | NO | |
| region_id | int(11) | NO | |
+-----------------+--------------+------+-----+
The indexes on the table are:
+------------+------------+----------+--------------+-------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name |
+------------+------------+----------+--------------+-------------+
| impression | 0 | PRIMARY | 1 | id |
| impression | 1 | posted | 1 | posted |
| impression | 1 | oi_dt | 1 | data_type |
| impression | 1 | oi_dt | 2 | object_id |
+------------+------------+----------+--------------+-------------+
A typical select statement goes something like:
SELECT COUNT(`id`)
FROM `impression`
WHERE
posted BETWEEN DATE('2014-01-04') AND DATE('2014-06-01')
AND `data_type` = 'event'
AND `object_id` IN ('1', '2', '3', '4', '5', '8', ...)
...and a typical record looks like (in order of the schema above):
'event', 1234, 81, '2014-01-02 00:00:01', 35.3, -75.2, 10
This statement takes approximately 26 seconds to run, which is where the problem
lies. Are there any solutions that can be employed here to reduce this time to well
below what it is now? Ideally it'd be < 1 second.
I'm open to switching storage solutions / etc... anything that'll help at this point.
Your assistance is most appreciated.
Other things possibly worth noting:
The table is using the InnoDB storage engine
using MySQL 5.5
Server: 8Gb RAM running CentOS 6 (Rackspace)
MySQL usually uses only one index per table in a given query. You have an index on posted and a compound index on data_type, object_id.
You should use EXPLAIN to find out which index your query is currently using. EXPLAIN will also tell you how many rows it estimates it will examine to produce the result set (it might examine many more rows than make it into the final result).
The columns should be in this order:
Columns in equality conditions, for example in your query data_type = 'event'
Columns in range conditions or sorting, but you only get one such column. Subsequent columns that are in range conditions or sorting do not gain any benefit from being added to the index after the first such column. So pick the column that is the most selective, that is, your condition narrows down the search to a smaller subset of the table.
Other columns in your select-list, if you have just a few such columns and you want to get the covering index effect. It's not necessary to add your primary key column if you use InnoDB, because every secondary index automatically includes the primary key column at the right end, even if you don't declare that.
So in your case, you might be better off with an index on data_type, posted. Try it and use EXPLAIN to confirm. It depends on whether the date range you give is more selective than the list of object_id's.
See also my presentation How to Design Indexes, Really.
Not sure if this is a viable solution for you, but partitioning may speed it up. I have a similar table for impressions and found the following to help it a lot. I'm querying mostly on the current day though.
ALTER TABLE impression PARTITION BY RANGE(TO_DAYS(posted))(
PARTITION beforeToday VALUES LESS THAN(735725),
PARTITION today VALUES LESS THAN(735726),
PARTITION future VALUES LESS THAN MAXVALUE
);
This does incur some maintenance (has to be updated often to get the benefits). If you are looking to query on a broader range, less maintenance would be required I think.

Innodb memcached plugin in RDS not deleting expired rows

I recently setup an RDS instance in AWS for MySQL 5.6 with the new Memcached InnoDB plugin. Everything works great and my app can store and retrieve cached items from the mapped table. When I store items I provide a timeout, and memcached correctly does not return the item once its TTL has expired. So far so good....
However when I look at the underlying table, it is full of rows which have already expired.
The MySQL documentation (http://dev.mysql.com/doc/refman/5.6/en/innodb-memcached-intro.html) indicates that item expiration has no effect when using the "innodb_only" caching policy (although it doesn't explicitly indicate which operation it is referring to). In any case my cache_policies table looks like this:
mysql> select * from innodb_memcache.cache_policies;
+--------------+------------+------------+---------------+--------------+
| policy_name | get_policy | set_policy | delete_policy | flush_policy |
+--------------+------------+------------+---------------+--------------+
| cache_policy | caching | caching | innodb_only | innodb_only |
+--------------+------------+------------+---------------+--------------+
1 row in set (0.01 sec)
So, per the docs the expiration field should be respected.
For reference my containers table looks like this:
mysql> select * from innodb_memcache.containers;
+---------+-----------+-----------+-------------+---------------+-------+------------+--------------------+------------------------+
| name | db_schema | db_table | key_columns | value_columns | flags | cas_column | expire_time_column | unique_idx_name_on_key |
+---------+-----------+-----------+-------------+---------------+-------+------------+--------------------+------------------------+
| default | sessions | userData | sessionID | data | c3 | c4 | c5 | PRIMARY |
+---------+-----------+-----------+-------------+---------------+-------+------------+--------------------+------------------------+
2 rows in set (0.00 sec)
And the data table is:
mysql> desc sessions.userData;
+-----------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+---------------------+------+-----+---------+-------+
| sessionID | varchar(128) | NO | PRI | NULL | |
| data | blob | YES | | NULL | |
| c3 | int(11) | YES | | NULL | |
| c4 | bigint(20) unsigned | YES | | NULL | |
| c5 | int(11) | YES | | NULL | |
+-----------+---------------------+------+-----+---------+-------+
5 rows in set (0.00 sec)
One more detail, the MySQL docs state that after modifying caching policies you need to re-install the Memcached plugin, but I did not find a way to do this on RDS, so I removed the Memcached option group, rebooted, added the memcached option group again, rebooted again... but there was no apparent change in behavior.
So, to conclude, am I missing some step or configuration here? I would hate to have to create a separate process just to delete the expired rows from the table, since I was expecting the Memcached integration to do this for me.
I'm by no means an expert, as I've just started to play around with memcached myself. However, this is from the MySQL documentation for the python tutorial.
It seems to be saying that if you use the InnoDB memcached plugin, MySQL will handle cache expiration, and it really doesn't matter what you enter for the cache expire time.
And for the flags, expire, and CAS values, we specify corresponding
columns based on the settings from the sample table demo.test. These
values are typically not significant in applications using the InnoDB
memcached plugin, because MySQL keeps the data synchronized and there
is no need to worry about data expiring or being stale.

How to delete Index?

Further to my previous question (qv) ...
I have already created the table(s) and populated with data. How do I set the prefix length to a very large value or remove it all togther, so that I don't have this problem? There will never be more than a few thousand rows and only this applciation is running on a dedicated PC, so performance is not an issue.
Solution, please for either PhpMyAdmin, or just MySQL command line.
Update: Can I just delete this index (or make it infinitely long)?
Hmmm, I would prefer to keep the unique index if I can. So, how to make it infinitely long?
Or should I redefine my text fields to be var_char with a limit to the length? (I do know the max possible lngth of the primary key)
mysql> describe tagged_chemicals;
+-------------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------+------+-----+---------+-------+
| bar_code | text | NO | | NULL | |
| rfid_tag | text | NO | UNI | NULL | |
| checked_out | char(1) | NO | | N | |
+-------------+---------+------+-----+---------+-------+
3 rows in set (0.04 sec)
It'll probably be something like
CREATE INDEX part_of_name ON customer (name(10));
from create index documentation http://dev.mysql.com/doc/refman/5.0/en/create-index.html
where in your case the rfid_tag is length 20.