Physical disk rewrite of mysql data - mysql

I am using mysql for the first time in years to help a friend out. The issue: a mysql table that gets updated a lot with INT and CHAR values. This web app site is hosted on a large generic provider, so I have no direct control of setup/parameters/etc. The performance has gotten really, really bad for this table, to the point where processing a data page that should take a max of 10 seconds is sometimes taking 15 minutes.
I initially tried running all updates as a single transaction, rather than the 50ish statements in a php loop in the web app (written several years ago). The problem, at least what I think, is that this app is running on a giant mysql instance with many other generic websites, and the disk speed just isn't able to handle so many updates.
I am able to use chron/batch jobs on this provider. The web app is mainly used during work hours, so I could limit access to the web app during overnight hours.
I normally work with postgresql or ms sql server, so my knowledge of mysql is fairly limited.
Would performance be increased if I force the table to be dropped and rewritten overnight? Is there some mysql function like postgres's vacuum? I have tried to search for information, but unfortunately using words like rewrite table just brings up references to sql syntax helpers or performance tuning.
Alternately, I guess that I could create a new storage mechanism in mysql, as long as it could be done via a php script. Would there be a better storage mode than the default storage engine for something frequently updated?

Performance of mysql depends on multiple factors that it's complicated enough to have a clear answer in every case. I think we can check the following steps to help figuring out on what to improve on INSERT data into mysql.
Database Engine.
There are 5 engine that you can use depends on your purposes: MyISAM, Memory, InnoDB, Archive, NDB.
Document
An engine which has Locking granularity as table will be slower than engine has its value as row because it will lock a table from changing when insert or update a single record, while Locking granularity as row mean locking only that row when you insert or update records.
When perform INSERT OR UPDATE record, engine has B-tree indexes attribute will be slower because it's have to rebuild it's indexes, so that you will have faster speed SELECT query. Therefore number of indexes in table will slow inserting and updating speed as well.
Indexes as CHAR will be slower than indexes as INT because it takes more time to figure out where to find the right node to store data in mysql.
MYSQL Statement
MYSQL has a estimation system that help you to discover performance of a query by add EXPLAIN before your mysql statement.
Example
EXPLAIN SELECT SQL_NO_CACHE * FROM Table_A WHERE id = 1;
Document

I worked on a web application, where we used mysql (it's really good !) to scale really large data.
In addition to what #Lam Nguyen said in his answer here is few things to consider,
Check which mysql engine you are using to see which locks it obtains during select, insert , update. To check which engine you are using here is a sample query with which you could run your litmus test.
mysql> show table status where name="<your_table_name>";
+-------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+--------------------+----------+----------------+---------+
| Name | Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time | Update_time | Check_time | Collation | Checksum | Create_options | Comment |
+-------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+--------------------+----------+----------------+---------+
| Login | InnoDB | 10 | Dynamic | 2 | 8192 | 16384 | 0 | 0 | 0 | NULL | 2019-04-28 12:16:59 | NULL | NULL | utf8mb4_general_ci | NULL | | |
+-------+--------+---------+------------+------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+-------------+------------+--------------------+----------+----------------+---------+
The default engine which comes with mysql installation is InnoDB. InnoDB does not acquire any lock while inserting a row.
SELECT ... FROM is a consistent read, reading a snapshot of the database and setting no locks unless the transaction isolation level is set to SERIALIZABLE.
A locking read, an UPDATE, or a DELETE generally set record locks on every index record that is scanned in the processing of the SQL statement.
InnoDB lock sets
Check for columns which you are indexing. Index the column which you would really query a lot. Avoid indexing char columns.
To check which columns of you table got indexed run,
mysql> show index from BookStore2;
+------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | Visible | Expression |
+------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
| Bookstore2 | 0 | PRIMARY | 1 | ISBN_NO | A | 0 | NULL | NULL | | BTREE | | | YES | NULL |
| Bookstore2 | 1 | SHORT_DESC_IND | 1 | SHORT_DESC | A | 0 | NULL | NULL | YES | BTREE | | | YES | NULL |
| Bookstore2 | 1 | SHORT_DESC_IND | 2 | PUBLISHER | A | 0 | NULL | NULL | YES | BTREE | | | YES | NULL |
+------------+------------+----------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+---------+------------+
3 rows in set (0.03 sec)
Do not run inner query on a large data set in a table. To actually see what your query does run explain on your query and see the number of rows iter
mysql> explain select * from login;
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------+
| id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------+
| 1 | SIMPLE | login | NULL | ALL | NULL | NULL | NULL | NULL | 2 | 100.00 | NULL |
+----+-------------+-------+------------+------+---------------+------+---------+------+------+----------+-------+
1 row in set, 1 warning (0.03 sec)
Avoid joining too may tables.
Make sure you are querying with a primary key in criteria or at least you are querying on your indexed column.
When your table grows too big make sure you split it across clusters.
With few tweaks, we would still be able to get query results in minimal time.

Related

Would partitioning the table improve the performance of this GROUP BY query?

I have a MySQL table say data_table
mysql> desc data_table;
+------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+------------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| prod_id | int(10) unsigned | NO | | NULL | |
| date | date | NO | | NULL | |
| cost | double | NO | | NULL | |
+------------+------------------+------+-----+---------+----------------+
4 rows in set (0.00 sec)
This table has around 700 million rows. I have created indexes on prod_id and date. I need to perform a query like this -
SELECT `id`, `prod_id`, WEEKOFYEAR(`date`) AS period, SUM(`cost`) AS cost_sum
FROM `data_table` GROUP BY `prod_id`, `period`;
My question is -
Will partitioning the table on months (~20 partitions) improve the performance of this query?
PARTITIONing will not help at all. Not BY RANGE; not any other flavor.
The query must read every row in the table; partitioning does not change that fact, nor can it speed it up at all.
The query, as it stands, has an unrelated problem. Which id is it supposed to return for each GROUP? Answer: It will return a 'random' id.
Based on the number of records and the SQL query you have written I would say yes, if done correctly Partitioning would help a lot. I would go further and suggest Range Partitioning on the Date field. This is a very common Partitioning method and works well and is easy to implement.
You don't mention the release of MySQL you're running so you'll have to do some additional reading HERE to understand what your MySQL release supports.
You can also run this SQL at the command prompt.
mysql> SHOW VARIABLES LIKE %partition%
This should report back with "have Partitioning = Yes" or "Partition_engine = yes" depending on your relase.
If you see that there are a lot of queries based on week number, it makes sense to permanently store the week number as a column. We can save on the calculation during select.
The ideal strategy is to know what queries you will run and then design your tables accordingly.

MySQL Performance issues / slow query with large amounts of data

MySql
I've a query that is taking sometime to load on a table, named impression that
has about 57 million rows. Table definition can be found below:
+-----------------+--------------+------+-----+
| Field | Type | Null | Key |
+-----------------+--------------+------+-----+
| id | int(11) | NO | PRI |
| data_type | varchar(16) | NO | MUL |
| object_id | int(11) | YES | |
| user_id | int(11) | YES | |
| posted | timestamp | NO | MUL |
| lat | float | NO | |
| lng | float | NO | |
| region_id | int(11) | NO | |
+-----------------+--------------+------+-----+
The indexes on the table are:
+------------+------------+----------+--------------+-------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name |
+------------+------------+----------+--------------+-------------+
| impression | 0 | PRIMARY | 1 | id |
| impression | 1 | posted | 1 | posted |
| impression | 1 | oi_dt | 1 | data_type |
| impression | 1 | oi_dt | 2 | object_id |
+------------+------------+----------+--------------+-------------+
A typical select statement goes something like:
SELECT COUNT(`id`)
FROM `impression`
WHERE
posted BETWEEN DATE('2014-01-04') AND DATE('2014-06-01')
AND `data_type` = 'event'
AND `object_id` IN ('1', '2', '3', '4', '5', '8', ...)
...and a typical record looks like (in order of the schema above):
'event', 1234, 81, '2014-01-02 00:00:01', 35.3, -75.2, 10
This statement takes approximately 26 seconds to run, which is where the problem
lies. Are there any solutions that can be employed here to reduce this time to well
below what it is now? Ideally it'd be < 1 second.
I'm open to switching storage solutions / etc... anything that'll help at this point.
Your assistance is most appreciated.
Other things possibly worth noting:
The table is using the InnoDB storage engine
using MySQL 5.5
Server: 8Gb RAM running CentOS 6 (Rackspace)
MySQL usually uses only one index per table in a given query. You have an index on posted and a compound index on data_type, object_id.
You should use EXPLAIN to find out which index your query is currently using. EXPLAIN will also tell you how many rows it estimates it will examine to produce the result set (it might examine many more rows than make it into the final result).
The columns should be in this order:
Columns in equality conditions, for example in your query data_type = 'event'
Columns in range conditions or sorting, but you only get one such column. Subsequent columns that are in range conditions or sorting do not gain any benefit from being added to the index after the first such column. So pick the column that is the most selective, that is, your condition narrows down the search to a smaller subset of the table.
Other columns in your select-list, if you have just a few such columns and you want to get the covering index effect. It's not necessary to add your primary key column if you use InnoDB, because every secondary index automatically includes the primary key column at the right end, even if you don't declare that.
So in your case, you might be better off with an index on data_type, posted. Try it and use EXPLAIN to confirm. It depends on whether the date range you give is more selective than the list of object_id's.
See also my presentation How to Design Indexes, Really.
Not sure if this is a viable solution for you, but partitioning may speed it up. I have a similar table for impressions and found the following to help it a lot. I'm querying mostly on the current day though.
ALTER TABLE impression PARTITION BY RANGE(TO_DAYS(posted))(
PARTITION beforeToday VALUES LESS THAN(735725),
PARTITION today VALUES LESS THAN(735726),
PARTITION future VALUES LESS THAN MAXVALUE
);
This does incur some maintenance (has to be updated often to get the benefits). If you are looking to query on a broader range, less maintenance would be required I think.

Innodb memcached plugin in RDS not deleting expired rows

I recently setup an RDS instance in AWS for MySQL 5.6 with the new Memcached InnoDB plugin. Everything works great and my app can store and retrieve cached items from the mapped table. When I store items I provide a timeout, and memcached correctly does not return the item once its TTL has expired. So far so good....
However when I look at the underlying table, it is full of rows which have already expired.
The MySQL documentation (http://dev.mysql.com/doc/refman/5.6/en/innodb-memcached-intro.html) indicates that item expiration has no effect when using the "innodb_only" caching policy (although it doesn't explicitly indicate which operation it is referring to). In any case my cache_policies table looks like this:
mysql> select * from innodb_memcache.cache_policies;
+--------------+------------+------------+---------------+--------------+
| policy_name | get_policy | set_policy | delete_policy | flush_policy |
+--------------+------------+------------+---------------+--------------+
| cache_policy | caching | caching | innodb_only | innodb_only |
+--------------+------------+------------+---------------+--------------+
1 row in set (0.01 sec)
So, per the docs the expiration field should be respected.
For reference my containers table looks like this:
mysql> select * from innodb_memcache.containers;
+---------+-----------+-----------+-------------+---------------+-------+------------+--------------------+------------------------+
| name | db_schema | db_table | key_columns | value_columns | flags | cas_column | expire_time_column | unique_idx_name_on_key |
+---------+-----------+-----------+-------------+---------------+-------+------------+--------------------+------------------------+
| default | sessions | userData | sessionID | data | c3 | c4 | c5 | PRIMARY |
+---------+-----------+-----------+-------------+---------------+-------+------------+--------------------+------------------------+
2 rows in set (0.00 sec)
And the data table is:
mysql> desc sessions.userData;
+-----------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+---------------------+------+-----+---------+-------+
| sessionID | varchar(128) | NO | PRI | NULL | |
| data | blob | YES | | NULL | |
| c3 | int(11) | YES | | NULL | |
| c4 | bigint(20) unsigned | YES | | NULL | |
| c5 | int(11) | YES | | NULL | |
+-----------+---------------------+------+-----+---------+-------+
5 rows in set (0.00 sec)
One more detail, the MySQL docs state that after modifying caching policies you need to re-install the Memcached plugin, but I did not find a way to do this on RDS, so I removed the Memcached option group, rebooted, added the memcached option group again, rebooted again... but there was no apparent change in behavior.
So, to conclude, am I missing some step or configuration here? I would hate to have to create a separate process just to delete the expired rows from the table, since I was expecting the Memcached integration to do this for me.
I'm by no means an expert, as I've just started to play around with memcached myself. However, this is from the MySQL documentation for the python tutorial.
It seems to be saying that if you use the InnoDB memcached plugin, MySQL will handle cache expiration, and it really doesn't matter what you enter for the cache expire time.
And for the flags, expire, and CAS values, we specify corresponding
columns based on the settings from the sample table demo.test. These
values are typically not significant in applications using the InnoDB
memcached plugin, because MySQL keeps the data synchronized and there
is no need to worry about data expiring or being stale.

How to run OPTIMIZE TABLE with the least downtime

I have a MySQL 5.5 DB of +-40GB on a 64GB RAM machine in a production environment. All tables are InnoDB. There is also a slave running as a backup.
One table - the most important one - grew to 150M rows, inserting and deleting became slow. To speed up inserting and deleting I deleted half of the table. This did not speed up as expected; inserting and deleting is still slow.
I've read that running OPTIMIZE TABLE can help in such a scenario. As I understand this operation will require a read lock on the entire table and optimizing the table might take quite a while on a big table.
What would be a good strategy to optimize this table while minimizing downtime?
EDIT The specific table to be optimized has +- 91M rows and looks like this:
+-------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| channel_key | varchar(255) | YES | MUL | NULL | |
| track_id | int(11) | YES | MUL | NULL | |
| created_at | datetime | YES | | NULL | |
| updated_at | datetime | YES | | NULL | |
| posted_at | datetime | YES | | NULL | |
| position | varchar(255) | YES | MUL | NULL | |
| dead | int(11) | YES | | 0 | |
+-------------+--------------+------+-----+---------+----------------+
Percona Toolkit's pt-online-schema-change does this for you. In this case it worked very well.
300 ms to insert seems excessive, even with slow disks. I would look into the root cause. Optimizing this table is going to take a lot of time. MySQL will create a copy of your table on disk.
Depending on the size of your innodb_buffer_pool (if the table is innodb), free memory on the host, I would try to preload the whole table in the page cache of the OS, so that at least reading the data will be sped up by a couple orders of magnitude.
If you're using innodb_file_per_table, or if it's a MyISAM table, it's easy enough to make sure the whole file is cached using "time cat /path/to/mysql/data/db/huge_table.ibd > /dev/null". When you rerun the command, and it runs in under a few seconds, you can assume the file content is sitting in the OS page cache.
You can monitor the progress whilst the "optimize table" is running, by looking at the size of the temporary file. It's usually in the database data directory, with a temp filename starting with a dash (#) character.
This article suggests to first drop all indexes in a table, then optimize it, and then add indexes back. It claims the speed difference is 20 times compared to just optimize.
update your version of mysql, in 8.0.x version the optimize table is fastest than 5.5
a optimize in a table with 91 millions could take like 3 hours in your version, you can run in morning, like 3am to not disturb the users of your app.

Massive DB and mysql

A new project we are working required a lot of data analysis but we are finding this to be VERY slow, we are looking for ways to change our approach with software and or hardware.
We are currently running on a amazon ec2 instance (linux):
High-CPU Extra Large Instance
7 GB of memory
20 EC2 Compute Units (8 virtual cores with 2.5 EC2 Compute Units each)
1690 GB of instance storage
64-bit platform
I/O Performance: High
API name: c1.xlarge
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU E5506 # 2.13GHz
stepping : 5
cpu MHz : 2133.408
cache size : 4096 KB
MemTotal: 7347752 kB
MemFree: 728860 kB
Buffers: 40196 kB
Cached: 2833572 kB
SwapCached: 0 kB
Active: 5693656 kB
Inactive: 456904 kB
SwapTotal: 0 kB
SwapFree: 0 kB
One part of the db is articles and entities and a link table for example:
mysql> DESCRIBE articles_entities;
+------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+-------+
| id | char(36) | NO | PRI | NULL | |
| article_id | char(36) | NO | MUL | NULL | |
| entity_id | char(36) | NO | MUL | NULL | |
| created | datetime | YES | | NULL | |
| modified | datetime | YES | | NULL | |
| relevance | decimal(5,4) | YES | MUL | NULL | |
| analysers | text | YES | | NULL | |
| anchor | varchar(255) | NO | | NULL | |
+------------+--------------+------+-----+---------+-------+
8 rows in set (0.00 sec)
As you can see from the table below we have a lot of assoications growing at a rate of 100,000+ a day
mysql> SELECT count(*) FROM articles_entities;
+----------+
| count(*) |
+----------+
| 2829138 |
+----------+
1 row in set (0.00 sec)
A simple query like the one below is taking too much time (12 secs)
mysql> SELECT count(*) FROM articles_entities WHERE relevance <= .4 AND relevance > 0;
+----------+
| count(*) |
+----------+
| 357190 |
+----------+
1 row in set (11.95 sec)
What should we be considering to improve our lookup times? Different DB storage? Different hardware.
As mrorigo asked, please provide the SHOW CREATE TABLE articles_entities so we can see the actual indexes of your table.
As a note from MySQL documentation http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
If the table has a multiple-column index, any leftmost prefix of the index can be used by the optimizer to find rows.
For example, if you have a three-column index on (col1, col2, col3), you have indexed search capabilities on (col1), (col1, col2), and (col1, col2, col3).
MySQL cannot use an index if the columns do not form a leftmost prefix of the index
So if relevance is part of a multi-column index, but isn't the leftmost column of that index, then the index is not used for your query.
This is a common issue that is often overlooked.
Using char(36) for keys is not the fastest you can do with MySQL. Use INT-types for keys if possible. If you index CHAR columns, the indexes will be VERY large compared to an (BIG)INT index (if not 'properly' created)
However, if your column values are not numeric, you are stuck with CHAR columns (which ARE still faster than VARCHAR, but can create large indexes).
Please provide a SHOW CREATE TABLE of tables to see key/index parameters, and also as the previous answer said, an EXPLAIN for the queries in question could help provide a better answer.
PS. Use SHOW TABLE STATUS LIKE '{table_name}' to see index (and data) sizes of the table.
There are three things that matter when it comes to query performance:
Indexes.
Memory.
Everything else.
The first thing to do is check your indexes. Do an EXPLAIN on your queries to find out how MySQL is processing them.
If that looks sensible, the next thing would be to check memory. How big is your total database? Memory is cheap these days, and queries that run from memory will be much, much faster than queries that have to read from disk.
After you've explored those, if performance is still slow, then it might be time to consider other options.