Reduce the size of MySQL NDB binlog - mysql

I am running NDB Cluster and I see that on mysql api nodes, there is a very big binary log table.
+---------------------------------------+--------+-------+-------+------------+---------+
| CONCAT(table_schema, '.', table_name) | rows | DATA | idx | total_size | idxfrac |
+---------------------------------------+--------+-------+-------+------------+---------+
| mysql.ndb_binlog_index | 83.10M | 3.78G | 2.13G | 5.91G | 0.56 |
Is there any recommended way to reduce the size of that without breaking anything? I understand that this will limit the time frame for point-in-time recovery, but the data has is growing out of hand and I need to do a bit of clean up.

It looks like this is possible. I don't see anything here: http://dev.mysql.com/doc/refman/5.5/en/mysql-cluster-replication-pitr.html that says you can't based on the last epoch.
Some additional information might be gained by reading this article:
http://www.mysqlab.net/knowledge/kb/detail/topic/backup/id/8309
The mysql.ndb_binlog_index is a MyISAM table. If you are cleaning it,
make sure you don't delete entries of binary logs that you still need.

Related

Mysql reclaim mysql disk space after deleting rows

First of all sorry for repeatating this repeated question as it has been asked numerous times already. I have gone through those questions and answers.
I have a table on production environment of size approx 159 GB and now I want to migrate MySQL to another server. But since total DB size is more than 300 GB I am not able to migrate it easily. So I am trying to reclaim space by deleting records from MySQL. I deleted more than 70 % records from this table and tried OPTIMIZE TABLE but its giving an error as :
mysql> OPTIMIZE TABLE table_name;
+--------------------------------------+----------+----------+-----------------------+
| Table | Op | Msg_type | Msg_text |
+--------------------------------------+----------+----------+-----------------------+
| table_name | optimize | note | Table does not support optimize, doing recreate + analyze instead |
| table_name | optimize | error | The table 'table_name' is full |
| table_name | optimize | status | Operation failed |
+--------------------------------------+----------+----------+-----------------------+
innodb_file_per_table is set to ON
SHOW VARIABLES LIKE '%innodb_file_per_table%';
Variable_name Value
--------------------- --------
innodb_file_per_table ON
Mysql Version: 5.7.28-log
I read somewhere that alter table will help however it slows down all MySQL queries.
In one answer I read that copying data in another table and then renaming it and deleting original table (which I assume OPTIMIZE TABLE does internally) will help but doing so will need downtime.
Is there any other way which I can achieve this.?

Do row-level binlog entries get recorded in MySQL when a non-null column with a default gets added

I wanted to verify the following behavior I noticed with MySQL row-based replication in case there was just something peculiar with our setup or configuration. With row-based replication turned on, and given the following table named pets:
| id | name | species |
|----|-----------|--------------|
| 1 | max | canine |
| 2 | spike | canine |
| 3 | bell | feline |
Any updates, deletes, or inserts are recorded in the binlog. However, if I were to add a non-null column with a default value, e.g.
ALTER TABLE `pets`
ADD COLUMN `sex` varchar(7) NOT NULL DEFAULT "unknown" AFTER `species`;
The records are updated like so:
| id | name | species | sex
|----|-----------|--------------|--------
| 1 | max | canine | unknown
| 2 | spike | canine | unknown
| 3 | bell | feline | unknown
The behavior I initially expected was that an update would be recorded for each row (since each row undergoes change), and these updates would appear in the binlog. However, it actually appears that no row-level events are being written to the binlog at all when the new column and default values are added.
Anyways, the questions I have are:
Is this behavior expected, or is this indicative of some issue with our setup (or my observational skills)?
Is this behavior configurable in any way?
Any information, links, resources, etc will be greatly appreciated.
Thanks,
As mysql documentation on binlog format setting says (emphasis is mine):
With the binary log format set to ROW, many changes are written to the binary log using the row-based format. Some changes, however, still use the statement-based format. Examples include all DDL (data definition language) statements such as CREATE TABLE, ALTER TABLE, or DROP TABLE.
To be honest, your train of thoughts did not seem logical to me, replicating such operations through updates just seemed completely inefficient to me. I know that some complex ddl-dml statements may be partially be replicated through a series of insert / updates, but this does not apply here.

Improving MySQL Performance on a Run-Once Query with a Large Dataset

I previously asked a question on how to analyse large datasets (how can I analyse 13GB of data). One promising response was to add the data into a MySQL database using natural keys and thereby make use of INNODB's clustered indexing.
I've added the data to the database with a schema that looks like this:
TorrentsPerPeer
+----------+------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+------------------+------+-----+---------+-------+
| ip | int(10) unsigned | NO | PRI | NULL | |
| infohash | varchar(40) | NO | PRI | NULL | |
+----------+------------------+------+-----+---------+-------+
The two fields together form the primary key.
This table represents known instances of peers downloading torrents. I'd like to be able to provide information on how many torrents can be found at peers. I'm going to draw a histogram of the frequencies of which I see numbers of torrents (e.g. 20 peers have 2 torrents, 40 peers have 3, ...).
I've written the following query:
SELECT `count`, COUNT(`ip`)
FROM (SELECT `ip`, COUNT(`infohash`) AS `count`
FROM TorrentsPerPeer
GROUP BY `ip`) AS `counts`
GROUP BY `count`;
Here's the EXPLAIN for the sub-select:
+----+-------------+----------------+-------+---------------+---------+------------+--------+----------+-------------+
| id | select_type | table | type | possible_keys | key | key_length | ref | rows | Extra |
+----+-------------+----------------+-------+---------------+---------+------------+--------+----------+-------------+
| 1 | SIMPLE | TorrentPerPeer | index | [Null] | PRIMARY | 126 | [Null] | 79262772 | Using index |
+----+-------------+----------------+-------+---------------+---------+------------+--------+----------+-------------+
I can't seem to do an EXPLAIN for the full query because it takes way too long. This bug suggests it's because it's running the sub query first.
This query is currently running (and has been for an hour). top is reporting that mysqld is only using ~5% of the available CPU whilst its RSIZE is steadily increasing. My assumption here is that the server is building temporary tables in RAM that it's using to complete the query.
My question is then; how can I improve the performance of this query? Should I change the query somehow? I've been altering the server settings in the my.cnf file to increase the INNODB buffer pool size, should I change any other values?
If it matters the table is 79'262'772 rows deep and takes up ~8GB of disk space. I'm not expecting this to be an easy query, maybe 'patience' is the only reasonable answer.
EDIT Just to add that the query has finished and it took 105mins. That's not unbearable, I'm just hoping for some improvements.
My hunch is that with an unsigned int and a varchar 40 (especially the varchar!) you have now a HUGE primary key and it is making your index file too big to fit in whatever RAM you have for Innodb_buffer_pool. This would make InnoDB have to rely on disk to swap index pages as it searches and that is a LOT of disk seeks and not a lot of CPU work.
One thing I did for a similar issue is use something in between a truly natural key and a surrogate key. We would take the 2 fields that are actually unique (one of which was also a varchar) and in the application layer would make a fixed width MD5 hash and use THAT as the key. Yes, it means more work for the app but it makes for a much smaller index file since you are no longer using an arbitrary length field.
OR, you could just use a server with tons of RAM and see if that makes the index fit in memory but I always like to make 'throw hardware at it' a last resort :)

COUNT(id) query is taking too long, what performance enhancements might help?

I have a query timeout problem. When I did a:
SELECT COUNT(id) AS rowCount FROM infoTable;
in my program, my JDBC call timed out after 2.5 minutes.
I don't have much database admin expertise but I am currently tasked with supporting a legacy database. In this mysql database, there is an InnoDB table:
+-------+------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+------------+------+-----+---------+----------------+
| id | bigint(20) | NO | PRI | NULL | auto_increment |
| info | longtext | NO | | | |
+-------+------------+------+-----+---------+----------------+
It currently has a high id of 5,192,540, which is the approximate number of rows in the table. Some of the info text is over 1M, some is very small. Around 3000 rows are added on a daily basis. Machine has loads of free disk space, but not a lot of extra memory. Rows are read and are occasionally modified but are rarely deleted, though I'm hoping to clean out some of the older data which is pretty much obsolete.
I tried the same query manually on a smaller test database which had 1,492,669 rows, installed on a similar machine with less disk space, and it took 9.19 seconds.
I tried the same query manually on an even smaller test database which had 98,629 rows and it took 3.85 seconds. I then added an index to id:
create index infoTable_idx on infoTable(id);
and the subsequent COUNT took 4.11 seconds, so it doesn't seem that adding an index would help in this case. (Just for kicks, I did the same on the aforementioned mid-sized db and access time increased from 9.2 to 9.3 seconds.)
Any idea how long a query like this should be taking? What is locked during this query? What happens if someone is adding data while my program is selecting?
Thanks for any advice,
Ilane
You might try executing the following explain statement, might be a bit quicker:
mysql> EXPLAIN SELECT id FROM table;
That may or may not yield quicker results, look for the rows field.

How to optimize mysql indexes so that INSERT operations happen quickly on a large table with frequent writes and reads?

I have a table watchlist containing today almost 3Mil records.
mysql> select count(*) from watchlist;
+----------+
| count(*) |
+----------+
| 2957994 |
+----------+
It is used as a log to record product-page-views on a large e-commerce site (50,000+ products). It records the productID of the viewed product, the IP address and USER_AGENT of the viewer. And a timestamp of when it happens:
mysql> show columns from watchlist;
+-----------+--------------+------+-----+-------------------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+--------------+------+-----+-------------------+-------+
| productID | int(11) | NO | MUL | 0 | |
| ip | varchar(16) | YES | | NULL | |
| added_on | timestamp | NO | MUL | CURRENT_TIMESTAMP | |
| agent | varchar(220) | YES | MUL | NULL | |
+-----------+--------------+------+-----+-------------------+-------+
The data is then reported on several pages throughout the site on both the back-end (e.g. checking what GoogleBot is indexing), and front-end (e.g. a side-bar box for "Recently Viewed Products" and a page showing users what "People from your region also liked" etc.).
So that these "report" pages and side-bars load quickly I put indexes on relevant fields:
mysql> show indexes from watchlist;
+-----------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-----------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| watchlist | 1 | added_on | 1 | added_on | A | NULL | NULL | NULL | | BTREE | |
| watchlist | 1 | productID | 1 | productID | A | NULL | NULL | NULL | | BTREE | |
| watchlist | 1 | agent | 1 | agent | A | NULL | NULL | NULL | YES | BTREE | |
+-----------+------------+-----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
Without the INDEXES, pages with the side-bar for example would spend about 30-45sec executing a query to get the 7 most-recent ProductIDs. With the indexes it takes <0.2sec.
The problem is that with the INDEXES the product pages themselves are taking longer and longer to load because as the table grows the write operations are taking upwards of 5sec. In addition there is a spike on the mysqld process amounting to 10-15% of available CPU each time a product page is viewed (roughly once every 2sec). We already had to upgrade the server hardware because on a previous server it was reaching 100% and caused mysqld to crash.
My plan is to attempt a 2-table solution. One table for INSERT operations, and another for SELECT operations. I plan to purge the INSERT table whenever it reaches 1000 records using a TRIGGER, and copy the oldest 900 records into the SELECT table. The report pages are a mixture of real-time (recently viewed) and analytics (which region), but the real-time pages tend to only need a handful of fresh records while the analytical pages don't need to know about the most recent trend (i.e. last 1000 views). So I can use the small table for the former and the large table for the latter reports.
My question: Is this an ideal solution to this problem?
Also: With TRIGGERS in MySQL is it possible to nice the trigger_statement so that it takes longer, but doesn't consume much CPU? Would running a cron job every 30min that is niced, and which performs the purging if required be a better solution?
Write operations for a single row into a data table should not take 5 seconds, regardless how big the table gets.
Is your clustered index based on the timestamp field? If not, it should be, so you're not writing into the middle of your table somewhere. Also, make sure you are using InnoDB tables - MyISAM is not optimized for writes.
I would propose writing into two tables: one long-term table, one short-term reporting table with little or no indexing, which is then dumped as needed.
Another solution would be to use memcached or an in-memory database for the live reporting data, so there's no hit on the production database.
One more thought: exactly how "live" must either of these reports be? Perhaps retrieving a new list on a timed basis versus once for every page view would be sufficient.
A quick fixed might be to use the INSERT DELAYED syntax, which allows mysql to queue the inserts and execute them when it has time. That's probably not a very scalable solution though.
I actually think that the principles of what you will be attempting is sound, although I wouldn't use a trigger. My suggested solution would be to let the data accumulate for a day and then purge the data to the secondary log table with a batch script that runs at night. This is mainly because these frequent transfers of a thousand rows would still put a rather heavy load on the server, and because I don't really trust the MySQL trigger implementation (although that isn't based on any real substance).
Instead of optimizing indexes you could use some sort of database write offload. You could delegate writing to some background process via asynchronous queue (ActiveMQ for example). Inserting a message into ActiveMQ queue is very fast. We are using ActiveMQ and have about 10-20K insert operations on test platform (and this is single threaded test application! So you could have more).
Look for 'shadow tables' when reconstructing tables this way, you don't need to write to the production table.
I had the same issue even using InnoDB tables or MyISAM as mention before, not optimized for writes, and solved it by using a second table to write temp data (that can periodically update master huge table). Master table over 18 million records, used to read only records and write result on to second small table.
The problem is the insert/update onto the big master table, takes a while, and worse if there are several updates or inserts on the queue awaiting, even with the INSERT DELAYED or UPDATE [LOW_PRIORITY] options enabled
To make it even faster, do read the small secondary table first, when searching a record, if te record is there, then work on the second table only. Use the master big table for reference and picking up new data record only *if data is not on the secondary small table, you just go and read the record from the master (Read is fast on InnoDB tables or MyISAM schemes)and then insert that record on the small second table.
Works like a charm, takes much less than 5 seconds to read from huge master 20 Million record and write on to second small table 100K to 300K records in less than a second.
This works just fine.
Regards
Something that often helps when doing bulk loads is to drop any indexes, do the bulk load, then recreate the indexes. This is generally much faster than the database having to constantly update the index for each and every row inserted.