How to purge big MySQL database old entries from single column? - mysql

I need to remove old database entries payload, while keeping other data (id and other properties) of same entries.
Table in question has message_id column (which consists of a datestamp concatenated with other info), content column (which is BLOB, and it makes over 90% of database total size) and some other columns that we have no use for in this case.
I've first tried running simple update with condition:
UPDATE LOW_PRIORITY repository SET content="" WHERE SUBSTR( message_id, 6, 6 )<201601 AND message_box = "IN";
I extract a YYYYMM from every entry message_id, and if it's older than a chosen cutoff month - I replace content with an empty string.
Database is over 25GB in size, and holds almost 2KK entries in my table, and is running on a very modest hardware, and my query failed with error after running for some time:
ERROR 2013 (HY000): Lost connection to MySQL server during query
Usually I try to avoid changing database variables, but i knew this error also pops up when you try restoring database from a large dumpfile, therefore I went and updated setting to handle 100MB packet size:
set global max_allowed_packet=104857600;
Re-running my UPDATE query resulted in a new error:
ERROR 2013 (HY000): Lost connection to MySQL server during query
As I have mentioned before - my MySQL server runs on a very modest hardware, and I'd prefer not to modify settings that could make server exceed available resources, therefore instead of increasing all available timeout database variables, I've decided to run my query in smaller chunks with a query like this:
UPDATE LOW_PRIORITY repository SET content="" WHERE message_id in (select message_id from(select message_id from repository where SUBSTR( message_id, 6, 6 )<201603 AND message_box = "IN" limit 0, 1000)as temp);
This query fails with an error:
ERROR 1206 (HY000): The total number of locks exceeds the lock table size
It also fails with a same query when limited even to single line with "limit 1"!
Do I use pagination incorrectly, or is there another better way of doing this?
*DB is running an a virtual Ubuntu server with dual core Intel CPU with 1GB of RAM and 100GB HDD. I't completely adequate for it's daily tasks, and I'd really like not to increase specs for just this one query.

You are trying to trick mysql into doing something it doesn't want (using limit in an in-statement) in a complicated way (complicated = more resources). That is not wrong, but you can just write
UPDATE LOW_PRIORITY repository SET content=""
WHERE content <> ""
and SUBSTR( message_id, 6, 6 ) < 201603 AND message_box = "IN"
limit 1000;
This will update the first 1000 old rows that still have content in it.

I would imagine your #1 problem here is that your WHERE condition will not be able to use an index on message_id field.
Why not simply do:
WHERE message_id < 20160100* ...
Assuming this is integer field, 201512** would be less the 201601** anyway so there would be no change in your outcome. But removing the substring function would allow you to use index on that field.

Related

Speed up select distinct process from very large table

I want to use select distinct on a single variable to extract data from a very large MyISAM table with ~300 million rows (~ 12.3 GiBs in size -- select distinct should yield ~100k observations, so much smaller than 1 GiB).
The problem is, this query takes 10+ hours to run. I actually don't know how long it takes because I've never finished the process due to impatience.
My query is as follows:
create table codebook(
symbol varchar(16) not null);
create index IDXcodebook on codebook(symbol);
insert into codebook
select distinct(symbol) from bigboytable
I've tried an indexon bigboytable(symbol) to speed up the process, but I have ran that indexing code for 15+ hours with no end in sight.
I've also tried:
SELECT symbol from bigboytable, GROUP BY symbol
But I get
Error Code: 2013. Lost connection to MySQL server during query
in fact, if any query, in this project or in other projects, is "too complicated", I get Error Code 2013 after only ~1-6+ hours, depending.
Other settings are:
Migration connection timeout (3600); DBS connection read timeout skipped; DBMS connection keep-alive interval (5 seconds); SSH BufferSize (10240 bytes); SSH connect, read write, and command timeouts (500 seconds);.
Any suggestions? I might work with Python's MySQL packages if that might speed things up; Workbench is very slow. I need this data ASAP for a large project, but don't need the 300+ million observations from bigboytable.
Edit: I attach my bigboytable definition and explain output here.

How to fix update statement blocking multiple tables in MYSQL

I have ran into an issue when using a mysql database where, after creating a new table and adding CRUD database query logic to my web application (with backend written in c), update querys will sometimes take 10-20 minute to execute.
The web application has apache modules that talk to server daemons that have a connection to a mysql (MariaDB 10.4) database. The server daemons each have about 20 work threads, waiting to handle any requests from the apache modules. The work threads maintain a consent connection to the mysql database. I added a new table of the following schema:
CREATE TABLE MyTable
(
table_id INT NOT NULL AUTO_INCREMENT,
index_id INT NOT NULL,
int_column_1 INT DEFAULT 0,
decimal_column_1 DECIMAL(9,3) DEFAULT 0,
decimal_column_2 DECIMAL(9,3) DEFAULT 0,
varchar_column_1 varchar(3000) DEFAULT NULL,
varchar_column_2 varchar(3000) DEFAULT NULL,
deleted tinyint DEFAULT 0,
PRIMARY KEY (table_id) ,
KEY index_on_index_id (index_id)
)
Then I added the following crud operations:
1. RETRIEVE:
SELECT * FROM MyTable table_id, varchar_column_1,... WHERE index_id = ${given index_id}
2. CREATE:
INSERT INTO MyTable (index_id, varchar_column_2, ,,,) VALUES ( ${given}, ${given})Note: This is done using a prepare statement because ${given varchar_column_2} is a user entered value.
3. UPDATE:
UPDATE MyTable SET varchar_column_1 = ISNULL(${given varchar_column_2}, `varchar_column_2 `) WHERE table_id = ${given table_id} Note: This is also done using a prepare statement because ${given varchar_column_2} is a user entered value. Also, the isnull is a kludge solution to the possibility that the given varchar_column_2 might be null, so that the column will just be set to the value in the table.
4. DELETE:
UPDATE MyTable SET deleted = 1 WHERE table_id = ${given table_id}
Finally, there is a delete index_id operation:
UPDATE MyTable SET deleted = 1 WHERE index_id = ${given index_id }
This was deployed to a production server without proper testing. On that production server, a script I wrote was ran that filled MyTable with about 30,000 entries. Then, using the crud operations, about 600 updates, 50 creates, 20 deletes, and thousands of retrieves were performed on the table. The problem that is occurring is that after some time (an hour or two) of these operations being performed, the update operation would take 10+ minutes to execute. Eventually, all of the work threads in the server daemon would be stuck waiting on the update operations, and any other requests to the daemon would time out. This behavior happened twice in one day and one more time two days later.
There were three parts of this behavior that really confused me. One is that all update operations on the database were being blocked. So even if the daemon, or any daemon, was updating a different table in database, that update would take 10+ minutes. The next is that the select operations would execute instantly as all the update queries were taking 10+ minutes. Finally, after 10-20 minutes, all of the 20-ish update queries would successfully execute, the database would be correctly updated, and the threads would go back to working properly.
I received a dump of the database and ran EXPLAIN ${mysql query} for each of the new CRUD queries, and none produced strange results. In the "Extras" column, the only entry was "using where clause" for the queries that have where clauses. Another potential problem is the use of varchars. Since the UPDATE operations are used the most and are the ones that seem to be causing the problem, I thought maybe the fact that the varchars are changing sizes a lot (they range from 8 chars to 500 chars), it might run into some mysql memory issues that cause the long execution time. I also thought maybe there was an issue with table level locks, but running
Show status like ' table%
returned table_locks_waited = 0.
Unfortunately, no database monitoring was being done on the production server that was having issues, I only have the order of the transactions as they happened. To this, each time this issue occurred, the first update query that was blocked was an update to a different table in the database. It was the same query twice (but it is also the most common update query in the application), but it has been in the application for months without any issues.
I tried to reproduce this issue on a server with the same table and CRUD operations, but with only 600 entries in MyTable. Making about 100 update requests, 20 create requests, 5 delete requests, and hundreds of get requests. I could not reproduce the issue of the update queries taking 10+ minutes. This makes me think that maybe the size of the table has something to do with it.
I am looking for any suggestions on what might be causing this issue, or any ideas on how to better diagnose the problem.
Sorry for the extremely long question. I am a junior software engineer that is in a little over his head. Any help would be really appreciated. I can also provide any additional information about the database or application if needed.

Mysql Index misbehave in rails

I have an rails app hosted with Mysql, there is a reservations table with index set in column rescheduled_reservation_id (nullable).
In my rails app there are two part to query reservation by rescheduled_reservation_id fields as below:
Transit::Reservation.find_by(rescheduled_reservation_id: 25805)
and produce the following log output:
Transit::Reservation Load (60.3ms) SELECT `transit_reservations`.* FROM `transit_reservations` WHERE `transit_reservations`.`deleted_at` IS NULL AND `transit_reservations`.`rescheduled_reservation_id` = 25805 LIMIT 1
However the other part of the app:
Transit::Reservation.where(rescheduled_reservation_id: 25805).last
with the log output belows
Transit::Reservation Load (2.3ms) SELECT `transit_reservations`.* FROM `transit_reservations` WHERE `transit_reservations`.`deleted_at` IS NULL AND `transit_reservations`.`rescheduled_reservation_id` = 25805 ORDER BY `transit_reservations`.`id` DESC LIMIT 1
As clearly seen the first query
Transit::Reservation Load (60.3ms) SELECT `transit_reservations`.* FROM `transit_reservations` WHERE `transit_reservations`.`deleted_at` IS NULL AND `transit_reservations`.`rescheduled_reservation_id` = 25805 LIMIT 1
took up to 60ms, the index might not have been used properly comparing to 2ms in this
Transit::Reservation Load (2.3ms) SELECT `transit_reservations`.* FROM `transit_reservations` WHERE `transit_reservations`.`deleted_at` IS NULL AND `transit_reservations`.`rescheduled_reservation_id` = 25805 ORDER BY `transit_reservations`.`id` DESC LIMIT 1
I also tried to debug further by running explain on both queries, I got back the same result i.e the index rescheduled_reservation_id being used
Is there anyone experiencing with this issue? I am wondering whether rails mysql connection ( I am using mysql2 gem ) might cause Mysql server to not choose the right index
It's Rare, but Normal.
The likely answer is that the first occurrence did not find the blocks it needed cached in the buffer_pool. So, it had to fetch them from disk. On a plain ole HDD, a Rule of Thumb is 10ms per disk hit. So, maybe there were 6 blocks that it needed to fetch, leading to 60.3ms.
Another possibility is that other activities were interfering, thereby slowing down this operation.
2.3ms is reasonable for a simple query like that the can be performed entirely with cached blocks in RAM.
Was the server recently restarted? After a restart, there is nothing in cache. Is the table larger than innodb_buffer_pool_size? If so, that would lead to 60ms happening sporadically -- blocks would get bumped out. (Caveat: The buffer_pool should not be made so big that 'swapping' occurs.)
A block is 16KB; it contains some rows of data or rows of index or nodes of a BTree. Depending on the size of the table, even that 'point query' might have needed to look at 6 blocks or more.
If you don't get 2.3ms most of the time, we should dig deeper. (I have hinted at sizes to investigate.)

Lost Connection when trying to add new column to MYSQL table

I'm trying to add a column to a mysql table that has over 25 million rows. I am running the sql command
ALTER TABLE `table_name` ADD COLUMN `column_name` varchar(128) NULL DEFAULT NULL;
This is being run using the mysql command line application.
Every time i try to run this it takes hours and then i get the error
ERROR 2013 (HY000): Lost connection to MySQL server during query
The database is running in a RDS instance on AWS and checking the monitoring statistics neither the memory or disk space is being exhausted.
Is there anything else i can try to add this column to the table?
Check your memory usage or, more probably, your disk usage (is there enough free space during the process?). Altering a table may require either a large amount of memory or a copy on disk of your table. Changing the alter algorithm from INPLACE to COPY can be even faster in your particular case.
You may also be hitting the innodb_online_alter_log_max_size limit, although in that case, only the query should fail, not the entire server. It is possible that the crash may be happening due to the ROLLBACK, and not the operation itself, though.
Finally, some application configurations or hosting servers cancels a query/http request that is taking too long, I recommend you to execute the same query on the command line client for testing purposes.

Mysql Update performance suddenly abysmal

MySQL 5.1, Ubuntu 10.10 64bit, Linode virtual machine.
All tables are InnoDB.
One of our production machines uses a MySQL database containing 31 related tables. In one table, there is a field containing display values that may change several times per day, depending on conditions.
These changes to the display values are applied lazily throughout the day during usage hours. A script periodically runs and checks a few inexpensive conditions that may cause a change, and updates the display value if a condition is met. However, this lazy method doesn't catch all posible scenarios in which the display value should be updated, in order to keep background process load to a minimum during working hours.
Once per night, a script purges all display values stored in the table and recalculates them all, thereby catching all possible changes. This is a much more expensive operation.
This has all been running consistently for about 6 months. Suddenly, 3 days ago, the run time of the nightly script went from an average of 40 seconds to 11 minutes.
The overall proportions on the stored data have not changed in a significant way.
I have investigated as best I can, and the part of the script that is suddenly running slower is the last update statement that writes the new display values. It is executed once per row, given the (INT(11)) id of the row and the new display value (also an INT).
update `table` set `display_value` = ? where `id` = ?
The funny thing is, that the purge of all the previous values is executed as:
update `table` set `display_value` = null
And this statement still runs at the same speed as always.
The display_value field is not indexed. id is the primary key. There are 4 other foreign keys in table that are not modified at any point during execution.
And the final curve ball: If I dump this schema to a test VM, and execute the same script it runs in 40 seconds not 11 minutes. I have not attempted to rebuild the schema on the production machine, as that's simply not a long term solution and I want to understand what's happening here.
Is something off with my indexes? Do they get cruft in them after thousands of updates on the same rows?
Update
I was able to completely resolve this problem by running optimize on the schema. Since InnoDB doesn't support optimize, this forced a rebuild, and resolved the issue. Perhaps I had a corrupted index?
mysqlcheck -A -o -u <user> -p
There is a chance the the UPDATE statement won't use an index on id, however, it's very improbable (if possible at all) for a query like yours.
Is there a chance your table are locked by a long-running concurrent query / DML? Which engine does the table use?
Also, updating the table record-by-record is not efficient. You can load your values into a temporary table in a bulk manner and update the main table with a single command:
CREATE TEMPORARY TABLE tmp_display_values (id INT NOT NULL PRIMARY KEY, new_display_value INT);
INSERT
INTO tmp_display_values
VALUES
(?, ?),
(?, ?),
…;
UPDATE `table` dv
JOIN tmp_display_values t
ON dv.id = t.id
SET dv.new_display_value = t.new_display_value;