Deleting old records - MySQL

Deleting old records - MySQL - mysql

I currently am looking for a solution to a basic problem I have: the deletion of old records.
To explain the situation, I have a table, which I'll call table1, with a reduced number of records. Usually it stays empty, as it is used to relay messages. These messages are read within two seconds of being added to the database, and deleted so that they aren't read again.
However, if one of the clients supposed to receive the messages from table1 goes offline, several messages can become pending. Sometimes hundreds. Sometimes thousands, or even hundreds of thousands, if not more.
Not only does this hurt the client's performance, which will have to process a huge amount of messages, it also hurts the database's which is kept in memory and should keep a minimal amount of records.
Considering the clients check for new messages every second, what would be the best way to delete old records? I've thought about adding timestamps, but won't that hurt the performance: the fact that it has to calculate timestamps when inserting? I've tried it out, and all those queries ended up in the slow queries log.
What would the best solution be? I've thought about something like checking if the table was altered in the past 5 seconds, and if not, we can be safe that all messages that should be relayed have been relayed already, and it can be wiped. But how can this be done?
I've thought about events running every couple of minutes, but I'm not sure how to implement something that would have no (or meaningless) impact on the select/insert/delete queries.
PS: This situation arrives when I noticed that some clients were offline, and there were 8 million messages pending.
EDIT :
I had forgotten to mention that the storage engine is MEMORY, and therefore all records are kept in RAM. That's the main reason I want to get rid of these records: because millions of records which shouldn't even be there, being kept in RAM, has an impact on system resources.
Here is an extract from the error log:
# Query_time: 0.000283 Lock_time: 0.000070 Rows_sent: 0 Rows_examined: 96
SET timestamp=1387199997;
DELETE FROM messages WHERE clientid='100';
[...]
# Query_time: 0.000178 Lock_time: 0.000054 Rows_sent: 0 Rows_examined: 96
SET timestamp=1387199998;
DELETE FROM messages WHERE clientid='14';
So I guess they do have a quite small delay, but is it in any way meaningful in MySQL? I mean, in "real life", 0.0003 could be completely ignored due to its insignificance, can the same be said about MySQL and connections with approximately 10ms ping?

Your question is interesting, but hasn't a lot of detail, so I can only give general points of view.
Firstly - there exist already a number of message queuing solutions which may do what you need out of the box. They hide the underlying implementation of data storage, clean-up etc. and allow you to focus on the application logic. RabbitMQ is a popular open source option.
Secondly, unless you are working with constrained hardware, 100s of thousands of records in a MySQL table is not a performance problem in most cases, nor is generating a time stamp on insert. So, I would recommend building a solution that's obvious and straightforward (and therefore less error prone) - add a timestamp column to your message table, and find a way of removing messages older than 5 minutes. You could add this to the logic which cleans up the records after delivery. As long as your queries are hitting indexed columns, I don't think you have to worry about hundreds of thousands of records.
I would put some energy into creating a performance test suite that allows you to experiment with solutions and see which is really faster. That can be tricky, especially if you want to test scenarios with multiple clients, but you will learn a lot more about the performance characteristics of the app by working through those scenarios.
EDIT:
You can have one column in your table automatically set a timestamp value - I've always found this to be extremely fast. As in - it's never been a problem on very large tables (tens of millions of rows).
I've not got much experience with the memory storage engine - but the MySQL documentation suggests that data modification actions (like insert or update or delete) can be slow due to locking time - that's borne out by your statistics, where the locking time is roughly 30% of the total.

I've had a similar problem.
A couple of questions: First, how long should undelivered messages dwell in the system? Forever? A day? Ten seconds?
Second, what is the consequence of erroneously deleting an undelivered message? Does it cause the collapse of the global banking system? Does it cause a hospital patient not to receive a needed injection? Or does a subsequent message simply cover for the missing one?
The best situation is short dwell time and low error consequence. If the error consequence is high, none of this is wise.
Setting up the solution took several steps for me.
First, write some code to fetch the max id from the messages table.
SELECT MAX(message_id) AS max_message_id FROM message
Then, an hour later, or ten seconds, or a day, or whatever, delete all the messages with id numbers less than the recorded one from the previous run.
DELETE FROM message WHERE message_id <= ?max_message_id
If all is functioning correctly, there won't be anything to delete. But if you have a bunch of stale messages for a client that's gone walkabout, pow, they're gone.
Finally, before putting this into production, wait for a quiet moment in your system, and, just once, issue the command
TRUNCATE TABLE message
to clear out any old rubbish in the table.
You can do this with an event (a stored job in the MySQL database) by creating a little one-row, one-column table to store the max_message_id.
EDIT
You can also alter your table to add a message_time column, in such a way that it gets set automatically whenever you insert a row. Issue these three statements at a time when your system is quiet and you can afford to trash all extant messages.
TRUNCATE TABLE message;
ALTER TABLE message ADD COLUMN message_time TIMESTAMP
NOT NULL
DEFAULT CURRENT_TIMESTAMP;
ALTER TABLE message ADD INDEX message_time (message_time);
Then you can just use a single statement to clean out the old records, like so.
DELETE FROM message WHERE message_time <= NOW() - INTERVAL 1 HOUR
(or whatever interval is appropriate). You should definitely alter an empty or almost-empty table because it takes time to alter lots of rows.
This is a good solution because there's a chance that you don't have to alter your message-processing client code at all. (Of course, if you did SELECT * anywhere, you probably will have to alter it. Pro-tip: never use SELECT * in application code.)

Related

How to count page views in MySQL without performance hit

I want to count the amount of visitors of a page, similar to what stackoverflow is doing with the "views" of each question.
The current solution just increments a field of a InnoDB table:
UPDATE data SET readers = readers + 1, date_edited = date_edited WHERE ID = '881529' LIMIT 1
This is the most expensive query on the page since it is performing a write operation.
Is there a better solution to the problem? How do high traffic sites like stackoverflow handle this?
I am thinking to instead write to a table using the memory engine and writing that content to a innodb table every minute or so.
e.g.:
INSERT INTO mem_table (id,views_new)
VALUES (881525,1)
ON DUPLICATE KEY UPDATE views_new = views_new+1
Then I would run a cron job every minute to update the InnoDB table:
UPDATE data d, mem_table m
SET d.readers = d.readers + m.readers_new
WHERE d.ID = m.ID;
DELETE FROM mem_table;
Unfortunatelly this is not so good with replication and the application is using a MySQL Galera Cluster.
Thank you in advance for any suggestions.

There are ways to reduce the immediate performance hit by starting a separate thread to update your counters. When you have a high number of parallel users (so many parallel updates of your hit counters), it is advisable to use a queuing mechanism to prevent locking (so like your in memory table). Your queue will have both writes and reads, so you have to take the table and data design into account.
Alternative is keeping a counter related to the article in a separate file. This prevents congestion on the single table with hit counters or if you keep it in the table serving the articles: A high lock wait time out on that article table (resulting in all kind of front end errors). Keeping the data in separate files does not give you insight in the overall hits on your site, but for that you could just use a log graphing tool like awstats.

If you can batch 100 INSERTs/UPDATEs together in a single statement, you can run it 10 times as fast. (There is a risk of lock_wait_timeout and/or deadlock.)
What if you build a MEMORY table and lose the queued data in a power failure? I assume that is OK for this application? (If not, you have a much bigger problem.)
What are your client(s)? Can they queue up things before even touching the database?
I like ping-ponging a pair of tables for staging data into the database. Clients write to one table; a continuously running job (not a cron job) is working with the other table. When the latter finishes with inserts/updates, it swaps the tables with a single, atomic, RENAME TABLE so that the clients are oblivious. My Staging Table blog discusses this in further detail. It explains how to avoid the replication problems you encountered.
Another tip. Do not put the count and date in the main table. Put them in a 'parallel table' ('vertical partitioning'). This cuts down on the bulkiness in replication and decreases the interference with other processing.
For Galera, use a pair non-replicated tables (suggest MyISAM with no indexes). Have the continually running job run in one place, cycling through the 3 nodes. If you had 3 jobs, there would be several ways in which they are more likely to stumble over each other.
If this won't keep up, you need to Shard your data. (That's what the big folks do, sooner or later.)

Magento 1.8: Lock wait timeout issues when customer is checking out

My website is experiencing issues at checkout. I'm using Magento Enterprise 1.8 and my checkout module is Idev's Onestepcheckout.
The issue we are seeing is that the eav_entity_store table is taking an exceedingly long time (up to 51 seconds) to return an order number to Mage_Eav_Model_Entity_Type.
What I do know is that the query run to get this is a transaction run as 'FOR UPDATE' so the row being accessed is locked until the transaction completes. I've looked at other parts of the code as well as the PHP code throughout the transaction where the row is locked (we're using InnoDB so the lock should be getting released once the transaction is committed) and I'm just not seeing anything there (or in the slow query logs) that should be causing a lock wait anywhere near 51 seconds.
I have considered that requests may be getting stacked up and slowly creeping up in time as they wait, but I'm seeing the query time go from 6ms to 20k ms to 50k ms 1,2,3. It isn't an issue of 100-200 requests stacked up, as there are only a few dozen of these a day.
I'm aware that MySql uses parent locking, but there are no FK's related to this table whatsoever. There are two BTREE indexes that at one point were FK's but have since been Altered (that happened years ago). For those who are un-Magento savy, the eav_entity_store table has less than 50 rows and is only 5 columns wide (4 smallint and a varchar). I seriously doubt tablesize or improper indexing is the culprit. In the spirit of TLDR, however, I will say that the two BTREE indexes are the two columns by which we select from this table.
One possibility is that I may need to replace the two indexes with a compound index, as the ONLY reads to this table are coming from a query that reads (FROM [Column with Index A] AND [Column with Index B]). I simply don't know if row-level locking would prevent this query from accessing another row in the table with the indexes currently on the table.
At this point, I've become convinced that the underlying issue is strictly DB related, but any Magento or MySql advice regarding this would be greatly appreciated. Anybody still actually reading this can hopefully appreciate that I have exhausted a number of options already and am seriously stumped here. Any info that you think may help is welcome. Thanks.
Edit The exact error we are seeing is:
Error message: SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction

Issue solved. Wasn't a problem with MySql. For some reason, generation of Invoice Numbers was taking an obscene amount of time. Company doesn't use Invoices from Magento. Turned them off. Problem solved. No full RCA done on what specifically the problem with invoice generation was.

How to improve InnoDB's SELECT performance while INSERTing

We recently switched our tables to use InnoDB (from MyISAM) specifically so we could take advantage of the ability to make updates to our database while still allowing SELECT queries to occur (i.e. by not locking the entire table for each INSERT)
We have a cycle that runs weekly and INSERTS approximately 100 million rows using "INSERT INTO ... ON DUPLICATE KEY UPDATE ..."
We are fairly pleased with the current update performance of around 2000 insert/updates per second.
However, while this process is running, we have observed that regular queries take very long.
For example, this took about 5 minutes to execute:
SELECT itemid FROM items WHERE itemid = 950768
(When the INSERTs are not happening, the above query takes several milliseconds.)
Is there any way to force SELECT queries to take a higher priority? Otherwise, are there any parameters that I could change in the MySQL configuration that would improve the performance?
We would ideally perform these updates when traffic is low, but anything more than a couple seconds per SELECT query would seem to defeat the purpose of being able to simultaneously update and read from the database. I am looking for any suggestions.
We are using Amazon's RDS as our MySQL server.
Thanks!

I imagine you have already solved this nearly a year later :) but I thought I would chime in. According to MySQL's documentation on internal locking (as opposed to explicit, user-initiated locking):
Table updates are given higher priority than table retrievals. Therefore, when a lock is released, the lock is made available to the requests in the write lock queue and then to the requests in the read lock queue. This ensures that updates to a table are not “starved” even if there is heavy SELECT activity for the table. However, if you have many updates for a table, SELECT statements wait until there are no more updates.
So it sounds like your SELECT is getting queued up until your inserts/updates finish (or at least there's a pause.) Information on altering that priority can be found on MySQL's Table Locking Issues page.

Updating large quantities of data in a production database

I have a large quantity of data in a production database that I want to update with batches of data while the data in the table is still available for end user use. The updates could be insertion of new rows or updates of existing rows. The specific table is approximately 50M rows, and the updates will be between 100k - 1M rows per "batch". What I would like to do is insert replace with a low priority.. In other words, I want the database to kind of slowly do the batch import without impacting performance of other queries that are occurring concurrently to the same disk spindles. To complicate this, the update data is heavily indexed. 8 b-tree indexes across multiple columns to facilitate various lookup that adds quite a bit of overhead to the import.
I've thought about batching the inserts down into 1-2k record blocks, then having the external script that loads the data just pause for a couple seconds between each insert, but that's really kind of hokey IMHO. Plus, during a 1M record batch, I really don't want to add 500-1000 2second pauses to add 20-40 minutes of extra load time if its not needed. Anyone have ideas on a better way to do this?

I've dealt with a similar scenario using InnoDB and hundreds of millions of rows. Batching with a throttling mechanism is the way to go if you want to minimize risk to end users. I'd experiment with different pause times and see what works for you. With small batches you have the benefit that you can adjust accordingly. You might find that you don't need any pause if you run this all sequentially. If your end users are using more connections then they'll naturally get more resources.
If you're using MyISAM there's a LOW_PRIORITY option for UPDATE. If you're using InnoDB with replication be sure to check that it's not getting too far behind because of the extra load. Apparently it runs in a single thread and that turned out to be the bottleneck for us. Consequently we programmed our throttling mechanism to just check how far behind replication was and pause as needed.

An INSERT DELAYED might be what you need. From the linked documentation:
Each time that delayed_insert_limit rows are written, the handler checks whether any SELECT statements are still pending. If so, it permits these to execute before continuing.

Check this link: http://dev.mysql.com/doc/refman/5.0/en/server-status-variables.html What I would do is write a script that will execute your batch updates when MySQL is showing Threads_running or Connections under a certain number. Hopefully you have some sort of test server where you can determine what a good number threshold might be for either of those server variables. There are plenty of other of server status variables to look at in there also. Maybe control the executions by the Innodb_data_pending_writes number? Let us know what works for you, its an interesting question!

Common-practice in dealing with high-load tables in MySQL

I have a table in MySQL 5 (InnoDB) that is used as a daemon Processing Queue, thus it is being accessed very often. It is typical to have around 250 000 records inserted per day. When I select records to be processed, they are read using a FOR UPDATE query to eliminate race conditions (everything is Transaction Based).
Now I am developing a "queue archive" and I have stumbled into a serious dead-lock problem. I need to delete "executed" records from the table as they are being processed (live), yet the table dead-locks every once in a while if I do so (two-three times per day at).
I though of moving towards delayed deletion (once per day at low load times) but this will not eliminate the problem only make it less obvious.
Is there a common-practice in dealing with high-load tables in MySQL?

InnoDB locks all rows it examines, not only those requested.
See this question for more details.
You need to create an index that would exactly match your search condition to get rid of unnecessary locks, and make sure it is used.
Unfortunately, DML queries in MySQL do not accept hints.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008