Best Way to Optimize Insert/ Updating Large # Rows Against Large Tables - mysql

I will probably cross-post this but I assume it's a common problem - didn't find anything quite right in my search.
My task is this --- I have a data source streaming a source of "updates" and "new tickets" to a MySQL collection of tickets. Each ticket has a unique Ticket_ID.
Essentially, I need to take these updates and do an insert/ update against a table of tickets. Simple enough.
The problem is that every day, there are 3,000 rows to insert/update against this ever-increasing-list that is currently at 300,000 and can probably expand to 1 million.
Luckily, after 30 days, I know a ticket cannot be updated again, so perhaps can move that to an archive. Still, that's 3000 rows a day against a month of data, which is usually 90,000 rows. That simply takes a lot of time. I haven't checked the exact figure but maybe 30 minutes to an hour, maybe longer.
How do I optimize this process of insert/ update, which by definition, has to lookup each incoming Ticket_ID against the database of existing Ticket_IDs?
Is the insert/ update process simply the best that there is?
The main problem is that insert/ update process can only have one running at a time --- as oppose to having 10+ (just insert) processes running simultaneously and then 1 delete duplicates process following.
Or is there a creative way like ... find ticket_ids where count(*) > 1 -- delete the one with the older timestamp -- that could save me some time?
I know this is a complicated question but it seems like a common problem. Thanks.

Related

Optimizing MySQL update query with two tables

I have 14 million rows and 20 columns in a table named weather and 1900 rows and 15 columns in a table named incident on MySQL server. I am trying to set the active column in weather to 1 where the weather date column is between the start and end date column of the incident table and where the weather location column is equal to the incident location column. I have the following query and I am not sure if it is the most efficient way to do it. It currently has been running for an hour on AWS RDS db.m5.4xlarge (16 vCPU and 64 GB RAM). It is only using 8% CPU according to the AWS Console.
UPDATE dev.weather, dev.incident
SET weather.active = 1
WHERE weather.location = incident.location AND weather.DATE BETWEEN dev.incident.start_date AND dev.incident.end_date
Is there a better way to accomplish this?
By the time we come up with a satisfactory solution, your query will be finished. But here are some thoughts on it.
UPDATE, especially if lots of rows are modified, is very time-consuming. (This is because of the need to save old rows in case of rollback.)
Without the indexes, I cannot advise on things completely.
This is a one-time query, correct? Future "incidents" will do the update as the incident is stored, correct? That will probably run reasonably fast.
Given that you have a way to update for a single incident, use that as the basis for doing the initial UPDATE (the one you are asking about now). That is, write a special, one-time, program to run through the 1900 incidents, performing the necessary Update. (Advantage: only one Update need ever be written.)
Be sure to COMMIT after each Update. (Or run with autocommit=ON.) Else the 1900 updates will be a big burden on the system, perhaps worse than the single-Update that started this discussion.

Running a cron to update 1 million records in every hour fails

We have an E-commerce system with more than 1 million users with a total or 4 to 5 million records in order table. We use codeigniter framework as back end and Mysql as database.
Due to this excessive number of users and purchases, we use cron jobs to update the order details and referral bonus points in every hour to make the things work.
Now we have a situation that these data updates exceeds one hour and the next batch of updates reach before finishing the previous one, there by leading into a deadlock and failure of the system.
I'd like to know about the different possible architectural and database scaling options and suggestions to get rid of this situation. We are using only the monolithic architecture to run this application.
Don't use cron. Have a single process that starts over when it finishes. If one pass lasts more than an hour, the next one will start late. (Checking PROCESSLIST is clumsy and error-prone. OTOH, this continually-running approach needs a "keep-alive" cronjob.)
Don't UPDATE millions of rows. Instead, find a way to put the desired info in a separate table that the user joins to. Presumably, that extra table would only 1 row (if everyone is controlled by the same game) or a small number of rows (if there are only a small number of patterns to handle).
Do have the slowlog turned on, with a small value for long_query_time (possibly "1.0", maybe lower). Use pt-query-digest to summarize it to find the "worst" queries. Then we can help you make them take less time, thereby helping to calm your busy system and improve the 'user experience'.
Do use batched INSERT. (A one INSERT with 100 rows runs about 10 times as fast as 100 single-row INSERTs.) Batching UPDATEs is tricky, but can be done with IODKU.
Do use batches of 100-1000 rows. (This is somewhat optimal considering the various things that can happen.)
Do use transactions judiciously. Do check for errors (including deadlocks) at every step.
Do tell us what you are doing in the hourly update. We might be able to provide more targeted advice than that 15-year-old book.
Do realize that you have scaled beyond the capabilities of the typical 3rd-party package. That is, you will have to learn the details of SQL.
I have some ideas here for you - mixed up with some questions.
Assuming you are limited in what you can do (i.e. you can't re-architect you way out of this) and that the database can't be tuned further:
Make the list of records to be processed as small as possible
i.e. Does the job have to run over all records? These 4-5 million records - are they all active orders, or that's how many you have in total for all time? Obviously just process the bare minimum.
Split and parallel process
You mentioned "batches" but never explained what that meant - can you elaborate?
Can you get multiple instances of the cron job to run at once, each covering a different segment of the records?
Multi-Record Operations
The easy (lazy) way to program updates is to do it in a loop that iterates through each record and processes it individually, but relational databases can do updates over multiple records at once. I'm pretty sure there's a proper term for that but I can't recall it. Are you processing each row individually or doing multi-record updates?
How does the cron job query the database? Have you hand-crafted the most efficient queries possible, or are you using some ORM / framework to do stuff for you?

How to minimize the performance hit of deleting lots of rows from a highly active table

I have a table which has tens of thousands of new rows added a hour.
Based on certain events I set a given row a state of complete by setting it's status field to 1, update it's status_timesamp and then when querying the table with selects, I ignore all rows with field of 1.
But this leads to a huge amount of rows that I no longer need all with a field of 1. I also may need the fields at a later point in the future for logging purposes, but for the everyday purpose of my application such rows aren't needed.
I could delete the row instead of updating the field to 1 but I figure a delete is more costly than an update and many inserts are happening per second.
Ultimately I would like a way to move all the rows with status 1 into some kind of log table, without impacting on the current table which has many inserts and updates happening per second.
This is a difficult question to answer. Indeed, updates (on a non-indexed field) should be faster than deletes. In a simple environment, you would do the delete, along with a trigger that logged the information that you wanted.
I find it hard to believe that there is no downtime for the database. Can't you do the deletes at 2:00 a.m. once per week on Sunday, in some time zone?
Normally if this isn't the case, then you have high-availability requirements. And, in such a circumstance, you would have a replicated database. Most times, inserts, updates, and queries would go to both databases. During database maintenance periods, only one might be up while the other "does maintenance". Then that database catches up with the transactions, takes over the user load, and the other database "does maintenance". In your case "does maintenance" means doing the delete and logging.
If you have a high availability requirement and you are not using replication of some sort, your system has bigger vulnerabilities than simply accumulating to-be-deleted data.

Whats the best way to create summary records from detail records with MySQL?

I have maybe 10 to 20 million detail records coming in a day (statistical and performance data), that must be read in, and summarized into 24 hourly and 1 daily summary records.
The process calculates averages on several fields, gets the max and min values of others, nothing significant CPU wise.
Is it better to:
A) summarize the detail records into the summary records while the records are coming in, delaying each detail record insert slightly? I assume there will be a lot of locking (select for update's etc) in the summary tables, as there are several different systems importing data.
B) wait until the hour is over, and then select the entire previous hours data and create the summary records? There would be a delay for users to see the statistics, however the detail records would be available during the time.
Perhaps there are alternative methods to this?
Just make view for summary tables. Your all insert will work as usual. Just make views according to your need as summary. That will update automatically with main tables.
Also you can make the 24 hourly and 1 daily summary basis. Views are stored queries that when invoked produce a result set. A view acts as a virtual table.
For more details about views refer : http://dev.mysql.com/doc/refman/5.0/en/create-view.html
Let me know if you want further assistance regarding mysql views.
It'd depend on the load required to run the single update, but I'd probably go with a separate summary run. I'd probably put a small bet on saying that a single update would take a shorter amount of time than the cumulative on-every-insert idea.

Strategy to handle large datasets in a heavily inserted into table

I have a web application that has a MySql database with a device_status table that looks something like this...
deviceid | ... various status cols ... | created
This table gets inserted into many times a day (2000+ per device per day (estimated to have 100+ devices by the end of the year))
Basically this table gets a record when just about anything happens on the device.
My question is how should I deal with a table that is going to grow very large very quickly?
Should I just relax and hope the database will be fine in a few months when this table has over 10 million rows? and then in a year when it has 100 million rows? This is the simplest, but seems like a table that large would have terrible performance.
Should I just archive older data after some time period (a month, a week) and then make the web app query the live table for recent reports and query both the live and archive table for reports covering a larger time span.
Should I have an hourly and/or daily aggregate table that sums up the various statuses for a device? If I do this, what's the best way to trigger the aggregation? Cron? DB Trigger? Also I would probably still need to archive.
There must be a more elegant solution to handling this type of data.
I had a similar issue in tracking the number of views seen for advertisers on my site. Initially I was inserting a new row for each view, and as you predict here, that quickly led to the table growing unreasonably large (to the point that it was indeed causing performance issues which ultimately led to my hosting company shutting down the site for a few hours until I had addressed the issue).
The solution I went with is similar to your #3 solution. Instead of inserting a new record when a new view occurs, I update the existing record for the timeframe in question. In my case, I went with daily records for each ad. what timeframe to use for your app would depend entirely on the specifics of your data and your needs.
Unless you need to specifically track each occurrence over the last hour, you might be over-doing it to even store them and aggregate later. Instead of bothering with the cron job to perform regular aggregation, you could simply check for an entry with matching specs. If you find one, then you update a count field of the matching row instead of inserting a new row.