I have a process which deletes 30 days worth of data from a table every day & replaces it with the newer data.
Sometimes, someone will be running a query when this deletion begins, which causes the delete to fail because the table is locked for read.
Is there a way around this? I don't care if the read shows some data that was subsequently deleted - the more important process is the delete
Thanks
I have an ETL process which takes data from transaction db and keeps after processing stores the data to another DB. While storing the data we are truncating the old data and storing new data to have better performance, as update takes a lot of time than truncate insert. So in this process we experience counts as 0 or wrong data for some time (like for 2 3 mins). We are running the ETL in every 8 hours.
So how can we avoid this problem? How can we achieve zero downtime?
One way we did use in the past was to prepare the prod data in a table named temp. Then when finished (and checked, that was the lengthy part in our process), drop prod and rename the temp in prod.
Takes almost no time, and the process was successful even in case some other users were locking the table.
During a month a process inserts a large number of rows in some database tables ~1M.
This happens daily and the whole process lasts ~40mins. That is fine.
I created some "summary tables" from these inserts so as to query the data fast. This works fine.
Problem: I keep inserting data in the summary tables and so the time to create the cache table matches the process to insert the actual data and this is good. But if the data inserted in the previous days have changed (due to any updates) then I would need to "recalculate" the previous days and to solve this instead of creating today's summary data daily I would need to change my process to recreate the summary data from the beginning of each month which would mean my running time would increase substantially.
Is there a standard way to deal with this problem?
We had a similar problem in our system, which we solved by generating a summary table holding each day's summary.
Whenever an UPDATE/INSERT changes the base tables the summary table is updated.. this will of course slow down these operations but keeps the summary table completely up to date.
This can be done using TRIGGERs, but as the operations are in one place, we just do it manually in a TRANSACTION.
One advantage of this approach is that there is no need to run a cron job to refresh/create the summary table.
I understand that this may not be applicable/feasible for your situation.
At my office we have a legacy accounting system that stores all of its data in plaintext files (TXT extension) with fixed-width records. Each data file is named e.g., FILESALE.TXT. My goal is to bring this data into our MySQL server for read-only usage by many other programs that can't interface with the legacy software. Each file is essentially one table.
There are about 20 files in total that I need to access, roughly 1gb of total data. Each line might be 350-400 characters wide and have 30-40 columns. After pulling the data in, no MySQL table is much bigger than 100mb.
The legacy accounting system can modify any row in the text file, delete old rows (it has a deleted record marker -- 0x7F), and add new rows at any time.
For several years now I have been running a cron job every 5 minutes that:
Checks each data file for last modification time.
If the file is not modified, skip it. Otherwise:
Parse the data file, clean up any issues (very simple checks only), and spit out a tab-delimited file of the columns I need (some of the columns I just ignore).
TRUNCATE the table and imports the new data into our MySQL server like this:
START TRANSACTION;
TRUNCATE legacy_sales;
LOAD DATA INFILE '/tmp/filesale.data' INTO TABLE legacy_sales;
COMMIT;
The cron script runs each file check and parse in parallel, so the whole updating process doesn't really take very long. The biggest table (changed infrequently) takes ~30 seconds to update, but most of the tables take less than 5 seconds.
This has been working ok, but there are some issues. I guess it messes with database caching, so each time I have to TRUNCATE and LOAD a table, other programs that use the MySQL database are slow at first. Additionally, when I switched to running the updates in parallel, the database can be in a slightly inconsistent state for a few seconds.
This whole process seems horribly inefficient! Is there a better way to approach this problem? Any thoughts on optimizations or procedures that might be worth investigating? Any neat tricks from anyone who faced a similar situation?
Thanks!
Couple of ideas:
If the rows in the text files have a modification timestamp, you could update your script to keep track of when it runs, and then only process the records that have been modified since the last run.
If the rows in the text files have a field that can act as a primary key, you could maintain a fingerprint cache for each row, keyed by that id. Use this to detect when a row changes, and skip unchanged rows. I.e., in the loop that reads the text file, calculate the SHA1 (or whatever) hash of the whole row, and then compare that to the hash from your cache. If they match, the row hasn't changed, so skip it. Otherwise, update/insert the MySQL record and the store the new hash value in the cache. The cache could be a GDBM file, a memcached server, a fingerprint field in your MySQL tables, whatever. This will leave unchanged rows untouched (and thus still cached) on MySQL.
Perform updates inside a transaction to avoid inconsistencies.
Two things come to mind and I won't go into too much detail but feel free to ask questions:
A service that offloads the processing of the file to an application server and then just populates the mySQL table, you can even build in intelligence by checking for duplicate records, rather than truncating the entire table.
Offload the processing to another mysql server and replicate / transfer it over.
I agree with alex's tips. If you can, update only modified fields and mass update with transactions and multiple inserts grouped. an additional benefit of transactions is faster updat
if you are concerned about down time, instead of truncating the table, insert into a new table. then rename it.
for improved performance, make sure you have proper indexing on the fields.
look at database specific performance tips such as
_ delayed_inserts in mysql improve performance
_ caches can be optimized
_ even if you do not have unique rows, you may (or may not) be able to md5 the rows
Currently i am using cron for this. I thought perhaps it is possible to implement some procedure that will remove all data from database that is older than one month, but i am not sure that this is the best way.
Problem is that we have many servers with many cron processes, that are controlled by very small amount of stuff, and we need to make it clear and easy-to-manage, that's why i don't want to have such cron process.
Data in table i want to delete - statistics, huge amount of this data is inserted every day, and if it will not be deleted - database will be unbeliaveable huge (about ~500M every day, for us it's quite big amount, 500M * 365 days is 182,5G per year)
Is it possible to delete data using some procedure in mysql (perhaps after new row is added) / and is that a good idea?
If you're intending on moving away from cron jobs, you could always create an event that runs at a scheduled frequency.
Whatever you do, it's a very bad idea to delete data every time a new row is added, as it'll slow down your insert and it's more likely to fragment your tables.