How to achieve zero downtime in ETl - mysql

I have an ETL process which takes data from transaction db and keeps after processing stores the data to another DB. While storing the data we are truncating the old data and storing new data to have better performance, as update takes a lot of time than truncate insert. So in this process we experience counts as 0 or wrong data for some time (like for 2 3 mins). We are running the ETL in every 8 hours.
So how can we avoid this problem? How can we achieve zero downtime?

One way we did use in the past was to prepare the prod data in a table named temp. Then when finished (and checked, that was the lengthy part in our process), drop prod and rename the temp in prod.
Takes almost no time, and the process was successful even in case some other users were locking the table.

Related

Enable dirty reads in MYSQL

I have a script that runs daily on a table that deletes the last 30 days worth of data and inserts new data in its place.
Sometimes, someone is querying the table when that script begins, which causes the script to fail because I cannot delete data while someone is querying - so we get lock timeouts.
I don't really care if we have dirty reads, I just need to make sure that the daily process runs, irrespective of what other users are doing.
Is there a way in MySQL to ensure that my DELETE + INSERT happens, irrespective of whatever queries are running on that table?

Table Renaming in an Explicit Transaction

I am extracting a subset of data from a backend system to load into a SQL table for querying by a number of local systems. I do not expect the dataset to ever be very large - no more than a few thousand records. The extract will run every two minutes on a SQL2008 server. The local systems are in use 24 x 7.
In my prototype, I extract the data into a staging table, then drop the live table and rename the staging table to become the live table in an explicit transaction.
SELECT fieldlist
INTO Temp_MyTable_Staging
FROM FOOBAR;
BEGIN TRANSACTION
IF(OBJECT_ID('dbo.MyTable') Is Not Null)
DROP TABLE MyTable;
EXECUTE sp_rename N'dbo.Temp_MyTable_Staging', N'MyTable';
COMMIT
I have found lots of posts on the theory of transactions and locks, but none that explain what actually happens if a scheduled job tries to query the table in the few milliseconds while the drop/rename executes. Does the scheduled job just wait a few moments, or does it terminate?
Conversely, what happens if the rename starts while a scheduled job is selecting from the live table? Does transaction fail to get a lock and therefore terminate?

Preventing Database Locking during updates

I am pulling a large amount of data (700k lines) from one database (Source Database), performing some data manipulation, then inserting/updating another database (Lookup Database) with the altered data.
The lookup database is being accessed by users regularly throughout the day. They are only performing select statements. The updates to the lookup database occur hourly. As the size of the data in the source database increases, I'm noticing much higher numbers of "lock wait timeout exceeded" errors when updating the lookup database. Is my assumption correct that this is most likely resulting from the select statements and update statements accessing the same data? That would make sense as the larger number of updates occurring would more frequently hit data that is being accessed by the users.
In my attempts to fix the situation I've increased the lock wait timeout to 120 (up from 60) but that has had very little effect.
One thought I had was to update into a new table in the lookup database, then swap (in the software the users are using) the database their queries go to. Any other ideas how I might resolve the issue?
I'm not sure if it's relevant but I'm using MySQL and the two databases are on completely separate servers.
Thanks!

How to create summary tables efficiently

During a month a process inserts a large number of rows in some database tables ~1M.
This happens daily and the whole process lasts ~40mins. That is fine.
I created some "summary tables" from these inserts so as to query the data fast. This works fine.
Problem: I keep inserting data in the summary tables and so the time to create the cache table matches the process to insert the actual data and this is good. But if the data inserted in the previous days have changed (due to any updates) then I would need to "recalculate" the previous days and to solve this instead of creating today's summary data daily I would need to change my process to recreate the summary data from the beginning of each month which would mean my running time would increase substantially.
Is there a standard way to deal with this problem?
We had a similar problem in our system, which we solved by generating a summary table holding each day's summary.
Whenever an UPDATE/INSERT changes the base tables the summary table is updated.. this will of course slow down these operations but keeps the summary table completely up to date.
This can be done using TRIGGERs, but as the operations are in one place, we just do it manually in a TRANSACTION.
One advantage of this approach is that there is no need to run a cron job to refresh/create the summary table.
I understand that this may not be applicable/feasible for your situation.

Convert Legacy Text Databases to SQL

At my office we have a legacy accounting system that stores all of its data in plaintext files (TXT extension) with fixed-width records. Each data file is named e.g., FILESALE.TXT. My goal is to bring this data into our MySQL server for read-only usage by many other programs that can't interface with the legacy software. Each file is essentially one table.
There are about 20 files in total that I need to access, roughly 1gb of total data. Each line might be 350-400 characters wide and have 30-40 columns. After pulling the data in, no MySQL table is much bigger than 100mb.
The legacy accounting system can modify any row in the text file, delete old rows (it has a deleted record marker -- 0x7F), and add new rows at any time.
For several years now I have been running a cron job every 5 minutes that:
Checks each data file for last modification time.
If the file is not modified, skip it. Otherwise:
Parse the data file, clean up any issues (very simple checks only), and spit out a tab-delimited file of the columns I need (some of the columns I just ignore).
TRUNCATE the table and imports the new data into our MySQL server like this:
START TRANSACTION;
TRUNCATE legacy_sales;
LOAD DATA INFILE '/tmp/filesale.data' INTO TABLE legacy_sales;
COMMIT;
The cron script runs each file check and parse in parallel, so the whole updating process doesn't really take very long. The biggest table (changed infrequently) takes ~30 seconds to update, but most of the tables take less than 5 seconds.
This has been working ok, but there are some issues. I guess it messes with database caching, so each time I have to TRUNCATE and LOAD a table, other programs that use the MySQL database are slow at first. Additionally, when I switched to running the updates in parallel, the database can be in a slightly inconsistent state for a few seconds.
This whole process seems horribly inefficient! Is there a better way to approach this problem? Any thoughts on optimizations or procedures that might be worth investigating? Any neat tricks from anyone who faced a similar situation?
Thanks!
Couple of ideas:
If the rows in the text files have a modification timestamp, you could update your script to keep track of when it runs, and then only process the records that have been modified since the last run.
If the rows in the text files have a field that can act as a primary key, you could maintain a fingerprint cache for each row, keyed by that id. Use this to detect when a row changes, and skip unchanged rows. I.e., in the loop that reads the text file, calculate the SHA1 (or whatever) hash of the whole row, and then compare that to the hash from your cache. If they match, the row hasn't changed, so skip it. Otherwise, update/insert the MySQL record and the store the new hash value in the cache. The cache could be a GDBM file, a memcached server, a fingerprint field in your MySQL tables, whatever. This will leave unchanged rows untouched (and thus still cached) on MySQL.
Perform updates inside a transaction to avoid inconsistencies.
Two things come to mind and I won't go into too much detail but feel free to ask questions:
A service that offloads the processing of the file to an application server and then just populates the mySQL table, you can even build in intelligence by checking for duplicate records, rather than truncating the entire table.
Offload the processing to another mysql server and replicate / transfer it over.
I agree with alex's tips. If you can, update only modified fields and mass update with transactions and multiple inserts grouped. an additional benefit of transactions is faster updat
if you are concerned about down time, instead of truncating the table, insert into a new table. then rename it.
for improved performance, make sure you have proper indexing on the fields.
look at database specific performance tips such as
_ delayed_inserts in mysql improve performance
_ caches can be optimized
_ even if you do not have unique rows, you may (or may not) be able to md5 the rows