How to do audit when using Spring batch as ETL [closed] - mysql

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
I have a requirement to use Spring Batch as ETL to migrate data from one set of tables in a source database (MySQL) to another set of tables in destination database(MySQL). The schema in destination tables is different from schema in source table, so i'm using processor to transform the data to match the schema of destination.
I need to do this migration block by block, i.e., like set of records at once on demand (Not all at once).
I have few concerns to take care of.
1) Audit (Make sure all the data is migrated)
2) Rollback & Retry (In case of error)
3) Error handling
4) How to keep updated of the new data of source table while migration is happening (No downtime)
Below are my thoughts of the same.
I will generate a random ID that will be unique for each job (may be a UUID for each job), and then put it in destination table (column in every row) while migrating.
1) Audit: My thought is to keep a count of the records i'm reading and then compare it with the rows of the destination table, once the migration is done.
2) Rollback & Retry: If the record count doesn't match in audit check, i will delete all the rows with the batch UUID, and then initiate the batch job again.
3) Error handling: Not sure of what other cases i need to be aware of, So i'm thinking just to log the errors.
4) Delta changes: I'm thinking to run the batch job again and again to find for the changes (with created_at, updated_at column values) until 0 records are found.
I want to understand, if any of the above steps can be done it in more better way? Please suggest.

You might need to spend some more time reviewing Spring Batch as it already takes care of most of these things for you.
You can already run a single Spring Batch job and set it to do the processing in chunks, the size of which you configure when you setup the job.
Auditing is already covered. The batch tables (and the Spring Batch admin interface) will keep a counts of all the records read in and written for each job instance, as well as the failure count (it you configured it to suppress exceptions).
Spring batch jobs already have retry and auto recover logic based on the aforementioned tables that track how many of the input records had already completed. This is not an option when using a database as an input source though. You would need to find an option in your table setup to identify the records completed, and or use a uniqueness constraint to the destination database so it can not re-write duplicate records. Another option could be to have your job's first step to be to read in records to a flat file, and then read from that file as a next step to process from. This should let the Spring Batch auto recovery logic work when the job is restarted.
Error handling is already covered as well. Any failure will stop the job so no data is lost. It will roll back the data in the current chunk it is processing. You can can set it to ignore (suppress) specific exceptions if there are specific failures you want it to keep running on. You can also set specific numbers of the different which exceptions to allow. And of course you could log failure details so you can lookup later.
As mentioned before you would need a value or trigger on your source query to identify which records you have processed. Which will allow you to keep running the chunk queries to pickup new records.

Related

SQL - Is there a better way to handle many transactions on one table? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
So say you have a single SQL table in some relation DB (SQL Server, MySQL, whatever). Say you have 500 tasks that runs every 15mins. Each task will delete a portion of data related to that task, insert new data related to that task then an external source will some selects for data related to that task.
From my experience this inevitably leads to deadlocks, timeouts and all around sub-par performance for selects, even when doing dirty reads.
You can try to stagger the start time of the tasks as best you can but it doesn't really solve the issue. There are so many tasks that there will always be overlap.
You can try upgrading the server with a better CPU to handle the connections but this is very costly for just 500 tasks imo.
So what I've done is duplicated the tables Schema and gave every task it's own distinct table with that schema.
Also, when constructing new data for the table all it does is inserts into a new table then flips the name of current table with that one.
ie
CREATE TABLE Task001_Table_Inactive
Bulk insert the fresh data to Task001_Table_Inactive
DROP TABLE Task001_Table_Active
RENAME Task001_Table_Inactive Task001_Table_Active
The advantages:
-Fast processing. SQL has NEVER been good at constant deletes. The ability to just bulk insert and flips the name has drastically reduced the processing time for the task.
-Scalable. Now that all these inserts and deletes aren't constantly fighting over one table I can run many tasks on one very cheap machine in the cloud.
-Anti-fragmentation. Since the table is recreated every time, any fragmentation issues that plague systems with constant deletes is no longer an issue.
-Selects are as quick as possible WITH OUT the need for dirty reads. Since the insertion is done in separate table the select statements being done by the external source will be as quick as it can be with out the need to do a dirty read. This is the biggest advantage imo!
-Fast migrations. Eventually I'll have too many tasks and run out of processing power, even with this setup. So if I need to migrate a task to a different server it's a simple matter of copying two tables rather than a hacky and extremely slow select statement on a blob table...
-Indexability. When a tables gets too big (300mill rows +) you can not index it. No matter what it'll just chug for a few days and give up because of a transaction buffer limit. This is just how SQL is. By segmenting out the huge blob table into smaller tables you can index successfully. Take this combined with parallelization, and you can index all the data FASTER than if you were doing indexing on a big table.
Disadvantages:
-Makes navigation of tables in a GUI difficult
-Makes select / schema alter statements slightly tricky because now it has to do some sort of cursor over each table like %Table% and apply the SQL command to each.
So why do many SQL enthusiast loathe Schema duplication if it has SO many advantages? Is there a better way to handle this?
There are no enough information to advise.., but below should be considered:
Snapshot isolation
https://msdn.microsoft.com/en-us/library/ms173763.aspx
Except when a database is being recovered, SNAPSHOT transactions do not request locks when reading data. SNAPSHOT transactions reading data do not block other transactions from writing data. Transactions writing data do not block SNAPSHOT transactions from reading data.
Use transactional replication / log shipping / mirroring / Change Data Capture etc. to offload the main table
Instead of deletion, soft deletion (update IsDeleted flag or maintaining xxxx_deleted table) is an option.
If the system (DB, hardware, network architecture) is properly designed, there are no issues even when where are thousands DML requests per second.
In Microsoft SQL Server, if your tasks never touch rows that do not belong to them, you can try to use either optimistic or pessimistic versioning - Snapshot Isolation in SQL Server.
Depending on your situation, either Read Committed Snapshot Isolation (RCSI) or Snapshot isolation level will do the trick. If both seem to suit you, I would recommend the former since it results in much less performance overhead.

SSIS; row redirected to error even after inserting to DB

I have a SSIS package to insert/update rows to Database. First i use look up for checking if row already inserted to DB;If yes i update that row, else insert as new row.
My problem is when inserting , a row is inserted successfully but also redirected to error. How both can happen at same time? That too it happens some times not always - very inconsistent. How to track what caused the error ? I used "redirect row" here to get failed rows.
This happens only when it deployed on server.On running my local machine using BIDS works fine.
Your OLE DB Destination is likely set to the default values
A quick recap of what all these values mean
Data access mode: You generally want Table or view - fast load or Table or view name variable - fast load as this will perform bulk inserts. The non-fast load choices result in singleton inserts which for any volume of data will have you questioning the sanity of those who told you SSIS can load data really fast.
Keep Identity: This is only needed if you want to explicitly provide an identity seed
Keep nulls: This specifies whether you should allow the defaults to fire
Table lock: Preemptively lock the table. Unless you're dealing with Hekaton tables (new SQL Server 2014 candy), touching a table will involve locks. Yes, even if you use the NOLOCK hint. Inserting data obviously results in locking so we can assure our ACID compliance. SQL Server will start with a small lock, either Row or Page level. If you cross a threshold of modifying data, SQL Server will escalate that lock to encapsulate the whole table. This is a performance optimization as it's easier to work if nobody else has their grubby little paws in the data. The penalty is that during this escalation, we might now have to wait for another process to finish so we can get exclusivity to the table. Had we gone big to begin with, we might have locked the table before the other process had begun. - Check constraints: Should we disable the checking of constraint values. Unless you have a post import step to ensure the constraint is valid, don't uncheck this. Swiftly loaded data that is invalid for the domain is no good.
Rows per batch: this is a pass through value to the INSERT BULK statement as the ROWS_PER_BATCH value.
Maximum insert commit size: The FastLoadMaxInsertCommitSize property specifies how many rows should be held in the transaction before committing. The 2005 default was 0 which meant everything gets committed or none of it does. The 2008+ default of 2 billion may be effectively the same, depending on your data volume.
So what
You have bad data somewhere in your insert. Something is causing the insert to fail. It might be the first row, last row, both or somewhere in between. The net effect is that the insert itself is rolled back. You designed your package to allow for the bad data to get routed to a flat file so a data steward can examine it, clean it up and get it re-inserted into the pipeline.
The key then is you needed to find some value that provides the optimal balance of insert performance size, more is better, relative to badness size. For the sake of argument, let's use a commit size of 5003, because everyone likes prime numbers, and assume our data source supplies 10009 rows of data. Three rows in there will violate the integrity of the target table and will need to be examined by a data steward.
This is going to result in 3 total batches being sent to the destination. The result is one of the following scenarios
Bad rows are the final 3 rows, resulting in only those 3 rows being sent to the text file and 10006 rows committed to the table in 2 batches.
Bad rows exist only in 1 of full sets. This would result in 5006 rows being written to the table and 5003 rows sent to our file
Bad rows split amongst each commit set. This results in 0 rows written to the table and all the data in our file.
I always felt Murphy was an optimist and the disk holding the error file would get corrupt but I digress.
What would be ideal is to whittle down the space bad data can exist in while maximizing the amount of good data inserted at a shot. In fact, there are a number of people who have written about it but I'm partial to the approach outlined in "Error Redirection with the OLE DB Destination".
We would perform an insert at our initial commit size of 5003 and successful rows will go as they will. Bad rows would go to a second OLE DB Destination, this time with a smaller commit size. There's differences of opinion whether you should immediately go to singleton inserts here or add an intermediate bulk insert attempt at half your primary commit size. This is where you can evaluate your process and find the optimal solution for your data.
If data still fails the insert at the single row level, then you write that to your error file/log. This approach allows you to put as much good data into the target table as you can in the most efficient mechanism as possible when you know you're going to have bad data.
Bad data addendum
Yet a final approach to inserting bad data is to not even try to insert it. If you know foreign keys must be satisfied, add a Lookup component to your data flow to ensure that only good values are presented to the insert. Same for NULLs. You're already checking your business key so duplicates shouldn't be an issue. If a column has a constraint that the Foo must begin with Pity, then check it in the data flow. Bad rows all get shunted off to a Derived Column Task that adds business friendly error messages and then they land at a Union All and then all the errors make it to the error file.
I've also written this logic where each check gets its own error column and I only split out the bad data prior to insertion. This prevents the business user from fixing one error in the source data only to learn that there's another error. Oh, and there's still another, try again. Whether this level of data cleansing is required is a conversation with your data supplier, data cleaner and possibly your boss (because they might want to know how much time you're going to have to spend making this solution bullet proof for the horrible data they keep flinging at you)
References
Keep nulls
Error Redirection with the OLE DB Destination
Batch Sizes, Fast Load, Commit Size and the OLE DB Destination
Default value for OLE DB Destination FastLoadMaxInsertCommitSize in SQL Server 2008
I have noticed that if you check lock tables, and also have an update, you will get deadlocks between the 2 flows in your dataflow. So we do not check table lock. The performance seems to be the same.
My finding might help those who visit here..
#billinkc made a broad comment; i had gone all through that.Well Later after digging down the system the problem was something different.
My SSIS package has script task within to do some operations.That one uses a folder called TEMP in the disk.The program which triggered this SSIS also simultaneously using the same TEMP folder. Now there the file read/write exceptions were not handled.
This caused script task to fail resulting a package fail error.Since the INSERT functionality carried before the script task,INSERT was successful.Later when script failed it moved rows to error too!
I tried with catching these "file errors/exceptions" and it worked!!

Getting stale results in multiprocessing environment

I am using 2 separate processes via multiprocessing in my application. Both have access to a MySQL database via sqlalchemy core (not the ORM). One process reads data from various sources and writes them to the database. The other process just reads the data from the database.
I have a query which gets the latest record from the a table and displays the id. However it always displays the first id which was created when I started the program rather than the latest inserted id (new rows are created every few seconds).
If I use a separate MySQL tool and run the query manually I get correct results, but SQL alchemy is always giving me stale results.
Since you can see the changes your writer process is making with another MySQL tool that means your writer process is indeed committing the data (at least, if you are using InnoDB it does).
InnoDB shows you the state of the database as of when you started your transaction. Whatever other tools you are using probably have an autocommit feature turned on where a new transaction is implicitly started following each query.
To see the changes in SQLAlchemy do as zzzeek suggests and change your monitoring/reader process to begin a new transaction.
One technique I've used to do this myself is to add autocommit=True to the execution_options of my queries, e.g.:
result = conn.execute( select( [table] ).where( table.c.id == 123 ).execution_options( autocommit=True ) )
assuming you're using innodb the data on your connection will appear "stale" for as long as you keep the current transaction running, or until you commit the other transaction. In order for one process to see the data from the other process, two things need to happen: 1. the transaction that created the new data needs to be committed and 2. the current transaction, assuming it's read some of that data already, needs to be rolled back or committed and started again. See The InnoDB Transaction Model and Locking.

Set eventual consistency (late commit) in MySQL

Consider the following situation: You want to update the number of page views of each profile in your system. This action is very frequent, as almost all visits to your website result in a page view incremental.
The basic way is update Users set page_views=page_views+1. But this is totally not optimal because we don't really need instant update (1 hour late is ok). Is there any other way in MySQL to postpone a sequence of updates, and make cumulative updates at a later time?
I myself tried another method: storing a counter (# of increments) for each profile. But this also results in handling a few thousands of small files, and I think that the disk IO cost (even if a deep tree-structure for files is applied) would probably exceed the database.
What is your suggestion for this problem (other than MySQL)?
To improve performance you could store your page view data in a MEMORY table - this is super fast but temporary, the table only persists while the server is running - on restart it will be empty...
You could then create an EVENT to update a table that will persist the data on a timed basis. This would help improve performance a little with the risk that, should the server go down, only the number of visits since the last run of the event would be lost.
The link posted by James via the comment to your question, wherein lies an accepted answer with another comment about memcached was my first thought also. Just store the profileIds in memcached then you could set up a cron to run every 15 minutes and grab all the entries then issue the updates to MySQL in a batch, but there are a few things to consider.
When you run the batch script to grab the ids out of memcached, you will have to ensure you remove all entries which have been parsed, otherwise you run the risk of counting the same profile views multiple times.
Being that memcache doesn't support wildcard searching via keys, and that you will have to purge existing keys for the reason stated in #1, you will probably have to setup a separate memcache server pool dedicated for the sole purpose of tracking profile ids, so you don't end up purging cached values which have no relation to profile view tracking. However, you could avoid this by storing the profileId and a timestamp within the value payload, then have your batch script step through each entry and check the timestamp, if it's within the time range you specified, add it to queue to be updated, and once you hit the upper limit of your time range, the script stops.
Another option may be to parse your access logs. If user profiles are in a known location like /myapp/profile/1234, you could parse for this pattern and add profile views this way. I ended up having to go this route for advertiser tracking, as it ended up being the only repeatable way to generate billing numbers. If they had any billing disputes we would offer to send them the access logs and parse for themselves.

MySQL table locking for a multi user JSP/Servlets site

Hi I am developing a site with JSP/Servlets running on Tomcat for the front-end and with a MySql db for the backend which is accessed through JDBC.
Many users of the site can access and write to the database at the same time ,my question is :
Do i need to explicitly take locks before each write/read access to the db in my code?
OR Does Tomcat handle this for me?
Also do you have any suggestions on how best to implement this ? I have written a significant amount of JDBC code already without taking the locks :/
I think you are thinking about transactions when you say "locks". At the lowest level, your database server already ensure that parallel read writes won't corrupt your tables.
But if you want to ensure consistency across tables, you need to employ transactions. Simply put, what transactions provide you is an all-or-nothing guarantee. That is, if you want to insert a Order in one table and related OrderItems in another table, what you need is an assurance that if insertion of OrderItems fails (in step 2), the changes made to Order tables (step 1) will also get rolled back. This way you'll never end up in a situation where an row in Order table have no associated rows in Order items.
This, off-course, is a very simplified representation of what a transaction is. You should read more about it if you are serious about database programming.
In java, you usually do transactions by roughly with following steps:
Set autocommit to false on your jdbc connection
Do several insert and/or updates using the same connection
Call conn.commit() when all the insert/updates that goes together are done
If there is a problem somewhere during step 2, call conn.rollback()