I can see I'm not the only person who's experienced an issue with the SSIS Transfer Database Object Task and timeouts, however, people using this for the extract phase of an ETL must be something fairly common, so I'm trying to establish what is the usual/accepted way to do this.
I have a web application that uses Entity Framework to generate ~250 tables, some of which occasionally have schema updates.
The bulk of the transform and load portion of our ETL is handled by a series of stored procedures, however, these read from a copy of the application's tables that are initially loaded in the Transfer Database Objects task.
Initially, we set up an SSIS package that simply ran the Transfer Database Objects task, and then kicked off the stored proc. That meant that the job was fairly resilient to change, and the only changes required were changes to the stored proc, if and when a schema update affected the tables that were used therein.
Unfortunately, as one of our application instances has grown over time, the Transfer Database Objects task is reaching the point where I'm regularly seeing Timeout errors. Those don't appear to be connection timeouts, or anything I can control on the server side, and from what I can see, I can't amend the CommandTimeout on the underlying SMO stuff within that Task.
I can see that some people manually craft their extract, such that they run a separate Data Flow task to pull the information from each table, which has the obvious bonus that these can be run in parallel, however, in my case, that's going to mean an initial chunk of work to craft 250ish of these, and a maintenance task whenever the schema changes on the source database, no matter how minor.
I've come across Biml, which looked like a possible way to at least ease that overhead, however, it doesn't appear this can run on VS2017 yet.
Does anyone have any particular patterns they follow for this, or if I do need individual data flow tasks, is there some way to automate the schema update, perhaps using some kind of SSIS automation and something from the entity framework?
It turns out the easiest way around this is to write a clone of the Transfer task, but with appropriate additions to allow more control over batching and timeouts etc. Details are available in this article: https://blogs.msdn.microsoft.com/mattm/2007/04/18/roll-your-own-transfer-sql-server-objects-task/
Related
Is it possible to use an SSIS Cache manger with anything other than a Lookup? I would like to use similar data across multiple data flows.
I haven't been able to find a way to cache this data in memory in a cache manager and then reuse it in a later flow.
Nope, a cache connection manager was specific to solving lookup tasks originally only allowing an OLE DB Connection to be used.
However, if you have a set of data you want to be static for the life of a package run and able to be used across data flows, or even other packages, as a table-like entity, perhaps you're looking for a Raw File. It's a tight, binary implementation of the data stored to disk. Since it's stored to disk, you will pay a write and subsequent read performance penalty but it's likely that the files are right sized such that any penalty is offset by the specific needs.
The first step you will need to do is define the data that will go into a Raw file and connect a Raw File Destination. Which is going to involve creating a Raw File Connection Manager where you will define where the file lives and the rules about the data in there (recreate, append, etc). At this point, run the data flow task so the file is created and populated.
The next step is everywhere you want to use the data, you'll patch in a Raw File Source. It's going to behave much as any other data source in your toolkit at this point.
I have a pipeline taking data from a MySQl server and inserting into a Datastore using DataFlow Runner.
It works fine as a batch job executing once. The thing is that I want to get the new data from the MySQL server in near real-time into the Datastore but the JdbcIO gives bounded data as source (as it is the result of a query) so my pipeline is executing only once.
Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds?
Or is there a way to make the pipeline redoing it automatically without having to submit another job?
It is similar to the topic Running periodic Dataflow job but I can not find the CountingInput class. I thought that maybe it changed for the GenerateSequence class but I don't really understand how to use it.
Any help would be welcome!
This is possible and there's a couple ways you can go about it. It depends on the structure of your database and whether it admits efficiently finding new elements that appeared since the last sync. E.g., do your elements have an insertion timestamp? Can you afford to have another table in MySQL containing the last timestamp that has been saved to Datastore?
You can, indeed, use GenerateSequence.from(0).withRate(1, Duration.standardSeconds(1)) that will give you a PCollection<Long> into which 1 element per second is emitted. You can piggyback on that PCollection with a ParDo (or a more complex chain of transforms) that does the necessary periodic synchronization. You may find JdbcIO.readAll() handy because it can take a PCollection of query parameters and so can be triggered every time a new element in a PCollection appears.
If the amount of data in MySql is not that large (at most, something like hundreds of thousands of records), you can use the Watch.growthOf() transform to continually poll the entire database (using regular JDBC APIs) and emit new elements.
That said, what Andrew suggested (emitting records additionally to Pubsub) is also a very valid approach.
Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds?
Yes. For bounded data sources, it is not possible to have the Dataflow job continually read from MySQL. When using the JdbcIO class, a new job must be deployed each time.
Or is there a way to make the pipeline redoing it automatically without having to submit another job?
A better approach would be to have whatever system is inserting records into MySQL also publish a message to a Pub/Sub topic. Since Pub/Sub is an unbounded data source, Dataflow can continually pull messages from it.
Hi I'm a server developer and we have a big mysql database(biggest table has about 0.5 billion rows) running 24-7.
And there's a lot of broken data. Most of them are logically wrong and involves multi-source(multiple tables, s3). And since it's kinda logically complicated, we need Rails model to clean them (can't be done with pure sql queries)
Right now, I am using my own small cleansing framework and using AWS Auto Scaling Group to scale up instances and speed up. But since the database is in-service, I have to be careful(table locks and other stuffs) and limit the process amount.
So I am curious about
How do you (or big companies) clean your data while the database is in-service?
Do you use temporary tables and swap? or just update/insert/delete to an in-service database?
Do you use a framework or library or solution to clean data in efficient way? (such as distributed processing)
How do you detect messed up data real-time?
Do you use a framework or library or solution to detect broken data?
So I have face a problem similar in nature to what you are dealing with but different in scale. this is how I would approach the situation.
First access the infrastructure concerns, like can the data base be offline or restricted from use for a few hours for maintenance, if so read on.
Next you need to define what constitutes as "broken data".
Once you arrive at your definition of "broken data" translate a way to programmatically identify it.
Write a script that leverages your programatic identification algorithm and run some test.
Then back up you data records in preparation.
Then given the scale of your data set you will probably need to increase your server resources as not to bottle neck the script.
Run the script
Test your data to assess the effectiveness of your script
If needed adjust and rerun
Its possible to do this with out closing of the database for maintenance but I think you will get better results if you do. Also since this is a rails app I would look at the model validations that your app has and input field validations to prevent "broken data" in real time.
I have been using SSIS for a while, and I have never came across BizTalk.
One of the data migration project we are doing, also consists of BizTalk, apart from SSIS.
I just wondered what is the need of BizTalk, if we already have a SSIS ETL tool.
SSIS is well suited for bulk ETL batch options where you're transfering data between a SQL Server and
Another RDBMS
Excel
A simple CSV file
You do not need row by row processing
Your mapping is primarily data type conversion mapping (i.e. changing VARCHAR to NVARCHAR or DATETIME to VARCHAR etc.)
You're ok with error/fault handling for batches rather than rows
You're doing primarily point to point integrations that are unlikely to change or will only be needed temporarily.
BizTalk is well suited for real time messaging needs where:
You're transferring messages between any two end points
You need a centralized hub and/or ESB for message processing
You need fine grained transformations of messages
You need to work with more complicated looping file structures (i.e. not straight up CSV)
You need to apply analyst manageable business rules
You need to be able to easily swap out endpoints at run time
You need more enhanced error/fault management for individual messages/rows
You need enhanced B2B capabilities (EDI, HL7, SWIFT, trading partner management, acknowledgements)
Both can do the job of the other with a lot of extra work, but to see this, try to get SSIS to do a task that would require calling a stored procedure per row and have it do proper error handling/transformation of each row, and try to have BizTalk do a bulk ETL operation that requires minimal transformation. Both can do either, but it will be painful.
The short answer, no.
BizTalk Server and SSIS are different paradigms and are used to complement each other, not in opposition. They are both part of the BizTalk Stack and are frequently used in the same app.
BizTalk is a messaging platform and app will tend to process one entity at a time. SSIS is set based and works best for bulk table based operations.
I'm sure that this is a pretty vague question that is difficult to answer but I would be grateful for any general thoughts on the subject.
Let me give you a quick background.
A decade ago, we used to write data loads reading input flat files from legacy applications and load them into our Datamart. Originally, our load programs were written in VB6 and cursored through the flat file and for each record, performed this general process:
1) Look up the record. If found, update it
2) else insert new record
Then we ended up changing this process to use SQL Server to DTS the flat file in a temp table and then we would perform a massive set base join on the temp table with the target production table, taking the data from the temp table and using it to update the target table. Records that didn't join were inserted.
This is a simplification of the process, but essentially, the process went from an iterative approach to "set based", no longer performing updates 1 record at a time. As a result, we got huge performance gains.
Then we created what was in my opinion a powerful set of shared functions in a DLL to perform common functions/update patterns using this approach. It greatly abstracted the development and really cut down on the development time.
Then Informatica PowerCenter, an ETL tool, came around and mgt wants to standardize on the tool and rewrite the old VB loads that used DTS.
I heard that PowerCenter processes records iteratively, but I know that it does do some optimization tricks, so I am curious how Informatica would perform.
Does anyone have any experience with using DTS or SSIS to be able to make a gut performance predition as to which would generally perform better?
I joined an organization that used both Informatica PowerCenter 8.1.1. Although I can't speak for general Informatica setups, I can say that at this company Informatica was exceedingly inefficient. The main problem is that Informatica generated some really henious SQL code in the back-end. When I watched what it was doing with profiler and from reviewing the text logs, it generated separate insert, update, and delete statements for each row that needed to be inserted/updated/deleted. Instead of trying to fix the Informatica implementation, I simply replaced it with SSIS 2008.
Another problem I had with Informatica was managing parallelization. In both DTS and SSIS, parallelizing tasks was pretty simple -- don't define precedence constraints and your tasks will run in parallel. In Informatica, you define a starting point and then define the branches for running processes in parallel. I couldn't find a way for it to limit the number of parallel processes unless I explicitly defined them by chaining the worklets or tasks.
In my case, SSIS substantially outperformed Informatica. Our load process with Informatica took about 8-12 hours. Our load process with SSIS and SQL Server Agent Jobs was about 1-2 hours. I am certain had we properly tuned Informatica we could have reduced the load to 3-4 hours, but I still don't think it would have done much better.