I am working on the design of an ETL solution that imports a large amount of files into database tables every hour. The files are being dropped in a network share.
I like to have multiple SSIS servers running the same package and import the files from the network share. This way, I can distribute the load across multiple servers and improve the availability of the overall solution. Also, I can add more SSIS ETL servers to the solution anytime the load increases.
As simple as above requirement looks, it needs many important design details to be addressed including the items below:
How to make sure two SSIS servers do not take the same file concurrently and import it.
If an SSIS server crashes while it is importing a data file, then how can my design assure the other SSIS server can take the file and finishes it? And the half-processed-file will not be abandoned?
I expect above requirements are quite typical with many ETL design architectures. I wonder there is an architecture pattern to address the requirements I explained above.
Related
I have been using SSIS for a while, and I have never came across BizTalk.
One of the data migration project we are doing, also consists of BizTalk, apart from SSIS.
I just wondered what is the need of BizTalk, if we already have a SSIS ETL tool.
SSIS is well suited for bulk ETL batch options where you're transfering data between a SQL Server and
Another RDBMS
Excel
A simple CSV file
You do not need row by row processing
Your mapping is primarily data type conversion mapping (i.e. changing VARCHAR to NVARCHAR or DATETIME to VARCHAR etc.)
You're ok with error/fault handling for batches rather than rows
You're doing primarily point to point integrations that are unlikely to change or will only be needed temporarily.
BizTalk is well suited for real time messaging needs where:
You're transferring messages between any two end points
You need a centralized hub and/or ESB for message processing
You need fine grained transformations of messages
You need to work with more complicated looping file structures (i.e. not straight up CSV)
You need to apply analyst manageable business rules
You need to be able to easily swap out endpoints at run time
You need more enhanced error/fault management for individual messages/rows
You need enhanced B2B capabilities (EDI, HL7, SWIFT, trading partner management, acknowledgements)
Both can do the job of the other with a lot of extra work, but to see this, try to get SSIS to do a task that would require calling a stored procedure per row and have it do proper error handling/transformation of each row, and try to have BizTalk do a bulk ETL operation that requires minimal transformation. Both can do either, but it will be painful.
The short answer, no.
BizTalk Server and SSIS are different paradigms and are used to complement each other, not in opposition. They are both part of the BizTalk Stack and are frequently used in the same app.
BizTalk is a messaging platform and app will tend to process one entity at a time. SSIS is set based and works best for bulk table based operations.
I have a Transaction database with 10,000+ entries inserted on daily basis.
My client's requirement is that we allow him to download reports from his own server, for this we make a same copy of Transaction database to his server.
But now problem is how do we move data at a specific time to his server which takes latest data entry?
There are at least a couple of options in SQL Server.
If you can connect to your customer's database, Change data capture with SSIS is one option. CDC collects all changes in a queryable store which SSIS then reads and pushes to your target. You can be as selective as you want on what to move over since you write the ETL process in SSIS. One downside to CDC is it's in enterprise edition only. See detailed instructions at https://technet.microsoft.com/en-us/library/bb895315(v=sql.105).aspx
Transactional replication is another option which is available in both enterprise and standard editions. This has been around along time and used by a lot of organizations to do exactly what you described - incrementally move data to another database. Not as flexible as CDC but you can still apply filters to what rows/columns get moved. Not needed enterprise edition is helpful for many customers. Lots of detail about the technology here https://msdn.microsoft.com/en-us/library/ms151198(v=sql.105).aspx but highly encourage you to check out Kendra Little's most excellent article that covers trans repl and compares it with CDC http://www.brentozar.com/archive/2013/09/transactional-replication-change-tracking-data-capture/
If you can't connect directly to the customer database, CDC with SSIS still works but the output target will be some flat file which then gets transferred to the customer and loaded using another SSIS package or some other bulk load job (TSQL, BCP, etc...). Do be careful with how the flat file gets moved since anybody can see its contents.
I'd avoid any manual methods like creating triggers or running some (usually expensive) query to find the changed rows. Apart from the maintenance efforts, you're very likely to encounter tough performance issues.
I have a large weekly CSV file (ranging from 500MB to 1GB with over 2.5 million rows) to load into an SQL Server 2008 R2 database.
I was able to use either BULK INSERT command or the Import and Export Data Wizard to load the data in. There was no observed difference in loading time span between them as far as my dataset is concerned.
What is your recommended approach as far as performance. efficiency and future maintenance is concerned?
Thanks in advance!
Cheers ,Alex
I ended up with using the SQL Server Import and Export Data Wizard and saving it to an SSIS package. Then I used Business Intelligence Development Studio to edit the saved package and re-imported it back to SQL Server. It works well and only takes 2 mins to load all 9 CSV files ranging from 10MB to 600 MB to the SQL Server database.
MSDN Forum:
When a SSIS developer opted for using the "Fast Load" option along
with the "Table lock" on the OLEDB target, or used the SQL Server
Destination, then he/she has effectively used the very BULK INSERT, so
this is a moot point to debate what is faster.
Bulk insert on its own has tricks, in SQL Server contest more can be
done to make it faster a row process, namely making it minimally or
not logging at all. Now disabling constraints is another thing the bcp
takes care of, not SSIS (unless instructed), and this what MSFT can
decide to change in SSIS, but where the SSIS shines is in using an
algorithm figuring out what are the best parameters for a given
machine/system to use (e.g. the buffer size, etc).
So in most applications the SSIS is faster right away and even more
faster with proper tweaking.
In real life many factors bring different impacts to the benchmarking,
but at this stage I am inclined to state there is no real measurable
difference.
Microsoft has published very informative guide about Comparing the different load strategies for achieving high performance and Choosing Between Bulk Load Methods - The Data Loading Performance Guide
Also have a look of following article as well.
SSIS: Destination Adapter Comparison
SSIS vs T-SQL – which one is fastest for ETL tasks?
Speeding Up SSIS Bulk Inserts into SQL Server
SSIS – FASTEST DATA FLOW TASK ITEM FOR TRANSFERRING DATA OVER THE NETWORK
I would save the SSIS package from the Import and Export Data Wizard and tweak the OLE DB Destination settings using Visual Studio (aka BIDS aka SSDT BI) - setting a Exclusive Table Lock and a large Batch Size and Commit Size e.g. 100000 rows. Typically this will boost performance by around 20%.
SSIS is the best option for future tuning e.g. filtering or transforming data, disabling and rebuilding indexes before & after your load.
I’m looking for some feedback on mechanisms to batch data from MySQL Community Server 5.1.32 with an external host down to an internal SQL Server 05 Enterprise machine over VPN. The external box accumulates data throughout business hours (about 100Mb per day), which then needs to be transferred internationally across a WAN connection (quality not yet determined but it's not going to be super fast) to an internal corporate environment before some BI work is performed. This should just be change-sets making their way down each night.
I’m interested in thoughts on the ETL mechanisms people have successfully used in similar scenarios before. SSIS seems like a potential candidate; can anyone comment on the suitability for this scenario? Alternatively, other thoughts on how to do this in a cost-conscious way would be most appreciated. Thanks!
It depends on the use you have of the data received from the external machine.
If you must have the data for the calculations of the morning after or do not have confidence in your network, you would prefer to loose-couple the two systems and enable some message-queuing between them so that if something fails during the night like the DBs, the networks links, anything that would be a pain for you to recover, you can start every morning with some data.
If the data retrieval is not subject to a high degree of criticality, any solution is good :)
Regarding SSIS, it's just a great ETL framework (yes, there's a subtlety :)). But I don't see it as a part of the data transfer, rather in the ETL part when your data has been received or is still waiting in the message-queing system.
First, if you are going to do this, have a good way to easily see what has changed since the last time. Every field should have a last updatedate or a timestamp that changes when the record is updated (not sure if mysql has this). This is far better than comparing every single field.
If you had SQL Server in both locations I would recommend replication, is it possible to use SQL server instead of mySQL? If not then SSIS is your best bet.
In terms of actually getting your data from MySQL into SQL Server, you can use SSIS to import the data using a number of methods. One would be to connect directly to your MySQL source (via an OLEDB Connection or similar) or you could do a daily export from MySQL to a flat file and pick this up using a FTP Task. Once you have the data, SSIS can perform the required transforms before loading the processed data to SQL Server.
We've got an architecture where we intend to use SSIS as a data-loading engine for incoming batches. The intent is to reduce the need for manual intervention & configuration and automate the function as much as possible so we're looking at setting up our "batch monitoring" package to run as scheduled SQL Server Agent jobs.
Is it possible to schedule several SQL Server Agent jobs using the same package, possibly looking at different folders or working on different data chunks (grouped by batch ids?
We might also have 3 or 4 “jobs” all running the same package and all monitoring the same folder for incoming files, but at slightly different intervals to avoid file contention issues.
I don't know of any reason you couldn't do this. You could launch the packages each with a different configuration (or configurations) pointing to different working directories, input folders, etc.