I read plenty of articles stataing that SSIS and ETL are much faster and more efficient than using VB6 recordsets and VB.NET DataReaders, however I do not fully understand why this is the case.
I created an SSIS package that looped through one million records and created a new table and did the same in VB and this confirmed that SSIS is very fast.
I understand that all the processing is done in the data tier so there are no costly trips from the application server to the database server, but is there an MSDN article that expalins the algorithm that makes SSIS a lot quicker?
I have a VB6 app that is very slow and think SSIS is the solution.
The pipeline architecture of the SSIS Data Flow Task is faster due mainly to buffering. By selecting the data in "chunks", the pipeline can perform many operations in RAM, then pass the data buffer downstream for further processing. Depending on the size and shape of the data, and the location and type of the source and destination, you can sometimes achieve better results outside of SSIS.
Related
in the context of a traditional ETL I wrote several views (SQL SELECTs with some JOINs) that I would like to "translate" into Data Flow Tasks in SSIS.
Is there any tool to automate this?
Thanks in advance
Changing SQL views into Data Flows in SSIS is generally a very bad idea. The SQL Engine is very good at fetching your data in the optimal way (on most cases) so you have the data as fast as possible. It uses indexes statistics, cache'd data and so on.
Obtaining the same result in a Data Flow will require to fetch the data from different tables and then do filtering and joining. Even if the SSIS is fast, it will always be faster to solve the query directly on the database (as long as it involves just 1 server) since they were build specifically for that purpose. It will be also a lot harder to design and to mantain (SSIS graphical interface Vs. SQL).
You should try to use SSIS just for its purpose, integrate data, and leave the heavy processing to your databases.
As for your question about automation, I don't think there is a tool to translate SQL views into DTS components, you will have to design them manually.
I have been using SSIS for a while, and I have never came across BizTalk.
One of the data migration project we are doing, also consists of BizTalk, apart from SSIS.
I just wondered what is the need of BizTalk, if we already have a SSIS ETL tool.
SSIS is well suited for bulk ETL batch options where you're transfering data between a SQL Server and
Another RDBMS
Excel
A simple CSV file
You do not need row by row processing
Your mapping is primarily data type conversion mapping (i.e. changing VARCHAR to NVARCHAR or DATETIME to VARCHAR etc.)
You're ok with error/fault handling for batches rather than rows
You're doing primarily point to point integrations that are unlikely to change or will only be needed temporarily.
BizTalk is well suited for real time messaging needs where:
You're transferring messages between any two end points
You need a centralized hub and/or ESB for message processing
You need fine grained transformations of messages
You need to work with more complicated looping file structures (i.e. not straight up CSV)
You need to apply analyst manageable business rules
You need to be able to easily swap out endpoints at run time
You need more enhanced error/fault management for individual messages/rows
You need enhanced B2B capabilities (EDI, HL7, SWIFT, trading partner management, acknowledgements)
Both can do the job of the other with a lot of extra work, but to see this, try to get SSIS to do a task that would require calling a stored procedure per row and have it do proper error handling/transformation of each row, and try to have BizTalk do a bulk ETL operation that requires minimal transformation. Both can do either, but it will be painful.
The short answer, no.
BizTalk Server and SSIS are different paradigms and are used to complement each other, not in opposition. They are both part of the BizTalk Stack and are frequently used in the same app.
BizTalk is a messaging platform and app will tend to process one entity at a time. SSIS is set based and works best for bulk table based operations.
I have a large weekly CSV file (ranging from 500MB to 1GB with over 2.5 million rows) to load into an SQL Server 2008 R2 database.
I was able to use either BULK INSERT command or the Import and Export Data Wizard to load the data in. There was no observed difference in loading time span between them as far as my dataset is concerned.
What is your recommended approach as far as performance. efficiency and future maintenance is concerned?
Thanks in advance!
Cheers ,Alex
I ended up with using the SQL Server Import and Export Data Wizard and saving it to an SSIS package. Then I used Business Intelligence Development Studio to edit the saved package and re-imported it back to SQL Server. It works well and only takes 2 mins to load all 9 CSV files ranging from 10MB to 600 MB to the SQL Server database.
MSDN Forum:
When a SSIS developer opted for using the "Fast Load" option along
with the "Table lock" on the OLEDB target, or used the SQL Server
Destination, then he/she has effectively used the very BULK INSERT, so
this is a moot point to debate what is faster.
Bulk insert on its own has tricks, in SQL Server contest more can be
done to make it faster a row process, namely making it minimally or
not logging at all. Now disabling constraints is another thing the bcp
takes care of, not SSIS (unless instructed), and this what MSFT can
decide to change in SSIS, but where the SSIS shines is in using an
algorithm figuring out what are the best parameters for a given
machine/system to use (e.g. the buffer size, etc).
So in most applications the SSIS is faster right away and even more
faster with proper tweaking.
In real life many factors bring different impacts to the benchmarking,
but at this stage I am inclined to state there is no real measurable
difference.
Microsoft has published very informative guide about Comparing the different load strategies for achieving high performance and Choosing Between Bulk Load Methods - The Data Loading Performance Guide
Also have a look of following article as well.
SSIS: Destination Adapter Comparison
SSIS vs T-SQL – which one is fastest for ETL tasks?
Speeding Up SSIS Bulk Inserts into SQL Server
SSIS – FASTEST DATA FLOW TASK ITEM FOR TRANSFERRING DATA OVER THE NETWORK
I would save the SSIS package from the Import and Export Data Wizard and tweak the OLE DB Destination settings using Visual Studio (aka BIDS aka SSDT BI) - setting a Exclusive Table Lock and a large Batch Size and Commit Size e.g. 100000 rows. Typically this will boost performance by around 20%.
SSIS is the best option for future tuning e.g. filtering or transforming data, disabling and rebuilding indexes before & after your load.
I'm working on SSRS.Actually I'm new to this.We have a OLTP database in which we have created stored procedure for each report.These stored procedures are used to create DataSet in BI solution to run the report.
Now we were asked to go through SSIS process ( ETL ) and Data Warehouse concept and all reports will now be running through these two approach.
So my doubt is:
1) As per my knowledge in SSIS , we have to create a new database and new tables for each report.Through packages (which include ETL process) we will insert all data into each tables and finally will fetch report data from these table directly.
This approach speed up data retrieval process because data is already calculated for every reports and do not need to design Data Warehouse.
Am I right?
2) Do we really need to run all reports through SSIS and Data Warehouse approach i.e. how can i judge which report need to run through SSIS and Data Warehouse approach OR can continue running report with OLTP system.
3) Any best article link for SSIS and Data warehouse concept
4) Do I have to first create SSIS packages before designing Data warehouse.
Thanks
1) I'm not sure you want a table per report. I guess you might end up with this if non of your reports used the same fields. When I hear data warehouse, I think dimensional model/star schema. The benefit of a star schema is that it simplifies the data model and reduces the amount of joins you might have to go through to get the data you need, optimizing for data retrieval.
2) The answer to this question depends on your goals. Many companies with a data warehouse try to do all non-real-time reporting out of their data warehouse or an ODS to reduce the load on the production OLTP system. If optimized reliability and speed of report delivery is the goal, then test query speeds, data integrity, and accuracy and decide if a data warehouse with ETL provides a better experience (and if that justifies the monitoring and maintenance required for a data warehouse).
3) For data warehouse concepts, try the Kimball Group. For SSIS, start with MSDN and make sure to visit the SSIS Package Essentials page.
4)You should design your data warehouse before you build SSIS packages. You might have to make a few tweaks as you get into the ETL process, but you generally know what you want to end up with (your DW design) and use SSIS to get the data to that desired end state.
I'm sure that this is a pretty vague question that is difficult to answer but I would be grateful for any general thoughts on the subject.
Let me give you a quick background.
A decade ago, we used to write data loads reading input flat files from legacy applications and load them into our Datamart. Originally, our load programs were written in VB6 and cursored through the flat file and for each record, performed this general process:
1) Look up the record. If found, update it
2) else insert new record
Then we ended up changing this process to use SQL Server to DTS the flat file in a temp table and then we would perform a massive set base join on the temp table with the target production table, taking the data from the temp table and using it to update the target table. Records that didn't join were inserted.
This is a simplification of the process, but essentially, the process went from an iterative approach to "set based", no longer performing updates 1 record at a time. As a result, we got huge performance gains.
Then we created what was in my opinion a powerful set of shared functions in a DLL to perform common functions/update patterns using this approach. It greatly abstracted the development and really cut down on the development time.
Then Informatica PowerCenter, an ETL tool, came around and mgt wants to standardize on the tool and rewrite the old VB loads that used DTS.
I heard that PowerCenter processes records iteratively, but I know that it does do some optimization tricks, so I am curious how Informatica would perform.
Does anyone have any experience with using DTS or SSIS to be able to make a gut performance predition as to which would generally perform better?
I joined an organization that used both Informatica PowerCenter 8.1.1. Although I can't speak for general Informatica setups, I can say that at this company Informatica was exceedingly inefficient. The main problem is that Informatica generated some really henious SQL code in the back-end. When I watched what it was doing with profiler and from reviewing the text logs, it generated separate insert, update, and delete statements for each row that needed to be inserted/updated/deleted. Instead of trying to fix the Informatica implementation, I simply replaced it with SSIS 2008.
Another problem I had with Informatica was managing parallelization. In both DTS and SSIS, parallelizing tasks was pretty simple -- don't define precedence constraints and your tasks will run in parallel. In Informatica, you define a starting point and then define the branches for running processes in parallel. I couldn't find a way for it to limit the number of parallel processes unless I explicitly defined them by chaining the worklets or tasks.
In my case, SSIS substantially outperformed Informatica. Our load process with Informatica took about 8-12 hours. Our load process with SSIS and SQL Server Agent Jobs was about 1-2 hours. I am certain had we properly tuned Informatica we could have reduced the load to 3-4 hours, but I still don't think it would have done much better.