I have a large weekly CSV file (ranging from 500MB to 1GB with over 2.5 million rows) to load into an SQL Server 2008 R2 database.
I was able to use either BULK INSERT command or the Import and Export Data Wizard to load the data in. There was no observed difference in loading time span between them as far as my dataset is concerned.
What is your recommended approach as far as performance. efficiency and future maintenance is concerned?
Thanks in advance!
Cheers ,Alex
I ended up with using the SQL Server Import and Export Data Wizard and saving it to an SSIS package. Then I used Business Intelligence Development Studio to edit the saved package and re-imported it back to SQL Server. It works well and only takes 2 mins to load all 9 CSV files ranging from 10MB to 600 MB to the SQL Server database.
MSDN Forum:
When a SSIS developer opted for using the "Fast Load" option along
with the "Table lock" on the OLEDB target, or used the SQL Server
Destination, then he/she has effectively used the very BULK INSERT, so
this is a moot point to debate what is faster.
Bulk insert on its own has tricks, in SQL Server contest more can be
done to make it faster a row process, namely making it minimally or
not logging at all. Now disabling constraints is another thing the bcp
takes care of, not SSIS (unless instructed), and this what MSFT can
decide to change in SSIS, but where the SSIS shines is in using an
algorithm figuring out what are the best parameters for a given
machine/system to use (e.g. the buffer size, etc).
So in most applications the SSIS is faster right away and even more
faster with proper tweaking.
In real life many factors bring different impacts to the benchmarking,
but at this stage I am inclined to state there is no real measurable
difference.
Microsoft has published very informative guide about Comparing the different load strategies for achieving high performance and Choosing Between Bulk Load Methods - The Data Loading Performance Guide
Also have a look of following article as well.
SSIS: Destination Adapter Comparison
SSIS vs T-SQL – which one is fastest for ETL tasks?
Speeding Up SSIS Bulk Inserts into SQL Server
SSIS – FASTEST DATA FLOW TASK ITEM FOR TRANSFERRING DATA OVER THE NETWORK
I would save the SSIS package from the Import and Export Data Wizard and tweak the OLE DB Destination settings using Visual Studio (aka BIDS aka SSDT BI) - setting a Exclusive Table Lock and a large Batch Size and Commit Size e.g. 100000 rows. Typically this will boost performance by around 20%.
SSIS is the best option for future tuning e.g. filtering or transforming data, disabling and rebuilding indexes before & after your load.
Related
I have a Transaction database with 10,000+ entries inserted on daily basis.
My client's requirement is that we allow him to download reports from his own server, for this we make a same copy of Transaction database to his server.
But now problem is how do we move data at a specific time to his server which takes latest data entry?
There are at least a couple of options in SQL Server.
If you can connect to your customer's database, Change data capture with SSIS is one option. CDC collects all changes in a queryable store which SSIS then reads and pushes to your target. You can be as selective as you want on what to move over since you write the ETL process in SSIS. One downside to CDC is it's in enterprise edition only. See detailed instructions at https://technet.microsoft.com/en-us/library/bb895315(v=sql.105).aspx
Transactional replication is another option which is available in both enterprise and standard editions. This has been around along time and used by a lot of organizations to do exactly what you described - incrementally move data to another database. Not as flexible as CDC but you can still apply filters to what rows/columns get moved. Not needed enterprise edition is helpful for many customers. Lots of detail about the technology here https://msdn.microsoft.com/en-us/library/ms151198(v=sql.105).aspx but highly encourage you to check out Kendra Little's most excellent article that covers trans repl and compares it with CDC http://www.brentozar.com/archive/2013/09/transactional-replication-change-tracking-data-capture/
If you can't connect directly to the customer database, CDC with SSIS still works but the output target will be some flat file which then gets transferred to the customer and loaded using another SSIS package or some other bulk load job (TSQL, BCP, etc...). Do be careful with how the flat file gets moved since anybody can see its contents.
I'd avoid any manual methods like creating triggers or running some (usually expensive) query to find the changed rows. Apart from the maintenance efforts, you're very likely to encounter tough performance issues.
I read plenty of articles stataing that SSIS and ETL are much faster and more efficient than using VB6 recordsets and VB.NET DataReaders, however I do not fully understand why this is the case.
I created an SSIS package that looped through one million records and created a new table and did the same in VB and this confirmed that SSIS is very fast.
I understand that all the processing is done in the data tier so there are no costly trips from the application server to the database server, but is there an MSDN article that expalins the algorithm that makes SSIS a lot quicker?
I have a VB6 app that is very slow and think SSIS is the solution.
The pipeline architecture of the SSIS Data Flow Task is faster due mainly to buffering. By selecting the data in "chunks", the pipeline can perform many operations in RAM, then pass the data buffer downstream for further processing. Depending on the size and shape of the data, and the location and type of the source and destination, you can sometimes achieve better results outside of SSIS.
I haven't got a lot of ETL experience but I haven't found the answer to my question either, although I guess it may be a no-brainer if you've worked with it. We're currently looking into creating a simple data warehouse (simple as in "copy most columns from most tables" and not OLAP-style) and it seems we're leaning towards SQL Server (2008) for a few reasons.
SSIS seems to be the tool for this kind of tasks when it comes to SQL Server, but I can't find anything about how it is affecting the source database cache, if at all, when loading data. Some of our installations are very sensitive performance-wise when it comes to having a usage-style-cache.
But if SSIS runs a "select *"-ish query and the cache is altered, then the performance for the users may degrade to unacceptable levels until it is rebuild from those queries again.
So my question is, does SSIS (or is there a way to avoid) affect the database cache when loading data from a SQL Server database?
Part of the problem is also that the source database could be both an Oracle or SQL Server database, so if there is a way to avoid the cache-affecting part for Oracle, that would be good input as well. (I guess the Attunity connector is the way to go?)
(Some additional info: We have considered plain files as well, but then export-import probably takes longer time than SSIS-transfer? I also guess change data capture is something we'll also look into, so if that is relevant to this question, feel free to include possible issues/benefits.)
Any other relevant suggestions are also welcome!
Thanks!
Tackling the SQL Server side:
First off, SSIS doesn't do anything special to avoid the buffer pool, or the plan cache.
Simple test (on a NON-production instance!):
Create a new SSIS package with a single connection manaager, and a single data flow containing one OLE DB Source, pointing to a table, similar to:
Clear the buffer pool, from SSMS: DBCC DROPCLEANBUFFERS
Verify that the cache has been cleared using the glorified dm_os_buffer_descriptors query at the top of this page: I get this:
Run the package
Re-run the query from step (2), and note that the data pages for the table (BOM_PIECE in my example) have been loaded into the cache:
Note that most SSIS components allow you to provide your own query, so if you have a way to avoid the buffer pool (I don't know that this is possible - I'd defer to someone who knows more about it), you could insert that into a query. So in the above example, instead of selecting Table or view in the OLE DB Source, you would select SQL command, or SQL command from variable if your command requires dynamic text.
Finally, I can imagine why you want to eliminate the cache load - but are you sure you want to do this? SQL Server is fairly good at managing memory, and what you're doing is swapping memory load for disk I/O load, which (depending on your use case) may have a negative impact on other users. This question has a discussion on SQL Server caching.
Read this article about Attunity regarding reading data from oracle
What do you mean "affect the database cache when loading data from a SQL Server database". SQL Server does not cache data, it caches execution plans. The fact that you are using SSIS wont affect your Server (other than the overhead of reading the data of course). Just use a propper transaction isolation level.
Also, read about the fast load property on SSIS components
About change data capture, I don't see how it can replace SSIS. You can use CDC to select the rows that will be loaded, but it wont do the loading for you.
I'm sure that this is a pretty vague question that is difficult to answer but I would be grateful for any general thoughts on the subject.
Let me give you a quick background.
A decade ago, we used to write data loads reading input flat files from legacy applications and load them into our Datamart. Originally, our load programs were written in VB6 and cursored through the flat file and for each record, performed this general process:
1) Look up the record. If found, update it
2) else insert new record
Then we ended up changing this process to use SQL Server to DTS the flat file in a temp table and then we would perform a massive set base join on the temp table with the target production table, taking the data from the temp table and using it to update the target table. Records that didn't join were inserted.
This is a simplification of the process, but essentially, the process went from an iterative approach to "set based", no longer performing updates 1 record at a time. As a result, we got huge performance gains.
Then we created what was in my opinion a powerful set of shared functions in a DLL to perform common functions/update patterns using this approach. It greatly abstracted the development and really cut down on the development time.
Then Informatica PowerCenter, an ETL tool, came around and mgt wants to standardize on the tool and rewrite the old VB loads that used DTS.
I heard that PowerCenter processes records iteratively, but I know that it does do some optimization tricks, so I am curious how Informatica would perform.
Does anyone have any experience with using DTS or SSIS to be able to make a gut performance predition as to which would generally perform better?
I joined an organization that used both Informatica PowerCenter 8.1.1. Although I can't speak for general Informatica setups, I can say that at this company Informatica was exceedingly inefficient. The main problem is that Informatica generated some really henious SQL code in the back-end. When I watched what it was doing with profiler and from reviewing the text logs, it generated separate insert, update, and delete statements for each row that needed to be inserted/updated/deleted. Instead of trying to fix the Informatica implementation, I simply replaced it with SSIS 2008.
Another problem I had with Informatica was managing parallelization. In both DTS and SSIS, parallelizing tasks was pretty simple -- don't define precedence constraints and your tasks will run in parallel. In Informatica, you define a starting point and then define the branches for running processes in parallel. I couldn't find a way for it to limit the number of parallel processes unless I explicitly defined them by chaining the worklets or tasks.
In my case, SSIS substantially outperformed Informatica. Our load process with Informatica took about 8-12 hours. Our load process with SSIS and SQL Server Agent Jobs was about 1-2 hours. I am certain had we properly tuned Informatica we could have reduced the load to 3-4 hours, but I still don't think it would have done much better.
I’m looking for some feedback on mechanisms to batch data from MySQL Community Server 5.1.32 with an external host down to an internal SQL Server 05 Enterprise machine over VPN. The external box accumulates data throughout business hours (about 100Mb per day), which then needs to be transferred internationally across a WAN connection (quality not yet determined but it's not going to be super fast) to an internal corporate environment before some BI work is performed. This should just be change-sets making their way down each night.
I’m interested in thoughts on the ETL mechanisms people have successfully used in similar scenarios before. SSIS seems like a potential candidate; can anyone comment on the suitability for this scenario? Alternatively, other thoughts on how to do this in a cost-conscious way would be most appreciated. Thanks!
It depends on the use you have of the data received from the external machine.
If you must have the data for the calculations of the morning after or do not have confidence in your network, you would prefer to loose-couple the two systems and enable some message-queuing between them so that if something fails during the night like the DBs, the networks links, anything that would be a pain for you to recover, you can start every morning with some data.
If the data retrieval is not subject to a high degree of criticality, any solution is good :)
Regarding SSIS, it's just a great ETL framework (yes, there's a subtlety :)). But I don't see it as a part of the data transfer, rather in the ETL part when your data has been received or is still waiting in the message-queing system.
First, if you are going to do this, have a good way to easily see what has changed since the last time. Every field should have a last updatedate or a timestamp that changes when the record is updated (not sure if mysql has this). This is far better than comparing every single field.
If you had SQL Server in both locations I would recommend replication, is it possible to use SQL server instead of mySQL? If not then SSIS is your best bet.
In terms of actually getting your data from MySQL into SQL Server, you can use SSIS to import the data using a number of methods. One would be to connect directly to your MySQL source (via an OLEDB Connection or similar) or you could do a daily export from MySQL to a flat file and pick this up using a FTP Task. Once you have the data, SSIS can perform the required transforms before loading the processed data to SQL Server.