Data Flow Task Buffer Size - ssis

I have created a package to pull more than 200,000 records. The pulling performance is bit slow. To increase the performance ie faster how much DefaultMaxRows and DefaultBufferSize should I have to give ? How do we set the buffer ? is there any calculation ?

SSIS does a descent job in optimizing the performance .For the Better Buffer performance , u can remove unwanted column from the source and set the data type for each column appropriately depending upon the source value .
However if u change the value of the Buffer size , it can have adverse effect on the performance (Page spoiling begins ) .
However You can try enabling logging of the BufferSizeTuning event to learn how many rows a buffer contains .Sometimes setting them to a higher value can boost performance, but only as long as all buffers fit in memory.

From your comment, the simple answer is:
Do not use OLE DB Command if you want speed. OLE DB Command needs to execute once per row. If you want to load 200,000 records that also means 200,000 executions of your OLE DB Command.
Perhaps you can explain what your ole db command does, and we can find a faster solution?

Related

Does 'Rows per batch' is SSIS OLE DB destination help reducing locking?

There is an option called 'Rows per batch' in OLE DB Destination, which, when specified, pulls a certain amount of rows within a batch, otherwise, pull all rows in the source in one batch.
Question: If my source and/or targer server are all highly OLTP database, will setting a low number on this parameter (for eg, 10k or 50k) help reducing lock escalation chance, so that the loading process can make minimal impact on either of the databases?
"Rows per batch" is actually more for tuning your data flow. By calculating the maximum width of a row, in bytes, and then dividing the default buffer size (the default is 10MB), you will get the number of rows you can insert in one "batch" without spilling the data out to tempdb or your buffer disk (depending if you set a specific location for you temp buffer). While keeping your data flow completely in memory and not needing to spill to disk, you'll keep your data transfer moving as quickly as possible.
The "Table Lock" option in the OLE DB Destination is what tells the server to lock the table or not.
In general the answer is: yes.
It also depends on the processing speed of the rows and the overhead per batch.
If your transaction with all the rows in a batch takes to long, consider splitting up. But splitting it up in too small batches can also yield performance problems.
The best way is to test and find the sweetspot.

Long time open database connection and SQL join query memory

I'm currently working on a java application which performs following in a background thread.
opens database connection
Select some rows (100000+ rows)
perform a long running task for each row by calling ResultSet.next() with some buffer size defined by resultSet.setFetchSize()
finally after everything's done closes the connection
If the query does some sorting or joining it will create a temp table and will have some additional memory usage. My question is if my database connection is being opened for long time (let's say few hours) and fetch batch by batch slowly, will it cause performance trouble's in database due to memory usage? (If the database is concurrently used by other threads also.) Or databases are designed to handle these things effectively?
(In the context of both MySQL and Oracle)
From an Oracle perspective, opening a cursor and fetching from it periodically doesn't have that much of an impact if it's left open... unless the underlying data that the cursor is querying against changes since the query was first started.
If so, the Oracle database now has to do additional work to find the data as it was at the start of the query (since read-consistency!), so now it needs to query the data blocks (either on disk or from the buffer cache) and, in the event the data has changed, the undo tablespace.
If the undo tablespace is not sized appropriately and enough data has changed, you may find that your cursor fetches fail with an "ORA-01555: snapshot too old" exception.
In terms of memory usage, a cursor doesn't open a result set and store it somewhere for you; it's simply a set of instructions to the database on how to get the next row that gets executed when you do a fetch. What gets stored in memory is that set of instructions, which is relatively small when compared to the amount of data it can return!
this mechanism seems not good.
although both mysql(innodb engine) and oracle provides consistent read for select,
do such a long select may leads to performance downgrade due to build cr block and other work,
even ora-01555 in oracle.
i think you should query/export all data first,
then process the actual business one by one.
at last, query all data first will not reduce the memory usage,
but reduce the continus time for memory and temp sort segment/file usage.
or you shoud consider separete the whole work to small pieces,
this is better.

when performing SELECT operation how is data read from disk each time ? how to verify data is being read from disk?

Do we need to drop OS cache?
I want to read data from disk not from cache I have disabled
1. query_cache_type=OFF
2. query_cache_size=0
even then when i perform select operation for Id =2 , innodb_buffer_pool_reads changes . If i select Id=3 no change for innodb_buffer_pool_reads.
How do I read next value from disk? Is there any other way to verify whether data is being read from the disk?
[Edit] Thank you all for your response.
I m trying to perform reverse engineering , want to test the execution speed of a select query without cache . So want to disable all cache and read data from disk?
Yes, to completely turn off the Query cache, make both of those settings.
To disable the Query cache for a single SELECT, do SELECT SQL_NO_CACHE ....
But... The QC is not the only caching mechanism. For InnoDB, the buffer_pool caches data and indexes. (MyISAM uses its key_cache and the OS's cache)
Typically, the first time you perform a query (after restarting the server), the disk will need to be hit. Typically, that query (or similar queries) performed after that will not need to hit the disk. Because of "caching", MySQL will hit the disk as little as necessary.
If some other connection in modifying the data you are about to SELECT, do not worry. MySQL will make sure to change the cached copy and/or the disk copy. You will always get the correct value.
InnoDB does things in "blocks" (16KB, typically about 100 rows). That is the unit of disk I/O. Ids 1,2,3 are probably in the same block. Again, MySQL takes care of fetches and changes. It will probably read the block once, cache it for a long time, and eventually write it once, even if there are a lot of changes to the rows in the block.
So how does "Durability" happen? Magic. It involves the InnoDB log file and some extra writes that are done to it. That is another topic; it would take much too long to explain it all.

Extract data with an OLE DB faster

Hi everyone I'm trying to extract a lot of records from a lot of joined tables and views using SSIS (OLE DB SOURCE) but it takes a huge time! the problem is due to the query because when I parsed it on sql server it takes more than hour ! Her's my ssis package design
I thought of paralleled extraction using two OLE DB source and merge join but it isn't recommended using it! besides it takes more time! Is there any way to help me please?
Writing the T-sql query with all the joins in the OLEDB source will always be faster than using different source and then using Merge Join IMHO. The reason is SSIS is memory Oriented architecture .It has to bring all the data from N different tables into its buffers and then filter it using Merge join and more over Merge Join is an asynchronous component(Semi Blocking) therefore it cannot use the same input buffer for its output .A new buffer is created and you may run out of memory if there are large number of rows extracted from the table.
Having said that there are few ways you can enhance the extraction performance using OLEDB source
1.Tune your SQL Query .Avoid using Select *
2.Check network bandwidth .You just cannot have faster throughput than your bandwidth supports.
3.All source adapters are asynchronous .The speed of an SSIS Source is not about how fast your query runs .It's about how fast the data is retrieved .
As others have suggested above ,you should show us the query and also the time it is taking to retireve the data else these are just few optimization technique which can make the extraction faster
Thank you for posting a screen shot of your data flow. I doubt whether the slowness you encounter is truly the fault of the OLE DB Source component.
Instead, you have 3 asynchronous components that result in a 2 full blocks of your data flow and one that's partially blocking (AGG, SRT, MRJ). That first aggregate will have to wait for all 500k rows to arrive before it can finish the aggregate and then pass it along to the sort.
These transformations also result in fragmented memory. Normally, a memory buffer is filled with data and visits each component in a data flow. Any changes happen directly to that address space and the engine can parallelize operations if it can determine step 2 is modifying field X and step 3 is modifying Y. The async components are going to cause data to be copied from one space to another. This is a double slow down. The first is the physical act of copying data from address space 0x01 to 0xFA or something. The second is that it reduces the available amount of memory for the dtexec process. No longer can SSIS play with all N gigs of memory. Instead, you'll have quartered your memory and after each async is done, that memory partition is just left there until the data flow completes.
If you want this run better, you'll need to fix your query. It may result in your aggregated data being materialized into a staging table or all in one big honkin' query.
Open a new question and provide insight into the data structures, indexes, data volumes, the query itself and preferably the query plan - estimated or actual. If you need help identifying these things, there are plenty of helpful folks here that can help you through the process.

Does executing a statement always take in memory for the result set?

I was told by a colleague that executing an SQL statement always puts the data into RAM/swap by the database server. Thus it is not practical to select large result sets.
I thought that such code
my $sth = $dbh->prepare('SELECT million_rows FROM table');
while (my #data = $sth->fetchrow) {
# process the row
}
retrieves the result set row by row, without it being loaded to RAM.
But I can't find any reference to this in DBI or MySQL docs. How is the result set really created and retrieved? Does it work the same for simple selects and joins?
Your colleague is right.
By default, the perl module DBD::mysql uses mysql_store_result which does indeed read in all SELECT data and cache it in RAM. Unless you change that default, when you fetch row-by-row in DBI, it's just reading them out of that memory buffer.
This is usually what you want unless you have very very large result sets. Otherwise, until you get the last data back from mysqld, it has to hold that data ready and my understanding is that it causes blocks on writes to the same rows (blocks? tables?).
Keep in mind, modern machines have a lot of RAM. A million-row result set is usually not a big deal. Even if each row is quite large at 1 KB, that's only 1 GB RAM plus overhead.
If you're going to process millions of rows of BLOBs, maybe you do want mysql_use_result -- or you want to SELECT those rows in chunks with progressive uses of LIMIT x,y.
See mysql_use_result and mysql_store_result in perldoc DBD::mysql for details.
This is not true (if we are talking about the database server itself, not client layers).
MySQL can buffer the whole resultset, but this is not necessarily done, and if done, not necessarily in RAM.
The resultset is buffered if you are using inline views (SELECT FROM (SELECT …)), the query needs to sort (which is shown as using filesort), or the plan requires creating a temporary table (which is shown as using temporary in the query plan).
Even if using temporary, MySQL only keeps the table in memory when its size does not exceed the limit set in tmp_table. When the table grows over this limit, it is converted from memory into MyISAM and stored on disk.
You, though, may explicitly instruct MySQL to buffer the resultset by appending SQL_BUFFER_RESULT instruction to the outermost SELECT.
See the docs for more detail.
No, that is not how it works.
Database will not hold rows in RAM/swap.
However, it will try, and mysql tries hard here, to cache as much as possible (indexes, results, etc...). Your mysql configuration gives values for the available memory buffers for different kinds of caches (for different kinds of storage engines) - you should not allow this cache to swap.
Test it
Bottom line - it should be very easy to test this using client only (I don't know perl's dbi, it might, but I doubt it, be doing something that forces mysql to load everything on prepare). Anyway... test it:
If you actually issue a prepare on SELECT SQL_NO_CACHE million_rows FROM table and then fetch only few rows out of millions.
You should then compare performance with SELECT SQL_NO_CACHE only_fetched_rows FROM table and see how that fares.
If the performance is comparable (and fast) then I believe that you can call your colleague's bluff.
Also if you enable log of the statements actually issued to mysql and give us a transcript of that then we (non perl folks) can give more definitive answer on what would mysql do.
I am not super familiar with this, but it looks to me like DBD::mysql can either fetch everything up front or only as needed, based on the mysql_use_result attribute. Consult the DBD::mysql and MySQL documentation.