Does 'Rows per batch' is SSIS OLE DB destination help reducing locking? - ssis

There is an option called 'Rows per batch' in OLE DB Destination, which, when specified, pulls a certain amount of rows within a batch, otherwise, pull all rows in the source in one batch.
Question: If my source and/or targer server are all highly OLTP database, will setting a low number on this parameter (for eg, 10k or 50k) help reducing lock escalation chance, so that the loading process can make minimal impact on either of the databases?

"Rows per batch" is actually more for tuning your data flow. By calculating the maximum width of a row, in bytes, and then dividing the default buffer size (the default is 10MB), you will get the number of rows you can insert in one "batch" without spilling the data out to tempdb or your buffer disk (depending if you set a specific location for you temp buffer). While keeping your data flow completely in memory and not needing to spill to disk, you'll keep your data transfer moving as quickly as possible.
The "Table Lock" option in the OLE DB Destination is what tells the server to lock the table or not.

In general the answer is: yes.
It also depends on the processing speed of the rows and the overhead per batch.
If your transaction with all the rows in a batch takes to long, consider splitting up. But splitting it up in too small batches can also yield performance problems.
The best way is to test and find the sweetspot.

Related

Resources consumed by a simple SELECT query in MySql

There a few large tables in one of the databases of a customer (each table is ~50M rows in size and is not too wide). The intent is to infrequently read these tables (completely). As there are no reasonable CDC indices present, the plan is to read the tables by querying them
SELECT * from large_table;
The reads will be performed using a jdbc driver. With the following fetch configuration present, the intent is to read the data approximately one record at a time (it may require a significant amount of time) so that the client code is never overwhelmed.
PreparedStatement stmt = connection.prepareStatement(queryString, ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);
stmt.setFetchSize(Integer.MIN_VALUE);
I was going through the execution path of a query in High Performance MySQL, however some questions seemed unanswered:
Without the temp tables being explicitly created and the query cache being made use of, "how" are the stream reads tracked on the server?
Is any temporary data created (in main memory or files on disk) whatsoever? If so, where is it created and how much?
If temporary data is not created, how are the rows to be returned tracked? Does the query engine keep track of all the page files to be read for this query on this connection? In case there are several such queries running on the server, are the earliest "Tracked" files purged in favor of queries submitted recently?
PS: I want to understand the effect of this approach on the MySql server (not saying that there aren't better ways of reading the tables)
That simple query will not use a temp table. It will simply fetch the rows and transfer them to the client until it finishes. Nor would any possible index be useful. (If the real query is more complex, let's see it.)
The client may wait for all the rows (faster, but memory intensive) before it hands any to the user code, or it may hand them off one at a time (much slower).
I don't know the details in JDBC on specifying it.
You may want to page through the table. If so, don't use OFFSET, but use the PRIMARY KEY and "remember where you left off". More discussion: http://mysql.rjweb.org/doc.php/pagination
Your Question #3 leads to a complex answer...
Every query brings all the relevant data (and index entries) into RAM. The data/index is read in chunks ("blocks") of 16KB from the BTree structure that is persisted on disk. For a simple select like that, it will read the blocks 'sequentially' until finished.
But, be aware of "caching":
If a block is already in RAM, no I/O is needed.
If a block is not in the cache ("buffer_pool"), it will, if necessary, bump some block out and read the desired block in. This is very normal, and very common. Do not fear it.
Because of the simplicity of the query, only a few blocks ever need to be in RAM at any moment. Hence, if your buffer pool were only a few megabytes, it could still handle, say, a 1TB table. There would be a lot of I/O, and that would impact other operations.
As for "tracking", let me use the analogy of reading a long book in a single sitting. There is nothing to track, you are simply turning pages ('blocks'). You don't even need a 'bookmark' for tracking, it is next-next-next...
Another note: InnoDB uses "B+Tree", which includes a link from one block to the "next", thereby making the page turning efficient.
Another interpretation of tracking... "Transactions" and "ACID". When any query (read or write) touches a table, there is some form of lock applied to each row touched. For SELECT the lock is rather light-weight. For writes it can cause delays or even a "deadlock". The locks are unavoidable, but sometimes actions can be taken to minimize their impact.
Logically (but not actually), a "snapshot" of all rows in all tables is taken at the instant you start a transaction. This allows you to see a consistent view of everything, even if other connections are changing rows. The underlying mechanism is very lightweight on reading, but heavier for writes. Writes will make a copy of the row so that each connection sees the snapshot that it 'should' see. Also, the copy allows for ROLLBACK and recovery from a crash (eg power failure).
(Transaction "isolation" mode allows some control over the snapshot.) To get the optimal performance for your case, do nothing special.
Here's a way to conceptualize the handling of transactions: Each row has a timestamp associated with it. Each query saves the start time of the query. The query can "see" only rows that are older than that start time. A subsequent write in another connection will be creating copies of rows with a later timestamp, hence not visible to the SELECT. Hence, the onus is on writes to do extra work; reads are cheap.

Long time open database connection and SQL join query memory

I'm currently working on a java application which performs following in a background thread.
opens database connection
Select some rows (100000+ rows)
perform a long running task for each row by calling ResultSet.next() with some buffer size defined by resultSet.setFetchSize()
finally after everything's done closes the connection
If the query does some sorting or joining it will create a temp table and will have some additional memory usage. My question is if my database connection is being opened for long time (let's say few hours) and fetch batch by batch slowly, will it cause performance trouble's in database due to memory usage? (If the database is concurrently used by other threads also.) Or databases are designed to handle these things effectively?
(In the context of both MySQL and Oracle)
From an Oracle perspective, opening a cursor and fetching from it periodically doesn't have that much of an impact if it's left open... unless the underlying data that the cursor is querying against changes since the query was first started.
If so, the Oracle database now has to do additional work to find the data as it was at the start of the query (since read-consistency!), so now it needs to query the data blocks (either on disk or from the buffer cache) and, in the event the data has changed, the undo tablespace.
If the undo tablespace is not sized appropriately and enough data has changed, you may find that your cursor fetches fail with an "ORA-01555: snapshot too old" exception.
In terms of memory usage, a cursor doesn't open a result set and store it somewhere for you; it's simply a set of instructions to the database on how to get the next row that gets executed when you do a fetch. What gets stored in memory is that set of instructions, which is relatively small when compared to the amount of data it can return!
this mechanism seems not good.
although both mysql(innodb engine) and oracle provides consistent read for select,
do such a long select may leads to performance downgrade due to build cr block and other work,
even ora-01555 in oracle.
i think you should query/export all data first,
then process the actual business one by one.
at last, query all data first will not reduce the memory usage,
but reduce the continus time for memory and temp sort segment/file usage.
or you shoud consider separete the whole work to small pieces,
this is better.

Balanced Data Distributor inserts data one at a time to destination but not parallelly

I have used SSIS Balance Data Distributor just to fill 50,000 Records from a OLEDB Source to OLEDB Destination,
When i don't Use SSIS BDD it takes 2 minutes 40 Secs when i use BDD it takes 1 minute 55 Secs which does not make that much difference.
I also find that the data does not load into destination Parallelly it is loading in first destination and later on it fills the next one. (One at a time) Can any of you please help how to fill them parallelly?
Balanced Data Distribution is not a silver bullet for performance and runtime. It is good when:
You have CPU-intensive transformations, and can benefit from parallel execution
Your destination supports concurrent insert
The first case is clear and it depends on your dataflow. As for the concurrent insert on OLE DB destination; the best results are on heap tables, or tables without a primary key/clustered index and other indexes as well. Or the clustered key has to defined on autoincremented surrogate key. On OLE DB Destination you might need to disable table lock; otherwise it could prevent insert from being parallel. But check for yourself, as written in Mark's answer - sometimes parallel insert works with table lock, but on a heap table or columnstore.
Other table types (with indexes, cluster or not) might escalate locks to table level or require index rebuilding, effectively disabling parallel insert. Delete or disable it.
So, you have to evaluate yourself weather the parallel execution justifies additional efforts in development and support.
When you are using BDD to insert into the same table, you will only get parallel inserts if the table is a heap (no clustered index) and does not have a unique constraint.
If tablock is turned ON for all the destinations, sql will take a special lock (BU), which will allow parallel inserts to the same heap.
If there are no other indexes on the table and the database is NOT in full recovery model, you will gain the additional benefit of minimal logging.
As you noted in your question, using BDD saved you about 45s - it’s actually working. You’ll probably see different performance when running on a server that has more cores and memory, so be sure to test there as well. The measure of success will be total duration, rather than what visual studio displays in its debugger.
Planning for server execution, it would also be helpful to increase these two properties on the data flow:
-DefaultBufferMaxRows (try adding a 0 to bring it to 100,000)
-DefaultBufferSize - add a zero to max out memory
If you’re on sql 2016, you can instead set AutoAdjustBufferSize to true, which will ignore the above properties and optimize the buffer to the best performance size.
These adjustments will increase the commit size of the inserts which will result in faster writes to some degree.
The bottom line is keep tablock on, test on the server. BDD is working.

Extract data with an OLE DB faster

Hi everyone I'm trying to extract a lot of records from a lot of joined tables and views using SSIS (OLE DB SOURCE) but it takes a huge time! the problem is due to the query because when I parsed it on sql server it takes more than hour ! Her's my ssis package design
I thought of paralleled extraction using two OLE DB source and merge join but it isn't recommended using it! besides it takes more time! Is there any way to help me please?
Writing the T-sql query with all the joins in the OLEDB source will always be faster than using different source and then using Merge Join IMHO. The reason is SSIS is memory Oriented architecture .It has to bring all the data from N different tables into its buffers and then filter it using Merge join and more over Merge Join is an asynchronous component(Semi Blocking) therefore it cannot use the same input buffer for its output .A new buffer is created and you may run out of memory if there are large number of rows extracted from the table.
Having said that there are few ways you can enhance the extraction performance using OLEDB source
1.Tune your SQL Query .Avoid using Select *
2.Check network bandwidth .You just cannot have faster throughput than your bandwidth supports.
3.All source adapters are asynchronous .The speed of an SSIS Source is not about how fast your query runs .It's about how fast the data is retrieved .
As others have suggested above ,you should show us the query and also the time it is taking to retireve the data else these are just few optimization technique which can make the extraction faster
Thank you for posting a screen shot of your data flow. I doubt whether the slowness you encounter is truly the fault of the OLE DB Source component.
Instead, you have 3 asynchronous components that result in a 2 full blocks of your data flow and one that's partially blocking (AGG, SRT, MRJ). That first aggregate will have to wait for all 500k rows to arrive before it can finish the aggregate and then pass it along to the sort.
These transformations also result in fragmented memory. Normally, a memory buffer is filled with data and visits each component in a data flow. Any changes happen directly to that address space and the engine can parallelize operations if it can determine step 2 is modifying field X and step 3 is modifying Y. The async components are going to cause data to be copied from one space to another. This is a double slow down. The first is the physical act of copying data from address space 0x01 to 0xFA or something. The second is that it reduces the available amount of memory for the dtexec process. No longer can SSIS play with all N gigs of memory. Instead, you'll have quartered your memory and after each async is done, that memory partition is just left there until the data flow completes.
If you want this run better, you'll need to fix your query. It may result in your aggregated data being materialized into a staging table or all in one big honkin' query.
Open a new question and provide insight into the data structures, indexes, data volumes, the query itself and preferably the query plan - estimated or actual. If you need help identifying these things, there are plenty of helpful folks here that can help you through the process.

Data Flow Task Buffer Size

I have created a package to pull more than 200,000 records. The pulling performance is bit slow. To increase the performance ie faster how much DefaultMaxRows and DefaultBufferSize should I have to give ? How do we set the buffer ? is there any calculation ?
SSIS does a descent job in optimizing the performance .For the Better Buffer performance , u can remove unwanted column from the source and set the data type for each column appropriately depending upon the source value .
However if u change the value of the Buffer size , it can have adverse effect on the performance (Page spoiling begins ) .
However You can try enabling logging of the BufferSizeTuning event to learn how many rows a buffer contains .Sometimes setting them to a higher value can boost performance, but only as long as all buffers fit in memory.
From your comment, the simple answer is:
Do not use OLE DB Command if you want speed. OLE DB Command needs to execute once per row. If you want to load 200,000 records that also means 200,000 executions of your OLE DB Command.
Perhaps you can explain what your ole db command does, and we can find a faster solution?