What does the default buffer size in the SSIS data flow properties relate to? - ssis

I'm looking at the data flow buffer properties.
My question is - suppose I configure the DefaultBufferSize (default 10MB), then does this mean that the dataflow is restricted to use only that amount of RAM at a time?

Related

How to deal with "could not execute broadcast in 300 secs"?

I am trying to get a build working, and one of the stages intermittently fails with the following error:
Could not execute broadcast in 300 secs. You can increase the timeout for broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting spark.sql.autoBroadcastJoinThreshold to -1
How should I deal with this error?
First, let's talk a bit about what that error means.
From the official Spark documentation (http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables):
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
From my experience the broadcast timeout usually occurs when one of the input datasets are partitioned poorly. Instead of disabling the broadcast, I recommend that you look at the partitions of your datasets and ensure that they are partitioned correctly.
The rules of thumb I use are to take the size of your dataset in MB and divide by 100, and set the number of partitions to be that number. Since the HDFS block size is 125 MB, we want to spilt the files into about 125 MB but since they don't split perfectly we can divide by a smaller number to get more partitions.
The main thing is that very small datasets (~<125 MB) are in a single partition as the network overhead it too large! Hope this helps.

Does 'Rows per batch' is SSIS OLE DB destination help reducing locking?

There is an option called 'Rows per batch' in OLE DB Destination, which, when specified, pulls a certain amount of rows within a batch, otherwise, pull all rows in the source in one batch.
Question: If my source and/or targer server are all highly OLTP database, will setting a low number on this parameter (for eg, 10k or 50k) help reducing lock escalation chance, so that the loading process can make minimal impact on either of the databases?
"Rows per batch" is actually more for tuning your data flow. By calculating the maximum width of a row, in bytes, and then dividing the default buffer size (the default is 10MB), you will get the number of rows you can insert in one "batch" without spilling the data out to tempdb or your buffer disk (depending if you set a specific location for you temp buffer). While keeping your data flow completely in memory and not needing to spill to disk, you'll keep your data transfer moving as quickly as possible.
The "Table Lock" option in the OLE DB Destination is what tells the server to lock the table or not.
In general the answer is: yes.
It also depends on the processing speed of the rows and the overhead per batch.
If your transaction with all the rows in a batch takes to long, consider splitting up. But splitting it up in too small batches can also yield performance problems.
The best way is to test and find the sweetspot.

Keep part of the data in memory and part in Disk

I have a column table with millions of records. I would like to keep only the last 3 months in memory, the rest need to be on disk but can be consulted. Is it possible to do this in SnappyData?
Column tables automatically overflow to disk using an LRU algorithm when memory fills up. You can also explicitly configure your eviction settings. All the configuration options are listed here.

Data Flow Task Buffer Size

I have created a package to pull more than 200,000 records. The pulling performance is bit slow. To increase the performance ie faster how much DefaultMaxRows and DefaultBufferSize should I have to give ? How do we set the buffer ? is there any calculation ?
SSIS does a descent job in optimizing the performance .For the Better Buffer performance , u can remove unwanted column from the source and set the data type for each column appropriately depending upon the source value .
However if u change the value of the Buffer size , it can have adverse effect on the performance (Page spoiling begins ) .
However You can try enabling logging of the BufferSizeTuning event to learn how many rows a buffer contains .Sometimes setting them to a higher value can boost performance, but only as long as all buffers fit in memory.
From your comment, the simple answer is:
Do not use OLE DB Command if you want speed. OLE DB Command needs to execute once per row. If you want to load 200,000 records that also means 200,000 executions of your OLE DB Command.
Perhaps you can explain what your ole db command does, and we can find a faster solution?

file Autogrowth settings switched

What could be the impacts to change the default values for Autogrowth for the files of a database?
Actually I have a database with the Autogrowth values switched between the Data and Log files.
I have those values in those database properties:
DB_Data (Rows Data), PRIMARY, 71027 (Initial Size(MB)), "By 10 percent, unrestricted growth"
DB_Log (Log), Not Applicable, 5011, "By 1MB, restricted growth to 2097152 MB".
For the data file it depends whether or not you have instant file initialisation enabled for the SQL Server account. If you don't you should definitely consider using a fixed growth increment as the length of time that the file growth takes will grow exponentially in proportion to the size of the growth. If you grow the file in too small an increment then you can end up with file system fragmentation.
For the log file you should definitely consider a much larger number than 1MB as you will end up with VLF fragmentation. Log file growth cannot take advantage of instant file initialisation so should always use a fixed increment (say between 1GB - 4GB unless you know for a fact that the log will always remain small) .
Of course in an ideal world it wouldn't actually matter what you set these too as you should be pre-sizing files in advance at low traffic times rather than leaving when it happens to chance.