As per the attached, we have a Balanced Data Distributor set up in a data transformation covering about 2 million rows. The script tasks are identical - each one opens a connection to oracle and executes first a delete and then an insert. (This isn't relevant but it's done that way due to parameter issues with the Ole DB command and the Microsoft Ole DB provider for Oracle...)
The issue I'm running into is no matter how large I make my buffers or how many concurrent executions I configure, the BDD will not execute more than five concurrent processes at a time.
I've pulled back hundreds of thousands of rows in a larger buffer, and it just gets divided 5 ways. I've tried this on multiple machines - the current shot is from a 16 core server with -1 concurrent executions configured on the package - and no matter what, it's always 5 parallel jobs.
5 is better than 1, but with 2.5 million rows to insert/update, 15 rows per second at 5 concurrent executions isn't much better than 2-3 rows per second with 1 concurrent execution.
Can I force the BDD to use more paths, and if so how?
Short answer:
Yes BDD can make use of more than five paths. You shouldn't be doing anything special to force it, by definition it should automatically do it for you. Then why isn't it using more than 5 paths? Because your source is producing data faster than your destination can consume causing backpressure. To resolve it, you've to tune your destination components.
Long answer:
In theory, "the BDD takes input data and routes it in equal proportions to it's outputs, however many there are." In your set up, there are 10 outputs. So input data should be equally distributed to all the 10 outputs at the same time and you should see 10 paths executing at the same time - again in theory.
But another concept of BDD is "instead of routing individual rows, the BDD operates on buffers on data." Which means data flow engine initiates a buffer, fills it with as many rows as possible, and moves that buffer to the next component (script destination in your case). As you can see 5 buffers are used each with the same number of rows. If additional buffers were started, you'd have seen more paths being used. SSIS couldn't use additional buffers and ultimately additional paths because of a mechanism called backpressure; it happens when the source produces data faster than the destination can consume it. If it happens all memory would be used up by the source data and SSIS will not have any memory to use for the transformation and destination components. So to avoid it, SSIS limits the number of active buffers. It is set to 5 (can't be changed) which is exactly the number of threads you're seeing.
PS: The text within quotes is from this article
There is a property in SSIS data flow tasks called EngineThreads which determines how many flows can be run concurrently, and its default value is 5 (in SSIS 2012 its default value is 10, so I'm assuming you're using SSIS 2008 or earlier.) The optimal value is dependent on your environment, so some testing will probably be required to figure out what to put there.
Here's a Jamie Thomson article with a bit more detail.
Another interesting thing I've discovered via this article on CodeProject.
[T]his component uses an internal buffer of 9,947 rows (as per the
experiment, I found so) and it is pre-set. There is no way to override
this. As a proof, instead of 10 lac rows, we will use only 9,947 (Nine
thousand nine forty seven ) rows in our input file and will observe
the behavior. After running the package, we will find that all the
rows are being transferred to the first output component and the other
components received nothing.
Now let us increase the number of rows in our input file from 9,947 to
9,948 (Nine thousand nine forty eight). After running the package, we
find that the first output component received 9,947 rows while the
second output component received 1 row.
So I notice in your first buffer run that you pulled 50,000 records. Those got divided into 9,984 record buckets and passed to each output. So essentially the BDD takes the records it gets from the buffer and passes them out in ~10,000 record increments to each output. So in this case perhaps your source is the bottleneck.
Perhaps you'll need to split your original Source query in half and create two BDD-driven data flows to in essence double your parallel throughput.
Related
I have a Data Flow Task that moves a bunch of data from multiple sources to multiple destinations. About 50 in all. The data moved is from one database to another with varying rows and columns in each flow.
While I believe I understand the basic idea behind the Data Flow Task's DefaultBufferMaxRows and DefaultBufferSize as it relates to Rows per Batch and Maximum insert commit size of the Destination, it's not clear to me what happens when there are multiple unrelated source and destination flows.
What I'm wondering is which of the following makes the most sense :
Divide out all the source and destination flows into separate Data Flow Tasks
Divide them into groups that have roughly the same size and number of rows
Leave as is and just make sure to set the properties with enough Buffer Rows and Buffer Size while setting the Rows per batch and Maximum insert commit size to the individual destination
I believe I read some place that it's better to have each source and destination in it's own data flow task, but I am unable to find the link at this time.
Most examples I've been able to locate online seem to always be for one source to one or more destinations, or just one to one.
Let me go from the basis. Data Flow Task is a task, organizing a pipeline of data from Data Source to Data Destination. It is a unique task in SSIS because it runs data manipulation in SSIS itself, all other tasks call external systems to do something with data out of SSIS.
On the relationships between DefaultBufferMaxRows, DefaultBufferSize as it relates to Rows per Batch and Maximum insert commit size of the Destination. There is no direct relation. DefaultBufferMaxRows and DefaultBufferSize are properties of Data Flow pipeline; the pipeline processes rows in batches and these properties controls the processing batch size. These properties control RAM consumption and performance of Data Flow Task.
On other hand, Rows per Batch and Maximum insert commit size are the properties of Data Destination, namely OLE DB Destination in Fast Load mode only; it controls performance of Data Destination itself. You may have a Data Flow with Flat File Destination where you do not have Rows per Batch, but it will definitely have DefaultBufferMaxRows and DefaultBufferSize properties.
Typical usage from my experience:
DefaultBufferMaxRows and DefaultBufferSize control batch size of Data Flow pipeline. Tuning it is a tradeoff - bigger batches means less overhead on batch handling i.e. less execution time, but more RAM consumption. More RAM means that you might experience outage of RAM and DFT data buffers will be swapped to Disk.
In SSIS 2016+ there is a "magical setting" AutoAdjustBufferSize which tells the engine to autogrow the buffer.
Values for these properties are usually defined at performance tests in QA environment. On development - use the defaults.
Rows per Batch and Maximum insert commit size -- control log growth and possibility to rollback all changes. Do not change these unless you really need to do so. Defaults are generally Ok; I changed it rarely on special reason. More on its functions.
On package design:
1 pair of Source-Destination per DFT (Data Flow Task). This is optimal - gives you most of control in terms of tuning and execution order etc. Also you can utilize parallel execution of tasks by SSIS engine. BTW, it simplifies debugging and support.
Division in groups. You can group DFT in Sequence groups and define common properties via Expressions-Variables. But - use it if you really need to do so because it complicates your design.
All Source-Destination in one DFT. I would recommend against it, complex and error prone.
As a bottom line, keep it simple -- 1 pair of Source-Destination per DFT, and play with your parameters only if have to do so.
I decided to use a MySQL Cluster for a bigger project of mine. Beside storing documents in a simple table scheme with only three indexes, a need to store information in the size of 1MB to 50MB arise. Those informations will be serialized custom tables being aggregats of data feeds.
How will be those information be stored and how many nodes will those information hit? I understand that with a replication factor of three those information will be written three times and I understand that there are coordinator nodes (named differently) so I ask myself what will be the impact storing those information?
Is it right that I understand that for a read a cluster will send those blobs to three servers (one requested the information, one coordinator and one data server) and for a write it is 5 (1+1+3)?
Generally speaking MySQL only supports NoOfReplicas=2 right now, using 3 or 4 is generally not supported and not very well tested, this is noted here:
http://dev.mysql.com/doc/refman/5.6/en/mysql-cluster-ndbd-definition.html#ndbparam-ndbd-noofreplicas
"The maximum possible value is 4; currently, only the values 1 and 2 are actually supported."
As also described in the above URL, the data is stored with the same number of replicas as this setting. So with NoOfReplicas=2, you get 2 copies. These are stored on the ndbd (or ndbmtd) nodes, the management nodes (ndb_mgmd) act as co-ordinators and the source of configuration, they do not store any data and neither does the mysqld node.
If you had 4 data nodes, you would have your entire dataset split in half and then each half is stored on 2 of the 4 data nodes. If you had 8 data nodes, your entire data set would be split into four parts and then each part stored on 2 of the 8 data nodes.
This process is sometimes known as "Partitioning". When a query runs, the data is split up and sent to each partition which processes it locally as much as possible (for example by removing non-matching rows using indexes, this is called engine condition pushdown, see http://dev.mysql.com/doc/refman/5.6/en/condition-pushdown-optimization.html) and then it is aggregated in mysqld for final processing (may include calculations, joins, sorting, etc) and return to the client. The ndb_mgmd nodes do not get involved in the actual data processing in any way.
Data is by default partitioned by the PRIMARY KEY, but you can change this to partition by other columns. Some people use this to ensure that a given query is only processed on a single data node much of the time, for example by partitioning a table to ensure all rows for the same customer are on a single data node rather than spread across them. This may be better, or worse, depending on what you are trying to do.
You can read more about data partitioning and replication here:
http://dev.mysql.com/doc/refman/5.6/en/mysql-cluster-nodes-groups.html
Note that MySQL Cluster is really not ideal for storing such large data, in any case you will likely need to tune some settings and try hard to keep your transactions small. There are some specific extra limitations/implications of using BLOB which you can find discussed here:
http://dev.mysql.com/doc/mysql-cluster-excerpt/5.6/en/mysql-cluster-limitations-transactions.html
I would run comprehensive tests to ensure it is performing well under high load if you go ahead and ensure you setup good monitoring and test your failure scenarios.
Lastly, I would also strongly recommend getting pre-sales support and a support contract from Oracle, as MySQL Cluster is quite a complicated product and needs to be configured and used correctly to get the best out of it. In the interest of disclosure, I work for Oracle in MySQL Support -- so you can take that recommendation as either biased or very well informed.
"The output column "A" (67) on output "Output0" (5) and component "Data Flow Task" (1) is not subsequently used in the Data Flow task. Removing this unused output column can increase Data Flow task performance."
Please resolve my problem
These warnings indicate that you have columns in your data flow that are not used. A Data Flow works by allocating "buckets" of fixed size memory, filling it with data from the source and allowing the downstream components to directly access the memory address to perform synchronous transformations.
Memory is a finite resource. If SSIS detects is has 1 GB to work with and one row of data will cost 4096 MB, then you could have at most 256 rows of data in the pipeline before running out of memory space. Those 256 rows would get split into N buckets of rows because as much as you can, you want to perform set based operations when working with databases.
Why does all this matter? SSIS detects whether you've used everything you've brought into the pipeline. If it's never used, then you're wasting memory. Instead of a single row costing 4096, by excluding unused columns, you reduce the amount of memory required for each row down to 1024 MB and now you can have 1024 rows in the pipeline just by only taking what you needed.
How do you get there? In your data source, write a query instead of selecting a table. Don't use SELECT * FROM myTable instead, explicitly enumerate all of the columns you need and nothing more. Same goes for Flat File Sources---uncheck the columns that are never used. You'll still pay a disk penalty for having to read the whole row in but they don't have to hit your DF and consume that memory. Same story for any Lookups - only query the data you need.
Asynchronous components are the last thing to be aware of as this has turned into a diatribe on performance. The above calculations are much like freshman calculus classes: assume a cow is a sphere to make the math easier. Asynchronous components result in your memory being split before and after the component. They radically change the shape of the rows going through a component such that downstream components can't reuse the address space above it. This results in a physical memory copy which is a slow operation.
My final comment though is if your package is performing adequately, finishing in an acceptable time frame, unless you have nothing else to do, leave it be and go on to your next task. These are just warnings and should not "grow up" to full blown errors.
I have a script that generates about 20,000 small objects with about 8 simple properties. My desire was to toss these objects into ScriptDb for later processing of the data.
What I'm experiencing though is that even with a savebatch operation that the process takes much longer then desired and then silently stops. By too much time, it's often greater then the 5 min execution limit, though without throwing any error. The script runs so long that I've not attempted to check a mutation result to see what didn't make it, but from a check after exectution it appears that most do not.
So though I'm quite certain that my collection of objects is below the storage size limit, is there a lesser known limit or throttle on accesses that is causing me problems? Are the number of objects the culprit here, should I be instead attempting to save one big object that's a collection of the lessers?
I think it's the amount of data you're writing. I know you can store 20,000 small objects, you just can't write that much in 5 minutes. Write 1000 then quit. Write the next thousand, etc. Run your function 20 times and the data is loaded. If you need to do this more/automated, use ScriptApp.
So I have a bit of a performance problem. I have made a java program that constructs a database. The problem is when loading in the data. I am loading in 5,000 files into a sql Database. When the program starts off, it can process about 10% of the files in 10 minutes however it gets much slower as it progresses. Currently at 28% it is going to finish in 16 hours at its current rate. However that rate is slowing down considerably.
My question is why does the program get progressively slower as it runs and how to fix that.
EDIT: I have two versions. One is threaded (capped at 5 threads) and one is not. The difference between the two is negligible. I can post the code again if any one likes, but I took it out because I am now fairly certain that the bottle neck is the MySQL (Also appropriately re tagged). I went ahead and used batch inserts. This did cause an initial increase in speed but once again after processing about 30% of the data it does drop of quickly.
So SQL Points
My Engine for all 64 tables is InnoDB version 10.
The table have about 300k rows at this point (~30% of the data)
All tables have one "joint" primary key. A id and a date.
Looking at MySQL WorkBench I see that there is a query per thread (5 queries)
I am not sure the unit of time (Just reading from MySQL Administrator), but the queries to check if a file is already inserted are taking `300. (This query should be fast as it is a SELECT MyIndex from MyTable Limit 1 to 1 where Date = date.) As I have been starting and stopping the program I built in this check to see if the file was already inserted. That way I am able to start it after each change and see what if any improvement there is without starting the process again.
I am fairly certain that the degradation of preformance is related to the tables' sizes. (I can stop and start the program now and the process remains slow. It is only when the tables are small that the process is going at an acceptable speed.)
Please, please ask and I will post what ever information you need.
DONE! Well I just let it run for the 4 Days it needed to. Thank you all for the help.
Cheers,
--Orlan
Q1: Why does the program get progressively slower?
In your problem space, you have 2 systems interacting: a producer that reads from the file system and produces data, and a consumer that transforms that data into records and stores them in the db. Your code is currently hard linking these two processes and your system works at the slowest speed of the two.
In your program you have a fixed arrival rate (1/sec - the wait when you've more than 10 threads running). If you have indexes in the tables being filled, as the table grows bigger, inserts will take longer. That means that while your arrival rate is fixed at 1/sec, your exit rate is continuosly increasing. Therefore, you will be creating more and more threads that share the same CPU/IO resources and getting less things done per unit of time. Creating threads is also a very expensive operation.
Q2: Could it have to do with how I am constructing the queries from Strings?
Only partially. Your string manipulation is a fixed cost in the system. It increases the cost it takes to service one request. But string operations are CPU bounded and your problem is I/O bounded, meaning that improving the string handling (that you should) will only marginally improve the performance of the system. (See Amdahl's Law).
Q3: how to fix that (performance issue)
Separate the file reader process from the db insert process. See the Consumer-Producer pattern. See also Completion Service for an implementation built-in the JDK:
(FileReaderProducer) --> queue --> (DBBulkInsertConsumer)
Don't create new Threads. Use the facilities provided by the java.util.concurrent package, like the executor service or the Completion service mentioned above. For a "bare" threadpool, use the Executors factory.
For this specific proble, having 2 separate thread pools, (one for the consumer, one for the producer) will allow you to tune your system for best performance. File reading improves with parallelization (up to your I/O bound), but db inserts are not (I/O + indexes + relational consistency checks), so you might need to limit the amount of file reading threads (3-5) to match the insertion rate (2-3). You can monitor the queue size to evaluate your system performance.
Use JDBC bulk inserts: http://viralpatel.net/blogs/batch-insert-in-java-jdbc/
Use StringBuilder instead of String concatenation. Strings in Java are immutable. That means that every time you do: myString += ","; you are creating a new String and making the old String elegible for garbage collection. In turn, this increases garbage collection performance penalties.
You can use direct insert from file to database (read here). It works faster. When I do same for postgres I get 20 times performance increase.
And also dounload Your kit profiler and profile your application for performance. Than you will see what takes your time.
There's a number of things in your code that could contribute to the speed problems and you are correct in suspecting that the Strings play a role.
Take for example this code:
String rowsString = "";
// - an extra 1 to not have a comma at the end
for (int i = 0; i <= numberOfRows - 3; i++) {
rowsString += "(DATA), \n";
}
rowsString += "(DATA)";
Depending on how many rows there are, this is a potential bottle-neck and memory hog. I think it's best if you use a StringBuilder here. I see a lot of String manipulation that are better suited to StringBuilders. Might I suggest you read up on String handling a bit and optimise these, especially where you += Strings?
Then the next question is how is your table designed? There could be things making your inserts slow, like incorrect default lengths for varchars, no indexes or too many indexes etc.
Most databases load data more efficiently if,
you load in batches of data,
you load in a relatively small numebr of threads e.g. one or two.
As you add more threads you add more overhead, so you expect it to be slower.
Try using an ExecutorService with a fixed size pool e.g. 2-4 and try loading the data in batches of say 100 at a time in a transaction.
You have several good tried and tested options for speeding up database access.
Use an ExecutorService for your threads. This may not help speed-wise but it will help you implement the following.
Hold a ThreadLocal Connection instead of making a new connection for every file. Also, obviously, don't close it.
Create a single PreparedStatement instead of making a new one every time around.
Batch up your statement executions.