I am using SSIS to connect to a legecy mainframe database and this allows only 5 concurrent connections at a time.
I have a dataflow task with many tables to transfer and it kicks outs because of this limitation.
I have split up the Data Flow task into seperate data flows and this is working for the moment, but it is not optiomal as they need to be sequenced and 1 large transfer in a flow is holding up subsequent transfers.
Anyone any idea of how to limit the number of connections in a single data flow, I had a look at using the Engine Threads but this did not make any difference.
Any help much appericated.
The connection object you are using for your tasks should have a property named 'RetainSameConnection'. This should cause the same connection to be used across all tasks. At least this is true for OLEDB connection types. I don't know if ADO.NET connections have the same property. They probably do.
Here is an article for more information: http://munishbansal.wordpress.com/2009/04/01/how-to-retain-same-data-connection-across-multiple-tasks-in-ssis/
Related
I can see I'm not the only person who's experienced an issue with the SSIS Transfer Database Object Task and timeouts, however, people using this for the extract phase of an ETL must be something fairly common, so I'm trying to establish what is the usual/accepted way to do this.
I have a web application that uses Entity Framework to generate ~250 tables, some of which occasionally have schema updates.
The bulk of the transform and load portion of our ETL is handled by a series of stored procedures, however, these read from a copy of the application's tables that are initially loaded in the Transfer Database Objects task.
Initially, we set up an SSIS package that simply ran the Transfer Database Objects task, and then kicked off the stored proc. That meant that the job was fairly resilient to change, and the only changes required were changes to the stored proc, if and when a schema update affected the tables that were used therein.
Unfortunately, as one of our application instances has grown over time, the Transfer Database Objects task is reaching the point where I'm regularly seeing Timeout errors. Those don't appear to be connection timeouts, or anything I can control on the server side, and from what I can see, I can't amend the CommandTimeout on the underlying SMO stuff within that Task.
I can see that some people manually craft their extract, such that they run a separate Data Flow task to pull the information from each table, which has the obvious bonus that these can be run in parallel, however, in my case, that's going to mean an initial chunk of work to craft 250ish of these, and a maintenance task whenever the schema changes on the source database, no matter how minor.
I've come across Biml, which looked like a possible way to at least ease that overhead, however, it doesn't appear this can run on VS2017 yet.
Does anyone have any particular patterns they follow for this, or if I do need individual data flow tasks, is there some way to automate the schema update, perhaps using some kind of SSIS automation and something from the entity framework?
It turns out the easiest way around this is to write a clone of the Transfer task, but with appropriate additions to allow more control over batching and timeouts etc. Details are available in this article: https://blogs.msdn.microsoft.com/mattm/2007/04/18/roll-your-own-transfer-sql-server-objects-task/
I have a pipeline taking data from a MySQl server and inserting into a Datastore using DataFlow Runner.
It works fine as a batch job executing once. The thing is that I want to get the new data from the MySQL server in near real-time into the Datastore but the JdbcIO gives bounded data as source (as it is the result of a query) so my pipeline is executing only once.
Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds?
Or is there a way to make the pipeline redoing it automatically without having to submit another job?
It is similar to the topic Running periodic Dataflow job but I can not find the CountingInput class. I thought that maybe it changed for the GenerateSequence class but I don't really understand how to use it.
Any help would be welcome!
This is possible and there's a couple ways you can go about it. It depends on the structure of your database and whether it admits efficiently finding new elements that appeared since the last sync. E.g., do your elements have an insertion timestamp? Can you afford to have another table in MySQL containing the last timestamp that has been saved to Datastore?
You can, indeed, use GenerateSequence.from(0).withRate(1, Duration.standardSeconds(1)) that will give you a PCollection<Long> into which 1 element per second is emitted. You can piggyback on that PCollection with a ParDo (or a more complex chain of transforms) that does the necessary periodic synchronization. You may find JdbcIO.readAll() handy because it can take a PCollection of query parameters and so can be triggered every time a new element in a PCollection appears.
If the amount of data in MySql is not that large (at most, something like hundreds of thousands of records), you can use the Watch.growthOf() transform to continually poll the entire database (using regular JDBC APIs) and emit new elements.
That said, what Andrew suggested (emitting records additionally to Pubsub) is also a very valid approach.
Do I have to execute the pipeline and resubmit a Dataflow job every 30 seconds?
Yes. For bounded data sources, it is not possible to have the Dataflow job continually read from MySQL. When using the JdbcIO class, a new job must be deployed each time.
Or is there a way to make the pipeline redoing it automatically without having to submit another job?
A better approach would be to have whatever system is inserting records into MySQL also publish a message to a Pub/Sub topic. Since Pub/Sub is an unbounded data source, Dataflow can continually pull messages from it.
I haven't got a lot of ETL experience but I haven't found the answer to my question either, although I guess it may be a no-brainer if you've worked with it. We're currently looking into creating a simple data warehouse (simple as in "copy most columns from most tables" and not OLAP-style) and it seems we're leaning towards SQL Server (2008) for a few reasons.
SSIS seems to be the tool for this kind of tasks when it comes to SQL Server, but I can't find anything about how it is affecting the source database cache, if at all, when loading data. Some of our installations are very sensitive performance-wise when it comes to having a usage-style-cache.
But if SSIS runs a "select *"-ish query and the cache is altered, then the performance for the users may degrade to unacceptable levels until it is rebuild from those queries again.
So my question is, does SSIS (or is there a way to avoid) affect the database cache when loading data from a SQL Server database?
Part of the problem is also that the source database could be both an Oracle or SQL Server database, so if there is a way to avoid the cache-affecting part for Oracle, that would be good input as well. (I guess the Attunity connector is the way to go?)
(Some additional info: We have considered plain files as well, but then export-import probably takes longer time than SSIS-transfer? I also guess change data capture is something we'll also look into, so if that is relevant to this question, feel free to include possible issues/benefits.)
Any other relevant suggestions are also welcome!
Thanks!
Tackling the SQL Server side:
First off, SSIS doesn't do anything special to avoid the buffer pool, or the plan cache.
Simple test (on a NON-production instance!):
Create a new SSIS package with a single connection manaager, and a single data flow containing one OLE DB Source, pointing to a table, similar to:
Clear the buffer pool, from SSMS: DBCC DROPCLEANBUFFERS
Verify that the cache has been cleared using the glorified dm_os_buffer_descriptors query at the top of this page: I get this:
Run the package
Re-run the query from step (2), and note that the data pages for the table (BOM_PIECE in my example) have been loaded into the cache:
Note that most SSIS components allow you to provide your own query, so if you have a way to avoid the buffer pool (I don't know that this is possible - I'd defer to someone who knows more about it), you could insert that into a query. So in the above example, instead of selecting Table or view in the OLE DB Source, you would select SQL command, or SQL command from variable if your command requires dynamic text.
Finally, I can imagine why you want to eliminate the cache load - but are you sure you want to do this? SQL Server is fairly good at managing memory, and what you're doing is swapping memory load for disk I/O load, which (depending on your use case) may have a negative impact on other users. This question has a discussion on SQL Server caching.
Read this article about Attunity regarding reading data from oracle
What do you mean "affect the database cache when loading data from a SQL Server database". SQL Server does not cache data, it caches execution plans. The fact that you are using SSIS wont affect your Server (other than the overhead of reading the data of course). Just use a propper transaction isolation level.
Also, read about the fast load property on SSIS components
About change data capture, I don't see how it can replace SSIS. You can use CDC to select the rows that will be loaded, but it wont do the loading for you.
Overview of the application:
I have a Delphi application that allows a user to define a number of queries, and run them concurrently over multiple MySQL databases. There is a limit on the number of threads that can be run at once (which the user can set). The user selects the queries to run, and the systems to run the queries on. Each thread runs the specified query on the specified system using a TADOQuery component.
Description of the problem:
When the queries retrieve a low number of records, the application works fine, even when lots of threads (up to about 100) are submitted. The application can also handle larger numbers of records(150,000+) as long as only a few threads (up to about 8) are running at once. However, when the user is running more than around 10 queries at once (i.e. 10+ threads), and each thread is retrieving around 150,000+ records, we start getting errors. Here are the specific error messages that we have encountered so far:
a: Not enough storage is available to complete this operation
b: OLE error 80040E05
c: Unspecified error
d: Thread creation error: Not enough storage is available to process this command
e: Object was open
f: ODBC Driver does not support the requested properties
Evidently, the errors are due to a combination of factors: number of threads, amount of data retrieved per thread, and possibly the MySQL server configuration.
The main question really is why are the errors occurring? I appreciate that it appears to be in some way related to resources, but given the different errors that are being returned, I'd like to get my head around exactly why the errors are cropping up. Is it down to resources on the PC, or something to do with the configuration of the server, for example.
The follow up question is what can we do to avoid getting the problems? We're currently throttling down the application by lowering the number of threads that can be run concurrently. We can't force the user to retrieve less records as the queries are totally user defined and if they want to retrieve 200,000 records, then that's up to them, so there's not much that we can do about that side of things. Realistically, we don't want to throttle down the speed of the application because most users will be retrieving small amounts of data, and we don't want to make the application to slow for them to use, and although the number of threads can be changed by the user, we'd rather get to the root of the problem and try to fix it without having to rely on tweaking the configuration all the time.
It looks you're loading a lot of data client-side. They may require to be cached in the client memory (especially if you use bidirectional cursors), and in a 32 bit application that could not be enough, depending on the average row size and how efficient is the library to store rows.
Usually the best way to accomplish database work is to perform that on the server directly, without retrieving data to the client. Usually databases have an efficient cache system and can write data out to disk when they don't fit in memory.
Why do you retrieve 150000 rows at once? You could use a mechanism to transfer data only when the user actually access them (sort of paging through data), to avoid large chunks of "wasted" memory.
This makes perfect sense (the fact you're having problems, not the specific errors). Think it through - you have the equivalent of 10 database connections (1 per thread) each receiving 150,000 rows of data (1,500,000 rows total) across a single network connection. Even if you're not using client-side cursors and the rows are small (just a few small columns), this is a HUGE flow of data across a single network interface, and a big hit on memory on the client computer.
I'd suspect the error messages are incorrect, in the same way that sometimes you have an access violation caused by a memory overwrite in another code location.
Depending on your DBMS, to help with the problem you could use the LIMIT/TOP sql clauses to limit the amountof data returned.
Things I would do:
write a very simple test application which only uses the necessary parts of the connection / query creation (with threads), this would eliminate all side effects caused by other parts of your software
use a different database access layer instead of ODBC, to find out if the ODBC driver is the root cause of the problem
it looks like the memory usage is no problem when the number of threads is low - to verify this, I would also measure / calculate the memory requirement of the records, compare it with the memory usage of the application in the operating system. For example if tests show that four threads can safely query 1.5 GB of total data without problems, but ten threads fail with under 0.5 GB of total data, I would say it is a threading problem
I have read that SSIS performs variable lock in scripts. So when scriptA and script B use the same variable one of them will wait for the other to finish.
Still I dont know id this is the same for Connection Strings.
For example I have two HTTP Connections being used in two webservices. if they are called constantly I am testing what happens.
Your assumption about SSIS variables is incorrect. SSIS doesn't perform "semaphore" type locking on variables. It won't "serialize" execution in order to allow two scripts that use the same variables to execute without interfering with each other. SSIS will fail the script that attempts to use a variable that is already in use by another script. If you think about it, it makes complete sense - having it operate any other way invites unpredictable race conditions.
There is no such thing as a "connection string" in SSIS. They are Connection Managers. An HTTP connection manager will manage a pool of HTTP connections. But I don't understand how that relates to "locking"?