I am working on a data warehousing project where several systems are loading data into a staging area for subsequent processing. Each table has a "loadId" column which is a foreign key against the "loads" table, which contains information such as the time of the load, the user account, etc.
Currently, the source system calls a stored procedure to get a new loadId, adds the loadId to each row that will be inserted, and then calls a third sproc to indicate that the load is finished.
My question is, is there any way to avoid having to pass back the loadId to the source system? For example, I was imagining that I could get some sort of connection Id from Sql Server, that I could use to look up the relevant loadId in the loads table. But I am not sure if Sql Server has a variable that is unique to a connection?
Does anyone know?
Thanks,
I assume the source systems are writing/committing the inserts into your source tables, and multiple loads are NOT running at the same time...
If so, have the source load call a stored proc, newLoadStarting(), prior to starting the load proc. This stored proc will update a the load table (creates a new row, records start time)
Put a trigger on your loadID column that will get max(loadID) from this table and insert as the current load id.
For completeness you could add an endLoading() proc which sets an end date and de-activates that particular load.
If you are running multiple loads at the same time in the same tables...stop doing that...it's not very productive.
a local temp table (with one pound sign #temp) is unique to the session, dump the ID in there then select from it
BTW this will only work if you use the same connection
In the end, I went for the following solution "pattern", pretty similar to what Markus was suggesting:
I created a table with a loadId column, default null (plus some other audit info like createdDate and createdByUser);
I created a view on the table that hides the loadId and audit columns, and only shows rows where loadId is null;
The source systems load/view data into the view, not the table;
When they are done, the source system calls a "sp__loadFinished" procedure, which puts the right value in the loadId column and does some other logging (number of rows received, date called, etc). I generate this from a template as it is repetitive.
Because loadId now has a value for all those rows, it is no longer visible to the source system and it can start another load if required.
I also arrange for each source system to have its own schema, which is the only thing it can see and is its default on logon. The view and the sproc are in this schema, but the underlying table is in a "staging" schema containing data across all the sources. I ensure there are no collisions through a naming convention.
Works like a charm, including the one case where a load can only be complete if two tables have been updated.
Related
I have a requirement where I have execute a MYSQL procedure (procedure name is same across all the sites) for different sites, I have a For each loop which get the server details and database details and within the Foreach Loop container I have dataflow task, in the data flow task I created a source connection for a MySql connection and calling the stored procedure like below,
CALL spITProd( '2022-09-25 20:04:22.847000000', '2022-10-25 20:04:22.847000000' );
This gives different metadata for different sites, to be precise first 16 columns are same for all the sites and after that number of columns and column names can vary, my requirement is to store first 16 columns in a destination table and all other columns in a different table by unpivoting them, I was able to achieve this for one site as shown below SSIS data flow task,
I want to automate this for process for all the sites, I know I can create multiple data flow tasks but that's not what I want to do, when source connects to the new site it fails as it encounters new metadata (as I said number of columns and column name changes after first 16 columns), Please suggest any ideas on how to dynamically read the columns from the procedure. I don't have any permission son the MYSQL database it is managed by the vendor I just have permission to the call the procedure, I tried to store the data into System.Object variable but I am not sure how to use that later as I cant create a temp table on destination using that, Appreciate any ideas on how to handle this requirement.
Mentioned the details above
Imagine that you want to save in a variable the number of rows the were updated or deleted in a table.
This is the steps that i did:
First, in the Control flow i created a Data Flow Task.
Them, in the Data Flow, i created a source(in my case is a excel file), then i proceeded to create two variables to count those rows- countDeleted and countUpdated, then connected the variables to two row count transformations, and them connected my destination (OLE DB).
Now in the control flow, what do i do??
Create a SQL execute task?? or a Script task?? What is the best way to do it?? What is the piece of code to use??
Thanks for youy help.
PS: i only have 4 weeks off SSIS, sorry for my noobieness :)
An OLD DB destination only inserts. It can't UPDATE or DELETE
What's your logic for updating or deleting?
If you're just starting out and reading about doing things in SSIS you will eventually find advice to use the OLE DB Command to perform row by row delete and inserts.
In my opinion this is to be avoided. It does not scale (works fine for small recorsets then fails for large recordsets), and it is difficult to maintain parameter mappings in the OLE DB Command. Although you should try it anyway to familiarise yourself with it.
My advice is to load the Excel data into a staging table, perform batch DELETE and UPDATE statements to load the data and use ##ROWCOUNT to capture the records updated.
For example;
Your existing described dataflow can be used to load into a table called StagingTable
Before your dataflow you should run an Execute SQL Task (This is in the Control Flow pane, not the Data Flow pane) that clears the staging table:
TRUNCATE TABLE StagingTable;
So first get that working - repeatedly running your package clears the staging table then loads Excel into it without creating duplicates
This in itself is a challenge as Excel is a terrible data interchange format.
Once you have that working, you add an execute SQL task to the end that runs some SQL that deletes the records you want and captures the count. For example:
DELETE FROM MyFinalTable WHERE PriamryKey IN (SELECT PrimaryKey FROM StagingTable);
SELECT ##ROWCOUNT;
Then you follow the instructions here to load that back to your SSIS variable
http://microsoft-ssis.blogspot.com/2011/03/rowcount-for-execute-sql-statement.html
What are you doing with this row count? Are you writing it to a logging table? Save
yourself the bother of pulling it back into an SSIS variable and just write it directly:
DELETE FROM MyFinalTable WHERE PriamryKey IN (SELECT PrimaryKey FROM StagingTable);
INSERT INTO LogTable(Table,Operation,Type)
SELECT 'MyFinalTable','Delete', ##ROWCOUNT;
In my experience it is not a good idea to build convoluted logic into SSIS packages if you can instead do in a database. Although it does depend on the person who has to eventually maintain it. Hopefully you can appreciate that this T-SQL approach is a more straightforward code based approach as opposed to having to dig around in property pages and events and other places inside SSIS packages.
I assume that you're using an Execute SQL Task for the updates and deletes? As #Nick.McDermaid mentioned, using an OLE DB Command within a Data Flow presents various issues when performing DML. You can find the number of rows updated, inserted, or deleted in a table through an Execute SQL Task by using the ExecValueVariable property of this task. Set the variable that will hold the row count to this property and it will return the number of affected rows. Note that is will only return the number of rows impacted by the last statement in the Execute SQL Task, regardless of batches (i.e. GO separators) are in the component.
I have created many SSIS packages in the past, though the need for this one is a bit different than the others which I have written.
Here's the quick description of the business need:
We have a small database on our end sourced from a 3rd party vendor, and this needs to be overwritten nightly.
The source of this data is a bunch of flat files (CSV) from the 3rd party vendor.
Current setup: we truncate the tables of this database, and we then insert the new data from the files, all via SSIS.
Problem: There are times when the files fail to come, and what happens is that we truncate the old data, though we don't have the fresh data set. This leaves us without a database where we would prefer to have yesterday's data over no data at all.
Desired Solution: I would like some sort of mechanism to see if the new data truly exists (these files) prior to truncating our current data.
What I have tried: I tried to capture the data from the files and add them to an ADO recordset and only proceeding if this part was successful. This doesn't seem to work for me, as I have all the data capture activities in one data flow and I don't see a way for me to reuse that data. It would seem wasteful of resources for me to do that and let the in-memory tables just sit there.
What have you done in a similar situation?
If files are not present update some flags like IsFile1Found to false and pass these flags to stored procedure which truncates on conditional basis.
If file is empty then Using powershell through Execute Process Task you can extract first two rows if there are two rows (header + data row) then it means data file is not empty. Then you can truncate the table and import the data.
other approach could be
you can load data into some staging table and from these staging table insert data to the destination table using SQL stored procedure and truncate these staging tables after data is moved to all the destination table. In this way before truncating destination table you can check if staging tables are empty or not.
I looked around and found that some others were struggling with the same issue, though none of them had a very elegant solution, nor do I.
What I ended up doing was to create a flat file connection to each file of interest and have a task count records and save to a variable. If a file isn't there, the package fails and you can stop execution at that point. There are some of these files whose actual count is interesting to me, though for the most part, I don't care. If you don't care what the counts are, you can keep recycling the same variable; this will reduce the creation of variables on your end (I needed 31). In order to preserve resources (read: reduce package execution time), I excluded all but one of the columns in each data source; it made a tremendous difference.
I'm migrating the data from an Access database to SQL Server via the SQL Server Migration Assistant (SSMA). The Access application will continue to be used with the local tables converted to linked tables.
One continuous form hangs for 15 - 30 seconds when it's loading. It displays approximately 2000 records. When I looked in SQL Server Profiler to see what it was doing, it was making a separate call to the backend database for each record in the form. So the delay when the form opens is caused by the 2000-odd separate calls to the database.
This is amazingly inefficient. Is there any way to get Access to make a single call to the backend database and retrieve all the records at once?
I don't know if this is relevant but the Record Source for the form is a view in the SQL Server backend database, which is linked to via an Access linked table (so, hopefully, Access just sees it as a table, not a view). I needed an Instead Of trigger on the view in SQL Server, and a unique index on the linked table in Access, to allow the records to be updated via the form.
If the act of opening that continuous form really does generate ~2000 separate SQL queries (one for every row in the view) then that is unusual behaviour for Access interacting with a SQL Server linked "table". Under normal circumstances what takes place is:
Access submits a single query to return all of the Primary Key values for all rows in the table/view. This query may be filtered and/or sorted by other columns based on the Filter and Order By properties of the form. This gives Access a list of the key values for every row that might be displayed in the form, in the order in which they will appear.
Access then creates a SQL prepared statement using sp_prepexec to retrieve entire rows from the table/view ten (10) rows at a time. The first call looks something like this...
declare #p1 int
set #p1=4
exec sp_prepexec #p1 output,N'#P1 int,#P2 int,#P3 int,#P4 int,#P5 int,#P6 int,#P7 int,#P8 int,#P9 int,#P10 int',N'SELECT "ID","AgentName" FROM "dbo"."myTbl" WHERE "ID" = #P1 OR "ID" = #P2 OR "ID" = #P3 OR "ID" = #P4 OR "ID" = #P5 OR "ID" = #P6 OR "ID" = #P7 OR "ID" = #P8 OR "ID" = #P9 OR "ID" = #P10',358,359,360,361,362,363,364,365,366,367
select #p1
...and each subsequent call uses sp_execute, something like this
exec sp_execute 4,368,369,370,371,372,373,374,375,376,377
Access repeats those calls until it has retrieved enough rows to fill the current page of continuous forms. It then displays those forms immediately.
Once the forms have been displayed, Access will "pre-fetch" a couple of more batches of rows (10 rows each) in anticipation of the user hitting PgDn or starting to scroll down.
If the user clicks the "Last Record" button in the record navigator, Access again uses sp_prepexec and sp_execute to request enough 10-row batches to fill the last page of the form, and possibly pre-fetch another couple of batches in case the user decides to hit PgUp or start scrolling up.
So in your case if Access really is causing SQL Server to run individual queries for every single row in the view then there may be something particular about your SQL View that is causing it. You could test that by creating an Access linked table to a single SQL Table or a simple one-table SQL View, then use SQL Server Profiler to check if opening that linked table causes the same behaviour.
Turned out the problem was two aggregate fields. One field's Control Source was =Count(ID) and the other field's Control Source was =Sum(Total_Qty).
Clearing the control sources of those two fields allowed the form to open quickly. SQL Server Profiler shows it calling sp_execute, as Gord Thompson described, to retrieve seven batches of 10 rows at a time. Much quicker than making 2000 calls to retrieve one row at a time.
I've come across the same problem again but this time with a different cause. I'm including it here for completeness, to help anyone in a similar situation:
This time the underlying query was hanging and SQL Server Profiler showed the same behaviour as before, with Access making separate calls to the SQL Server database to bring back one record at a time, for every record in the query.
The cause turned out to be the ORDER BY clause in the query. I guess Access had to pull back all records in the linked table from SQL Server before being able to order them. Makes sense when I think of it. Although I don't know why Access doesn't just pull all records through at once, instead of getting the records one at a time.
I would try setting the Recordset Type to Snapshot (on the Data tab of the Form's property sheet and/or the property sheet of the query you are using for the form source)
Is it possible for MySQL database to invoke an external exe file when a new row is added to one of the tables in the database?
I need to monitor the changes in the database, so when a relevant change is made, I need to do some batch jobs outside the database.
Chad Birch has a good idea with using MySQL triggers and a user-defined function. You can find out more in the MySQL CREATE TRIGGER Syntax reference.
But are you sure that you need to call an executable right away when the row is inserted? It seems like that method will be prone to failure, because MySQL might spawn multiple instances of the executable at the same time. If your executable fails, then there will be no record of which rows have been processed yet and which have not. If MySQL is waiting for your executable to finish, then inserting rows might be very slow. Also, if Chad Birch is right, then will have to recompile MySQL, so it sounds difficult.
Instead of calling the executable directly from MySQL, I would use triggers to simply record the fact that a row got INSERTED or UPDATED: record that information in the database, either with new columns in your existing tables or with a brand new table called say database_changes. Then make an external program that regularly reads the information from the database, processes it, and marks it as done.
Your specific solution will depend on what parameters the external program actually needs.
If your external program needs to know which row was inserted, then your solution could be like this: Make a new table called database_changes with fields date, table_name, and row_id, and for all the other tables, make a trigger like this:
CREATE TRIGGER `my_trigger`
AFTER INSERT ON `table_name`
FOR EACH ROW BEGIN
INSERT INTO `database_changes` (`date`, `table_name`, `row_id`)
VALUES (NOW(), "table_name", NEW.id)
END;
Then your batch script can do something like this:
Select the first row in the database_changes table.
Process it.
Remove it.
Repeat 1-3 until database_changes is empty.
With this approach, you can have more control over when and how the data gets processed, and you can easily check to see whether the data actually got processed (just check to see if the database_changes table is empty).
you could do what replication does: hang on the 'binary log'. setup your server as a 'master server', and instead of adding a 'slave server', run mysqlbinlog. you'll get a stream of every command that modifies your database.
step in 'between' the client and server: check MySQLProxy. you point it to your server, and point your client(s) to the proxy. it lets you interpose Lua scripts to monitor, analyze or transform any SQL command.
I think it's going to require adding a User-Defined Function, which I believe requires recompilation:
MySQL FAQ - Triggers: Can triggers call an external application through a UDF?
I think it's really a MUCH better idea to have some external process poll changes to the table and execute the external program - you could also have a column which contains the status of this external program run (e.g. "pending", "failed", "success") - and just select rows where that column is "pending".
It depends how soon the batch job needs to be run. If it's something which needs to be run "sooner or later" and can fail and need to be retried, definitely have an app polling the table and running them as necessary.