SSIS Redirect on Error - Too many rows - ssis

I have an SSIS package that imports a flat CSV file, there are approx 200,000 records in the file. I've set the table that the data imports into with a primary unique key of the account number. There shouldn't be any duplicates in the source data (application controlled - outside of my influence)
However there is 1 duplicate row in the CSV, however when i add the primary key it redirect 7k rows... these aren't duplicate rows it just appears to redirect a load for no reason?
If I manually remove the single duplicate row it works perfectly. There is nothing special about the data or the files, it should just import the data and redirect the error row.

This behavior is due to OLE DB Destination and fast insert mode used.
With fast insert mode, OLE DB destination issues INSERT BULK command and does insert in batches. If one of rows within a batch violates table constraints, the whole batch is failed and got redirected to Error Output. This accounts for strange at the first glance behavior - reject more than 1 row.
What you can do with it - depends on your goal and limitations
If simply filter out consecutive duplicates - switch OLE DB Dest to regular insert mode at cost of significant performance decrease. Simplest way.
If performance drop is not an option, and you need to keep it simple - use Sort Component at Dataflow and tick discard duplicate rows flag. Caveat - you do not have a control on which row will be discarded.
If you need to implement some business rules on what data should pass - then you have to implement some scoring column and use it to filter rows. See Todd McDermid's article on this.

Related

SSIS: How to get the number of updated and deleted rows in an audit?

Imagine that you want to save in a variable the number of rows the were updated or deleted in a table.
‌
This is the steps that i did:
First, in the Control flow i created a Data Flow Task.
Them, in the Data Flow, i created a source(in my case is a excel file), then i proceeded to create two variables to count those rows- countDeleted and countUpdated, then connected the variables to two row count transformations, and them connected my destination (OLE DB).
Now in the control flow, what do i do??
Create a SQL execute task?? or a Script task?? What is the best way to do it?? What is the piece of code to use??
Thanks for youy help.
P‌S: i only have 4 weeks off SSIS, sorry for my noobieness :)
An OLD DB destination only inserts. It can't UPDATE or DELETE
What's your logic for updating or deleting?
If you're just starting out and reading about doing things in SSIS you will eventually find advice to use the OLE DB Command to perform row by row delete and inserts.
In my opinion this is to be avoided. It does not scale (works fine for small recorsets then fails for large recordsets), and it is difficult to maintain parameter mappings in the OLE DB Command. Although you should try it anyway to familiarise yourself with it.
My advice is to load the Excel data into a staging table, perform batch DELETE and UPDATE statements to load the data and use ##ROWCOUNT to capture the records updated.
For example;
Your existing described dataflow can be used to load into a table called StagingTable
Before your dataflow you should run an Execute SQL Task (This is in the Control Flow pane, not the Data Flow pane) that clears the staging table:
TRUNCATE TABLE StagingTable;
So first get that working - repeatedly running your package clears the staging table then loads Excel into it without creating duplicates
This in itself is a challenge as Excel is a terrible data interchange format.
Once you have that working, you add an execute SQL task to the end that runs some SQL that deletes the records you want and captures the count. For example:
DELETE FROM MyFinalTable WHERE PriamryKey IN (SELECT PrimaryKey FROM StagingTable);
SELECT ##ROWCOUNT;
Then you follow the instructions here to load that back to your SSIS variable
http://microsoft-ssis.blogspot.com/2011/03/rowcount-for-execute-sql-statement.html
What are you doing with this row count? Are you writing it to a logging table? Save
yourself the bother of pulling it back into an SSIS variable and just write it directly:
DELETE FROM MyFinalTable WHERE PriamryKey IN (SELECT PrimaryKey FROM StagingTable);
INSERT INTO LogTable(Table,Operation,Type)
SELECT 'MyFinalTable','Delete', ##ROWCOUNT;
In my experience it is not a good idea to build convoluted logic into SSIS packages if you can instead do in a database. Although it does depend on the person who has to eventually maintain it. Hopefully you can appreciate that this T-SQL approach is a more straightforward code based approach as opposed to having to dig around in property pages and events and other places inside SSIS packages.
I assume that you're using an Execute SQL Task for the updates and deletes? As #Nick.McDermaid mentioned, using an OLE DB Command within a Data Flow presents various issues when performing DML. You can find the number of rows updated, inserted, or deleted in a table through an Execute SQL Task by using the ExecValueVariable property of this task. Set the variable that will hold the row count to this property and it will return the number of affected rows. Note that is will only return the number of rows impacted by the last statement in the Execute SQL Task, regardless of batches (i.e. GO separators) are in the component.

Package is struck at "Execute phase is beginning" at Lookup task

I have used a Lookup in my data flow task. When I use Full Cache mode, the data flow task runs fine. But when I use Partial Cache or no Cache in my lookup, the records do not go past the lookup task and it keeps running for hours. I have checked for errors but there aren't any errors displayed. Could anyone please help me on this?
A Lookup is not appropriate for your task. Instead:
Add an OLE DB Source to pull in the data
Sort the records from the incoming source and the OLE DB Source
Perform a merge join (Full outer).
Add a Derived Column Transformation to check for ISNULL on the two joining columns. Create a new output column Called Action. For the NULLs in the target then you will tag that as an INSERT record.
Add a conditional split to send the INSERT record to an OLE DB Destination to insert the new records.
You can also check to see if there are matches between the two populations and perform updates, or look for NULLs in the source and DELETE in the destination.

SSIS OLE DB conditional "insert"

I have no idea whether this can be done or not, but basically, I have the following data flow:
Extracts the data from an XML file (works fine)
Simply splits the records based on an enclosed condition (works fine)
Had to add a derived column object due to some character set issues (might be better methods, but it works)
Now "Step 4" is where I'm running into a scenario where I'd only like to insert the values that have a corresponding match in my database, for instance, the XML has about 6000 records, and from those, I have maybe 10 of them that I need to match back against and insert them instead of inserting all 6000 of them and doing the compare after the fact (which I could also do, but was hoping there'd be another method). I was thinking that I might be able to perform a sql insert command within the OLE DB DESTINATION object where the ID value in the file matches, but that's what I'm not 100% clear on or if it's even possible for that matter. Should I simply go the temp table route and scrub the data after the fact, or can I do this directly in the destination piece? Any suggestions would be greatly appreciated.
EDIT
Thanks to the last comment from billinkc, I managed to get bit closer, where I can identify the matches and use that result set, but somehow it seems to be running the data flow twice, which is strange.... I took the lookup object out to see whether it was causing it and somehow it seems to be the case, any reason why it would run this entire flow twice with the addition of the lookup? I should have a total of 8 matches, which I confirmed with the data viewer output, but then it seems to be running it a second time for the same file.
Is there a reason you can't use a Lookup transformation to find existing records. Configure it so that it routes non-match records to the no match output and then only connect the match found connector to the "Navigator Staging Manager Funds"
I believe that answers what you've asked but I wonder if you're expressing the right desire? My assumption is the lookup would go against the existing destination and so the lookup returns the id 10 for a row. All of the out of the box destinations in SSIS only perform inserts, so that row that found a match would now get doubled. As you are looking for existing rows, that usually implies you'd want to perform an update to an existing row. If that's the case, there is a specially designed transformation, the OLE DB Command. It is the component that allows for updates. There is a performance problem with that component, it issues a single update statement per row flowing through it. For 10 rows, I think it'd be fine. Otherwise, the pattern you'd use is to write all the new rows (inserts) into your destination table and then write all of your changed rows (updates) into a second staging-type table. After the data flow is complete, then use an Execute SQL Task to perform a set based update statement.
There are third party options that handle combined upserts. I know Pragmatic Works has an option and there are probably others on the tasks and components site.

Add only new records in MySQL via script

I have a large database which I am trying to update via perl. The information to be added comes from a csv file which I do not control (but which is trusted—it comes from a different part of our company). For each record in the file, I need to either add it (if it does not exist) or do nothing (if it exists). Adding a record consists of the usual INSERT INTO, but before that can run for a particular entry a specific UPDATE must be run.
Let's say for the sake of concreteness that the file has 10,000 entries, but 90% of them are already in the database. What is the most efficient way to import the records? I can see a few obvious approaches:
Pull all records of this type from the database, then check each of the entries from the file for membership. Downside: lots of data transfer, possibly enough to time the server out.
Read in the entries from the file and send a query for just those records with an RLIKE 'foo|bar|baz|...' query (or a stuff = 'foo' || stuff = 'bar' || ... query, but that seems even worse). Downside: huge query, probably enough to choke the server.
Read in the file, send a query for each entry, then add it if appropriate. Downside: tens of thousands of queries, very slow.
Apart from the UPDATE requirement, this seems like a fairly standard issue that presumably has a standard solution. If there is, it can probably be adapted to my case with appropriate use of tests on the auto_increment primary key.
The standard solution is to use INSERT IGNORE which won't raise an error if the insertion would fail because of a constraint. This isn't much use to you as it doesn't give you a chance to do the UPDATE before you know the INSERT is going to work. If you can do the update afterwards, however, this is ideal: just INSERT IGNORE each record and then do the UPDATE if it succeeded.
If a record already exists that means a record with a matching unique key is already in the database, so I don't understand the RLIKE proposal which is bound to be slow.
I would use Perl to grep the CSV file using SELECT count(*) FROM table WHERE key = ? for each record, and removing anything where the result is non-zero.
Then just do your UPDATE and INSERT for everything left in the filtered CSV data.
There is no need to timeout the server if you keep flushing data while iterating the list.

SQL Server: unique key for batch loads

I am working on a data warehousing project where several systems are loading data into a staging area for subsequent processing. Each table has a "loadId" column which is a foreign key against the "loads" table, which contains information such as the time of the load, the user account, etc.
Currently, the source system calls a stored procedure to get a new loadId, adds the loadId to each row that will be inserted, and then calls a third sproc to indicate that the load is finished.
My question is, is there any way to avoid having to pass back the loadId to the source system? For example, I was imagining that I could get some sort of connection Id from Sql Server, that I could use to look up the relevant loadId in the loads table. But I am not sure if Sql Server has a variable that is unique to a connection?
Does anyone know?
Thanks,
I assume the source systems are writing/committing the inserts into your source tables, and multiple loads are NOT running at the same time...
If so, have the source load call a stored proc, newLoadStarting(), prior to starting the load proc. This stored proc will update a the load table (creates a new row, records start time)
Put a trigger on your loadID column that will get max(loadID) from this table and insert as the current load id.
For completeness you could add an endLoading() proc which sets an end date and de-activates that particular load.
If you are running multiple loads at the same time in the same tables...stop doing that...it's not very productive.
a local temp table (with one pound sign #temp) is unique to the session, dump the ID in there then select from it
BTW this will only work if you use the same connection
In the end, I went for the following solution "pattern", pretty similar to what Markus was suggesting:
I created a table with a loadId column, default null (plus some other audit info like createdDate and createdByUser);
I created a view on the table that hides the loadId and audit columns, and only shows rows where loadId is null;
The source systems load/view data into the view, not the table;
When they are done, the source system calls a "sp__loadFinished" procedure, which puts the right value in the loadId column and does some other logging (number of rows received, date called, etc). I generate this from a template as it is repetitive.
Because loadId now has a value for all those rows, it is no longer visible to the source system and it can start another load if required.
I also arrange for each source system to have its own schema, which is the only thing it can see and is its default on logon. The view and the sproc are in this schema, but the underlying table is in a "staging" schema containing data across all the sources. I ensure there are no collisions through a naming convention.
Works like a charm, including the one case where a load can only be complete if two tables have been updated.