Remove duplicates in SSIS Data Flow - ssis

I am working on an SSIS data flow task.
The source table is from old database which is denormalized.
The destination table is normalized.
SSIS fails because the data transfer is not possible because of duplicates (duplicates in primary key column).
It would be good if the SSIS can checks the destination for availability of current record (by checking the key) and if it exists , it can ignore pushing it. Then it can continue with the next record.
Is there a way to handle this scenario?

Assuming your destination table is a subset of your source table, you should be able to use the Sort Transformation to pull in only the columns you need for your destination table, and then check the "Remove rows with duplicate sort values" to basically give you a distinct list of records based on the columns you selected.
Then, simply route the results of the sort to your destination, and you should be good to go.

Related

Prioritize Bulk Insert in a table using Union all in ssis

I have multiple archive tables storing similar kind of data in these tables but archived in the month wise format. Now, the requirement is to get all the archived data in to one table instead of multiple tables.
I am doing this activity with the help of Union all in SSIS, however it seems that it is taking random insert in the destination table.
Attach is the route taken for the transformation.
I want to prioritize the insert, please suggest!
You can add an extra column "Priority" to each of OLE DB sources with the corresponding priority for each source and then after union you can add Sort Component that sorts the data by Priority. But if you have a lot of data - that would be really inefficient because sort component will wait until all the source data is read.
I would suggest to write a proper source SQL statement that does the union/prioritization/sort for you and then insert into target.
Also if the sources are on different servers you can create Foreach loop container that will iterate through source tables and inset all of them into the target table. You can use this article for the reference.

In SSIS, how can one simply ignore records that Lookup identifies as not a match?

In my current SSIS data flow task, I feed my data flow into a Lookup tool. Matches are inserted into one table and non-matches are inserted into another table.
I did it this way because this is what I was able to learn from the available tutorials at the time.
However, it seems wasteful because I don't really want the non-matched records at all. Is there a way to tell SSIS to discard the non-matched records entirely rather than store them in a table?
The lookup dialog doesn't appear to give me an option for "ignore non-matches."
Is there some way to achieve this desired behavior?
If lookup = match, insert matched records into table (as currently done)
If lookup not a match, ignore (or discard) non-matched records
Leave Redirect rows to no match output as you currently have specified.
Select the "non-matched" branch and delete the destination.
Done.
Really, that's it. The rows will still be in the memory buffers of your data flow but they won't carry to the Match destination as they'll be logically segmented.
Personally, I have a Row Count wired up so I can count the original Rows, the matched rows and the unmatched rows. It helps me audit how the package has performed over time but there's nothing wrong with not using an output stream from a component.

SSIS - Reuse Ole DB source when matching Fact against lookup table twice

I am pretty new to SSIS and BI in general, so first of all sorry if this is a newbie question.
I have my source data for the fact table in a csv, so I want to match the ids against the surrogate keys in lookup tables.
The data structure in the csv is like this
... userId, OriginStationId, DestinyStationId,..
What I am trying to accomplish is to match the data against my lookup table. So what I am doing is
Reading Lookup data using OLE DB Source
Reading my csv file
Sorting both inputs by the same field
Doing a left join by Id, in order to get the SK
This way, if there is no match (aka can't find the surrogate key) I can redirect that to a rejected csv and handle it later.
something like this:
(sorry for the spanish!)
I am doing this for each dimension, so I can handle each one with different error codes.
Since OriginStationId and DestinyStationId are two values from the same dimension (they both match against the same lookup table), I wanted to know if there's a way to avoid reading two times the data from the table (I mean, not to use two ole db sources to read twice the data from the same table).
I tried adding a second output to the sort but I am not allowed to. The same goes to adding another output from OLE DB Source.
I see there's an "cache option", is the best way to go ? (Although it would impy creating anyway another OLE DB source.. right?)
The third option I thought of was joining by the two fields, but since there is only one field in the lookup table (the same field) I am getting an error when I try to map both colums from my csv against the same column in my Lookup table
There are columns missing with the sort order 2 to 2
What is the best way to go for this ?
Or I am thinking something incorrectly ?
If something was not clear let me know and I'll update my question
Any time you wish you could have multiple outputs from a component that only allows one, all you have to do is follow that component with the Multicast component, whose sole purpose is to split a Data Flow stream into multiple outputs.
Gonzalo
I have just used this article on how to derive columns for a data warehouse building:- How to Populate a Fact Table using SSIS (part 1).
Using this I built a simple package that reads a CSV file with two columns that are used to derive separate values from the same CodeTable. The CodeTable has two fields Id and Description.
The Data Flow has two "Lookup" tasks. The first one joins the attribute Lookup1 against the Description to derive its Id. The second joins the attribute Lookup2 against the Description to derive a different Id.
Here is the Data Flow:-
Note the "Data Conversion" was required to convert the string attributes from the CSV file into "Unicode string [DT_WSTR]" so they could be joined to the nvarchar(50) description attribute in the table.
Here is the Data Conversion:-
Here is the first Lookup (the second one joins "Copy of Lookup2" to the Description):-
Here is the Data Viewer output with the to two derived Ids CodeTableFirstId and CodeTableSecondId:-
Hopefully I understand your problem and this is of use to you.
Cheers John

SSIS 2008. Transferring data from one table to another ONLY if the data is not duplicated

I'm going to do my best to try to explain this. I currently have a data flow task that has an OLE DB Source transferring data from a table from a different database to a table to another database. It works fine but the issue I'm having is the fact that I keep adding duplicate data to the destination table.
So a CustomerID of '13029' with an amount of '$56.82' on Date '11/30/2012' is seen in that table multiple times. How do I make it so I can only have unique data transferring over to that destination table?
In the dataflow task, where you transfer the data, you can insert a Lookup transformation. In the lookup, you can specify a data source (table or query, what serves you best). When you chose the data source, you can go to the Columns view and create a mapping, where you connect the CustomerID, Date and Amount of both tables.
In the general view, you can configure, what happens with matched/non matched row. Simply take the not matched output and direct it to the DB destination.
You will need to identify what makes that data unique in the table. If it's a customer table, then it's probably the customerid of 13029. However if it's a customer order table, then maybe it's the combination of CustomerId and OrderDate (and maybe not, I have placed two unique orders on the same date). You will know the answer to that based on your table's design.
Armed with that knowledge, you will want to write a query to pull back the keys from the target table SELECT CO.CustomerId, CO.OrderId FROM dbo.CustomerOrder CO If you know the process only transfers data from the current year, add a filter to the above query to restrict the number of rows returned. The reason for this is memory conservation-you want SSIS to run fast, don't bring back extraneous columns or rows it will never need.
Inside your dataflow, add a Lookup Transformation with that query. You don't specify 2005, 2008 or 2012 as your SSIS version and they have different behaviours associated with the Lookup Transformation. Generally speaking, what you are looking to do is identify the unmatched rows. By definition, unmatched means they don't exist in the target database so those are the rows that are new. 2005 assumes every row is going to match or it errors. You will need to click the Configure Error Output... button and select "Redirect Rows". 2008+ has an option under "Specify how to handle rows with no matching entries" and there you'll want "Redirect rows to no match output."
Now take the No match output branch (2008+) or the error output branch (2005) and plumb that into your destination.
What this approach doesn't cover is detecting and handling when the source system reports $56.82 and the target system has $22.38 (updates). If you need to handle that, then you need to look at some change detection system. Look at Andy Leonard's Stairway to Integration Services series of articles to learn about options for detecting and handling changes.
Have you considered using the T-SQL MERGE statement? http://technet.microsoft.com/en-us/library/bb510625.aspx
It will compare both tables on defined fields, and take an action if matched or not.

How to restart counting from 1 after erasing table in MS Access?

I have table in MS Access that has an AutoNumber type in field ID
After inserting some rows, the ID has become 200
Then, I have deleted the records in the table. However, when I tried to insert a new row, I see that the ID starts with 201
How can I force the ID to restart with 1, without having to drop the table and make new a new one?
In Access 2010 or newer, go to Database Tools and click Compact and Repair Database, and it will automatically reset the ID.
You can use:
CurrentDb.Execute "ALTER TABLE yourTable ALTER COLUMN myID COUNTER(1,1)"
I hope you have no relationships that use this table, I hope it is empty, and I hope you understand that all you can (mostly) rely on an autonumber to be is unique. You can get gaps, jumps, very large or even negative numbers, depending on the circumstances. If your autonumber means something, you have a major problem waiting to happen.
In addition to all the concerns expressed about why you give a rat's ass what the ID value is (all are correct that you shouldn't), let me add this to the mix:
If you've deleted all the records from the table, compacting the database will reset the seed value back to its original value.
For a table where there are still records, and you've inserted a value into the Autonumber field that is lower than the highest value, you have to use #Remou's method to reset the seed value. This also applies if you want to reset to the Max+1 in a table where records have been deleted, e.g., 300 records, last ID of 300, delete 201-300, compact won't reset the counter (you have to use #Remou's method -- this was not the case in earlier versions of Jet, and, indeed, in early versions of Jet 4, the first Jet version that allowed manipulating the seed value programatically).
I am going to Necro this topic.
Starting around ms-access-2016, you can execute Data Definition Queries (DDQ) through Macro's
Data Definition Query
ALTER TABLE <Table> ALTER COLUMN <ID_Field> COUNTER(1,1);
Save the DDQ, with your values
Create a Macro with the appropriate logic either before this or after.
To execute this DDQ:
Add an Open Query action
Define the name of the DDQ in the Query Name field; View and Data Mode settings are not relevant and can leave the default values
WARNINGS!!!!
This will reset the AutoNumber Counter to 1
Any Referential Integrity will be summarily destroyed
Advice
Use this for Staging tables
these are tables that are never intended to persist the data they temporarily contain.
The data contained is only there until additional cleaning actions have been performed and stored in the appropriate table(s).
Once cleaning operations have been performed and the data is no longer needed, these tables are summarily purged of any data contained.
Import Tables
These are very similar to Staging Tables but tend to only have two columns: ID and RowValue
Since these are typically used to import RAW data from a general file format (TXT, RTF, CSV, XML, etc.), the data contained does not persist past the processing lifecycle
I think the only ways to do this is outlined in this article.
The article explains several methods. Here is one example:
To do this in Microsoft Office Access 2007, follow these steps:
Delete the AutoNumber field from the main table.
Make note of the AutoNumber field name.
Click the Create tab, and then click Query Design in the Other group.
In the Show Table dialog box, select the main table. Click Add, and then click Close.
Double-click the required fields in the table view of the main table to select the fields.
Select the required Sort order.
On the Design tab, click Make Table in the Query Type group. Type the new table name in the Table Name box, and then click OK.
On the Design tab, click Run in the Results group.
The following message appears:
You are about to paste # row(s) into a new table.
Click Yes to insert the rows.
Close the query.
Right-click the new table, and then click Design View.
In the Design view for the table, add an AutoNumber field that has the same field name that you deleted in step 1. Add this AutoNumber
field to the new table, and then save the table.
Close the Design view window.
Rename the main table name. Rename the new table name to the main table name.
I always use below approach. I've created one table in database as Table1 with only one column i.e. Row_Id Number (Long Integer) and its value is 0
INSERT INTO <TABLE_NAME_TO_RESET>
SELECT Row_Id AS <COLUMN_NAME_TO_RESET>
FROM Table1;
This will insert one row with 0 value in AutoNumber column, later delete that row.