Im quit new in ADF so here's the challenge from me.
I have a Pipeline that consist a LookUp activity and ForEach and inside this a Copy Activity
When i run this pipeline the first output of the Lookup activity looks like this
The output contains 11 different values. From my perspective i only see 11 records that will need to be copied to my Sink which is Azure SQL DB.
The input of the ForEach activity looks like this
During the running the Pipeline copy 11 times and in my sql database it has now 121 records. This amount is based on 11 rows multiple 11 iteration. This is not the output which i expected.
I only expect 11 rows in my sink table. How can i change this pipeline in order to achieve the expected outcome of only 11 rows?
Many thanks!
In order to copy data, Lookup activity and copy data source activity should not be given same configuration. If given so, duplicate rows will be copied.
I tried to repro the same in my environment.
If 3 records are there in source data, 3 times 3 records will be copied.
In order to avoid duplicates, we can use only copy activity to copy data from Source to sink.
Only 3 records are in target table.
Related
I have two sources- file and DB.
"Product_code" is the key but it can be duplicate since the files can have products modified later than in DB. Both have ModDate field.
I have to load unique and most recent records.
in the DB there are 30 unique IDs and 10 in the file with more recent date, that must replace the rows in the DB with the older date.
What is the most used tool in that type of scenarios?
Any ideas on what will look like the structure in Data Flow be highly appreciated.
Cant use scripts and T-SQL.
I was using this structure
old ssis structure
After the suggested use aggregate sort by ID and MAX date
the structure now is like this
new ssis structure
but still not getting the result( all columns with the most recent date at the destination DB. Now only one column(ID) at the end.
Thanks
Based on the constraints you mention, you can approach it this way:
Union both data sources together (you will likely have data type issues)
Multicast to path A and path B
Path A:
Use an aggregate transform to group by Product ID and MAX modifieddate
The output of this will be two columns: a list of unique products, and their latest modified date
Path B:
Join back to Path A on Product and modified date. The output of this should be the dataset filtered on what you want.
I am building an SSIS for SQL Server 2014 package and currently trying to get the most recent record from 2 different sources using datetime columns between the two sources and implementing a method to accomplish that. So far I am using a Lookup Task on thirdpartyid to match the records that I need to eventually compare and using a Merge Join to bring them together and eventually have a staging table that has the most recent record.I have a previous data task, not shown that already inserts records that are not in AD1 into a staging table so at this point these records are a one to one match. Both sources look like this with the exact same datetime columns just different dates and some information having null values as there is no history of it.
Sample output
This is my data flow task so far. I am really new to SSIS so any ideas or suggestions would be greatly appreciated.
Given that there is a 1:1 match between your two sources, I would structure this as an Source (V1) -> Lookup (AD1)
Define your lookup based on thirdpartyid and retrieve all the AD1 columns. You'll likely end up with data flow look like like name and name_ad1, etc
I would then add a Derived Column that identifies if the dates are different (assuming in that situation, you need to take action)
IsNull(LastUpdated_AD1) || LastUpdated > LastUpdated_AD1
That expression would create a boolean if the column in AD1 is null or if the V1 last updated column is greater than the AD1 version.
You'd likely then add a Conditional Split into your Data Flow and base it on the value of our new column and then route the changed data into your mechanism for handling updates (OLE DB Command or preferably, an OLE DB Destination + an Execute SQL Task post Data Flow to perform an batch update)
The comment asks
should it all be one expression? like IsNull(AssignmentLastUpdated_AD1) || AssignmentLastUpdated > AssignmentLastUpdated_AD1 || IsNull(RoomLastUpdated_AD1) || RoomLastUpdated > RoomLastUpdated_AD1
You can do it like that but when you get a weird result and someone asks how you got that value, long expressions make it hard to debug. I'd likely have two derived columns components in the data flow. The first would create a has changed column for each set of conditions
HasChangedAssignment
(IsNull(AssignmentLastUpdated_AD1) || AssignmentLastUpdated > AssignmentLastUpdated_AD1)
HasChangedRoom
IsNull(RoomLastUpdated_AD1) || RoomLastUpdated > RoomLastUpdated_AD1
etc
And then in the final derived column, you create the HasChanged column
HasChangedAssignment || HasChangedRoom || HasChangedAdNauseum
Using a pattern based approach like this makes it much easier to build, troubleshoot and/or or make small changes that can have a big impact to the correctness, maintainability and performance of your packages.
I am developing a database for an organization, which has three branches. They want to use this database locally.
The problem is that they want to gather all three branches data after each three months for reporting from all of the three branches.
For example, if branch A has 40 records and branch B has 50 records and Branch C has 30 records: after three months each of the branch should have (30+40+50) records.
How can I do that? Any suggestions?
You can export individual data from 3 databases and then insert into a new database as it is not frequent (3 months interval) like syncronize.
or
You can create an API for fetching data and call the APIs in a loop, and after fetching 3 databases separately, join them in one array and then show them as you want.
It is like:
1st
Select Database
Fetch The Data
then again
Change Database
Fetch The Data from 2nd DB
and so on.
then join every array of data in a single variable and show it
A tale of two cities almost...I have 17,000 rows of data that come in as a pair of strings in 2 columns. There are always 5 item numbers and 5 Item Unit counts per row (unit counts are always 4 characters). They have to match up unit and item or it's invalid. What I'm trying to do is "unpivot" the strings into individual rows - Item Number and Item Units
So here's an example of one row of data and the two columns
Record ID Column: 0
Item Number Column: A001E10 A002E9 A003R20 A001B7 XA917D3
Item Units Column: 001800110002000300293
I wrote a C# windows app test harness to unpivot the data into individual rows and it works fine and dandy. So it basically unpivots the data into 85,000 (5 times 17,000) rows and displays it to me in a grid which is what I expect (ID, Item Number and Item Units).
0 | A001E10 | 0018
0 | A002E9 | 0011
and so on...
In my SSIS app I added a script task to process this same data and basically used the same code that my test harness uses. When I run my task I can see it loads the 17,000 rows but it only generates 15,000 +/- on the output so obviously something isn't right.
What I'm thinking is that I don't have the script task setup correctly even though it is using the same code that my test harness uses in that it's dropping records for some reason.
If I go back into my task and give it a particular record ID that it didn't get in the first pass, it will process that ID and generate the right output. So this tells me that the record is ok but for some reason it misses it or drops it on the initial process. Maybe something to do with buffers?
Well - I figured it out.
We have a sequence task with tons of dataflow tasks inside it that are running in parallel. We're relying on the engine to prioritize and handle the data extract and load correctly. However, this one particular script task is not handled by the engine correctly within that sequence container.
The clue was that you could run the script task itself outside of the whole process and it worked fine. So we pulled the script task out of the sequence task and put it by itself after the sequence task and now it runs correctly.
i have an input csv file with columns position_Id, Asofdate, etc which has to be loaded into staging table. In my table, columns Position_Id, AsofDate are primary keys. We receive this input file for very 2 hours. For Exmaple, we recived File at 10 Am today, and that files loads into table.And after 2 more hours we recived another file whcih contains of same data as of the file which we recived 2 hours back and data loads into table.
Now my table contains the data of the file that we recived at 10 Am and 12 pm. At 12:10 pm we received modified input file with different data inside it. Now, my actual requirement is, before the latest input file (12:10 Pm) data is loaded int table, it has to see that only new and updated data has to be loaded into the table.
Have you ever heard of the term Upsert? Here are examples of how to upsert (insert new records and update existing records).
This blog post walks you through Upserting using a lookup in a dataflow.
This stackoverflow answer provides links to explaining and setting up a merge.