This is quite a strange problem, wasn't quite sure how to title it. The issue I have is some data rows in an SSIS task which need to be modified depending on other rows.
Name Location IsMultiple
Bob England
Jim Wales
John Scotland
Jane England
A simplifed dataset, with some names, their locations, and a column 'IsMultiple' which needs to be updated to show which rows share locations. (Bob and Jane's rows would be flagged 'True' in the example above).
In my situation there is much more complex logic involved, so solutions using sql would not be suitable.
My initial thoughts were to use an asyncronous script task, take in all the data rows, parse them, and then output them all after the very last row has been input. The only way I could think of doing this was to call row creation in the PostExecute Phase, which did not work.
Is there a better way to go about this?
A couple of options come to mind for SSIS solutions. With both options you would need the data sorted by location. If you can do this in your SQL source, that would be best. Otherwise, you have the Sort component.
With sorted data as your input you can use a Script component that compares the values of adjacent rows to see if multiple locations exist.
Another option would be to split your data path into two. Do this by adding a Multicast component. The first path would be your main path that you currently have. In the second task, add an Aggregate transformation after the Multicast component. Edit the Aggregate and select Location as a Group By operation. Select (*) as a Count all. The output will be rows with counts by location.
After the Aggregate, Add a Merge Join component and select your first and second data paths as inputs. Your join keys should be the Location column from each path. All the inputs from path 1 should be outputs and include the count from path 2 as an output.
In a derived column, modify the isMultiple column with an expression that expresses "If count is greater than 1 then true else false".
If possible, I might recommend doing it with pure SQL in a SQL task on your control flow prior to your data flow. A simple UPDATE query where you GROUP BY location and do a HAVING COUNT for everything greater than 1 should be able to do this. But if this is a simplified version this may not be feasible.
If the data isn't available until after the data flow is done you could place the SQL task after your data flow on your control flow.
Related
Can i use another column in InList clause?
Example,
i have created a variable and below is the formula.
IF [query1.column1] inList ([query2.column2]) then SUM([query1.amountColumn])
Else 0
OR is it possible to put variable after inList in formula?
If not possible -- is there any other alternative to this?
I see two possible approaches. I will to use the eFashion universe for both solutions.
Solution #1
Here are my 2 queries to begin...
Run your queries. Click on the columns you want to compare, [query1].[column1] and [query2].[column2] in your case; [Query 1].[Month] and [Query 2].[Month] for me. Right-click and merge them. They must be dimensions and of the same data type.
Now create a variable based on [Query 2].[Month Name] which you can filter on to eliminate the results from Query 1 that do not match up to anything in Query 2.
[UV Month Name]=[Query 2].[Month Name]
The key here is you need to change the Qualification to "Detail" and set the Associated Dimension to what we just merged by clicking three dots to the right. Choose [Month Name] not from either query, but the merged dimension.
Now build out your table with whatever object you want from Query 1 and add in the variable we just created.
Now add a filter on that variable to only show row where it is not null.
And you are done.
Pros
Works when limiting query (query2) has a relatively large number of values (compare to Cons for Solution #2).
Cons
More complicated to set up
May run into universe or performance issues related to query being filtered (query1).
Solution #2
Building upon Solution #1, I duplicated Query 1 and renamed it Query 3. Now you can choose "Results from another query" to get the [query1].[column1] InList ([query2].[column2]) logic you want.
If you take this approach then you don't need to do the merge, variable, and filter. The results of the query are filter before being returned by the report.
Pros
Simple
Cons
The number of values coming from your second query must be relatively small. It varies by database or maybe even your universe. I have found if it is over 1,000 values I get an error when I run the query that it is "too complex".
I have job in Talend that is designed to bring together some data from different databases: one is a MySQL database and the other a MSSQL database.
What I want to do is match a selection of loan numbers from the MySQL database (about 82,000 loan numbers) to the corresponding information we have housed in the MSSQL database.
However, the tables in MSSQL to which I am joining the data from MySQL are much larger (~ 2 million rows), are quite wide, and thus cost much more time to query. Ideally I could perform an inner join between the two tables based on the loan number, but since they are in different databases this is not possible. The inner join that is performed inside a tMap occurs after the Lookup input has already returned its data set, which is quite large (especially since this particular MSSQL query will execute a user-defined function for each loan number).
Is there any way to create a global variable out of the output from the MySQL query (namely, the loan numbers selected by the MySQL query) and use that global variable as an IN clause in the MSSQL query?
This should be possible. I'm not working in MySQL but I have something roughly equivalent here that I think you should be able to adapt to your needs.
I've never actually answered a Stackoverflow question and while I was typing this the page started telling me I need at least 10 reputation to post more than 2 pictures/links here and I think I need 4 pics, so I'm just going to write it out in words here and post the whole thing complete with illustrations on my blog in case you need more info (quite likely, I should think!)
As you can see, I've got some data coming out of the table and getting filtered by tFilterRow_1 to only show the rows I'm interested in.
The next step is to limit it to just the field I want to use in the variable. I've used tMap_3 rather than a tFilterColumns because the field I'm using is a string and I wanted to be able to concatenate single quotes around it but if you're using an integer you might not need to do that. And of course if you have a lot of repetition you might also want to get a tUniqueRows in there as well to save a lot of unnecessary repetition
The next step is the one that does the magic. I've got a list like this:
'A1'
'A2'
'B1'
'B2'
etc, and I want to turn it into 'A1','A2','B1','B2' so I can slot it into my where clause. For this, I've used tAggregateRow_1, selecting "list" as the aggregate function to use.
Next up, we want to take this list and put it into a context variable (I've already created the context variable in the metadata - you know how to do that, right?). Use another tMap component, feeding into a tContextLoad widget. tContextLoad always has two columns in its schema, so map the output of the tAggregateRows to the "value" column and enter the name of the variable in the "key". In this example, my context variable is called MyList
Now your list is loaded as a text string and stored in the context variable ready for retrieval. So open up a new input and embed the variable in the sql code like this
"SELECT distinct MY_COLUMN
from MY_SECOND_TABLE where the_selected_row in ("+
context.MyList+")"
It should be as easy as that, and when I whipped it up it worked first time, but let me know if you have any trouble and I'll see what I can do.
I have an SSIS data flow in SSIS 2012 project.
I need to calculate in the best way possible for every row field a sum of another table based on some criteria.
It would be something like a lookup but returning an aggregate on the lookup result.
Is there an SSIS way to do it by components or i need to turn to script task or stored procedure?
Example:
One data flow has a filed names LOT.
i need to get the sum(quantity) from table b where dataflow.LOT = tableb.lot
and write this back to a flow field
You just need to use the Lookup Component. Instead of selecting tableb write the query, thus
SELECT
B.Lot -- for matching
, SUM(B.quantity) AS TotalQuantity -- for data flow injection
FROM
tableb AS B
GROUP BY
B.Lot;
Now when the package begins, it will first run this query against that data source and generate the quantities across all lots.
This may or may not be a good thing based on data volumes and whether the values in tableB are changing. In the larger volume case, if it's a problem, then I'd look at whether I can do something about the above query. Maybe I only need current year's data. Maybe my list of Lots could be pushed into the remove server beforehand to only compute the aggregates for what I need.
If TableB is very active, then you might need to change your caching from the default of Full to a Partial or None. If Lot 10 shows up twice in the data flow, the None would perform 2 lookups against the source while the Partial would cache the values it has seen. Probably, depends on memory pressure, etc.
I'm going to do my best to try to explain this. I currently have a data flow task that has an OLE DB Source transferring data from a table from a different database to a table to another database. It works fine but the issue I'm having is the fact that I keep adding duplicate data to the destination table.
So a CustomerID of '13029' with an amount of '$56.82' on Date '11/30/2012' is seen in that table multiple times. How do I make it so I can only have unique data transferring over to that destination table?
In the dataflow task, where you transfer the data, you can insert a Lookup transformation. In the lookup, you can specify a data source (table or query, what serves you best). When you chose the data source, you can go to the Columns view and create a mapping, where you connect the CustomerID, Date and Amount of both tables.
In the general view, you can configure, what happens with matched/non matched row. Simply take the not matched output and direct it to the DB destination.
You will need to identify what makes that data unique in the table. If it's a customer table, then it's probably the customerid of 13029. However if it's a customer order table, then maybe it's the combination of CustomerId and OrderDate (and maybe not, I have placed two unique orders on the same date). You will know the answer to that based on your table's design.
Armed with that knowledge, you will want to write a query to pull back the keys from the target table SELECT CO.CustomerId, CO.OrderId FROM dbo.CustomerOrder CO If you know the process only transfers data from the current year, add a filter to the above query to restrict the number of rows returned. The reason for this is memory conservation-you want SSIS to run fast, don't bring back extraneous columns or rows it will never need.
Inside your dataflow, add a Lookup Transformation with that query. You don't specify 2005, 2008 or 2012 as your SSIS version and they have different behaviours associated with the Lookup Transformation. Generally speaking, what you are looking to do is identify the unmatched rows. By definition, unmatched means they don't exist in the target database so those are the rows that are new. 2005 assumes every row is going to match or it errors. You will need to click the Configure Error Output... button and select "Redirect Rows". 2008+ has an option under "Specify how to handle rows with no matching entries" and there you'll want "Redirect rows to no match output."
Now take the No match output branch (2008+) or the error output branch (2005) and plumb that into your destination.
What this approach doesn't cover is detecting and handling when the source system reports $56.82 and the target system has $22.38 (updates). If you need to handle that, then you need to look at some change detection system. Look at Andy Leonard's Stairway to Integration Services series of articles to learn about options for detecting and handling changes.
Have you considered using the T-SQL MERGE statement? http://technet.microsoft.com/en-us/library/bb510625.aspx
It will compare both tables on defined fields, and take an action if matched or not.
I am writing the SSIS package to import the data from *.csv files to the SQL 2008 DB. The problem is that one of the file contains the duplicate records in the csv file and I want to extract only the distinct values from that source. Please see the image below.
Unfortunately, the generated files are not under my control and it is owned by the third party and I could not change the way they generated.
I did use the LookUp Component. But it only checks the existing data against the incoming data. It does not check the duplicate records in the incoming data.
I believe the sort component gives an option to remove duplicate rows.
Depends on how serious you want to get about the duplicates. Do you need a record of what was duplicated or is it enough to just get rid of them? Sort component will get rid of dups on the sort field. However, the dups may have different data in the other fields and then you want a differnt strategy. Usually I load all to staging tables and clean up from there. I send the dupes removed to an exception table (we have to answer a lot of questions from our customers about why things don't match what they sent) and I often use a set of business rules (and use either an execute SQl or data flow tasks to enforce the rules) to determine which one to pick if there are duplicates in one area but not another (say two business addresses when we can only store 1). I also make sure the client is aware of how we determine which of the two to pick.
Use SORT tool for that from Toolbox, then click on it. You will get all available input columns.
Check the column and change sortType direction and then check "remove rows with duplicate sort value".
Bring in the data from the csv file the way it is, then dedup it after it's loaded.
It'll be easier to debug, too.
I used Aggregate Component and Group By both QualificationID and UnitID. If you want, you can also use Sort Component too. Perhaps, my information might help others.