I am using SSIS to load data from flat files to a SQL table. The flat files contain both new and updated rows. Each time the process is run, the updated rows will affect a small subset of the SQL table, specified by a 'period' column (e.g. one procedure may only affect periods 3, 4, and 5).
I am using a Look-Up transformation to separate new rows (Lookup No Match Output) from existing row s(Lookup Match Output). Since both the reference set and the data set being loaded is extremely large, I would like to use partial caching for the lookup. Is it somehow possible to modify the partial caching query to only include rows from the period numbers included in the flat files?
For example, my reference table may contain data from periods 1-10, but my flat files being loaded may only have data from periods 3-5. Therefore, I only want to cache data from periods 3-5, since I already know periods 1-2 and 6-10 will never produce a match.
Instead of using the table selector in the drop down, which you should never do unless you need every column from every row, write your query to only pull back the columns you need for either matching or augmenting the existing data. In your case, you're going to need to add a filter which is a bit persnickety.
The best approach I've found is to write the lookup query in a variable of type String. In it, I will build the query and apply the needed filter. Below, you see I defined two variables. One an int which will serve as my filter and then the query itself which uses it.
The expression on my SourceQuery Variable is
"SELECT
D.rn
FROM
(
SELECT TOP 10
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) * 2 AS rn
FROM
sys.all_columns AS SA
) AS D(rn)
WHERE D.rn <= " + (DT_WSTR, 10) #[User::MaxID]
My Data Flow looks like
I have my source and it hits a lookup and based on matched results it goes to one of the two buckets. My source query just generates the numbers 1 to 10 and the lookup is a query that generates even numbers from 2 to 20.
During design time, that query looks like
SELECT
D.rn
FROM
(
SELECT TOP 10
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) * 2 AS rn
FROM
sys.all_columns AS SA
) AS D(rn)
A normal run would result in a 50/50 split between the buckets
The goal of course is to make the lookup query take a parameter like one of the source components but you'd quickly discover that
SELECT
D.rn
FROM
(
SELECT TOP 10
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) * 2 AS rn
FROM
sys.all_columns AS SA
) AS D(rn)
WHERE D.rn > ?
doesn't fly. Instead, you have to go back out to the Control Flow and select the Data Flow, Right click and select Properties. In the window for your data flow, go to Expressions and click the ellipses (...)
There will be a property named as your Lookup task. Assign the Variable that uses the expression to make it all dynamic and voila, with a maxid of 6 I only find 3 matches
A final note, the Partial Cache may or may not be what you're looking for. That's an actual lookup setting that controls how it balances the cost of lookup data versus caching it locally. A full cache will drop all the specified columns for the applicable range into memory which is why you only want to specify the columns you need. If you can get it down to a few skinny columns and even if it's millions of rows, you probably aren't going to feel pain.
Contrived example:
Glean your period minum and maximum at runtime and store them in two variables, PeriodMinimum and PeriodMaximum (I'm assuming it's a range, I'll discuss alternatives at the end.)
Add them as Derived Columns to your Source flow.
In the Lookup Editor, under the Advanced tab, use a custom query (Contrived example): SELECT lookup, value FROM reference where period between ? and ?
Click the Parameters button and use your input columns appropriately.
If instead of a range you want to be able to randomly choose periods (3, 6, and 10) you'll have to do something a bit more contrived ...
Create multiple variables, Period1, 2, 3 ... n and set the default to -1 or some value which is not a valid period.
Populate these variables as needed with the periods you do want to filter on.
In the custom query, use SELECT lookup, value FROM reference where period = ? or period = ? or period = ?, ...
Set each parameter using your input columns.
Anyway, in general, use the Custom query with parameters when you want a dynamic lookup query based on runtime data.
Related
Is there a way to perform a date range lookup using a cache connection manager in SSIS? Or something similar that is very performant.
The scenario I have is as follows.
I have a row in a table that has a date, lets call it BusinessDate. I need to perform a lookup on a table to see if the businessDate is between the StartDate and EndDate of the dimension.
The problem is, the table I'm reading from had millions of records and my dimension (Lookup table) has a few thousand records and it takes very long.
Please help...
Nope, the Lookup with a cache connection manager is a strict equals. You might be able to finagle it with a lookup against an OLE DB source with a Partial/None cache model and custom queries.
So, what can you do?
You can modify the way you populate your Lookup Cache. Assuming your data looks something like
MyKey|StartDate|EndDate|MyVal
1 |2021-01-01|2021-02-01|A
1 |2021-02-01|9999-12-31|B
Instead of just loading as is, explode out your dimension.
MyKey|TheDate|MyVal
1 |2021-01-01|A
1 |2021-01-02|A
1 |2021-01-03|A
1 |2021-01-04|A
1 |2021-01-05|A
...
1 |2021-02-01|B
1 |2021-02-02|B
...
You might not want to build your lookup all the way to year 9999 but know your data and say go 5 years in the future as well as pick up the end date.
Now your lookup usage is a supported case - strict equals.
Otherwise, the pattern of a merge join is how people handle range joins in a data flow. Going to reproduce Matt Masson's article from the msdn blogs because it's dead
Lookup Pattern: Range Lookups
Performing range lookups (i.e. to find a key for a given range) is a common ETL operation in data warehousing scenarios. It's especially for historical loads and late arriving fact situations, where you're using type 2 dimensions and you need to locate the key which represents the dimension value for a given point in time.
This blog post outlines three separate approaches for doing range lookups in SSIS:
Using the Lookup Transform
Merge Join + Conditional Split
Script Component
All of our scenarios will use the AdventureWorksDW2008 sample database (DimProduct table) as the dimension, and take its fact data from AdventureWorks2008 (SalesOrderHeader and SalesOrderDetail tables). The "ProductNumber" column from the SalesOrderDetail table maps to the natural key of the DimProduct dimension (ProductAlternateKey column). In all cases we want to lookup the key (ProductKey) for the product which was valid (identified by StartDate and EndDate) for the given OrderDate.
One last thing to note is that the Merge Join and Script Component solutions assume that a valid range exists for each incoming value. The Lookup Transform approach is the only one that will identify rows that have no matches (although the Script Component solution could be modified to do so as well).
Lookup Transform
The Lookup Transform was designed to handle 1:1 key matching, but it can also be used in the range lookup scenario by using a partial cache mode, and tweaking the query on the Advanced Settings page. However, the Lookup doesn't cache the range itself, and will end up going to the database very often - it will only detect a match in its cache if all of the parameters are the same (i.e. same product purchased on the same date).
We can use the following query to have the lookup transform perform our range lookup:
select [ProductKey], [ProductAlternateKey],
[StartDate], [EndDate]
from [dbo].[DimProduct]
where [ProductAlternateKey] = ?
and [StartDate] <= ?
and (
[EndDate] is null or
[EndDate] > ?
)
On the query parameters page, we map 0 -> ProductNumber, 1 and 2 -> OrderDate.
This approach is effective and easy to setup, but it is pretty slow when dealing with a large number of rows, as most lookups will be going to the database.
Merge Join and Conditional Split
This approach doesn't use the Lookup Transform. Instead we use a Merge Join Transform to do an inner join on our dimension table. This will give us more rows coming out than we had coming in (you'll get a row for every repeated ProductAlternateKey). We use the conditional split to do the actual range check, and take only the rows that fall into the right range.
For example, a row coming in from our source would contain an OrderDate and ProductNumber, like this:
From the DimProduct source, we take three additional columns - ProductKey (what we're after), StartDate and EndDate. The DimProduct dimension contains three entries for the "LJ-0192-L" product (as its information, like unit price, has changed over time). After going through the Merge Join, the single row becomes three rows.
We use the Conditional Split to do the range lookup, and take the single row we want. Here is our expression (remember, in our case an EndDate value of NULL indicates that it's the most current row):
StartDate <= OrderDate && (OrderDate < EndDate || ISNULL(EndDate))
This approach is a little more complicated, but performs a lot better than using the Lookup Transform.
Script component
Not reproduced here
Conclusion
Not reproduced here
Hello and thanks in advance for any help.
I wrote a small program in Access that imports text files and places them into tables. The table is laid out with about 20 fields and each field represents a different category that contains a number (hours) for different reasons, for each record. I have been unable to find anything that will get me started in the right direction and need help.
What I want to do is search each record and find the highest five numbers for each record and then return each value with the associated heading for that field.
For example
PGM ASD HFR STE NHU
Client A _____365.4__ 255___254.6___180.1___26
Once I figure out how to query this info from the other 20 columns, my goal is to build a form that have this query attached to a button that returns these values. I can either set it up to search each record or search all records and find the the top five values for all clients.
Again, thanks for any help. I am not hoping that someone will build me a solution, just get me a reference or some material to get me heading in a direction.
In Access, this will require a custom function that compares values of fields. Common requirement is to find the top 1 value from a record. Has been discussed many times in many sites. Google. Finding top 5 does add complication. If data structure were normalized, a TOP N nested query could probably provide desired output.
A workaround for the current structure could be to build a UNION query that rearranges your data to normalized structure (data is vertical instead of horizontal). Then use that query like a table as the source for TOP N nested query. Is there only 1 record for each client? UNION example:
SELECT Client, PGM AS Hrs, "PGM" AS Source FROM tablename
UNION SELECT Client, ASD, "ASD" FROM tablename
UNION SELECT Client, HFR, "HFR" FROM tablename
UNION SELECT Client, STE, "STE" FROM tablename
UNION SELECT Client, NHU, "NHU" FROM tablename
continue for 15 other fields;
Must type or copy/paste in SQLView of query builder. Limit of 50 SELECT lines.
For example of TOP N review http://allenbrowne.com/subquery-01.html#TopN. Unfortunately, this type of nested query can be slow performer. And basing it off a UNION instead of natural table can be even slower.
I have job in Talend that is designed to bring together some data from different databases: one is a MySQL database and the other a MSSQL database.
What I want to do is match a selection of loan numbers from the MySQL database (about 82,000 loan numbers) to the corresponding information we have housed in the MSSQL database.
However, the tables in MSSQL to which I am joining the data from MySQL are much larger (~ 2 million rows), are quite wide, and thus cost much more time to query. Ideally I could perform an inner join between the two tables based on the loan number, but since they are in different databases this is not possible. The inner join that is performed inside a tMap occurs after the Lookup input has already returned its data set, which is quite large (especially since this particular MSSQL query will execute a user-defined function for each loan number).
Is there any way to create a global variable out of the output from the MySQL query (namely, the loan numbers selected by the MySQL query) and use that global variable as an IN clause in the MSSQL query?
This should be possible. I'm not working in MySQL but I have something roughly equivalent here that I think you should be able to adapt to your needs.
I've never actually answered a Stackoverflow question and while I was typing this the page started telling me I need at least 10 reputation to post more than 2 pictures/links here and I think I need 4 pics, so I'm just going to write it out in words here and post the whole thing complete with illustrations on my blog in case you need more info (quite likely, I should think!)
As you can see, I've got some data coming out of the table and getting filtered by tFilterRow_1 to only show the rows I'm interested in.
The next step is to limit it to just the field I want to use in the variable. I've used tMap_3 rather than a tFilterColumns because the field I'm using is a string and I wanted to be able to concatenate single quotes around it but if you're using an integer you might not need to do that. And of course if you have a lot of repetition you might also want to get a tUniqueRows in there as well to save a lot of unnecessary repetition
The next step is the one that does the magic. I've got a list like this:
'A1'
'A2'
'B1'
'B2'
etc, and I want to turn it into 'A1','A2','B1','B2' so I can slot it into my where clause. For this, I've used tAggregateRow_1, selecting "list" as the aggregate function to use.
Next up, we want to take this list and put it into a context variable (I've already created the context variable in the metadata - you know how to do that, right?). Use another tMap component, feeding into a tContextLoad widget. tContextLoad always has two columns in its schema, so map the output of the tAggregateRows to the "value" column and enter the name of the variable in the "key". In this example, my context variable is called MyList
Now your list is loaded as a text string and stored in the context variable ready for retrieval. So open up a new input and embed the variable in the sql code like this
"SELECT distinct MY_COLUMN
from MY_SECOND_TABLE where the_selected_row in ("+
context.MyList+")"
It should be as easy as that, and when I whipped it up it worked first time, but let me know if you have any trouble and I'll see what I can do.
I have an SSIS data flow in SSIS 2012 project.
I need to calculate in the best way possible for every row field a sum of another table based on some criteria.
It would be something like a lookup but returning an aggregate on the lookup result.
Is there an SSIS way to do it by components or i need to turn to script task or stored procedure?
Example:
One data flow has a filed names LOT.
i need to get the sum(quantity) from table b where dataflow.LOT = tableb.lot
and write this back to a flow field
You just need to use the Lookup Component. Instead of selecting tableb write the query, thus
SELECT
B.Lot -- for matching
, SUM(B.quantity) AS TotalQuantity -- for data flow injection
FROM
tableb AS B
GROUP BY
B.Lot;
Now when the package begins, it will first run this query against that data source and generate the quantities across all lots.
This may or may not be a good thing based on data volumes and whether the values in tableB are changing. In the larger volume case, if it's a problem, then I'd look at whether I can do something about the above query. Maybe I only need current year's data. Maybe my list of Lots could be pushed into the remove server beforehand to only compute the aggregates for what I need.
If TableB is very active, then you might need to change your caching from the default of Full to a Partial or None. If Lot 10 shows up twice in the data flow, the None would perform 2 lookups against the source while the Partial would cache the values it has seen. Probably, depends on memory pressure, etc.
I have data from two different source locations that need to be combined into one. I am assuming I would want to do this with a merge or a merge join, but I am unsure of what exactly I need to do.
Table 1 has the same fields as Table 2 but the data is different which is why I would like to combine them into one destination table. I am trying to do this with SSIS, but I have never had to merge data before.
The other issue that i have is that some of the data is duplicated between the two. How would I only keep 1 of the duplicated records?
Instead of making an entirely new table which will need to be updated again every time Table 1 or 2 changes, you could use a combination of views and UNIONs. In other words create a view that is the result of a UNION query between your two tables. To get rid of duplicates you could group by whatever column uniquely identifies each record.
Here is a UNION query using Group By to remove duplicates:
SELECT
MAX (ID) AS ID,
NAME,
MAX (going)
FROM
(
SELECT
ID :: VARCHAR,
NAME,
going
FROM
facebook_events
UNION
SELECT
ID :: VARCHAR,
NAME,
going
FROM
events
) AS merged_events
GROUP BY
NAME
(Postgres not SSIS, but same concept)
Instead of Merge and Sort , Use union all Sort. because Merge transform need two sorted input and performance will be decreased
1)Give Source1 & Source2 as input to UnionALL Transformation
2) Give Output of UnionALL transfromation to Sort transformation and check remove duplicate keys.
This sounds like a pretty classic merge. Create your source and destination connections. Put in a Data Flow task. Put both sources into the Data Flow. Make sure the sources are both sorted and connect them to a Merge. You can either add in a Sort transformation between the connection and the Merge or sort them using a query when you pull them in. It's easier to do it with a query if that's possible in your situation. Put a Sort transformation after the Merge and check the "Remove rows with duplicate sort values" box. That will take care of any duplicates you have. Connect the Sort transformation to the data destination.
You can do this without SSIS, too.