To make things simple lets say I have a Client table with Fields:
ClientID
PCode
Region
I have a look up Region table with Fields:
ID
PostCode
Region
The Client table has one row :
123, 3075, THOMASTOWN
The Region has 2 rows:
1,3074,THOMASTOWN
2,3075,LALOR
I am trying to cleanse some data using SSIS fuzzy lookup.
I use the Client as source and the lookup as Reference. When the Max number of matches to output per lookup is 1 the result is 123,3075,THOMASTOWN, 1,3074,THOMASTOWN. In this case SSIS prefers the value of region over the value of pcode. But when I increase this option to a higher (2, 4, or 100) all the result rows are the same as the previous one.
I expect when I increase the number the other row of the lookup table shows up as one of the matches because the row has the same region code as the Client row.
To my surprise when for the first time I increased the option from 2 to 4 I saw the expected result (the other lookup record as a match) in 2 out of 4 rows of the fuzzy lookup output but since then it has never happened again and I always see the exactly same records as per different values of Max number of matches option.
Can anyone explain to me what's happening here and if I am doing something wrong?
From BOL: http://msdn.microsoft.com/en-us/library/ms137786.aspx
The transformation returns zero or more matches up to the number of matches specified.
Specifying a maximum number of matches does not guarantee that the transformation
returns the maximum number of matches; it only guarantees that the transformation
returns at most that number of matches.
If you are using these two columns in your matching, the region seems to be different enough that the pcode is still not a strong enough match on its own to return any results.
There are a couple of options:
Try tuning the similarity threshold and see if you can get the row returned
Use two fuzzy lookups (one for each column), so that if the first does not match or the similarity % is not strong enough, you can perform a second lookup to make the match.
Related
Is there a way to perform a date range lookup using a cache connection manager in SSIS? Or something similar that is very performant.
The scenario I have is as follows.
I have a row in a table that has a date, lets call it BusinessDate. I need to perform a lookup on a table to see if the businessDate is between the StartDate and EndDate of the dimension.
The problem is, the table I'm reading from had millions of records and my dimension (Lookup table) has a few thousand records and it takes very long.
Please help...
Nope, the Lookup with a cache connection manager is a strict equals. You might be able to finagle it with a lookup against an OLE DB source with a Partial/None cache model and custom queries.
So, what can you do?
You can modify the way you populate your Lookup Cache. Assuming your data looks something like
MyKey|StartDate|EndDate|MyVal
1 |2021-01-01|2021-02-01|A
1 |2021-02-01|9999-12-31|B
Instead of just loading as is, explode out your dimension.
MyKey|TheDate|MyVal
1 |2021-01-01|A
1 |2021-01-02|A
1 |2021-01-03|A
1 |2021-01-04|A
1 |2021-01-05|A
...
1 |2021-02-01|B
1 |2021-02-02|B
...
You might not want to build your lookup all the way to year 9999 but know your data and say go 5 years in the future as well as pick up the end date.
Now your lookup usage is a supported case - strict equals.
Otherwise, the pattern of a merge join is how people handle range joins in a data flow. Going to reproduce Matt Masson's article from the msdn blogs because it's dead
Lookup Pattern: Range Lookups
Performing range lookups (i.e. to find a key for a given range) is a common ETL operation in data warehousing scenarios. It's especially for historical loads and late arriving fact situations, where you're using type 2 dimensions and you need to locate the key which represents the dimension value for a given point in time.
This blog post outlines three separate approaches for doing range lookups in SSIS:
Using the Lookup Transform
Merge Join + Conditional Split
Script Component
All of our scenarios will use the AdventureWorksDW2008 sample database (DimProduct table) as the dimension, and take its fact data from AdventureWorks2008 (SalesOrderHeader and SalesOrderDetail tables). The "ProductNumber" column from the SalesOrderDetail table maps to the natural key of the DimProduct dimension (ProductAlternateKey column). In all cases we want to lookup the key (ProductKey) for the product which was valid (identified by StartDate and EndDate) for the given OrderDate.
One last thing to note is that the Merge Join and Script Component solutions assume that a valid range exists for each incoming value. The Lookup Transform approach is the only one that will identify rows that have no matches (although the Script Component solution could be modified to do so as well).
Lookup Transform
The Lookup Transform was designed to handle 1:1 key matching, but it can also be used in the range lookup scenario by using a partial cache mode, and tweaking the query on the Advanced Settings page. However, the Lookup doesn't cache the range itself, and will end up going to the database very often - it will only detect a match in its cache if all of the parameters are the same (i.e. same product purchased on the same date).
We can use the following query to have the lookup transform perform our range lookup:
select [ProductKey], [ProductAlternateKey],
[StartDate], [EndDate]
from [dbo].[DimProduct]
where [ProductAlternateKey] = ?
and [StartDate] <= ?
and (
[EndDate] is null or
[EndDate] > ?
)
On the query parameters page, we map 0 -> ProductNumber, 1 and 2 -> OrderDate.
This approach is effective and easy to setup, but it is pretty slow when dealing with a large number of rows, as most lookups will be going to the database.
Merge Join and Conditional Split
This approach doesn't use the Lookup Transform. Instead we use a Merge Join Transform to do an inner join on our dimension table. This will give us more rows coming out than we had coming in (you'll get a row for every repeated ProductAlternateKey). We use the conditional split to do the actual range check, and take only the rows that fall into the right range.
For example, a row coming in from our source would contain an OrderDate and ProductNumber, like this:
From the DimProduct source, we take three additional columns - ProductKey (what we're after), StartDate and EndDate. The DimProduct dimension contains three entries for the "LJ-0192-L" product (as its information, like unit price, has changed over time). After going through the Merge Join, the single row becomes three rows.
We use the Conditional Split to do the range lookup, and take the single row we want. Here is our expression (remember, in our case an EndDate value of NULL indicates that it's the most current row):
StartDate <= OrderDate && (OrderDate < EndDate || ISNULL(EndDate))
This approach is a little more complicated, but performs a lot better than using the Lookup Transform.
Script component
Not reproduced here
Conclusion
Not reproduced here
Context: I have a dataset with multiple numeric columns, which I am analysing in Contour. During one step of my analysis I want to find the minimum value of three different columns for every row.
Question: Is there an expression function in Contour I can use to get the minimum or maximum value of two or more columns?
Generally speaking you can find all available expression functions in the Foundry Contour documentation under the References section.
For this specific case the following two functions can be used:
GREATEST: Returns the maximum value of the list of values/columns. Null values are ignored.
LEAST: Returns the minimum value of the list of values/columns. Null values are ignored.
The usage would look as follows:
greatest("numeric_column_1", "numeric_column_2", "numeric_column_3")
This is a query on deduplicating an already sorted mainframe dataset without re-sorting it.
The input sequential dataset has the following structure. 'KEYn' in the first 4 bytes represents the key and the remainder of each row represents the rest of the record's data. There are records in which the same key is repeated though the remaining data is different in each record. The records are already sorted on 'KEYn'.
KEY1aaaaaa
KEY1bbbbbb
KEY2cccccc
KEY3xxxxxx
KEY3yyyyyy
KEY3zzzzzz
KEY3wwwwww
KEY4uuuuuu
KEY5hhhhhh
KEY5ffffff
My requirement is to pick up the first record of each key and drop the remaining 'duplicates'. so the output file for the above input should look like this:
KEY1aaaaaa
KEY2cccccc
KEY3xxxxxx
KEY4uuuuuu
KEY5hhhhhh
Since the data is already sorted, I don't want to use SORT utility with SUM FIELDS=NONE or ICETOOL with SELECT - FIRST operand since both of these will actually end up re-sorting the data on the deduplication key (KEYn). Also the actual dataset I am referring to is huge (1.6 billion records, AVGRLEN 900 VB) and a job actually ran out of sort work space trying to sort it in one go.
My query is: Is there any option available in JCL based utilities to do this deduplication without resorting and using sort work space? I am trying to avoid writing a COBOL/Assembler program to do this.
Try this untested.
OPTION COPY
INREC BUILD=(1,4,SEQNUM,3,ZD,RESTART=(5,4),5)
OUTFIL INCLUDE=(5,3,ZD,EQ,1),BUILD=(1,4,8)
I have a large database containing informatiom about orders. Each order has a unique number, a tripnumber it is assigned to and a location number.
It sometimes happens that multiple orders are delivered to the same location, during the same trip. I want to merge these entries into one, as they are skewing my analysis results.
I want to iterate over the entire table, checking in every trip whether there are orders that have the same location number. If so, I want to update the rows in MySQL to either add together the values in the columns of that row or take the maximum of the two.
Is this possible in just MySQL?
I am fairly new to using MySQL (and coding in general) so I'm not sure how to write anything that iterates in it.
I am trying to create a dynamic table - I have tried a Pivot Table, but cannot get it to work. So I thought that maybe it could be done with an IF-statement, but that did not work for me neither.
Basically, I have 2 tables, 1 table containing the information (data source table) and 1 table that should be dynamic according to the data in the first table.
So if I change the data in the E-column, the Fruit table (image below) must be updated accordingly.
So if I write 2 instead of 1 in the count of Apples, then it should create 2 apples under the "Fruit"-column". Data in the remaining columns will be calculated with a formula/fixed data - so that is not important.
I am open to any solutions; formulas, pivot tables, VBA, etc.
Have a nice weekend.
I have both Excel 2010 and 2013.
If you want to repeat some text a number of times you can use a somewhat complicated formula to do it. It relies on there not being duplicate entries in the Fruits table and no entries with 0 count.
Picture of ranges and results
Formulas involved include a starter cell E2 and a repeating entry E3 and copied down. These are actually normal formulas, no array required. Note that I have created a Table for the data which allows me to use named fields to get the whole column.
E2 = INDEX(Table1[Fruits],1)
E3 = IF(
INDEX(Table1[Count],MATCH(E2,Table1[Fruits],0))>COUNTIF($E$2:E2,E2),
E2,
INDEX(Table1[Fruits],MATCH(E2,Table1[Fruits],0)+1))
How it works
This formula relies on checking the number of entries above the current one and comparing to the desired count. Some notes:
The starter cell is needed to get the first result.
After the first cell, it counts how often the value above appears in the total list. This is compared to the desired count. If less than desired, it will repeat the value from above. If greater, it will go to the next item in the list. There is a dual relative/absolute reference in here to count cells above.
Since it goes to the next item in the list, don't put a 0 for a count or it will get included once.
You can copy this down for as many cells as you want. It will #REF! when it runs out of data. You can wrap in an IFERROR(..., "") to make these display pretty.
If the non-0 rule is too much, it can probably be removed with a little effort. If there are duplicates, that will be much harder to deal with.