What is the use of Multicast Transformation Task ? With this task, is it possible to send to two destinations from a single source, while each destination has different columns ?
I assume that you are referring to Multicast Transformation inside the Data Flow task. If so, yes it is possible. The purpose of the transformation is to channel data from a single source to n number of Transformation tasks or Destinations.
If source has following columns
Source
Column 1
Column 2
Column 3
and destinations have these columns.
Destination 1 Destination 2
Column 1 Column 2
Column 3
Both destinations will be able to see Columns 1 - 3 that are available in Source. You have to map the columns accordingly in the respective destinations. Refer below example:
Example:
Screenshot #1 shows that Source has two columns Header and Value.
Screenshot #2 shows that Destination 1 has both columns Header and Value. They are mapped accordingly.
Screenshot #3 shows that Destination 2 has only column Header. It is mapped accordingly.
Screenshot #4 shows sample package execution.
Hope that helps.
Screenshot #1:
Screenshot #2:
Screenshot #3:
Screenshot #4:
#Siva did a good job of explaining the how. I'm going to tackle the "What is the use of Multicast Transformation Task?" question.
Let me give you examples of how I have used it or seen it used. First, we like to store the data in a staging table that contains just the raw unchanged data (this makes it easier for us to research data issues to see if the data problem came from a bug in our process or bad data sent by the client.) and at the same time I want to send the same data to another staging table that will be used to transform the data.
Sometimes we use Mulitcast to take denormalized files and send them to normalized data tables. So the names go to the person table, the addresses go to the address table and the phones go to the phone table.
Multicast can be used to do several different transformations on different data fields in the same source at the same time rather than one at a time and then bring all the revised data back together in a Merge join. So one path checks the States to make sure they are valid or converts the long names to the 2 character abbreviations and another checks the zip codes and adds the leading zeros that got lost because the data came from an Excel file. Then the cleaned address data is put back together with the correct values we want for insertion to our database. This can speed up cleaning as data is being scrubbed simultaneously not one step at a time.
Related
I'm new to SSIS and I completely stuck with perhaps easy question.
I have two tables with one-to-many relationship. I parse HTML data in a Script component and create two outputs for Master Data and Detail records.
Then I check the condition for overwriting the existing data, and if it is satisfied, I write Master record to the table. Unfortunately, my data flow looks like on the picture above (schematic view). Detail records are added in any case. I would like the Details are stored only if the condition is met (the green arrow on the picture), but can't imagine how to do it.
I have face the same problem when we have to load the XML data into parent child tables. For this , I have added two data flow tasks in package. In first DFT, I have parsed XML and loaded data into master table only. In second DFT, I have parsed child XML nodes data and pass this output to merge join operator (first input). Now, we have to pass second input to merge join operator, for which i have extract data from master table.
guys!
Eventually I managed to resolve the problem. I split the whole process in two data flows. In the first one I parse html, save master data in table if needed and save parsed detail data in the package Object variable. Also, the first data flow has a Row Count component which saves its value in MasterRowCount variable. In the second data flow I save the detail data in table. The first and the second data flows are connected by expression constrained precedence (#MasterRowCount > 0). Thus, the second data flow executes only if the master data were added.
I am pretty new to SSIS and BI in general, so first of all sorry if this is a newbie question.
I have my source data for the fact table in a csv, so I want to match the ids against the surrogate keys in lookup tables.
The data structure in the csv is like this
... userId, OriginStationId, DestinyStationId,..
What I am trying to accomplish is to match the data against my lookup table. So what I am doing is
Reading Lookup data using OLE DB Source
Reading my csv file
Sorting both inputs by the same field
Doing a left join by Id, in order to get the SK
This way, if there is no match (aka can't find the surrogate key) I can redirect that to a rejected csv and handle it later.
something like this:
(sorry for the spanish!)
I am doing this for each dimension, so I can handle each one with different error codes.
Since OriginStationId and DestinyStationId are two values from the same dimension (they both match against the same lookup table), I wanted to know if there's a way to avoid reading two times the data from the table (I mean, not to use two ole db sources to read twice the data from the same table).
I tried adding a second output to the sort but I am not allowed to. The same goes to adding another output from OLE DB Source.
I see there's an "cache option", is the best way to go ? (Although it would impy creating anyway another OLE DB source.. right?)
The third option I thought of was joining by the two fields, but since there is only one field in the lookup table (the same field) I am getting an error when I try to map both colums from my csv against the same column in my Lookup table
There are columns missing with the sort order 2 to 2
What is the best way to go for this ?
Or I am thinking something incorrectly ?
If something was not clear let me know and I'll update my question
Any time you wish you could have multiple outputs from a component that only allows one, all you have to do is follow that component with the Multicast component, whose sole purpose is to split a Data Flow stream into multiple outputs.
Gonzalo
I have just used this article on how to derive columns for a data warehouse building:- How to Populate a Fact Table using SSIS (part 1).
Using this I built a simple package that reads a CSV file with two columns that are used to derive separate values from the same CodeTable. The CodeTable has two fields Id and Description.
The Data Flow has two "Lookup" tasks. The first one joins the attribute Lookup1 against the Description to derive its Id. The second joins the attribute Lookup2 against the Description to derive a different Id.
Here is the Data Flow:-
Note the "Data Conversion" was required to convert the string attributes from the CSV file into "Unicode string [DT_WSTR]" so they could be joined to the nvarchar(50) description attribute in the table.
Here is the Data Conversion:-
Here is the first Lookup (the second one joins "Copy of Lookup2" to the Description):-
Here is the Data Viewer output with the to two derived Ids CodeTableFirstId and CodeTableSecondId:-
Hopefully I understand your problem and this is of use to you.
Cheers John
I have a file like as seen below: Just Ex:
kwqif h;wehf uhfeqi f ef
fekjfnkenfekfh ijferihfq eiuh qfe iwhuq fbweq
fjqlbflkjqfh iufhquwhfe hued liuwfe
jewbkfb flkeb l jdqj jvfqjwv yjwfvjyvdfe
enjkfne khef kurehf2 kuh fkuwh lwefglu
gjghjgyuhhh jhkvv vytvgyvyv vygvyvv
gldw nbb ouyyu buyuy bjbuy
ID Name Address
1 Andrew UK
2 John US
3 Kate AUS
I want to dynamically skip header information and load flatfile to DB
Like below:
ID Name Address
1 Andrew UK
2 John US
3 Kate AUS
The header information may vary (not fixed no. of rows) from file to file.
Any help..Thanks in advance.
The generic SSIS components cannot meet this requirement. You need to code for this e.g. in an SSIS Script task.
I would code that script to read through the file looking for that header row ID Name Address, and then write that line and the rest of the file out to a new file.
Then I would load that new file using the SSIS Flat File Source component.
You might be able to avoid a script task if you'd prefer not to use one. I'll offer a few ideas here as it's not entirely clear which will be best from your example data. To some extent it's down to personal preference anyway, and also the different ideas might help other people in future:
Convert ID and ignore failures: Set the file source so that it expects however many columns you're forced into having by the header rows, and simply pull everything in as string data. In the data flow - immediately after the source component - add a data conversion component or conditional split component. Try to convert the first column (with the ID) into a number. Add a row count component and set the error output of the data conversion or conditional split to be redirected to that row count rather than causing a failure. Send the rest of the data on its way through the rest of your data flow.
This should mean you only get the rows which have a numeric value in the ID column - but if there's any chance you might get real failures (i.e. the file comes in with invalid ID values on rows you otherwise would want to load), then this might be a bad idea. You could drop your failed rows into a table where you can check for anything unexpected going on.
Check for known header values/header value attributes: If your header rows have other identifying features then you could avoid relying on the error output by simply setting up the conditional split to check for various different things: exact string matches if the header rows always start with certain values, strings over a certain length if you know they're always much longer than the ID column can ever be, etc.
Check for configurable header values: You could also put a list of unacceptable ID values into a table, and then do a lookup onto this table, throwing out the rows which match the lookup - then if you need to update the list of header values, you just have to update the table and not your whole SSIS package.
Check for acceptable ID values: You could set up a table like the above, but populate this with numbers - not great if you have no idea how many rows might be coming in or if the IDs are actually unique each time, but if you're only loading in a few rows each time and they always start at 1, you could chuck the numbers 1 - 100 into a table and throw away and rows you load which don't match when doing a lookup onto this table.
Staging table: This is probably the way I'd deal with it if I didn't want to use a script component, but in part that's because I tend to implement initial staging tables like this anyway, and I'm comfortable working in SQL - so your mileage may vary.
Pick up the file in a data flow and drop it into a staging table as-is. Set your staging table data types to all be large strings which you know will hold the file data - you can always add a derived column which truncates things or set the destination to ignore truncation if you think there's a risk of sometimes getting abnormally large values. In a separate data flow which runs after that, use SQL to pick up the rows where ID is numeric, and carry on with the rest of your processing.
This has the added bonus that you can just pick up the columns which you know will have data you care about in (i.e. columns 1 through 3), you can do any conversions you need to do in SQL rather than in SSIS, and you can make sure your columns have sensible names to be used in SSIS.
I have 6 different input datasets. I want to run ETL over all 6 datasets so they all get transformed to the same output table (same columns and types).
I am using Pentaho (Spoon) to do this.
Is there a way I can define an output table schema to be used by all these transformations in Pentaho? I am using MySQL as my output database.
Thanks in advance.
Sounds like you need the Select Values step. Put one of those on the last hop of each dataset's path and make the metadata for the paths all look EXACTLY the same. Then you can connect the output from each Select Values step into a Table Output. All the rows from each set will be mixed together in no particular order.
This can be more challenging than it looks. Spoon will throw errors if any of the fields aren't just exactly identical to the corresponding field from all other datasets. You'll have to find some way to get all the metadata from the datasets to be the same.
I have flat file that structured in a hierarchical format that looks something like this:
Area|AreaCode|AreaDescription
Region|RegionCode|RegionDescriptoin
Zone|ZoneCode|ZoneDescription
District|DistrictCode|DistrictDescription
Route|RouteCode|RouateDescription
Record|Name|Address|Ect
RouteFooter
Route|RouteCode|RouateDescription
Record|Name|Address|Ect
RouteFooter
DistrictFooter
District|DistrictCode|DistrictDescription
Route|RouteCode|RouateDescription
Record|Name|Address|Ect
Record|Name|Address|Ect
RouteFooter
Route|RouteCode|RouateDescription
Record|Name|Address|Ect
RouteFooter
DistrictFooter
ZoneFooter
RegionFooter
AreaFooter
I have to bring this into SSIS and consume information about the Record row and also about the header for the current record row. As well as information from several other sources and output a more simple flat file as a result.
I would like to read the flat file above into a structure that each row contains a record with the appropriate header information included.
My question is, what is the best way to do this if it is even possible?
First how do you tell what type of line you are on if you are on say line 3,987,986? How do you tell what is related to what? Is there apossiblity you could get this in a better format? Before spending lots of time (and don't kid yourself, this will take lots of time to set up and test properly) I would kick the file back to the provider and request it in a different format. You won't always get it, but you should at least try.
When I have done this in the past in DTS, the first characters of each line told me which structure the line referred to. I imported all into a staging table with two columns, one for the recordtype data and one for the rest. Then I parsed the rest into the staging tables for the type of record with the correct column structure for that type of record (and any fileds you might need to do the relationships) and then did clean up and then imported to prod tables. AS you also have differnt number of columns I would try that approach (only you may have to manually populate some columns instead of figuring out directly from the file), also give each record an identity filed in the staging tables. this will help you figure out the realtionships I think.