Prioritize Bulk Insert in a table using Union all in ssis - ssis

I have multiple archive tables storing similar kind of data in these tables but archived in the month wise format. Now, the requirement is to get all the archived data in to one table instead of multiple tables.
I am doing this activity with the help of Union all in SSIS, however it seems that it is taking random insert in the destination table.
Attach is the route taken for the transformation.
I want to prioritize the insert, please suggest!

You can add an extra column "Priority" to each of OLE DB sources with the corresponding priority for each source and then after union you can add Sort Component that sorts the data by Priority. But if you have a lot of data - that would be really inefficient because sort component will wait until all the source data is read.
I would suggest to write a proper source SQL statement that does the union/prioritization/sort for you and then insert into target.
Also if the sources are on different servers you can create Foreach loop container that will iterate through source tables and inset all of them into the target table. You can use this article for the reference.

Related

SSIS Dynamic table and number of columns in data flow

I have a table (Say Table A) that lists about 10 or 15 tables, each have a different number/names of columns. I need to create a data flow which follows the same pattern for all those tables.
So I have a Foreach loop in SSIS package that loops through all the records in Table A, saves the name of the 10 or 15 tables in a variable and does a data flow operation.
Within the data flow, it registers the design of the first default table which I give it, with names/no of columns and it works perfectly, Problem is when it goes through the second table, it has its own columns and that is where it throws an error saying that meta data for tables it not matched or something like that. Basically table definitions are different.
How can I make SSIS dynamically create the design of these tables, I really don't want to have 15 different data flows for each table.
Suggestions please.
Sad to say, SSIS itself does not allow such scenario. SSIS is bound to metadata, i.e. column names and data types; moreover, it checks for a match before running the dataflow tasks.
You have to create a dataflow for each table layout; you can extend your design with several dataflows and conditional invocation with task precedence constraints. BIML can help you do generate SSIS package based on your metadata or code, but the limit stays the same - data design has to be fixed before running the package and cannot be altered
what if I say there is a way?
It's slow and complicated you have been warned! (at worst 1/3 of normal speed)
1.you can combine all your columns into a single column and create a JSON from them
let's call it json_all.
this can be achieved with information_schema and some dynamic query in SSIS.
2.we have a staging table that only has 1 column and we insert these data to that table.
3.we open the JSON and insert it into the final table.
this way you have fixed metadata for all tables. (all of them only has 1 column named json_all as output and your destination table only has 1 column with the same name as input)
this is very helpful if you have a lot of tables with a limited amount of rows. (1 milion in 20 min was my benchmark)

Pentaho Import uniqe records into database

I am quite new to Pentaho Spoon and I would like to import records of an csv file to an database table. However, only unique records should be imported into the database table. That is why I need to compare EACH record with all records of the database table in order to determine if the record should be imported or not.
So far, I tried out the suggested CRUD-pattern which looks like this:
As you can see in the picture, I merge the excel input and the table input (ignore the cast-steps. I needed to cast a value because ther differed in the float format: database format was #.000000 and the csv format of float was #.0)
After the merge join, I compare the flag (which is given by the merge rows(diff) and if the compared records are new, I import them to the database table, if they are changed, I update the record and if they are deleted or identical, I simply do nothing. So far, so good.
But here is the problem: If I shuffle the records of the csv-input-file and run the transformation anew, all the records are imported anew and consequently, there are duplicated in my database table (which I wanted to avoid). To emphasize again: The right way to solve this is that each row of the csv-input-file is compared with ALL entries in the database table.
How can I realize this? Any suggestions? Thank you so much in advance!!
The Merge Rows (diff) expect the input to be sorted. Normally, you have been warned of this by a pop-up.
Put a Sort rows step on the output flow of the Excel Input, before it reaches the Merge Rows (diff).
You should do the same between the Table Input and the Merge Rows (diff). On course you may think you could do it in the sql statement of the Table Input.
However, there is a beginner trap here. You have 3 other steps Output Rows, Update and Delete which operates on the same table. And these steps may lock the table. As in Kettle all the steps are running concurrently, you do not know which steps will fire first, and the table may be locked and never be able to read even the first record. This is known in jargon as an auto-lock, and the way to solve it is to put a Sort Row step as a buffer.
You can use the 'Dimension lookup/update' control which provides the same functionality which you are trying to achieve.
Thanks,
Nilesh

SSIS - Reuse Ole DB source when matching Fact against lookup table twice

I am pretty new to SSIS and BI in general, so first of all sorry if this is a newbie question.
I have my source data for the fact table in a csv, so I want to match the ids against the surrogate keys in lookup tables.
The data structure in the csv is like this
... userId, OriginStationId, DestinyStationId,..
What I am trying to accomplish is to match the data against my lookup table. So what I am doing is
Reading Lookup data using OLE DB Source
Reading my csv file
Sorting both inputs by the same field
Doing a left join by Id, in order to get the SK
This way, if there is no match (aka can't find the surrogate key) I can redirect that to a rejected csv and handle it later.
something like this:
(sorry for the spanish!)
I am doing this for each dimension, so I can handle each one with different error codes.
Since OriginStationId and DestinyStationId are two values from the same dimension (they both match against the same lookup table), I wanted to know if there's a way to avoid reading two times the data from the table (I mean, not to use two ole db sources to read twice the data from the same table).
I tried adding a second output to the sort but I am not allowed to. The same goes to adding another output from OLE DB Source.
I see there's an "cache option", is the best way to go ? (Although it would impy creating anyway another OLE DB source.. right?)
The third option I thought of was joining by the two fields, but since there is only one field in the lookup table (the same field) I am getting an error when I try to map both colums from my csv against the same column in my Lookup table
There are columns missing with the sort order 2 to 2
What is the best way to go for this ?
Or I am thinking something incorrectly ?
If something was not clear let me know and I'll update my question
Any time you wish you could have multiple outputs from a component that only allows one, all you have to do is follow that component with the Multicast component, whose sole purpose is to split a Data Flow stream into multiple outputs.
Gonzalo
I have just used this article on how to derive columns for a data warehouse building:- How to Populate a Fact Table using SSIS (part 1).
Using this I built a simple package that reads a CSV file with two columns that are used to derive separate values from the same CodeTable. The CodeTable has two fields Id and Description.
The Data Flow has two "Lookup" tasks. The first one joins the attribute Lookup1 against the Description to derive its Id. The second joins the attribute Lookup2 against the Description to derive a different Id.
Here is the Data Flow:-
Note the "Data Conversion" was required to convert the string attributes from the CSV file into "Unicode string [DT_WSTR]" so they could be joined to the nvarchar(50) description attribute in the table.
Here is the Data Conversion:-
Here is the first Lookup (the second one joins "Copy of Lookup2" to the Description):-
Here is the Data Viewer output with the to two derived Ids CodeTableFirstId and CodeTableSecondId:-
Hopefully I understand your problem and this is of use to you.
Cheers John

A way to update data in Oracle

I have a table that I need to update each day. The data comes in a text file every time. I wrote a program that extracts the data from the text file and and writes it in the table, but now I want to modify it to just update the existing data. The data is mostly the same, it might differ only a few things.
I was thinking about MERGE but I don't know very well how I could use this in my program. All the examples that I saw used a second table.
So it would be like creating a second table in which I extract the current data, after which I make the merge into the old table to update the records. I want to avoid creating a second table, so I was wondering if there is any way to do this?
Thanks!

SSIS 2008. Transferring data from one table to another ONLY if the data is not duplicated

I'm going to do my best to try to explain this. I currently have a data flow task that has an OLE DB Source transferring data from a table from a different database to a table to another database. It works fine but the issue I'm having is the fact that I keep adding duplicate data to the destination table.
So a CustomerID of '13029' with an amount of '$56.82' on Date '11/30/2012' is seen in that table multiple times. How do I make it so I can only have unique data transferring over to that destination table?
In the dataflow task, where you transfer the data, you can insert a Lookup transformation. In the lookup, you can specify a data source (table or query, what serves you best). When you chose the data source, you can go to the Columns view and create a mapping, where you connect the CustomerID, Date and Amount of both tables.
In the general view, you can configure, what happens with matched/non matched row. Simply take the not matched output and direct it to the DB destination.
You will need to identify what makes that data unique in the table. If it's a customer table, then it's probably the customerid of 13029. However if it's a customer order table, then maybe it's the combination of CustomerId and OrderDate (and maybe not, I have placed two unique orders on the same date). You will know the answer to that based on your table's design.
Armed with that knowledge, you will want to write a query to pull back the keys from the target table SELECT CO.CustomerId, CO.OrderId FROM dbo.CustomerOrder CO If you know the process only transfers data from the current year, add a filter to the above query to restrict the number of rows returned. The reason for this is memory conservation-you want SSIS to run fast, don't bring back extraneous columns or rows it will never need.
Inside your dataflow, add a Lookup Transformation with that query. You don't specify 2005, 2008 or 2012 as your SSIS version and they have different behaviours associated with the Lookup Transformation. Generally speaking, what you are looking to do is identify the unmatched rows. By definition, unmatched means they don't exist in the target database so those are the rows that are new. 2005 assumes every row is going to match or it errors. You will need to click the Configure Error Output... button and select "Redirect Rows". 2008+ has an option under "Specify how to handle rows with no matching entries" and there you'll want "Redirect rows to no match output."
Now take the No match output branch (2008+) or the error output branch (2005) and plumb that into your destination.
What this approach doesn't cover is detecting and handling when the source system reports $56.82 and the target system has $22.38 (updates). If you need to handle that, then you need to look at some change detection system. Look at Andy Leonard's Stairway to Integration Services series of articles to learn about options for detecting and handling changes.
Have you considered using the T-SQL MERGE statement? http://technet.microsoft.com/en-us/library/bb510625.aspx
It will compare both tables on defined fields, and take an action if matched or not.