I am new to Talend os.
However, I received a task:
Create file delimited .csv metadata (one for Lead & Opportunity).
Move files to your repository on the AWS server (the etl_process1 login).
Create two tables sfdc_leads_reporting_raw and sfdc_opp_reporting_raw.
Load the data from the files into the tables. Ensure the data types are correctly used when creating metadata schemas & tables.
Till step 4 I am done.
Now the problem is:
How to Implement logging at the end of each job to report the number of leads (count of distinct id in leads table) and number of opportunities created (count of opportunity id) by stages (how many converted, qualified, closed won, and dead)?
Help would be appreciated.
You can get this data using global variables, in a subjob at the end of your job. Most components provide a global variable called tComponent_NB_LINE (or _NB_LINE_INSERTED for database components) that gives you the number of lines output by the component.
For instance tFileOutputDelimited_1_NB_LINE or tOracleOutput_1_NB_LINE_INSERTED.
Using these variables you can log into console or file.
Here is a simple example. If you have a tOracleOutput_1 in your job you can do:
tPostJob -- OnComponentOk -- tFixedFlowInput -- Main -- tLogRow
Inside tFixedFlowInput you retrieve the variable
(Integer)globalMap.get("tOracleOutput_1_NB_LINE_INSERTED")`.
If you need to log aggregated info, you can append a tAggregateRow to your output components, and use tSetGlobalVar to get count by certain criteria.
Related
How can several files be produced in a single SSIS package? I have created one that produces a single file, but have no idea how to produce several ones.
The package I produced uses variables to know which data to retrieve, and an expression in the flat file connection manager to assign the correct name to the file (which is based on variables).
The single package I created retrieves the city for which I want the sales data (New York) and the month (September 2020) as variables/parameters, and uses them to extract the appropriate data. Example of SQL statement executed:
select * from table1 where City = ? and Period = ??
It then uses those to build the name for the file to be exported and sends it to a folder. But how do you do that to produce several files within the same package? How can I make the same SSIS package produce another file for Chicago - July 2020, another for Denver - June 2020, and another for San Diego - March 2020?
I plan to have a table that indicates what needs to be produced.
ExampleRow1: Chicago, Sep 2020, Produce=Yes.
ExampleRow2: Miami, Jan 2020, Produce=Yes.
So the SSIS package would need to use that info to produce a file, and then do it again, and again, until there is nothing more to produce. Is this even possible? I know a foreach loop container can help, but not sure if it can handle so many variables changing. This is pretty much the first package I create, that's why I am this ignorant. Thanks in advance!
Right now, you have it working correctly for the value of your two SSIS variables (City and Period) and you have it parameterized so I wouldn't discount that as your first SSIS package. People struggle with far easier tasks
What you need to do is connect the orchestrator/driver table into your package. Here's how we're going to do that.
Create an SSIS variable called rsObject of type Object. This is going to hold a recordset object aka the results of our query.
Execute SQL Task
Add an Execute SQL Task to the Control Flow. Call it "SQL Get Driver Data" You'd use a query like
SELECT T.City, T.Period FROM dbo.ExtractSetup AS T WHERE T.Produce = 1;
Change the default of No Result Set to Full Result Set. That tells SSIS to expect a table shaped return object but something needs to catch that incoming data.
In the Results tab, you now need to map the results into an SSIS variable. Assuming an OLE DB type connection manager, you'll select User::rsObject in the Variable list and then 0 as the recordset name (doing this from memory so specifics might be a slightly off)
Save and run that task. Assuming no errors, when the package runs we have a, potentially empty, set of data in our recordset object. Let's do something with that.
Shredding the data
The name I generally see applied to getting data out of enumerable objects in SSIS is called "shredding the data". The implementation of that is an Foreach Enumerator - one of the most powerful tools in your toolkit.
Drag a ForEach Loop Container onto the canvas. Drag the connector line (precedent constraint) from the "SQL Get Driver Data" to our new ForEach Loop Container. I'd name it "FELC Shred Results" to indicate my intent.
Double click the Task and change the default enumerator type from File System to "Ado.net recordset" This has no bearing on whether you used an OLE, ODBC or ADO.NET connection manager to populate the table-like object. If it's a table, use ADO.NET Recordset.
Specify our variable [User::rsObject] as the source of the Recordset object.
The last thing we need to do is configure what we should do with the current row in the enumerator. That's in the Mapping tab. Here you'll add two parameters and this will be a zero based ordinal system. Choose [User::City] (or whatever you've named your City variable) for your first entry and map that to column name 0. Add a row and use User::Period and map that to column 1
The final step is to take the existing logic (Data Flow Task and whatever else is variable dependent) and move it into the FELC. That's literally drawing a box around it with the mouse to highlight everything and hold the left mouse button and drag it into the FELC.
Hit F5 and you should have 2 files generated.
I am trying to accomplish something that is pretty easy to do in SQL, but seemingly very challenging to do in SSIS without using SQL. Basically, I need to consolidate and concatenate a field of a many-to-one relationship.
Given entities: [Contract Item] (many) to (one) [Account]
There is a field [ari_productsummary] that contains the product listed on the Contract Item entity. We want to write that value to the Account as [ari_activecontractitems]. However, an Account may have more than one Contract Item record associated to it, in which case, we want to concatenate those values. We also only want the distinct values to be concatenated (distinct rows already solved within my data flow).
This can be accomplished by writing to a temporary table, and then using a query or view to obtain the summarized results as followed. I created a SQL table called TESTTABLE that contains the [ari_productsummary] from the Contract Item entity along with the referring [accountid] to map it back to Account. I then wrote the following query as a view:
SELECT distinct accountid,
(SELECT TT2.ari_productsummary + '; '
FROM TESTTABLE TT2
WHERE TT2.accountid = TT.accountid
FOR XML PATH ('')
) AS 'ari_activecontractitems'
FROM TESTTABLE TT
Executing that Query provides me the results that I want, which I can then use for importing into the Account entity as shown below:
But how do I do this in a SSIS dataflow without writing to a SQL table as a temporary placeholder for the data?? I want to do the entire process inside one dataflow container, without using a temporary SQL table/view. The whole summarization process needs to be done on the fly:
Does anyone have a solution that doesn't require a temporary SQL table/view/query, but is contained entirely within the data flow?
I am using VS 2017 and the KingswaySoft Dynamic CRM 365 ETL toolset to develop my solution/package.
Spit balling here as I don't Dynamics nor do I have the custom components.
Data Flow 1 - Contract aggregation
The purpose of this data flow is to replicate your logic in the elegant query you provided and shove that into a Cache Connection Manager (see Notes for 2008+ at the end)
KingswaySoft Dynamics Source -> Script Task -> Cache Transform
If you want to keep the sort in there, do it before the script task. The implementation I'll take with the Script Task is that it's fully blocking - that is all the rows must arrive before it can send any on. Tasks like the Merge Join are only partially blocking because the requirement of sorted data means that once you no longer have a match for the current item, you can send it on down the pipeline.
The Script Task is going to be asynchronous transformation. You'll have two output columns, your key accountid and your new derived column of ari_activecontractitems. That column will might need to be big - you'll know your data best but if it's a blob type in Dynamics (> 4k unicode or > 8k ascii characters) then you'll have to define the data type as DT_TEXT/DT_NTEXT
As inputs, you'll select accountid and ari_productsummary from your source.
The code should be pretty easy. We're going to accumulate the inbound data into a Dictionary.
// member variable
Dictionary<string, List<string>> accumulator;
The PreProcess method, we'll tack this in there to initialize our variable
// initialize in PreProcess method
accumulator = new Dictionary<string, List<string>>();
In the OnBufferRowSent (name approx)
// simulate the inbound queue
// row_id would be something like Rows.row_id
if (!accumulator.ContainsKey(row_id))
{
// Create an empty dictionary for our list
accumulator.Add(row_id, new List<string>());
}
// add it if we don't have it
if (!accumulator[row_id].Contains(invoice))
{
accumulator[row_id].Add(invoice);
}
Once you get the signal sent of no more data available, that's when you start buffering output data. The auto generated code will have placeholders for all this.
// This is how we shove data out the pipe
foreach(var kvp in accumulator)
{
// approximately thus
OutputBuffer1.AddRow();
OutputBuffer1.row_id = kvp.Key;
OutputBuffer1.ari_productsummary = string.Join("; ", kvp.Value);
}
We have an upcoming release that comes with a component that does exactly what you are trying to achieve without the need of writing custom code. The feature is currently under preview, please reach out to us for private access to the feature. You can find our contact information on our website.
UPDATE - June 5, 2020, we have made the components available for public access at https://www.kingswaysoft.com/products/ssis-productivity-pack/ as a result of our 2020 Release Wave 1. We have two components available that serve this kind of purpose. The Composition component will take input values and transform into a composite value in a SSIS column. The Decomposition component does the opposite, it would take an input value and split it into multiple rows using either delimiter-based text splitting or XML/JSON array splitting.
I've got 4 different companies whose spreadsheets I have to process. Currently, if I receive a set of files, the SSIS solution is set to execute all of the packages in the solution whether I received the files or not. I want to change this behavior by looking the the Source directory that stores the .xlsx files.
For instance, I have the files from two companies: A & B. The file names will always be in this format with the companies' name at the beginning of the file name.
From there, I would like to use the file names to possibly set variables (? - not sure if this would be the best way ) to set precedence constraints to only execute the packages that relate to the companies' files we received.
Based on the screen shots, I want the packages of Company A & B to execute while C & D are skipped and then to move on to the next package.
Suggestions on how to achieve this goal?
One way would be to create a boolean variable for each company.
#CompanyA_Recd, #CompanyB_Recd, #CompanyC_Recd, #CompanyD_Recd
Your foreach loop would loop through the files, analyze each file name, and mark the appropriate variable as true if you received that company's files.
Then your precedence constraints could just check the variable for that company.
Note that, depending on the logic you want for executing the Final Report Package, this means you might need two precedence constraints for each company. One that goes to the package if the variable is true, and one that goes straight to the final package if the variable is false.
In my SSIS package I have a dataflow that looks something like this.
My requirement is to log the end time of each flatfile destination (Or the time when each of the flat files is created) , in a SQL server table. To be more clear, there will be one row per flatfile in the log table. Is there any simple way(preferably) to accomplish this? Thanks in advance.
Update: I ended up using a script task after the dataflow and read the creation time of each of the file created in the dataflow. I also used same script task to insert logs into the table, just to keep things in one place. For details refer the post masked as answer.
In order to get the accurate date and timestamp of each flat file created as the destination, you'll need to create three new global variables and set up a for-each loop container in the control flow following your current data flow task and then add to the for-each loop container a script task that will read from one flat file at a time the date/time information. That information will then be saved to one of the new global variables that can then be applied in a second SQL task (also in the for-each loop) to write the information to a database table.
The following link provides a good example of the steps you'll need to apply. There are a few extra steps not applicable that you can easily exclude.
http://microsoft-ssis.blogspot.com/2011/01/use-filedates-in-ssis.html
Hope this helps.
After looking more closely at the toolbox, I think the best way to do this is to move each source/destination pairing into its own dataflow and use the OnPostExecute event of each dataflow to write to the SQL table.
Wanted to provide more detail to #TabAlleman's approach.
For each control flow task with a name like Bene_hic, you will have a source file and a destination file.
On the 'Event Handlers' tab for that executable (use the drop-down list,) you can create the OnPostExecute event.
In that event, I have two SQL tasks. One generates the SQL to execute for this control flow task, the second executes the SQL.
These SQL tasks are dependent on two user variables scoped in the OnPostExecute event. The EvaluateAsExpression property for both is set to True. The first one, Variable1, is used as a template for the SQL to execute and has a value like:
"SELECT execSQL FROM db.Ssis_onPostExecute
where stgTable = '" + #[System::SourceName] + "'"
#[System::SourceName] is an SSIS system variable containing the name of the control flow task.
I have a table in my database named Ssis_onPostExecute with two fields, an execSQL field with values like:
DELETE FROM db.TableStats WHERE TABLENAME = 'Bene_hic';
INSERT INTO db.TableStats
SELECT CreatorName ,t.tname, CURRENT_TIMESTAMP ,rcnt FROM
(SELECT databasename, TABLENAME AS tname, CreatorName FROM dbc.TablesV) t
INNER JOIN
(SELECT 'Bene_hic' AS tname,
COUNT(*) AS rcnt FROM db.Bene_hic) u ON
t.tname = u.tname
WHERE t.databasename = 'db' AND t.tname = 'Bene_hic';
and a stgTable field with the name of the corresponding control flow task in the package (case-sensitive!) like Bene_hic
In the first SQL task (named SQL,) I have the SourceVariable set to a user variable (User::Variable1) and the ResultSet property set to 'single row.' The Result Set detail includes a Result Name = 0 and Variable name as the second user variable (User::Variable2.)
In the second SQL task (exec,) I have the SQLSourceType property set to Variable and the SourceVariable property set to User::Variable2.
Then the package is able to copy the data in the source object to the destination, and whether it fails or not, enter a row in a table with the timestamp and number of rows copied, along with the table name and anything else you want to track.
Also, when debugging, you have to run the whole package, not just one task in the event. The variables won't be set correctly otherwise.
HTH, it took me forever to figure all this stuff out, working from examples on several web sites. I'm using code to generate the SQL in the execSQL field for each of the 42 control flow tasks, meaning I created 84 user variables.
-Beth
The easy solution will be:
1) drag the OLE DB Command from the tool box after the Fatfile destination.
2) Update Script to update table with current date when Flat file destination is successful.
3) You can create a variable (scope is project) with value systemdatetime.
4) You might have to create another variable depending on your package construct if Success or fail
I have tree tables, Customer, Invoice and InvoiceRow with the standard relations.
These I have to export in one fixed field length file with the first two characters of each row identifying the row type. The row types have different specifications.
I could probably do it with a nested loop in a script block, but this is my first ever SSIS package and that solution feels wrong.
edit:
The output has to have:
Customer
Invoice
Rows
Customer
Invoice
Rows
and so on
Your gut feeling on doing this using a Script Destination component is correct. Unfortunately, this scenario doesn't jive with SSIS well. I don't consider this a beginner package. If you must use SSIS then I'd start by inner joining all the data so there is one row for each InvoiceRow, containing the data needed from all three tables.
CustomerCols, InvoiceCols, RowCols
Then, in the script destination component you'll need to keep track of the customer and invoice values, as they change you'll need to write extra rows to the output.
See Creating a Destination with the Script Component for more information on script destination.
My experience shows that script destinations can have good performance.
I would avoid writing Script Destination, and use just Script Transform + Flat File Destination. This way, you concentrate on the logical output (strings of data), while allowing SSIS to do actual writing to the file (it might be a bit more efficient, plus you concentrate on your business, not on writing to files).
First, you'll need to get denormalized data. You can do joins and sorts in the DBMS, but if you don't want to put too much pressure on DBMS - just get sorted data out of it and merge it using two SSIS Merge Join transforms.
Then do the script: keep running values of current Customer and Invoice, output them when they change, output InvoiceRow on every input. Something like this:
if (this.CustomerID != InputBuffer.CustomerID) {
this.CustomerID = InputBuffer.CustomerID;
OutputBuffer.AddRow();
OutputBuffer.OutputColumn = "Customer: " + InputBuffer.CustomerID + " " + InputBuffer.CustomerName;
}
// repeat the same code for Invoice
OutputBuffer.AddRow();
OutputBuffer.OutputColumn = "InvoiceRow: " + InputBuffer.InvoiceRowPrice;
Finally, add a Flat File Destination with a single column (OutputColumn created by the script) to write this to the file.
Process your three tables so that the outputs are all appropriate for your output file (including the row type designator). You'll have to do this in three separate flow paths in your data flow, then bring the rows together in a Union All data flow element. From there, process them as needed to create your output file.