SSIS Compilation problems - DirectRowToOutput - ssis

Input buffer does not contain a definition for 'DirectRowToOutput0' or likewise for the the other properties below.
Row.DirectRowToOutput0();
Row.ErrorMessage = ex.Message;
Row.DirectRowToFailedValidation();
I had some packages on SSIS package store, and attempted to import them using the Package Import Wizard project. But it had some issues and compilation failed, and completely broke all previous script components so I fished the code out of some backups, and pasted it back into some new script tasks.
'ErrorMessage' I did add to a new output flow and column, but it looks like things don't work that way anymore.
New Script tasks appear to be C# 2012.
What have I missed? Am struggling to find which documentation I should be using, and these version conflicts are really hard to deal with.
Using SSDT 2017.

"DirectRowToOutputX()" means "Provide support for filtering outputs to named output groups". In other words, if you ADD SUPPORT for output filtering, then you get the functionality that you list above. Here's how:
When you configure your Script Component, you need to click Inputs and Outputs, then in the inputs and outputs pane, select the output that you'd like to filter. Then in Common Properties, select ExclusionGroup and set it to some value other than zero. Now go back and edit your script and the Row.DirectRowToOutput0() statement will work. Code example below.
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
bool keepThisRow = ValidateMyRow(Row);
if (keepThisRow)
{
// Get/set Row values, do something useful
Row.DirectRowToOutput0(); // for Output0
}
/* Else do nothing - the row will be filtered OUT of the output if not explicitly
* included
*/
}

Related

Changing output dataset path in the transform function

Can we change the output dataset path dynamically in the my_compute_function as show below
from transforms.api import transform, Input, Output
#transform(
my_output=Output("/path/to/my/dataset"),
my_input=Input("/path/to/input"),
)
def my_compute_function(my_output, my_input):
**my_output.path = "new path"**
my_output.write_dataframe(
my_input.dataframe()
)
No, this is not possible. The reason is that the inputs/outputs/transforms are fixed at "CI-time" or "build-time". When you press "commit" in Authoring or you merge a PR, a CI job is kicked off.
In this CI job, all the relations between inputs and outputs are determined. Output datasets that don't exist yet are created, and a "jobspec" is added to them. A "jobspec" is a snippet of JSON that describes to foundry how a particular dataset is generated.
Anytime you press the "build" button on a dataset (or build the dataset through a schedule or similar), the jobspec is consulted. It contains a reference to the repository, revision, source file and entry point of the function that builds this dataset. From there the build is orchestrated and kicks off, invoking your function to produce the final output.
This mechanism allows you to get a "static view" of the entire pipeline, which you can then visualize with Monocle, as you might have seen.
Depending on what your needs are, here are some solutions you might be able to use instead:
Tag the rows you're producing in your transform in some way, so that even though you put them into a single dataset, you can later select them by this tag/category
If your set of categories does not change often, you can instead create the output datasets ahead of time and then filter the rows into the appropriate dataset they should go into.
The main drawback with the latter approach is that it's not very dynamic, so if a new category shows up, you'll manually have to change the code to "triage" it into a new dataset, until the data becomes available.
There's other solutions (ultimately it is possible to make API calls and to manually adjust inputs/outputs as well, for instance) but they are more complex and undesirable from a maintenance perspective.

Passing a variable value from Control Flow to Data Flow in SSIS

I have a fairly straightforward SSIS package where I can't sucessfully pass the value of a package-scoped variable from the Control Flow to a Data Flow task. Consider the below diagram:
The Execute SQL task gets values from a list of "machines". This is used to control a ForEach Loop Container, which works very well. Next a script task performs some math and assigns a single number to a package scoped variable (integer type). I have added message boxes that pop up during the loop so that I can verify that the value of this variable is being set properly.
The last icon is a data flow where I want to use the variable value. I have a simple script task that contains just a message box showing me the current value of this same variable. Every time, the variable is the value that I initially set in the designer (BIDS). Therefore, the value is not being "passed" to the data flow. I have verified multiple times that the names of the variables are correct (including case sensitive values).
This should be pretty simple, and I am getting frustrated with this issue. I would greatly appreciate and suggestions or comments. Thank you!
How are you setting the package variable from your script task? It should look like this (c#):
DTS.Variables["testVariable"].Value = "some value";
Then to test it from the script component in your dataflow task:
public override void PostExecute()
{
base.PostExecute();
MessageBox.Show(Variables.testVariable, "test");
}
I did this in a test package and it worked fine.
EDIT
Also make sure that you added the variable to the ReadWriteVariables section of the properties for the script tasks.

How to add custom footer to an SSIS flat file - seperate component or script task?

I have a data flow component, comprise of the following :
1.) Read Excel file
2.) "Script transformation component"
3.) Write to a flat file
Everything works fine, however I was trying to add header & footer.
Both header & footer are custom and have to be derived from the data in the file.
I can open the file in C# and write the low level code, but this seems to be a pretty common task & I thought there would be something generic.
Let me know if someone has done this.
The approach listed in the Toolbox.com article will certainly work. (You'll need to add another data flow, but data flows are cheap.)
On the other hand, since you've already got a Script Component in your existing data flow, you can use that to generate the header and trailer rows.
First, change the SynchronousInputID of the Script Component output to None so that you can generate additional rows:
Next, update the ProcessInputRow() method and add a FinishOutputs() method along these lines:
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
private bool _headerWritten = false;
public override void IncomingRows_ProcessInputRow(IncomingRowsBuffer Row)
{
if (!_headerWritten)
{
// Code to write the header row goes here
_headerWritten = true;
}
OutgoingRowsBuffer.AddRow();
// do whatever other processing you need for this row of input
}
public override void FinishOutputs()
{
base.FinishOutputs();
// Code to write the footer row goes here
}
}
This approach requires a bit more code but lets you do everything in one pass, which for sufficiently huge files may be important. (On the other hand, an Excel spreadsheet shouldn't be that big ...)

What is the simple way to find the column name from Lineageid in SSIS

What is the simple way to find the column name from Lineageid in SSIS. Is there any system variable avilable?
I remember saying this can't be that hard, I can write some script in the error redirect to lookup the column name from the input collection.
string badColumn = this.ComponentMetaData.InputCollection[Row.ErrorColumn].Name;
What I learned was the failing column isn't in that collection. Well, it is but the ErrorColumn reported is not quite what I needed. I couldn't find that package but here's an example of why I couldn't get what I needed. Hopefully you will have better luck.
This is a simple data flow that will generate an error once it hits the derived column due to division by zero. The Derived column generates a new output column (LookAtMe) as the result of the division. The data viewer on the Error Output tells me the failing column is 73. Using the above script logic, if I attempted to access column 73 in the input collection, it's going to fail because that is not in the collection. LineageID 73 is LookAtMe and LookAtMe is not in my error branch, it's only in the non-error branch.
This is a copy of my XML and you can see, yes, the outputColumn id 73 is LookAtme.
<outputColumn id="73" name="LookAtMe" description="" lineageId="73" precision="0" scale="0" length="0" dataType="i4" codePage="0" sortKeyPosition="0" comparisonFlags="0" specialFlags="0" errorOrTruncationOperation="Computation" errorRowDisposition="RedirectRow" truncationRowDisposition="RedirectRow" externalMetadataColumnId="0" mappedColumnId="0"><properties>
I really wanted that data though and I'm clever so I can union all my results back together and then conditional split it back out to get that. The problem is, Union All is an asynchronous transformation. Async transformations result in the data being copied from one set of butters to another resulting in...new lineage ids being assigned so even with a union all bringing the two streams back together, you wouldn't be able to call up the data flow chain to find that original lineage id because it's in a different buffer.
Around this point, I conceded defeat and decided I could live without intelligent/helpful error reporting in my packages.
I know this is a long dead thread but I tripped across a manual solution to this problem and thought I would share for anyone who happens upon this same problem. Granted this doesn't provide a programmatic solution to the problem but for simple debugging it should do the trick. The solution uses a Derived Column as an example but this seems to work for any Data Flow component.
Answer provided by Todd McDermid and taken from AskSQLServerCentral:
"[...] Unfortunately, the lineage ID of your columns is pretty well hidden inside SSIS. It's the "key" that SSIS uses to identify columns. So, in order to figure out which column it was, you need to open the Advanced Editor of the Derived Column component or Data Conversion. Do that by right clicking and selecting "Advanced Editor". Go to the "Input and Output Properties" tab. Open the first node - "Derived Column Input" or "Data Conversion Input". Open the "Input Columns" tab. Click through the columns, noting the "LineageID" property of each. You may have to do the same with the "Derived Column Output" node, and "Output Columns" inside there. The column that matches your recorded lineage ID is the offending column."
For anyone using SQL Server versions before SS2016, here are a couple of reference links for a way to get the Column name:
http://www.andrewleesmith.co.uk/2017/02/24/finding-the-column-name-of-an-ssis-error-output-error-column-id/
which is based on:
http://toddmcdermid.blogspot.com/2016/04/finding-column-name-for-errorcolumn.html
I appreciate we aren't supposed to just post links, but this solution is quite convoluted, and I've tried to summarise by pulling info from both Todd and Andrew's blog posts and recreating them here. (thank you to both if you ever read this!)
From Todd's page:
Go to the "Inputs and Outputs" page, and select the "Output 0" node.
Change the "SynchronousInputID" property to "None". (This changes
the script from synchronous to asynchronous.)
On the same page, open the "Output 0" node and select the "Output
Columns" folder. Press the "Add Column" button. Change the "Name"
property of this new column to "LineageID".
Press the "Add Column" button again, and change the "DataType"
property to "Unicode string [DT_WSTR]", and change the "Name"
property to "ColumnName".
Go to the "Script" page, and press the "Edit Script" button. Copy
and paste this code into the ScriptMain class (you can delete all
other method stubs):
public override void CreateNewOutputRows() {
IDTSInput100 input = this.ComponentMetaData.InputCollection[0];
if (input != null)
{
IDTSVirtualInput100 vInput = input.GetVirtualInput();
if (vInput != null)
{
foreach (IDTSVirtualInputColumn100 vInputColumn in vInput.VirtualInputColumnCollection)
{
Output0Buffer.AddRow();
Output0Buffer.LineageID = vInputColumn.LineageID;
Output0Buffer.ColumnName = vInputColumn.Name;
}
}
} }
Feel free to attach a dummy output to that script, with a data viewer,
and see what you get. From here, it's "standard engineering" for you
ETL gurus. Simply merge join the error output of the failing
component with this metadata, and you'll be able to transform the
ErrorColumn number into a meaningful column name.
But for those of you that do want to understand what the above script
is doing:
It's getting the "first" (and only) input attached to the script
component.
It's getting the virtual input related to the input. The "input" is
what the script can actually "see" on the input - and since we
didn't mark any columns as being "ReadOnly" or "ReadWrite"... that
means the input has NO columns. However, the "virtual input" has
the complete list of every column that exists, whether or not we've
said we're "using" it.
We then loop over all of the "virtual columns" on this virtual
input, and for each one...
Get the LineageID and column name, and push them out as a new row on
our asynchronous script.
The image and text from Andrew's page helps explain it in a bit more detail:
This map is then merge-joined with the ErrorColumn lineage ID(s)
coming down the error path, so that the error information can be
appended with the column name(s) from the map. I included a second
script component that looks up the error description from the error
code, so the error table rows that we see above contain both column
names and error descriptions.
The remaining component that needs explaining is the conditional split
– this exists just to provide metadata to the script component that
creates the map. I created an expression (1 == 0) that always
evaluates to false for the “No Rows – Metadata Only” path, so no rows
ever travel down it.
Whilst this solution does require the insertion of some additional
plumbing within the data flow, we get extremely valuable information
logged when errors do occur. So especially when the data flow is
running unattended in Production – when we don’t have the tools &
techniques available at design time to figure out what’s going wrong –
the logging that results gives us much more precise information about
what went wrong and why, compared to simply giving us the failed data
and leaving us to figure out why it was rejected.
There is no simple way to find out column name by lineage id.
If you want to know this using BIDS You have to inspect all components inside dataflow using Advanced properties, Input and Output columns tab and see LineageID for each column and input/output path.
But You can:
inspect XML - this is very difficult
write .NET application and use FindColumnByLineageId
However, second solution includes a lot of coding and understanding of pipeline because You have to programmaticaly: open the package, iterate over tasks, iterate inside containers, iterate over transformations inside data flows to find particular component to use proposed method.
Here is a solution that:
Works at package runtime (not pre-populating)
Is automated through a Script Task and Component
Doesn't involve installing new assemblies or custom components
Is nicely BIML compatible
Check out the full solution here.
EDIT
Here is the short version.
Create 2 Object variables, execsObj and lineageIds
Create Script Task in Control flow, give it ReadWrite access to both variables
Add an assembly reference to your script task for Microsoft.SqlServer.DTSPipelineWrap.dll (may required SQL Client SDK be installed; required for the MainPipe object below)
Insert the following code into your Script Task
Dictionary<int, string> lineageIds = null;
public void Main()
{
// Grab the executables so we have to something to iterate over, and initialize our lineageIDs list
// Why the executables? Well, SSIS won't let us store a reference to the Package itself...
Dts.Variables["User::execsObj"].Value = ((Package)Dts.Variables["User::execsObj"].Parent).Executables;
Dts.Variables["User::lineageIds"].Value = new Dictionary<int, string>();
lineageIds = (Dictionary<int, string>)Dts.Variables["User::lineageIds"].Value;
Executables execs = (Executables)Dts.Variables["User::execsObj"].Value;
ReadExecutables(execs);
Dts.TaskResult = (int)ScriptResults.Success;
}
private void ReadExecutables(Executables executables)
{
foreach (Executable pkgExecutable in executables)
{
if (object.ReferenceEquals(pkgExecutable.GetType(), typeof(Microsoft.SqlServer.Dts.Runtime.TaskHost)))
{
TaskHost pkgExecTaskHost = (TaskHost)pkgExecutable;
if (pkgExecTaskHost.CreationName.StartsWith("SSIS.Pipeline"))
{
ProcessDataFlowTask(pkgExecTaskHost);
}
}
else if (object.ReferenceEquals(pkgExecutable.GetType(), typeof(Microsoft.SqlServer.Dts.Runtime.ForEachLoop)))
{
// Recurse into FELCs
ReadExecutables(((ForEachLoop)pkgExecutable).Executables);
}
}
}
private void ProcessDataFlowTask(TaskHost currentDataFlowTask)
{
MainPipe currentDataFlow = (MainPipe)currentDataFlowTask.InnerObject;
foreach (IDTSComponentMetaData100 currentComponent in currentDataFlow.ComponentMetaDataCollection)
{
// Get the inputs in the component.
foreach (IDTSInput100 currentInput in currentComponent.InputCollection)
foreach (IDTSInputColumn100 currentInputColumn in currentInput.InputColumnCollection)
lineageIds.Add(currentInputColumn.ID, currentInputColumn.Name);
// Get the outputs in the component.
foreach (IDTSOutput100 currentOutput in currentComponent.OutputCollection)
foreach (IDTSOutputColumn100 currentoutputColumn in currentOutput.OutputColumnCollection)
lineageIds.Add(currentoutputColumn.ID, currentoutputColumn.Name);
}
}
Create Script Component in Dataflow with ReadOnly access to lineageIds and the following code.
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
Dictionary<int, string> lineageIds = (Dictionary<int, string>)Variables.lineageIds;
int? colNum = Row.ErrorColumn;
if (colNum.HasValue && (lineageIds != null))
{
if (lineageIds.ContainsKey(colNum.Value))
Row.ErrorColumnName = lineageIds[colNum.Value];
else
Row.ErrorColumnName = "Row error";
}
Row.ErrorDescription = this.ComponentMetaData.GetErrorDescription(Row.ErrorCode);
}

Organizing Notebooks & Saving Results in Mathematica

As of now I use 3 Notebook :
Functions
Where I have all the functions I created and call in the other Notebooks.
Transformation
Based on the original data, I compute transformations and add columns/List
When data is my raw data, I then call :
t1data : the result of the first transformation
t2data : the result of the second transformation
and so on,
I am yet at t20.
Display & Analysis
Using both the above I create Manipulate object that enable me to analyze the data.
Questions
Is there away to save the results of the Transformation Notebook such that t13data for example can be used in the Display & Analysis Notebooks without running all the previous computations (t1,t2,t3...t12) it is based on ?
Is there a way to use my Functions or transformed data without opening the corresponding Notebook ?
Does my separation strategy make sense at all ?
As of now I systematically open the 3 and have to run them all before being able to do anything, and it takes a while given my poor computing power and yet inefficient codes.
Saving variable states: can be done using DumpSave, Save or Put. Read back using Get or <<
You could make a package from your functions and read those back using Needs or <<
It's not something I usually do. I opt for a monolithic notebook containing everything (nicely layered with sections and subsections so that you can fold open or close) or for a package + slightly leaner analysis notebook depending on the weather and some other hidden variables.
Saving intermediate results
The native file format for Mathematica expressions is the .m file. This is human readable text format, and you can view the file in a text editor if you ever doubt what is, or is not being saved. You can load these files using Get. The shorthand form for Get is:
<< "filename.m"
Using Get will replace or refresh any existing assignments that are explicitly made in the .m file.
Saving intermediate results that are simple assignments (dat = ...) may be done with Put. The shorthand form for Put is:
dat >> "dat.m"
This saves only the assigned expression itself; to restore the definition you must use:
dat = << "dat.m"
See also PutAppend for appending data to a .m file as new results are created.
Saving results and function definitions that are complex assignments is done with Save. Examples of such assignments include:
f[x_] := subfunc[x, 2]
g[1] = "cat"
g[2] = "dog"
nCr = #!/(#2! (# - #2)!) &;
nPr = nCr[##] #2! &;
For the last example, the complexity is that nPr depends on nCr. Using Save it is sufficient to save only nPr to get a fully working definition of nPr: the definition of nCr will automatically be saved as well. The syntax is:
Save["nPr.m", nPr]
Using Save the assignments themselves are saved; to restore the definitions use:
<< "nPr.m" ;
Moving functions to a Package
In addition to Put and Save, or manual creation in a text editor, .m files may be generated automatically. This is done by creating a Notebook and setting Cell > Cell Properties > Initialization Cell on the cells that contain your function definitions. When you save the Notebook for the first time, Mathematica will ask if you want to create an Auto Save Package. Do so, and Mathematica will generate a .m file in parallel to the .nb file, containing the contents of all Initialization Cells in the Notebook. Further, it will update this .m file every time you save the Notebook, so you never need to manually update it.
Sine all Initialization Cells will be saved to the parallel .m file, I recommend using the Notebook only for the generation of this Package, and not also for the rest of your computations.
When managing functions, one must consider context. Not all functions should be global at all times. A series of related functions should often be kept in its own context which can then be easily exposed to or removed from $ContextPath. Further, a series of functions often rely on subfunctions that do not need to be called outside of the primary functions, therefore these subfunctions should not be global. All of this relates to Package creation. Incidentally, it also relates to the formatting of code, because knowing that not all subfunctions must be exposed as global gives one the freedom to move many subfunctions to the "top level" of the code, that is, outside of Module or other scoping constructs, without conflicting with global symbols.
Package creation is a complex topic. You should familiarize yourself with Begin, BeginPackage, End and EndPackage to better understand it, but here is a simple framework to get you started. You can follow it as a template for the time being.
This is an old definition I used before DeleteDuplicates existed:
BeginPackage["UU`"]
UnsortedUnion::usage = "UnsortedUnion works like Union, but doesn't \
return a sorted list. \nThis function is considerably slower than \
Union though."
Begin["`Private`"]
UnsortedUnion =
Module[{f}, f[y_] := (f[y] = Sequence[]; y); f /# Join###] &
End[]
EndPackage[]
Everything above goes in Initialization Cells. You can insert Text cells, Sections, or even other input cells without harming the generated Package: only the contents of the Initialization Cells will be exported.
BeginPackage defines the Context that your functions will belong to, and disables all non-System` definitions, preventing collisions. (There are ways to call other functions from your package, but that is better for another question).
By convention, a ::usage message is defined for each function that it to be accessible outside the package itself. This is not superfluous! While there are other methods, without this, you will not expose your function in the visible Context.
Next, you Begin a context that is for the package alone, conventionally "`Private`". After this point any symbols you define (that are not used outside of this Begin/End block) will not be exposed globally after the Package is loaded, and will therefore not collide with Global` symbols.
After your function definition(s), you close the block with End[]. You may use as many Begin/End blocks as you like, and I typically use a separate one for each function, though it is not required.
Finally, close with EndPackage[] to restore the environment to what it was before using BeginPackage.
After you save the Notebook and generate the .m package (let's say "mypackage.m"), you can load it with Get:
<< "mypackage.m"
Now, there will be a function UnsortedUnion in the Context UU` and it will be accessible globally.
You should also look into the functionality of Needs, but that is a little more advanced in my opinion, so I shall stop here.