Splitting file name in SSIS - ssis

I have files in one folder with following naming convention
ClientID_ClientName_Date_Fileextension
12345_Dell_20110103.CSV
I want to extract ClientID from the filename and store that in a variable. I am unsure how I would do this. It seems that a Script Task would suffice but I am do not know how to proceed.

Your options are using Expressions on SSIS Variables or using a Script Task. As a general rule, I prefer Expressions but mentally, I can tell that's a lot of code, or a lot of intertwined variables.
Instead, I'd use the String.Split method in .NET. If you called the Split method for your sample data and provided a delimiter of the underscore _ then you'd receive a 3 element array
12345
Dell
20110103.CSV
Wrap that in a Try Catch block and always grab the second element. Quick and dirty but of course won't address things like 12345_Dell_Quest_20110103.CSV but you didn't ask that question.
Code approximate
string phrase = Dts.Variables["User::CurrentFile"].Value.ToString()
string[] stringSeparators = new string[] {"-"};
string[] words;
try
{
words = phrase.Split(stringSeparators, StringSplitOptions.None);
Dts.Variables["User::ClientName"].Value = words[1];
}
catch
{
; // Do something with this error
}

Related

Connecting output of script to extract performance

When I use the Execute Script operator, where there is one input arc and this input is of type ExampleSet and I run, for example, the one-line script return operator.getInput(ExampleSet.class), and then connect the output to an Extract Performance operator, which takes an ExampleSet as input, I get an error: Mandatory input missing at port Performance.example set.
My goal is to check a Petri-net for soundness via the Analyse soundness operator that comes with the RapidProm extension, and to take and change the first attribute on the first line to either 0 or 1 depending on whether this string matches "is sound", so I can then use Extract Performance and combine it with other performances using Average.
Is doing this with Execute Script the right way to do it, and if so, how should I fix this error?
Firstly: Don't bother about the error Mandatory input missing at port Performance.example set
It will be resolved when you run the model.
Secondly: It is indeed a bit ugly, the output of the operator that checks
the soundness of the model, since it is a very long string that looks like
Woflan diagnosis of net "d1cf46bd-15a9-4801-9f02-946a8f125eaf" - The net is sound End of Woflan diagnosis
You can indeed use the execute script to resolve this :)
See the script below!
The output is an example set that returns 1 if the model is sound, and 0 otherwise. Furthermore, I like to use some log operators to translate this to a nice table useful for documentation purposes.
ExampleSet input = operator.getInput(ExampleSet.class);
for (Example example : input) {
String uglyResult = example["att1"];
String soundResult = "The net is sound";
Boolean soundnessCheck = uglyResult.toLowerCase().contains(soundResult.toLowerCase());
if (soundnessCheck){
example["att1"] = "1"; //the net is sound :)
} else {
example["att1"] = "0"; //the net is not sound!
}
}
return input;
See also the attached example model I created.
RapidMiner Setup

Program Structure to meet Martin's Clean Code Function Criteria

I have been reading Clean Code by Robert C. Martin and have a basic (but fundamental) question about functions and program structure.
The book emphasizes that functions should:
be brief (like 10 lines, or less)
do one, and only one, thing
I am a little unclear on how to apply this in practice. For example, I am developing a program to:
load a baseline text file
parse baseline text file
load a test text file
parse test text file
compare parsed test with parsed baseline
aggregate results
I have tried two approaches, but neither seem to meet Martin's criteria:
APPROACH 1
setup a Main function that centrally commands other functions in the workflow. But then main() can end up being very long (violates #1), and is obviously doing many things (violates #2). Something like this:
main()
{
// manage steps, one at a time, from start to finish
baseFile = loadFile("baseline.txt");
parsedBaseline = parseFile(baseFile);
testFile = loadFile("test.txt");
parsedTest = parseFile(testFile);
comparisonResults = compareFiles(parsedBaseline, parsedTest);
aggregateResults(comparisonResults);
}
APPROACH 2
use Main to trigger a function "cascade". But each function is calling a dependency, so it still seems like they are doing more than one thing (violates #2?). For example, calling the aggregation function internally calls for results comparison. The flow also seems backwards, as it starts with the end goal and calls dependencies as it goes. Something like this:
main()
{
// trigger end result, and let functions internally manage
aggregateResults("baseline.txt", "comparison.txt");
}
aggregateResults(baseFile, testFile)
{
comparisonResults = compareFiles(baseFile, testFile);
// aggregate results here
return theAggregatedResult;
}
compareFiles(baseFile, testFile)
{
parsedBase = parseFile(baseFile);
parsedTest = parseFile(testFile);
// compare parsed files here
return theFileComparison;
}
parseFile(filename)
{
loadFile(filename);
// parse the file here
return theParsedFile;
}
loadFile(filename)
{
//load the file here
return theLoadedFile;
}
Obviously functions need to call one another. So what is the right way to structure a program to meet Martin's criteria, please?
I think you are interpreting rule 2 wrong by not taking context into account. The main() function only does one thing and that is everything, i.e. running the whole program. Let's say you have a convert_abc_file_xyz_file(source_filename, target_filename) then this function should only do the one thing its name (and arguments) implies: converting a file of format abc into one of format xyz. Of course on a lower level there are many things to be done to achieve this. For instancereading the source file (read_abc_file(…)), converting the data from format abc into format xyz (convert_abc_to_xyz(…)), and then writing the converted data into a new file (write_xyz_file(…)).
The second approach is wrong as it becomes impossible to write functions that only do one thing because every functions does all the other things in the ”cascaded” calls. In the first approach it is possible to test or reuse single functions, i.e. just call read_abc_file() to read a file. If that function calls convert_abc_to_xyz() which in turn calls write_xyz_file() that is not possible anymore.

Some way to perform a TrimAfterLast() in Fidder Script

There is the very helpful Utilities.TrimBeforeLast() function in Fiddler script. However, I really need to perform a Utilities.TrimAfterLast(StringVar, "}") to remove the extra characters after a JSON object i've captured.
Is there any way I could produce this equivalent result with fiddlerscript?
Thanks
FiddlerScript allows you to use the entire .NET Framework, which offers many different functions for processing strings.
In this case, you'd probably want to use something like:
var sMyString = "whatever}junk";
var iX = sMyString.LastIndexOf('}');
sMyString = sMyString.Substring(0, iX);

List Lua functions in a file

How can I list all the functions included in a Lua source file?
For example, I have fn.lua which contains
function test1()
print("Test 1")
end
function test2()
print("Test 2")
end
And I wish to be able to display those function names (test1, test2) from another Lua script.
The only way I can figure at the moment is to include the file using require, then list the functions in _G - but that will include all the standard Lua functions as well.
Of course I could just parse the file manually using the string search functions, but that doesn't seem very Lua to me!
This will eventually form part of a process that allows the developer to write functions in Lua, and the operator to select which of those functions are called from a list in Excel (yuk!).
If you put them all in a "module" (which you should probably do, anyway):
mymodule = { }
function mymodule.test1()
print("Test 1")
end
function module.test2()
print("Test 2")
end
return mymodule
It becomes trivial:
mymodule = require"mymodule"
for fname,obj in pairs(mymodule) do
if type(obj) == "function" then
print(fname)
end
end
If you absolutely have to keep them in raw form, you'd have to load them in a different way to separate your global environment, and then iterate over it in a similar way (over the inner env's cleaned _G, perhaps).
I see three ways:
Save the names in _G before loading your script and compare to the names left in _G after loading it. I've seen some code for this, either in the Lua mailing list or in the wiki, but I can't find a link right now.
Report the globals used in a function by parsing luac listings, as in http://lua-users.org/lists/lua-l/2012-12/msg00397.html.
Use my bytecode inspector lbci from within Lua, which contains an example that reports globals.
If you want to do this, it's better to define the function is a package, as described in Programming in Lua book:
functions = {}
function functions.test1() print('foo') end
function functions.test2() print('bar') end
return functions
Then you can simply iterate your table functions.

What is the simple way to find the column name from Lineageid in SSIS

What is the simple way to find the column name from Lineageid in SSIS. Is there any system variable avilable?
I remember saying this can't be that hard, I can write some script in the error redirect to lookup the column name from the input collection.
string badColumn = this.ComponentMetaData.InputCollection[Row.ErrorColumn].Name;
What I learned was the failing column isn't in that collection. Well, it is but the ErrorColumn reported is not quite what I needed. I couldn't find that package but here's an example of why I couldn't get what I needed. Hopefully you will have better luck.
This is a simple data flow that will generate an error once it hits the derived column due to division by zero. The Derived column generates a new output column (LookAtMe) as the result of the division. The data viewer on the Error Output tells me the failing column is 73. Using the above script logic, if I attempted to access column 73 in the input collection, it's going to fail because that is not in the collection. LineageID 73 is LookAtMe and LookAtMe is not in my error branch, it's only in the non-error branch.
This is a copy of my XML and you can see, yes, the outputColumn id 73 is LookAtme.
<outputColumn id="73" name="LookAtMe" description="" lineageId="73" precision="0" scale="0" length="0" dataType="i4" codePage="0" sortKeyPosition="0" comparisonFlags="0" specialFlags="0" errorOrTruncationOperation="Computation" errorRowDisposition="RedirectRow" truncationRowDisposition="RedirectRow" externalMetadataColumnId="0" mappedColumnId="0"><properties>
I really wanted that data though and I'm clever so I can union all my results back together and then conditional split it back out to get that. The problem is, Union All is an asynchronous transformation. Async transformations result in the data being copied from one set of butters to another resulting in...new lineage ids being assigned so even with a union all bringing the two streams back together, you wouldn't be able to call up the data flow chain to find that original lineage id because it's in a different buffer.
Around this point, I conceded defeat and decided I could live without intelligent/helpful error reporting in my packages.
I know this is a long dead thread but I tripped across a manual solution to this problem and thought I would share for anyone who happens upon this same problem. Granted this doesn't provide a programmatic solution to the problem but for simple debugging it should do the trick. The solution uses a Derived Column as an example but this seems to work for any Data Flow component.
Answer provided by Todd McDermid and taken from AskSQLServerCentral:
"[...] Unfortunately, the lineage ID of your columns is pretty well hidden inside SSIS. It's the "key" that SSIS uses to identify columns. So, in order to figure out which column it was, you need to open the Advanced Editor of the Derived Column component or Data Conversion. Do that by right clicking and selecting "Advanced Editor". Go to the "Input and Output Properties" tab. Open the first node - "Derived Column Input" or "Data Conversion Input". Open the "Input Columns" tab. Click through the columns, noting the "LineageID" property of each. You may have to do the same with the "Derived Column Output" node, and "Output Columns" inside there. The column that matches your recorded lineage ID is the offending column."
For anyone using SQL Server versions before SS2016, here are a couple of reference links for a way to get the Column name:
http://www.andrewleesmith.co.uk/2017/02/24/finding-the-column-name-of-an-ssis-error-output-error-column-id/
which is based on:
http://toddmcdermid.blogspot.com/2016/04/finding-column-name-for-errorcolumn.html
I appreciate we aren't supposed to just post links, but this solution is quite convoluted, and I've tried to summarise by pulling info from both Todd and Andrew's blog posts and recreating them here. (thank you to both if you ever read this!)
From Todd's page:
Go to the "Inputs and Outputs" page, and select the "Output 0" node.
Change the "SynchronousInputID" property to "None". (This changes
the script from synchronous to asynchronous.)
On the same page, open the "Output 0" node and select the "Output
Columns" folder. Press the "Add Column" button. Change the "Name"
property of this new column to "LineageID".
Press the "Add Column" button again, and change the "DataType"
property to "Unicode string [DT_WSTR]", and change the "Name"
property to "ColumnName".
Go to the "Script" page, and press the "Edit Script" button. Copy
and paste this code into the ScriptMain class (you can delete all
other method stubs):
public override void CreateNewOutputRows() {
IDTSInput100 input = this.ComponentMetaData.InputCollection[0];
if (input != null)
{
IDTSVirtualInput100 vInput = input.GetVirtualInput();
if (vInput != null)
{
foreach (IDTSVirtualInputColumn100 vInputColumn in vInput.VirtualInputColumnCollection)
{
Output0Buffer.AddRow();
Output0Buffer.LineageID = vInputColumn.LineageID;
Output0Buffer.ColumnName = vInputColumn.Name;
}
}
} }
Feel free to attach a dummy output to that script, with a data viewer,
and see what you get. From here, it's "standard engineering" for you
ETL gurus. Simply merge join the error output of the failing
component with this metadata, and you'll be able to transform the
ErrorColumn number into a meaningful column name.
But for those of you that do want to understand what the above script
is doing:
It's getting the "first" (and only) input attached to the script
component.
It's getting the virtual input related to the input. The "input" is
what the script can actually "see" on the input - and since we
didn't mark any columns as being "ReadOnly" or "ReadWrite"... that
means the input has NO columns. However, the "virtual input" has
the complete list of every column that exists, whether or not we've
said we're "using" it.
We then loop over all of the "virtual columns" on this virtual
input, and for each one...
Get the LineageID and column name, and push them out as a new row on
our asynchronous script.
The image and text from Andrew's page helps explain it in a bit more detail:
This map is then merge-joined with the ErrorColumn lineage ID(s)
coming down the error path, so that the error information can be
appended with the column name(s) from the map. I included a second
script component that looks up the error description from the error
code, so the error table rows that we see above contain both column
names and error descriptions.
The remaining component that needs explaining is the conditional split
– this exists just to provide metadata to the script component that
creates the map. I created an expression (1 == 0) that always
evaluates to false for the “No Rows – Metadata Only” path, so no rows
ever travel down it.
Whilst this solution does require the insertion of some additional
plumbing within the data flow, we get extremely valuable information
logged when errors do occur. So especially when the data flow is
running unattended in Production – when we don’t have the tools &
techniques available at design time to figure out what’s going wrong –
the logging that results gives us much more precise information about
what went wrong and why, compared to simply giving us the failed data
and leaving us to figure out why it was rejected.
There is no simple way to find out column name by lineage id.
If you want to know this using BIDS You have to inspect all components inside dataflow using Advanced properties, Input and Output columns tab and see LineageID for each column and input/output path.
But You can:
inspect XML - this is very difficult
write .NET application and use FindColumnByLineageId
However, second solution includes a lot of coding and understanding of pipeline because You have to programmaticaly: open the package, iterate over tasks, iterate inside containers, iterate over transformations inside data flows to find particular component to use proposed method.
Here is a solution that:
Works at package runtime (not pre-populating)
Is automated through a Script Task and Component
Doesn't involve installing new assemblies or custom components
Is nicely BIML compatible
Check out the full solution here.
EDIT
Here is the short version.
Create 2 Object variables, execsObj and lineageIds
Create Script Task in Control flow, give it ReadWrite access to both variables
Add an assembly reference to your script task for Microsoft.SqlServer.DTSPipelineWrap.dll (may required SQL Client SDK be installed; required for the MainPipe object below)
Insert the following code into your Script Task
Dictionary<int, string> lineageIds = null;
public void Main()
{
// Grab the executables so we have to something to iterate over, and initialize our lineageIDs list
// Why the executables? Well, SSIS won't let us store a reference to the Package itself...
Dts.Variables["User::execsObj"].Value = ((Package)Dts.Variables["User::execsObj"].Parent).Executables;
Dts.Variables["User::lineageIds"].Value = new Dictionary<int, string>();
lineageIds = (Dictionary<int, string>)Dts.Variables["User::lineageIds"].Value;
Executables execs = (Executables)Dts.Variables["User::execsObj"].Value;
ReadExecutables(execs);
Dts.TaskResult = (int)ScriptResults.Success;
}
private void ReadExecutables(Executables executables)
{
foreach (Executable pkgExecutable in executables)
{
if (object.ReferenceEquals(pkgExecutable.GetType(), typeof(Microsoft.SqlServer.Dts.Runtime.TaskHost)))
{
TaskHost pkgExecTaskHost = (TaskHost)pkgExecutable;
if (pkgExecTaskHost.CreationName.StartsWith("SSIS.Pipeline"))
{
ProcessDataFlowTask(pkgExecTaskHost);
}
}
else if (object.ReferenceEquals(pkgExecutable.GetType(), typeof(Microsoft.SqlServer.Dts.Runtime.ForEachLoop)))
{
// Recurse into FELCs
ReadExecutables(((ForEachLoop)pkgExecutable).Executables);
}
}
}
private void ProcessDataFlowTask(TaskHost currentDataFlowTask)
{
MainPipe currentDataFlow = (MainPipe)currentDataFlowTask.InnerObject;
foreach (IDTSComponentMetaData100 currentComponent in currentDataFlow.ComponentMetaDataCollection)
{
// Get the inputs in the component.
foreach (IDTSInput100 currentInput in currentComponent.InputCollection)
foreach (IDTSInputColumn100 currentInputColumn in currentInput.InputColumnCollection)
lineageIds.Add(currentInputColumn.ID, currentInputColumn.Name);
// Get the outputs in the component.
foreach (IDTSOutput100 currentOutput in currentComponent.OutputCollection)
foreach (IDTSOutputColumn100 currentoutputColumn in currentOutput.OutputColumnCollection)
lineageIds.Add(currentoutputColumn.ID, currentoutputColumn.Name);
}
}
Create Script Component in Dataflow with ReadOnly access to lineageIds and the following code.
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
Dictionary<int, string> lineageIds = (Dictionary<int, string>)Variables.lineageIds;
int? colNum = Row.ErrorColumn;
if (colNum.HasValue && (lineageIds != null))
{
if (lineageIds.ContainsKey(colNum.Value))
Row.ErrorColumnName = lineageIds[colNum.Value];
else
Row.ErrorColumnName = "Row error";
}
Row.ErrorDescription = this.ComponentMetaData.GetErrorDescription(Row.ErrorCode);
}