Connecting output of script to extract performance - rapidminer

When I use the Execute Script operator, where there is one input arc and this input is of type ExampleSet and I run, for example, the one-line script return operator.getInput(ExampleSet.class), and then connect the output to an Extract Performance operator, which takes an ExampleSet as input, I get an error: Mandatory input missing at port Performance.example set.
My goal is to check a Petri-net for soundness via the Analyse soundness operator that comes with the RapidProm extension, and to take and change the first attribute on the first line to either 0 or 1 depending on whether this string matches "is sound", so I can then use Extract Performance and combine it with other performances using Average.
Is doing this with Execute Script the right way to do it, and if so, how should I fix this error?

Firstly: Don't bother about the error Mandatory input missing at port Performance.example set
It will be resolved when you run the model.
Secondly: It is indeed a bit ugly, the output of the operator that checks
the soundness of the model, since it is a very long string that looks like
Woflan diagnosis of net "d1cf46bd-15a9-4801-9f02-946a8f125eaf" - The net is sound End of Woflan diagnosis
You can indeed use the execute script to resolve this :)
See the script below!
The output is an example set that returns 1 if the model is sound, and 0 otherwise. Furthermore, I like to use some log operators to translate this to a nice table useful for documentation purposes.
ExampleSet input = operator.getInput(ExampleSet.class);
for (Example example : input) {
String uglyResult = example["att1"];
String soundResult = "The net is sound";
Boolean soundnessCheck = uglyResult.toLowerCase().contains(soundResult.toLowerCase());
if (soundnessCheck){
example["att1"] = "1"; //the net is sound :)
} else {
example["att1"] = "0"; //the net is not sound!
}
}
return input;
See also the attached example model I created.
RapidMiner Setup

Related

Octave: how to retrieve data from a Java ResultSet object?

I need to feed my Octave instance with data retrieved from an Oracle database.
I have implemented an OJDBC connection in my Octave instance an I am able now to put data from an Oracle database into a Java ResultSet object in Octave (taken from: https://lists.gnu.org/archive/html/help-octave/2011-08/msg00250.html):
javaaddpath('access-path-to-ojdbc8.jar') ;
props = javaObject('java.util.Properties') ;
props.setProperty("user", 'username') ;
props.setProperty("password", 'password') ;
driver = javaObject('oracle.jdbc.OracleDriver') ;
url = 'jdbc:oracle:thin:#ip:port:schema' ;
con = driver.connect(url, props) ;
sql = 'select-query' ;
ps = con.prepareStatement(sql) ;
rs = ps.executeQuery() ;
But haven't succeeded with retrieving data from that ResultSet.
How can I put data from a ResultSet object in Octave into an array or matrix?
Finding out what to do
The docs you want for ResultSet and related classes are in the Java JDBC API documentation. (You don't need the Oracle-specific doco unless you want to do fancy Oracle-specific stuff. All JDBC drivers conform to the generic JDBC API.) Have a look at that and any JDBC tutorial; because it is a Java object, you'll use all the same method calls from Octave that you would from Java code.
For conversion to Octave values, know that Java primitives convert to Octave types automatically, java.lang.String objects require conversion by calling char(...) on them, and java.sql.Date values you will have to convert to datenums manually. (Lazy way is to get their string values and parse them; fast way is to get their Unix time values and convert numerically.)
What to do
Because Java JDBC advances the result set cursor one row at a time, and requires a separate method call to get the value for each column, you need to use a pair of nested loops to iterate over the ResultSet. Like this:
rsMeta = rs.getMetaData();
nCols = rsMeta.getColumnCount();
data = NaN(1, nCols);
iRow = 0;
while rs.next()
iRow = iRow + 1;
for iCol = 1:nCols
data(iRow,iCol) = rs.getDouble(iCol);
endfor
endwhile
Ah, but what if your columns aren't all numerics? Then you'll need to look at the column type in rsMeta, switch on it, and use a cell array to hold the heterogeneous data set. Like this:
rsMeta = rs.getMetaData();
nCols = rsMeta.getColumnCount();
data = cell(1, nCols);
iRow = 0;
while rs.next()
iRow = iRow + 1;
for iCol = 1:nCols
colTypeId = rsMeta.getColumnType(iCol);
switch colTypeId
case NUMERIC_TYPE
data{iRow,iCol} = rs.getDouble(iCol);
case CHAR_TYPE
data{iRow,iCol} = rs.getString(iCol);
data{iRow,iCol} = char(data{iRow,iCol});
# ... and so on ...
otherwise
error('Unsupported SQL data type in column %d: %d', ...
iCol, colTypeId);
endswitch
endfor
endwhile
How do you know what the values for NUMERIC_TYPE, CHAR_TYPE, and so on should be? You have to examine the values in the java.sql.Types Java class. Do that at run time to make sure you're consistent with the JDK you're running against.
(Note: this code is the easy, sloppy way of doing it. There's all sorts of improvements and optimizations you could (and should) do on it.)
How to go fast
Unfortunately, the performance of this is going to suck big time, because Java method calls from Octave are expensive, and cells are in inefficient way of holding data. If your result sets are large, in order to get good performance, what you need to do is write a result set buffering layer in Java that runs the loops in Java and buffers the results in primitive per-column arrays, and use that. If you want an example of how to do this, I have an example implementation in Matlab in my Janklab library (M-code layer here). Feel free to steal the code. Octave doesn't support dot-referencing of Java constructors or class methods, so to convert it to Octave, you'd need to replace all those with javaObject and javaMethod calls. (That's tedious and results in ugly code, so I'm not going to do it myself. Sorry.)
If you're not willing to do that (and really, who is?), and still need good performance, what you should actually do is forget about connecting Octave directly to Oracle, and write a separate Python/NumPy or R program that takes your query, runs it against your Oracle db, and writes the result to a .mat file that you will then read from Octave.
I don't have access to the specified .jar or a suitable database to test your specific code, but in any case, this isn't really a problem of octave. Effectively you need the relevant api for the ResultSet class, and a standard approach for processing it. The oracle documentation suggests that in java you'd do something like this:
while (rs.next()) { System.out.println (rs.getString(1)); }
So, presumably this is exactly what you'll do in octave too, except via octave's java interface. One possible way this might look like is
while rs.next().booleanValue % since a Boolean java object by itself
% isn't valid logical input for octave's
% 'while' statement
% do something with rs, e.g. fill in a cell array
endwhile
As for whether you can automatically convert a java array to an octave cell-object or vice-versa, as far as I know this is not possible. You'd have to set / get elements from one to the other via a for loop, just like you'd do in java (e.g. see the note in the manual regarding the javaArray function)

I just started learning python. I want to get file name as user input

def copy_file(from_file,to_file):
content = open(from_file).read()
target = open(to_file,'w').write(content)
print open(to_file).read()
def user_input(f1):
f1 = raw_input("Enter the source file : ")
user_input(f1)
user_input(f2)
copy_file(user_input(f1),user_input(f2))
What is the mistake in this ? I tried it with argv and it was working.
You're not calling the function user_input (by using ()). (fixed in question by OP).
Also, you need to return a string from user_input. currently you're trying to set a variable f1 which is local to the function user_input. While this is possible using global - I do not recommend this (this beats keeping your code DRY).
It's possible to do something similar with objects by changing their states. String is an object - but since strings are immutable, and you can't have the function change their state - this approach of expecting a function to change the string it's given is also doomed to fail.
def user_input():
return raw_input("Enter the source file :").strip()
copy_file(user_input(),user_input())
You can see user_input does very little, it's actually redundant if you assume user input is valid.

Splitting file name in SSIS

I have files in one folder with following naming convention
ClientID_ClientName_Date_Fileextension
12345_Dell_20110103.CSV
I want to extract ClientID from the filename and store that in a variable. I am unsure how I would do this. It seems that a Script Task would suffice but I am do not know how to proceed.
Your options are using Expressions on SSIS Variables or using a Script Task. As a general rule, I prefer Expressions but mentally, I can tell that's a lot of code, or a lot of intertwined variables.
Instead, I'd use the String.Split method in .NET. If you called the Split method for your sample data and provided a delimiter of the underscore _ then you'd receive a 3 element array
12345
Dell
20110103.CSV
Wrap that in a Try Catch block and always grab the second element. Quick and dirty but of course won't address things like 12345_Dell_Quest_20110103.CSV but you didn't ask that question.
Code approximate
string phrase = Dts.Variables["User::CurrentFile"].Value.ToString()
string[] stringSeparators = new string[] {"-"};
string[] words;
try
{
words = phrase.Split(stringSeparators, StringSplitOptions.None);
Dts.Variables["User::ClientName"].Value = words[1];
}
catch
{
; // Do something with this error
}

What is the simple way to find the column name from Lineageid in SSIS

What is the simple way to find the column name from Lineageid in SSIS. Is there any system variable avilable?
I remember saying this can't be that hard, I can write some script in the error redirect to lookup the column name from the input collection.
string badColumn = this.ComponentMetaData.InputCollection[Row.ErrorColumn].Name;
What I learned was the failing column isn't in that collection. Well, it is but the ErrorColumn reported is not quite what I needed. I couldn't find that package but here's an example of why I couldn't get what I needed. Hopefully you will have better luck.
This is a simple data flow that will generate an error once it hits the derived column due to division by zero. The Derived column generates a new output column (LookAtMe) as the result of the division. The data viewer on the Error Output tells me the failing column is 73. Using the above script logic, if I attempted to access column 73 in the input collection, it's going to fail because that is not in the collection. LineageID 73 is LookAtMe and LookAtMe is not in my error branch, it's only in the non-error branch.
This is a copy of my XML and you can see, yes, the outputColumn id 73 is LookAtme.
<outputColumn id="73" name="LookAtMe" description="" lineageId="73" precision="0" scale="0" length="0" dataType="i4" codePage="0" sortKeyPosition="0" comparisonFlags="0" specialFlags="0" errorOrTruncationOperation="Computation" errorRowDisposition="RedirectRow" truncationRowDisposition="RedirectRow" externalMetadataColumnId="0" mappedColumnId="0"><properties>
I really wanted that data though and I'm clever so I can union all my results back together and then conditional split it back out to get that. The problem is, Union All is an asynchronous transformation. Async transformations result in the data being copied from one set of butters to another resulting in...new lineage ids being assigned so even with a union all bringing the two streams back together, you wouldn't be able to call up the data flow chain to find that original lineage id because it's in a different buffer.
Around this point, I conceded defeat and decided I could live without intelligent/helpful error reporting in my packages.
I know this is a long dead thread but I tripped across a manual solution to this problem and thought I would share for anyone who happens upon this same problem. Granted this doesn't provide a programmatic solution to the problem but for simple debugging it should do the trick. The solution uses a Derived Column as an example but this seems to work for any Data Flow component.
Answer provided by Todd McDermid and taken from AskSQLServerCentral:
"[...] Unfortunately, the lineage ID of your columns is pretty well hidden inside SSIS. It's the "key" that SSIS uses to identify columns. So, in order to figure out which column it was, you need to open the Advanced Editor of the Derived Column component or Data Conversion. Do that by right clicking and selecting "Advanced Editor". Go to the "Input and Output Properties" tab. Open the first node - "Derived Column Input" or "Data Conversion Input". Open the "Input Columns" tab. Click through the columns, noting the "LineageID" property of each. You may have to do the same with the "Derived Column Output" node, and "Output Columns" inside there. The column that matches your recorded lineage ID is the offending column."
For anyone using SQL Server versions before SS2016, here are a couple of reference links for a way to get the Column name:
http://www.andrewleesmith.co.uk/2017/02/24/finding-the-column-name-of-an-ssis-error-output-error-column-id/
which is based on:
http://toddmcdermid.blogspot.com/2016/04/finding-column-name-for-errorcolumn.html
I appreciate we aren't supposed to just post links, but this solution is quite convoluted, and I've tried to summarise by pulling info from both Todd and Andrew's blog posts and recreating them here. (thank you to both if you ever read this!)
From Todd's page:
Go to the "Inputs and Outputs" page, and select the "Output 0" node.
Change the "SynchronousInputID" property to "None". (This changes
the script from synchronous to asynchronous.)
On the same page, open the "Output 0" node and select the "Output
Columns" folder. Press the "Add Column" button. Change the "Name"
property of this new column to "LineageID".
Press the "Add Column" button again, and change the "DataType"
property to "Unicode string [DT_WSTR]", and change the "Name"
property to "ColumnName".
Go to the "Script" page, and press the "Edit Script" button. Copy
and paste this code into the ScriptMain class (you can delete all
other method stubs):
public override void CreateNewOutputRows() {
IDTSInput100 input = this.ComponentMetaData.InputCollection[0];
if (input != null)
{
IDTSVirtualInput100 vInput = input.GetVirtualInput();
if (vInput != null)
{
foreach (IDTSVirtualInputColumn100 vInputColumn in vInput.VirtualInputColumnCollection)
{
Output0Buffer.AddRow();
Output0Buffer.LineageID = vInputColumn.LineageID;
Output0Buffer.ColumnName = vInputColumn.Name;
}
}
} }
Feel free to attach a dummy output to that script, with a data viewer,
and see what you get. From here, it's "standard engineering" for you
ETL gurus. Simply merge join the error output of the failing
component with this metadata, and you'll be able to transform the
ErrorColumn number into a meaningful column name.
But for those of you that do want to understand what the above script
is doing:
It's getting the "first" (and only) input attached to the script
component.
It's getting the virtual input related to the input. The "input" is
what the script can actually "see" on the input - and since we
didn't mark any columns as being "ReadOnly" or "ReadWrite"... that
means the input has NO columns. However, the "virtual input" has
the complete list of every column that exists, whether or not we've
said we're "using" it.
We then loop over all of the "virtual columns" on this virtual
input, and for each one...
Get the LineageID and column name, and push them out as a new row on
our asynchronous script.
The image and text from Andrew's page helps explain it in a bit more detail:
This map is then merge-joined with the ErrorColumn lineage ID(s)
coming down the error path, so that the error information can be
appended with the column name(s) from the map. I included a second
script component that looks up the error description from the error
code, so the error table rows that we see above contain both column
names and error descriptions.
The remaining component that needs explaining is the conditional split
– this exists just to provide metadata to the script component that
creates the map. I created an expression (1 == 0) that always
evaluates to false for the “No Rows – Metadata Only” path, so no rows
ever travel down it.
Whilst this solution does require the insertion of some additional
plumbing within the data flow, we get extremely valuable information
logged when errors do occur. So especially when the data flow is
running unattended in Production – when we don’t have the tools &
techniques available at design time to figure out what’s going wrong –
the logging that results gives us much more precise information about
what went wrong and why, compared to simply giving us the failed data
and leaving us to figure out why it was rejected.
There is no simple way to find out column name by lineage id.
If you want to know this using BIDS You have to inspect all components inside dataflow using Advanced properties, Input and Output columns tab and see LineageID for each column and input/output path.
But You can:
inspect XML - this is very difficult
write .NET application and use FindColumnByLineageId
However, second solution includes a lot of coding and understanding of pipeline because You have to programmaticaly: open the package, iterate over tasks, iterate inside containers, iterate over transformations inside data flows to find particular component to use proposed method.
Here is a solution that:
Works at package runtime (not pre-populating)
Is automated through a Script Task and Component
Doesn't involve installing new assemblies or custom components
Is nicely BIML compatible
Check out the full solution here.
EDIT
Here is the short version.
Create 2 Object variables, execsObj and lineageIds
Create Script Task in Control flow, give it ReadWrite access to both variables
Add an assembly reference to your script task for Microsoft.SqlServer.DTSPipelineWrap.dll (may required SQL Client SDK be installed; required for the MainPipe object below)
Insert the following code into your Script Task
Dictionary<int, string> lineageIds = null;
public void Main()
{
// Grab the executables so we have to something to iterate over, and initialize our lineageIDs list
// Why the executables? Well, SSIS won't let us store a reference to the Package itself...
Dts.Variables["User::execsObj"].Value = ((Package)Dts.Variables["User::execsObj"].Parent).Executables;
Dts.Variables["User::lineageIds"].Value = new Dictionary<int, string>();
lineageIds = (Dictionary<int, string>)Dts.Variables["User::lineageIds"].Value;
Executables execs = (Executables)Dts.Variables["User::execsObj"].Value;
ReadExecutables(execs);
Dts.TaskResult = (int)ScriptResults.Success;
}
private void ReadExecutables(Executables executables)
{
foreach (Executable pkgExecutable in executables)
{
if (object.ReferenceEquals(pkgExecutable.GetType(), typeof(Microsoft.SqlServer.Dts.Runtime.TaskHost)))
{
TaskHost pkgExecTaskHost = (TaskHost)pkgExecutable;
if (pkgExecTaskHost.CreationName.StartsWith("SSIS.Pipeline"))
{
ProcessDataFlowTask(pkgExecTaskHost);
}
}
else if (object.ReferenceEquals(pkgExecutable.GetType(), typeof(Microsoft.SqlServer.Dts.Runtime.ForEachLoop)))
{
// Recurse into FELCs
ReadExecutables(((ForEachLoop)pkgExecutable).Executables);
}
}
}
private void ProcessDataFlowTask(TaskHost currentDataFlowTask)
{
MainPipe currentDataFlow = (MainPipe)currentDataFlowTask.InnerObject;
foreach (IDTSComponentMetaData100 currentComponent in currentDataFlow.ComponentMetaDataCollection)
{
// Get the inputs in the component.
foreach (IDTSInput100 currentInput in currentComponent.InputCollection)
foreach (IDTSInputColumn100 currentInputColumn in currentInput.InputColumnCollection)
lineageIds.Add(currentInputColumn.ID, currentInputColumn.Name);
// Get the outputs in the component.
foreach (IDTSOutput100 currentOutput in currentComponent.OutputCollection)
foreach (IDTSOutputColumn100 currentoutputColumn in currentOutput.OutputColumnCollection)
lineageIds.Add(currentoutputColumn.ID, currentoutputColumn.Name);
}
}
Create Script Component in Dataflow with ReadOnly access to lineageIds and the following code.
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
Dictionary<int, string> lineageIds = (Dictionary<int, string>)Variables.lineageIds;
int? colNum = Row.ErrorColumn;
if (colNum.HasValue && (lineageIds != null))
{
if (lineageIds.ContainsKey(colNum.Value))
Row.ErrorColumnName = lineageIds[colNum.Value];
else
Row.ErrorColumnName = "Row error";
}
Row.ErrorDescription = this.ComponentMetaData.GetErrorDescription(Row.ErrorCode);
}

Accessing the Body of a Function with Lua

I'm going back to the basics here but in Lua, you can define a table like so:
myTable = {}
myTable [1] = 12
Printing the table reference itself brings back a pointer to it. To access its elements you need to specify an index (i.e. exactly like you would an array)
print(myTable ) --prints pointer
print(myTable[1]) --prints 12
Now functions are a different story. You can define and print a function like so:
myFunc = function() local x = 14 end --Defined function
print(myFunc) --Printed pointer to function
Is there a way to access the body of a defined function. I am trying to put together a small code visualizer and would like to 'seed' a given function with special functions/variables to allow a visualizer to 'hook' itself into the code, I would need to be able to redefine the function either from a variable or a string.
There is no way to get access to body source code of given function in plain Lua. Source code is thrown away after compilation to byte-code.
Note BTW that function may be defined in run-time with loadstring-like facility.
Partial solutions are possible — depending on what you actually want to achieve.
You may get source code position from the debug library — if debug library is enabled and debug symbols are not stripped from the bytecode. After that you may load actual source file and extract code from there.
You may decorate functions you're interested in manually with required metadata. Note that functions in Lua are valid table keys, so you may create a function-to-metadata table. You would want to make this table weak-keyed, so it would not prevent functions from being collected by GC.
If you would need a solution for analyzing Lua code, take a look at Metalua.
Check out Lua Introspective Facilities in the debugging library.
The main introspective function in the
debug library is the debug.getinfo
function. Its first parameter may be a
function or a stack level. When you
call debug.getinfo(foo) for some
function foo, you get a table with
some data about that function. The
table may have the following fields:
The field you would want is func I think.
Using the debug library is your only bet. Using that, you can get either the string (if the function is defined in a chunk that was loaded with 'loadstring') or the name of the file in which the function was defined; together with the line-numbers at which the function definition starts and ends. See the documentation.
Here at my current job we have patched Lua so that it even gives you the column numbers for the start and end of the function, so you can get the function source using that. The patch is not very difficult to reproduce, but I don't think I'll be allowed to post it here :-(
You could accomplish this by creating an environment for each function (see setfenv) and using global (versus local) variables. Variables created in the function would then appear in the environment table after the function is executed.
env = {}
myFunc = function() x = 14 end
setfenv(myFunc, env)
myFunc()
print(myFunc) -- prints pointer
print(env.x) -- prints 14
Alternatively, you could make use of the Debug Library:
> myFunc = function() local x = 14 ; debug.debug() end
> myFunc()
> lua_debug> _, x = debug.getlocal(3, 1)
> lua_debug> print(x) -- prints 14
It would probably be more useful to you to retrieve the local variables with a hook function instead of explicitly entering debug mode (i.e. adding the debug.debug() call)
There is also a Debug Interface in the Lua C API.