Octave: how to retrieve data from a Java ResultSet object? - octave

I need to feed my Octave instance with data retrieved from an Oracle database.
I have implemented an OJDBC connection in my Octave instance an I am able now to put data from an Oracle database into a Java ResultSet object in Octave (taken from: https://lists.gnu.org/archive/html/help-octave/2011-08/msg00250.html):
javaaddpath('access-path-to-ojdbc8.jar') ;
props = javaObject('java.util.Properties') ;
props.setProperty("user", 'username') ;
props.setProperty("password", 'password') ;
driver = javaObject('oracle.jdbc.OracleDriver') ;
url = 'jdbc:oracle:thin:#ip:port:schema' ;
con = driver.connect(url, props) ;
sql = 'select-query' ;
ps = con.prepareStatement(sql) ;
rs = ps.executeQuery() ;
But haven't succeeded with retrieving data from that ResultSet.
How can I put data from a ResultSet object in Octave into an array or matrix?

Finding out what to do
The docs you want for ResultSet and related classes are in the Java JDBC API documentation. (You don't need the Oracle-specific doco unless you want to do fancy Oracle-specific stuff. All JDBC drivers conform to the generic JDBC API.) Have a look at that and any JDBC tutorial; because it is a Java object, you'll use all the same method calls from Octave that you would from Java code.
For conversion to Octave values, know that Java primitives convert to Octave types automatically, java.lang.String objects require conversion by calling char(...) on them, and java.sql.Date values you will have to convert to datenums manually. (Lazy way is to get their string values and parse them; fast way is to get their Unix time values and convert numerically.)
What to do
Because Java JDBC advances the result set cursor one row at a time, and requires a separate method call to get the value for each column, you need to use a pair of nested loops to iterate over the ResultSet. Like this:
rsMeta = rs.getMetaData();
nCols = rsMeta.getColumnCount();
data = NaN(1, nCols);
iRow = 0;
while rs.next()
iRow = iRow + 1;
for iCol = 1:nCols
data(iRow,iCol) = rs.getDouble(iCol);
endfor
endwhile
Ah, but what if your columns aren't all numerics? Then you'll need to look at the column type in rsMeta, switch on it, and use a cell array to hold the heterogeneous data set. Like this:
rsMeta = rs.getMetaData();
nCols = rsMeta.getColumnCount();
data = cell(1, nCols);
iRow = 0;
while rs.next()
iRow = iRow + 1;
for iCol = 1:nCols
colTypeId = rsMeta.getColumnType(iCol);
switch colTypeId
case NUMERIC_TYPE
data{iRow,iCol} = rs.getDouble(iCol);
case CHAR_TYPE
data{iRow,iCol} = rs.getString(iCol);
data{iRow,iCol} = char(data{iRow,iCol});
# ... and so on ...
otherwise
error('Unsupported SQL data type in column %d: %d', ...
iCol, colTypeId);
endswitch
endfor
endwhile
How do you know what the values for NUMERIC_TYPE, CHAR_TYPE, and so on should be? You have to examine the values in the java.sql.Types Java class. Do that at run time to make sure you're consistent with the JDK you're running against.
(Note: this code is the easy, sloppy way of doing it. There's all sorts of improvements and optimizations you could (and should) do on it.)
How to go fast
Unfortunately, the performance of this is going to suck big time, because Java method calls from Octave are expensive, and cells are in inefficient way of holding data. If your result sets are large, in order to get good performance, what you need to do is write a result set buffering layer in Java that runs the loops in Java and buffers the results in primitive per-column arrays, and use that. If you want an example of how to do this, I have an example implementation in Matlab in my Janklab library (M-code layer here). Feel free to steal the code. Octave doesn't support dot-referencing of Java constructors or class methods, so to convert it to Octave, you'd need to replace all those with javaObject and javaMethod calls. (That's tedious and results in ugly code, so I'm not going to do it myself. Sorry.)
If you're not willing to do that (and really, who is?), and still need good performance, what you should actually do is forget about connecting Octave directly to Oracle, and write a separate Python/NumPy or R program that takes your query, runs it against your Oracle db, and writes the result to a .mat file that you will then read from Octave.

I don't have access to the specified .jar or a suitable database to test your specific code, but in any case, this isn't really a problem of octave. Effectively you need the relevant api for the ResultSet class, and a standard approach for processing it. The oracle documentation suggests that in java you'd do something like this:
while (rs.next()) { System.out.println (rs.getString(1)); }
So, presumably this is exactly what you'll do in octave too, except via octave's java interface. One possible way this might look like is
while rs.next().booleanValue % since a Boolean java object by itself
% isn't valid logical input for octave's
% 'while' statement
% do something with rs, e.g. fill in a cell array
endwhile
As for whether you can automatically convert a java array to an octave cell-object or vice-versa, as far as I know this is not possible. You'd have to set / get elements from one to the other via a for loop, just like you'd do in java (e.g. see the note in the manual regarding the javaArray function)

Related

Proper way to dynamically define LUA functions

I have been playing with Lua for the past week and I ended up writing this peace of code. I find it really useful that you can dynamically create new functions that "inherit" other functions, so programmers must have a name for it. My questions are:
What is this called?
Is there a better way to do this? (cycle over a data structure and create add-on functions that "improve" existing functions)
D = {
name = {
value = nil,
offset = 0,
update = function (self)
self.value = "Berlin"
end,
},
}
--Dynamic function definition
for i in pairs(D) do
D[i].upAndWrite = function(self)
self:update()
print("upAndWrite was here")
end
end
print(D.name.value)
D.name:upAndWrite()
print(D.name.value)
Result:
nil
upAndWrite was here
Berlin
I don't think that what you're doing have special name for it, it's just on-the-fly function creation.
There are few notes regarding your code:
Proper for loop
for i in pairs(D) do
…
end
In programming, variable i is generally used for counter loops, like in
for i=1,100 do
…
end
Here, pairs returns an iterator function and the idiomatic way to use it is
for k,v in pairs(D) do
…
end
Here k is a key (like i in your code) and v is a value (use it instead of indexing table like D[k] or D[i] in your code when you need to access the corresponding value).
There's no need to create functions on the fly!
Another important thing is that you create new function on each loop iteration. While this feature is very powerful, you're not using it at all as you don't store anything using upvalues and only access data through arguments.
A better way to do it would be creating function once and assigning it to every field:
-- Define D here
do
local function upAndWrite(self)
self:update()
print("upAndWrite was here")
end
for k,v in pairs(D) do
v.upAndWrite = upAndWrite -- Use our function
end
end
-- Perform tests here
What does on-the-fly function creation allow?
As mentioned above, you can utilize this very powerful mechanism of closures in certain situations. Here's a simple example:
local t = {}
for i = 1, 100 do
-- Create function to print our value
t[i] = function() print(i) end
-- Create function to increment value by one
t[-i] = function() i=i+1 end
end
t[1]() -- Prints 1
t[20]() -- Prints 20
t[-20]() -- Increment upvalue by one
t[20]() -- Now it is 21!
This example demonstrates one possible usage of upvalues and the fact that many functions can share them. This can be useful in a variety of situations together with the fact that upvalues can't be changed by side code (without use of debug library) and can be trusted in general.
I hope my answer covers what you wanted to know.
Note: Also, this language is called Lua, not LUA.
As a whole it doesn't have a name, no. There's lots of concepts that play into this:
First Class Functions aka. functions that can be assigned to variables and passed around just like numbers or strings.
Anonymous Functions aka. functions that are created without giving it a name explicitly. All functions in Lua are technically anonymous, but often they are assigned into a variable right after creation.
Metaprogramming aka. writing programs that write programs. A loop that creates functions (or methods) on an arbitrary number of objects is very simple, but I'd count it as metaprogramming.
Lua Tables; this may seem obvious, but consider that not all languages have a feature like this. Javascript has objects which are similar, but Ruby for example has no comparable feature.
If you're gonna use pairs, you might as well make use of both variables.
for key, object in pairs(D) do
function object:upAndWrite(self)
self:update()
print("upAndWrite was here")
end
end
Though that would create many closures, which means more work for the garbage collector, more memory usage and slower execution speed.
for key, object in pairs(D) do
print(object.upAndWrite) -- All the functions are different
end
It's a good first stage, but after refactoring it a bit you could get this:
do
local method = function(self) -- Create only one closure
self:update()
print("upAndWrite was here")
end
for key, object in pairs(D) do
object.upAndWrite = method -- Use single closure many times
end
end
Now there's only one closure that's shared among all the tables.
for key, object in pairs(D) do
print(object.upAndWrite) -- All the functions are the same
end

Connecting output of script to extract performance

When I use the Execute Script operator, where there is one input arc and this input is of type ExampleSet and I run, for example, the one-line script return operator.getInput(ExampleSet.class), and then connect the output to an Extract Performance operator, which takes an ExampleSet as input, I get an error: Mandatory input missing at port Performance.example set.
My goal is to check a Petri-net for soundness via the Analyse soundness operator that comes with the RapidProm extension, and to take and change the first attribute on the first line to either 0 or 1 depending on whether this string matches "is sound", so I can then use Extract Performance and combine it with other performances using Average.
Is doing this with Execute Script the right way to do it, and if so, how should I fix this error?
Firstly: Don't bother about the error Mandatory input missing at port Performance.example set
It will be resolved when you run the model.
Secondly: It is indeed a bit ugly, the output of the operator that checks
the soundness of the model, since it is a very long string that looks like
Woflan diagnosis of net "d1cf46bd-15a9-4801-9f02-946a8f125eaf" - The net is sound End of Woflan diagnosis
You can indeed use the execute script to resolve this :)
See the script below!
The output is an example set that returns 1 if the model is sound, and 0 otherwise. Furthermore, I like to use some log operators to translate this to a nice table useful for documentation purposes.
ExampleSet input = operator.getInput(ExampleSet.class);
for (Example example : input) {
String uglyResult = example["att1"];
String soundResult = "The net is sound";
Boolean soundnessCheck = uglyResult.toLowerCase().contains(soundResult.toLowerCase());
if (soundnessCheck){
example["att1"] = "1"; //the net is sound :)
} else {
example["att1"] = "0"; //the net is not sound!
}
}
return input;
See also the attached example model I created.
RapidMiner Setup

Some way to perform a TrimAfterLast() in Fidder Script

There is the very helpful Utilities.TrimBeforeLast() function in Fiddler script. However, I really need to perform a Utilities.TrimAfterLast(StringVar, "}") to remove the extra characters after a JSON object i've captured.
Is there any way I could produce this equivalent result with fiddlerscript?
Thanks
FiddlerScript allows you to use the entire .NET Framework, which offers many different functions for processing strings.
In this case, you'd probably want to use something like:
var sMyString = "whatever}junk";
var iX = sMyString.LastIndexOf('}');
sMyString = sMyString.Substring(0, iX);

How do chain python generator expressions with filtering?

I am trying to chain generators so I can process a large CSV file as a stream of lines rather than doing each single operation in a batch.
This way, I can delay each iteration of each step, avoiding loading an entire data set in memory.
Generator expressions work fine except if I put try to filter the output with an if statement.
This works: and only one iteration is consumed at a time
file_iterable = open("myfile.csv")
parsed_csv_iterable = (parse_i(i) for i in file_iterable)
then I can get a line at a time by calling next() on the resulting iterable. However, if I do this
file_iterable = open("myfile.csv")
parsed_csv_iterable = (parse_i(i) for i in file_iterable if parse_i(i)[0] in [1,2,3])
Then, the iterator keeps running until is exhausted. Why? What's the workaround?

How can I set an expression to the FileSpec property on Foreach File enumerator?

I'm trying to create an SSIS package to process files from a directory that contains many years worth of files. The files are all named numerically, so to save processing everything, I want to pass SSIS a minimum number, and only enumerate files whose name (converted to a number) is higher than my minimum.
I've tried letting the ForEach File loop enumerate everything and then exclude files in a Script Task, but when dealing with hundreds of thousands of files, this is way too slow to be suitable.
The FileSpec property lets you specify a file mask to dictate which files you want in the collection, but I can't quite see how to specify an expression to make that work, as it's essentially a string match.
If there's an expression within the component somewhere which basically says Should I Enumerate? - Yes / No, that would be perfect. I've been experimenting with the below expression, but can't find a property to which to apply it.
(DT_I4)REPLACE( SUBSTRING(#[User::ActiveFilePath],FINDSTRING( #[User::ActiveFilePath], "\", 7 ) + 1 ,100),".txt","") > #[User::MinIndexId] ? "True" : "False"
Here is one way you can achieve this. You could use Expression Task combined with Foreach Loop Container to match the numerical values of the file names. Here is an example that illustrates how to do this. The sample uses SSIS 2012.
This may not be very efficient but it is one way of doing this.
Let's assume there is a folder with bunch of files named in the format YYYYMMDD. The folder contains files for the first day of every month since 1921 like 19210101, 19210201, 19210301 .... all the upto current month 20121101. That adds upto 1,103 files.
Let's say the requirement is only to loop through the files that were created since June 1948. That would mean the SSIS package has to loop through only the files greater than 19480601.
On the SSIS package, create the following three parameters. It is better to configure parameters for these because these values are configurable across environment.
ExtensionToMatch - This parameter of String data type will contain the extension that the package has to loop through. This will supplement the value to FileSpec variable that will be used on the Foreach Loop container.
FolderToEnumerate - This parameter of String data type will store the folder path that contains the files to loop through.
MinIndexId - this parameter of Int32 data type will contain the minimum numerical value above which the files should match the pattern.
Create the following four parameters that will help us loop through the files.
ActiveFilePath - This variable of String data type will hold the file name as the Foreach Loop container loops through each file in the folder. This variable is used in the expression of another variable. To avoid error, set it to a non-empty value, say 1.
FileCount - This is a dummy variable of Int32 data type will be used for this sample to illustrate the number of files that the Foreach Loop container will loop through.
FileSpec - This variable of String data type will hold the file pattern to loop through. Set the expression of this variable to below mentioned value. This expression will use the extension specified on the parameters. If there are no extensions, it will *.* to loop through all files.
"*" + (#[$Package::ExtensionToMatch] == "" ? ".*" : #[$Package::ExtensionToMatch])
ProcessThisFile - This variable of Boolean data type will evaluate whether a particular file matches the criteria or not.
Configure the package as shown below. Foreach loop container will loop through all the files matching the pattern specified on the FileSpec variable. An expression specified on the Expression Task will evaluate during runtime and will populate the variable ProcessThisFile. The variable will then be used on the Precedence constraint to determine whether to process the file or not.
The script task within the Foreach loop container will increment the counter of variable FileCount by 1 for each file that successfully matches the expression.
The script task outside the Foreach loop will simply display how many files were looped through by the Foreach loop container.
Configure the Foreach loop container to loop through the folder using the parameter and the files using the variable.
Store the file name in variable ActiveFilePath as the loop passes through each file.
On the Expression task, set the expression to the following value. The expression will convert the file name without the extension to a number and then will check if it evaluates to greater than the given number in the parameter MinIndexId
#[User::ProcessThisFile] = (DT_BOOL)((DT_I4)(REPLACE(#[User::ActiveFilePath], #[User::FileSpec] ,"")) > #[$Package::MinIndexId] ? 1: 0)
Right-click on the Precedence constraint and configure it to use the variable ProcessThisFile on the expression. This tells the package to process the file only if it matches the condition set on the expression task.
#[User::ProcessThisFile]
On the first script task, I have the variable User::FileCount set to the ReadWriteVariables and the following C# code within the script task. This increments the counter for file that successfully matches the condition.
public void Main()
{
Dts.Variables["User::FileCount"].Value = Convert.ToInt32(Dts.Variables["User::FileCount"].Value) + 1;
Dts.TaskResult = (int)ScriptResults.Success;
}
On the second script task, I have the variable User::FileCount set to the ReadOnlyVariables and the following C# code within the script task. This simply outputs the total number of files that were processed.
public void Main()
{
MessageBox.Show(String.Format("Total files looped through: {0}", Dts.Variables["User::FileCount"].Value));
Dts.TaskResult = (int)ScriptResults.Success;
}
When the package is executed with MinIndexId set to 1948061 (excluding this), it outputs the value 773.
When the package is executed with MinIndexId set to 20111201 (excluding this), it outputs the value 11.
Hope that helps.
From investigating how the ForEach loop works in SSIS (with a view to creating my own to solve the issue) it seems that the way it works (as far as I could see anyway) is to enumerate the file collection first, before any mask is specified. It's hard to tell exactly what's going on without seeing the underlying code for the ForEach loop but it seems to be doing it this way, resulting in slow performance when dealing with over 100k files.
While #Siva's solution is fantastically detailed and definitely an improvement over my initial approach, it is essentially just the same process, except using an Expression Task to test the filename, rather than a Script Task (this does seem to offer some improvement).
So, I decided to take a totally different approach and rather than use a file-based ForEach loop, enumerate the collection myself in a Script Task, apply my filtering logic, and then iterate over the remaining results. This is what I did:
In my Script Task, I use the asynchronous DirectoryInfo.EnumerateFiles method, which is the recommended approach for large file collections, as it allows streaming, rather than having to wait for the entire collection to be created before applying any logic.
Here's the code:
public void Main()
{
string sourceDir = Dts.Variables["SourceDirectory"].Value.ToString();
int minJobId = (int)Dts.Variables["MinIndexId"].Value;
//Enumerate file collection (using Enumerate Files to allow us to start processing immediately
List<string> activeFiles = new List<string>();
System.Threading.Tasks.Task listTask = System.Threading.Tasks.Task.Factory.StartNew(() =>
{
DirectoryInfo dir = new DirectoryInfo(sourceDir);
foreach (FileInfo f in dir.EnumerateFiles("*.txt"))
{
FileInfo file = f;
string filePath = file.FullName;
string fileName = filePath.Substring(filePath.LastIndexOf("\\") + 1);
int jobId = Convert.ToInt32(fileName.Substring(0, fileName.IndexOf(".txt")));
if (jobId > minJobId)
activeFiles.Add(filePath);
}
});
//Wait here for completion
System.Threading.Tasks.Task.WaitAll(new System.Threading.Tasks.Task[] { listTask });
Dts.Variables["ActiveFilenames"].Value = activeFiles;
Dts.TaskResult = (int)ScriptResults.Success;
}
So, I enumerate the collection, applying my logic as files are discovered and immediately adding the file path to my list for output. Once complete, I then assign this to an SSIS Object variable named ActiveFilenames which I'll use as the collection for my ForEach loop.
I configured the ForEach loop as a ForEach From Variable Enumerator, which now iterates over a much smaller collection (Post-filtered List<string> compared to what I can only assume was an unfiltered List<FileInfo> or something similar in SSIS' built-in ForEach File Enumerator.
So the tasks inside my loop can just be dedicated to processing the data, since it has already been filtered before hitting the loop. Although it doesn't seem to be doing much different to either my initial package or Siva's example, in production (for this particular case, anyway) it seems like filtering the collection and enumerating asynchronously provides a massive boost over using the built in ForEach File Enumerator.
I'm going to continue investigating the ForEach loop container and see if I can replicate this logic in a custom component. If I get this working I'll post a link in the comments.
The best you can do is use FileSpec to specify a mask, as you said. You could include at least some specs in it, like files starting with "201" for 2010, 2011 and 2012. Then, in some other task, you could filter out those you don't want to process (for instance, 2010).