mlr: what exactly does the function getBMRTuneResults? - mlr

I'm doing nested resampling with a 4x3 setup (4-fold cross-validation in the outer loop, and 3-fold cross-validation in the inner loop). For now, I only use Support Vector Machines (ksvm from kernlab). In the inner loop, I'm looking for the optimal tuning parameters C and sigma.
Calling getBMRPerformances() then outputs me the performances on the 4 individual outer test data sets. The function getBMRTuneResults() outputs 4 values for the measure I am using (in my case cohen's kappa) as well but they differ from the output of getBMRPerformances() and I don't understand what the second output actually is.

As the function name indicates, it outputs the results from tuning. So the performance values correspond to the performances calculated during the tuning (inner loop).
The four values in particular here are the performances reached by the best performing hyperparameter setting for a particular fold of your outer loop.

Related

What will the inner tuning learner return in nested cv in MLR?

In MLR there is a method to implement the nested cross validation. In nested cv, the inner loop is used to select the best tuning parameters and the outer loop is used to evaluate the model performance. When I combine nested cv with the feature selection process, I'm a bit confounded about what will MLR return about the inner bested tuned model. For example, I want to first apply a filter based on the correlation p value with outcome<0.05. In nested cv (I say it in training, validation and test set), it should be:
In the inner loop, for each training set, apply the filter, then tune the parameter we're interested and test in the validation set. In the inner loop, we can get the best tuning parameter and the feature set associated with it.
What I'm wondering is what the inner best tuned parameter will return for outer loop training, I assume there are two possible models:
The inner best tuned model just return the best tuned parameter, not the selected feature subset. So in the outer loop, we'll first apply the same filter, then train the training+validation set with the best tuned parameter.
The inner best tuned model return the best tuned parameter and the selected feature subset. So in the outer loop, we'll just train the training+validation set with the best tuned parameter and selected feature subset (from the inner loop).
In my opinion, I think the first one is more logic. Part of my code is as below:
svm_learner<-makeLearner("classif.svm",predict.type="prob",fix.factors.prediction = TRUE)
svm_filter<-makeFilterWrapper(learner = svm_learner,
fw.method = "t.test.filter", fw.threshold = -0.05)
svm_filter_nested<-makeTuneWrapper(svm_filter,par.set=ps,
control=ctrl,resampling=inner)
r=resample(svm_filter_nested,task,resampling=outer,models=TRUE)
Option 2) is correct.
Hyperpars are optimized for the chosen feature subset. I would not make sense to do so if you rerun the filter process again in the outer loop.
There is not more than train/predict happening in each outer fold with parameters coming from the inner optimization loop. No optimization is going on in the outer loop.
PS: You might want to ask such general questions on https://stats.stackexchange.com/ rather than on Stackoverflow since they are related to general (statistical) concepts rather than programming.
People will vote to close such questions since they lack a relation to programming. (Note that no one of the mlr-team is watching stats.stackexchange questions though)

Making sense of soundMixer.computeSpectrum

All examples that I can find on the Internet just visualize the result array of the function computeSpectrum, but I am tasked with something else.
I generate a music note and I need by analyzing the result array to be able to say what note is playing. I figured out that I need to set the second parameter of the function call 'FFTMode' to true and then it returns sound frequencies. I thought that really it should return only one non-zero value which I could use to determine what note I generated using Math.sin function, but it is not the case.
Can somebody suggest a way how I can accomplish the task? Using the soundMixer.computeSpectrum is a requirement because I am going to analyze more complex sounds later.
FFT will transform your signal window into set of Nyquist sine waves so unless 440Hz is one of them you will obtain more than just one nonzero value! For a single sine wave you would obtain 2 frequencies due to aliasing. Here an example:
As you can see for exact Nyquist frequency the FFT response is single peak but for nearby frequencies there are more peaks.
Due to shape of the signal you can obtain continuous spectrum with peaks instead of discrete values.
Frequency of i-th sample is f(i)=i*samplerate/N where i={0,1,2,3,4,...(N/2)-1} is sample index (first one is DC offset so not frequency for 0) and N is the count of samples passed to FFT.
So in case you want to detect some harmonics (multiples of single fundamental frequency) then set the samplerate and N so samplerate/N is that fundamental frequency or divider of it. That way you would obtain just one peak for harmonics sinwaves. Easing up the computations.

octave: efficiency of using trace()?

Suppose I have two 5000 x 1000 matrices, A and B. Will octave compute trace(A*B') efficiently, i.e. in a way that only requires 5000 inner products as opposed to 5000*5000 inner products most of which will not be used?
And, what if the argument to trace is more complicated, i.e.: trace(A*B' + C*D')? Does that change anything?
trace(A*B') will compute the complete matrix product before using trace().
A more efficient approach would be sum(sum(A.*conj(B),2)). The inner sum computes the diagonal of the resulting matrix.
A probably even more efficient approach would be doing both sums in one step via `sum((A.*conj(B))(:)).
trace(A*B' + C*D') would be computed efficiently by sum((A.*conj(B) + C.*conj(D))(:)).
No, the product will be evaluated before the call to trace(). One efficient implementation would be to manually compute only the diagonal terms of that matrix multiply and then sum them sum(sum(diag(A) .* diag(B))); for your second example sum(sum(diag(A) .* diag(B) + diag(C) .* diag(D)))
Note that you can shorten both expressions slightly and possibly gain a bit of speed at the loss of readability and Matlab compatibility like so: sum((diag(A) .* diag(B))(:));

Efficient set intersection - decide whether the intersection is larger than k

I am faced with a problem where I have to calculate intersections between all pairs in a collection of sets. None of the sets are smaller than a small constant k, and I'm only interested in whether two sets have an intersection larger than k-1 elements or not. I do not need the actual intersections nor the exact size, only whether it's larger than k-1 or not. Is there some clever pre-processing trick or a neat set intersection algorithm that I could use to speed things up?
More info that can be useful to answer the question:
The sets represent maximal cliques in a large, undirected, sparse graph. The number of sets can be in the order of tens of thousands or more, but most of the sets are likely to be small.
The sets are already sorted members of each set are in increasing order. Effectively they are sorted lists - I receive them this way from an underlying library for maximal clique search.
Nothing is known about the distribution of elements in the sets (i.e. whether they are in tight clumps or not).
Most of the set intersections are likely to be empty, so the ideal solution would be a clever data structure that helps me cut down the number of set intersections I have to make.
Consider a mapping with all sets of size k as the keys and corresponding values of lists of all sets from your collection that contain the key as a subset. Given this mapping, you don't need to perform any intersection tests: for each key, all pairs of sets from the list will have an intersection of size at least k. This approach can produce the same pair of sets more than once, so that will need to be checked.
The mapping is easy enough to calculate. For each set in the collection, calculate all the size-k subsets and append the original set to the list for that key set. But is this actually faster? In general, no. The performance of this approach will depend on the distribution of the sizes of the sets in the collection and the value of k. With d distinct elements in the sets, you could have as many as d choose k keys, which can be very large.
However, the basic idea is usable to reduce the number of intersections. Instead of using sets of size k, use smaller ones of fixed size q as the keys. The values are again lists of all sets that have the key as a subset. Now, test each pair of sets from the list for intersection. Thus, with q=1 you only test those pairs of sets that have at least one element in common, with q=2 you only test those pairs of sets that have at least two elements in common, and so on. The optimal value for q will depend on the distribution of sizes of the sets, I think.
For the sets in question, a good choice might be q=2. The keys are then just the edges of the graph, giving a predictable size to the mapping. Since most sets are expected to be disjoint, q=2 should eliminate a lot of comparisons without much additional overhead.
One possible optimization, which is more effective the smaller the range of values contained in each set:
Create a list of all the sets, sorted by their kth-greatest element (this is easy to find, since you already have each set with its elements in order). Call this list L.
For any two sets A and B, their intersection cannot have as many as k elements in it if the kth-greatest element in A is less than the least element in B.
So, for each set in turn, calculate its intersection only with the sets in the relevant part of L.
You can use the same fact to exit early from computing the intersection of any two sets - if there are only n-1 elements left to compare in one of the sets, and the intersection so far contains at most k-n elements, then stop. The above procedure is simply this rule applied to all the sets in L at once, with n=k, at the point where we're looking at the least element of set B and the kth-greatest element of A.
The following strategy should be quite efficient. I've used variations of this for intersecting ascending sequences on a number of occasions.
First I assume that you have some sort of priority queue available (if not, rolling your own heap is pretty easy). And a fast key/value lookup (btree, hash, whatever).
With that said, here is pseudocode for an algorithm that should do what you want quite efficiently.
# Initial setup
sets = array of all sets
intersection_count = key/value lookup with keys = (set_pos, set_pos) and values are counts.
p_queue = priority queue whose elements are (set[0], 0, set_pos), organized by set[0]
# helper function
def process_intersections(current_sets):
for all pairs of current_sets:
if pair in intersection_count:
intersection_count[pair] += 1
else:
intersection_count[pair] = 1
# Find all intersections
current_sets = []
last_element = first element of first thing in p_queue
while p_queue is not empty:
(element, ind, set_pos) = get top element from p_queue
if element != last_element:
process_intersections(current_sets)
last_element = element
current_sets = []
current_sets.append(set_pos)
ind += 1
if ind < len(sets[set_pos]):
add (sets[set_pos][ind], ind, set_pos) to p_queue
# Don't forget the last one!
process_intersections(current_sets)
final answer = []
for (pair, count) in intersection_count.iteritems():
if k-1 < count:
final_answer.append(pair)
The running time will be O(sum(sizes of sets) * log(number of sets) + count(times a point is in a pair of sets). In particular note that if two sets have no intersection, you never try to intersect them.
What if you used a predictive subset as a prequalifier. Pre-sort, but use a subset intersection as a threshold condition. If subset intersection > n% then complete the intersection, otherwise abandon. n then becomes the inverse of your comfort level with the prospect of a false positive.
You could also sort by the subset intersections(m) calculated earlier and begin running the full intersection ordered by m descending. So presumably the majority of your highest m intersections would likely cross your k threshold on the full subset and the probably of hitting your k threshold would continually decrease.
This really starts to treat the problem as NP-Complete.

Determining edge weights given a list of walks in a graph

These questions regard a set of data with lists of tasks performed in succession and the total time required to complete them. I've been wondering whether it would be possible to determine useful things about the tasks' lengths, either as they are or with some initial guesstimation based on appropriate domain knowledge. I've come to think graph theory would be the way to approach this problem in the abstract, and have a decent basic grasp of the stuff, but I'm unable to know for certain whether I'm on the right track. Furthermore, I think it's a pretty interesting question to crack. So here we go:
Is it possible to determine the weights of edges in a directed weighted graph, given a list of walks in that graph with the lengths (summed weights) of said walks? I recognize the amount and quality of permutations on the routes taken by the walks will dictate the quality of any possible answer, but let's assume all possible walks and their lengths are given. If a definite answer isn't possible, what kind of things can be concluded about the graph? How would you arrive at those conclusions?
What if there were several similar walks with possibly differing lengths given? Can you calculate a decent average (or other illustrative measure) for each edge, given enough permutations on different routes to take? How will discounting some permutations from the available data set affect the calculation's accuracy?
Finally, what if you had a set of initial guesses as to the weights and had to refine those using the walks given? Would that improve upon your guesstimation ability, and how could you apply the extra information?
EDIT: Clarification on the difficulties of a plain linear algebraic approach. Consider the following set of walks:
a = 5
b = 4
b + c = 5
a + b + c = 8
A matrix equation with these values is unsolvable, but we'd still like to estimate the terms. There might be some helpful initial data available, such as in scenario 3, and in any case we can apply knowledge of the real world - such as that the length of a task can't be negative. I'd like to know if you have ideas on how to ensure we get reasonable estimations and that we also know what we don't know - eg. when there's not enough data to tell a from b.
Seems like an application of linear algebra.
You have a set of linear equations which you need to solve. The variables being the lengths of the tasks (or edge weights).
For instance if the tasks lengths were t1, t2, t3 for 3 tasks.
And you are given
t1 + t2 = 2 (task 1 and 2 take 2 hours)
t1 + t2 + t3 = 7 (all 3 tasks take 7 hours)
t2 + t3 = 6 (tasks 2 and 3 take 6 hours)
Solving gives t1 = 1, t2 = 1, t3 = 5.
You can use any linear algebra techniques (for eg: http://en.wikipedia.org/wiki/Gaussian_elimination) to solve these, which will tell you if there is a unique solution, no solution or an infinite number of solutions (no other possibilities are possible).
If you find that the linear equations do not have a solution, you can try adding a very small random number to some of the task weights/coefficients of the matrix and try solving it again. (I believe falls under Perturbation Theory). Matrices are notorious for radically changing behavior with small changes in the values, so this will likely give you an approximate answer reasonably quickly.
Or maybe you can try introducing some 'slack' task in each walk (i.e add more variables) and try to pick the solution to the new equations where the slack tasks satisfy some linear constraints (like 0 < s_i < 0.0001 and minimize sum of s_i), using Linear Programming Techniques.
Assume you have an unlimited number of arbitrary characters to represent each edge. (a,b,c,d etc)
w is a list of all the walks, in the form of 0,a,b,c,d,e etc. (the 0 will be explained later.)
i = 1
if #w[i] ~= 1 then
replace w[2] with the LENGTH of w[i], minus all other values in w.
repeat forever.
Example:
0,a,b,c,d,e 50
0,a,c,b,e 20
0,c,e 10
So:
a is the first. Replace all instances of "a" with 50, -b,-c,-d,-e.
New data:
50, 50
50,-b,-d, 20
0,c,e 10
And, repeat until one value is left, and you finish! Alternatively, the first number can simply be subtracted from the length of each walk.
I'd forget about graphs and treat lists of tasks as vectors - every task represented as a component with value equal to it's cost (time to complete in this case.
In tasks are in different orderes initially, that's where to use domain knowledge to bring them to a cannonical form and assign multipliers if domain knowledge tells you that the ratio of costs will be synstantially influenced by ordering / timing. Timing is implicit initial ordering but you may have to make a function of time just for adjustment factors (say drivingat lunch time vs driving at midnight). Function might be tabular/discrete. In general it's always much easier to evaluate ratios and relative biases (hardnes of doing something). You may need a functional language to do repeated rewrites of your vectors till there's nothing more that romain knowledge and rules can change.
With cannonical vectors consider just presence and absence of task (just 0|1 for this iteratioon) and look for minimal diffs - single task diffs first - that will provide estimates which small number of variables. Keep doing this recursively, be ready to back track and have a heuristing rule for goodness or quality of estimates so far. Keep track of good "rounds" that you backtraced from.
When you reach minimal irreducible state - dan't many any more diffs - all vectors have the same remaining tasks then you can do some basic statistics like variance, mean, median and look for big outliers and ways to improve initial domain knowledge based estimates that lead to cannonical form. If you finsd a lot of them and can infer new rules, take them in and start the whole process from start.
Yes, this can cost a lot :-)