Run multiple Meka (Weka) classifiers - load errors to file / table object - output

For those of you unfamiliar with Meka - it is an extension of Weka for multi-label classifiers. Meka and Weka are VERY similar, however, and so Weka users may be able to answer this question, too.
Basically, I want my results from my runs of various classifiers, and I want them all in a table so I can do Model Selection quickly (dynamically/automatically) without having to hardcode the values from each classifier, for the various evaluation metrics...
Is there a fool-proof, effective way to run multiple classifier experiments - say using cross validation - and get a table like the below:
Model Hamming_Loss Exact_match Jaccard One_Error Rank_Loss
Binary.Relevance 0.94 0.95 0.03 0.04 0.002
Classifier.Chains 0.91 0.94 0.06 0.04 0.03
Random.k-Labelsets 0.95 0.97 0.01 0.01 0.005
... ... ... ... ...
... ... ... ... ...

Using Java you can manually create an array of different classifiers and iterate along it, saving the relevant output values in a matrix, for easy acces afterwards. You may even create a new dataset from the results obtained, for dynamic model selection, as you stated. But the key point is, as said, that you have to set up your classifiers array manually.
Classifiers[] cls = new Classifiers[clSize];
cls[0] = new J48();//or whatever you need
...
//one option:
cls[0].buildClassifier(dataset);
....
//another option
cls[0].crossValidateModel(dataset);
....
Hope to have helped. Comment, if you need further support.

Related

QAT output nodes for Quantized Model got the same min max range

Recently, I have worked on quantization aware training on tf1.x to push the model to Coral Dev Board. However, when I finished training the model, why is my min max of my 2 outputs fake quantization is the same?
Should it be different when one's maximum target is 95 and one is 2pi?
I have figured out the problem. It is the problem when that part of the model is not really trained QAT. This happens for the output node that somehow forgets to QAT when training. The -6 and 6 values come from the default source of the quantization of tf1.x as mention here
To overcome the problem, we should provide some op to trigger the QAT for the output nodes. In my regression case, I add a dummy op: tf.maximum(output,0) in the model to make the node QAT. If your output is strictly between 0-1, applying "sigmoid" activation at output instead of relu can also solve the problems.

What is the difference between `kur test` and `kur evaluate`

What exactly do kur test and kur evaluate differ?
The differences we see from console
(dlnd-tf-lab) ->kur evaluate mnist.yml
Evaluating: 100%|████████████████████████████| 10000/10000 [00:04<00:00, 2417.95samples/s]
LABEL CORRECT TOTAL ACCURACY
0 949 980 96.8%
1 1096 1135 96.6%
2 861 1032 83.4%
3 868 1010 85.9%
4 929 982 94.6%
5 761 892 85.3%
6 849 958 88.6%
7 935 1028 91.0%
8 828 974 85.0%
9 859 1009 85.1%
ALL 8935 10000 89.3%
Focus on one: /Users/Natsume/Downloads/kur/examples
(dlnd-tf-lab) ->kur test mnist.yml
Testing, loss=0.458: 100%|█████████████████████| 3200/3200 [00:01<00:00, 2427.42samples/s]
Without understanding the source codes behind kur test and kur evaluate, how can we understand what exactly do they differ?
#ajsyp the developer of Kur (deep learning library) provided the following answer, which I found to be very helpful.
kur test is used when you know what the "correct answer" is, and you
simply want to see how well your model performs on a held-out sample.
kur evaluate is pure inference: it is for generating results from
your trained model.
Typically in machine learning you split your available data into 3
sets: training, validation, and testing (people sometimes call these
different things, just so you're aware). For a particular model
architecture / selection of model hyperparameters, you train on the
training set, and use the validation set to measure how well the model
performs (is it learning correctly? is it overtraining? etc). But you
usually want to compare many different model hyperparameters: maybe
you tweak the number of layers, or their size, for example.
So how do you select the "best" model? The most naive thing to do is
to pick the model with the lowest validation loss. But then you run
the risk of optimizing/tweaking your model to work well on the
validation set.
So the test set comes into play: you use the test set as a very final,
end of the day, test of how well each of your models is performing.
It's very important to hide that test set for as long as possible,
otherwise you have no impartial way of knowing how good your model is
or how it might compare to other models.
kur test is intended to be used to run a test set through the model
to calculate loss (and run any applicable hooks).
But now let's say you have a trained model, say an image recognition
model, and now you want to actually use it! You get some new data (you
probably don't even have "truth" labels for them, just the raw
images), and you want the model to classify the images. That's what
kur evaluate is for: it takes a trained model and uses it "in
production mode," where you don't have/need truth values.

Ignore zero values in dygraphs when standard deviation is above a certain value

I'm using dygraphs to display weather data collected every ten minutes. One of the datapoints is snow depth (in meters). Once in a while the depth is wrong, 0 meters, where the previous and next is 0.9 meters. It's winter atm. and I've been on the location to verify 0.9 is correct.
With 47 datapoints at 0.9 m. and one at 0 m. standard deviation is approx. 0.13 (using ministat on FreeBSD).
I've looked through the dygraphs documentation but can't find a way to ignore values like 0 when the standard deviation is above a certain threshold.
This page on dygraphs have three examples on how to deal with standard deviation but I just want to ignore the 0, not use it with the option errorBar or customBar, and the data is not in fractions.
The option rollPeriod is not applicable since it merely averages x data points.
I fetch the weather data in xml-format from a third party every ten minutes in a cron job, parse the values and store them in postgresql. Then select the last 48 data points in another cron job and redirect the data to a csv-file which dygraphs consumes.
Can I have dygraphs ignore 0 if standard deviation is above a threshold?
I can get the standard deviation with ministat or other utility from the last cron job and remove the 0 from the cvs-file using sed/awk. But will prefer dygraphs to do that.
var g5 = new Dygraph(
document.getElementById("snow_depth"),
"snow.csv",
{
legend: 'always',
title: 'Snødjupne',
labels: ["", "djupne"],
ylabel: 'Meter'
}
);
First three lines of the csv-file.
2016-02-23 01:50:00+00,0.91
2016-02-23 02:00:00+00,0
2016-02-23 02:10:00+00,0.9
You should clean your data before you chart it! dygraphs is a charting tool, not a data processing library.
The best approach is to load your data via AJAX, parse it, filter out the zeros and feed it in to dygraphs using native format.

Weka Decision Tree

I am trying to use weka to analyze some data. I've got a dataset with 3 variables and 1000+ instances.
The dataset references movie remakes and
how similar they are (0.0-1.0)
the difference in years between the movie and the remake
and lastly if they were made by the same studio (yes or no)
I am trying to make a decision tree to analyze the data. Using the J48 (because that's all I have ever used) I only get one leaf. Im assuming I'm doing something wrong. Any help is appreciated.
Here is a snippet from the data set:
Similarity YearDifference STUDIO TYPE
0.5 36 No
0.5 9 No
0.85 18 No
0.4 10 No
0.5 15 No
0.7 6 No
0.8 11 No
0.8 0 Yes
...
If interested the data can be downloaded as a csv here http://s000.tinyupload.com/?file_id=77863432352576044943
Your data set is not balanced cause there are almost 5 times more "No" then "Yes" for class attribute. That's why J48 is tree which is actually just one leaf that classifies everything as "NO". You can do one of these things:
sample your data set so you have equal number of No and Yes
Try using better classification algorithm e.g. Random Forest (it's located few spaces below J48 in Weka explorer GUI)

Multiple regression with lagged time series using libsvm

I'm trying to develop a forecaster for electric consumption. So I want to perform a regression using daily data for an entire year. My dataset has several features. Googling I've found that my problem is a Multiple regression problem (Correct me please if I am mistaken).
What I want to do is train a svm for regression with several independent variables and one dependent variable with n lagged days. Here's a sample of my independent variables, I actually have around 10. (We used PCA to determine which variables had some correlation to our problem)
Day Indep1 Indep2 Indep3
1 1.53 2.33 3.81
2 1.71 2.36 3.76
3 1.83 2.81 3.64
... ... ... ...
363 1.5 2.65 3.25
364 1.46 2.46 3.27
365 1.61 2.72 3.13
And the independendant variable 1 is actually my dependant variable in the future. So for example, with a p=2 (lagged days) I would expect my svm to train with the first 2 time series of all three independant variables.
Indep1 Indep2 Indep3
1.53 2.33 3.81
1.71 2.36 3.76
And the output value of the dependent variable would be "1.83" (Indep variable 1 on time 3).
My main problem is that I don't know how to train properly. What I was doing is just putting all features-p in an array for my "x" variables and for my "y" variables I'm just putting my independent variable on p+1 in case I want to predict next day's power consumption.
Example of training.
x with p = 2 and 3 independent variables y for next day
[1.53, 2.33, 3.81, 1.71, 2.36, 3.76] [1.83]
I tried with x being a two dimensional array but when you combine it for several days it becomes a 3d array and libsvm says it can't be.
Perhaps I should change from libsvm to another tool or maybe it's just that I'm training incorrectly.
Thanks for your help,
Aldo.
Let me answer with the python / numpy notation.
Assume the original time series data matrix with columns (Indep1, Indep2, Indep3, ...) is a numpy array data with shape (n_samples, n_variables). Let's generate it randomly for this example:
>>> import numpy as np
>>> n_samples = 100, n_variables = 5
>>> data = np.random.randn(n_samples, n_variables)
>>> data.shape
(100, 5)
If you want to use a window size of 2 time-steps, then the training set can be built as follows:
>>> targets = data[2:, 0] # shape is (n_samples - 2,)
>>> targets.shape
(98,)
>>> features = np.hstack([data[0:-2, :], data[1:-1, :]]) # shape is (n_samples - 2, n_variables * 2)
>>> features.shape
(98, 10)
Now you have your 2D input array + 1D targes that you can feed to libsvm or scikit-learn.
Edit: it might very well be the case that extracting more time-series oriented features such as moving average, moving min, moving max, moving differences (time based derivatives of the signal) or STFT might help your SVM mode make better predictions.