I am trying to use weka to analyze some data. I've got a dataset with 3 variables and 1000+ instances.
The dataset references movie remakes and
how similar they are (0.0-1.0)
the difference in years between the movie and the remake
and lastly if they were made by the same studio (yes or no)
I am trying to make a decision tree to analyze the data. Using the J48 (because that's all I have ever used) I only get one leaf. Im assuming I'm doing something wrong. Any help is appreciated.
Here is a snippet from the data set:
Similarity YearDifference STUDIO TYPE
0.5 36 No
0.5 9 No
0.85 18 No
0.4 10 No
0.5 15 No
0.7 6 No
0.8 11 No
0.8 0 Yes
...
If interested the data can be downloaded as a csv here http://s000.tinyupload.com/?file_id=77863432352576044943
Your data set is not balanced cause there are almost 5 times more "No" then "Yes" for class attribute. That's why J48 is tree which is actually just one leaf that classifies everything as "NO". You can do one of these things:
sample your data set so you have equal number of No and Yes
Try using better classification algorithm e.g. Random Forest (it's located few spaces below J48 in Weka explorer GUI)
Related
I'm trying to develop a model to recognize new gestures with the Myo Armband. (It's an armband that possesses 8 electrical sensors and can recognize 5 hand gestures). I'd like to record the sensors' raw data for a new gesture and feed it to a model so it can recognize it.
I'm new to machine/deep learning and I'm using CNTK. I'm wondering what would be the best way to do it.
I'm struggling to understand how to create the trainer. The input data looks like something like that I'm thinking about using 20 sets of these 8 values (they're between -127 and 127). So one label is the output of 20 sets of values.
I don't really know how to do that, I've seen tutorials where images are linked with their label but it's not the same idea. And even after the training is done, how can I avoid the model to recognize this one gesture whatever I do since it's the only one it's been trained for.
An easy way to get you started would be to create 161 columns (8 columns for each of the 20 time steps + the designated label). You would rearrange the columns like
emg1_t01, emg2_t01, emg3_t01, ..., emg8_t20, gesture_id
This will give you the right 2D format to use different algorithms in sklearn as well as a feed forward neural network in CNTK. You would use the first 160 columns to predict the 161th one.
Once you have that working you can model your data to better represent the natural time series order it contains. You would move away from a 2D shape and instead create a 3D array to represent your data.
The first axis shows the number of samples
The second axis shows the number of time steps (20)
The thirst axis shows the number of sensors (8)
With this shape you're all set to use a 1D convolutional model (CNN) in CNTK that traverses the time axis to learn local patterns from one step to the next.
You might also want to look into RNNs which are often used to work with time series data. However, RNNs are sometimes hard to train and a recent paper suggests that CNNs should be the natural starting point to work with sequence data.
What exactly do kur test and kur evaluate differ?
The differences we see from console
(dlnd-tf-lab) ->kur evaluate mnist.yml
Evaluating: 100%|████████████████████████████| 10000/10000 [00:04<00:00, 2417.95samples/s]
LABEL CORRECT TOTAL ACCURACY
0 949 980 96.8%
1 1096 1135 96.6%
2 861 1032 83.4%
3 868 1010 85.9%
4 929 982 94.6%
5 761 892 85.3%
6 849 958 88.6%
7 935 1028 91.0%
8 828 974 85.0%
9 859 1009 85.1%
ALL 8935 10000 89.3%
Focus on one: /Users/Natsume/Downloads/kur/examples
(dlnd-tf-lab) ->kur test mnist.yml
Testing, loss=0.458: 100%|█████████████████████| 3200/3200 [00:01<00:00, 2427.42samples/s]
Without understanding the source codes behind kur test and kur evaluate, how can we understand what exactly do they differ?
#ajsyp the developer of Kur (deep learning library) provided the following answer, which I found to be very helpful.
kur test is used when you know what the "correct answer" is, and you
simply want to see how well your model performs on a held-out sample.
kur evaluate is pure inference: it is for generating results from
your trained model.
Typically in machine learning you split your available data into 3
sets: training, validation, and testing (people sometimes call these
different things, just so you're aware). For a particular model
architecture / selection of model hyperparameters, you train on the
training set, and use the validation set to measure how well the model
performs (is it learning correctly? is it overtraining? etc). But you
usually want to compare many different model hyperparameters: maybe
you tweak the number of layers, or their size, for example.
So how do you select the "best" model? The most naive thing to do is
to pick the model with the lowest validation loss. But then you run
the risk of optimizing/tweaking your model to work well on the
validation set.
So the test set comes into play: you use the test set as a very final,
end of the day, test of how well each of your models is performing.
It's very important to hide that test set for as long as possible,
otherwise you have no impartial way of knowing how good your model is
or how it might compare to other models.
kur test is intended to be used to run a test set through the model
to calculate loss (and run any applicable hooks).
But now let's say you have a trained model, say an image recognition
model, and now you want to actually use it! You get some new data (you
probably don't even have "truth" labels for them, just the raw
images), and you want the model to classify the images. That's what
kur evaluate is for: it takes a trained model and uses it "in
production mode," where you don't have/need truth values.
This question seems pretty stupid but I actually fail to find a simple solution to this. I have a csv file that is structured like this:
0 21 34.00 34.00
1 23 35.00 25.00
2 25 45.00 65.00
The first column is the node's id, the second is an unimportant attribute. The 3rd and 4th attribute are supposed to be the x and y position of the nodes.
I can import the file into the Data Laboratory without problems, but I fail to explain to Gephi to use the x y attributes as the corresponding properties. All I want to achieve is that Gephi sets the x Property to the value of the x Attribute (and y respectively). Also see picture.
Thanks for your help!
In the Layout window, you can select "Geo Layout" and define which columns are used as Latitude and Longitude.
The projection might come in weird if you do not actually have GeoData, but for me, this is fine.
In Gephi 0.8 there was a plugin called Recast column. This plugin is unfortunately not ported to Gephi 0.9 yet, but it allowed you to set Standard (hidden) Columns in the Node Table, from visible values in the nodes table. Thus if you have two columns of type Float or Decimal that represent your coordinates, you could set the coordinate values of your nodes.
i want to import csv-Files with about 40 million lines into neo4j. For this i try to use the "batchimporter" from https://github.com/jexp/batch-import.
Maybe it's a problem that i provide own IDs. This is the example
nodes.csv
i:id
l:label
315041100 Person
201215100 Person
315041200 Person
rels.csv :
start
end
type
relart
315041100 201215100 HAS_RELATION 30006
315041200 315041100 HAS_RELATION 30006
the content of batch.properties:
use_memory_mapped_buffers=true
neostore.nodestore.db.mapped_memory=1000M
neostore.relationshipstore.db.mapped_memory=5000M
neostore.propertystore.db.mapped_memory=4G
neostore.propertystore.db.strings.mapped_memory=2000M
neostore.propertystore.db.arrays.mapped_memory=1000M
neostore.propertystore.db.index.keys.mapped_memory=1500M
neostore.propertystore.db.index.mapped_memory=1500M
batch_import.node_index.node_auto_index=exact
./import.sh graph.db nodes.csv rels.csv
will be processed without errors, but it takes about 60 seconds!
Importing 3 Nodes took 0 seconds
Importing 2 Relationships took 0 seconds
Total import time: 54 seconds
When i use smaller IDs - for example 3150411 instead of 315041100 - it takes just 1 second!
Importing 3 Nodes took 0 seconds
Importing 2 Relationships took 0 seconds
Total import time: 1 seconds
Actually i would take even bigger IDs with 10 digits. I don't know what i'm doing wrong. Can anyone see an error?
JDK 1.7
batchimporter 2.1.3 (with neo4j 2.1.3)
OS: ubuntu 14.04
Hardware: 8-Core-Intel-CPU, 16GB RAM
I think the problem is that the batch importer is interpreting those IDs as actual physical ids on disk. And so the time is spent in the file system, inflating the store files up to the size where they can fit those high ids.
The ids that you're giving are intended to be "internal" to the batch import, or? Although I'm not sure how to tell the batch importer that is the case.
#michael-hunger any input there?
the problem is that those ID's are internal to Neo4j where they represent disk record-ids. if you provide high values there, Neo4j will create a lot of empty records until it reaches your ids.
So either you create your node-id's starting from 0 and you store your id as normal node property.
Or you don't provide node-id's at all and only lookup nodes via their "business-id-value"
i:id id:long l:label
0 315041100 Person
1 201215100 Person
2 315041200 Person
start:id end:id type relart
0 1 HAS_RELATION 30006
2 0 HAS_RELATION 30006
or you have to configure and use an index:
id:long:people l:label
315041100 Person
201215100 Person
315041200 Person
id:long:people id:long:people type relart
0 1 HAS_RELATION 30006
2 0 HAS_RELATION 30006
HTH Michael
Alternatively you can also just write a small java or groovy program to import your data if handling those ids with the batch-importer is too tricky.
See: http://jexp.de/blog/2014/10/flexible-neo4j-batch-import-with-groovy/
Ideally I could specify something like 10 as my input (in ounces) and get back a string like this: "1 & 1/4 cups". Is there a library that can do something like this? (note: I am totally fine with the rounding implicit in something like this).
Note: I would prefer a C library, but I am OK with solutions for nearly any language as I can probably find appropriate bindings.
It is really two things: 1) the data encompassing the conversion, 2) the presentation of the conversion.
The second is user choice: If you want fractions, you need to write or get a fractions library. There are many.
The first is fairly easy. The vast majority of conversions are just a factor. Usually you will organize known factors into a conversion into the appropriate SI unit for that type of conversion (volume, length, area, density, etc.)
Your data then looks something like this:
A acres 4.046870000000000E+03 6
A ares 1.000000000000000E+02 15
A barns 1.000000000000000E-28 15
A centiares 1.000000000000000E+00 15
A darcys 9.869230000000000E-13 6
A doors 9.290340000000000E+24 6
A ferrados 7.168458781362010E-01 6
A hectares 1.000000000000000E+04 15
A labors 7.168625518000000E+05 6
A Rhode Island 3.144260000000000E+09 4
A sections 2.590000000000000E+06 6
A sheds 1.000000000000000E-48 15
A square centimeters 1.000000000000000E-04 15
A square chains (Gunter's or surveyor's) 4.046860000000000E+02 6
A square chains (Ramsden's) 9.290304000000000E+02 5
A square feet 9.290340000000000E-02 6
A square inches 6.451600000000000E-04 15
A square kilometers 1.000000000000000E+06 15
A square links (Gunter's or surveyor's) 4.046900000000000E-02 5
A square meters (SI) 1.000000000000000E+00 15
A square miles (statute) 2.590000000000000E+06 7
A square millimeter 1.000000000000000E-06 15
A square mils 6.451610000000000E-10 5
A square perches 2.529300000000000E+01 5
A square poles 2.529300000000000E+01 5
A square rods 2.529300000000000E+01 5
A square yards 8.361270000000000E-01 6
A townships 9.324009324009320E+07 5
In each case, these are area conversions into the SI unit for area -- square meters. Then make a second conversion into the the desired conversion. The third number there is significant digits.
Keep a file of these for the desired factors and then you can convert from any area to any area that you have data on. Repeat for other categories of conversion (Volume, Power, Length, Weight, etc etc etc)
My thoughts were using Google Calculator for this task if you want generic conversions...
Example: http://www.google.com/ig/calculator?q=10%20ounces%20to%20cups -- returns JSON, but I believe you can specify format.
Here's a Java example for currency conversion:
http://blog.caplin.com/2011/01/06/simple-currency-conversion-using-google-calculator-and-java/
Well, for a quick and dirty solution you could always have it run GNU Units as an external program. If your software is GPL compatible you can even rip off the code from Units and use it in your program.
Please check out JSR 363, the Units of Measurement Standard for Java: http://unitsofmeasurement.github.io/
At least in C++ you get basic support via "value types" already, but you still have to implement those conversions yourself or find a suitable library similar to what JSR 363 offers for Java.