Test and Training Set are Not Compatible - csv

I have seen various articles about the same issue, Tried a lot of solutions and nothing is working. Kindly advice.
I am getting an error in WEKA:
"Problem Evaluating Classifier: Test and Training Set are Not
Compatible".
I am using
J48 as my algorithm
This is my Test set:
Trainset:
https://www.dropbox.com/s/fm0n1vkwc4yj8yn/train.csv
Evalset:
https://www.dropbox.com/s/2j9jgxnoxr8xjdx/Eval.csv
(I am unable to copy and paste due to long code)
I have tried "Batch Filtering" in WEKA (for Traningset) but it still does not work.
EDIT: I have even converted my .csv to .arff but still the same
issue.
EDIT2: I have made sure the headers in both CSV's match. Even
then same issue. Please help!
Please advice.

A common error in converting ".csv" files to ".arff" with Weka is when values for nominal attributes appear in a different order or not at all from dataset to dataset.
Your evaluation ".arff" file probably looks like this (skipping irrelevant data):
#relation Eval
#attribute a321 {TRUE}
Your train ".arff" file probably looks like this (skipping irrelevant data):
#relation train
#attribute a321 {FALSE}
However, both should contain all possible values for that attribute, and in the same order:
#attribute a321 {TRUE, FALSE}
You can remedy this by post-processing your ".arff" files in a text editor and changing the header so that your nominal values appear in the same order (and quantity) from file to file.

How do I divide a dataset into training and test set?
You can use the RemovePercentage filter (package weka.filters.unsupervised.instance).
In the Explorer just do the following:
training set:
Load the full dataset
select the RemovePercentage filter in the preprocess panel
set the correct percentage for the split
apply the filter
save the generated data as a new file
test set:
Load the full dataset (or just use undo to revert the changes to the dataset)
select the RemovePercentage filter if not yet selected
set the invertSelection property to true
apply the filter
save the generated data as new file

Related

Psychopy: how to avoid to store variables in the csv file?

When I run my PsychoPy experiment, PsychoPy saves a CSV file that contains my trials and the values of my variables.
Among these, there are some variables I would like to NOT be included. There are some variables which I decided to include in the CSV, but many others which automatically felt in it.
is there a way to manually force (from the code block) the exclusion of some variables in the CSV?
is there a way to decide the order of the saved columns/variables in the CSV?
It is not really important and I know I could just create myself an output file without using the one of PsychoPy, or I can easily clean it afterwards but I was just curious.
PsychoPy spits out all the variables it thinks you could need. If you want to drop some of them, that is a task for the analysis stage, and is easily done in any processing pipeline. Unless you are analysing data in a spreadsheet (which you really shouldn't), the number of columns in the output file shouldn't really be an issue. The philosophy is that you shouldn't back yourself into a corner by discarding data at the recording stage - what about the reviewer who asks about the influence of a variable that you didn't think was important?
If you are using the Builder interface, the saving of onset & offset times for each component is optional, and is controlled in the "data" tab of each component dialog.
The order of variables is also not under direct control of the user, but again, can be easily manipulated at the analysis stage.
As you note, you can of course write code to save custom output files of your own design.
there is a special block called session_variable_order: [var1, var2, var3] in experiment_config.yaml file, which you probably should be using; also, you should consider these methods:
from psychopy import data
data.ExperimentHandler.saveAsWideText(fileName = 'exp_handler.csv', delim='\t', sortColumns = False, encoding = 'utf-8')
data.TrialHandler.saveAsText(fileName = 'trial_handler.txt', delim=',', encoding = 'utf-8', dataOut = ('n', 'all_mean', 'all_raw'), summarised = False)
notice the sortColumns and dataOut params

Copying fits-file data and/or header into a new fits-file

Similar question was asked before, but asked in an ambigous way and using a different code.
My problem: I want to make an exact copy of a .fits-file header into a new file. (I need to process a fits file in way, that I change the data, keep the header the same and save the result in a new file). Here a short example code, just demonstrating the tools I use and the discrepancy I arrive at:
data_old, header_old = fits.getdata("input_file.fits", header=True)
fits.writeto('output_file.fits', data_old, header_old, overwrite=True)
I would expect now that the the files are exact copies (headers and data of both being same). But if I check for difference, e.g. in this way -
fits.printdiff("input_file.fits", "output_file.fits")
I see that the two files are not exact copies of each other. The report says:
...
Files contain different numbers of HDUs:
a: 3
b: 2
Primary HDU:
Headers contain differences:
Headers have different number of cards:
a: 54
b: 4
...
Extension HDU 1:
Headers contain differences:
Keyword GCOUNT has different comments:
...
Why is there no exact copy? How can I do an exact copy of a header (and/or the data)? Is a key forgotten? Is there an alternative simple way of copy-pasting a fits-file-header?
If you just want to update the data array in an existing file while preserving the rest of the structure, have you tried the update function?
The only issue with that is it doesn't appear to have an option to write to a new file rather than update the existing file (maybe it should have this option). However, you can still use it by first copying the existing file, and then updating the copy.
Alternatively, you can do things more directly using the object-oriented API. Something like:
with fits.open(filename) as hdu_list:
hdu = hdu_list[<name or index of the HDU to update>]
hdu.data = <new ndarray>
# or hdu.data[<some index>] = <some value> i.e. just directly modify the existing array
hdu.writeto('updated.fits') # to write just that HDU to a new file, or
# hdu_list.writeto('updated.fits') # to write all HDUs, including the updated one, to a new file
There's nothing not "pythonic" about this :)

How to load image from csv file in tensorflow

I have image save in 0.csv files.
The format is as picture below.
How can I read it to tensorflow?
Thanks!
You should use the Dataset input pipeline introduced in tensorflow 1.4:
https://www.tensorflow.org/programmers_guide/datasets#consuming_text_data
Here's the example from the developers guide (though you'll want to read through that guide, it's quite well written):
filenames = ["/var/data/file1.txt", "/var/data/file2.txt"]
dataset = tf.data.Dataset.from_tensor_slices(filenames)
# Use `Dataset.flat_map()` to transform each file as a separate nested dataset,
# and then concatenate their contents sequentially into a single "flat" dataset.
# * Skip the first line (header row).
# * Filter out lines beginning with "#" (comments).
dataset = dataset.flat_map(
lambda filename: (
tf.data.TextLineDataset(filename)
.skip(1)
.filter(lambda line: tf.not_equal(tf.substr(line, 0, 1), "#"))))
The Dataset preprocessing pipeline has a few nice advantages. Most of the functionality you'll need such as reading text records, shuffling, batching, etc. are reduced to one-liners. More importantly though, it forces you into writing your preprocessing pipeline in a good, modular, testable way. It takes a little bit to get used to the API, but it's time well spent.

converting csv to arff

I am working on a school project for data mining, where we were given CSV data from kaggle (this is how the data looks (2 lines out of 6970)):
4,1970,Female,150,DomesticPartnersKids,Bachelor's Degree,Democrat,,Yes,No,No,No,Yes,Public,No,Yes,No,Yes,No,No,Yes,Science,Study first,Yes,Yes,No,No,Receiving,No,No,Pragmatist,No,No,Cool headed,Standard hours,No,Happy,Yes,Yes,Yes,No,A.M.,No,End,Yes,No,Me,Yes,Yes,No,Yes,No,Mysterious,No,No,,,,,,,,,,Mac,Yes,Cautious,No,Umm...,No,Space,Yes,In-person,No,Yes,Yes,No,Yay people!,Yes,Yes,Yes,Yes,Yes,No,Yes,,,,,,,,,,,,,,,,,No,No,No,Only-child,Yes,No,No
5,1997,Male,75,Single,High School Diploma,Republican,,Yes,Yes,No,,Yes,Private,No,No,No,Yes,No,No,Yes,Science,Study first,,Yes,No,Yes,Receiving,No,Yes,Pragmatist,No,Yes,Cool headed,Odd hours,No,Right,Yes,No,No,Yes,A.M.,Yes,Start,Yes,Yes,Circumstances,No,Yes,No,Yes,Yes,Mysterious,No,No,Tunes,Technology,Yes,Yes,Yes,Yes,No,Supportive,No,PC,No,Cautious,No,Umm...,No,Space,No,In-person,No,No,Yes,Yes,Grrr people,Yes,No,No,No,No,No,No,Yes,No,No,Yes,No,Own,Pessimist,Mom,No,No,No,No,Nope,Yes,No,No,No,Yes,No,Yes,No,Yes,No
and we have to get this to an .arff format for use in weka. I manualy typed the header(107 attributes)
#ATTRIBUTE user_id NUMERIC
#ATTRIBUTE yob NUMERIC
#ATTRIBUTE gender {Male,Female}
#ATTRIBUTE income {150,100,75,50,25,10}
#ATTRIBUTE householdstatus {MarriedKids,Married,DomesticPartnersKids,DomesticPartners,Single,SingleKids}
#ATTRIBUTE educationlevel {Bachelor's Degree,High School Diploma,Current K-12,Current Undergraduate,Master's Degree,Associate's Degree,Doctoral Degree}
#ATTRIBUTE party {Democrat,Republican}
#ATTRIBUTE Q124742 {Yes,No}
#ATTRIBUTE Q124122 {Yes,No}
and I get this error :
} expected at end of enumeration read token eol
Then I tried to use the weka converter but it gave me an error
Wrong number of values.Read 2,expected 1,read Token[EOL],line 4 Problem encountered at line:3
Here's what I did:
From Kaggle, I downloaded train.csv (5568 instances, highest ID numbeer 6960).
I didn't use the converter -- just loaded it into the Weka Explorer as a CSV file. Some problems and their solution:
Line 3: First instance of "Bachelor's Degree". It did NOT like that single quote ("line 3, read 7, expected 108"). Got rid of all single quotes (using a global replace in a text editor). Then I tried to load it into Weka again.
The file doesn't have a CR (the Enter key on the keyboard) at the end of the last line, which caused an error ("null on line 5569"). I added one, again in a text editor. Then I loaded it into Weka, and took a look at the variables.
YOB (Year of Birth) is missing for about 300 instances, with "NA" filled in. So, it didn't evaluate as either string or numeric. Edited these to be empty cells instead. Then I loaded it into Weka.
And, of course, moved Party to be the class variable (at the end). I did this in Weka.
Saved this as train.arff
Loaded it back in, and it seems to work OK. I generated 51% accuracy with a OneR classifier, but you wouldn't expect a OneR classifier to work well here. I'm sure you can do better.
Note I didn't do any manual typing of headers. That must have taken a while!
Good luck!

Counting the number of passes through a CSV file in JMeter

Am I missing an easy way to do this?
I have a CSV file with a number of params in it, and in my test I want to be able to make some of the fields unique across CSV repetitions with a suffix determined by the number of times I've looped through the file.
So suppose my CSV (simplified) had:
abc
def
ghi
I want to generate in the test
abc_1
def_1
ghi_1 <hit EOF>
abc_2
def_2
ghi_2 <hit EOF>
abc_3
def_3
ghi_3
I thought I could set up a counter to run parallel to my CSV loop, but that won't work unless I increment it by 1/n each iteration, where n is the number of lines in my CSV file. Which you can't do because counters are integers.
I'm going to go flail around and see if I can come up with a solution, but in case I'm not successful, has anyone got any suggestions?
I've used an EOF marker row (index column with something like "EOF" or "END", etc) and used an IF controller with either a non-resetting counter OR user-variables incremented via javascript in a BSF element (BSF assertion or whatever, just a mechanism to run the script).
Unfortunately its the best solution I've come up with without putting too much effort into it.