Machine learning algorithms in OpenCV appear to use data read in CSV format. See for example this cpp file. The data is read into an OpenCV machine learning class CvMLData using the following code:
CvMLData data;
data.read_csv( filename )
However, there does not appear to be any readily available documentation on the required format for the csv file. Does anyone know how the csv file should be arranged?
Other (non-Opencv) programs tend to have a line per training example, and begin with an integer or string indicating the class label.
If I read the source for that class, particularly the str_to_flt_elem function, and the class documentation I conclude that valid formats for individual items in the file are:
Anything that can be parsed to a double by strod
A question mark (?) or the empty string to represent missing values
Any string that doesn't parse to a double.
Items 1 and 2 are only valid for features. anything matched by item 3 is assumed to be a class label, and as far as I can deduce the order of the items doesn't matter. The read_csv function automatically assigns each column in the csv file the correct type, and (if you want) you can override the labels with set_response_index. Delimiter wise you can use the default (,) or set it to whatever you like before calling read_csv with set_delimiter (as long as you don't use the decimal point).
So this should work for example, for 6 datapoints in 3 classes with 3 features per point:
A,1.2,3.2e-2,+4.1
A,3.2,?,3.1
B,4.2,,+0.2
B,4.3,2.0e3,.1
C,2.3,-2.1e+3,-.1
C,9.3,-9e2,10.4
You can move your text label to any column you want, or even have multiple text labels.
Related
So basically I'm at a wall with an assignment and it's beginning to really frustrate me. Essentially I have a CSV file and my goal is to count how an the amount of times a string is called. So like column 1 would have a string and column 2 would have a integer connected to it. I ultimately need this to be formatted into a dictionary. Where I am stuck is how the heck do I do this without using imported libraries. I am only allowed to iterate through the file using for loops. Would my best bet be indexing each line and creating that into a string and count how many times that string is called? Any insight would be appreciated.
If you don't want to you any library (and assuming you are using python) you can use a dict comprehension, like this:
with open("data.csv") as file:
csv_as_dict = {line[0]: line[1] for line in file.readlines()}
Note: The question is possibly a duplicate of Creating a dictionary from a csv file?.
I'm trying to bulk load a massive dataset into a single Neo4j instance. Each node will represent a general Entity which will have specific properties, e.g.:
label
description
date
In addition to these there are zero or more properties specific to the Entity type, so for example if the Entity is a Book, the properties will look something like this:
label
description
date
author
first published
...
And if the Entity is a Car the properties will look something like this:
label
description
date
make
model
...
I first attempted to import the dataset by streaming each Entity from the filesystem and using Cypher to insert each node (some 200M entities and 400M relationships). This was far too slow (as I had expected but worth a try).
I've therefore made use of the bulk import tool neo4j-admin import which works over a CSV file which has specified headers for each property. The problem I'm having is that I don't see a way to add the additional properties specific to each Entity. The only solution I can think of is to include a CSV column for every possible property expressed across the set of entities, however I believe I will end up with a bunch of redundant properties on all my entities.
EDIT1
Each Entity is unique, so there will be some 1M+ types (labels in Neo4j)
Any suggestions on how to accomplish this would be appreciated.
The import command of neo4j-admin supports importing from multiple node and relationship files.
Therefore, to support multiple "types" of nodes (called labels in neo4j), you can split your original CSV file into separate files, one for each Entity "type". Each file can then have data columns specific to that type.
[UPDATED]
Here is one way to support the import of nodes having arbitrary schemata from a CSV file.
The CSV file should not have a header.
Every property on a CSV line should be represented by an adjacent pair of values: 1 for the property name, and 1 for the property value.
With such a CSV file, this code (which takes advantage of the APOC function apoc.map.fromValues) should work:
LOAD CSV FROM "file:///mydata.csv" AS line
CREATE (e:Entity)
SET e = apoc.map.fromValues(line);
NOTE: the above code would use strings for all values. If you want some property values to be integers, booleans, etc., then you can do something like this instead (but this is probably only sensible if the same property occurs frequently; if the property does not exist on a line no property will be created in the node, but it will waste some time):
LOAD CSV FROM "file:///mydata.csv" AS line
WITH apoc.map.fromValues(line) AS data
WITH apoc.map.setKey(data, 'foo', TOINTEGER(data.foo)) AS data
CREATE (e:Entity)
SET e = apoc.map.fromValues(line);
If I have two properties:
foo=1
bar=2345
Is there a way to specify that foo is a number and bar is a string?
I assume: bar="2345" would do but I wonder if there's a widely accepted convention
A properties file is a text file which contains data in some standard format, which can be read by the application using it. It is mostly used for configuration of the application and also for internationalization.
As per the wiki document https://en.wikipedia.org/wiki/.properties
Each parameter is stored as a pair of strings, one storing the name of
the parameter (called the key), and the other storing the value.
There is no way to specify / force the value to be number or string only (instead it is always a string). It is majorly the functionality of the framework / application which; while reading the properties file tries to parse the values. If it fails to parse the value (of certain specific type like number) it may fallback to some default value or will simply terminate the program.
I have a CSV file which I want to convert to Parquet for futher processing. Using
sqlContext.read()
.format("com.databricks.spark.csv")
.schema(schema)
.option("delimiter",";")
.(other options...)
.load(...)
.write()
.parquet(...)
works fine when my schema contains only Strings. However, some of the fields are numbers that I'd like to be able to store as numbers.
The problem is that the file arrives not as an actual "csv" but semicolon delimited file, and the numbers are formatted with German notation, i.e. comma is used as decimal delimiter.
For example, what in US would be 123.01 in this file would be stored as 123,01
Is there a way to force reading the numbers in different Locale or some other workaround that would allow me to convert this file without first converting the CSV file to a different format? I looked in Spark code and one nasty thing that seems to be causing issue is in CSVInferSchema.scala line 268 (spark 2.1.0) - the parser enforces US formatting rather than e.g. rely on the Locale set for the JVM, or allowing configuring this somehow.
I thought of using UDT but got nowhere with that - I can't work out how to get it to let me handle the parsing myself (couldn't really find a good example of using UDT...)
Any suggestions on a way of achieving this directly, i.e. on parsing step, or will I be forced to do intermediate conversion and only then convert it into parquet?
For anybody else who might be looking for answer - the workaround I went with (in Java) for now is:
JavaRDD<Row> convertedRDD = sqlContext.read()
.format("com.databricks.spark.csv")
.schema(stringOnlySchema)
.option("delimiter",";")
.(other options...)
.load(...)
.javaRDD()
.map ( this::conversionFunction );
sqlContext.createDataFrame(convertedRDD, schemaWithNumbers).write().parquet(...);
The conversion function takes a Row and needs to return a new Row with fields converted to numerical values as appropriate (or, in fact, this could perform any conversion). Rows in Java can be created by RowFactory.create(newFields).
I'd be happy to hear any other suggestions how to approach this but for now this works. :)
I am a new learner in WEKA. I use Car Evaluation dataset. First, I copied all attributes, instances and values correctly in Excel and save as csv file. I opened that csv file in WEKA. I can see all count of classes, attributes etc. However, I cannot see for doors and persons attributes. I am getting "Attribute is neither numeric nor nominal."
These attributes get values such as "2","3" and "more". They take both numeric and nominal values. In WEKA their types are string. How can I change attribute types or which method should I apply to see their visualization and counts?
WEKA can read a csv file, but the csv gives no information about the type of the attributes. That is why WEKA encourages you to use arff file format. arff format is the same as csv except that it has a header that describes the variables (and allows comments and other documentation). The header will contain things like
#attribute mpg numeric
#attribute cyl numeric
#attribute doors {2,3,more}
to indicate that mpg and cyl will have numeric values while doors will be a factor that can take on any of the three values "2","3", or "more". You will need to be sure that you specify all of the possible values for factors like doors. You can simply add the header in a text editor if you know what the header should look like. You can get more details on the arff format at This WEKA site or This University of Waikato site.
Perhaps you should decide for making the attribute all numeric, or all nominal (also known as categorical, or all strings).
Benefits of an all numeric attribute: algorithms can determine a mathematical relationship between that attribute and any other attribute, including the target (or desired output), e.g., correlation, dependence/independence, covariance. Furthermore, if you use tree-based algorithms, nodes can define decision rules such as doors>3 or persons<2.
The benefit of having an all nominal attributes includes: algorithms can finish faster because of the limited number of things that can be done with categorical values. Cons: most algorithms do not directly support nominal attributes. Tree-based algorithms are limited in the type of decisions nodes they can produce, e.g., doors is '3' or persons is not 'more'.
Caveat: if the attribute you are dealing with is the target or desired output, having it all numeric will make weka interpret it as a regression problem, while having that attribute as nominal will automatically be interpreted as a classification problem.
If you are interested in making your attribute all numeric, you could probably replace all occurrences more with, say, a -1 using excel.
If later down the road you need to go from all numeric to a nominal attribute, you could simply use a filter do to that. Or if you are using the java API you could check Walter's solution:
import weka.core.Instances;
import weka.filters.Filter;
import weka.filters.unsupervised.attribute.NumericToNominal;
public class Main {
public static void main(String[] args) throws Exception {
//load training instances
Instances originalTrain= //...load data with numeric attributes
NumericToNominal convert= new NumericToNominal();
String[] options= new String[2];
options[0]="-R";
options[1]="1-2"; //range of variables to make numeric
convert.setOptions(options);
convert.setInputFormat(originalTrain);
Instances newData=Filter.useFilter(originalTrain, convert);
System.out.println("Before");
for(int i=0; i<2; i=i+1) {
System.out.println("Nominal? "+originalTrain.attribute(i).isNominal());
}
System.out.println("After");
for(int i=0; i<2; i=i+1) {
System.out.println("Nominal? "+newData.attribute(i).isNominal());
}
}
}