nltk adding genre to files in corpus - nltk

I have a bunch of plain text files I want to classify into either class A or class B.
for training, I was thinking of adding the genre as class A or class B to each file and try to identify some features that are predictive of the genre of a file.
I can create a plain text corpus but is there any way to add the genre of a file while creating the corpus?.

I'd suggest NLTK's CategorizedPlaintextCorpusReader. The text files have to be named according to their category / genre and you have to pass a regular expression to the constructor that tells NLTK which file belongs to which category.
The documentation states:
A regular expression pattern used to find the category for each file identifier. The pattern will be applied to each file identifier, and the first matching group will be used as the category label for that file.
Instead of a pattern, you can also pass a dictionary or a text file containing a mapping of fileids to category names. Please note that each text file can belong to multiple categories.
See this blog entry for code examples.

Related

Importing massive dataset in Neo4j where each entity has differing properties

I'm trying to bulk load a massive dataset into a single Neo4j instance. Each node will represent a general Entity which will have specific properties, e.g.:
label
description
date
In addition to these there are zero or more properties specific to the Entity type, so for example if the Entity is a Book, the properties will look something like this:
label
description
date
author
first published
...
And if the Entity is a Car the properties will look something like this:
label
description
date
make
model
...
I first attempted to import the dataset by streaming each Entity from the filesystem and using Cypher to insert each node (some 200M entities and 400M relationships). This was far too slow (as I had expected but worth a try).
I've therefore made use of the bulk import tool neo4j-admin import which works over a CSV file which has specified headers for each property. The problem I'm having is that I don't see a way to add the additional properties specific to each Entity. The only solution I can think of is to include a CSV column for every possible property expressed across the set of entities, however I believe I will end up with a bunch of redundant properties on all my entities.
EDIT1
Each Entity is unique, so there will be some 1M+ types (labels in Neo4j)
Any suggestions on how to accomplish this would be appreciated.
The import command of neo4j-admin supports importing from multiple node and relationship files.
Therefore, to support multiple "types" of nodes (called labels in neo4j), you can split your original CSV file into separate files, one for each Entity "type". Each file can then have data columns specific to that type.
[UPDATED]
Here is one way to support the import of nodes having arbitrary schemata from a CSV file.
The CSV file should not have a header.
Every property on a CSV line should be represented by an adjacent pair of values: 1 for the property name, and 1 for the property value.
With such a CSV file, this code (which takes advantage of the APOC function apoc.map.fromValues) should work:
LOAD CSV FROM "file:///mydata.csv" AS line
CREATE (e:Entity)
SET e = apoc.map.fromValues(line);
NOTE: the above code would use strings for all values. If you want some property values to be integers, booleans, etc., then you can do something like this instead (but this is probably only sensible if the same property occurs frequently; if the property does not exist on a line no property will be created in the node, but it will waste some time):
LOAD CSV FROM "file:///mydata.csv" AS line
WITH apoc.map.fromValues(line) AS data
WITH apoc.map.setKey(data, 'foo', TOINTEGER(data.foo)) AS data
CREATE (e:Entity)
SET e = apoc.map.fromValues(line);

how to iterate through a csv file in neo4j

https://raw.githubusercontent.com/saurabhkumar1903/neo4j/master/alterFile/sampletestoutput1.csv
here's the link to an image showning my expected output:[https://i.imgur.com/x6CYdfU.jpg] I've drawn it on paper just to show the expected output.
[I have a csv file containing a list of nodes where each line denotes a relationship of node at line[0] with every other list of nodes line2,line2,line[3].....line[4500] in that line]
eg.
1,3,4,5,7,8
2,4,5,11
4,10,11,15
here node at
line[0] i.e. "1" has a directed relationship with
nodes at line[2] i.e "3" as a friend,
nodes at line[4], i,e."4" as a friend,
nodes at line[6], i,e."5" as a friend,
what i am trying to do is I want to show a graph in neo4j dipicting the suggested friend relationship among each nodes.
what I cannot figure out is how to iterate the whole csv file as well as capture the relationship among each nodes on each line of csv file.
If you want to ensure there is a HAS_FRIEND relationship between a person whose id comes first (in line) and the people whose ids come afterwards (in the same line), something like this should work:
LOAD CSV FROM 'file:///friends.csv' AS line
MERGE (p:Person {id: TOINT(line[0])})
FOREACH(id IN line[1..] |
MERGE(f:Person {id: TOINT(id)})
MERGE (p)-[:HAS_FRIEND]-(f))
This query presumes that you only want a single HAS_FRIEND relationship between any 2 nodes. Therefore, the MERGE for the relationship does not specify a direction. That way, if there is already such a relationship in either direction, no new relationship is created.
Also, the TOINT function is used to convert the id values to integers (since LOAD CSV automatically treats all values as strings). If you don't need that conversion, then you can remove the function calls.

Building a classifier with J48

Weka is meant to make it very easy to build classifiers. There are many different kinds, and here I want to use a scheme called “J48” that produces decision trees.
Weka can read Comma Separated Values (.csv) format files by selecting the appropriate File Format in the Open file dialog.
I've created a small spreadsheet file (see the next image), saved it .csv format, and loaded it into Weka.
The first row of the .csv file have the attribute names, separated by commas, which for this case is classe real and resultado modelo.
I've got the dataset opened in the Explorer.
If I go to the Classify panel, choose a classifier, open trees and click J48, i should just run it (I have the dataset, the classifier). (see the next image)
Well, it doesn't allow to press start. (see the next image)
What do I need to do to fix this?
If you look back at the Preprocess, you will see that resultado modelo is probably being treated as a numeric attribute. J48 only works with nominal class attributes. (Predictor attributes can be numeric, as a commenter #nekomatic noted.)
You can change this by using a filter in the Preprocess tab. Choose the unsupervised attribute filter NumericToNominal and this will convert all your variables (or a subset of them) from numeric to nominal. Then you should be able to run J48 just fine.

Should the structure of a derived obj file coinside with the naming of the original step file?

When using the Model Derivative API I successfully generate an obj representation from a step file. But within that process are some quirks that I do not fully understand:
The Post job has a output.advanced.exportFileStructure property which can be set to "multiple" and a output.advanced.objectIds property which lets you specify the which parts of the model you would like to extract. From the little that the documentation states, I would expect to receive one obj file per requested objectid. Which from my experience is not the case. So does this only work for compressed files like .iam and .ipt?
Well, anyway, instead I get one obj file for all objectIds with one polygon group per objectId. The groups are named (duh!), so I would expect them to be named like their objectId but it seams like the numbers are assigned in a random way. So how should I actually map an objectId to its corresponding 3d part? Is there any way to link the information from GET :urn/metadata/:guid/properties back to their objects?
I hope somebody can shine light on this. If you need more information I can provide you with the original step file, the obj and my server log.
You misunderstood the objectIds property of the derivatives API: specifying that field allows you to export only specific components to a single obj, for example your car model has 1000 different components, but you just want to export components that represent the engine: [34, 56, 76] (I just made those up...). If you want to export each objectId to a separate obj file, you need to fire multiple jobs. the "exportFileStructure" option only applies to composite designs (i.e. assemblies) single: creates one OBJ file for all the input files (assembly file), multiple: creates a separate OBJ file for each object. A step file is not a composite design.
As you noticed the obj groups are named randomly. As far as I know there is no easy reliable way to map a component in the obj file to the original objectId because .obj is a very basic format and it doesn't support metadata. You could use a geometric approach (finding where is the component in space, use bounding boxes, ...) to achieve the mapping but it could be challenging with complex models.

CSV format for OpenCV machine learning algorithms

Machine learning algorithms in OpenCV appear to use data read in CSV format. See for example this cpp file. The data is read into an OpenCV machine learning class CvMLData using the following code:
CvMLData data;
data.read_csv( filename )
However, there does not appear to be any readily available documentation on the required format for the csv file. Does anyone know how the csv file should be arranged?
Other (non-Opencv) programs tend to have a line per training example, and begin with an integer or string indicating the class label.
If I read the source for that class, particularly the str_to_flt_elem function, and the class documentation I conclude that valid formats for individual items in the file are:
Anything that can be parsed to a double by strod
A question mark (?) or the empty string to represent missing values
Any string that doesn't parse to a double.
Items 1 and 2 are only valid for features. anything matched by item 3 is assumed to be a class label, and as far as I can deduce the order of the items doesn't matter. The read_csv function automatically assigns each column in the csv file the correct type, and (if you want) you can override the labels with set_response_index. Delimiter wise you can use the default (,) or set it to whatever you like before calling read_csv with set_delimiter (as long as you don't use the decimal point).
So this should work for example, for 6 datapoints in 3 classes with 3 features per point:
A,1.2,3.2e-2,+4.1
A,3.2,?,3.1
B,4.2,,+0.2
B,4.3,2.0e3,.1
C,2.3,-2.1e+3,-.1
C,9.3,-9e2,10.4
You can move your text label to any column you want, or even have multiple text labels.