Importing massive dataset in Neo4j where each entity has differing properties - csv

I'm trying to bulk load a massive dataset into a single Neo4j instance. Each node will represent a general Entity which will have specific properties, e.g.:
label
description
date
In addition to these there are zero or more properties specific to the Entity type, so for example if the Entity is a Book, the properties will look something like this:
label
description
date
author
first published
...
And if the Entity is a Car the properties will look something like this:
label
description
date
make
model
...
I first attempted to import the dataset by streaming each Entity from the filesystem and using Cypher to insert each node (some 200M entities and 400M relationships). This was far too slow (as I had expected but worth a try).
I've therefore made use of the bulk import tool neo4j-admin import which works over a CSV file which has specified headers for each property. The problem I'm having is that I don't see a way to add the additional properties specific to each Entity. The only solution I can think of is to include a CSV column for every possible property expressed across the set of entities, however I believe I will end up with a bunch of redundant properties on all my entities.
EDIT1
Each Entity is unique, so there will be some 1M+ types (labels in Neo4j)
Any suggestions on how to accomplish this would be appreciated.

The import command of neo4j-admin supports importing from multiple node and relationship files.
Therefore, to support multiple "types" of nodes (called labels in neo4j), you can split your original CSV file into separate files, one for each Entity "type". Each file can then have data columns specific to that type.
[UPDATED]
Here is one way to support the import of nodes having arbitrary schemata from a CSV file.
The CSV file should not have a header.
Every property on a CSV line should be represented by an adjacent pair of values: 1 for the property name, and 1 for the property value.
With such a CSV file, this code (which takes advantage of the APOC function apoc.map.fromValues) should work:
LOAD CSV FROM "file:///mydata.csv" AS line
CREATE (e:Entity)
SET e = apoc.map.fromValues(line);
NOTE: the above code would use strings for all values. If you want some property values to be integers, booleans, etc., then you can do something like this instead (but this is probably only sensible if the same property occurs frequently; if the property does not exist on a line no property will be created in the node, but it will waste some time):
LOAD CSV FROM "file:///mydata.csv" AS line
WITH apoc.map.fromValues(line) AS data
WITH apoc.map.setKey(data, 'foo', TOINTEGER(data.foo)) AS data
CREATE (e:Entity)
SET e = apoc.map.fromValues(line);

Related

Is there a way to get columns names of dataframe in pyspark without reading the whole dataset?

I have huges datasets in my HDFS environnement, say 500+ datasets and all of them are around 100M+ rows. I want to get only the column names of each dataset without reading the whole datasets because it will take too long time to do that. My data are json formatted and I'm reading them using the classic spark json reader : spark.read.json('path'). So what's the best way to get columns names without wasting my time and memory ?
Thanks...
from the official doc :
If the schema parameter is not specified, this function goes through the input once to determine the input schema.
Therefore, you cannot get the column names with only the first line.
Still, you can do an extra step first, that will extract one line and create a dataframe from it, then extract the column names.
One answer could be the following :
Read the data using spark.read.txt('path') method
Limit the number of rows to 1 with the method limit(1) since we just want the header as column names
Convert the table to rdd and collect it as a list with the method collect()
Convert the first row collected from unicode string to python dict (since I'm working with json formatted data).
The keys of the above dict is exactly what we are looking for (columns names as list in python).
This code worked for me:
from ast import literal_eval
literal_eval(spark.read.text('path').limit(1)
.rdd.flatMap(lambda x: x)
.collect()[0]).keys()
The reason it works faster might be that pyspark won't load the whole dataset with all the field structures if you read it using txt format (because everything is read as a big string), it's lighter and more efficient for that specific case.

Should the structure of a derived obj file coinside with the naming of the original step file?

When using the Model Derivative API I successfully generate an obj representation from a step file. But within that process are some quirks that I do not fully understand:
The Post job has a output.advanced.exportFileStructure property which can be set to "multiple" and a output.advanced.objectIds property which lets you specify the which parts of the model you would like to extract. From the little that the documentation states, I would expect to receive one obj file per requested objectid. Which from my experience is not the case. So does this only work for compressed files like .iam and .ipt?
Well, anyway, instead I get one obj file for all objectIds with one polygon group per objectId. The groups are named (duh!), so I would expect them to be named like their objectId but it seams like the numbers are assigned in a random way. So how should I actually map an objectId to its corresponding 3d part? Is there any way to link the information from GET :urn/metadata/:guid/properties back to their objects?
I hope somebody can shine light on this. If you need more information I can provide you with the original step file, the obj and my server log.
You misunderstood the objectIds property of the derivatives API: specifying that field allows you to export only specific components to a single obj, for example your car model has 1000 different components, but you just want to export components that represent the engine: [34, 56, 76] (I just made those up...). If you want to export each objectId to a separate obj file, you need to fire multiple jobs. the "exportFileStructure" option only applies to composite designs (i.e. assemblies) single: creates one OBJ file for all the input files (assembly file), multiple: creates a separate OBJ file for each object. A step file is not a composite design.
As you noticed the obj groups are named randomly. As far as I know there is no easy reliable way to map a component in the obj file to the original objectId because .obj is a very basic format and it doesn't support metadata. You could use a geometric approach (finding where is the component in space, use bounding boxes, ...) to achieve the mapping but it could be challenging with complex models.

How to combine multiple MySQL databases using D2RQ?

I have four different MySQL databases that I need to convert into Linked Data and then run queries on the aggregated data. I have generated the D2RQ maps separately and then manually copied them together into a single file. I have read up some material on customizing the maps but am finding it hard to do so in my case because:
The ontology classes do not correspond to table names. In fact, most classes are column headers.
When I open the combined mapping in Protege, it generates only 3 classes (ClassMap, Database, and PropertyBridge) and lists all the column headers as instances of these.
If I import this file into my ontology, everything becomes annotation.
Please suggest an efficient way to generate a single graph that is formed by mapping these databases to my ontology.
Here is an example. I am using the EEM ontology to refine the mapping file generated by D2RQ. This is a section from the mapping file:
map:scan_event_scanDate a d2rq:PropertyBridge;
d2rq:belongsToClassMap map:scan_event;
d2rq:property vocab:scan_event_scanDate;
d2rq:propertyDefinitionLabel "scan_event scanDate";
d2rq:column "scan_event.scanDate";
# Manually added
d2rq:datatype xsd:int;
.
map:scan_event_scanTime a d2rq:PropertyBridge;
d2rq:belongsToClassMap map:scan_event;
d2rq:property vocab:scan_event_scanTime;
d2rq:propertyDefinitionLabel "scan_event scanTime";
d2rq:column "scan_event.scanTime";
# Manually added
d2rq:datatype xsd:time;
The ontology I am interested in has the following:
Data property: eventOccurredAt
Domain: EPCISevent
Range: datetime
Now, how should I modify the mapping file so that the date and time are two different relationships?
I think the best way to generate a single graph of your 4 databases is to convert them one by one to a Jena Model using D2RQ, and then use the Union method to create a global model.
For your D2RQ mapping file, you should read carefully The mapping language, it's not normal to have classes corresponding to columns.
If you give an example of your table structure, I can give you an illustration of a mapping file.
Good luck

Using Boomi how can I create a flat file profile which has "line" data on the same physical line in the flat file as the "header" data?

I've got parent & child data which I am trying to convert into a flat file using Dell Boomi. The flat file structure is column-based and needs a structure where the lines data is on the same line of the file as the header data.
For instance, a header record which has 4 line items needs to generate a file with a structure of:
[header][line][line][line][line]
Currently what I have been able to generate is either
[header][line]
[header][line]
[header][line]
[header][line]
or
[header]
[line]
[line]
[line]
[line]
I think using the results of the second profile and then using a data processing shape to strip [\r][\n] might be my best option but wanted to check before implementing it.
I created a user define function for each field of data. For this example, lets just speak for "FirstName."
I used a branch immediately-Branch 1 Split the documents, flow controlled them one at a time, and entered the map, then stopped. Branch two contained a message where I built the new file. I typed a static value with the header name, then used a dynamic process property as the parameter next to the header.
User defined function for "FirstName" accepts the first name as input, appends the dynamic process property (need to define a dynamic property for each field), prepends your delimiter of choice, then sets the same dynamic process property.
This was all done with NO scripting at all. I hope this helps. I can provide screenshots if you need more clarification.

CSV format for OpenCV machine learning algorithms

Machine learning algorithms in OpenCV appear to use data read in CSV format. See for example this cpp file. The data is read into an OpenCV machine learning class CvMLData using the following code:
CvMLData data;
data.read_csv( filename )
However, there does not appear to be any readily available documentation on the required format for the csv file. Does anyone know how the csv file should be arranged?
Other (non-Opencv) programs tend to have a line per training example, and begin with an integer or string indicating the class label.
If I read the source for that class, particularly the str_to_flt_elem function, and the class documentation I conclude that valid formats for individual items in the file are:
Anything that can be parsed to a double by strod
A question mark (?) or the empty string to represent missing values
Any string that doesn't parse to a double.
Items 1 and 2 are only valid for features. anything matched by item 3 is assumed to be a class label, and as far as I can deduce the order of the items doesn't matter. The read_csv function automatically assigns each column in the csv file the correct type, and (if you want) you can override the labels with set_response_index. Delimiter wise you can use the default (,) or set it to whatever you like before calling read_csv with set_delimiter (as long as you don't use the decimal point).
So this should work for example, for 6 datapoints in 3 classes with 3 features per point:
A,1.2,3.2e-2,+4.1
A,3.2,?,3.1
B,4.2,,+0.2
B,4.3,2.0e3,.1
C,2.3,-2.1e+3,-.1
C,9.3,-9e2,10.4
You can move your text label to any column you want, or even have multiple text labels.