I have four different MySQL databases that I need to convert into Linked Data and then run queries on the aggregated data. I have generated the D2RQ maps separately and then manually copied them together into a single file. I have read up some material on customizing the maps but am finding it hard to do so in my case because:
The ontology classes do not correspond to table names. In fact, most classes are column headers.
When I open the combined mapping in Protege, it generates only 3 classes (ClassMap, Database, and PropertyBridge) and lists all the column headers as instances of these.
If I import this file into my ontology, everything becomes annotation.
Please suggest an efficient way to generate a single graph that is formed by mapping these databases to my ontology.
Here is an example. I am using the EEM ontology to refine the mapping file generated by D2RQ. This is a section from the mapping file:
map:scan_event_scanDate a d2rq:PropertyBridge;
d2rq:belongsToClassMap map:scan_event;
d2rq:property vocab:scan_event_scanDate;
d2rq:propertyDefinitionLabel "scan_event scanDate";
d2rq:column "scan_event.scanDate";
# Manually added
d2rq:datatype xsd:int;
.
map:scan_event_scanTime a d2rq:PropertyBridge;
d2rq:belongsToClassMap map:scan_event;
d2rq:property vocab:scan_event_scanTime;
d2rq:propertyDefinitionLabel "scan_event scanTime";
d2rq:column "scan_event.scanTime";
# Manually added
d2rq:datatype xsd:time;
The ontology I am interested in has the following:
Data property: eventOccurredAt
Domain: EPCISevent
Range: datetime
Now, how should I modify the mapping file so that the date and time are two different relationships?
I think the best way to generate a single graph of your 4 databases is to convert them one by one to a Jena Model using D2RQ, and then use the Union method to create a global model.
For your D2RQ mapping file, you should read carefully The mapping language, it's not normal to have classes corresponding to columns.
If you give an example of your table structure, I can give you an illustration of a mapping file.
Good luck
Related
I'm trying to bulk load a massive dataset into a single Neo4j instance. Each node will represent a general Entity which will have specific properties, e.g.:
label
description
date
In addition to these there are zero or more properties specific to the Entity type, so for example if the Entity is a Book, the properties will look something like this:
label
description
date
author
first published
...
And if the Entity is a Car the properties will look something like this:
label
description
date
make
model
...
I first attempted to import the dataset by streaming each Entity from the filesystem and using Cypher to insert each node (some 200M entities and 400M relationships). This was far too slow (as I had expected but worth a try).
I've therefore made use of the bulk import tool neo4j-admin import which works over a CSV file which has specified headers for each property. The problem I'm having is that I don't see a way to add the additional properties specific to each Entity. The only solution I can think of is to include a CSV column for every possible property expressed across the set of entities, however I believe I will end up with a bunch of redundant properties on all my entities.
EDIT1
Each Entity is unique, so there will be some 1M+ types (labels in Neo4j)
Any suggestions on how to accomplish this would be appreciated.
The import command of neo4j-admin supports importing from multiple node and relationship files.
Therefore, to support multiple "types" of nodes (called labels in neo4j), you can split your original CSV file into separate files, one for each Entity "type". Each file can then have data columns specific to that type.
[UPDATED]
Here is one way to support the import of nodes having arbitrary schemata from a CSV file.
The CSV file should not have a header.
Every property on a CSV line should be represented by an adjacent pair of values: 1 for the property name, and 1 for the property value.
With such a CSV file, this code (which takes advantage of the APOC function apoc.map.fromValues) should work:
LOAD CSV FROM "file:///mydata.csv" AS line
CREATE (e:Entity)
SET e = apoc.map.fromValues(line);
NOTE: the above code would use strings for all values. If you want some property values to be integers, booleans, etc., then you can do something like this instead (but this is probably only sensible if the same property occurs frequently; if the property does not exist on a line no property will be created in the node, but it will waste some time):
LOAD CSV FROM "file:///mydata.csv" AS line
WITH apoc.map.fromValues(line) AS data
WITH apoc.map.setKey(data, 'foo', TOINTEGER(data.foo)) AS data
CREATE (e:Entity)
SET e = apoc.map.fromValues(line);
This is going to be little complex
I am trying to save a json with nested array structure. I have the json like below which I am trying to save
JSON LINK
Is there a possibility to save the above json with cypher query, because previously I tried with py2neo library for pyhton, which is based on model definition, Therefore the above json with nested structure will be going to have a dynamic keys a bit.
What I actually tried is, I'm leaving it below
query = '''
CREATE (part:Part PARTJSON)
MERGE (part) - [:LINKED_TO] - (general:General GENERALJSON)
MERGE (general) - [:LINKED_TO] - (bom:Bom BOMJSON )
MERGE (general) - [:LINKED_TO] - (generaldata:GeneralData GENERALDATAJSON )
.......
'''
Is there a possibility to write a cypher query to save it in one flash.
If so, can help with possible ideas so that it will be useful for neo4j users with roadblocks.
Thanks in advance.
When using the Model Derivative API I successfully generate an obj representation from a step file. But within that process are some quirks that I do not fully understand:
The Post job has a output.advanced.exportFileStructure property which can be set to "multiple" and a output.advanced.objectIds property which lets you specify the which parts of the model you would like to extract. From the little that the documentation states, I would expect to receive one obj file per requested objectid. Which from my experience is not the case. So does this only work for compressed files like .iam and .ipt?
Well, anyway, instead I get one obj file for all objectIds with one polygon group per objectId. The groups are named (duh!), so I would expect them to be named like their objectId but it seams like the numbers are assigned in a random way. So how should I actually map an objectId to its corresponding 3d part? Is there any way to link the information from GET :urn/metadata/:guid/properties back to their objects?
I hope somebody can shine light on this. If you need more information I can provide you with the original step file, the obj and my server log.
You misunderstood the objectIds property of the derivatives API: specifying that field allows you to export only specific components to a single obj, for example your car model has 1000 different components, but you just want to export components that represent the engine: [34, 56, 76] (I just made those up...). If you want to export each objectId to a separate obj file, you need to fire multiple jobs. the "exportFileStructure" option only applies to composite designs (i.e. assemblies) single: creates one OBJ file for all the input files (assembly file), multiple: creates a separate OBJ file for each object. A step file is not a composite design.
As you noticed the obj groups are named randomly. As far as I know there is no easy reliable way to map a component in the obj file to the original objectId because .obj is a very basic format and it doesn't support metadata. You could use a geometric approach (finding where is the component in space, use bounding boxes, ...) to achieve the mapping but it could be challenging with complex models.
I have huge number (35k) of small (16kb) json files stored on S3Bucket. I need to load them into DataFrame for futher processing, here is my code for extract:
val jsonData = sqlContext.read.json("s3n://bucket/dir1/dir2")
.where($"nod1.filter1"==="filterValue")
.where($"nod2.subNode1.subSubNode2.created"(0)==="filterValue2")
I'm storing this data into temp table and use for futher operations (exploading nested structures into separate data frames)
jsonData.registerTempTable("jsonData")
So now I have autogenerated schema for this deeply nested dataframe.
With above code I have terrible performance issues I presume its caused by not using sc.parallelize during bucket load, moreover I'm pretty sure that autogeneration of schema in read.json() method is taking a lot of time.
Questions part:
How should my bucket load look like, to be more efficient and faster?
Is there any way to declare this schema in advance (I need to work around Case Class tuple problem thou) to avoid auto-generation?
Does filtering data during load make sense or I should simply load all and filter data after?
Found so far:
sqlContext.jsonRdd(rdd, schema)
It did the part with auto generated schema, but InteliJ screams about depreciated method, is there any alternative for it?
As an alternative to case class, use a custom class that implements the Product interface, and then DataFrame will use the schema exposed by your class members without the case class constraints. See in-line comment here http://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
If your json is composed of unrooted fragments you could use s3distcp to group the files and concatenate them into fewer files. Also try s3a protocol as it is better performance than s3n.
I have a doubt. I do know that Logstash allows us to input csv/log files and filter it using separators and columns. And it will output into elasticsearch for it to be used by Kibana. However, after writing the conf file, do I need to specify index pattern by using the command:
CURL -XPUT 'http://localhost:5601/test' d
Because I do know that when you have a JSON file, you will have to define the mapping etc. Do I need to do this step for csv files and other non json files? Sorry for asking, I need to clear my doubt.
When you insert documents into a new elasticsearch index, a mapping is created for you. This may not be a good thing, as it's based on the initial value of each field. Imagine a field that normally contains a string, but the initial document contains an integer - now your mapping is wrong. This is a good case for creating a mapping.
If you insert documents through logstash into an index named logstash-YYYY-MM-DD (the default), logstash will apply its own mapping. It will use any pattern hints you gave it in grok{}, e.g.:
%{NUMBER:bytes:int}
and it will also make a "raw" (not analyzed) version of each string, which you can access as myField.raw. This may also not be what you want, but you can make your own mapping and provide it as an argument in the elasticsearch{} output stanza.
You can also make templates, which elasticsearch will apply when an index pattern matches the template definition.
So, you only need to create a mapping if you don't like the default behaviors of elasticsearch or logstash.
Hope that helps.