How to programmatically create an OrientDB database of a given (json) schema? - json

For my dev workflow purposes, I'd like to create a new orientdb database given a JSON schema, on the fly. I dont believe this is natively supported in orientdb, are there any existing solutions that do this - provide a JSON schema and point to a orientdb instance, and it auto-creates the database (edges, vertices, indexes and perhaps some sample data).

I ended up creating a .sh script to create the DB on the fly. The .sh files looks something like:
# (file: createmydb.sh)
# script to create my database declaratively
set echo true
# use this to ignore errors and continue, if needed
# set ignoreErrors true
# create database
create database plocal:../databases/MyDB root root plocal graph
# create User vertex
create class User extends V
create property User.Email STRING
create property User.Firstname STRING
...
And then call it like:
/usr/local/src/orientdb/bin/console.sh createmydb.sh
This works well for my purposes. The DB creation script is very easy to read, can be modified easily. And I am sure very backwards compatible (which may not have been the case with importing an exported JSON version of the db schema).

So far I've found that pre-loading the schema using an external definition stored in either JSON or OSQL has been most successful for me. Currently I am using an OSQL script that contains a whole bunch of CREATE CLASS ... and CREATE PROPERTY ... commands. It works well enough.
Pretty soon I'll have to start supporting dynamic changes to the data model, at which point I will have to write code to read a JSON schema definition and convert that to appropriate calls into OrientDB, either through the Blueprints API or through SQL batches.
I've not found a tool that does what you need "automatically." If you find one, please let me (and everyone else here) know.

Related

ArangoDB: How to export collection to CSV?

I have noticed there is a feature in web interface of ArangoDB which allows users to Download or Upload data as JSON file. However, I find nothing similar for CSV exporting. How can an existing Arango DB collection be exported to a .csv file?
If you want to export data from ArangoDB to CSV, then you should use Arangoexport. It is included in the full packages as well as the client-only packages. You find it next to the arangod server executable.
Basic usage:
https://docs.arangodb.com/3.4/Manual/Programs/Arangoexport/Examples.html#export-csv
Also see the CSV example with AQL query:
https://docs.arangodb.com/3.4/Manual/Programs/Arangoexport/Examples.html#export-via-aql-query
Using an AQL query for a CSV export allows you to transform the data if desired, e.g. to concatenate an array to a string or unpack nested objects. If you don't do that, then the JSON serialization of arrays/objects will be exported (which may or may not be what you want).
The default Arango install includes the following file:
/usr/share/arangodb3/js/contrib/CSV_export/CSVexport.js
It includes this comment:
// This is a generic CSV exporter for collections.
//
// Usage: Run with arangosh like this:
// arangosh --javascript.execute <CollName> [ <Field1> <Field2> ... ]
Unfortunately, at least in my experience, that usage tip is incorrect. Arango team, if you are reading this, please correct the file or correct my understanding.
Here's how I got it to work:
arangosh --javascript.execute "/usr/share/arangodb3/js/contrib/CSV_export/CSVexport.js" "<CollectionName>"
Please specify a password:
Then it sends the CSV data to stdout. (If you with to send it to a file, you have to deal with the password prompt in some way.)

KNIME - Execute a EXE program in a Workflow

I have a workflow Knime, in the middle I must execute an external program to create an Excel file.
Exists some node that allows me to achieve this? I don't need to put any input or output, only execute the program and wait to generate the Excel file (I require to use this Excel for the next nodes).
There are (at least) two “External Tool” nodes which allow running executables on the command line:
External Tool
External Tool (Labs)
In case that should not be enough, you can always go for a Java Snippet node. The java.lang.Runtime class should be your entry point.
It's could be used the External tool node. The node requires inputs and outputs... but, you can use a table creator node for input:
This create an empty table.
In the external tool node, you must include an Input file and Output file, depending on your request, this config could be meaningless but require to the Node works.
In this case, the external app creates a text with the result of the execution, so, in the initial table (Table creator node), will be read the file and get the information into Knime.

AWS Glue Crawler Classifies json file as UNKNOWN

I'm working on an ETL job that will ingest JSON files into a RDS staging table. The crawler I've configured classifies JSON files without issue as long as they are under 1MB in size. If I minify a file (instead of pretty print) it will classify the file without issue if the result is under 1MB.
I'm having trouble coming up with a workaround. I tried converting the JSON to BSON or GZIPing the JSON file but it is still classified as UNKNOWN.
Has anyone else run into this issue? Is there a better way to do this?
I have two json files which are 42mb and 16mb, partitioned on S3 as path:
s3://bucket/stg/year/month/_0.json
s3://bucket/stg/year/month/_1.json
I had the same problem as you, crawler classification as UNKNOWN.
I were able to solved it:
You must create custom classifier with jsonPath as "$[*]" then create new crawler with the classifier.
Run your new crawler with the data on S3 and proper schema will be created.
DO NOT update your current crawler with the classifier as it won't apply the change, I don't know why, maybe because of classifier versioning AWS mentioned in their documents. Create new crawler make them work
As mentioned in
https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html#custom-classifier-json
When you run a crawler using the built-in JSON classifier, the entire file is used to define the schema. Because you don’t specify a JSON path, the crawler treats the data as one object, that is, just an array.
That is something which Dung also pointed out in his answer.
Please also note that file encoding can lead to JSON being classified as UNKNOWN. Please try and re-encode the file as UTF-8.

Custom dynamic inventory scripts/plugins in Ansible

Ansible allows devs
to write programs (in any language) that will return JSON describing the dynamic “snapshot” of current hosts. I’m using vSphere, which is currently not supported by Ansible OSS, and so I need to write such a "custom inventory plugin".
I can handle the querying of vSphere for a list of hosts, as well as constructing the JSON that is compatible with what Ansible is expecting.
Where the documentation completely (seemingly) falls flat is:
How do I “connect” Ansible with my inventory app? That is, say my inventory app is a simple bash script (inventory.sh)..how do I configure Ansible to call bash inventory.sh and obtain JSON from it? In reality the app will likely be a Java executable (inventory.jar) but I figure that if I can figure out how to get it working with bash, I can extrapolate to Java; and
How does Ansible actually capture/fetch the JSON back from the app? STDOUT? Is this all supposed to happen over an HTTP connection? Examples? How does inventory.sh or inventory.jar communicate that JSON back to Ansible?
The inventory script has to be located on the same machine where Ansible runs. It is not communicating through http, Ansible will simply parse the STDOUT of your program. The location does not matter at all, you have to pass the path to Ansible when you call Ansible:
ansible-playbook ... -i /path/to/your/inventory.sh
To avoid passing the inventory location every time you could add this to you ansible.cfg:
inventory = /path/to/your/inventory.sh
You could also copy the script to /etc/ansible/hosts, which is the default location Ansible will look for inventory files/scripts, but I prefer to keep things together so I suggest to place it close to your playbooks/roles etc.
And (3) Is any of this documented, anywhere? Don't see anything in the Ansible docs...
It is not mentioned on the page Developing Dynamic Inventory Sources but it is to be seen on some examples on the page Dynamic Inventory. The docs are community managed and from times litte unstructured and lacking important information.
BTW, there is a VMware inventory script included. By looking at the source I have seen it imports some vSphere stuff. I have little experience with VMware so I can't judge if this is actually what you need and don't need to write your own.
This is completely user defined. Typically you would write your dynamic inventory in Python and use a json dump of the output to create the inventory.
Here is an example for the use case you mentioned (vSphere): https://github.com/RaymiiOrg/ansible-vmware/blob/master/query.py
In a nutshell you create it like a normal Python file and create the options (as he does in main) and selectively execute functions based on which options are passed. These will make REST calls and return the output in the form of a JSON dump, which Ansible can parse for use in inventory.

Export from Couchbase to CSV file

I have a Couchbase Cluster with only one node (let's call it localhost) and I need to export all the data from a very big bucket (let's call it XXX) into a CSV file.
Now this seems to be a pretty easy task but I can't find the way to make it work.
According to the (really bad) documentation on the cbtransfer toold from Couchbase http://docs.couchbase.com/admin/admin/CLI/cbtransfer_tool.html they say this is possible but they don't explain it clearly. They just add a flag if you want the transfer to occur in csv format (?) but it is not working. Maybe someone who already did this can give me a hand?
Using the documentation I've been able to make an approach to the result I want to obtain (a clean CSV file with all the documents in the XXX bucket) using this command:
/opt/couchbase/bin/cbtransfer http://localhost:8091 /path/to/export/output.csv -b XXX
But what I get is that /path/to/export/output.csv is actually a folder with a lot of folders inside and it is storing some kind of json metadata that can be used to restore the XXX bucket in another instance of Couchbase.
Has anyone been able to export data from a Couchbase bucket (Json documents) into a CSV file?
From looking at the documentation, you have to put a slightly different syntax to export to a CSV. http://docs.couchbase.com/admin/admin/CLI/cbtransfer_tool.html
It needs to look like so:
cbtransfer http://[localhost]:8091 csv:./data.csv -b default -u Administrator -p password
Notice the "csv:" before the name of the csv file.
I tested this and it does export a CSV. Just be forwarned that you need a relatively flat document structure for this to work really well, as JSON can represent far more complex data structures than CSV obviously, e.g. arrays, sub-documents, etc. cbtransfer will not unravel those. For example, if there is a subdocument, cbtransfer will represent it as a JSON doc in the line of each CSV.
So depending on what your document structure is, exporting to CSV is not an ideal format. It is a step backwards.