OrientDB ETL from CSV DateTime - csv

This is currently my config file
{
"config": {
"haltOnError": false
},
"source": {
"file": {
"path": "/home/user1/temp/real_user/user3.csv"
}
},
"extractor": {
"csv": {
"columns": ["id", "name", "token", "username", "password", "created", "updated", "enabled", "is_admin", "is_banned", "userAvatar"],
"columnsOnFirstLine": true
},
"field": {
"fieldName": "created",
"expression": "created.asDateTime()"
}
},
"transformers": [{
"vertex": {
"class": "user"
}
}],
"loader": {
"orientdb": {
"dbURL": "plocal:/home/user1/orientdb/real_user",
"dbAutoCreateProperties": true,
"dbType": "graph",
"classes": [{
"name": "user",
"extends": "V"
}],
"indexes": [{
"class": "user",
"fields": ["id:long"],
"type": "UNIQUE"
}]
}
}
}
and my csv currently looks like this
6,Olivia Ong,2jkjkl54k5jklj5k4j5k4jkkkjjkj,\N,\N,2013-11-15 16:36:33,2013-11-15 16:36:33,1,0,\N,\N
7,Matthew,32kj4h3kjh44hjk3hk43hkhhkjhasd,\N,\N,2013-11-18 17:29:13,2013-11-15 16:36:33,1,0,\N,\N
I still wonder when I execute the ETL, orientdb wont recognize my datetime as datetime.
I tried putting datatype in column fields "created:datetime", but it ended up not showing any data.
I wonder what is the proper solution for this case.

from next version, 2.2.8, you will be able to define different default pattern for date and datetime: CSV extractor documentation
Note that when you define the columns, you need to specify the column's type:
"columns": ["id:string", "created:date", "updated:datetime"],
You can use the snapshot jar of 2.2.8 of ETL module with 2.2.7 without any problem:
https://oss.sonatype.org/content/repositories/snapshots/com/orientechnologies/orientdb-etl/2.2.8-SNAPSHOT/

Related

How to send JSON file with Filebeat into Elasticsearch

I'm trying to send the content of a JSON file into Elasticsearch.
Each file contains only one simple JSON object (just attributes, no array, no nested objects). Filebeat sees the files but they're not sent to Elasticsearch (it's working with csv files so the connection is correct)...
Here is the JSON file (all in one line in the file but I passed it into a JSON formatter to be displayed here):
{
"IPID": "3782",
"Agent": "localhost",
"User": "vtom",
"Script": "/opt/vtom/scripts/scriptOK.ksh",
"Arguments": "",
"BatchQueue": "queue_ksh-json",
"VisualTOMServer": "labc",
"Job": "testJSONlogs",
"Application": "test_CAD",
"Environment": "TEST",
"JobRetry": "0",
"LabelPoint": "0",
"ExecutionMode": "NORMAL",
"DateName": "TEST_CAD",
"DateValue": "05/11/2022",
"DateStart": "2022-11-05",
"TimeStart": "20:58:14",
"StandardOutputName": "/opt/vtom/logs/TEST_test_CAD_testJSONlogs_221105-205814.o",
"StandardOutputContent": "_______________________________________________________________________\nVisual TOM context of the job\n \nIPID : 3782\nAgent : localhost\nUser : vtom\nScript : ",
"ErrorOutput": "/opt/vtom/logs/TEST_test_CAD_testJSONlogs_221105-205814.e",
"ErrorOutputContent": "",
"JsonOutput": "/opt/vtom/logs/TEST_test_CAD_testJSONlogs_221105-205814.json",
"ReturnCode": "0",
"Status": "Finished"
}
The input definition in Filebeat is (it's a merge of data from different web sources):
- type: filestream
id: vtomlogs
enabled: true
paths:
- /opt/vtom/logs/*.json
index: vtomlogs-%{+YYYY.MM.dd}
parsers:
- ndjson:
keys_under_root: true
overwrite_keys: true
add_error_key: true
expand_keys: true
The definition of the index template:
{
"properties": {
"IPID": {
"coerce": true,
"index": true,
"ignore_malformed": false,
"store": false,
"type": "integer",
"doc_values": true
},
"VisualTOMServer": {
"type": "keyword"
},
"Status": {
"type": "keyword"
},
"Agent": {
"type": "keyword"
},
"Script": {
"type": "text"
},
"User": {
"type": "keyword"
},
"ErrorOutputContent": {
"type": "text"
},
"ReturnCode": {
"type": "integer"
},
"BatchQueue": {
"type": "keyword"
},
"StandardOutputName": {
"type": "text"
},
"DateStart": {
"format": "yyyy-MM-dd",
"index": true,
"ignore_malformed": false,
"store": false,
"type": "date",
"doc_values": true
},
"Arguments": {
"type": "text"
},
"ExecutionMode": {
"type": "keyword"
},
"DateName": {
"type": "keyword"
},
"TimeStart": {
"format": "HH:mm:ss",
"index": true,
"ignore_malformed": false,
"store": false,
"type": "date",
"doc_values": true
},
"JobRetry": {
"type": "integer"
},
"LabelPoint": {
"type": "keyword"
},
"DateValue": {
"format": "dd/MM/yyyy",
"index": true,
"ignore_malformed": false,
"store": false,
"type": "date",
"doc_values": true
},
"JsonOutput": {
"type": "text"
},
"StandardOutputContent": {
"type": "text"
},
"Environment": {
"type": "keyword"
},
"ErrorOutput": {
"type": "text"
},
"Job": {
"type": "keyword"
},
"Application": {
"type": "keyword"
}
}
}
The file is seen by Filebeat but it does nothing with it...
0100","log.logger":"input.filestream","log.origin":{"file.name":"filestream/prospector.go","file.line":177},"message":"A new file /opt/vtom/logs/TEST_test_CAD_testJSONlogs_221106-124138.json has been found","service.name":"filebeat","id":"vtomlogs","prospector":"file_prospector","operation":"create","source_name":"native::109713280-64768","os_id":"109713280-64768","new_path":"/opt/vtom/logs/TEST_test_CAD_testJSONlogs_221106-124138.json","ecs.version":"1.6.0"}
My version of Elasticsearch is: 8.4.3
My version of Filebeat is: 8.5.0 (with allow_older_versions: true in my configuration file)
Thanks for your help

Integromat - Dynamically render spec of a collection from an rpc

I am trying to dynamically render a spec (Specification) of a collection from an RPC. Can't get it to work. Here I have attached the code of both 'module->mappable parameters' and the 'remote procedure->communication' here.
module -> mappable parameters
[
{
"name": "birdId",
"type": "select",
"label": "Bird Name",
"required": true,
"options": {
"store": "rpc://selectbird",
"nested": [
{
"name": "variables",
"type": "collection",
"label": "Bird Variables",
"spec": [
"rpc://birdVariables"
]
}
]
}
}
]
remote procedure -> communication
{
"url": "/bird/get-variables",
"method": "POST",
"body": {
"birdId": "{{parameters.birdId}}"
},
"headers": {
"Authorization": "Apikey {{connection.apikey}}"
},
"response": {
"iterate":{
"container": "{{body.data}}"
},
"output": {
"name": "{{item.name}}",
"label": "{{item.label}}",
"type": "{{item.type}}"
}
}
}
Thanks in advance.
Just tried the following and it worked. According to Integromat's Docs you can use the wrapper directive for the rpc like so:
{
"url": "/bird/get-variables",
"method": "POST",
"body": {
"birdId": "{{parameters.birdId}}"
},
"headers": {
"Authorization": "Apikey {{connection.apikey}}"
},
"response": {
"iterate":"{{body.data}}",
"output": {
"name": "{{item.name}}",
"label": "{{item.label}}",
"type": "{{item.type}}"
},
"wrapper": [{
"name": "variables",
"type": "collection",
"label": "Bird Variables",
"spec": "{{output}}"
}]
}
}
Your mappable parameters would then look like:
[
{
"name": "birdId",
"type": "select",
"label": "Bird Name",
"required": true,
"options": {
"store": "rpc://selectbird",
"nested": "rpc://birdVariables"
}
}
]
Needing this myself. Pulling in custom fields that have different types but would like them all to show for the user to update customs fields or when creating a contact be able to update them. Not sure if best to have them all show or have a select drop down then let the user use the map for more than one.
Here is my response from a Get for custom fields. Could you show how my code should look. Got little confused as usualy look for add a value in the output and do you need two separate RPC's in integromat? Noticed your store and nested were different.
{
"customFields": [
{
"id": "5sCdYXDx5QBau2m2BxXC",
"name": "Your Experience",
"fieldKey": "contact.your_experience",
"dataType": "LARGE_TEXT",
"position": 0
},
{
"id": "RdrFtK2hIzJLmuwgBtAr",
"name": "Assisted by",
"fieldKey": "contact.assisted_by",
"dataType": "MULTIPLE_OPTIONS",
"position": 0,
"picklistOptions": [
"Tom",
"Jill",
"Rick"
]
},
{
"id": "uyjmfZwo0PCDJKg2uqrt",
"name": "Is contacted",
"fieldKey": "contact.is_contacted",
"dataType": "CHECKBOX",
"position": 0,
"picklistOptions": [
"I would like to be contacted"
]
}
]
}

Issue in running a Gobblin Job

I am new to Gobblin and i am trying to run a simple job in standalone mode but i am getting the folowing error:
Task failed due to "com.google.gson.JsonSyntaxException:
com.google.gson.stream.MalformedJsonException: Expected name at line 1
column 72 path $.fields."
My job file is:
job.name=MRJob1
job.group=MapReduce
job.description=A getting started job for MapReduce
source.class=gobblin.source.extractor.filebased.TextFileBasedSource
source.filebased.downloader.class=gobblin.source.extractor.filebased.CsvFileDownloader
converter.classes=gobblin.converter.csv.CsvToJsonConverterV2,gobblin.converter.avro.JsonIntermediateToAvroConverter
writer.builder.class=gobblin.writer.AvroDataWriterBuilder
source.filebased.fs.uri=file:///
source.filebased.data.directory=/home/sahil97/Downloads/gobblin-dist/CsvSource/
source.schema=[{"ColumnName":"FIRST_NAME","comment": "","isNullable": "true","dataType":{"type":"string"}}{"ColumnName":"LAST_NAME","comment": "","isNullable": "true","dataType":{"type":"string"}},{"ColumnName":"GENDER","comment": "","isNullable": "true","dataType":{"type":"string"}},{"ColumnName":"AGE","comment": "","isNullable": "true","dataType":{"type":"int"}}]
source.skip.first.record=true
source.csv_file.delimiter=,
converter.csv.to.json.delimiter=,
extract.table.type=append_only
extract.table.name=CsvToAvro
extract.namespace=MapReduce
converter.classes=gobblin.converter.csv.CsvToJsonConverterV2,gobblin.converter.avro.JsonIntermediateToAvroConverter
writer.destination.type=HDFS
writer.output.format=AVRO
data.publisher.type=gobblin.publisher.BaseDataPublisher
My CSV file is:Repo.txt
FIRST_NAME,LAST_NAME,GENDER,AGE
Sahil,Gaur,Male,22
Sagar,Gaur,Male,21
Dushyant,Saini,Male,23
Devyani,Kaulas,Female,21
Sanchi,Theraja,Female,22
Shreya,Gupta,Female,21
Chirag,Thakur,Male,22
Manish,Sharma,Male,23
Abhishek,Soni,Male,24
Varnita,Sachdeva,Female,22
Deepam,Chaurishi,Male,22
If this is the actual json: You have an extra comma here .The error says you have bad json syntax.
So this is probably one of the first places to look.
{
"ColumnName": "FIRST_NAME",
"comment": "",
"isNullable": "true",
"dataType": {
"type": "string"
}
}, // Try this comma while defining the json
{
"ColumnName": "LAST_NAME",
"comment": "",
"isNullable": "true",
"dataType": {
"type": "string"
}
},
{
"ColumnName": "GENDER",
"comment": "",
"isNullable": "true",
"dataType": {
"type": "string"
}
},
{
"ColumnName": "AGE",
"comment": "",
"isNullable": "true",
"dataType": {
"type": "int"
}
}

Azure pipeline mapping I want to add one static field in json while csv import

This is input I want to assign a static value to username so that when this csv will load that static value will be inserted in destination
{
"name": "Input_dat",
"properties": {
"structure": [
{
"name": "ServerName",
"type": "String"
},
{
"name": "DTVer",
"type": "Double"
},
{
"name": "Ver",
"type": "Double"
},
{
"name": "UserName",
"type": "String"
}
],
"published": false,
"type": "AzureBlob",
"linkedServiceName": "Source-AzureBlob",
"typeProperties": {
"folderPath": "foldwr/folder1/Import/",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": "\u0001"
}
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": true,
"policy": {}
}
}
E.g username="sqladlin"
You cannot do this if your souce is a blob storage. Maybe add the field in the csv before its ingested by data factory, or after.
If you really want to do this with data factory, I think the only option is to go for a custom activity.
Hope this helped!

Apache Drill JSON storage configuration error(invalid json mapping)

I try to change the storage configuration in apache drill in embedded mode to identify headers and to change the delimiter of csv files. I also renamed the new format category from csv to sap.
I tried to use the information from the documentation and created the following json storage information:
{
"type": "file",
"enabled": true,
"connection": "file:///",
"workspaces": {
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null
},
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null
}
},
"formats": {
"sap": {
"type": "text",
"extensions": [
"sap"
],
"skipFirstLine": false,
"extractHeader": true,
"delimiter": "|"
},
"psv": {
"type": "text",
"extensions": [
"tbl"
],
"delimiter": "|"
},
"csv": {
"type": "text",
"extensions": [
"csv"
],
"delimiter": ","
},
"tsv": {
"type": "text",
"extensions": [
"tsv"
],
"delimiter": "\t"
},
"parquet": {
"type": "parquet"
},
"json": {
"type": "json"
},
"avro": {
"type": "avro"
}
}
}
But always when I try to save it in the web-ui I got the message: error (invalid json mapping).
The exec.storage.enable_new_text_reader is set true.
Could somebody help my how I can add the two config items: skipFirstLine and extractHeader?
BR
Drill is able to parse the header row in a text file (CSV, TSV, etc.) in Drill 1.3. Check documentation for this.
Check Release notes for Dill 1.3 and csv header parsing issue for more details.