Apache Drill JSON storage configuration error(invalid json mapping) - json

I try to change the storage configuration in apache drill in embedded mode to identify headers and to change the delimiter of csv files. I also renamed the new format category from csv to sap.
I tried to use the information from the documentation and created the following json storage information:
{
"type": "file",
"enabled": true,
"connection": "file:///",
"workspaces": {
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null
},
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null
}
},
"formats": {
"sap": {
"type": "text",
"extensions": [
"sap"
],
"skipFirstLine": false,
"extractHeader": true,
"delimiter": "|"
},
"psv": {
"type": "text",
"extensions": [
"tbl"
],
"delimiter": "|"
},
"csv": {
"type": "text",
"extensions": [
"csv"
],
"delimiter": ","
},
"tsv": {
"type": "text",
"extensions": [
"tsv"
],
"delimiter": "\t"
},
"parquet": {
"type": "parquet"
},
"json": {
"type": "json"
},
"avro": {
"type": "avro"
}
}
}
But always when I try to save it in the web-ui I got the message: error (invalid json mapping).
The exec.storage.enable_new_text_reader is set true.
Could somebody help my how I can add the two config items: skipFirstLine and extractHeader?
BR

Drill is able to parse the header row in a text file (CSV, TSV, etc.) in Drill 1.3. Check documentation for this.
Check Release notes for Dill 1.3 and csv header parsing issue for more details.

Related

How to send JSON file with Filebeat into Elasticsearch

I'm trying to send the content of a JSON file into Elasticsearch.
Each file contains only one simple JSON object (just attributes, no array, no nested objects). Filebeat sees the files but they're not sent to Elasticsearch (it's working with csv files so the connection is correct)...
Here is the JSON file (all in one line in the file but I passed it into a JSON formatter to be displayed here):
{
"IPID": "3782",
"Agent": "localhost",
"User": "vtom",
"Script": "/opt/vtom/scripts/scriptOK.ksh",
"Arguments": "",
"BatchQueue": "queue_ksh-json",
"VisualTOMServer": "labc",
"Job": "testJSONlogs",
"Application": "test_CAD",
"Environment": "TEST",
"JobRetry": "0",
"LabelPoint": "0",
"ExecutionMode": "NORMAL",
"DateName": "TEST_CAD",
"DateValue": "05/11/2022",
"DateStart": "2022-11-05",
"TimeStart": "20:58:14",
"StandardOutputName": "/opt/vtom/logs/TEST_test_CAD_testJSONlogs_221105-205814.o",
"StandardOutputContent": "_______________________________________________________________________\nVisual TOM context of the job\n \nIPID : 3782\nAgent : localhost\nUser : vtom\nScript : ",
"ErrorOutput": "/opt/vtom/logs/TEST_test_CAD_testJSONlogs_221105-205814.e",
"ErrorOutputContent": "",
"JsonOutput": "/opt/vtom/logs/TEST_test_CAD_testJSONlogs_221105-205814.json",
"ReturnCode": "0",
"Status": "Finished"
}
The input definition in Filebeat is (it's a merge of data from different web sources):
- type: filestream
id: vtomlogs
enabled: true
paths:
- /opt/vtom/logs/*.json
index: vtomlogs-%{+YYYY.MM.dd}
parsers:
- ndjson:
keys_under_root: true
overwrite_keys: true
add_error_key: true
expand_keys: true
The definition of the index template:
{
"properties": {
"IPID": {
"coerce": true,
"index": true,
"ignore_malformed": false,
"store": false,
"type": "integer",
"doc_values": true
},
"VisualTOMServer": {
"type": "keyword"
},
"Status": {
"type": "keyword"
},
"Agent": {
"type": "keyword"
},
"Script": {
"type": "text"
},
"User": {
"type": "keyword"
},
"ErrorOutputContent": {
"type": "text"
},
"ReturnCode": {
"type": "integer"
},
"BatchQueue": {
"type": "keyword"
},
"StandardOutputName": {
"type": "text"
},
"DateStart": {
"format": "yyyy-MM-dd",
"index": true,
"ignore_malformed": false,
"store": false,
"type": "date",
"doc_values": true
},
"Arguments": {
"type": "text"
},
"ExecutionMode": {
"type": "keyword"
},
"DateName": {
"type": "keyword"
},
"TimeStart": {
"format": "HH:mm:ss",
"index": true,
"ignore_malformed": false,
"store": false,
"type": "date",
"doc_values": true
},
"JobRetry": {
"type": "integer"
},
"LabelPoint": {
"type": "keyword"
},
"DateValue": {
"format": "dd/MM/yyyy",
"index": true,
"ignore_malformed": false,
"store": false,
"type": "date",
"doc_values": true
},
"JsonOutput": {
"type": "text"
},
"StandardOutputContent": {
"type": "text"
},
"Environment": {
"type": "keyword"
},
"ErrorOutput": {
"type": "text"
},
"Job": {
"type": "keyword"
},
"Application": {
"type": "keyword"
}
}
}
The file is seen by Filebeat but it does nothing with it...
0100","log.logger":"input.filestream","log.origin":{"file.name":"filestream/prospector.go","file.line":177},"message":"A new file /opt/vtom/logs/TEST_test_CAD_testJSONlogs_221106-124138.json has been found","service.name":"filebeat","id":"vtomlogs","prospector":"file_prospector","operation":"create","source_name":"native::109713280-64768","os_id":"109713280-64768","new_path":"/opt/vtom/logs/TEST_test_CAD_testJSONlogs_221106-124138.json","ecs.version":"1.6.0"}
My version of Elasticsearch is: 8.4.3
My version of Filebeat is: 8.5.0 (with allow_older_versions: true in my configuration file)
Thanks for your help

Open a bunch of JSON files and change value of a key

I have over 500 JSON files where I have to change a value of a lang key from en to de
This is how the structure of the files looks like
{
"id": "a7b9e1b4-3d91-4e1b-946a-8c4b92b78d31",
"name": "smart.emoji.gb.country",
"auto": true,
"contexts": [],
"responses": [
{
"resetContexts": false,
"affectedContexts": [],
"parameters": [],
"messages": [
{
"type": 0,
"lang": "en",
"speech": "Country Flag"
}
],
"defaultResponsePlatforms": {},
"speech": []
}
],
"priority": 500000,
"webhookUsed": false,
"webhookForSlotFilling": false,
"lastUpdate": 1525736163,
"fallbackIntent": false,
"events": []
}
All the files are in the same folder.
Any idea how can I do it automatically?
I found an answer - Notepad++ has this functionality.

Azure pipeline mapping I want to add one static field in json while csv import

This is input I want to assign a static value to username so that when this csv will load that static value will be inserted in destination
{
"name": "Input_dat",
"properties": {
"structure": [
{
"name": "ServerName",
"type": "String"
},
{
"name": "DTVer",
"type": "Double"
},
{
"name": "Ver",
"type": "Double"
},
{
"name": "UserName",
"type": "String"
}
],
"published": false,
"type": "AzureBlob",
"linkedServiceName": "Source-AzureBlob",
"typeProperties": {
"folderPath": "foldwr/folder1/Import/",
"format": {
"type": "TextFormat",
"rowDelimiter": "\n",
"columnDelimiter": "\u0001"
}
},
"availability": {
"frequency": "Day",
"interval": 1
},
"external": true,
"policy": {}
}
}
E.g username="sqladlin"
You cannot do this if your souce is a blob storage. Maybe add the field in the csv before its ingested by data factory, or after.
If you really want to do this with data factory, I think the only option is to go for a custom activity.
Hope this helped!

Apache Drill S3 : No default schema selected

I am trying to work with Apache Drill. I am new to this whole environment, just trying to understand how Apache Drill works.
I am trying to query my json data stored on s3 using Apache Drill.
My bucket is created in US East (N. Virginia).
I have created a new Storage Plugin for S3 using this link.
Here is the configuration for my new S3 Storage Plugin :
{
"type": "file",
"enabled": true,
"connection": "s3a://testing-drill/",
"config": {
"fs.s3a.access.key": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"fs.s3a.secret.key": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
},
"workspaces": {
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null,
"allowAccessOutsideWorkspace": false
},
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null,
"allowAccessOutsideWorkspace": false
}
},
"formats": {
"psv": {
"type": "text",
"extensions": [
"tbl"
],
"delimiter": "|"
},
"csv": {
"type": "text",
"extensions": [
"csv"
],
"delimiter": ","
},
"tsv": {
"type": "text",
"extensions": [
"tsv"
],
"delimiter": "\t"
},
"parquet": {
"type": "parquet"
},
"json": {
"type": "json",
"extensions": [
"json"
]
},
"avro": {
"type": "avro"
},
"sequencefile": {
"type": "sequencefile",
"extensions": [
"seq"
]
},
"csvh": {
"type": "text",
"extensions": [
"csvh"
],
"extractHeader": true,
"delimiter": ","
}
}
}
I have also configured my core-site-example.xml as follows:
<configuration>
<property>
<name>fs.s3a.access.key</name>
<value>xxxxxxxxxxxxxxxxxxxx</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>xxxxxxxxxxxxxxxxxxxxxxxx</value>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>s3.us-east-1.amazonaws.com</value>
</property>
</configuration>
But when I try to use/set the workspace using the following command :
USE shiv.`root`;
It gives me following error :
Error: VALIDATION ERROR: Schema [shiv.root] is not valid with respect to either root schema or current default schema.
Current default schema: No default schema selected
[Error Id: 6d9515c0-b90f-48aa-9dc5-0c660f1c06ca on ip-10-0-3-241.ec2.internal:31010] (state=,code=0)
If try to execute show schemas;, then I get the following error :
show schemas;
Error: SYSTEM ERROR: AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: EEB438A6A0A5E667, AWS Error Code: null, AWS Error Message: Bad Request
Fragment 0:0
[Error Id: 85883537-9b4f-4057-9c90-cdaedec116a8 on ip-10-0-3-241.ec2.internal:31010] (state=,code=0)
I am not able to understand the root cause of this issue.
I had a similar issue when using Apache Drill with GCS(Google Cloud Storage)
I was getting the following error when running USE gcs.data query.
VALIDATION ERROR: Schema [gcs.data] is not valid with respect to either root schema or current default schema.
Current default schema: No default schema selected
I ran SHOW SCHEMAS and there was no gcs.data schema.
I went ahead and created data folder in my GCS bucket and gcs.data showed up in SHOW SCHEMAS and USE gcs.data query worked.
From my limited experience with apache drill what I understood is that,
In file storage, if you have a workspace that uses a folder that does not exist then drill will throw this error.
GCS and S3 both are file type storage so maybe you are having this issue.
Here is my GCS storage config
{
"type": "file",
"connection": "gs://my-gcs-bkt",
"config": null,
"workspaces": {
"data": {
"location": "/data",
"writable": true,
"defaultInputFormat": null,
"allowAccessOutsideWorkspace": false
},
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null,
"allowAccessOutsideWorkspace": false
},
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null,
"allowAccessOutsideWorkspace": false
}
},
"formats": {
"parquet": {
"type": "parquet"
},
"json": {
"type": "json",
"extensions": [
"json"
]
},
"tsv": {
"type": "text",
"extensions": [
"tsv"
],
"delimiter": "\t"
},
"csvh": {
"type": "text",
"extensions": [
"csvh"
],
"extractHeader": true,
"delimiter": ","
},
"csv": {
"type": "text",
"extensions": [
"csv"
],
"delimiter": ","
},
"psv": {
"type": "text",
"extensions": [
"tbl"
],
"delimiter": "|"
}
},
"enabled": true
}

OrientDB ETL from CSV DateTime

This is currently my config file
{
"config": {
"haltOnError": false
},
"source": {
"file": {
"path": "/home/user1/temp/real_user/user3.csv"
}
},
"extractor": {
"csv": {
"columns": ["id", "name", "token", "username", "password", "created", "updated", "enabled", "is_admin", "is_banned", "userAvatar"],
"columnsOnFirstLine": true
},
"field": {
"fieldName": "created",
"expression": "created.asDateTime()"
}
},
"transformers": [{
"vertex": {
"class": "user"
}
}],
"loader": {
"orientdb": {
"dbURL": "plocal:/home/user1/orientdb/real_user",
"dbAutoCreateProperties": true,
"dbType": "graph",
"classes": [{
"name": "user",
"extends": "V"
}],
"indexes": [{
"class": "user",
"fields": ["id:long"],
"type": "UNIQUE"
}]
}
}
}
and my csv currently looks like this
6,Olivia Ong,2jkjkl54k5jklj5k4j5k4jkkkjjkj,\N,\N,2013-11-15 16:36:33,2013-11-15 16:36:33,1,0,\N,\N
7,Matthew,32kj4h3kjh44hjk3hk43hkhhkjhasd,\N,\N,2013-11-18 17:29:13,2013-11-15 16:36:33,1,0,\N,\N
I still wonder when I execute the ETL, orientdb wont recognize my datetime as datetime.
I tried putting datatype in column fields "created:datetime", but it ended up not showing any data.
I wonder what is the proper solution for this case.
from next version, 2.2.8, you will be able to define different default pattern for date and datetime: CSV extractor documentation
Note that when you define the columns, you need to specify the column's type:
"columns": ["id:string", "created:date", "updated:datetime"],
You can use the snapshot jar of 2.2.8 of ETL module with 2.2.7 without any problem:
https://oss.sonatype.org/content/repositories/snapshots/com/orientechnologies/orientdb-etl/2.2.8-SNAPSHOT/