I am trying to work with Apache Drill. I am new to this whole environment, just trying to understand how Apache Drill works.
I am trying to query my json data stored on s3 using Apache Drill.
My bucket is created in US East (N. Virginia).
I have created a new Storage Plugin for S3 using this link.
Here is the configuration for my new S3 Storage Plugin :
{
"type": "file",
"enabled": true,
"connection": "s3a://testing-drill/",
"config": {
"fs.s3a.access.key": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
"fs.s3a.secret.key": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
},
"workspaces": {
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null,
"allowAccessOutsideWorkspace": false
},
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null,
"allowAccessOutsideWorkspace": false
}
},
"formats": {
"psv": {
"type": "text",
"extensions": [
"tbl"
],
"delimiter": "|"
},
"csv": {
"type": "text",
"extensions": [
"csv"
],
"delimiter": ","
},
"tsv": {
"type": "text",
"extensions": [
"tsv"
],
"delimiter": "\t"
},
"parquet": {
"type": "parquet"
},
"json": {
"type": "json",
"extensions": [
"json"
]
},
"avro": {
"type": "avro"
},
"sequencefile": {
"type": "sequencefile",
"extensions": [
"seq"
]
},
"csvh": {
"type": "text",
"extensions": [
"csvh"
],
"extractHeader": true,
"delimiter": ","
}
}
}
I have also configured my core-site-example.xml as follows:
<configuration>
<property>
<name>fs.s3a.access.key</name>
<value>xxxxxxxxxxxxxxxxxxxx</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>xxxxxxxxxxxxxxxxxxxxxxxx</value>
</property>
<property>
<name>fs.s3a.endpoint</name>
<value>s3.us-east-1.amazonaws.com</value>
</property>
</configuration>
But when I try to use/set the workspace using the following command :
USE shiv.`root`;
It gives me following error :
Error: VALIDATION ERROR: Schema [shiv.root] is not valid with respect to either root schema or current default schema.
Current default schema: No default schema selected
[Error Id: 6d9515c0-b90f-48aa-9dc5-0c660f1c06ca on ip-10-0-3-241.ec2.internal:31010] (state=,code=0)
If try to execute show schemas;, then I get the following error :
show schemas;
Error: SYSTEM ERROR: AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: EEB438A6A0A5E667, AWS Error Code: null, AWS Error Message: Bad Request
Fragment 0:0
[Error Id: 85883537-9b4f-4057-9c90-cdaedec116a8 on ip-10-0-3-241.ec2.internal:31010] (state=,code=0)
I am not able to understand the root cause of this issue.
I had a similar issue when using Apache Drill with GCS(Google Cloud Storage)
I was getting the following error when running USE gcs.data query.
VALIDATION ERROR: Schema [gcs.data] is not valid with respect to either root schema or current default schema.
Current default schema: No default schema selected
I ran SHOW SCHEMAS and there was no gcs.data schema.
I went ahead and created data folder in my GCS bucket and gcs.data showed up in SHOW SCHEMAS and USE gcs.data query worked.
From my limited experience with apache drill what I understood is that,
In file storage, if you have a workspace that uses a folder that does not exist then drill will throw this error.
GCS and S3 both are file type storage so maybe you are having this issue.
Here is my GCS storage config
{
"type": "file",
"connection": "gs://my-gcs-bkt",
"config": null,
"workspaces": {
"data": {
"location": "/data",
"writable": true,
"defaultInputFormat": null,
"allowAccessOutsideWorkspace": false
},
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null,
"allowAccessOutsideWorkspace": false
},
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null,
"allowAccessOutsideWorkspace": false
}
},
"formats": {
"parquet": {
"type": "parquet"
},
"json": {
"type": "json",
"extensions": [
"json"
]
},
"tsv": {
"type": "text",
"extensions": [
"tsv"
],
"delimiter": "\t"
},
"csvh": {
"type": "text",
"extensions": [
"csvh"
],
"extractHeader": true,
"delimiter": ","
},
"csv": {
"type": "text",
"extensions": [
"csv"
],
"delimiter": ","
},
"psv": {
"type": "text",
"extensions": [
"tbl"
],
"delimiter": "|"
}
},
"enabled": true
}
Related
I'm trying to send the content of a JSON file into Elasticsearch.
Each file contains only one simple JSON object (just attributes, no array, no nested objects). Filebeat sees the files but they're not sent to Elasticsearch (it's working with csv files so the connection is correct)...
Here is the JSON file (all in one line in the file but I passed it into a JSON formatter to be displayed here):
{
"IPID": "3782",
"Agent": "localhost",
"User": "vtom",
"Script": "/opt/vtom/scripts/scriptOK.ksh",
"Arguments": "",
"BatchQueue": "queue_ksh-json",
"VisualTOMServer": "labc",
"Job": "testJSONlogs",
"Application": "test_CAD",
"Environment": "TEST",
"JobRetry": "0",
"LabelPoint": "0",
"ExecutionMode": "NORMAL",
"DateName": "TEST_CAD",
"DateValue": "05/11/2022",
"DateStart": "2022-11-05",
"TimeStart": "20:58:14",
"StandardOutputName": "/opt/vtom/logs/TEST_test_CAD_testJSONlogs_221105-205814.o",
"StandardOutputContent": "_______________________________________________________________________\nVisual TOM context of the job\n \nIPID : 3782\nAgent : localhost\nUser : vtom\nScript : ",
"ErrorOutput": "/opt/vtom/logs/TEST_test_CAD_testJSONlogs_221105-205814.e",
"ErrorOutputContent": "",
"JsonOutput": "/opt/vtom/logs/TEST_test_CAD_testJSONlogs_221105-205814.json",
"ReturnCode": "0",
"Status": "Finished"
}
The input definition in Filebeat is (it's a merge of data from different web sources):
- type: filestream
id: vtomlogs
enabled: true
paths:
- /opt/vtom/logs/*.json
index: vtomlogs-%{+YYYY.MM.dd}
parsers:
- ndjson:
keys_under_root: true
overwrite_keys: true
add_error_key: true
expand_keys: true
The definition of the index template:
{
"properties": {
"IPID": {
"coerce": true,
"index": true,
"ignore_malformed": false,
"store": false,
"type": "integer",
"doc_values": true
},
"VisualTOMServer": {
"type": "keyword"
},
"Status": {
"type": "keyword"
},
"Agent": {
"type": "keyword"
},
"Script": {
"type": "text"
},
"User": {
"type": "keyword"
},
"ErrorOutputContent": {
"type": "text"
},
"ReturnCode": {
"type": "integer"
},
"BatchQueue": {
"type": "keyword"
},
"StandardOutputName": {
"type": "text"
},
"DateStart": {
"format": "yyyy-MM-dd",
"index": true,
"ignore_malformed": false,
"store": false,
"type": "date",
"doc_values": true
},
"Arguments": {
"type": "text"
},
"ExecutionMode": {
"type": "keyword"
},
"DateName": {
"type": "keyword"
},
"TimeStart": {
"format": "HH:mm:ss",
"index": true,
"ignore_malformed": false,
"store": false,
"type": "date",
"doc_values": true
},
"JobRetry": {
"type": "integer"
},
"LabelPoint": {
"type": "keyword"
},
"DateValue": {
"format": "dd/MM/yyyy",
"index": true,
"ignore_malformed": false,
"store": false,
"type": "date",
"doc_values": true
},
"JsonOutput": {
"type": "text"
},
"StandardOutputContent": {
"type": "text"
},
"Environment": {
"type": "keyword"
},
"ErrorOutput": {
"type": "text"
},
"Job": {
"type": "keyword"
},
"Application": {
"type": "keyword"
}
}
}
The file is seen by Filebeat but it does nothing with it...
0100","log.logger":"input.filestream","log.origin":{"file.name":"filestream/prospector.go","file.line":177},"message":"A new file /opt/vtom/logs/TEST_test_CAD_testJSONlogs_221106-124138.json has been found","service.name":"filebeat","id":"vtomlogs","prospector":"file_prospector","operation":"create","source_name":"native::109713280-64768","os_id":"109713280-64768","new_path":"/opt/vtom/logs/TEST_test_CAD_testJSONlogs_221106-124138.json","ecs.version":"1.6.0"}
My version of Elasticsearch is: 8.4.3
My version of Filebeat is: 8.5.0 (with allow_older_versions: true in my configuration file)
Thanks for your help
Error: String cannot be coerced to a nodeId
Hi,
I was busy setting up a connection between the Orion Broker and an PLC with OPC-UA Server using the opcua iotagent agent.
I managed to setup all parts and I am able to receive (test) data, but I am unable to follow the tutorial with regards to adding an entity to the Orion-Broker using a json file:
curl http://localhost:4001/iot/devices -H "fiware-service: plcservice" -H "fiware-servicepath: /demo" -H "Content-Type: application/json" -d #add_device.json
The expected result would be an added entity to the OrionBroker with the supplied data, but this only results in a error message:
{"name":"Error","message":"String cannot be coerced to a nodeId : ns*4:s*MAIN.mainVar"}
suspected Error
Is it possible that the iotagent does not work nicely with nested Variables?
steps taken
doublechecked availability of OPC Data:
OPC data changes every second, can be seen in Broker log
reduced complexity of setup to only include Broker and IOT-agent
additional information:
add_device.json file:
{
"devices": [
{
"device_id": "plc1",
"entity_name": "PLC1",
"entity_type": "plc",
"attributes": [
{
"object_id": "ns*4:s*MAIN.mainVar",
"name": "main",
"type": "Number"
}
],
"lazy": [
],
"commands" : []
}
]
}
config of IOT-agent (from localhost:4081/config):
{
"config": {
"logLevel": "DEBUG",
"contextBroker": {
"host": "orion",
"port": 1026
},
"server": {
"port": 4001,
"baseRoot": "/"
},
"deviceRegistry": {
"type": "memory"
},
"mongodb": {
"host": "iotmongo",
"port": "27017",
"db": "iotagent",
"retries": 5,
"retryTime": 5
},
"types": {
"plc": {
"service": "plcservice",
"subservice": "/demo",
"active": [
{
"name": "main",
"type": "Int16"
},
{
"name": "test1",
"type": "Int16"
},
{
"name": "test2",
"type": "Int16"
}
],
"lazy": [],
"commands": []
}
},
"browseServerOptions": null,
"service": "plc",
"subservice": "/demo",
"providerUrl": "http://iotage:4001",
"pollingExpiration": "200000",
"pollingDaemonFrequency": "20000",
"deviceRegistrationDuration": "P1M",
"defaultType": null,
"contexts": [
{
"id": "plc_1",
"type": "plc",
"service": "plcservice",
"subservice": "/demo",
"polling": false,
"mappings": [
{
"ocb_id": "test1",
"opcua_id": "ns=4;s=test.TestVar.test1",
"object_id": null,
"inputArguments": []
},
{
"ocb_id": "test2",
"opcua_id": "ns=4;s=test.TestVar.test2",
"object_id": null,
"inputArguments": []
},
{
"ocb_id": "main",
"opcua_id": "ns=4;s=MAIN.mainVar",
"object_id": null,
"inputArguments": []
}
]
}
]
}
}
I'm one of the maintainers of the iotagent-opcua repo, we have identified and fixed the bug you were addressing, please update your agent to the latest version (1.4.0)
If you haven't ever heard about it, starting from 1.3.8 we have introduced a new configuration property called "relaxTemplateValidation" which let you use previously forbidden characters (e.g. = and ; ). I suggest you to have a look at it on the configuration examples provided.
I am facing an issue while deploy storage account using arm templates:
Deployment template validation failed: 'The template resource 'sneha1'
for type
'Microsoft.WindowsAzure.ResourceStack.Frontdoor.Common.Entities.TemplateGenericProperty`1[System.String]'
at line '20' and column '59' has incorrect segment lengths. A nested
resource type must have identical number of segments as its resource
name. A root resource type must have segment length one greater than
its resource name.
This is my template:
{
"$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
"contentVersion": "1.0.0.0",
"parameters": {
"storageAccounts_sneha_name": {
"defaultValue": "sneha,
"type": "String"
}
},
"variables": {},
"resources": [
{
"type": "Microsoft.Storage/storageAccounts/sneha",
"apiVersion": "2019-04-01",
"name": "[concat(parameters('storageAccounts_sneha_name'), copyIndex(1) ) ]",
"location": "centralus",
"copy":{
"Name":"rama",
"count": 5
},
"sku": {
"name": "Standard_LRS",
"tier": "Standard"
},
"kind": "StorageV2",
"properties": {
"networkAcls": {
"bypass": "AzureServices",
"virtualNetworkRules": [],
"ipRules": [],
"defaultAction": "Allow"
},
"supportsHttpsTrafficOnly": true,
"encryption": {
"services": {
"file": {
"enabled": true
},
"blob": {
"enabled": true
}
},
"keySource": "Microsoft.Storage"
},
"accessTier": "Hot"
}
},
{
"type": "Microsoft.Storage/storageAccounts/blobServices",
"apiVersion": "2019-04-01",
"name": "[concat(parameters('storageAccounts_sneha_name'), '/default')]",
"dependsOn": [
"[resourceId('Microsoft.Storage/storageAccounts', parameters('storageAccounts_sneha_name'))]"
],
"properties": {
"cors": {
"corsRules": []
},
"deleteRetentionPolicy": {
"enabled": false
}
}
},
{
"type": "Microsoft.Storage/storageAccounts/blobServices/containers",
"apiVersion": "2019-04-01",
"name": "[concat(parameters('storageAccounts_sneha_name'), '/default/container1')]",
"dependsOn": [
"[resourceId('Microsoft.Storage/storageAccounts/blobServices', parameters('storageAccounts_sneha_name'), 'default')]",
"[resourceId('Microsoft.Storage/storageAccounts', parameters('storageAccounts_sneha_name'))]"
],
"properties": {
"publicAccess": "Blob"
}
}
]
}
what it says is that the resource sneha1 name is malformed, if you can share the resource name I can help you fix it, but in a nutshell name should be 1 segment shorter than type:
name: "xxx",
type: "microsoft.storage/storageAccounts"
or like so:
name: "xxx/diag",
type: "microsoft.storage/storageAccounts/diagnosticSettings"
This issue occurs in case your name property is greater in length than TYPE
"apiVersion": "2016-12-01",
"name": "[concat(parameters('vaultName'), '/', parameters('policyName'))]",
"type": "Microsoft.RecoveryServices/vaults/backupPolicies"
In above example name has two level and type has 3 level .
Example as shown below:
"name":"azVault/policy1"
"type":"Microsoft.RecoveryServices/vaults/backupPolicies"
This might be your problem, remove the sneha in type and try.
"type": "Microsoft.Storage/storageAccounts/sneha",
"name": "[concat(parameters('storageAccounts_sneha_name'), '/copyIndex(1)' ) ]",
This is currently my config file
{
"config": {
"haltOnError": false
},
"source": {
"file": {
"path": "/home/user1/temp/real_user/user3.csv"
}
},
"extractor": {
"csv": {
"columns": ["id", "name", "token", "username", "password", "created", "updated", "enabled", "is_admin", "is_banned", "userAvatar"],
"columnsOnFirstLine": true
},
"field": {
"fieldName": "created",
"expression": "created.asDateTime()"
}
},
"transformers": [{
"vertex": {
"class": "user"
}
}],
"loader": {
"orientdb": {
"dbURL": "plocal:/home/user1/orientdb/real_user",
"dbAutoCreateProperties": true,
"dbType": "graph",
"classes": [{
"name": "user",
"extends": "V"
}],
"indexes": [{
"class": "user",
"fields": ["id:long"],
"type": "UNIQUE"
}]
}
}
}
and my csv currently looks like this
6,Olivia Ong,2jkjkl54k5jklj5k4j5k4jkkkjjkj,\N,\N,2013-11-15 16:36:33,2013-11-15 16:36:33,1,0,\N,\N
7,Matthew,32kj4h3kjh44hjk3hk43hkhhkjhasd,\N,\N,2013-11-18 17:29:13,2013-11-15 16:36:33,1,0,\N,\N
I still wonder when I execute the ETL, orientdb wont recognize my datetime as datetime.
I tried putting datatype in column fields "created:datetime", but it ended up not showing any data.
I wonder what is the proper solution for this case.
from next version, 2.2.8, you will be able to define different default pattern for date and datetime: CSV extractor documentation
Note that when you define the columns, you need to specify the column's type:
"columns": ["id:string", "created:date", "updated:datetime"],
You can use the snapshot jar of 2.2.8 of ETL module with 2.2.7 without any problem:
https://oss.sonatype.org/content/repositories/snapshots/com/orientechnologies/orientdb-etl/2.2.8-SNAPSHOT/
I try to change the storage configuration in apache drill in embedded mode to identify headers and to change the delimiter of csv files. I also renamed the new format category from csv to sap.
I tried to use the information from the documentation and created the following json storage information:
{
"type": "file",
"enabled": true,
"connection": "file:///",
"workspaces": {
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null
},
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null
}
},
"formats": {
"sap": {
"type": "text",
"extensions": [
"sap"
],
"skipFirstLine": false,
"extractHeader": true,
"delimiter": "|"
},
"psv": {
"type": "text",
"extensions": [
"tbl"
],
"delimiter": "|"
},
"csv": {
"type": "text",
"extensions": [
"csv"
],
"delimiter": ","
},
"tsv": {
"type": "text",
"extensions": [
"tsv"
],
"delimiter": "\t"
},
"parquet": {
"type": "parquet"
},
"json": {
"type": "json"
},
"avro": {
"type": "avro"
}
}
}
But always when I try to save it in the web-ui I got the message: error (invalid json mapping).
The exec.storage.enable_new_text_reader is set true.
Could somebody help my how I can add the two config items: skipFirstLine and extractHeader?
BR
Drill is able to parse the header row in a text file (CSV, TSV, etc.) in Drill 1.3. Check documentation for this.
Check Release notes for Dill 1.3 and csv header parsing issue for more details.