Apache Drill not able to set default workspace - apache-drill

At the moment I am only able to work on dfs.tmp. workspace, which is quite annoying. So I tried to change the default workspace to a new (existing) folder (owned by the drill user):
"workspaces": {
"default": {
"location": "/var/drill",
"writable": true,
"defaultInputFormat": null
},
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null
},
...
But it does not work:
CREATE TABLE `test` as SELECT 'Test' FROM (VALUES(1))
Returns the following error, which indicates that the modified settings get ignored.
org.apache.drill.common.exceptions.UserRemoteException: PARSE ERROR: Root schema is immutable. Creating or dropping tables/views is not allowed in root schema.Select a schema using 'USE schema' command.
I also tried it with prefix (without success)
CREATE TABLE dfs.default.`test` as SELECT 'Test' FROM (VALUES(1))
PARSE ERROR: Encountered ". default" at line 1, column 17.
Also also tried to restart drill and make root writable.

Adding this answer just to combine the two existing ones by ColemanTO and devツ
and show an example.
So, like the other answers so far have said, the word "default" is reserved in a drill query. You correctly refer to docs that say to create a new default workspace in order to, say, define a writable root(/) workspace. However, the documentation also gives an example where, in order to actually reference the custom "default" workspace, you need to add backticks.
So if you added a workspace
{
"type": "file",
"enabled": true,
"connection": "hdfs:///",
"config": null,
"workspaces": {
"default": {
"location": "/some/path",
"writable": true,
"defaultInputFormat": null
}
...
}
...
}
you would refer to in in a query like:
SELECT * FROM dfs.`default`.`path/relative/to/custom/default/location` LIMIT 10;
In the case of your original posted problem, you could also create a new workspace called, say "var_drill", so that you don't have to escape from the keyword "default" in your queries.

Use any other name than default in workspaces.
default is reserved keyword for Drill.

"Note: Default is a reserved word. You must enclose reserved words when used as identifiers in back ticks."

Yes default is a reserved keyword and it should be indicated in backticks

Related

Can I get the file names for synced text files in my pipeline in Foundry?

I have a bunch of text files that are synced from my raw system, I want an easy way to use the file names (in addition to the contents of the files) downstream in Foundry transforms.
I know this is possible using raw file access, but that seems complicated, I just want the file names next to the data.
The response from ollie299792458 will work only if dataFrameReaderClass is com.palantir.foundry.spark.input.TextDataFrameReader.
Alternatively you can get a file name when reading the dataset in Code Repositories or Workbooks using Spark input_file_name function:
Creates a string column for the file name of the current Spark task.
If you're immediately going into code repos or code workbooks, then you can use the input_file_name() function (see proggeo's answer below). This is likely easier and simpler than the below, but won't work if you're going to do something else with the data.
Schema Method
If you open your dataset, then go to Details -> Schema, you can edit the schema to add a file path column, for each row this will have the value of the path of the file that the row comes from.
The key part is the _filePath member of fieldSchemaList and "addFilePath": true under customMetadata. The first is a special column that TextDataFrameReader populates with the file path, the second tells the reader to populate that column. The other column in the example below (content) just contains everything in each file.
For more details see the Metadata section in the Foundry core backend in platform documentation. This is also possible for csv's and more structured data with different Reader classes.
Full schema example
{
"fieldSchemaList": [
{
"type": "STRING",
"name": "content",
"nullable": null,
"userDefinedTypeClass": null,
"customMetadata": {},
"arraySubtype": null,
"precision": null,
"scale": null,
"mapKeyType": null,
"mapValueType": null,
"subSchemas": null
},
{
"type": "STRING",
"name": "_filePath",
"nullable": null,
"userDefinedTypeClass": null,
"customMetadata": {},
"arraySubtype": null,
"precision": null,
"scale": null,
"mapKeyType": null,
"mapValueType": null,
"subSchemas": null
}
],
"dataFrameReaderClass": "com.palantir.foundry.spark.input.TextDataFrameReader",
"customMetadata": {
"textParserParams": {
"parser": "SINGLE_COLUMN_PARSER",
"nullValues": null,
"nullValuesPerColumn": null,
"charsetName": "UTF-8",
"addFilePath": true,
"addByteOffset": false,
"addImportedAt": false
}
}
}

Can you SQL populate a BigQuery table and set the table column modes in the same API call?

I'm using Google App Script to migrate data through BigQuery and I've run into an issue because the SQL I'm using to perform a WRITE_TRUNCATE load is causing the destination table to be recreated with column modes of NULLABLE rather than their previous mode of REQUIRED.
Attempting to change the modes to REQUIRED after the data is loaded using a metadata patch causes an error even though the columns don't contain any null values.
I considered working around the issue by dropping the table and recreating it again with the same REQUIRED modes, then loading the data using WRITE_APPEND instead of WRITE_TRUNCATE. But this isn't possible because a user wants to have the same source and destination table in their SQL.
Does anyone know if it's possible to define a BigQuery.Jobs.insert request that includes the output schema information/metadata?
If it's not possible the only alternative I can see is to use my original work around of a WRITE_APPEND but add a temporary table into the process, to allow for the destination table appearing in the source SQL. But if this can be avoid that would be nice.
Additional Information:
I did experiment with different ways of setting the schema information but when they didn't return an error message the schema seemed to get ignored.
I.e. this is the json I'm passing into BigQuery.Jobs.insert
jsnConfig =
{
"configuration":
{
"query":
{
"destinationTable":
{
"projectId":"my-project",
"datasetId":"sandbox_dataset",
"tableId":"hello_world"
},
"writeDisposition":"WRITE_TRUNCATE",
"useLegacySql":false,
"query":"SELECT COL_A, COL_B, '1' AS COL_C, COL_TIMESTAMP, COL_REQUIRED FROM `my-project.sandbox_dataset.hello_world_2` ",
"allowLargeResults":true,
"schema":
{
"fields":
[
{
"description":"Desc of Column A",
"type":"STRING",
"mode":"NULLABLE",
"name":"COL_A"
},
{
"description":"Desc of Column B",
"type":"STRING",
"mode":"REQUIRED",
"name":"COL_B"
},
{
"description":"Desc of Column C",
"type":"STRING",
"mode":"REPEATED",
"name":"COL_C"
},
{
"description":"Desc of Column Timestamp",
"type":"INTEGER",
"mode":"NULLABLE",
"name":"COL_TIMESTAMP"
},
{
"description":"Desc of Column Required",
"type":"STRING",
"mode":"REQUIRED",
"name":"COL_REQUIRED"
}
]
}
}
}
}
var job = BigQuery.Jobs.insert(jsnConfig, "my-project");
The result is that the new or existing hello_world table is truncated and loaded with the data specified in the query (so part of the json package is being read), but the column descriptions and modes aren't added as defined in the schema section. They're just blank and NULLABLE in the table.
More
When I tested the REST request above using Googles API page for BigQuery.Jobs.Insert it highlighted the "schema" property in the request as invalid. I think it appears the schema can be defined if you're loading the data from a file, i.e. BigQuery.Jobs.Load but it doesn't seem to support that functionality if you're putting the data in using an SQL source.
See the documentation here: https://cloud.google.com/bigquery/docs/schemas#specify-schema-manual-python
You can pass a schema object with your load job, meaning you can set fields to mode=REQUIRED
this is the command you should use:
bq --location=[LOCATION] load --source_format=[FORMAT] [PROJECT_ID]:[DATASET].[TABLE] [PATH_TO_DATA_FILE] [PATH_TO_SCHEMA_FILE]
as #Roy answered, this is done via load only. Can you output the logs of this command?

Couchbase FTS Index update through REST API

From latest couchbase doc,Could see FTS index can be created/updated using below
PUT /api/index/{indexName}
Creates/updates an index definition.
I have created index with name fts-idx and created successfully.
But looks like update of index is failing with REST API.
Response:
responseMessage : ,{"error":"rest_create_index: error creating index: fts-idx, err: manager_api: cannot create index because an index with the same name already exists: fts-idx"
Anything i have missed here.
I was able to replicate this issue, and I think figured it out. It's not a bug, but it should really be documented better.
You need to pass in the index's UUID as part of the PUT (I think this is a concurrency check). You can get the index's current uuid via GET /api/index/fts-index (it's in indexDef->uuid)
And once you have that, make it part of your update PUT body:
{
"name": "fts-index",
"type": "fulltext-index",
"params": {
// ... etc ...
},
"sourceType": "couchbase",
"sourceName": "travel-sample",
"sourceUUID": "307a1042c094b7314697980312f4b66b",
"sourceParams": {},
"planParams": {
// ... etc ...
},
"uuid": "89a125824b012319" // <--- right here
}
Once I did that, the update PUT went through just fine.

Update multiple elements of a list using Couchbase N1QL

context
I have somewhere in my couchbase documents, a node looking like this :
"metadata": {
"configurations": {
"AU": {
"enabled": false,
"order": 2147483647
},
"BE": {
"enabled": false,
"order": 2147483647
},
"BG": {
"enabled": false,
"order": 2147483647
} ...
}
}
and it goes along with a list country unicodes and their "enabled" state
what I want to achieve
update this document to mark is as disabled ("enabled" = false) for all countries
to do this I hoped this syntax would work (let's say I'm trying to update document with id 03c53a2d-6208-4a35-b9ec-f61e74d81dab)
UPDATE `data` t
SET country.enabled = false
FOR country IN t.metadata.configurations END
where meta(t).id = "03c53a2d-6208-4a35-b9ec-f61e74d81dab";
but it seems like it doesn't change anything on my document
any hints ? :)
thanks guys,
As the filed name is dynamic you can generate field names using OBJECT_NAMES() and use that during update of field.
UPDATE data t USE KEYS "03c53a2d-6208-4a35-b9ec-f61e74d81dab"
SET t.metadata.configurations.[v].enabled = false FOR v IN OBJECT_NAMES(t.metadata.configurations) END ;
In above example OBJECT_NAMES(t.metadata.configurations) generates ["AU", "BE","BG"]
When field of JSON is referenced .[v] it evaluates v and value become field.
So During looping construct t.metadata.configurations.[v].enabled becomes
t.metadata.configurations.`AU`.enabled,
t.metadata.configurations.`BE`.enabled,
t.metadata.configurations.`BG`.enabled
Depends on value of v.
This query should work:
update data
use keys "03c53a2d-6208-4a35-b9ec-f61e74d81dab"
set country.enabled = true for country within metadata.configurations when
country.enabled is defined end
The WITHIN allows "country" to be found at any level of the metadata.configurations structure, and we use the "WHEN country.enabled IS DEFINED" to make sure we are looking at the correct type of "country" structure.

Error while adding new documents to Azure Search index

I have an index with a couple of fields of type Edm.String and Collection(Edm.String). I want to have another index with the same fields plus another field of type Edm.Double. When I create such an index and try to add the same values (plus the newly added Edm.Double value) as I did to the first index, I'm getting the following error:
{
"error": {
"code": "",
"message": "The request is invalid. Details: parameters : An unexpected 'StartArray' node was found when reading from the JSON reader. A 'PrimitiveValue' node was expected.\r\n"
}
}
Does anyone know what this error means? I tried looking for it on the Internet but I couldn't find anything related to my situation. A sample request I'm sending to the new index looks like this:
POST https://myservicename.search.windows.net/indexes/newindexname/docs/index?api-version=2016-09-01
{
"value": [{
"#search.action": "upload",
"keywords": ["red", "lovely", "glowing", "cute"],
"name": "sample document",
"weight": 0.5,
"id": "67"
}]
}
The old index is the same but it doesn't have the "weight" parameter.
Edit: I created the index using the portal, so I don't have the exact JSON to create the index, but the fields are roughly like this:
Field Type Attributes Analyzer
---------------------------------------------------------------------------------------
id Edm.String Key, Retrievable
name Edm.String Searchable, Filterable, Retrievable Eng-Microsoft
keywords Collection(Edm.String) Searchable, Filterable, Retrievable Eng-Microsoft
weight Edm.Double Filterable, Sortable
The reason I got the error was because I made a mistake and was trying to send a Collection(Edm.String) when the actual type on the index was Edm.String.