Query tables in Apache Drill without adding in extensions to the table name - apache-drill

I have an Apache Drill setup with the following storage plugin
{
"type": "file",
"connection": "file:///",
"config": null,
"workspaces": {
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null,
"allowAccessOutsideWorkspace": false
},
"csv": {
"location": "/home/user/data/csv",
"writable": false,
"defaultInputFormat": "csv",
"allowAccessOutsideWorkspace": false
},
"parquet": {
"location": "/home/user/data/parquet",
"writable": false,
"defaultInputFormat": "parquet",
"allowAccessOutsideWorkspace": false
}
},
"formats": {
"csv": {
"type": "text",
"extensions": [
"csv"
],
"skipFirstLine": true,
"extractHeader": true,
"delimiter": ","
},
"parquet": {
"type": "parquet",
"autoCorrectCorruptDates": false
}
},
"enabled": true
}
The name has been configured to foo.
The issue is that I wanted to write a query where the name of the table does not have the extension.
I tried the following,
select * from foo.csv.`agency` limit 100
I get the following response
org.apache.drill.common.exceptions.UserRemoteException: VALIDATION ERROR: From line 1, column 15 to line 1, column 21: Object 'agency' not found within 'foo.csv' [Error Id: 80be4497-b71c-47dd-bc2c-6abfa425d55a on nn-hadoop-1:30112]
But this works
select * from foo.csv.`agency.csv` limit 100
Is there a way for me to not suffix the table name (file name) with the file extension when I create my query? I have included the defaultInputFormat in the workspace

I found a way to resolve this issue. Instead of querying files directly, I put the file I wanted to query into a directory and queried the directory instead.
https://drill.apache.org/docs/querying-directories/
So, now I have created a folder called agency and moved the file agency.csv is inside this folder. Now I can do
select * from foo.csv.`/agency` limit 100
and I am getting the results I was looking for.

You can query the table or file.
For the table please create it:
select * from foo.csv.`agency.csv`
and query:
select * from agency;
For the second case, Drill should have a full path to the file or directory (including workspace, if you use it)

Another option is to use a view; first use a writable workspace (like /tmp above)
use foo.tmp;
Then create the view
create view agency as select * from foo.csv.`agency.csv`;
Unlike creating a table, a view does not take a significant disk space, and would reflect the latest data in the file (in case the file is updated.)
The usage is the same, like
select * from agency limit 100;

Related

How do I properly use deleteMany() with an $and query in the MongoDB shell?

I am trying to delete all documents in my collection infrastructure that have a type.primary property of "pipelines" and a type.secondary property of "oil."
I'm trying to use the following query:
db.infrastructure.deleteMany({$and: [{"properties.type.primary": "pipelines"}, {"properties.type.secondary": "oil"}] }),
That returns: { acknowledged: true, deletedCount: 0 }
I expect my query to work because in MongoDB Compass, I can retrieve 182 documents that match the query {$and: [{"properties.type.primary": "pipelines"}, {"properties.type.secondary": "oil"}] }
My documents appear with the following structure (relevant section only):
properties": {
"optional": {
"description": ""
},
"original": {
"Opername": "ENBRIDGE",
"Pipename": "Lakehead",
"Shape_Leng": 604328.294581,
"Source": "EIA"
},
"required": {
"unit": null,
"viz_dim": null,
"years": []
},
"type": {
"primary": "pipelines",
"secondary": "oil"
}
...
My understanding is that I just need to pass a filter to deleteMany() and that $and expects an array of objects. For some reason the two combined isn't working here.
I realized the simplest answer was the correct one -- I spelled my database name incorrectly.

Azure data factory ingest csv with full stop in header

I have a copy data activity in Azure data factory which reads in a csv file. This csv file is produced by a 3rd party so I cannot change it. One of the headings has a full stop (or period) in it: 'foo.bar'. When I run the activity I get the error message:
Failure happened on 'Source' side. ErrorCode=JsonInvalidDataFormat,
'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,
Message=Error occurred when deserializing source JSON file 'foo;bar'.
Check if the data is in valid JSON object format.,Source=Microsoft.DataTransfer.ClientLibrary,'
The csv look like this
state,sys_updated_on,foo.bar,sys_id
New,03/06/2021 12:42:18,S Services,xxx
Resolved,03/06/2021 12:35:06,MS Services,yyy
New,03/06/2021 12:46:18,S Services,zzz
The source dataset looks like this:
{
"name": "my_dataset",
"properties": {
"linkedServiceName": {
"referenceName": "my_linked_service",
"type": "LinkedServiceReference"
},
"annotations": [],
"type": "DelimitedText",
"typeProperties": {
"location": {
"type": "AzureBlobStorageLocation",
"fileName": "i.csv",
"folderPath": "Temp/exports/widgets",
"container": "blah"
},
"columnDelimiter": ",",
"escapeChar": "\\",
"firstRowAsHeader": true,
"quoteChar": "\""
},
"schema": []
}
}
You can skip the header like this:
Don't set first row as header:
Set Skip line count: 1 at source settings:
Some other way, you also could use Data Flow with Derived column to create a new schema to replace the column foo.bar:

S3 to Redshift: Unknown boolean format

Executing a copy command from S3 to Redshift, loading JSON files. I have some fields as bool in a new table I'm inserting into and always getting the following error: "Unknown boolean format"
My JSON is well parsed, ran a million tests on that already. I've tried passing in the boolean fields as:
false // "false" // "False" // 0 // "0" // null
But always get the same error, when executing:
select * from stl_load_errors;
err_code err_reason
1210 Unknown boolean format
I've seen some comments about using IGNOREHEADER in my statement but that isn't an option because the files I'm dealing with are in a single row json format. Ignoring the head would basically mean not reading thefile at all. I have other tables working like this and work fine, but don't have any bool columns in those tables.
The COPY from JSON Format documentation page provides an example that includes a Boolean:
{
"id": 0,
"guid": "84512477-fa49-456b-b407-581d0d851c3c",
"isActive": true,
"tags": [
"nisi",
"culpa",
"ad",
"amet",
"voluptate",
"reprehenderit",
"veniam"
],
"friends": [
{
"id": 0,
"name": "Carmella Gonzales"
},
{
"id": 1,
"name": "Renaldo"
}
]
}
The Boolean documentation page shows values similar to what you have tried.

Ad-hoc queries to a massive JSON dataset

I have a massive dataset stored in Azure BLOB in JSON format. Some apps are constantly adding new data to it. BLOBs are organized in partitions like
/dataset={name}/date={YYYY-MM-DD}/one_or_more_json_files
Data pieces do not follow any particular schema. JSON field names are not in consistent letter case. Some JSON rows can be broken.
Could someone advise a good way to query this data without defining schema in advance. I would like to do something like
select * from my_huge_json_dataset where dataset='mydataset' and date>'2015-04-01'
without defining explicit schema for the table
My first consideration was HIVE but it turns out that SerDe needs schema to be defined to create a table. json_tuple could be an answer but it is case-sensitive and crashes if meets malformed json row.
I am also considering Apache Drill and Pig but have no experience with them and would like some guidance.
You could use Apache Drill, you only need to configure a new storage pointing to your dataset folder:
{
"type": "file",
"enabled": true,
"connection": "file:///",
"config": null,
"workspaces": {
"root": {
"location": "/",
"writable": false,
"defaultInputFormat": null
},
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null
}
},
"formats": {
"json": {
"type": "json",
"extensions": [
"json"
]
}
}
So, if you defined that Storage Plugin as 'dfs' for example you could query from the root directory without defining any schema using ANSI SQL, just like:
SELECT * FROM dfs.dataset.date.`file.json`;
or even filter by your folder name in the same query using dir0.
I encourage you to visit their documentation site Apache Drill documentation in your case specially Querying JSON files

Apache Drill - Query HDFS and SQL

I'm trying to explore Apache Drill. I'm not a Data Analyst, just an Infra support Guy. I see documentation on Apache Drill is too limited
I need some details about custom data storage that can be used with Apache Drill
Is it possible to query HDFS without Hive, using Apache Drill just like dfs do
Is it possible to query old age RDBMS like MySQL and Microsoft SQL
Thanks in advance
Update:
My HDFS Storage defention says error (Invalid JSON mapping)
{
"type":"file",
"enabled":true,
"connection":"hdfs:///",
"workspaces":{
"root":{
"location":"/",
"writable":true,
"storageformat":"null"
}
}
}
If I replace hdfs:/// with file:///, it seems to accept it.
I copied all the library files from the folder
<drill-path>/jars/3rdparty to <drill-path>/jars/
Cannot make it work. Please help. I'm not a dev at all, I'm Infra guy.
Thanks in advance
Yes.
Drill directly recognizes the schema of the file based on the metadata. Refer the link for more info -
https://cwiki.apache.org/confluence/display/DRILL/Connecting+to+Data+Sources
Not Yet.
While there is a MapR driver that lets you achieve the same but it is not inherently supported in Drill now. There have been several discussions around this and it might be there soon.
YES, it is possible that drill can communicate with both the Hadoop system and the RDBMS systems together. Infact you can have queries joining both the systems.
The HDFS storage plug in can be as :
{
"type": "file",
"enabled": true,
"connection": "hdfs://xxx.xxx.xxx.xxx:8020/",
"workspaces": {
"root": {
"location": "/user/cloudera",
"writable": true,
"defaultInputFormat": null
},
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null
}
},
"formats": {
"parquet": {
"type": "parquet"
},
"psv": {
"type": "text",
"extensions": [
"tbl"
],
"delimiter": "|"
},
"csv": {
"type": "text",
"extensions": [
"csv"
],
"delimiter": ","
},
"tsv": {
"type": "text",
"extensions": [
"tsv"
],
"delimiter": "\t"
},
"json": {
"type": "json"
}
}
}
The connection URL will be your mapR/Coudera URL with port number 8020 by default . You should be able to spot that in the configuration of Hadoop on your system with configuration key : "fs_defaultfs"