How to generate a psv file from apache drill - apache-drill

The current way I am going about creating a pipe seperated value (psv) file is to first create a view with a query like
Create view ABC as
select column 1 || '|' || column 2 || '|' || ..
And then use the !record to do a select * from ABC.
This is causing a lot of development time and error prone as the files that I need to generate have 100's of columns.
Is there a simple way of going about this?

In your Storage plugin create custom format.
Here is the documentation
https://drill.apache.org/docs/plugin-configuration-basics/
"formats": {
"psv": {
"type": "text",
"extensions": [
"tbl"
],
"delimiter": "|"
}
}
Alter you session to set your default store
alter session set `store.format`='psv';
Use CTAS to write the data in above specified format
create table `users.vgunnu`.`vt_del_test` as select * from dfs.root.`/tmp/test_parquet` limit 3;
More info for the format
http://drill.apache.org/docs/create-table-as-ctas-command/

Related

How to use '$' special character in Athena SQL DML referencing Queries?

I have data coming to s3 from mixpanel and mixpanel adds '$' character before some event properties. Sample:
"event": "$ae_session",
"properties": {
"time": 1646816604,
"distinct_id": "622367f395dd06c26f311c46",
"$ae_session_length": 17.2,
"$app_build_number": "172",
"$app_release": "172",...}
As '$' special character is not supported in Athena I need to use some sort of escape thing to proceeds from here. I would really need any help regarding this.
The error i am getting in subsequent DML queries after My DDL table:
HIVE_METASTORE_ERROR: Error: name expected at the position 262 of
'struct<distinct_id:string,
sheetid:string,
addedUserId:string,
memberId:string,
communityId:string,
businessId:string,
time:timestamp,
communityBusinessType:string,
initialBusinessType:string,
sheetRowIndex:string,
dataType:varchar(50),
screenType:varchar(50),
rowIndex:int,
$ae_session_length:int>' but '$' is found.
(Service: null; Status Code: 0; Error Code: null; Request ID: null; Proxy: null)
Since, I can not change the column names as they are directly populating from mixpanel on daily interval I really think that there should be work around this somehow!

I have a MySQL database whereI am trying to retrieve JSON data that stores urls but the results keep coming back empty

The data in the column looks like the below:
{
"activity": {
"token": "e7b64be4-74d4-7a6d-a74b-xxxxxxx",
"route": "http://example.com/enroll/confirmation",
"url_parameters": {
"Success": "True",
"ContractNumber": "003992314W",
"Barcode": "1908Y10Z",
"price": "8.99"
},
"server_info": {
"cookie": [
"_ga=xxxx; _fbp=xxx; _hjid=xxx; XDEBUG_SESSION=XDEBUG_ECLIPSE;"
],
"upgrade-insecure-requests": [
"1"
],
},
"campaign": "Unknown/None",
"ip": "192.168.10.1",
"entity": "App\\Models\\User",
"entity_id": "1d9f3066-13ce-4659-b10d-xxxxx",
},
"time": "2021-05-21 20:15:02"
}
My code that I am using is below:
SELECT *
FROM websote.stored_events
WHERE JSON_EXTRACT(event_properties, '$.route') = 'http://example.com/enroll/confirmation'
ORDER BY created_at DESC LIMIT 500;
The code works on the other the json values just not the url ones. I've tried escaping the values in MySQL like the below:
SELECT *
FROM websote.stored_events
WHERE JSON_EXTRACT(event_properties, '$.route') = 'http:///example.com//enroll//confirmation'
ORDER BY created_at DESC LIMIT 500;
But still no luck. Any help on this would be appreciated!
Route is a nested property; I would have expected the path to be
JSON_EXTRACT(event_properties, '$.activity.route')
Your example data isn't valid JSON. You can't have a comma after the last element in an object or array:
"entity_id": "1d9f3066-13ce-4659-b10d-xxxxx",
},
^ here
If I remove that and other similar cases, I can test your data inserts into a JSON column and I can extract the object element you described:
mysql> select json_extract(event_properties, '$.activity.route') as route from stored_events;
+------------------------------------------+
| route |
+------------------------------------------+
| "http://example.com/enroll/confirmation" |
+------------------------------------------+
Note the value is returned with double-quotes. This is because it's returned as a JSON document, a scalar string. If you want the raw value, you have to unquote it:
mysql> select json_unquote(json_extract(event_properties, '$.activity.route')) as route from stored_events;
+----------------------------------------+
| route |
+----------------------------------------+
| http://example.com/enroll/confirmation |
+----------------------------------------+
If you want to search for that value, you would have to do a similar expression:
select * from stored_events
where json_unquote(json_extract(event_properties, '$.activity.route'))
= 'http://example.com/enroll/confirmation'
Searching based on object properties stored in JSON has disadvantages.
It requires complex expressions that force you (and anyone else you needs to maintain your code) to learn a lot of details about how JSON works.
It cannot be optimized with an index. This query will run a table-scan. You can add virtual columns with indexes, but that adds to complexity and if you need to ALTER TABLE to add virtual columns, it misses the point of JSON to store semi-structured data.
The bottom line is that if you find yourself using JSON functions in the WHERE clause of a query, it's a sign that you should be storing the column you want to search as a normal column, not as part of a JSON document.
Then you can write code that is easy to read, easy for your colleagues to maintain, and can be optimized easily with indexes:
SELECT * FROM stored_events
WHERE route = 'http://example.com/enroll/confirmation';
You can still store other properties in the JSON document, but the ones you want to be searchable should be stored in normal columns.
You might like to view my presentation How to Use JSON in MySQL Wrong.

AWS Athena and handling json

I have millions of files with the following (poor) JSON format:
{
"3000105002":[
{
"pool_id": "97808",
"pool_name": "WILDCAT (DO NOT USE)",
"status": "Zone Permanently Plugged",
"bhl": "D-12-10N-05E 902 FWL 902 FWL",
"acreage": ""
},
{
"pool_id": "96838",
"pool_name": "DRY & ABANDONED",
"status": "Zone Permanently Plugged",
"bhl": "D-12-10N-05E 902 FWL 902 FWL",
"acreage": ""
}]
}
I've tried to generate an Athena DDL that would accommodate this type (especially the api field) of structure with this:
CREATE EXTERNAL TABLE wp_info (
api:array < struct < pool_id:string,
pool_name:string,
status:string,
bhl:string,
acreage:string>>)
LOCATION 's3://foo/'
After trying to generate a table with this, the following error is thrown:
Your query has the following error(s):
FAILED: ParseException line 2:12 cannot recognize input near ':' 'array' '<' in column type
What is a workable solution to this issue? Note that the api string is different for every one of the million files. The api key is not actually within any of the files, so I hope there is a way that Athena can accommodate just the string-type value for these data.
If you don't have control over the JSON format that you are receiving, and you don't have a streaming service in the middle to transform the JSON format to something simpler, you can use regex functions to retrieve the relevant data that you need.
A simple way to do it is to use Create-Table-As-Select (CTAS) query that will convert the data from its complex JSON format to a simpler table format.
CREATE TABLE new_table
WITH (
external_location = 's3://path/to/ctas_partitioned/',
format = 'Parquet',
parquet_compression = 'SNAPPY')
AS SELECT
regexp_extract(line, '"pool_id": "(\d+)"', 1) as pool_id,
regexp_extract(line, ' "pool_name": "([^"])",', 1) as pool_name,
...
FROM json_lines_table;
You will improve the performance of the queries to the new table, as you are using Parquet format.
Note that you can also update the table when you can new data, by running the CTAS query again with external_location as 's3://path/to/ctas_partitioned/part=01' or any other partition scheme

Can you SQL populate a BigQuery table and set the table column modes in the same API call?

I'm using Google App Script to migrate data through BigQuery and I've run into an issue because the SQL I'm using to perform a WRITE_TRUNCATE load is causing the destination table to be recreated with column modes of NULLABLE rather than their previous mode of REQUIRED.
Attempting to change the modes to REQUIRED after the data is loaded using a metadata patch causes an error even though the columns don't contain any null values.
I considered working around the issue by dropping the table and recreating it again with the same REQUIRED modes, then loading the data using WRITE_APPEND instead of WRITE_TRUNCATE. But this isn't possible because a user wants to have the same source and destination table in their SQL.
Does anyone know if it's possible to define a BigQuery.Jobs.insert request that includes the output schema information/metadata?
If it's not possible the only alternative I can see is to use my original work around of a WRITE_APPEND but add a temporary table into the process, to allow for the destination table appearing in the source SQL. But if this can be avoid that would be nice.
Additional Information:
I did experiment with different ways of setting the schema information but when they didn't return an error message the schema seemed to get ignored.
I.e. this is the json I'm passing into BigQuery.Jobs.insert
jsnConfig =
{
"configuration":
{
"query":
{
"destinationTable":
{
"projectId":"my-project",
"datasetId":"sandbox_dataset",
"tableId":"hello_world"
},
"writeDisposition":"WRITE_TRUNCATE",
"useLegacySql":false,
"query":"SELECT COL_A, COL_B, '1' AS COL_C, COL_TIMESTAMP, COL_REQUIRED FROM `my-project.sandbox_dataset.hello_world_2` ",
"allowLargeResults":true,
"schema":
{
"fields":
[
{
"description":"Desc of Column A",
"type":"STRING",
"mode":"NULLABLE",
"name":"COL_A"
},
{
"description":"Desc of Column B",
"type":"STRING",
"mode":"REQUIRED",
"name":"COL_B"
},
{
"description":"Desc of Column C",
"type":"STRING",
"mode":"REPEATED",
"name":"COL_C"
},
{
"description":"Desc of Column Timestamp",
"type":"INTEGER",
"mode":"NULLABLE",
"name":"COL_TIMESTAMP"
},
{
"description":"Desc of Column Required",
"type":"STRING",
"mode":"REQUIRED",
"name":"COL_REQUIRED"
}
]
}
}
}
}
var job = BigQuery.Jobs.insert(jsnConfig, "my-project");
The result is that the new or existing hello_world table is truncated and loaded with the data specified in the query (so part of the json package is being read), but the column descriptions and modes aren't added as defined in the schema section. They're just blank and NULLABLE in the table.
More
When I tested the REST request above using Googles API page for BigQuery.Jobs.Insert it highlighted the "schema" property in the request as invalid. I think it appears the schema can be defined if you're loading the data from a file, i.e. BigQuery.Jobs.Load but it doesn't seem to support that functionality if you're putting the data in using an SQL source.
See the documentation here: https://cloud.google.com/bigquery/docs/schemas#specify-schema-manual-python
You can pass a schema object with your load job, meaning you can set fields to mode=REQUIRED
this is the command you should use:
bq --location=[LOCATION] load --source_format=[FORMAT] [PROJECT_ID]:[DATASET].[TABLE] [PATH_TO_DATA_FILE] [PATH_TO_SCHEMA_FILE]
as #Roy answered, this is done via load only. Can you output the logs of this command?

MySQL query to retrieve tabular data in json format

I have a table like below in MySQl database
user-name mail
ganesh g#g.com
gani gani#gani.com
gan gan#gan.com
I need query to retrieve above table in JSON format
Example:
[{
user-name:"ganesh",
mail:"g#g.com"
},
{
user-name:"gani",
mail:"gani#gani.com"
},
{
user-name:"gan",
mail:"gan#gan.com"
}
]
I need help, to do above
It's not recommended to do such things in the DBMS, do it in the script that is loading the data instead, if you're wrapping some legacy code you can't edit then wrap it with more code to format the data.
If all that fails do something like this: http://www.thomasfrank.se/mysql_to_json.html
SELECT
CONCAT("[",
GROUP_CONCAT(
CONCAT("{username:'",username,"'"),
CONCAT(",email:'",email),"'}")
)
,"]")
AS json FROM users;