Apache Drill - Query HDFS and SQL - mysql

I'm trying to explore Apache Drill. I'm not a Data Analyst, just an Infra support Guy. I see documentation on Apache Drill is too limited
I need some details about custom data storage that can be used with Apache Drill
Is it possible to query HDFS without Hive, using Apache Drill just like dfs do
Is it possible to query old age RDBMS like MySQL and Microsoft SQL
Thanks in advance
Update:
My HDFS Storage defention says error (Invalid JSON mapping)
{
"type":"file",
"enabled":true,
"connection":"hdfs:///",
"workspaces":{
"root":{
"location":"/",
"writable":true,
"storageformat":"null"
}
}
}
If I replace hdfs:/// with file:///, it seems to accept it.
I copied all the library files from the folder
<drill-path>/jars/3rdparty to <drill-path>/jars/
Cannot make it work. Please help. I'm not a dev at all, I'm Infra guy.
Thanks in advance

Yes.
Drill directly recognizes the schema of the file based on the metadata. Refer the link for more info -
https://cwiki.apache.org/confluence/display/DRILL/Connecting+to+Data+Sources
Not Yet.
While there is a MapR driver that lets you achieve the same but it is not inherently supported in Drill now. There have been several discussions around this and it might be there soon.

YES, it is possible that drill can communicate with both the Hadoop system and the RDBMS systems together. Infact you can have queries joining both the systems.
The HDFS storage plug in can be as :
{
"type": "file",
"enabled": true,
"connection": "hdfs://xxx.xxx.xxx.xxx:8020/",
"workspaces": {
"root": {
"location": "/user/cloudera",
"writable": true,
"defaultInputFormat": null
},
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null
}
},
"formats": {
"parquet": {
"type": "parquet"
},
"psv": {
"type": "text",
"extensions": [
"tbl"
],
"delimiter": "|"
},
"csv": {
"type": "text",
"extensions": [
"csv"
],
"delimiter": ","
},
"tsv": {
"type": "text",
"extensions": [
"tsv"
],
"delimiter": "\t"
},
"json": {
"type": "json"
}
}
}
The connection URL will be your mapR/Coudera URL with port number 8020 by default . You should be able to spot that in the configuration of Hadoop on your system with configuration key : "fs_defaultfs"

Related

Substituting service url is arm template

I have an ARM template that deploys API's to an API Management instance
Here is an example of one API
{
"properties": {
"authenticationSettings": {
"subscriptionKeyRequired": false
},
"subscriptionKeyParameterNames": {
"header": "Ocp-Apim-Subscription-Key",
"query": "subscription-key"
},
"apiRevision": "1",
"isCurrent": true,
"subscriptionRequired": true,
"displayName": "DDD.CRM.PostLeadRequest",
"serviceUrl": "https://test1/api/FuncCreateLead?code=XXXXXXXXXX",
"path": "CRMAPI/PostLeadRequest",
"protocols": [
"https"
]
},
"name": "[concat(variables('ApimServiceName'), '/mms-crm-postleadrequest')]",
"type": "Microsoft.ApiManagement/service/apis",
"apiVersion": "2019-01-01",
"dependsOn": []
}
When I am deploying this to different environments I would like to be able to substitute the service url depending on the environment. I'm wondering the best approach?
Can I read in a config file or something like that?
At the time of deployment I have a variable that tells me the environment so I can base decisions on that. Just not sure the best way to do it
See about ARM template parameters: https://learn.microsoft.com/en-us/azure/azure-resource-manager/resource-group-authoring-templates#parameters They can be specified in a separate file. So you will have single template, but environment specific parameter files.

Defining Azure Stream Analytics iot-hub input source through Powershell

I'm trying to write a powershell script that creates a new streamAnalytics job in my azure portal account, with input source as iot-hub and output source as blob storage account.
To do so, I'm using AzureRM command new-streamAnalyticsJob, and json files.
my problem is: I have not seen any documentation or example for json file where the inputs source is iot-hub. only event-hub.
what are the parameters I need to give in the json file? can anyone display an example for json file with input source to streamAnalytics job as Iot-hub?
I got the answer eventually: the required field I had to add to the inputs Oliver posted earlier here is:
"endpoint":"messages/events"
I added it under Datasource Properties section, and it works fine!
Thanks Oliver
To come back on the error message you are seeing, to add to Olivier's sample you need a Property named endpoint which corresponds to the endpoint in IoT Hub. If you are looking for Telemetry messages this will be:
"endpoint": "messages/events"
This can be found in the schema for Azure ARM: https://github.com/Azure/azure-rest-api-specs/blob/current/specification/streamanalytics/resource-manager/Microsoft.StreamAnalytics/2016-03-01/examples/Input_Create_Stream_IoTHub_Avro.json
So to complete Olivier's example, when using API version '':
"Inputs": [
{
"Name": "Hub",
"Properties": {
"DataSource": {
"Properties": {
"consumerGroupName": "[variables('asaConsumerGroup')]",
"iotHubNamespace": "[parameters('iotHubName')]",
"sharedAccessPolicyKey": "[listkeys(variables('iotHubKeyResource'), variables('iotHubVersion')).primaryKey]",
"sharedAccessPolicyName": "[variables('iotHubKeyName')]",
"endpoint": "messages/events"
},
"Type": "Microsoft.Devices/IotHubs"
},
"Serialization": {
"Properties": {
"Encoding": "UTF8"
},
"Type": "Json"
},
"Type": "Stream"
}
}
],
That'd look like the following for the inputs part of the ASA resource:
"Inputs": [
{
"Name": "IoTHubStream",
"Properties": {
"DataSource": {
"Properties": {
"consumerGroupName": "[variables('CGName')]",
"iotHubNamespace": "[variables('iotHubName')]",
"sharedAccessPolicyKey": "[listkeys(variables('iotHubKeyResource'), variables('iotHubVersion')).primaryKey]",
"sharedAccessPolicyName": "[variables('iotHubKeyName')]"
},
"Type": "Microsoft.Devices/IotHubs"
},
"Serialization": {
"Properties": {
"Encoding": "UTF8"
},
"Type": "Json"
},
"Type": "Stream"
}
}
]

Data Factory: AzureSQL in- and output for pipeline activity type AzureMLBatchExecution

In Azure Data Factory, I’m trying to call an Azure Machine Learning model by a Data Factory Pipeline. I want to use a Azure SQL table as input and another Azure SQL table for the output.
First I deployed a Machine Learning (classic) web service. Then I created an Azure Data Factory Pipeline, using a LinkedService (type= ‘AzureML’, using Request URI and API key of the ML-webservice) and a input and output dataset (‘AzureSqlTable’ type).
Deploying and Provisioning is succeeded. The pipeline starts as scheduled, but keeps ‘Running’ without any result. The pipeline activity is not being shown in the Monitor&Manage: Activity Windows.
On different sites and tutorials, I only find JSON-scripts using the activity type ‘AzureMLBatchExecution’ with BLOB in- and outputs. I want to use AzureSQL in- and output but I can’t get this working.
Can someone provide a sample JSON-script or tell me what’s possibly wrong with the code below?
Thanks!
{
"name": "Predictive_ML_Pipeline",
"properties": {
"description": "use MyAzureML model",
"activities": [
{
"type": "AzureMLBatchExecution",
"typeProperties": {},
"inputs": [
{
"name": "AzureSQLDataset_ML_Input"
}
],
"outputs": [
{
"name": "AzureSQLDataset_ML_Output"
}
],
"policy": {
"timeout": "02:00:00",
"concurrency": 3,
"executionPriorityOrder": "NewestFirst",
"retry": 1
},
"scheduler": {
"frequency": "Week",
"interval": 1
},
"name": "My_ML_Activity",
"description": "prediction analysis on ML batch input",
"linkedServiceName": "AzureMLLinkedService"
}
],
"start": "2017-04-04T09:00:00Z",
"end": "2017-04-04T18:00:00Z",
"isPaused": false,
"hubName": "myml_hub",
"pipelineMode": "Scheduled"
}
}
With a little help from a Microsoft technician, I've got this working. The JSON script as mentioned above is only changed in the schedule-section:
"start": "2017-04-01T08:45:00Z",
"end": "2017-04-09T18:00:00Z",
A pipeline is active only between its start time and end time. Because the scheduler is set to weekly, the pipeline is triggered at the start of the week: that date should be within start- and end date. For more details about scheduling, see: https://learn.microsoft.com/en-us/azure/data-factory/data-factory-scheduling-and-execution
The Azure SQL Input dataset should look like this:
{
"name": "AzureSQLDataset_ML_Input",
"properties": {
"published": false,
"type": "AzureSqlTable",
"linkedServiceName": "SRC_SQL_Azure",
"typeProperties": {
"tableName": "dbo.Azure_ML_Input"
},
"availability": {
"frequency": "Week",
"interval": 1
},
"external": true,
"policy": {
"externalData": {
"retryInterval": "00:01:00",
"retryTimeout": "00:10:00",
"maximumRetry": 3
}
}
}
I added the external and policy properties to this dataset (see script above) and after that, it worked.

Defining query parameters for basic CRUD operations in Loopback

We are using Loopback successfully so far, but we want to add query params to our API documentation.
In our swagger.json file, we might have something that looks like =>
{
"swagger": "2.0",
"info": {
"version": "1.0.0",
"title": "poc-discovery"
},
"basePath": "/api",
"paths": {
"/Users/{id}/accessTokens/{fk}": {
"get": {
"tags": [
"User"
],
"summary": "Find a related item by id for accessTokens.",
"operationId": "User.prototype.__findById__accessTokens",
"parameters": [
{
"name": "fk",
"in": "path",
"description": "Foreign key for accessTokens",
"required": true,
"type": "string",
"format": "JSON"
},
{
"name": "id",
"in": "path",
"description": "User id",
"required": true,
"type": "string",
"format": "JSON"
},
{
"name":"searchText",
"in":"query",
"description":"The Product that needs to be fetched",
"required":true,
"type":"string"
},
{
"name":"ctrCode",
"in":"query",
"description":"The Product locale needs to be fetched. Example=en-GB, fr-FR, etc.",
"required":true,
"type":"string"
},
],
I am 99% certain the swagger.json information gets generated dynamically via information from the .json files in the /server/models directory.
I am hoping that I can add the query params that we accept for each model in those .json files. What I want to avoid is having to modify swagger.json directly.
What is the best approach to add our query params so that they show up in our docs? Very confused as to how to best approach this.
After a few hours of tinkering, I'm afraid there is not a straight forward way to achieve this as the swagger spec generated here is representation of remoting metadata for model methods along with Model data from model.json files.
Thus, updating remoting metadata for built-in model methods would be challenging and it might not be fully supported by method implementations.
Right approach, IMO, here is to:
- create a remoteMethod wrapper around built-in method for which you want additional params to be injected with requried http mapping data.
- And, disable the REST end-point for the built-in method using
MyModel.disableRemoteMethod(<methodName>, <isStatic>).

Creating Ext.data.Model using JSON config

In the app we're developing, we create all the JSON at the server side using dinamically generated configs (JSON objects). We use that for stores (and other stuff, like GUIs), with a dinamically generated list of its data fields.
With a JSON like this:
{
"proxy": {
"type": "rest",
"url": "/feature/163",
"timeout": 600000
},
"baseParams": {
"node": "163"
},
"fields": [{"name": "id", "type": "int" },
{"name": "iconCls", "type": "auto"},
{"name": "text","type": "string"
},{ "name": "name", "type": "auto"}
],
"xtype": "jsonstore",
"autoLoad": true,
"autoDestroy": true
}, ...
Ext will gently create an "implicit model" with which I'll be able to work with, load it on forms, save it, delete it, etc.
What I want is to specify through a JSON config not the fields, but the model itself. Is this possible?
Something like:
{
model: {
name: 'MiClass',
extends: 'Ext.data.Model',
"proxy": {
"type": "rest",
"url": "/feature/163",
"timeout": 600000},
etc... }
"autoLoad": true,
"autoDestroy": true
}, ...
That way I would be able to create a whole JSON from the server without having to glue stuff using JS statements on the client side.
Best regards,
I don't see why not. The syntax to create a model class is similar to that of store and components:
Ext.define('MyApp.model.MyClass', {
extend:'Ext.data.Model',
fields:[..]
});
So if you take this apart you could call Ext.define(className,config);
where className is a string and config is a JSON object and both are generated on the server.
There's no way to achieve what I want.
The only way you can do it is by means of defining the fields of the Ext.data.Store and have it to generate the implicit model by using the fields configuration.