How to call stored painless script function in elastisearch - function

I am trying to use an example from
https://www.elastic.co/guide/en/elasticsearch/reference/6.4/modules-scripting-using.html
I have created a function and saved it.
POST http://localhost:9200/_scripts/calculate-score
{
"script": {
"lang": "painless",
"source": "ctx._source.added + params.my_modifier"
}
}
Try to call saved function
POST http://localhost:9200/users/user/_search
{
"query": {
"script": {
"script": {
"id": "calculate-score",
"params": {
"my_modifier": 2
}
}
}
}
}
And it returns an error: Variable [ctx] is not defined. I tried to use doc['added'] but received the same error. Please help me understand how to call the function.

You should try using doc['added'].value, let me explain you why and how. In short, because painless scripting language is rather simple but obscure.
Why can't ES find ctx variable?
The reason it cannot find ctx variable is because this painless script runs in "filter context" and such variable is not available in filter context. (If you are curious, there were 18 types of painless context as of ES 6.4).
In filter context there are only two variables available:
params (Map, read-only)
User-defined parameters passed in as part of the query.
doc (Map, read-only)
Contains the fields of the current document where each field is a List of values.
It should be enough to use doc['added'].value in your case:
POST /_scripts/calculate-score
{
"script": {
"lang": "painless",
"source": "doc['added'].value + params.my_modifier"
}
}
Should, because there will be another problem if we try to execute it (exactly like you did):
"type": "script_exception",
"reason": "runtime error",
"script_stack": [
"doc['added'].value + params.my_modifier",
"^---- HERE"
],
"script": "calculate-score",
"lang": "painless",
"caused_by": {
"type": "class_cast_exception",
"reason": "cannot cast def [long] to boolean"
}
Because of its context, this script is expected to return a boolean:
Return
boolean
Return true if the current document should be returned as a
result of the query, and false otherwise.
At this point we can understand why the script you were trying to execute did not make much sense for Elasticsearch: it is supposed to tell if a document matches a script query or not. If a script returns an integer, Elasticsearch wouldn't know if it is true or false.
How to make a stored script work in filter context?
As an example we can use the following script:
POST /_scripts/calculate-score1
{
"script": {
"lang": "painless",
"source": "doc['added'].value > params.my_modifier"
}
}
Now we can access the script:
POST /users/user/_search
{
"query": {
"script": {
"script": {
"id": "calculate-score1",
"params": {
"my_modifier": 2
}
}
}
}
}
And it will return all documents where added is greater than 2:
"hits": [
{
"_index": "users",
"_type": "user",
"_id": "1",
"_score": 1,
"_source": {
"name": "John Doe",
"added": 40
}
}
]
This time the script returned a boolean and Elasticsearch managed to use it.
If you are curious, range query can do the same job, without scripting.
Why do I have to put .value after doc['added']?
If you try to access doc['added'] directly you may notice that the error message is different:
POST /_scripts/calculate-score
{
"script": {
"lang": "painless",
"source": "doc['added'] + params.my_modifier"
}
}
"type": "script_exception",
"reason": "runtime error",
"script_stack": [
"doc['added'] + params.my_modifier",
" ^---- HERE"
],
"script": "calculate-score",
"lang": "painless",
"caused_by": {
"type": "class_cast_exception",
"reason": "Cannot apply [+] operation to types [org.elasticsearch.index.fielddata.ScriptDocValues.Longs] and [java.lang.Integer]."
}
Once again painless shows us its obscurity: when accessing the field 'added' of the document, we obtain an instance of org.elasticsearch.index.fielddata.ScriptDocValues.Longs, which Java Virtual Machine denies to add to an integer (we can't blame Java here).
So we have to actually call .getValue() method, which, translated in painless, is simply .value.
What if I want to change that field in a document?
What if you want to add 2 to field added of some document, and save the updated document? Update API can do this.
It operates in update context, which actually has got ctx variable defined, which in turn has access to the original JSON document via ctx['_source'].
We might create a new script:
POST /_scripts/add-some
{
"script": {
"lang": "painless",
"source": "ctx['_source']['added'] += params.my_modifier"
}
}
Now we can use it:
POST /users/user/1/_update
{
"script" : {
"id": "add-some",
"params" : {
"my_modifier" : 2
}
}
}
Why the example from the documentation doesn't work?
Apparently, because it is wrong. This script (from this documentation page):
POST _scripts/calculate-score
{
"script": {
"lang": "painless",
"source": "Math.log(_score * 2) + params.my_modifier"
}
}
is later executed in filter context (in a search request, in a script query), and, as we now know, there is no _score variable available.
This script would kind of make sense only in score context, when running a funtion_score query which allows to twiggle the relevance score of the documents.
Final note
I would like to mention that in general, it's recommended to avoid using scripts because their performance is poor.

Related

Selecting in States.StringToJson function

It there is a way to process the result of States.StringToJson intesic function directly ? 
Currently in a step function, I try to handle the error from another synchronous step function call :
"OtherStepFunction": {
"Type": "Task",
"Resource": "arn:aws:states:::states:startExecution.sync:2",
"Parameters": {
"StateMachineArn": "otherstepFunctionCall",
"Input.$": "$"
},
"End": true,
"Catch": [
{
"ErrorEquals": [
"States.ALL"
],
"Comment": "OtherStepFunctionFailed",
"Next": "StatusStepFunctionFailed",
"ResultPath": "$.error"
}
]
},
All errors goes in a pass flow named StatusStepFunctionFailed, with the errors output in $.error path.
The $.error is composed of the error type and the cause as an escapedJson string.
"error": {
"Error": "States.TaskFailed",
"Cause": "{\"ExecutionArn\":\"otherfunctionarm:executionid\",\"Input\":\"foooooo\"}"
}
Is there any way to extract only the ExecutionARN from this input ? In my pass step, I convert the Cause path as a json, but i didn't find a way to select directly the ExectionARN part. The following :
"reason.$": "States.JsonMerge($.error.Cause).ExecutionArn"
return The value for the field 'reason.$' must be a valid JSONPath or a valid intrinsic function call (at /States/HandleResource/Iterator/States/StatusStepFunctionFailedHandleJSON/Parameters)
My current workaround is to use 2 pass flow, first convert the output and then formating.
I had a similar issue.
What I did was create a task to put the Cause into a new path parameter using StringToJSON. I put that task as the next from the error and then called the subsequent task from that one.
Using your variable names and values:
In the Catch, change the Next from StatusStepFunctionFailed to parseErrorCause
Then parseErrorCause is like this:
"parseErrorCause": {
"Type": "Pass",
"Parameters": {
"Result.$": "States.StringToJson($.error.Cause)"
},
"ResultPath": "$.parsedJSON",
"Next": "StatusStepFunctionFailed"
},
And StatusStepFunctionFailed accesses
"Variable": "$.parsedJSON.Result.Input",
to get foooooo

Azure Data Factory - attempting to add params to dynamic content in the body of a REST API request

In Azure Data Factory, I'm attempting to add params to the body of a copy task (connected to a REST API post request as the source). I'm wanting to use dynamic content to do so, but I'm struggling trying to find the real solution for the proper nomenclature. Here's what I have so far.
copy task
dynamic content
{
"datatable":
{
"start":0,
"length": 10000,
"filters": [
{
"name": "Arrival Dates",
"start": "pipeline().parameters.pDate1",
"end": "pipeline().parameters.pDate2"
}
],
"sort": [
{
"name": "start_date",
"order": "ASC"
}
]
}
}
You'll notice that I've added params for dates. Is this the correct nomenclature for trying to add dynamic content? The autocorrect tried to add the # sign in the beginning of the code block, which will cause the entire thing to error out. I've tried adding it before each parameter, but that isn't actually reading the dynamic values either.
This is not correct. You need to use concat to concatenate the different variables. Something like this :
#concat('{ "datatable": { "start":0, "length": 10000, "filters": [ { "name": "Arrival Dates", "start": "',pipeline().parameters.pDate1,'", "end": "',pipeline().parameters.pDate2,'" } ], "sort": [ { "name": "start_date", "order": "ASC" } ] } }')
This is also documented in the SO question.

Need documentation for *.analysis.windows.net/public/reports/querydata

I am reverse engineering an app that sends queries to
SOMESERVERNAME.analysis.windows.net/public/reports/querydata via an HTTP POST of an JSON-structured query.
Some initial lines of a sample query are at the end of this message.
I can't find any documentation on this anywhere. I don't know if this is some secret API or what. I ultimately would like to just ignore the aggregations altogether and just dump the raw data, which seems to sit in some flat-file type container on the back-end, but without some API documentation I'm stuck with just re-running the super basic handful of queries I've been able to intercept.
Note: this app is an embedded analytics page created with PowerBI, but the only REST API I can find for PowerBI has nothing to do with querying, but just basic object management.
Thanks!
{
"version": "1.0.0",
"queries": [
{
"Query": {
"Commands": [
{
"SemanticQueryDataShapeCommand": {
"Query": {
"Version": 2,
"From": [
{
"Name": "s",
"Entity": "Sheet1"
}
],
"Select": [
{
"Aggregation": {
"Expression": {
"Column": {
"Expression": {
"SourceRef": {
"Source": "s"
}
},
"Property": "Total"
}
},
"Function": 0
},
"Name": "Sum(Sheet1.Total)"
}
],
"Where": [
{
"Condition": {
"In": {
"Expressions": [
{
"Column": {
"Expression": {
"SourceRef": {
"Source": "s"
}
},
"Property": "Year"
}
}
],
"Values": [
[
{
"Literal": {
"Value": "'2018'"
}
}
]
]
}
}
},
............
I have built a client that scrapes data off a specific Power BI report using the same API, but probably you'll be able to adapt it to your use case. Maybe we can even abstract the code into a more generalized Power BI client!
Having tinkered with the API for two days, I realised that there are many ways the data can be formatted:
"nested"/multidimensional data can be unflattened, flattened by 1 degree, etc.
a primary "table" of a result dataset (in data.PH) can reference others (in data.SH)
The basics are as follows:
A dataset is structured like a multidimensional table, with cells containing values.
In a set of cells, the first always has a field S that contains the schema of its and all subsequent cells.
The schema maps a field of each cell's object with a selection from your query, e.g. the G0 field with the queried column age.
My client seems to work only with a specific type of query (SemanticQueryDataShapeCommand), a specific nr of dimensions and a specific column marked as primary (via Binding.Primary). But maybe that helps! https://github.com/derhuerst/fetch-bvg-occupancy/blob/1ebb864b1ff7130f9d2f0ab031c6d78bcabdd633/lib/parse-dataset.js
The only documented way to use this API is through the ADOMD.NET or OleDb provider.
If you want to send a DAX/MDX query and retrieve data programmatically, there's a sample of how to front-end the service with a simple REST API here.

Logstash json field removal

We have a heavily nested json document containing server metrcs, the document contains > 1000 fields some of which are completely irrelevant to us for analytic purposes so i would like to remove them before indexing the document in Elastic.
However i am unable to find the correct filter to use as the fields i want to remove have common names in multiple different objects within the document.
The source document looks like this ( reduced in size for brevity)
[
{
"server": {
"is_master": true,
"name": "MYServer",
"id": 2111
},
"metrics": {
"Server": {
"time": {
"boundary": {},
"type": "TEXT",
"display_name": "Time",
"value": "2018-11-01 14:57:52"
}
},
"Mem_OldGen": {
"used": {
"boundary": {},
"display_name": "Used(mb)",
"value": 687
},
"committed": {
"boundary": {},
"display_name": "Committed(mb)",
"value": 7116
}
"cpu_count": {
"boundary": {},
"display_name": "Cores",
"value": 4
}
}
}
}
]
The data is loaded into logstash using the http_poller input plugin and needs to be processed before sending to Elastic for indexing.
I am trying to remove the fields that are not relevant for us to track for analytical purposes, these include the "display_name" and "boundary" fields from each json object in the different metrics.
I have tried using the mutate filter to remove the fields but because they exist in so many different objects it requires to many coded paths to be added to the logstash config.
I have also looked at the ruby filter, which seems promising as it can look the event, but i am unable to get it to crawl the entire json document, or more importantly actually remove the fields.
Here is what i was trying as a test
filter {
split{
field => "message"
}
ruby {
code => '
event.get("[metrics][Mem_OldGen][used]").to_hash.keys.each { |k|
logger.info("field is:", k)
if k.include?("display_name")
event.remove(k)
end
if k.include?("boundary")
event.remove(k)
end
}
'
}
}
It first splits the input at the message level to create one event per server, then tries to remove the fields from a specific metric.
Any help you be greatly appreciated.
If I get the point, you want to keep just the value key.
So, considering the response hash:
response = {
"server": {
"is_master": true,
"name": "MYServer",
"id": 2111
},
"metrics": {
...
You could do:
response[:metrics].transform_values { |hh| hh.transform_values { |h| h.delete_if { |k,v| k != :value } } }
#=> {:server=>{:is_master=>true, :name=>"MYServer", :id=>2111}, :metrics=>{:Server=>{:time=>{:value=>"2018-11-01 14:57:52"}}, :Mem_OldGen=>{:used=>{:value=>687}, :committed=>{:value=>7116}, :cpu_count=>{:value=>4}}}}

rename invalid keys from JSON

I have following flow in NIFI , JSON has (1000+) objects in it.
invokeHTTP->SPLIT JSON->putMongo
Flow works fine, till I receive some keys in json with "." in the name. e.g. "spark.databricks.acl.dfAclsEnabled".
my current solution is not optimal, I have jotted down bad keys, and using multiple replace text processor to replace "." with "_". I am not using REGEX, I am using string literal find/replace. So each time I am getting failure in putMongo processor, I am inserting new replaceText processor.
This is not maintainable. I am wondering if I can use JOLT for this? couple of info regarding input JSON.
1) no set structure, only thing that is confirmed is. everything will be in events array. But event object itself is free form.
2) maximum list size = 1000.
3) 3rd party JSON, so I cant ask for change in format.
Also, key with ".", can appear anywhere. So I am looking for JOLT spec that can cleanse at all level and then rename it.
{
"events": [
{
"cluster_id": "0717-035521-puny598",
"timestamp": 1531896847915,
"type": "EDITED",
"details": {
"previous_attributes": {
"cluster_name": "Kylo",
"spark_version": "4.1.x-scala2.11",
"spark_conf": {
"spark.databricks.acl.dfAclsEnabled": "true",
"spark.databricks.repl.allowedLanguages": "python,sql"
},
"node_type_id": "Standard_DS3_v2",
"driver_node_type_id": "Standard_DS3_v2",
"autotermination_minutes": 10,
"enable_elastic_disk": true,
"cluster_source": "UI"
},
"attributes": {
"cluster_name": "Kylo",
"spark_version": "4.1.x-scala2.11",
"node_type_id": "Standard_DS3_v2",
"driver_node_type_id": "Standard_DS3_v2",
"autotermination_minutes": 10,
"enable_elastic_disk": true,
"cluster_source": "UI"
},
"previous_cluster_size": {
"autoscale": {
"min_workers": 1,
"max_workers": 8
}
},
"cluster_size": {
"autoscale": {
"min_workers": 1,
"max_workers": 8
}
},
"user": ""
}
},
{
"cluster_id": "0717-035521-puny598",
"timestamp": 1535540053785,
"type": "TERMINATING",
"details": {
"reason": {
"code": "INACTIVITY",
"parameters": {
"inactivity_duration_min": "15"
}
}
}
},
{
"cluster_id": "0717-035521-puny598",
"timestamp": 1535537117300,
"type": "EXPANDED_DISK",
"details": {
"previous_disk_size": 29454626816,
"disk_size": 136828809216,
"free_space": 17151311872,
"instance_id": "6cea5c332af94d7f85aff23e5d8cea37"
}
}
]
}
I created a template using ReplaceText and RouteOnContent to perform this task. The loop is required because the regex only replaces the first . in the JSON key on each pass. You might be able to refine this to perform all substitutions in a single pass, but after fuzzing the regex with the look-ahead and look-behind groups for a few minutes, re-routing was faster. I verified this works with the JSON you provided, and also JSON with the keys and values on different lines (: on either):
...
"spark_conf": {
"spark.databricks.acl.dfAclsEnabled":
"true",
"spark.databricks.repl.allowedLanguages"
: "python,sql"
},
...
You could also use an ExecuteScript processor with Groovy to ingest the JSON, quickly filter all JSON keys that contain ., perform a collect operation to do the replacement, and re-insert the keys in the JSON data if you want a single processor to do this in a single pass.