How to perform partial matching on _id in Elastic search - json

I am trying to perform a partial word matching on the _id field in my Elastic search instance.
After searching the official documentation I found out that the best way to do this is to create a n-gram analyzer, so using Sense I did this:
PUT /index2
{"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"partial_filter": {
"type": "ngram",
"min_gram": 2,
"max_gram": 20
}
},
"analyzer": {
"partial": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"partial_filter"
]
}
}
}
}}
I have tried to test the analyzer using :
POST /index2/_analyze
{
"analyzer": "partial",
"text": "brown fox"
}
And it worked as expected producing proper combinations.
The next step should be to apply the analyzer to the relevant fields,so I tried to do this:
PUT /index2/_mapping/type2
{
"type2": {
"properties": {
"_id": {
"type": "string",
"analyzer": "partial"
}
}
}
}
But i am getting an error:
"reason": "Field [_id] is defined twice in [type2]"
Probably because _id field gets created during the index2 creation along with the analyzer.
So my question is how can I use the partial search on the _id field?
Is there any other way to do this?
Thanks in advance!

Related

Extract value of Tags from cloudTrail logs using Athena

I am trying to query cloudtrail logs using Athena. My goal is to find specific instances and extract them with their Tags.
The query I am using is:
SELECT eventTime, awsRegion , json_extract(responseelements, '$.instancesSet.items[0].instanceId') AS instanceId, json_extract(responseelements, '$.instancesSet.items[0].tagSet.items') AS TAGS FROM cloudtrail_logs_PP WHERE (eventName = 'RunInstances' OR eventName = 'StartInstances' ) AND requestparameters LIKE '%mytest1%' AND "timestamp" BETWEEN '2021/09/01' AND '2021/10/01' ORDER BY eventTime;
Using this query - I am able to get all Tags under one column.
Output of query
I want to extract only specific Tags and need help in the same. How cam I extract the only specific Tag?
I tried enhancing my query as json_extract(responseelements, '$.instancesSet.items[0].tagSet.items[0]' but the order of Tags is diff in diff logs - so cant pass the index location.
My json file in S3 is something like below:
{
"eventVersion": "1",
"eventTime": "2022-05-27T18:44:29Z",
"eventName": "RunInstances",
"awsRegion": "us-east-1",
"requestParameters": {
"instancesSet": {
"items": [{
"imageId": "ami-1234545",
"keyName": "DDKJKD"
}]
},
"instanceType": "m5.2xlarge",
"monitoring": {
"enabled": false
},
"hibernationOptions": {
"configured": false
}
},
"responseElements": {
"instancesSet": {
"items": [{
"tagSet": {
"items": [ {
"key": "11",
"value": "DS"
}, {
"key": "1",
"value": "A"
}]
}]
}
}
}

Elasticsearch exact phrase match on JSON

I am working on exact phrase match from a json field using the elasticsearch. I have tried mutiple syntax like multi_match, query_string & simple_query_string but they does not return results exactly as per the given phrase.
query_string syntax that I am using;
"query":{
"query_string":{
"fields":[
"json.*"
],
"query":"\"legal advisor\"",
"default_operator":"OR"
}
}
}
I also tried filter instead of query but filter is not given any result on json. The syntax I used for filter is;
"query": {
"bool": {
"filter": {
"match": {
"json": "legal advisor"
}
}
}
}
}
Now the question is;
Is it possible to perform exact match operation on json using elasticsearch?
You can try using multi-match query with type phrase
{
"query": {
"multi_match": {
"query": "legal advisor",
"fields": [
"json.*"
],
"type": "phrase"
}
}
}
Since you have not provided your sample docs and expected docs, I am assuming you are looking for a phrase match, Adding a working sample.
Index sample docs which will also generate the index mapping
{
"title" : "legal advisor"
}
{
"title" : "legal expert advisor"
}
Now if you are looking for exact phrase search of legal advisor use below query
{
"query": {
"match_phrase": {
"title": "legal advisor"
}
}
}
Which returns only first doc
"hits": [
{
"_index": "64989158",
"_type": "_doc",
"_id": "1",
"_score": 0.5753642,
"_source": {
"title": "legal advisor"
}
}
]

Will there be a performance overhead when using an index having Object_Pairs (in case of a covered query) - Couchbase

Suppose I create an index on Object_pair(values).val.data.
Will my index store the “values” field as an array (with elements name for ID and val for data due to object_pair)?
If so, and also if my n1ql query is a covered query (fetching only Object_pair(values).val.data via select clause), will there still be a performance overhead? (because I am under the impression that in the above case, as index would already contain “values” field as an array, no actual object_pair transformation would take place hence avoiding the overhead. Only in the case of a non-covered query will the actual document be accessed and object_pair transformation done on “values” field).
Couchbase document:
"values": {
"item_1": {
"data": [{
"name": "data_1",
"value": "A"
},
{
"name": "data_2",
"value": "XYZ"
}
]
},
"item_2": {
"data": [{
"name": "data_1",
"value": "123"
},
{
"name": "data_2",
"value": "A23"
}
]
}
}
}```
UPDATE:
suppose if we plan to create index on Object_pair(values)[*].val.data & Object_pair(values)[*].name
Index: CREATE INDEX idx01 ON ent_comms_tracking(ARRAY { value.name, value.val.data} FOR value IN object_pairs(values) END)
Query: SELECT ARRAY { value.name, value.val.data} FOR value IN object_pairs(values) END as values_array FROM bucket
Can you please paste your full create index statement?
Creating index on OBJECT_PAIRS(values).val.data indexes nothing.
You can check it out by creating a primary index and then running below query:
SELECT OBJECT_PAIRS(`values`).val FROM mybucket
Output is:
[
{}
]
OBJECT_PAIRS(values) returns arrays of values which contain the attribute name and value pairs of the object values -
SELECT OBJECT_PAIRS(`values`) FROM mybucket
[
{
"$1": [
{
"name": "item_1",
"val": {
"data": [
{
"name": "data_1",
"value": "A"
},
{
"name": "data_2",
"value": "XYZ"
}
]
}
},
{
"name": "item_2",
"val": {
"data": [
{
"name": "data_1",
"value": "123"
},
{
"name": "data_2",
"value": "A23"
}
]
}
}
]
}
]
It's an array, so val of it is not directly referenced

In Logic Apps JSON Array while parsing throwing error for single object but for multiple objects it is working fine

While parsing JSON in Azure Logic App in my array I can get single or multiple values/objects (Box as shown in below example)
Both type of inputs are correct but when only single object is coming then it is throwing an error "Invalid type. Expected Array but got Object "
Input 1 (Throwing error) : -
{
"MyBoxCollection":
{
"Box":{
"BoxName": "Box 1"
}
}
}
Input 2 (Working Fine) : -
{
"MyBoxCollection":
[
{
"Box":{
"BoxName": "Box 1"
},
"Box":{
"BoxName": "Box 2"
}
}]
}
JSON Schema :
"MyBoxCollection": {
"type": "object",
"properties": {
"box": {
"type": "array",
items": {
"type": "object",
"properties": {
"BoxName": {
"type": "string"
},
......
.....
..
}
Error Details :-
[
{
"message": "Invalid type. Expected Array but got Object .",
"lineNumber": 0,
"linePosition": 0,
"path": "Order.MyBoxCollection.Box",
"schemaId": "#/properties/Root/properties/MyBoxCollection/properties/Box",
"errorType": "type",
"childErrors": []
}
]
I used to use the trick of injecting a couple of dummy rows in the resultset as suggested by the other posts, but I recently found a better way. Kudos to Thomas Prokov for providing the inspiration in his NETWORG blog post.
The JSON parse schema accepts multiple choices as type, so simply replace
"type": "array"
with
"type": ["array","object"]
and your parse step will happily parse either an array or a single value (or no value at all).
You may then need to identify which scenario you're in: 0, 1 or multiple records in the resultset? I'm pasting below how you can create a variable (ResultsetSize) which takes one of 3 values (rs_0, rs_1 or rs_n) for your switch:
"Initialize_ResultsetSize": {
"inputs": {
"variables": [
{
"name": "ResultsetSize",
"type": "string",
"value": "rs_n"
}
]
},
"runAfter": {
"<replace_with_name_of_previous_action>": [
"Succeeded"
]
},
"type": "InitializeVariable"
},
"Check_if_resultset_is_0_or_1_records": {
"actions": {
"Set_ResultsetSize_to_0": {
"inputs": {
"name": "ResultsetSize",
"value": "rs_0"
},
"runAfter": {},
"type": "SetVariable"
}
},
"else": {
"actions": {
"Set_ResultsetSize_to_1": {
"inputs": {
"name": "ResultsetSize",
"value": "rs_1"
},
"runAfter": {},
"type": "SetVariable"
}
}
},
"expression": {
"and": [
{
"equals": [
"#string(body('<replace_with_name_of_Parse_JSON_action>')?['<replace_with_name_of_root_element>']?['<replace_with_name_of_list_container_element>']?['<replace_with_name_of_item_element>']?['<replace_with_non_null_element_or_attribute>'])",
""
]
}
]
},
"runAfter": {
"Initialize_ResultsetSize": [
"Succeeded"
]
},
"type": "If"
},
"Process_resultset_depending_on_ResultsetSize": {
"cases": {
"Case_no_record": {
"actions": {
},
"case": "rs_0"
},
"Case_one_record_only": {
"actions": {
},
"case": "rs_1"
}
},
"default": {
"actions": {
}
},
"expression": "#variables('ResultsetSize')",
"runAfter": {
"Check_if_resultset_is_0_or_1_records": [
"Succeeded",
"Failed",
"Skipped",
"TimedOut"
]
},
"type": "Switch"
}
For this problem, I met another stack overflow post which is similar to this problem. While there is one "Box", it will be shown as {key/value pair} but not [array] when we convert it to json format. I think it is caused by design, so maybe we can just add a record "Box" at the source of your xml data such as:
<Box>specific_test</Box>
And do some operation to escape the "specific_test" in the next steps.
Another workaround for your reference:
If your json data has only one array, we can use it to do a judgment. We can judge the json data if it contains "[" character. If it contains "[", the return value is the index of the "[" character. If not contains, the return value is -1.
The expression shows as below:
indexOf('{"MyBoxCollection":{"Box":[aaa,bbb]}}', '[')
The screenshot above is the situation when it doesn't contain "[", it return -1.
Then we can add a "If" condition. If >0, do "Parse JSON" with one of the schema. If =-1, do "Parse JSON" with the other schema.
Hope it would be helpful to your problem~
We faced a similar issue. The only solution we find is by manipulating the XML before conversion. We updated XML nodes which needs to be an array even when we have single element using this. We used a Azure function to update the required XML attributes and then returned the XML for conversion in Logic Apps. Hope this helps someone.

How do Elasticsearch's "include_in_parent" / "include_in_root" work? Should it show in '_source'?

In a simple Elasticsearch mapping like this:
{
"personal_document": {
"analyzer": "standard",
"_timestamp": {
"enabled": true
},
"properties": {
"description": {
"type": "multi_field",
"fields": {
"sort": {
"type": "string",
"index": "not_analyzed"
},
"description": {
"type": "string",
"include_in_root": true
}
}
},
"my_nested": {
"type": "nested",
"include_in_root": true,
"properties": {
"description": {
"type": "string"
}
}
}
}
}
}
.... isn't "include_in_root": true supposed to add the field my_nested.description to the root document?
And during a query am I not supposed to see THAT field into the _source field?
and
Specifying an highlight directive on the field 'my_nested.description' would automatically retrieve the _included_in_root value_ instead of the nested field?
(something like this)
"highlight": {
"fields": {
"description": {},
"my_nested.description": {}
}
}
Or do I have some misunderstanding of the official nested type documentation?
(that is not really clear)
If the include_in_parent or include_in_root options are enabled on the nested documents then Elasticsearch internally indexes the data with nested fields flattened on the parent document. However, this is just internal for Elasticsearch and you'll never see them in the _source field.
If the user field is of type object, this document would be indexed
internally something like this...
as it is refered here.
Thus, you continue to perform actions (like the highlights that you mention) by referring to the nested document's fields. The highlight syntax that you refer to should look like this
"highlight": {
"fields": {
"my_nested.description": {}
}
}
and not
"highlight": {
"fields": {
"description": {}
}
}
You can use a wildcard for specifying highlight field:
POST /test-1/page/_search
{
"query": {
"query_string": {
"query": "Lorem ipsum"
}
},
"highlight" : {
"fields" : {
"*" : {}
}
}
}
If it's a good idea, I don't know. I guess it depends on your application.
This also works with nested documents, btw --- but seems to hickup when doing attachments on nested documents without include_in_root