Cloudant/Mango selector for deeply nested JSONs - json

Let's say some of my documents have the following structure:
{
"something":{
"a":"b"
},
"some_other_thing":{
"c":"d"
},
"what_i_want":{
"is_down_here":[
{
"some":{
"not_needed":"object"
},
"another":{
"also_not_needed":"object"
},
"i_look_for":"this_tag",
"tag_properties":{
"this":"that"
}
},
{
"but_not":{
"down":"here"
}
}
]
}
}
Is there a Mango JSON selector that can successfully select on "i_look_for" having the value "this_tag" ? It's inside an array (i know its position in the array). I'm also interested on filtering the result so I only get the "tag_properties" in the result.
I have tried a lot of things, including $elemMatch but everything mostly return "invalid json".
Is that even a use case for Mango or should I stick with views ?

With Cloudant Query (Mango) selector statements, you still need to define an appropriate index before querying. With that in mind, here's your answer:
json-type CQ index
{
"index": {
"fields": [
"what_i_want.is_down_here.0"
]
},
"type": "json"
}
Selector against json-type index
{
"selector": {
"what_i_want.is_down_here.0": {
"i_look_for": "this_tag"
},
"what_i_want.is_down_here.0.tag_properties": {
"$exists": true
}
},
"fields": [
"_id",
"what_i_want.is_down_here.0.tag_properties"
]
}
The solution above assumes that you always know/can guarantee the fields you want are within the 0th element of the is_down_here array.
There is another way to answer this question with a different CQ index type. This article explains the differences, and has helpful examples that show querying arrays. Now that you know a little more about the different index types, here's how you'd answer your question with a Lucene search/"text"-type CQ index:
text-type CQ index
{
"index": {
"fields": [
{"name": "what_i_want.is_down_here.[]", "type": "string"}
]
},
"type": "text"
}
Selector against text-type index
{
"selector": {
"what_i_want.is_down_here": {
"$and": [
{"$elemMatch": {"i_look_for": "this_tag"}},
{"$elemMatch": {"tag_properties": {"$exists": true}}}
]
}
},
"fields": [
"_id",
"what_i_want.is_down_here"
]
}
Read the article and you'll learn that each approach has its tradeoffs: json-type indexes are smaller and less flexible (can only index specific elements); text-type is larger but more flexible (can index all array elements). And from this example, you can also see that the projected values also come with some tradeoffs (projecting specific values vs. the entire array).
More examples in these threads:
Cloudant Selector Query
How to index multidimensional arrays in couchdb

If I'm understanding your question properly, there are two supported ways of doing this according to the docs:
{
"what_i_want": {
"i_look_for": "this_tag"
}
}
should be equivalent to the abbreviated form:
{
"what_i_want.i_look_for": "this_tag"
}

Related

Elastic Search - Nested aggregation

I would like to form a nested aggregation type query in elastic search. Basically , the nested aggregation is at four levels.
groupId.keyword
---direction
--billingCallType
--durationCallAnswered
example:
"aggregations": {
"avgCallDuration": {
"terms": {
"field": "groupId.keyword",
"size": 10000,
"min_doc_count": 1,
"shard_min_doc_count": 0,
"show_term_doc_count_error": false,
"order": [
{
"_count": "desc"
},
{
"_key": "asc"
}
]
},
"aggregations": {
"call_direction": {
"terms" : {
"field": "direction"
},
"aggregations": {
"call_type" : {
"terms": {
"field": "billingCallType"
},
"aggregations": {
"avg_value": {
"terms": {
"field": "durationCallAnswered"
}
}
}
}
}
}
}
}
}
This is part of a query . While running this , I am getting the error as
"type": "illegal_argument_exception",
"reason": "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [direction] in order to load field data by uninverting the inverted index. Note that this can use significant memory."
Can anyone throw light on this?
Tldr;
As the error state, you are performing an aggregation on a text field, the field direction.
Aggregation are not supported by default on text field, as it is very expensive (cpu and memory wise).
They are 3 solutions to your issue,
Change the mapping from text to keyword (will require re indexing, most efficient way to query the data)
Change the mapping to add to this field fielddata: true (flexible, but not optimised)
Don't do the aggregation on this field :)

Elasticsearch dynamic mapping for object within attribute

Wondering if I can create a "dynamic mapping" within an elasticsearch index. The problem I am trying to solve is the following: I have a schema that has an attribute that contains an object that can differ greatly between records. I would like to mirror this data within elasticsearch if possible but believe that automatic mapping may get in the way.
Imagine a scenario where I have a schema like the following:
{
name: string
origin: string
payload: object // can be of any type / schema
}
Is it possible to create a mapping that supports this? I do not need to query the records by this payload attribute, but it would be great if I can.
Note that I have checked the documentation but am confused on if what elastic calls dynamic mapping is what I am looking for.
It's certainly possible to specify which queryable fields you expect the payload to contain and what those fields' mappings should be.
Let's say each doc will include the fields payload.livemode and payload.created_at. If these are the only two fields you'll want to perform queries on, and you'd like to disable dynamic, index-time mappings autogenerated by Elasticsearch for the rest of the fields, you can use dynamic templates like so:
PUT my-payload-index
{
"mappings": {
"dynamic_templates": [
{
"variable_payload": {
"path_match": "payload",
"mapping": {
"type": "object",
"dynamic": false,
"properties": {
"created_at": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
},
"livemode": {
"type": "boolean"
}
}
}
}
}
],
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"origin": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Then, as you ingest your docs:
POST my-payload-index/_doc
{
"name": "abc",
"origin": "web.dev",
"payload": {
"created_at": "2021-04-05 08:00:00",
"livemode": false,
"abc":"def"
}
}
POST my-payload-index/_doc
{
"name": "abc",
"origin": "web.dev",
"payload": {
"created_at": "2021-04-05 08:00:00",
"livemode": true,
"modified_at": "2021-04-05 09:00:00"
}
}
and verify with
GET my-payload-index/_mapping
no new mappings will be generated for the fields payload.abc nor payload.modified_at.
Not only that — the new fields will also be ignored, as per the documentation:
These fields will not be indexed or searchable, but will still appear in the _source field of returned hits.
Side note: if fields are neither stored nor searchable, they're effectively the opposite of enabled.
The Big Picture
Working with variable contents of a single, top-level object is quite standard. Take for instance the stripe event object — each event has an id, an api_version and a few other shared params. Then there's the data object that's analogous to your payload field.
Now, all is fine, until you need to aggregate on the contents of your payload. See, since the content is variable, so are the data paths / accessors. But wildcards in aggregation paths don't work in Elasticsearch. Scripts do but are onerous to maintain.
Back to stripe. They partially solved it through what they call polymorphic, typed hashes — as discussed in their blog on API design:
A pretty neat approach that's worth emulating.
P.S. I discuss dynamic templates in more detail in the chapter "Mapping Automation" of my ES Handbook.

JSON Schema validation of arrays with mandatory and optional elements

I am developing a JSON Schema for validating documents like this one:
{
"map": [
{
"key": "mandatoryKey1",
"value": "value1"
},
{
"key": "mandatoryKey2",
"value": "value2"
},
{
"key": "otherStuff",
"value": "value3"
},
{
"key": "someMoreStuff",
"value": "value4"
}
]
}
The document needs to have a "map" array with elements containing keys and values. There MUST be two elements with mandatoryKey1 and mandatoryKey2. Any other key-value pairs are allowed. Order of the elements should not matter. I found this difficult to express in JSON Schema. I can force the schema to check for the mandatory keys like this (left out the definitions part as it is trivial) :
"map": {
"type": "array",
"minItems": 2,
"items": {
"oneOf": [
{
"$ref": "#/definitions/mandatoryElement1"
},
{
"$ref": "#/definitions/mandatoryElement2"
}
]
}
}
The problems are:
It validates that a document includes the mandatory data, but does not permit any other key/value pairs.
It does not check for duplicates, so it can cheated by including mandatoryElement1 twice. Uniqueness of items can only be checked by tuple validation, which I cannot apply here cause the item order should not matter.
The basic problem I see here is that the array elements somehow need to know about each other, i.e. arbitrary key/value pairs are allowed ONLY IF the mandatory keys are present. This "conditional validation" does not seem to be possible with JSON Schema. Any ideas for a better approach?

Logstash json field removal

We have a heavily nested json document containing server metrcs, the document contains > 1000 fields some of which are completely irrelevant to us for analytic purposes so i would like to remove them before indexing the document in Elastic.
However i am unable to find the correct filter to use as the fields i want to remove have common names in multiple different objects within the document.
The source document looks like this ( reduced in size for brevity)
[
{
"server": {
"is_master": true,
"name": "MYServer",
"id": 2111
},
"metrics": {
"Server": {
"time": {
"boundary": {},
"type": "TEXT",
"display_name": "Time",
"value": "2018-11-01 14:57:52"
}
},
"Mem_OldGen": {
"used": {
"boundary": {},
"display_name": "Used(mb)",
"value": 687
},
"committed": {
"boundary": {},
"display_name": "Committed(mb)",
"value": 7116
}
"cpu_count": {
"boundary": {},
"display_name": "Cores",
"value": 4
}
}
}
}
]
The data is loaded into logstash using the http_poller input plugin and needs to be processed before sending to Elastic for indexing.
I am trying to remove the fields that are not relevant for us to track for analytical purposes, these include the "display_name" and "boundary" fields from each json object in the different metrics.
I have tried using the mutate filter to remove the fields but because they exist in so many different objects it requires to many coded paths to be added to the logstash config.
I have also looked at the ruby filter, which seems promising as it can look the event, but i am unable to get it to crawl the entire json document, or more importantly actually remove the fields.
Here is what i was trying as a test
filter {
split{
field => "message"
}
ruby {
code => '
event.get("[metrics][Mem_OldGen][used]").to_hash.keys.each { |k|
logger.info("field is:", k)
if k.include?("display_name")
event.remove(k)
end
if k.include?("boundary")
event.remove(k)
end
}
'
}
}
It first splits the input at the message level to create one event per server, then tries to remove the fields from a specific metric.
Any help you be greatly appreciated.
If I get the point, you want to keep just the value key.
So, considering the response hash:
response = {
"server": {
"is_master": true,
"name": "MYServer",
"id": 2111
},
"metrics": {
...
You could do:
response[:metrics].transform_values { |hh| hh.transform_values { |h| h.delete_if { |k,v| k != :value } } }
#=> {:server=>{:is_master=>true, :name=>"MYServer", :id=>2111}, :metrics=>{:Server=>{:time=>{:value=>"2018-11-01 14:57:52"}}, :Mem_OldGen=>{:used=>{:value=>687}, :committed=>{:value=>7116}, :cpu_count=>{:value=>4}}}}

Elasticsearch mapping of nested structure

I'm looking for some pointers on mapping a somewhat dynamic structure for consumption by Elasticsearch.
The raw structure itself is json, but the problem is that a portion of the structure contains a variable, rather than the outer elements of the structure being static.
To provide a somewhat redacted example, my json looks like this:
"stat": {
"state": "valid",
"duration": 5,
},
"12345-abc": {
"content_length": 5,
"version": 2
}
"54321-xyz": {
"content_length": 2,
"version", 1
}
The first block is easy; Elasticsearch does a great job of mapping the "stat" portion of the structure, and if I were to dump a lot of that data into an index it would work as expected. The problem is that the next 2 blocks are essentially the same thing, but the raw json is formatted in such a way that a unique element has crept into the structure, and Elasticsearch wants to map that by default, generating a map that looks like this:
"stat": {
"properties": {
"state": {
"type": "string"
},
"duration": {
"type": "double"
}
}
},
"12345-abc": {
"properties": {
"content_length": {
"type": "double"
},
"version": {
"type": "double"
}
}
},
"54321-xyz": {
"properties": {
"content_length": {
"type": "double"
},
"version": {
"type": "double"
}
}
}
I'd like the ability to index all of the "content_length" data, but it's getting separated, and with some of the variable names being used, when I drop the data into Kibana I wind up with really long fieldnames that become next to useless.
Is it possible to provide a generic tag to the structure? Or is this more trivially addressed at the json generation phase, with our developers hard coding a generic structure name and adding an identifier field name.
Any insight / help greatly appreciated.
Thanks!
If those keys like 12345-abc are generated and possibly infinite values, it will get hard (if not impossible) to do some useful queries or aggregations. It's not really clear which exact use case you have for analyzing your data, but you should probably have a look at nested objects (https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-objects.html) and generate your input json accordingly to what you want to query for. It seems that you will have better aggregation results if you put these additional objects into an array with a special field containing what is currently your key.
{
"stat": ...,
"things": [
{
"thingkey": "12345-abc",
"content_length": 5,
"version": 2
},
...
]
}