Elastic Search - Nested aggregation - json

I would like to form a nested aggregation type query in elastic search. Basically , the nested aggregation is at four levels.
groupId.keyword
---direction
--billingCallType
--durationCallAnswered
example:
"aggregations": {
"avgCallDuration": {
"terms": {
"field": "groupId.keyword",
"size": 10000,
"min_doc_count": 1,
"shard_min_doc_count": 0,
"show_term_doc_count_error": false,
"order": [
{
"_count": "desc"
},
{
"_key": "asc"
}
]
},
"aggregations": {
"call_direction": {
"terms" : {
"field": "direction"
},
"aggregations": {
"call_type" : {
"terms": {
"field": "billingCallType"
},
"aggregations": {
"avg_value": {
"terms": {
"field": "durationCallAnswered"
}
}
}
}
}
}
}
}
}
This is part of a query . While running this , I am getting the error as
"type": "illegal_argument_exception",
"reason": "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [direction] in order to load field data by uninverting the inverted index. Note that this can use significant memory."
Can anyone throw light on this?

Tldr;
As the error state, you are performing an aggregation on a text field, the field direction.
Aggregation are not supported by default on text field, as it is very expensive (cpu and memory wise).
They are 3 solutions to your issue,
Change the mapping from text to keyword (will require re indexing, most efficient way to query the data)
Change the mapping to add to this field fielddata: true (flexible, but not optimised)
Don't do the aggregation on this field :)

Related

How can I change my Elasticsearch query from multiple querys using _msearch to a single query using the bucket aggregation

I am new to Elasticsearch.
Does anybody have an idea on how to change this query from an
currently _msearch query to an _search query?
It would also be convenient if I didn't have to make a separate query for every "CAR" but instead solve it with only one query. I would like to use the bucket aggregation instead.
This is my the query where I search for every Car separately:
GET /index/_msearch
{}
{"query": {"match": {"name": "CAR_RED"}},"size": 1,"sort": {"time":{"order": "desc"}}}
{}
{"query": {"match": {"name": "CAR_BLACK"}},"size": 1,"sort": {"time":{"order": "desc"}}}
{}
{"query": {"match": {"name": "CAR_WHITE"}},"size": 1,"sort": {"time":{"order": "desc"}}}
At the moment I am trying to solve it with the bucket aggregation but I always get an error.
GET /index/_search
{
"size": 0,
"aggs" : {
"CARS":{
"terms":{"field":"name.keyword"}
},
"bucket_sort": {
"sort": [
{"time":{"order": "desc"}}
],
"size": 1
}
}
}
It would be awesome if anyone could help me with this query.
GET /index/_search
{
"size": 0,
"query" : {"terms" : {"name.keyword" : ["CAR_RED","CAR_BLACK","CAR_WHITE"] }},
"aggs" : {
"CARS":{
"terms":{"field":"name.keyword"},
"aggs" : {
"top_cars_by_time" : {
"top_hits" : {
"sort" : [
{
"time" : {order: "desc"}
}
]
}
}
}
}
}
}
What this query does:
Filters in the query itself, the RED,BLACK AND WHITE CARS
Aggregates the results by car color.
Then for each bucket in the aggregation, sorts the results by the descending order of the time field.
SO for every car color, you will get the hit with the greatest time first, and so on.
Read up on top_hits, its the aggregation you need.
Top Hits Aggregation
HTH.

How do I properly use deleteMany() with an $and query in the MongoDB shell?

I am trying to delete all documents in my collection infrastructure that have a type.primary property of "pipelines" and a type.secondary property of "oil."
I'm trying to use the following query:
db.infrastructure.deleteMany({$and: [{"properties.type.primary": "pipelines"}, {"properties.type.secondary": "oil"}] }),
That returns: { acknowledged: true, deletedCount: 0 }
I expect my query to work because in MongoDB Compass, I can retrieve 182 documents that match the query {$and: [{"properties.type.primary": "pipelines"}, {"properties.type.secondary": "oil"}] }
My documents appear with the following structure (relevant section only):
properties": {
"optional": {
"description": ""
},
"original": {
"Opername": "ENBRIDGE",
"Pipename": "Lakehead",
"Shape_Leng": 604328.294581,
"Source": "EIA"
},
"required": {
"unit": null,
"viz_dim": null,
"years": []
},
"type": {
"primary": "pipelines",
"secondary": "oil"
}
...
My understanding is that I just need to pass a filter to deleteMany() and that $and expects an array of objects. For some reason the two combined isn't working here.
I realized the simplest answer was the correct one -- I spelled my database name incorrectly.

Elastic search filter for distinct categories

I made a simple mapping with three fields and i am analyzing one field which is text type and other fields are keyword type.
example
fields: Category_one, Category_two, Category_three.
Now i am searching the documents.
Get _search/cat
{
"size": 4,
"query": {
"match": {
"Category_one.ngrams": {
"query": "Nice food place in XYZ location",
"analyzer": "standard"
}
},
"aggs":{
"distincr_values":{
"terms": {
"fields" : "Category_two"
}
}
}
}
}
It's showing this error
{
"error": {
"root_cause": [
{
"type": "parsing_exception",
"reason": "[match] malformed query, expected [END_OBJECT] but found [FIELD_NAME]",
"line": 10,
"col": 5
}
],
"type": "parsing_exception",
"reason": "[match] malformed query, expected [END_OBJECT] but found [FIELD_NAME]",
"line": 10,
"col": 5
},
"status": 400
}
Kindly help me with this error. My main motive is to find distinct searches according Category_two field.
Any help would be appreciated.
I believe youre getting this error because of your query structure.
Your aggregations keyword must be outside (same level as) the query. At the moments your aggs is wrapped up inside the query.
Following this structure:
Get _search/cat
{
"size": 4,
"query": {
'query goes here'
},
"aggs":{
'aggregation go here'
}
}

Cloudant/Mango selector for deeply nested JSONs

Let's say some of my documents have the following structure:
{
"something":{
"a":"b"
},
"some_other_thing":{
"c":"d"
},
"what_i_want":{
"is_down_here":[
{
"some":{
"not_needed":"object"
},
"another":{
"also_not_needed":"object"
},
"i_look_for":"this_tag",
"tag_properties":{
"this":"that"
}
},
{
"but_not":{
"down":"here"
}
}
]
}
}
Is there a Mango JSON selector that can successfully select on "i_look_for" having the value "this_tag" ? It's inside an array (i know its position in the array). I'm also interested on filtering the result so I only get the "tag_properties" in the result.
I have tried a lot of things, including $elemMatch but everything mostly return "invalid json".
Is that even a use case for Mango or should I stick with views ?
With Cloudant Query (Mango) selector statements, you still need to define an appropriate index before querying. With that in mind, here's your answer:
json-type CQ index
{
"index": {
"fields": [
"what_i_want.is_down_here.0"
]
},
"type": "json"
}
Selector against json-type index
{
"selector": {
"what_i_want.is_down_here.0": {
"i_look_for": "this_tag"
},
"what_i_want.is_down_here.0.tag_properties": {
"$exists": true
}
},
"fields": [
"_id",
"what_i_want.is_down_here.0.tag_properties"
]
}
The solution above assumes that you always know/can guarantee the fields you want are within the 0th element of the is_down_here array.
There is another way to answer this question with a different CQ index type. This article explains the differences, and has helpful examples that show querying arrays. Now that you know a little more about the different index types, here's how you'd answer your question with a Lucene search/"text"-type CQ index:
text-type CQ index
{
"index": {
"fields": [
{"name": "what_i_want.is_down_here.[]", "type": "string"}
]
},
"type": "text"
}
Selector against text-type index
{
"selector": {
"what_i_want.is_down_here": {
"$and": [
{"$elemMatch": {"i_look_for": "this_tag"}},
{"$elemMatch": {"tag_properties": {"$exists": true}}}
]
}
},
"fields": [
"_id",
"what_i_want.is_down_here"
]
}
Read the article and you'll learn that each approach has its tradeoffs: json-type indexes are smaller and less flexible (can only index specific elements); text-type is larger but more flexible (can index all array elements). And from this example, you can also see that the projected values also come with some tradeoffs (projecting specific values vs. the entire array).
More examples in these threads:
Cloudant Selector Query
How to index multidimensional arrays in couchdb
If I'm understanding your question properly, there are two supported ways of doing this according to the docs:
{
"what_i_want": {
"i_look_for": "this_tag"
}
}
should be equivalent to the abbreviated form:
{
"what_i_want.i_look_for": "this_tag"
}

Can suggestion a solution for big, relational data analyzer please?

I`m looking for some suggestions on my requirements. Below are the description of my requirements. Feel free to contact me for any details please. Even some suggestions on how I can describe my questions more clearly is also very appreciate:)
Requirements description
I have some data, the format is like below:
router, interface,timestamp, src_ip, dst_ip, src_port, dst_port, protocol, bits
r1, 1, 1453016443, 10.0.0.1, 10.0.0.2, 100, 200, tcp, 108
r2, 1, 1453016448, 10.0.0.3, 10.0.0.8, 200, 200, udp, 100
As you can see, it is some network raw data. I omit some columns just to make it looks more clear. The volume of data is very big. And it is generating very fast, like 1 billion rows every 5min...
What I want is to do some real time analysis on these data.
For example:
draw a line using the timestamp
select sum(bits) , timestamp from raw_data group by router,interface where interface = 1, router=r1.
find out which 3 src_ip sending the most data for one interface
select sum(bits) from raw_data where router=r1 and interface=2 group by src_ip order by sum(bits) desc limit 3
I have already tried some solutions and each of them is not very suitable for it. For example :
rdbms
MySQL seems fine except a few problems:
the data is too big
I`m having a lot more columns than I described here. To improve my query speed, I have to some index on most of the columns. But i think create index on big table and the index containing too many columns is not very good, right?
openTSDB
OpenTSDB is a good timeseries database. But also not suitable for my requirements.
openTSDB is having problem to solve the TOP N problem. In my requirements "to get top 3 src_ip which sending most data", openTSDB can not resolve this.
Spark
I know that apache spark can be used like RDBMS. It having the feature called spark SQL. I did not try but I guess the performance should not satisfy the real time analysis/query requirement, right? After all, spark is more suitable for offline calculation, right?
Elastic Search
I really give a lot hope on ES when I know this project. But it is not suitable either. Because When you aggregating more than one column, you have to use the so called nested bucket aggregation in elasticsearch. And the result of this aggregation can not be sorted. You have to retrieve all the result and sort by your self. In my case, the result is too much. To sort the result will be very difficult
So.... I`m stuck here. Can anyone give some suggestions please?
I don't see why ES would not be able to achieve your requirements. I think you misunderstood this part
But it is not suitable either. Because When you aggregating more than one column, you have to use the so called nested bucket aggregation in elasticsearch. And the result of this aggregation can not be sorted.
Your first requirement draw a line using the timestamp could be easily achieved with a query/aggregation like this:
{
"query": {
"bool": {
"must": [
{
"term": {
"interface": 1
}
},
{
"term": {
"router": "r1"
}
}
]
}
},
"aggs": {
"by_minute": {
"date_histogram": {
"field": "timestamp",
"interval": "1m"
},
"aggs": {
"sum_bits": {
"sum": {
"field": "bits"
}
}
}
}
}
}
As for your second requirement find out which 3 src_ip sending the most data for one interface, it can also easily be achieved with a query/aggregation like this one:
{
"query": {
"bool": {
"must": [
{
"term": {
"interface": 2
}
},
{
"term": {
"router": "r1"
}
}
]
}
},
"aggs": {
"by_src_ip": {
"terms": {
"field": "src_ip",
"size": 3,
"order": {
"sum_bits": "desc"
}
},
"aggs": {
"sum_bits": {
"sum": {
"field": "bits"
}
}
}
}
}
}
UPDATE
According to your comment, your second requirement above could change to find the top 3 combination of src_ip/dst_ip. This would be doable with a terms aggregation using a script instead of a term which would build the src/dest combination and provide the sum of bits for each couple, like this:
{
"query": {
"bool": {
"must": [
{
"term": {
"interface": 2
}
},
{
"term": {
"router": "r1"
}
}
]
}
},
"aggs": {
"by_src_ip": {
"terms": {
"script": "[doc.src_ip.value, doc.dst_ip.value].join('-')",
"size": 3,
"order": {
"sum_bits": "desc"
}
},
"aggs": {
"sum_bits": {
"sum": {
"field": "bits"
}
}
}
}
}
}
Note that in order to run this last query, you'll need to enable dynamic scripting. Also since you'll have billions of documents, scripting might not be the best solution, but it's worth giving it a try before diving further. One other possible solution would be to add a combination field (src_ip-dst_ip) at indexing time so that you can use it as a field in your terms aggregation without having to resort to scripting.
You can try Axibase Time Series Database which is non-relational but supports SQL queries in addition to rest-like API. Here's a Top-N query example:
SELECT entity, avg(value) FROM cpu_busy
WHERE time between now - 1 * hour and now
GROUP BY entity
ORDER BY avg(value) DESC
LIMIT 3
https://axibase.com/docs/atsd/sql/#grouping
ATSD Community Edition is free.
Disclosure: I work for Axibase