ElasticSearch: range date with fields inside fields - json

I have an Issue that I need to wrote a elasticsearch query that give me what I look for,
first of all here is one item of my JSON object in db that query looking into:
{
"data": {
"circuit": {
"version": "2.12.2",
"createdOn": "2020-02-04T10:38:11.282",
"expirationDate": "2020-02-06T05:50:00.000",
"expiredSoonNotification": false
}
},
"createdDate": "2020-02-04T10:38:11.282"
}
What I need is to get all Items that accept this condition:
now < "data.circuit.expirationDate" < ("data.circuit.expirationDate" - "createdDate")/10 + now
meaning : I need to get all items that there expirationDate is less 10% from now
I hope that I explained my issue cause I don't know how to use fields inside lt og gt
something I did until now like that, but not working :
{
"query": {
"bool": {
"must_not": [
{
"bool": {
"must": [
{
"range": {
"data.circuit.expirationDate": {
"gt": "now",
"lt": ("data.circuit.expirationDate" - "createdDate")/10 + now
}
}
}
]
}
}
]
}
},
"sort": [
{
"createdDate": {
"order": "desc"
}
}
]
}
Thank You

You cannot do math referencing other fields in a range-query. You would need to encode your logic in a script-query using the Elasticsearch "painless" scripting-language. Script-queries are significantly slower than other queries, as the script needs to get executed for every single document. You can limit the number of documents for which the script gets executed by breaking up the logic into 2 parts:
"data.circuit.expirationDate" > now
"data.circuit.expirationDate" <
(("data.circuit.expirationDate" - "createdDate")/10 + now)
Your query structure would need to look like this (Pseudo-code):
"query": {
"bool": {
"must": { "script": "data.circuit.expirationDate" < ("data.circuit.expirationDate" - "createdDate")/10 + now) }
"filter": { "range": "data.circuit.expirationDate" > now }
}
}
You also should consider whether you really need precision down to millisecond-level. Performance-wise it would be much better to round now to a more granular unit (e.g. now/s for second-level granularity).
Pre-calculating ("data.circuit.expirationDate" - "createdDate")/10 and storing the calculated result directly in your document would furthermore increase query-performance significantly.

Related

Elastic Search - Nested aggregation

I would like to form a nested aggregation type query in elastic search. Basically , the nested aggregation is at four levels.
groupId.keyword
---direction
--billingCallType
--durationCallAnswered
example:
"aggregations": {
"avgCallDuration": {
"terms": {
"field": "groupId.keyword",
"size": 10000,
"min_doc_count": 1,
"shard_min_doc_count": 0,
"show_term_doc_count_error": false,
"order": [
{
"_count": "desc"
},
{
"_key": "asc"
}
]
},
"aggregations": {
"call_direction": {
"terms" : {
"field": "direction"
},
"aggregations": {
"call_type" : {
"terms": {
"field": "billingCallType"
},
"aggregations": {
"avg_value": {
"terms": {
"field": "durationCallAnswered"
}
}
}
}
}
}
}
}
}
This is part of a query . While running this , I am getting the error as
"type": "illegal_argument_exception",
"reason": "Text fields are not optimised for operations that require per-document field data like aggregations and sorting, so these operations are disabled by default. Please use a keyword field instead. Alternatively, set fielddata=true on [direction] in order to load field data by uninverting the inverted index. Note that this can use significant memory."
Can anyone throw light on this?
Tldr;
As the error state, you are performing an aggregation on a text field, the field direction.
Aggregation are not supported by default on text field, as it is very expensive (cpu and memory wise).
They are 3 solutions to your issue,
Change the mapping from text to keyword (will require re indexing, most efficient way to query the data)
Change the mapping to add to this field fielddata: true (flexible, but not optimised)
Don't do the aggregation on this field :)

Elasticsearch Query DSL

I get log files from my firewall which i want to filter for several strings.
However the string contains always some other information. So i want to filter the whole string for some specific words which are always in the string: "User" "authentication" "failed.
I tried this but i do not get any data from it:
"query": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gt": "now-15m"
}
}
},
{
"query_string": {
"query": "User AND authentication AND failed"
}
}
]
}
}
}
However i cannot find the syntax for specific filtering words in strings. Hopefully some of you can help me.
This is the message log ( i want to filter "event.original"): Screenshot

Can suggestion a solution for big, relational data analyzer please?

I`m looking for some suggestions on my requirements. Below are the description of my requirements. Feel free to contact me for any details please. Even some suggestions on how I can describe my questions more clearly is also very appreciate:)
Requirements description
I have some data, the format is like below:
router, interface,timestamp, src_ip, dst_ip, src_port, dst_port, protocol, bits
r1, 1, 1453016443, 10.0.0.1, 10.0.0.2, 100, 200, tcp, 108
r2, 1, 1453016448, 10.0.0.3, 10.0.0.8, 200, 200, udp, 100
As you can see, it is some network raw data. I omit some columns just to make it looks more clear. The volume of data is very big. And it is generating very fast, like 1 billion rows every 5min...
What I want is to do some real time analysis on these data.
For example:
draw a line using the timestamp
select sum(bits) , timestamp from raw_data group by router,interface where interface = 1, router=r1.
find out which 3 src_ip sending the most data for one interface
select sum(bits) from raw_data where router=r1 and interface=2 group by src_ip order by sum(bits) desc limit 3
I have already tried some solutions and each of them is not very suitable for it. For example :
rdbms
MySQL seems fine except a few problems:
the data is too big
I`m having a lot more columns than I described here. To improve my query speed, I have to some index on most of the columns. But i think create index on big table and the index containing too many columns is not very good, right?
openTSDB
OpenTSDB is a good timeseries database. But also not suitable for my requirements.
openTSDB is having problem to solve the TOP N problem. In my requirements "to get top 3 src_ip which sending most data", openTSDB can not resolve this.
Spark
I know that apache spark can be used like RDBMS. It having the feature called spark SQL. I did not try but I guess the performance should not satisfy the real time analysis/query requirement, right? After all, spark is more suitable for offline calculation, right?
Elastic Search
I really give a lot hope on ES when I know this project. But it is not suitable either. Because When you aggregating more than one column, you have to use the so called nested bucket aggregation in elasticsearch. And the result of this aggregation can not be sorted. You have to retrieve all the result and sort by your self. In my case, the result is too much. To sort the result will be very difficult
So.... I`m stuck here. Can anyone give some suggestions please?
I don't see why ES would not be able to achieve your requirements. I think you misunderstood this part
But it is not suitable either. Because When you aggregating more than one column, you have to use the so called nested bucket aggregation in elasticsearch. And the result of this aggregation can not be sorted.
Your first requirement draw a line using the timestamp could be easily achieved with a query/aggregation like this:
{
"query": {
"bool": {
"must": [
{
"term": {
"interface": 1
}
},
{
"term": {
"router": "r1"
}
}
]
}
},
"aggs": {
"by_minute": {
"date_histogram": {
"field": "timestamp",
"interval": "1m"
},
"aggs": {
"sum_bits": {
"sum": {
"field": "bits"
}
}
}
}
}
}
As for your second requirement find out which 3 src_ip sending the most data for one interface, it can also easily be achieved with a query/aggregation like this one:
{
"query": {
"bool": {
"must": [
{
"term": {
"interface": 2
}
},
{
"term": {
"router": "r1"
}
}
]
}
},
"aggs": {
"by_src_ip": {
"terms": {
"field": "src_ip",
"size": 3,
"order": {
"sum_bits": "desc"
}
},
"aggs": {
"sum_bits": {
"sum": {
"field": "bits"
}
}
}
}
}
}
UPDATE
According to your comment, your second requirement above could change to find the top 3 combination of src_ip/dst_ip. This would be doable with a terms aggregation using a script instead of a term which would build the src/dest combination and provide the sum of bits for each couple, like this:
{
"query": {
"bool": {
"must": [
{
"term": {
"interface": 2
}
},
{
"term": {
"router": "r1"
}
}
]
}
},
"aggs": {
"by_src_ip": {
"terms": {
"script": "[doc.src_ip.value, doc.dst_ip.value].join('-')",
"size": 3,
"order": {
"sum_bits": "desc"
}
},
"aggs": {
"sum_bits": {
"sum": {
"field": "bits"
}
}
}
}
}
}
Note that in order to run this last query, you'll need to enable dynamic scripting. Also since you'll have billions of documents, scripting might not be the best solution, but it's worth giving it a try before diving further. One other possible solution would be to add a combination field (src_ip-dst_ip) at indexing time so that you can use it as a field in your terms aggregation without having to resort to scripting.
You can try Axibase Time Series Database which is non-relational but supports SQL queries in addition to rest-like API. Here's a Top-N query example:
SELECT entity, avg(value) FROM cpu_busy
WHERE time between now - 1 * hour and now
GROUP BY entity
ORDER BY avg(value) DESC
LIMIT 3
https://axibase.com/docs/atsd/sql/#grouping
ATSD Community Edition is free.
Disclosure: I work for Axibase

How to convert this SQL query to an Elasticsearch query?

I'm new to Elasticsearch querying, so I'm a little lost on how to convert this SQL query to an Elasticsearch query:
SELECT time_interval, type, sum(count)
FROM test
WHERE (&start_date <= t_date <= &end_date)
GROUP BY time_interval, type
I know I can use the "range" query to set parameters for gte and lte, but if there's a clearer way to do this, that would be even better. Thanks in advance!
Edit:
My elasticsearch is setup to have an index: "test" with type: "summary" and contains JSON documents that have a few fields:
t_datetime
t_date
count
type
*t_id**
The IDs for these JSON documents are the t_date concatenated with the t_id values
Assuming, t_datetime is the same as time_interval, you can use the query below:
POST trans/summary/_search?search_type=count
{
"aggs": {
"filtered_results": {
"filter": {
"range": {
"t_date": {
"gte": "2015-05-01",
"lte": "2015-05-30"
}
}
},
"aggs": {
"time_interval_type_groups": {
"terms": {
"script": "doc['t_datetime'].value + '_' + doc['type'].value",
"size": 0
},
"aggs": {
"sum_of_count": {
"sum": {
"field": "count"
}
}
}
}
}
}
}
}
This query is making use of scripts. On newer versions of Elasticsearch, dynamic scripting is disabled by default. To enable dynamic scripting, follow this.

Lucene Multiple delete query (JSON)

I have a problem with a script that i wrote for elasticsearch. On my server I have multiple log files that need to be deleted on a daily basis. To automate this process I wrote a Perl script that deletes my keep alive log files.
Basically an curl XDELETE
But now I want to add a query to delete another log file.
IS IT POSSIBLE TO ADD ANOTHER JSON OBJECT, WITH OUT CREATING ANOTHER DELETE VARIABLE?
So adding something to my JSON that integrates a separate queries that also deletes that log?
{
"query": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"to": "2014-08-24T00:00:00.000+01:00"
}
}
},
{
"query_string": {
"fields": [
"log_message"
],
"query": "keepAlive"
}
},
]
}
}
}
(Something Like &&? adding a second bool query)
Because everything I add will just over specify the query that i have leading to results I do not want.
Thank you
Not quite sure I've correctly understood what your looking for, but it sounds like you want to combine the results of the given query with those of some other separate query. In that case, you can nest boolean queries as should clauses, something like:
{
"query": {
"bool": {
"should": [
{
"bool": {
"must": [
{
"range": {
"#timestamp": {
"to": "2014-08-24T00:00:00.000+01:00"
}
}
},
{
"query_string": {
"fields": [
"log_message"
],
"query": "keepAlive"
}
},
]
}
},
{
**Another query here**
},
]
}
}
}