Conversion from sql to elastic search query - mysql

How can i convert the following sql query into elastic search query?
SELECT sum(`price_per_unit`*`quantity`) as orders
FROM `order_demormalize`
WHERE date(`order_date`)='2014-04-15'

You need to use scripts to compute the product of values. For newer versions of Elasticsearch, enable dynamic scripting by adding the line script.disable_dynamic: false in elasticsearch.yml file. Note that this may leave a security hole in your Elasticsearch cluster. So enable scripting judiciously. Try the query below:
POST <indexname>/<typename>/_search?search_type=count
{
"query": {
"filtered": {
"filter": {
"term": {
"order_date": "2014-04-15"
}
}
}
},
"aggs": {
"orders": {
"sum": {
"script": "doc['price_per_unit'].value * doc['quantity'].value"
}
}
}
}

Related

Elasticsearch Query DSL

I get log files from my firewall which i want to filter for several strings.
However the string contains always some other information. So i want to filter the whole string for some specific words which are always in the string: "User" "authentication" "failed.
I tried this but i do not get any data from it:
"query": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"gt": "now-15m"
}
}
},
{
"query_string": {
"query": "User AND authentication AND failed"
}
}
]
}
}
}
However i cannot find the syntax for specific filtering words in strings. Hopefully some of you can help me.
This is the message log ( i want to filter "event.original"): Screenshot

How could I have MySQL sum() and group by clause within my elasticsearch query?

I'm trying perform an elasticsearch query as a GET request in order pull data from the index which I created. The data which is in the index is, a table from MySQL DB, configured though logstash.
Here is my request without the IN clause:
http://localhost:9200/response_summary/_search?q=api:"location"+AND+transactionoperationstatus:"charged"+AND+operatorid='DIALOG'+AND+userid:test+AND+time:"2015-05-27"
In the above, I should be able to append sum(chargeAmount+0) & group by . I tried giving it a search on the web, but couldn't find any solutions.
Any help could be appreaciated.
Whatever you put after the q=... in your query uses the same syntax as a query_string query, so you can rewrite your query to leverage query_string and use aggregations to compute the desired sum:
curl -XPOST http://localhost:9200/response_summary/_search -d '{
"query": {
"query_string": {
"query": "api:\"location\" AND transactionoperationstatus:\"charged\" AND operatorid:\"DIALOG\" AND userid:test AND time:\"2015-05-27\" AND responseCode:(401 403)"
}
},
"aggs": {
"total": {
"terms": {
"field": "chargeAmount"
},
"aggs":{
"total": {
"sum": {
"field": "chargeAmount"
}
}
}
}
}
}'
In Postman, it would look like this:

Can suggestion a solution for big, relational data analyzer please?

I`m looking for some suggestions on my requirements. Below are the description of my requirements. Feel free to contact me for any details please. Even some suggestions on how I can describe my questions more clearly is also very appreciate:)
Requirements description
I have some data, the format is like below:
router, interface,timestamp, src_ip, dst_ip, src_port, dst_port, protocol, bits
r1, 1, 1453016443, 10.0.0.1, 10.0.0.2, 100, 200, tcp, 108
r2, 1, 1453016448, 10.0.0.3, 10.0.0.8, 200, 200, udp, 100
As you can see, it is some network raw data. I omit some columns just to make it looks more clear. The volume of data is very big. And it is generating very fast, like 1 billion rows every 5min...
What I want is to do some real time analysis on these data.
For example:
draw a line using the timestamp
select sum(bits) , timestamp from raw_data group by router,interface where interface = 1, router=r1.
find out which 3 src_ip sending the most data for one interface
select sum(bits) from raw_data where router=r1 and interface=2 group by src_ip order by sum(bits) desc limit 3
I have already tried some solutions and each of them is not very suitable for it. For example :
rdbms
MySQL seems fine except a few problems:
the data is too big
I`m having a lot more columns than I described here. To improve my query speed, I have to some index on most of the columns. But i think create index on big table and the index containing too many columns is not very good, right?
openTSDB
OpenTSDB is a good timeseries database. But also not suitable for my requirements.
openTSDB is having problem to solve the TOP N problem. In my requirements "to get top 3 src_ip which sending most data", openTSDB can not resolve this.
Spark
I know that apache spark can be used like RDBMS. It having the feature called spark SQL. I did not try but I guess the performance should not satisfy the real time analysis/query requirement, right? After all, spark is more suitable for offline calculation, right?
Elastic Search
I really give a lot hope on ES when I know this project. But it is not suitable either. Because When you aggregating more than one column, you have to use the so called nested bucket aggregation in elasticsearch. And the result of this aggregation can not be sorted. You have to retrieve all the result and sort by your self. In my case, the result is too much. To sort the result will be very difficult
So.... I`m stuck here. Can anyone give some suggestions please?
I don't see why ES would not be able to achieve your requirements. I think you misunderstood this part
But it is not suitable either. Because When you aggregating more than one column, you have to use the so called nested bucket aggregation in elasticsearch. And the result of this aggregation can not be sorted.
Your first requirement draw a line using the timestamp could be easily achieved with a query/aggregation like this:
{
"query": {
"bool": {
"must": [
{
"term": {
"interface": 1
}
},
{
"term": {
"router": "r1"
}
}
]
}
},
"aggs": {
"by_minute": {
"date_histogram": {
"field": "timestamp",
"interval": "1m"
},
"aggs": {
"sum_bits": {
"sum": {
"field": "bits"
}
}
}
}
}
}
As for your second requirement find out which 3 src_ip sending the most data for one interface, it can also easily be achieved with a query/aggregation like this one:
{
"query": {
"bool": {
"must": [
{
"term": {
"interface": 2
}
},
{
"term": {
"router": "r1"
}
}
]
}
},
"aggs": {
"by_src_ip": {
"terms": {
"field": "src_ip",
"size": 3,
"order": {
"sum_bits": "desc"
}
},
"aggs": {
"sum_bits": {
"sum": {
"field": "bits"
}
}
}
}
}
}
UPDATE
According to your comment, your second requirement above could change to find the top 3 combination of src_ip/dst_ip. This would be doable with a terms aggregation using a script instead of a term which would build the src/dest combination and provide the sum of bits for each couple, like this:
{
"query": {
"bool": {
"must": [
{
"term": {
"interface": 2
}
},
{
"term": {
"router": "r1"
}
}
]
}
},
"aggs": {
"by_src_ip": {
"terms": {
"script": "[doc.src_ip.value, doc.dst_ip.value].join('-')",
"size": 3,
"order": {
"sum_bits": "desc"
}
},
"aggs": {
"sum_bits": {
"sum": {
"field": "bits"
}
}
}
}
}
}
Note that in order to run this last query, you'll need to enable dynamic scripting. Also since you'll have billions of documents, scripting might not be the best solution, but it's worth giving it a try before diving further. One other possible solution would be to add a combination field (src_ip-dst_ip) at indexing time so that you can use it as a field in your terms aggregation without having to resort to scripting.
You can try Axibase Time Series Database which is non-relational but supports SQL queries in addition to rest-like API. Here's a Top-N query example:
SELECT entity, avg(value) FROM cpu_busy
WHERE time between now - 1 * hour and now
GROUP BY entity
ORDER BY avg(value) DESC
LIMIT 3
https://axibase.com/docs/atsd/sql/#grouping
ATSD Community Edition is free.
Disclosure: I work for Axibase

How to convert this SQL query to an Elasticsearch query?

I'm new to Elasticsearch querying, so I'm a little lost on how to convert this SQL query to an Elasticsearch query:
SELECT time_interval, type, sum(count)
FROM test
WHERE (&start_date <= t_date <= &end_date)
GROUP BY time_interval, type
I know I can use the "range" query to set parameters for gte and lte, but if there's a clearer way to do this, that would be even better. Thanks in advance!
Edit:
My elasticsearch is setup to have an index: "test" with type: "summary" and contains JSON documents that have a few fields:
t_datetime
t_date
count
type
*t_id**
The IDs for these JSON documents are the t_date concatenated with the t_id values
Assuming, t_datetime is the same as time_interval, you can use the query below:
POST trans/summary/_search?search_type=count
{
"aggs": {
"filtered_results": {
"filter": {
"range": {
"t_date": {
"gte": "2015-05-01",
"lte": "2015-05-30"
}
}
},
"aggs": {
"time_interval_type_groups": {
"terms": {
"script": "doc['t_datetime'].value + '_' + doc['type'].value",
"size": 0
},
"aggs": {
"sum_of_count": {
"sum": {
"field": "count"
}
}
}
}
}
}
}
}
This query is making use of scripts. On newer versions of Elasticsearch, dynamic scripting is disabled by default. To enable dynamic scripting, follow this.

Lucene Multiple delete query (JSON)

I have a problem with a script that i wrote for elasticsearch. On my server I have multiple log files that need to be deleted on a daily basis. To automate this process I wrote a Perl script that deletes my keep alive log files.
Basically an curl XDELETE
But now I want to add a query to delete another log file.
IS IT POSSIBLE TO ADD ANOTHER JSON OBJECT, WITH OUT CREATING ANOTHER DELETE VARIABLE?
So adding something to my JSON that integrates a separate queries that also deletes that log?
{
"query": {
"bool": {
"must": [
{
"range": {
"#timestamp": {
"to": "2014-08-24T00:00:00.000+01:00"
}
}
},
{
"query_string": {
"fields": [
"log_message"
],
"query": "keepAlive"
}
},
]
}
}
}
(Something Like &&? adding a second bool query)
Because everything I add will just over specify the query that i have leading to results I do not want.
Thank you
Not quite sure I've correctly understood what your looking for, but it sounds like you want to combine the results of the given query with those of some other separate query. In that case, you can nest boolean queries as should clauses, something like:
{
"query": {
"bool": {
"should": [
{
"bool": {
"must": [
{
"range": {
"#timestamp": {
"to": "2014-08-24T00:00:00.000+01:00"
}
}
},
{
"query_string": {
"fields": [
"log_message"
],
"query": "keepAlive"
}
},
]
}
},
{
**Another query here**
},
]
}
}
}