How to convert this SQL query to an Elasticsearch query?

How to convert this SQL query to an Elasticsearch query? - mysql

I'm new to Elasticsearch querying, so I'm a little lost on how to convert this SQL query to an Elasticsearch query:
SELECT time_interval, type, sum(count)
FROM test
WHERE (&start_date <= t_date <= &end_date)
GROUP BY time_interval, type
I know I can use the "range" query to set parameters for gte and lte, but if there's a clearer way to do this, that would be even better. Thanks in advance!
Edit:
My elasticsearch is setup to have an index: "test" with type: "summary" and contains JSON documents that have a few fields:
t_datetime
t_date
count
type
*t_id**
The IDs for these JSON documents are the t_date concatenated with the t_id values

Assuming, t_datetime is the same as time_interval, you can use the query below:
POST trans/summary/_search?search_type=count
{
"aggs": {
"filtered_results": {
"filter": {
"range": {
"t_date": {
"gte": "2015-05-01",
"lte": "2015-05-30"
}
}
},
"aggs": {
"time_interval_type_groups": {
"terms": {
"script": "doc['t_datetime'].value + '_' + doc['type'].value",
"size": 0
},
"aggs": {
"sum_of_count": {
"sum": {
"field": "count"
}
}
}
}
}
}
}
}
This query is making use of scripts. On newer versions of Elasticsearch, dynamic scripting is disabled by default. To enable dynamic scripting, follow this.

Related

ElasticSearch: range date with fields inside fields

I have an Issue that I need to wrote a elasticsearch query that give me what I look for,
first of all here is one item of my JSON object in db that query looking into:
{
"data": {
"circuit": {
"version": "2.12.2",
"createdOn": "2020-02-04T10:38:11.282",
"expirationDate": "2020-02-06T05:50:00.000",
"expiredSoonNotification": false
}
},
"createdDate": "2020-02-04T10:38:11.282"
}
What I need is to get all Items that accept this condition:
now < "data.circuit.expirationDate" < ("data.circuit.expirationDate" - "createdDate")/10 + now
meaning : I need to get all items that there expirationDate is less 10% from now
I hope that I explained my issue cause I don't know how to use fields inside lt og gt
something I did until now like that, but not working :
{
"query": {
"bool": {
"must_not": [
{
"bool": {
"must": [
{
"range": {
"data.circuit.expirationDate": {
"gt": "now",
"lt": ("data.circuit.expirationDate" - "createdDate")/10 + now
}
}
}
]
}
}
]
}
},
"sort": [
{
"createdDate": {
"order": "desc"
}
}
]
}
Thank You

You cannot do math referencing other fields in a range-query. You would need to encode your logic in a script-query using the Elasticsearch "painless" scripting-language. Script-queries are significantly slower than other queries, as the script needs to get executed for every single document. You can limit the number of documents for which the script gets executed by breaking up the logic into 2 parts:
"data.circuit.expirationDate" > now
"data.circuit.expirationDate" <
(("data.circuit.expirationDate" - "createdDate")/10 + now)
Your query structure would need to look like this (Pseudo-code):
"query": {
"bool": {
"must": { "script": "data.circuit.expirationDate" < ("data.circuit.expirationDate" - "createdDate")/10 + now) }
"filter": { "range": "data.circuit.expirationDate" > now }
}
}
You also should consider whether you really need precision down to millisecond-level. Performance-wise it would be much better to round now to a more granular unit (e.g. now/s for second-level granularity).
Pre-calculating ("data.circuit.expirationDate" - "createdDate")/10 and storing the calculated result directly in your document would furthermore increase query-performance significantly.

How could I append time stamp range within my elasticsearch query?

I'm trying perform an elasticsearch query as a POST request in order pull data from the index which I created. The data which is in the index is, a table from MySQL DB, configured though logstash.
Here is my request and the JSON body:
http://localhost:9200/response_summary/_search
Body:
{
"query": {
"query_string": {
"query": "transactionoperationstatus:\"charged\" AND api:\"payment\" AND operatorid:\"XL\" AND userid:*test AND time:\"2015-05-27*\" AND responsecode:(200+201)"
}
},
"aggs": {
"total": {
"terms": {
"field": "userid"
},
"aggs": {
"total": {
"sum": {
"script": "Double.parseDouble(doc['chargeamount'].value)"
}
}
}
}
}
}
In the above JSON body, I'm in need to append the timestamp into the query_string in order get the data from the index within a date range. I tried adding at the end of the query as:
AND timestamp:[2015-05-27T00:00:00.128Z+TO+2015-05-27T23:59:59.128Z]"
Where am I going wrong? Any help would be appreciated.

You just need to remove the +as they are only necessary when sending a query via the URL query string (i.e. to URL-encode the spaces), but if you use the query_string query, you don't need to do that
AND timestamp:[2015-05-27T00:00:00.128Z TO 2015-05-27T23:59:59.128Z]"
^ ^
| |
remove these

How could I have MySQL sum() and group by clause within my elasticsearch query?

I'm trying perform an elasticsearch query as a GET request in order pull data from the index which I created. The data which is in the index is, a table from MySQL DB, configured though logstash.
Here is my request without the IN clause:
http://localhost:9200/response_summary/_search?q=api:"location"+AND+transactionoperationstatus:"charged"+AND+operatorid='DIALOG'+AND+userid:test+AND+time:"2015-05-27"
In the above, I should be able to append sum(chargeAmount+0) & group by . I tried giving it a search on the web, but couldn't find any solutions.
Any help could be appreaciated.

Whatever you put after the q=... in your query uses the same syntax as a query_string query, so you can rewrite your query to leverage query_string and use aggregations to compute the desired sum:
curl -XPOST http://localhost:9200/response_summary/_search -d '{
"query": {
"query_string": {
"query": "api:\"location\" AND transactionoperationstatus:\"charged\" AND operatorid:\"DIALOG\" AND userid:test AND time:\"2015-05-27\" AND responseCode:(401 403)"
}
},
"aggs": {
"total": {
"terms": {
"field": "chargeAmount"
},
"aggs":{
"total": {
"sum": {
"field": "chargeAmount"
}
}
}
}
}
}'
In Postman, it would look like this:

Can suggestion a solution for big, relational data analyzer please?

I`m looking for some suggestions on my requirements. Below are the description of my requirements. Feel free to contact me for any details please. Even some suggestions on how I can describe my questions more clearly is also very appreciate:)
Requirements description
I have some data, the format is like below:
router, interface,timestamp, src_ip, dst_ip, src_port, dst_port, protocol, bits
r1, 1, 1453016443, 10.0.0.1, 10.0.0.2, 100, 200, tcp, 108
r2, 1, 1453016448, 10.0.0.3, 10.0.0.8, 200, 200, udp, 100
As you can see, it is some network raw data. I omit some columns just to make it looks more clear. The volume of data is very big. And it is generating very fast, like 1 billion rows every 5min...
What I want is to do some real time analysis on these data.
For example:
draw a line using the timestamp
select sum(bits) , timestamp from raw_data group by router,interface where interface = 1, router=r1.
find out which 3 src_ip sending the most data for one interface
select sum(bits) from raw_data where router=r1 and interface=2 group by src_ip order by sum(bits) desc limit 3
I have already tried some solutions and each of them is not very suitable for it. For example :
rdbms
MySQL seems fine except a few problems:
the data is too big
I`m having a lot more columns than I described here. To improve my query speed, I have to some index on most of the columns. But i think create index on big table and the index containing too many columns is not very good, right?
openTSDB
OpenTSDB is a good timeseries database. But also not suitable for my requirements.
openTSDB is having problem to solve the TOP N problem. In my requirements "to get top 3 src_ip which sending most data", openTSDB can not resolve this.
Spark
I know that apache spark can be used like RDBMS. It having the feature called spark SQL. I did not try but I guess the performance should not satisfy the real time analysis/query requirement, right? After all, spark is more suitable for offline calculation, right?
Elastic Search
I really give a lot hope on ES when I know this project. But it is not suitable either. Because When you aggregating more than one column, you have to use the so called nested bucket aggregation in elasticsearch. And the result of this aggregation can not be sorted. You have to retrieve all the result and sort by your self. In my case, the result is too much. To sort the result will be very difficult
So.... I`m stuck here. Can anyone give some suggestions please?

I don't see why ES would not be able to achieve your requirements. I think you misunderstood this part
But it is not suitable either. Because When you aggregating more than one column, you have to use the so called nested bucket aggregation in elasticsearch. And the result of this aggregation can not be sorted.
Your first requirement draw a line using the timestamp could be easily achieved with a query/aggregation like this:
{
"query": {
"bool": {
"must": [
{
"term": {
"interface": 1
}
},
{
"term": {
"router": "r1"
}
}
]
}
},
"aggs": {
"by_minute": {
"date_histogram": {
"field": "timestamp",
"interval": "1m"
},
"aggs": {
"sum_bits": {
"sum": {
"field": "bits"
}
}
}
}
}
}
As for your second requirement find out which 3 src_ip sending the most data for one interface, it can also easily be achieved with a query/aggregation like this one:
{
"query": {
"bool": {
"must": [
{
"term": {
"interface": 2
}
},
{
"term": {
"router": "r1"
}
}
]
}
},
"aggs": {
"by_src_ip": {
"terms": {
"field": "src_ip",
"size": 3,
"order": {
"sum_bits": "desc"
}
},
"aggs": {
"sum_bits": {
"sum": {
"field": "bits"
}
}
}
}
}
}
UPDATE
According to your comment, your second requirement above could change to find the top 3 combination of src_ip/dst_ip. This would be doable with a terms aggregation using a script instead of a term which would build the src/dest combination and provide the sum of bits for each couple, like this:
{
"query": {
"bool": {
"must": [
{
"term": {
"interface": 2
}
},
{
"term": {
"router": "r1"
}
}
]
}
},
"aggs": {
"by_src_ip": {
"terms": {
"script": "[doc.src_ip.value, doc.dst_ip.value].join('-')",
"size": 3,
"order": {
"sum_bits": "desc"
}
},
"aggs": {
"sum_bits": {
"sum": {
"field": "bits"
}
}
}
}
}
}
Note that in order to run this last query, you'll need to enable dynamic scripting. Also since you'll have billions of documents, scripting might not be the best solution, but it's worth giving it a try before diving further. One other possible solution would be to add a combination field (src_ip-dst_ip) at indexing time so that you can use it as a field in your terms aggregation without having to resort to scripting.

You can try Axibase Time Series Database which is non-relational but supports SQL queries in addition to rest-like API. Here's a Top-N query example:
SELECT entity, avg(value) FROM cpu_busy
WHERE time between now - 1 * hour and now
GROUP BY entity
ORDER BY avg(value) DESC
LIMIT 3
https://axibase.com/docs/atsd/sql/#grouping
ATSD Community Edition is free.
Disclosure: I work for Axibase

Conversion from sql to elastic search query

How can i convert the following sql query into elastic search query?
SELECT sum(`price_per_unit`*`quantity`) as orders
FROM `order_demormalize`
WHERE date(`order_date`)='2014-04-15'

You need to use scripts to compute the product of values. For newer versions of Elasticsearch, enable dynamic scripting by adding the line script.disable_dynamic: false in elasticsearch.yml file. Note that this may leave a security hole in your Elasticsearch cluster. So enable scripting judiciously. Try the query below:
POST <indexname>/<typename>/_search?search_type=count
{
"query": {
"filtered": {
"filter": {
"term": {
"order_date": "2014-04-15"
}
}
}
},
"aggs": {
"orders": {
"sum": {
"script": "doc['price_per_unit'].value * doc['quantity'].value"
}
}
}
}

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008