ElasticSearch multiple terms search json - json

I have a ton of items in a Db with many columns. I need to search across two of these columns to get one data set.
The first column, genericCode, would group together any of the rows that have that code.
The second column, genericId, is calling out a specific row to add because it is missing the list of genericCode's i'm looking for.
The back-end C# sets up my json for me as follows, but it returns nothing.
{
"from": 0,
"size": 50,
"aggs": {
"paramA": {
"nested": {
"path": "paramA"
},
"aggs": {
"names": {
"terms": {
"field": "paramA.name",
"size": 10
},
"aggs": {
"specValues": {
"terms": {
"field": "paramA.value",
"size": 10
}
}
}
}
}
}
},
"query": {
"bool": {
"must": [
{
"term": {
"locationId": {
"value": 1
}
}
},
{
"terms": {
"genericCode": [
"10A",
"20B"
]
}
},
{
"terms": {
"genericId": [
11223344
]
}
}
]
}
}
}
I get and empty result set. If I remove either of the "terms" I get what I would expect. So, I just need to combine those terms into one search.
I've gone through a lot of the documentation here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search.html
and still can't seem to find what I'm looking for.
Let's say I'm Jelly Belly and I want to create a bag of jelly beans with all the Star Wars and Disney jelly beans, and I also want to add all beans of the color green. That is basically what I'm trying to do.
EDIT: Changing the "must" to '"should"` isn't quite right either. I need it to be (in pseudo sql):
SELECT *
FROM myTable
Where locationId = 1
AND (
genericCode = "this", "that
OR
genericId = 1234, 5678
)
The locationId separates our data in an important way.
I found this post: elasticsearch bool query combine must with OR
and it has gotten me closer, but not all the way there...
I've tried several iterations of should > must > should > must building this query and get varying results, but nothing accurate.

Here is the query that is working. It helped when I realized I was passing in the wrong data for one of my parameters. doh
Nest the should inside the must as #khachik noted in the comment above. I had this some time ago but it wasn't working due to the above blunder.
{
"from": 0,
"size": 10,
"aggs": {
"specs": {
"nested": {
"path": "paramA"
},
"aggs": {
"names": {
"terms": {
"field": "paramA.name",
"size": 10
},
"aggs": {
"specValues": {
"terms": {
"field": "paramA.value",
"size": 10
}
}
}
}
}
}
},
"query": {
"bool": {
"must": [
{
"term": {
"locationId": {
"value": 1
}
}
},
{
"bool": {
"should": [
{
"terms": {
"genericCode": [
"101",
"102"
]
}
},
{
"terms": {
"genericId": [
3078711,
3119430
]
}
}
]
}
}
]
}
}
}

Related

Elasticsearch - How to get the latest record in each group with filter?

I have a few records in elastic search I want to group the record by user_id and fetch the latest record which is event_type is 1
If the latest record event_type value is not 1 then we should not fetch that record. I did it in MySQL query. Please let me know how can I do that same in elastic search.
After executing the MySQL query
SELECT * FROM user_events
WHERE id IN( SELECT max(id) FROM `user_events` group by user_id ) AND event_type=1;
I need the same output in elasticsearch aggregations.
Elasticsearch Query:
GET test_analytic_report/_search
{
"from": 0,
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"event_date": {
"gte": "2022-10-01",
"lte": "2023-02-06"
}
}
}
]
}
},
"sort": {
"event_date": {
"order": "desc"
}
},
"aggs": {
"group": {
"terms": {
"field": "user_id"
},
"aggs": {
"group_docs": {
"top_hits": {
"size": 1,
"_source": ["user_id", "event_date", "event_type"],
"sort": {
"user_id": "desc"
}
}
}
}
}
}
}
I have the above query I have two users whose user_id is 55 and 56. So, in my aggregations, it should not come. But It fetched the other event_type data but I want only event_types=1 with the latest one. if the user's last record does not have event_type=1, it should not come.
In the above table, user_id 56 latest record event_type contains 2 so it should not come in our aggregations.
I tried but it's not returning the exact result that I want.
Note: event_date is the current date and time. As per the above image, I have inserted it manually that's why the date differs
GET user_events/_search
{
"size": 1,
"query": {
"term": {
"event_type": 1
}
},
"sort": [
{
"id": {
"order": "desc"
}
}
]
}
Explanation: This is an Elasticsearch API request in JSON format. It retrieves the latest event of type 1 (specified by "event_type": 1 in the query) from the "user_events" index, with a size of 1 (specified by "size": 1) and sorts the results in descending order by the "id" field (specified by "order": "desc" in the sort).
If your ES version supports, you can do it with field collapse feature. Here is an example query:
{
"_source": false,
"query": {
"bool": {
"filter": {
"term": {
"event_type": 1
}
}
}
},
"collapse": {
"field": "user_id",
"inner_hits": {
"name": "the_record",
"size": 1,
"sort": [
{
"id": "desc"
}
]
}
},
"sort": [
{
"id": {
"order": "desc"
}
}
]
}
In the response, you will see that the document you want is in inner_hits under the name you give. In my example it is the_record. You can change the size of the inner hits if you want more records in each group and sort them.
Tldr;
They are many ways to go about it:
Sorting
Collapsing
Latest Transform
All those solution are approximate of what you could get with sql.
But my personal favourite is transform
Solution - transform jobs
Set up
We create 2 users, with 2 events.
PUT 75324839/_bulk
{"create":{}}
{"user_id": 1, "type": 2, "date": "2015-01-01T00:00:00.000Z"}
{"create":{}}
{"user_id": 1, "type": 1, "date": "2016-01-01T00:00:00.000Z"}
{"create":{}}
{"user_id": 2, "type": 1, "date": "2015-01-01T00:00:00.000Z"}
{"create":{}}
{"user_id": 2, "type": 2, "date": "2016-01-01T00:00:00.000Z"}
Transform job
This transform job is going to run against the index 75324839.
It will find the latest document, with regard to the user_id, based of the value in date field.
And the results are going to be stored in latest_75324839.
PUT _transform/75324839
{
"source": {
"index": [
"75324839"
]
},
"latest": {
"unique_key": [
"user_id"
],
"sort": "date"
},
"dest": {
"index": "latest_75324839"
}
}
If you were to query latest_75324839
You would find:
{
"hits": [
{
"_index": "latest_75324839",
"_id": "AGvuZWuqqz7c5ytICzX5Z74AAAAAAAAA",
"_score": 1,
"_source": {
"date": "2017-01-01T00:00:00.000Z",
"user_id": 1,
"type": 1
}
},
{
"_index": "latest_75324839",
"_id": "AA3tqz9zEwuio1D73_EArycAAAAAAAAA",
"_score": 1,
"_source": {
"date": "2016-01-01T00:00:00.000Z",
"user_id": 2,
"type": 2
}
}
]
}
}
Get the final results
To get the amount of user with type=1.
A simple search query such as:
GET latest_75324839/_search
{
"query": {
"term": {
"type": {
"value": 1
}
}
},
"aggs": {
"number_of_user": {
"cardinality": {
"field": "user_id"
}
}
}
}
Side notes
This transform job has been running in batch, this means it will only run once.
It is possible to run it in a continuous fashion, to get all the time the latest event for a user_id.
Here are some examples.
Your are looking for an SQL HAVING clause, which would allow you to filter results after grouping. But sadly there is nothing equivalent on Elastic.
So it is not possible to
sort, collapse and filter afterwards (even post_filter does not
help here)
use a top_hits aggregation with custom sorting and then filter
use any map/reduce scripted aggregations, as they do not support
sorting.
work with subqueries.
So basically seen, Elastic is not a database. Any sorting or relation to other documents should be based on scoring. And the score should be calculated independently for each document, distributed on shards.
But there is a tiny loophole, which might be the solution for your use case. It is based on a top_metrics aggregation followed by bucket selector to eliminate the unwanted event types:
GET test_analytic_report/_search
{
"size": 0,
"aggs": {
"by_id": {
"terms": {
"field": "user_id",
"size": 100
},
"aggs": {
"tm": {
"top_metrics": {
"metrics": {
"field": "event_type"
},
"sort": [
{
"id": {
"order": "desc"
}
}
]
}
},
"event_type_filter": {
"bucket_selector": {
"buckets_path": {
"event_type": "tm.event_type"
},
"script": "params.event_type == 1"
}
}
}
}
}
}
If you require more fields from the source document you can add them to the top_metrics.
It is sorted by id now, but you can also use event_date.

JSON schema help, array of objects

I am trying to write a JSON object where the key "pStock" is the total stock of an array of bike sizes 'size'. Each size has an inventory or 'count'. I have two versions of the same code. the first one returns an error message even though the syntax looks correct to my eye.
"pStock": [
{
"size": {
"type": "string",
"count": {
"type": "number"
}
}
}
}
]
Here is the second version which returns no errors but I'm not quite sure it's saying what I want it to say.
"pStock": {
"type": ["object"],
"size": {
"type": "string",
"count": {
"type": "number"
}
}
}
EDIT 1
I appreciate all of these responses. I made a silly error in posting. Below is the correct "wrong" code that isn't working. I get the error. 'Error, schema is invalid: data/properties/pStock should be object,boolean
at Ajv.validateSchema' Rephrasing. the below code still does not work and received the error 'Error, schema is invalid: data/properties/pStock should be object,boolean
at Ajv.validateSchema'
"pStock": [
{
"size": {
"type": "string",
"count": {
"type": "number"
}
}
}
]
Any help would be greatly appreciated.
Count the opening and closing curly braces on your first JSON. It has 3 opening and 4 closing.
"pStock": [
{ // Open 1
"size": { // Open 2
"type": "string",
"count": { // Open 3
"type": "number"
} // Close 3
} // Close 2
} // Close 1
} // Close what?
]
Just remove the last one and it will work.
You are missing the closing square bracket ] on the pStock array because you have an extra brace } i.e.
"pStock": [
{
"size": {
"type": "string",
"count": {
"type": "number"
}
}
}
} <--- this is wrong
]
should be
{
"pStock":[
{
"size":{
"type":"string",
"count":{
"type":"number"
}
}
}
]
}
The first version should look like that:
"pStock": [
{
"size": {
"type": "string",
"count": {
"type": "number"
}
}
}
]
You had too many } (line 7)
The second version does not represent what you wanted, it does not contain the array of sizes.
But you can create this (pStock with multiple keys of different sizes. Then in each size write the inventory/count):
"pStock": {
"size1": {
inventory: "5",
count: 4
},
"size2": {
inventory: "5",
count: 4
}
}

Do I need to merge two Elasticsearch queries or can I use an or-type operator?

I have two Elasticsearch queries (which I use via the elastic package in R).
One query gathers the number of times a feature is loaded, the other gathers the number of times a feature is unloaded.
My needs have now changed in that I need to gather both types of data/states together, in the same dataset (the state can either be TRUE or FALSE and I want to gather both in the same dataset).
What I want to do: To identify both cases where visible is either TRUE or FALSE.
Therefore, I want to know what the best approach is: should I (attempt to) merge the queries or I should use an or-type operator?
If it is the latter, how would I go about it?
For completeness, here are my minified queries (unminified versions are at the end of this question):
loads_body <- '{"size":0,"query":{"bool":{"must":[{"match":{"merchant":"a6xzTHtpQs"}},{"term":{"visible":true}},{"range":{"time":{"gte":"2018-04-02T06:00:00","lte":"2018-04-03T05:59:59","time_zone":"+00:00"}}}]}},"aggs":{"daily":{"date_histogram":{"field":"time","interval":"hour","time_zone":"+00:00","min_doc_count":0,"extended_bounds":{"min":"2018-04-02T06:00:00","max":"2018-04-03T05:59:59"}}}}}'
and
unloads_body <- '{"size":0,"query":{"bool":{"must":[{"match":{"merchant":"a6xzTHtpQs"}},{"term":{"visible":false}},{"range":{"time":{"gte":"2018-04-02T06:00:00","lte":"2018-04-03T05:59:59","time_zone":"+00:00"}}}]}},"aggs":{"daily":{"date_histogram":{"field":"time","interval":"hour","time_zone":"+00:00","min_doc_count":0,"extended_bounds":{"min":"2018-04-02T06:00:00","max":"2018-04-03T05:59:59"}}}}}'
Unminified queries:
loads_body <- '{
"size":0,
"query": {
"bool": {
"must":[ {
"match": {
"merchant": "a6xzTHtpQs"
}
}
,
{
"term": {
"visible": true
}
}
,
{
"range": {
"time": {
"gte": "2018-04-02T06:00:00", "lte": "2018-04-03T05:59:59", "time_zone": "+00:00"
}
}
}
]
}
}
,
"aggs": {
"daily": {
"date_histogram": {
"field":"time",
"interval":"hour",
"time_zone":"+00:00",
"min_doc_count":0,
"extended_bounds": {
"min": "2018-04-02T06:00:00", "max": "2018-04-03T05:59:59"
}
}
}
}
}'
and
unloads_body <- '{
"size":0,
"query": {
"bool": {
"must":[ {
"match": {
"merchant": "a6xzTHtpQs"
}
}
,
{
"term": {
"visible": false
}
}
,
{
"range": {
"time": {
"gte": "2018-04-02T06:00:00", "lte": "2018-04-03T05:59:59", "time_zone": "+00:00"
}
}
}
]
}
}
,
"aggs": {
"daily": {
"date_histogram": {
"field":"time",
"interval":"hour",
"time_zone":"+00:00",
"min_doc_count":0,
"extended_bounds": {
"min": "2018-04-02T06:00:00", "max": "2018-04-03T05:59:59"
}
}
}
}
}'
Yes you can use a single query and sub aggregations to do what you are looking for. Something along the lines of
{
"query":{
"bool":{
"must":[
{
"match":{
"merchant":"a6xzTHtpQs"
}
},
{
"range":{
"time":{
"gte":"2018-04-02T06:00:00",
"lte":"2018-04-03T05:59:59",
"time_zone":"+00:00"
}
}
}
]
}
},
"aggs":{
"Visible_agg":{
"terms":{
"field":"visible"
},
"aggs":{
"daily":{
"date_histogram":{
"field":"time",
"interval":"hour",
"time_zone":"+00:00",
"min_doc_count":0,
"extended_bounds":{
"min":"2018-04-02T06:00:00",
"max":"2018-04-03T05:59:59"
}
}
}
}
}
}
}
This should produce the histograms in two buckets one for "visible": true and other for "visible":false
Is this what you are looking for?

Elasticsearch bucket aggregation using concatenated parameter

I'm using Elasticsearch API and the schema of the document as follow
{
name: "",
born_year: "",
born_month: "",
born_day: "",
book_type: "",
price: <some number>,
country: ""
}
Now what I need is to get the document count per each name where born before 1995 (born_year + born_month + born_day < "20051220"). How can i achieve?
I tried this:
{
"query": {
"query_string": {
"query": "country:\"SL\""
}
},
"size": 0,
"aggs": {
"total": {
"terms": {
"field": "name"
}
}
}
}
But I have no idea how can I add filter for the birthday.
As mentioned by #val, you need to add a real date field that you can easily add by concatenating these three fields at creation time.
But how you filter based on date range, there are two ways and both of them will return different result sets
Now the level of filtering is your choice.
You mentioned querying on country field. But you have not mentioned at what level you want to filter on date range. I will give you queries for both the cases.
Mappings- assuming you create a date field.
{
name:"",
born_year:"",
born_month:"",
born_day:"",
book_type:"",
price:<some number>,
country:"",
date : ""
}
Case - 1) Filtering date range for name aggregations only, here documents count will not be effected by the date range filter
{
"query": {
"query_string": {
"query": "country:\"SL\""
}
},
"aggs": {
"total": {
"filter": {
"range": {
"date": {
"gte": "your_date_mx",
"lte": "your_date_min"
}
}
},
"aggs": {
"NAME": {
"terms": {
"field": "name",
"size": 10
}
}
}
}
}
}
Case 2) In this case both your documents count and aggregation will be filtered for date range as we add date range filter at query level.
{
"query": {
"query_string": {
"query": "country:\"SL\""
},
"bool": {
"must": [
{
"range": {
"date": {
"gte": "your_date_mx",
"lte": "your_date_mic"
}
}
}
]
}
},
"aggs": {
"toal": {
"terms": {
"field": "name",
"size": 10
}
}
}
}
So adding a filter to aggregation will effect only aggs count.
Edit -
Approach1) with groovy script try to concatinate the string and parse it to integer and then compare with your input date.
{
"query": {
"bool": {
"must": [
{}
],
"filter": {
"script": {
"script": {
"inline": "(doc['year'].value + doc['month'].value + doc['date'].value).toInteger() > 19910701",
"params": {
"param1": 19911122
}
}
}
}
}
}
}
Make sure when indexing index date(or month) with single digit like 6 as 06
2) Approach 2 - parse the string the exact date(preferred)
{
"query": {
"bool": {
"must": [
{}
],
"filter": {
"script": {
"script": {
"inline": "Date.parse('dd-MM-yyyy',doc['date'].value +'-'+ doc['month'].value +'-'+ doc['year'].value).format('dd-MM-yyyy') > param1",
"params": {
"param1": "04-05-1991"
}
}
}
}
}
}
}
Second approach is much better approach as you don't have to worry about the maintaing the string for each field(date, month, day) to later parse to proper int for comparing.

Elasticsearch the terms filter raise "filter does not support [mediatest]"

my query is like this:
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"online": 1
}
},
{
"terms": {
"mediaType": "flash"
}
}
]
}
}
}
}
}
it raise a QueryParsingException [[comos_v2] [terms] filter does not support [mediaType]],of which the field "mediaType" exactly does not exist in mapping.
my question is why term filter does not raise the Exception?
The above is not a valid Query DSL. In the above Terms filter the values to "mediaType" field should be an array
It should be the following :
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"online": 1
}
},
{
"terms": {
"mediaType": ["flash"]
}
}
]
}
}
}
}
}
Its 2021 I'm using .keyword for an exact text match but you can just as easily omit:
{"query":
{"bool":
{"must":
[
{"term":
{"variable1.keyword":var1Here}
},
{"term":
{"variable2.keyword":var2Here}
}
]
}
}
}
Its simply a matter of "term" vs "terms". Very easy to miss the plural / single aspect of it.
I had a very similar error with this query, in which I was trying to delete a specific zone:
'{"query":{"terms":{"zoneid":25070}}}'
I was getting an error when I ran the above query.
As soon as changed "terms" to "term" the query executed with no issues, like this:
'{"query":{"term":{"zoneid":25070}}}'