I'm using Elastic Search 2. I have a big database of locations, all of them have a gps attribute, which is a geopoint.
My frontend application displays a google maps component with the results, filtered by my query, let's say pizza. The problem is that the dataset grew a lot, and the client wants even results on the map.
So if I search for a specific query in New York, i would like to have results all over New York, but i'm currently receiving 400 results only in one populous area of Manhattan.
My naive approach was to just filter by distance
{
"size":400,
"query":{
"bool":{
"must":{
"match_all":{
}
},
"filter":{
"geo_distance":{
"distance":"200km",
"gps":[
-73.98502023369585,
40.76195656809083
]
}
}
}
}
}
This doesn't guarantee that the results will be spread across the map.
How can I do it?
I've tried using Geo-Distance Aggregation for this
{
"size":400,
"query":{
"bool":{
"must":{
"match_all":{
}
},
"filter":{
"geo_distance":{
"distance":"200km",
"gps":[
-73.98502023369585,
40.76195656809083
]
}
}
}
},
"aggs":{
"per_ring":{
"geo_distance":{
"field":"gps",
"unit":"km",
"origin":[
-73.98502023369585,
40.76195656809083
],
"ranges":[
{
"from":0,
"to":100
},
{
"from":100,
"to":200
}
]
}
}
}
}
But i just receive a results list + the amount of elements that belong to the buckets. The results list is not guaranteed to be spread.
"aggregations": {
"per_ring": {
"buckets": [
{
"key": "*-100.0",
"from": 0,
"from_as_string": "0.0",
"to": 100,
"to_as_string": "100.0",
"doc_count": 33821
},
{
"key": "100.0-200.0",
"from": 100,
"from_as_string": "100.0",
"to": 200,
"to_as_string": "200.0",
"doc_count": 6213
}
]
}
}
I would like to grab half of the results from one bucket, half from the other bucket.
I've also attempted to use Geohash Grid Aggregation, but that also doesn't give me samples of results for every bucket, just provides the areas.
So how do I get a spaced distribution of results spread across my map with one elastic search query?
Thanks!
I think introducing some randomness may give you the desired result. I am assuming you're seeing the same distribution because of index ordering (you're not scoring based on distance, and you're taking the first 400 so you are most likely seeing the same result set).
{
"size": 400,
"query": {
"function_score": {
"query": {
"bool": {
"must": [
{
"match_all": {}
}
],
"filter": {
"geo_distance": {
"distance": "200km",
"gps": [
-73.98502023369585,
40.76195656809083
]
}
}
}
},
"functions": [
{
"random_score": {}
}
]
}
}
}
Random score in elastic
Related
I have a few records in elastic search I want to group the record by user_id and fetch the latest record which is event_type is 1
If the latest record event_type value is not 1 then we should not fetch that record. I did it in MySQL query. Please let me know how can I do that same in elastic search.
After executing the MySQL query
SELECT * FROM user_events
WHERE id IN( SELECT max(id) FROM `user_events` group by user_id ) AND event_type=1;
I need the same output in elasticsearch aggregations.
Elasticsearch Query:
GET test_analytic_report/_search
{
"from": 0,
"size": 0,
"query": {
"bool": {
"must": [
{
"range": {
"event_date": {
"gte": "2022-10-01",
"lte": "2023-02-06"
}
}
}
]
}
},
"sort": {
"event_date": {
"order": "desc"
}
},
"aggs": {
"group": {
"terms": {
"field": "user_id"
},
"aggs": {
"group_docs": {
"top_hits": {
"size": 1,
"_source": ["user_id", "event_date", "event_type"],
"sort": {
"user_id": "desc"
}
}
}
}
}
}
}
I have the above query I have two users whose user_id is 55 and 56. So, in my aggregations, it should not come. But It fetched the other event_type data but I want only event_types=1 with the latest one. if the user's last record does not have event_type=1, it should not come.
In the above table, user_id 56 latest record event_type contains 2 so it should not come in our aggregations.
I tried but it's not returning the exact result that I want.
Note: event_date is the current date and time. As per the above image, I have inserted it manually that's why the date differs
GET user_events/_search
{
"size": 1,
"query": {
"term": {
"event_type": 1
}
},
"sort": [
{
"id": {
"order": "desc"
}
}
]
}
Explanation: This is an Elasticsearch API request in JSON format. It retrieves the latest event of type 1 (specified by "event_type": 1 in the query) from the "user_events" index, with a size of 1 (specified by "size": 1) and sorts the results in descending order by the "id" field (specified by "order": "desc" in the sort).
If your ES version supports, you can do it with field collapse feature. Here is an example query:
{
"_source": false,
"query": {
"bool": {
"filter": {
"term": {
"event_type": 1
}
}
}
},
"collapse": {
"field": "user_id",
"inner_hits": {
"name": "the_record",
"size": 1,
"sort": [
{
"id": "desc"
}
]
}
},
"sort": [
{
"id": {
"order": "desc"
}
}
]
}
In the response, you will see that the document you want is in inner_hits under the name you give. In my example it is the_record. You can change the size of the inner hits if you want more records in each group and sort them.
Tldr;
They are many ways to go about it:
Sorting
Collapsing
Latest Transform
All those solution are approximate of what you could get with sql.
But my personal favourite is transform
Solution - transform jobs
Set up
We create 2 users, with 2 events.
PUT 75324839/_bulk
{"create":{}}
{"user_id": 1, "type": 2, "date": "2015-01-01T00:00:00.000Z"}
{"create":{}}
{"user_id": 1, "type": 1, "date": "2016-01-01T00:00:00.000Z"}
{"create":{}}
{"user_id": 2, "type": 1, "date": "2015-01-01T00:00:00.000Z"}
{"create":{}}
{"user_id": 2, "type": 2, "date": "2016-01-01T00:00:00.000Z"}
Transform job
This transform job is going to run against the index 75324839.
It will find the latest document, with regard to the user_id, based of the value in date field.
And the results are going to be stored in latest_75324839.
PUT _transform/75324839
{
"source": {
"index": [
"75324839"
]
},
"latest": {
"unique_key": [
"user_id"
],
"sort": "date"
},
"dest": {
"index": "latest_75324839"
}
}
If you were to query latest_75324839
You would find:
{
"hits": [
{
"_index": "latest_75324839",
"_id": "AGvuZWuqqz7c5ytICzX5Z74AAAAAAAAA",
"_score": 1,
"_source": {
"date": "2017-01-01T00:00:00.000Z",
"user_id": 1,
"type": 1
}
},
{
"_index": "latest_75324839",
"_id": "AA3tqz9zEwuio1D73_EArycAAAAAAAAA",
"_score": 1,
"_source": {
"date": "2016-01-01T00:00:00.000Z",
"user_id": 2,
"type": 2
}
}
]
}
}
Get the final results
To get the amount of user with type=1.
A simple search query such as:
GET latest_75324839/_search
{
"query": {
"term": {
"type": {
"value": 1
}
}
},
"aggs": {
"number_of_user": {
"cardinality": {
"field": "user_id"
}
}
}
}
Side notes
This transform job has been running in batch, this means it will only run once.
It is possible to run it in a continuous fashion, to get all the time the latest event for a user_id.
Here are some examples.
Your are looking for an SQL HAVING clause, which would allow you to filter results after grouping. But sadly there is nothing equivalent on Elastic.
So it is not possible to
sort, collapse and filter afterwards (even post_filter does not
help here)
use a top_hits aggregation with custom sorting and then filter
use any map/reduce scripted aggregations, as they do not support
sorting.
work with subqueries.
So basically seen, Elastic is not a database. Any sorting or relation to other documents should be based on scoring. And the score should be calculated independently for each document, distributed on shards.
But there is a tiny loophole, which might be the solution for your use case. It is based on a top_metrics aggregation followed by bucket selector to eliminate the unwanted event types:
GET test_analytic_report/_search
{
"size": 0,
"aggs": {
"by_id": {
"terms": {
"field": "user_id",
"size": 100
},
"aggs": {
"tm": {
"top_metrics": {
"metrics": {
"field": "event_type"
},
"sort": [
{
"id": {
"order": "desc"
}
}
]
}
},
"event_type_filter": {
"bucket_selector": {
"buckets_path": {
"event_type": "tm.event_type"
},
"script": "params.event_type == 1"
}
}
}
}
}
}
If you require more fields from the source document you can add them to the top_metrics.
It is sorted by id now, but you can also use event_date.
I have a ton of items in a Db with many columns. I need to search across two of these columns to get one data set.
The first column, genericCode, would group together any of the rows that have that code.
The second column, genericId, is calling out a specific row to add because it is missing the list of genericCode's i'm looking for.
The back-end C# sets up my json for me as follows, but it returns nothing.
{
"from": 0,
"size": 50,
"aggs": {
"paramA": {
"nested": {
"path": "paramA"
},
"aggs": {
"names": {
"terms": {
"field": "paramA.name",
"size": 10
},
"aggs": {
"specValues": {
"terms": {
"field": "paramA.value",
"size": 10
}
}
}
}
}
}
},
"query": {
"bool": {
"must": [
{
"term": {
"locationId": {
"value": 1
}
}
},
{
"terms": {
"genericCode": [
"10A",
"20B"
]
}
},
{
"terms": {
"genericId": [
11223344
]
}
}
]
}
}
}
I get and empty result set. If I remove either of the "terms" I get what I would expect. So, I just need to combine those terms into one search.
I've gone through a lot of the documentation here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search.html
and still can't seem to find what I'm looking for.
Let's say I'm Jelly Belly and I want to create a bag of jelly beans with all the Star Wars and Disney jelly beans, and I also want to add all beans of the color green. That is basically what I'm trying to do.
EDIT: Changing the "must" to '"should"` isn't quite right either. I need it to be (in pseudo sql):
SELECT *
FROM myTable
Where locationId = 1
AND (
genericCode = "this", "that
OR
genericId = 1234, 5678
)
The locationId separates our data in an important way.
I found this post: elasticsearch bool query combine must with OR
and it has gotten me closer, but not all the way there...
I've tried several iterations of should > must > should > must building this query and get varying results, but nothing accurate.
Here is the query that is working. It helped when I realized I was passing in the wrong data for one of my parameters. doh
Nest the should inside the must as #khachik noted in the comment above. I had this some time ago but it wasn't working due to the above blunder.
{
"from": 0,
"size": 10,
"aggs": {
"specs": {
"nested": {
"path": "paramA"
},
"aggs": {
"names": {
"terms": {
"field": "paramA.name",
"size": 10
},
"aggs": {
"specValues": {
"terms": {
"field": "paramA.value",
"size": 10
}
}
}
}
}
}
},
"query": {
"bool": {
"must": [
{
"term": {
"locationId": {
"value": 1
}
}
},
{
"bool": {
"should": [
{
"terms": {
"genericCode": [
"101",
"102"
]
}
},
{
"terms": {
"genericId": [
3078711,
3119430
]
}
}
]
}
}
]
}
}
}
I am trying to run a query on elastic search with geo distance filter. My query works as:
{
"filter":
{
"geo_distance": {
"center_point": { "lon": 77.2909989, "lat": 28.6854955 },
"distance": "100m",
"order": "asc"
}
}
}
However, if I change the ordering of keys in the "geo_distance" json object such as distance comes first and then center_points. The query fails.
{
"filter":
{
"geo_distance": {
"distance": "100m",
"order": "asc",
"center_point": { "lon": 77.2909989, "lat": 28.6854955 }
}
}
}
The query fails with error -
illegal latitude value [251.71875] for [geo_distance]
I believe ordering of keys should not have any impact.
How can this be fixed?
I am trying to perform a partial word matching on the _id field in my Elastic search instance.
After searching the official documentation I found out that the best way to do this is to create a n-gram analyzer, so using Sense I did this:
PUT /index2
{"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"partial_filter": {
"type": "ngram",
"min_gram": 2,
"max_gram": 20
}
},
"analyzer": {
"partial": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"partial_filter"
]
}
}
}
}}
I have tried to test the analyzer using :
POST /index2/_analyze
{
"analyzer": "partial",
"text": "brown fox"
}
And it worked as expected producing proper combinations.
The next step should be to apply the analyzer to the relevant fields,so I tried to do this:
PUT /index2/_mapping/type2
{
"type2": {
"properties": {
"_id": {
"type": "string",
"analyzer": "partial"
}
}
}
}
But i am getting an error:
"reason": "Field [_id] is defined twice in [type2]"
Probably because _id field gets created during the index2 creation along with the analyzer.
So my question is how can I use the partial search on the _id field?
Is there any other way to do this?
Thanks in advance!
I am building a faceted filtering function for a webshop, something like this:
Filter on Brand:
[ ] LG (10)
[ ] Apple (5)
[ ] HTC (3)
Filter on OS:
[ ] Android 4 (11)
[ ] Android 5 (2)
[ ] IOS (5)
I am using aggregation and filtering in elasticsearch, which is working out pretty well for me after a few days of learning ES (loving it!). But sadly I got stuck on the actual filtering now.
If i click on 'LG', the IOS filter will be disabled and (5) will change to (0) and the results on the right side will change to 13 android phones. Great, so far so good.
Now if I click on 'Android 4', only 11 phones will show on the right side. Awesome! So far so good :)
But now, if i click on 'Android 5', all results disappear. I'm not sure what I'm doing wrong. I would expect that all LG phones with both Android 4 and 5 show up.
Below is a sample query of the last case. Please note there are also some other fields included in the query which I am using to build the faceted filtering.
{
"size":100,
"query":{
"filtered":{
"query":{
"match_all":[
]
},
"filter":{
"bool":{
"must":[
{
"term":{
"brand.untouched":"LG"
}
},
{
"term":{
"operating_system.untouched":"Android 4"
}
},
{
"term":{
"operating_system.untouched":"Android 5"
}
}
],
"should":[
],
"must_not":{
"missing":{
"field":"model"
}
}
}
},
"strategy":"query_first"
}
},
"aggs":{
"brand.untouched":{
"terms":{
"field":"brand.untouched"
}
},
"operating_system.untouched":{
"terms":{
"field":"operating_system.untouched"
}
},
"camera1":{
"histogram":{
"field":"camera1",
"interval":5,
"min_doc_count":0
}
},
"price_seperate":{
"histogram":{
"field":"price_seperate",
"interval":125,
"min_doc_count":0
}
}
}
}
Does anyone know the solution? Thanks so much.
Your query is searching for documents in which operating_system.untouched is both "Android 4" and "Android 5" which will never be the case and hence you get zero results. You can simply make use of Terms Filter so that documents where the value of operating_system.untouched is either "Android 4" or "Android 5" matches. Below is the updated query you should be using:
{
"size":100,
"query":{
"filtered":{
"filter":{
"bool":{
"must":[
{
"terms":{
"brand.untouched": [
"LG"
]
}
},
{
"terms":{
"operating_system.untouched": [
"Android 4",
"Android 5"
]
}
}
],
"must_not":{
"missing":{
"field":"model"
}
}
}
},
"strategy":"query_first"
}
},
"aggs":{
"brand.untouched":{
"terms":{
"field":"brand.untouched"
}
},
"operating_system.untouched":{
"terms":{
"field":"operating_system.untouched"
}
},
"camera1":{
"histogram":{
"field":"camera1",
"interval":5,
"min_doc_count":0
}
},
"price_seperate":{
"histogram":{
"field":"price_seperate",
"interval":125,
"min_doc_count":0
}
}
}
}
If you want to add another set of categories like price range, you just need to add a bool should clause inside the bool must clause. See below for an example when you want to filter on a field price on two ranges (0, 100] and (100, 200]. What this basically means is that you can have nested must and should filters to realize any boolean logic you want to implement for filtering in Elasticsearch.
...
"must":[
{
"terms":{
"brand.untouched": [
"LG"
]
}
},
{
"terms":{
"operating_system.untouched": [
"Android 4",
"Android 5"
]
}
},
"bool": {
"should": [
{
"range": {
"price": {
"gt": 0,
"lte": 100
}
}
},
{
"range": {
"price": {
"gt": 100,
"lte": 200
}
}
}
]
}
],
...