ElasticSearch: exact has lower score than partial match - json

I am trying to implement address autocomplete using ElasticSearch.
Suppose, I have three fields, which I would like to implement search on:
{
"address_name": "George st.",
"number": "1",
"city_name": "London"
}
According to this article, I have have configured my index and type like this:
{
"settings": {
"analysis": {
"filter": {
"nGram_filter": {
"type": "nGram",
"min_gram": 1,
"max_gram": 20,
"token_chars": [
"letter",
"digit",
"punctuation",
"symbol"
]
}
},
"analyzer": {
"nGram_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding",
"nGram_filter"
]
},
"whitespace_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
},
"mappings": {
"address": {
"_all": {
"analyzer": "nGram_analyzer",
"search_analyzer": "whitespace_analyzer"
},
"properties": {
"address_name": {
"type": "string"
},
"number": {
"type": "string",
"boost": 2
},
"city_name": {
"type": "string"
},
"local": {
"type": "integer",
"include_in_all": false,
"index": "no"
},
"place_id": {
"type": "integer",
"include_in_all": false,
"index": "no"
},
"has_number": {
"type": "integer",
"include_in_all": false,
"index": "no"
}
}
}
}
}
Full search query:
{
"size": 100,
"query": {
"match": {
"_all": {
"query": "George st. 1 London",
"operator": "and"
}
}
}
}
As I search by query George st. 1 London, ElasticSearch firstly returns me George st. 19 London, George st. 17 London, etc. but the exact match George st. 1 London is returned only in X-th place and has lowest score than the first ones.
I was trying to understand why it happens by adding explain query to the end of the search URL, but it didn't help.
Is there any way to solve this problem?
Thank you.

Basically, since you're running all fields through an nGram token filter at indexing time, it means that for the number field,
17 will be tokenized as 1 and 17 and
19 will be tokenized as 1 and 19
Hence, all three documents you mention will have then token 1 indexed for their number field.
Then at query time, you're using the whitespace analyzer, which means that George st. 1 London will be tokenized into the following tokens: George, st, 1 and London.
From there, we can draw two conclusions:
all three documents will match no matter what (since all tokens match a given field)
there's no way with the current settings and mapping that you can give more weight to the document George st. 1 London than to the others.
The easiest way out of this is to not apply nGram to the number field so that the street number needs to be matched exactly and not with prefixes.

Related

Is there any way to define a scoping mechanism in JSON Schema for Arrays of Objects?

I would like to use JSON Schema to validate my data which exists as an array of objects. In this use-case, I have a list of people and I want to make sure they possess certain properties, but these properties aren't exhaustive.
For instance, if we have a person name Bob, I want to make sure that Bob's height, ethnicity and location is set to certain values. But I don't care much about Bob's other properties like hobbies, weight, relationshipStatus.
There is one caveat and it is that there can be multiple Bobs, so I don't want to check for all Bobs. It just so happens that each person has a unique ID given to them and I want to check properties of a person by the specified id.
Here is an example of all the people that exist:
{
"people": [
{
"name": "Bob",
"id": "ei75dO",
"age": "36",
"height": "68",
"ethnicity": "american",
"location": "san francisco",
"weight": "174",
"relationshipStatus": "married",
"hobbies": ["camping", "traveling"]
},
{
"name": "Leslie",
"id": "UMZMA2",
"age": "32",
"height": "65",
"ethnicity": "american",
"location": "pawnee",
"weight": "139",
"relationshipStatus": "married",
"hobbies": ["politics", "parks"]
},
{
"name": "Kapil",
"id": "HkfmKh",
"age": "27",
"height": "71",
"ethnicity": "indian",
"location": "mumbai",
"weight": "166",
"relationshipStatus": "single",
"hobbies": ["tech", "games"]
},
{
"name": "Arnaud",
"id": "xSiIDj",
"age": "42",
"height": "70",
"ethnicity": "french",
"location": "paris",
"weight": "183",
"relationshipStatus": "married",
"hobbies": ["cooking", "reading"]
},
{
"name": "Kapil",
"id": "fDnweF",
"age": "38",
"height": "67",
"ethnicity": "indian",
"location": "new delhi",
"weight": "159",
"relationshipStatus": "married",
"hobbies": ["tech", "television"]
},
{
"name": "Gary",
"id": "ZX43NI",
"age": "29",
"height": "69",
"ethnicity": "british",
"location": "london",
"weight": "172",
"relationshipStatus": "single",
"hobbies": ["parkour", "guns"]
},
{
"name": "Jim",
"id": "uLqbVe",
"age": "26",
"height": "72",
"ethnicity": "american",
"location": "scranton",
"weight": "179",
"relationshipStatus": "single",
"hobbies": ["parkour", "guns"]
}
]
}
And here is what I specifically want to check for in each person:
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"type": "object",
"properties": {
"people": {
"type": "array",
"contains": {
"anyOf": [
{
"type": "object",
"properties": {
"id": {
"const": "ei75dO"
},
"name": {
"const": "Bob"
},
"ethnicity": {
"const": "american"
},
"location": {
"const": "los angeles"
},
"height": {
"const": "68"
}
},
"required": ["id", "name", "ethnicity", "location", "height"]
},
{
"type": "object",
"properties": {
"id": {
"const": "fDnweF"
},
"name": {
"const": "Kapil"
},
"location": {
"const": "goa"
},
"height": {
"const": "65"
}
},
"required": ["id", "name", "location", "height"]
},
{
"type": "object",
"properties": {
"id": {
"const": "xSiIDj"
},
"name": {
"const": "Arnaud"
},
"location": {
"const": "paris"
},
"relationshipStatus": {
"const": "single"
}
},
"required": ["id", "name", "location", "relationshipStatus"]
},
{
"type": "object",
"properties": {
"id": {
"const": "uLqbVe"
},
"relationshipStatus": {
"const": "married"
}
},
"required": ["id", "relationshipStatus"]
}
]
}
}
},
"required": ["people"]
}
Note that for Bob, I only want to check that his name in the records is Bob, his ethnicity is american and that his location and height are set properly.
For Kapil, notice that there are 2 of them in the record. I only want to validate the array object pertaining to Kapil with the id fDnweF.
And for Jim, I only want to make sure that his relationshipStatus is set to married.
So my question would be, is there any way in JSON Schema to say hey, when you come across and array of objects instead of running validation across each element in the data, only run it against objects that match a specific identifier. In our instance, we would say that the identifier is id. You can imagine that this identifier can be anything, for example it could have been socialSecurity# if the list of people were all from America.
The issue with the current schema is that when it tries to validate the objects, it generates a giant list of errors with no clear indication of which object failed with which value.
In an ideal scenario AJV (which I currently use) would generate errors that should look something like:
---------Bob-------------
path: people[0].location
expected: "los angeles"
// Notice how this isn't Kapil at index 2 since we provided the id which matches kapil at index 4
---------Kapil-----------
path: people[4].location
expected: "goa"
---------Kapil-----------
path: people[4].height
expected: "65"
---------Arnaud----------
path: people[3].relationshipStatus
expected: "single"
-----------Jim-----------
path: people[6].relationshipStatus
expected: "married"
Instead, currently AJV spits our errors with no clear indication of where the failure might be. If bob failed to match the expected value of location, it says that every person including bob has an invalid location, which from our perspective is incorrect.
How can I define a schema that can resolve this use-case and we can use JSON Schema to pinpoint which elements in our data aren't in compliance with what our schema states. All so that we can store these schema errors cleanly for reporting purposes and come back to these reports to see exactly which people (represented by index values of array) failed which values.
Edit:
Assume that we would also like to check relatives for Bob as well. for instance we want to create a schema to check that their relative with the given ID ALSO is set to location: "los angeles" and another for "orange county".
{
"people": [{
"name": "Bob",
"id": "ei75d0",
"relationshipStatus": "married",
"height": "68",
"relatives": [
{
"name": "Tony",
"id": "UDX5A6",
"location": "los angeles",
},
{
"name": "Lisa",
"id": "WCX4AG",
"location": "orange county",
}
]
}]
}
My question then would be, can the if/then/else be applied over to nested elements as well? I'm not having success but I'll continue trying to get it to work and will post an update here if/once I do.
How can I define a schema that can resolve this use-case and we can use JSON Schema to pinpoint which elements in our data aren't in compliance with what our schema states
It's a little fiddly, but I've gone from "this isn't possible" to "you can just about do this.
If you re-structure your schema to the following...
{
"$schema": "https://json-schema.org/draft/2019-09/schema",
"type": "object",
"properties": {
"people": {
"type": "array",
"items": {
"allOf":[
{
"if": {
"properties": {
"id": {
"const": "uLqbVe"
}
}
},
"then": {
"type": "object",
"properties": {
"id": {
"const": "uLqbVe"
},
"relationshipStatus": {
"const": "married"
}
},
"required": ["id", "relationshipStatus"]
},
"else": true
}
]
}
}
},
"required": ["people"]
}
What we're doing here is, for each item in the array, if the object has the specific ID, then do the other validation, otherwise, it's valid.
It's wrapped in an allOf so you can do the same pattern multiple times.
The caveat is that, if you don't include all the IDs, or if you don't carefully check your schema, you will get told everything is valid.
You should ideally, additionaly check that the IDs you are expecting, are actually there. (It's fine to do so in the same schema.)
You can see this mostly working if you test it on https://jsonschema.dev by removing the $schema property. (This playground is only draft-07, but none of the keywords you use need anything above draft-07 anyway.)
You can test this working on https://json-everything.net/json-schema which then gives you full validation response.
AJV by default doesn't give you all the validaiton results. There's an option to enable it but I'm not in a position to test the result myself right now.

Azure Cost Management API does not allow me to select columns

I tried to use the Azure Cost Management - Query Usage API to get details (certain columns) on all costs for a given subscription. The body I use for the request is
{
"type": "Usage",
"timeframe": " BillingMonthToDate ",
"dataset": {
"granularity": "Daily",
"configuration": {
"columns": [
"MeterCategory",
"CostInBillingCurrency",
"ResourceGroup"
]
}
}
But the response I get back is this:
{
"id": "xxxx",
"name": "xxxx",
"type": "Microsoft.CostManagement/query",
"location": null,
"sku": null,
"eTag": null,
"properties": {
"nextLink": null,
"columns": [
{
"name": "UsageDate",
"type": "Number"
},
{
"name": "Currency",
"type": "String"
} ],
"rows": [
[
20201101,
"EUR"
],
[
20201102,
"EUR"
],
[
20201103,
"EUR"
],
...
]
}
The JSON continues listing all the dates with the currency.
When I use the dataset.aggregation or dataset.grouping clauses in the JSON, I do get costs returned in my JSON but then I don't get the detailed column information that I want. And of course it is not possible to combine these 2 clauses with the dataset.columns clause. Anyone have any idea what I'm doing wrong?
I found a solution without using the dataset.columns clause (which might just be a faulty clause?). By grouping the data according tot the columns I want, I can also get the data for those column values:
{
"type": "Usage",
"timeframe": "BillingMonthToDate",
"dataset": {
"granularity": "Daily",
"aggregation": {
"totalCost": {
"name": "PreTaxCost",
"function": "Sum"
}
},
"grouping": [
{
"type": "Dimension",
"name": "SubscriptionName"
},
{
"type": "Dimension",
"name": "ResourceGroupName"
}
,
{
"type": "Dimension",
"name": "meterSubCategory"
}
,
{
"type": "Dimension",
"name": "MeterCategory"
}
]
}

Fiware Context Broker with entities geolocated

I have a problem in retrieving entities using georeferenced queries.
Use the v2 syntax.
This is my query:
GET /v2/entities?georel=near;maxDistance:1000&geometry=point&coords=13.52,43.61
and this is my entity:
{
"id": "p1",
"type": "pm",
"address": {
"type": "Text",
"value": "Via Roma "
},
"allowedVehicleType": {
"type": "Text",
"value": "car"
},
"category": {
"type": "Text",
"value": "onstreet"
},
"location": {
"type": "geo:json",
"value": {
"type": "Point",
"coordinates": [ 13.5094, 43.6246 ]
}
},
"name": {
"type": "Text",
"value": "p1"
},
"totalSpotNumber": {
"type": "Number",
"value": 32
}
}
What is wrong?
I followed the official documentation but I can not get any results as well.
I also tried to reverse the coordinates, but the result does not change.
Any suggestion is welcome.
Note that longitude comes before latitude in GeoJSON coordinates, while the coords parameters does in the opposite way.
Thus, assuming that your entity is located in Ancona city, I think that using "coordinates": [ 43.6246, 13.5094 ] will solve the problem.

Schema to load json data to google big query

I have a question for the project that we are doing...
I tried to extract this JSON to Google Big Query and not able to get JSON votes Object fields from the JSON input. I tried the "record" and the "string" types in the schema.
{
"votes": {
"funny": 10,
"useful": 10,
"cool": 10
},
"user_id": "OlMjqqzWZUv2-62CSqKq_A",
"review_id": "LMy8UOKOeh0b9qrz-s1fQA",
"stars": 4,
"date": "2008-07-02",
"text": "This is what this 4-star bar is all about.",
"type": "review",
"business_id": "81IjU5L-t-QQwsE38C63hQ"
}
Also i am not able to get the tables populated from this below JSON for the categories and neighborhood JSON arrays? What should my schema be for these inputs? The docs didn't help much unfortunately in this case or maybe i am not looking at the right place..
{
"business_id": "Iu-oeVzv8ZgP18NIB0UMqg",
"full_address": "3320 S Hill St\nSouth East LA\nLos Angeles, CA 90007",
"schools": [
"University of Southern California"
],
"open": true,
"categories": [
"Medical Centers",
"Health and Medical"
],
"neighborhoods": [
"South East LA"
]
}
I am able to get the regular fields, but that's about it... Any help is appreciated!
For business it seems you want schools to be a repeated field. Your schema should be:
"schema": {
"fields": [
{
"name": "business_id",
"type": "string"
}.
{
"name": "full_address",
"type": "string"
},
{
"name": "schools",
"type": "string",
"mode": "repeated"
},
{
"name": "open",
"type": "boolean"
}
]
}
For votes it seems you want record. Your schema should be:
"schema": {
"fields": [
{
"name": "name",
"type": "string"
}.
{
"name": "votes",
"type": "record",
"fields": [
{
"name": "funny",
"type": "integer",
},
{
"name": "useful",
"type": "integer"
},
{
"name": "cool",
"type": "integer"
}
]
},
]
}
Source
I was also stuck on this problem, but the issue I faced was because one has to remember to flag the mode as repeated for the records source
Also please note that these cannot have a null value source

Facets tokenize tags with spaces. Is there a solution?

I have some problem with facets tokenize tags with spaces.
I have the following mappings:
curl -XPOST "http://localhost:9200/pictures" -d '
{
"mappings" : {
"pictures" : {
"properties" : {
"id": { "type": "string" },
"description": {"type": "string", "index": "not_analyzed"},
"featured": { "type": "boolean" },
"categories": { "type": "string", "index": "not_analyzed" },
"tags": { "type": "string", "index": "not_analyzed", "analyzer": "keyword" },
"created_at": { "type": "double" }
}
}
}
}'
And My Data is:
curl -X POST "http://localhost:9200/pictures/picture" -d '{
"picture": {
"id": "4defe0ecf02a8724b8000047",
"title": "Victoria Secret PhotoShoot",
"description": "From France and Italy",
"featured": true,
"categories": [
"Fashion",
"Girls",
],
"tags": [
"girl",
"photoshoot",
"supermodel",
"Victoria Secret"
],
"created_at": 1405784416.04672
}
}'
And My Query is:
curl -X POST "http://localhost:9200/pictures/_search?pretty=true" -d '
{
"query": {
"text": {
"tags": {
"query": "Victoria Secret"
}
}
},
"facets": {
"tags": {
"terms": {
"field": "tags"
}
}
}
}'
The Output result is:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
},
"facets" : {
"tags" : {
"_type" : "terms",
"missing" : 0,
"total" : 0,
"other" : 0,
"terms" : [ ]
}
}
}
Now, I got total 0 in facets and total: 0 in hits
Any Idea Why its not working?
I know that when I remove the keyword analyzer from tags and make it "not_analyzed" then I get result.
But there is still a problem of case sensitive.
If I run same above query by removing the keyword analyzer then I get the result which is:
facets: {
tags: {
_type: terms
missing: 0
total: 12
other: 0
terms: [
{
term: photoshoot
count: 1
}
{
term: girl
count: 1
}
{
term: Victoria Secret
count: 1
}
{
term: supermodel
count: 1
}
]
}
}
Here Victoria Secret is case sensitive in "not_analyzed" but it takes space in count, but when I query with lowercase as "victoria secret" it doesn't give any results.
Any suggestions??
Thanks,
Suraj
The first examples are not totally clear to me. If you use the KeywordAnalyzer it means that the field will be indexed as it is, but then it makes much more sense to just not analyze the field at all, which is the same. The mapping you posted contains both
"index": "not_analyzed", "analyzer": "keyword"
which doesn't make a lot of sense. If you are not analyzing the field why would select an analyzer for it?
Apart from this, of course if you don't analyze the field the tag Victoria Secret will be indexed as it is, thus the query victoria secret won't match. If you want it to be case-insensitive you need to define a custom analyzer which uses the KeyworkTokenizer, since you don't want to tokenize it and the LowercaseTokenFilter. You can define a custom analyzer through the index settings analysis section and then use it in your mapping. But that way the facet would be always lowercase, which is something that you don't like I guess. That's why it's better to define a multi field and index the field using two different text analysis, one for the facet and one for search.
You can create the index like this:
curl -XPOST "http://localhost:9200/pictures" -d '{
"settings" : {
"analysis" : {
"analyzer" : {
"lowercase_analyzer" : {
"type" : "custom",
"tokenizer" : "keyword",
"filter" : [ "lowercase"]
}
}
}
},
"mappings" : {
"pictures" : {
"properties" : {
"id": { "type": "string" },
"description": {"type": "string", "index": "not_analyzed"},
"featured": { "type": "boolean" },
"categories": { "type": "string", "index": "not_analyzed" },
"tags" : {
"type" : "multi_field",
"fields" : {
"tags": { "type": "string", "analyzer": "lowercase_analyzer" },
"facet": {"type": "string", "index": "not_analyzed"},
}
},
"created_at": { "type": "double" }
}
}
}
}'
Then the custom lowercase_analyzer will be applied by default to the text query too when you search on that field, so that you can either search for Victoria Secret or victoria secret and get the result back. You need to change the facet part and make the facet on the new tags.facet field, which is not analyzed.
Furthermore, you might want to have a look at the match query since the text query has been deprecated with the latest elasticsearch version (0.19.9).
I think this make some sense to my answer
https://gist.github.com/2688072