Elasticsearch query with nested sets - json

I am pretty new to Elasticsearch, so please bear with me and let me know if I need to provide any additional information. I have inherited a project and need to implement new search functionality. The document/mapping structure is already in place but can be changed if it can not facilitate what I am trying to achieve. I am using Elasticsearch version 5.6.16.
A company is able to offer a number of services. Each service offering is grouped together in a set. Each set is composer of 3 categories;
Product(s) (ID 1)
Process(es) (ID 3)
Material(s) (ID 4)
The document structure looks like;
[{
"id": 4485,
"name": "Company A",
// ...
"services": {
"595": {
"1": [
95, 97, 91
],
"3": [
475, 476, 471
],
"4": [
644, 645, 683
]
},
"596": {
"1": [
91, 89, 76
],
"3": [
476, 476, 301
],
"4": [
644, 647, 555
]
},
"597": {
"1": [
92, 93, 89
],
"3": [
473, 472, 576
],
"4": [
641, 645, 454
]
},
}
}]
In the above example; 595, 596 and 597 are IDs relating to the set. 1, 3 and 4 relate to the categories (mentioned above).
The mapping looks like;
[{
"id": {
"type": "long"
},
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"services": {
"properties": {
// ...
"595": {
"properties": {
"1": {"type": "long"},
"3": {"type": "long"},
"4": {"type": "long"}
}
},
"596": {
"properties": {
"1": {"type": "long"},
"3": {"type": "long"},
"4": {"type": "long"}
}
},
// ...
}
},
}]
When searching for a company that provides a Product (ID 1) - a search of 91 and 95 which would return Company A because those IDs are within the same set. But if I was to search 95 and 76, it would not return Company A - while the company does do both of these products, they are not in the same set. These same rules would apply when searching Processes and Materials or a combination of these.
I am looking for confirmation that the current document/mapping structure will facilitate this type of search.
If so, given 3 arrays of IDs (Products, Processes and Materials), what is the JSON to find all companies that provide these services within the same set?
If not, how should the document/mapping be changed to allow this search?
Thank you for your help.

It is a bad idea to have ID for what appears as a value as a field itself as that could lead to creation of so many inverted indexes, (remember that in Elasticsearch, inverted index is created on every field) and I feel it is not reasonable to have something like that.
Instead change your data model to something like below. I have also included sample documents, the possible queries you can apply and how the response can appear.
Note that just for sake of simplicity, I'm focussing only on the services field that you have mentioned in your mapping.
Mapping:
PUT my_services_index
{
"mappings": {
"properties": {
"services":{
"type": "nested", <----- Note this
"properties": {
"service_key":{
"type": "keyword" <----- Note that I have mentioned keyword here. Feel free to use text and keyword if you plan to implement partial + exact search.
},
"product_key": {
"type": "keyword"
},
"product_values": {
"type": "keyword"
},
"process_key":{
"type": "keyword"
},
"process_values":{
"type": "keyword"
},
"material_key":{
"type": "keyword"
},
"material_values":{
"type": "keyword"
}
}
}
}
}
}
Notice that I've made use of nested datatype. I'd suggest you to go through that link to understand why do we need that instead of using plain object type.
Sample Document:
POST my_services_index/_doc/1
{
"services":[
{
"service_key": "595",
"process_key": "1",
"process_values": ["95", "97", "91"],
"product_key": "3",
"product_values": ["475", "476", "471"],
"material_key": "4",
"material_values": ["644", "645", "643"]
},
{
"service_key": "596",
"process_key": "1",
"process_values": ["91", "89", "75"],
"product_key": "3",
"product_values": ["476", "476", "301"],
"material_key": "4",
"material_values": ["644", "647", "555"]
}
]
}
Notice how you can now manage your data, if it ends up having multiple combinations or product_key, process_key and material_key.
The way you interpret the above document is that, you have two nested documents inside a document of my_services_index.
Sample Query:
POST my_services_index/_search
{
"_source": "services.service_key",
"query": {
"bool": {
"must": [
{
"nested": { <---- Note this
"path": "services",
"query": {
"bool": {
"must": [
{
"term": {
"services.service_key": "595"
}
},
{
"term": {
"services.process_key": "1"
}
},
{
"term": {
"services.process_values": "95"
}
}
]
}
},
"inner_hits": {} <---- Note this
}
}
]
}
}
}
Note that I've made use of Nested Query.
Response:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.828546,
"hits" : [ <---- Note this. Which would return the original document.
{
"_index" : "my_services_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.828546,
"_source" : {
"services" : [
{
"service_key" : "595",
"process_key" : "1",
"process_values" : [
"95",
"97",
"91"
],
"product_key" : "3",
"product_values" : [
"475",
"476",
"471"
],
"material_key" : "4",
"material_values" : [
"644",
"645",
"643"
]
},
{
"service_key" : "596",
"process_key" : "1",
"process_values" : [
"91",
"89",
"75"
],
"product_key" : "3",
"product_values" : [
"476",
"476",
"301"
],
"material_key" : "4",
"material_values" : [
"644",
"647",
"555"
]
}
]
},
"inner_hits" : { <--- Note this, which would tell you which inner document has been a hit.
"services" : {
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.828546,
"hits" : [
{
"_index" : "my_services_index",
"_type" : "_doc",
"_id" : "1",
"_nested" : {
"field" : "services",
"offset" : 0
},
"_score" : 1.828546,
"_source" : {
"service_key" : "595",
"process_key" : "1",
"process_values" : [
"95",
"97",
"91"
],
"product_key" : "3",
"product_values" : [
"475",
"476",
"471"
],
"material_key" : "4",
"material_values" : [
"644",
"645",
"643"
]
}
}
]
}
}
}
}
]
}
}
Note that I've made use of keyword datatype. Please feel free to use the datatype as and what your business requirements would be for all the fields.
The idea I've provided is to help you understand the document model.
Hope this helps!

Related

Json Path Read from a Kafka Message

I have a kafka message like below, where im trying to read the data from the json path. However im having a challenge when reading some of the attributes from the json path. here is the sample message.
sample1:
{
"header": {
"bu": "google",
"id": "12345",
"bum": "google",
"originTimestamp": "2021-10-09T15:17:09.842+00:00",
"batchSize": "0",
"jobType": "Batch"
},
"payload": {
"derivationdetails": {
"Id": "6783jhvvh897u31y283y",
"itemid": "1234567",
"batchid": 107,
"attributes": {
"itemid": "1234567",
"lineNbr": "1498",
"cat": "5929",
"Id": "6783jhvvh897u31y283y",
"indicator": "false",
"subcat": "3514"
},
"Exception": {
"values": [
{
"type": "PICK",
"value": "blocked",
"Reason": [
"RULE"
],
"rules": [
"439"
]
}
],
"rulesBagInfo": [
{
"Idtype": "XXXX",
"uniqueid": "7889423rbhevfhjaufdyeuiryeukjbdafvjd",
"rulesMatch": [
"439"
]
}
]
}
}
}
}
sample 2: Same message but see the difference in "Payload"
{
"header": {
"bu": "google",
"id": "12345",
"bum": "google",
"originTimestamp": "2021-10-09T15:17:09.842+00:00",
"batchSize": "0",
"jobType": "Batch"
},
"payload": {
"Id": "6783jhvvh897u31y283y",
"itemid": "1234567",
"batchid": 107,
"attributes": {
"itemid": "1234567",
"lineNbr": "1498",
"cat": "5929",
"Id": "6783jhvvh897u31y283y",
"indicator": "false",
"subcat": "3514"
},
"Exception": {
"values": [
{
"type": "PICK",
"value": "blocked",
"Reason": [
"RULE"
],
"rules": [
"439"
]
}
],
"rulesBagInfo": [
{
"Idtype": "XXXX",
"uniqueid": "7889423rbhevfhjaufdyeuiryeukjbdafvjd",
"rulesMatch": [
"439"
]
}
]
}
}
}
If you observe, sometimes the message has "derivationdetails", and sometimes it doesn't. But irrespective of its existence, i need to read the values of id,itemid and batchid. I tried using
$.payload[*].id
$.payload[*].itemid
$.payload[*].batchid
But i see that for batchid is returning null even though it has a value in the message, and the attributes under "attributes" return null if im using the above. For fields under "attributes" im using this(example):
$.payload.attributes.itemId
And, completely blank on how to read the below part.
"Exception": {
"values": [
{
"type": "PICK",
"value": "blocked",
"Reason": [
"RULE"
],
"rules": [
"439"
]
}
],
"rulesBagInfo": [
{
"Idtype": "XXXX",
"uniqueid": "7889423rbhevfhjaufdyeuiryeukjbdafvjd",
"rulesMatch": [
"439"
]
Im new to this and need some suggestions on how to read the attributes properly. Any help would be much appreciated.Thanks
Use ..(recursive descent, Deep scan. JSONPath borrows this syntax from E4X.) to get the values. But It will return a list if there are multiple entries with same key nested in deep.
Below jsonpath expressions will return a list with one item each for both sample1 and sample2
$.payload..attributes.Id
$.payload..attributes.itemid
$.payload..batchid
$.payload..Exception

How to search a Claim using extension field

I have a Claim payload, in which I have added an extension block: (not sure how where the url came from)
"extension" : [{
"url" : "http://hl7.org/fhir/StructureDefinition/iso-21090-EN-use",
"valueString" : "MAPD"
}],
I want search this claim record using the extension but don't know how to do it.
I tried using GET request to https://<azure_fhir_server>/Claim?extension=MAPD but it says
{
"severity": "warning",
"code": "not-supported",
"diagnostics": "The search parameter 'extension' is not supported for resource type 'Claim'."
}
=====================
EDIT:
As suggested by #Nik Klassen, I posted the following payload to /SearchParameter
{
"resourceType" : "SearchParameter",
"id": "b072f860-7ecd-4d73-a490-74acd673f8d2",
"name": "extensionValueString",
"status": "active",
"url" : "http://hl7.org/fhir/SearchParameter/extension-valuestring",
"description": "Returns a Claim with extension.valueString matching the specified one in request.",
"code" : "lob",
"base" : [
"Claim"
],
"type" : "string",
"expression" : "Claim.extension.where(url ='http://hl7.org/fhir/SearchParameter/extension-valuestring').extension.value.string"
}
Also, did the $reindex on Claim, but the couldnt find the column lob($reindex response is below):
{
"resourceType": "Parameters",
"id": "ee8786d2-616a-4b81-8f6a-8089591b1225",
"meta": {
"versionId": "1"
},
"parameter": [
{
"name": "_id",
"valueString": "28e808d6-e420-4a33-bb0b-7cd325c8c169"
},
{
"name": "status",
"valueString": "http://hl7.org/fhir/fm-status|active"
},
{
"name": "priority",
"valueString": "http://terminology.hl7.org/CodeSystem/processpriority|normal"
},
{
"name": "facility",
"valueString": "Location/Location"
},
{
"name": "patient",
"valueString": "Patient/f8d8477c-1ef4-4878-abed-51e514bfd91f"
},
{
"name": "encounter",
"valueString": "Encounter/67062d00-2531-3ebd-8558-1de2fd3e5aab"
},
{
"name": "use",
"valueString": "http://hl7.org/fhir/claim-use|claim"
},
{
"name": "identifier",
"valueString": "TEST"
},
{
"name": "_lastUpdated",
"valueString": "2021-08-25T07:39:15.3050000+00:00"
},
{
"name": "created",
"valueString": "1957-04-12T21:23:35+05:30"
}
]
}
I read somewhere I need to create StructureDefinition, but don't know how to do that.
Basically I want to add a field "LOB" as an extension to all my resources, and search them using: GET: https://fhir_server/{resource}?lob=<value>
By default you can only search on fields that are part of the FHIR spec. These are listed in a "Search Parameters" section on the page for each resource type, i.e. https://hl7.org/fhir/claim.html#search. To search on extensions you will need to create a custom SearchParameter https://learn.microsoft.com/en-us/azure/healthcare-apis/fhir/how-to-do-custom-search, i.e.
POST {{FHIR_URL}}/SearchParameter
{
"resourceType" : "SearchParameter",
"id" : "iso-21090-EN-use",
"url" : "ttp://hl7.org/fhir/SearchParameter/iso-21090-EN-use",
... some required fields ...
"code" : "iso-use",
"base" : [
"Claim"
],
"type" : "token",
"expression" : "Claim.extension.where(url = 'http://hl7.org/fhir/StructureDefinition/iso-21090-EN-use').value.string"
}

Mongodb Aggregate : replace value of one collection with matching value of other collection

I am new to MongoDB, I have two collections like this :
1st collection name is a
db.a.find()
{
"_id": "1234",
"versions": [{
"owner_id": ObjectId("100000"),
"versions": 1,
"type" : "info",
"items" : ["item1","item3","item7"]
},
{
"owner_id": ObjectId("100001"),
"versions": 2,
"type" : "bug",
"OS": "Ubuntu",
"Dependencies" : "Trim",
"items" : ["item1","item7"]
}
]}
2nd Collection name is b
db.b.find()
{
"_id": ObjectId("100000"),
"email": "abc#xyz.com"
} {
"_id": ObjectId("100001"),
"email": "bbc#xyz.com"
}
Expected output is:
{
"_id": "1234",
"versions":[{
"owner_id": "abc#xyz.com",
"versions": 1,
"type" : "info",
"items" : ["item1","item3","item7"]
},
{
"owner_id": "bbc#xyz.com",
"versions": 2,
"type" : "bug",
"OS": "Ubuntu",
"Dependencies" : "Trim",
"items" : ["item1","item7"]
}
] }
Requirement: fields inside each document of versions are not fixed,
Example : versions[0] have 4 key-value pair and versions[1] have 6 key-value pair.
so I am looking a query which can replace owner_id with email keeping all other filed in output.
I tried :
db.a.aggregate(
[
{$unwind:"$versions"},
{$lookup : {from : "b", "localField":"versions.owner_id", "foreignField":"_id", as :"out"}},
{$project : {"_id":1, "versions.owner_id":{$arrayElemAt:["$out.email",0]}}},
{$group:{_id:"$_id", versions : {$push : "$versions"}}}
]
).pretty()
Please help.
Thank You!!!
Instead of $project pipeline stage use $addFields.
Example:
db.a.aggregate([
{ $unwind: "$versions" },
{
$lookup: {
from: "b",
localField: "versions.owner_id",
foreignField: "_id",
as: "out"
}
},
{
$addFields: {
"versions.owner_id": { $arrayElemAt: ["$out.email",0] }
}
},
{ $group: { _id: "$_id", versions: { $push: "$versions" } } }
]).pretty()

MongoDB syntax error

I am having trouble with the syntax (SyntaxError: Unexpected token ILLEGAL) in MongoDB. This command was copied directly from a MongoDB instruction PDF and I cannot find out what is wrong.
Also I don't know if it is relevant but I am using Codeanywhere with a MEAN stack.
db.restaurants.insert(
{
"address" : {
"street" : "2 Avenue",
"zipcode" : "10075",
"building" : "1480",
"coord" : [ ­73.9557413, 40.7720266 ],
},
"borough" : "Manhattan",
"cuisine" : "Italian",
"grades" : [
{
"date" : ISODate("2014­10­01T00:00:00Z"),
"grade" : "A",
"score" : 11
},
{
"date" : ISODate("2014­01­16T00:00:00Z"),
"grade" : "B",
"score" : 17
}
],
"name" : "Vella",
"restaurant_id" : "41704620"
}
)
Try to replace:
"coord" : [ ­73.9557413, 40.7720266 ],
with:
"coord" : [ ­73.9557413, 40.7720266 ]
The comma at the end of subdocument is extra.
By the way, the JSON standard allows only double quoted string as property key, thus, try also this variant:
"coord" : [ "­73.9557413", "40.7720266" ]
I checked your entire JSON-document with a JSON validator, here is a valid version:
{
"address": {
"street": "2 Avenue",
"zipcode": "10075",
"building": "1480",
"coord": ["73.9557413", "40.7720266"]
},
"borough": "Manhattan",
"cuisine": "Italian",
"grades": [{
"date": "20141001T00:00:00Z",
"grade": "A",
"score": 11
}, {
"date": "20140116T00:00:00Z",
"grade": "B",
"score": 17
}],
"name": "Vella",
"restaurant_id": "41704620"
}

Facets tokenize tags with spaces. Is there a solution?

I have some problem with facets tokenize tags with spaces.
I have the following mappings:
curl -XPOST "http://localhost:9200/pictures" -d '
{
"mappings" : {
"pictures" : {
"properties" : {
"id": { "type": "string" },
"description": {"type": "string", "index": "not_analyzed"},
"featured": { "type": "boolean" },
"categories": { "type": "string", "index": "not_analyzed" },
"tags": { "type": "string", "index": "not_analyzed", "analyzer": "keyword" },
"created_at": { "type": "double" }
}
}
}
}'
And My Data is:
curl -X POST "http://localhost:9200/pictures/picture" -d '{
"picture": {
"id": "4defe0ecf02a8724b8000047",
"title": "Victoria Secret PhotoShoot",
"description": "From France and Italy",
"featured": true,
"categories": [
"Fashion",
"Girls",
],
"tags": [
"girl",
"photoshoot",
"supermodel",
"Victoria Secret"
],
"created_at": 1405784416.04672
}
}'
And My Query is:
curl -X POST "http://localhost:9200/pictures/_search?pretty=true" -d '
{
"query": {
"text": {
"tags": {
"query": "Victoria Secret"
}
}
},
"facets": {
"tags": {
"terms": {
"field": "tags"
}
}
}
}'
The Output result is:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
},
"facets" : {
"tags" : {
"_type" : "terms",
"missing" : 0,
"total" : 0,
"other" : 0,
"terms" : [ ]
}
}
}
Now, I got total 0 in facets and total: 0 in hits
Any Idea Why its not working?
I know that when I remove the keyword analyzer from tags and make it "not_analyzed" then I get result.
But there is still a problem of case sensitive.
If I run same above query by removing the keyword analyzer then I get the result which is:
facets: {
tags: {
_type: terms
missing: 0
total: 12
other: 0
terms: [
{
term: photoshoot
count: 1
}
{
term: girl
count: 1
}
{
term: Victoria Secret
count: 1
}
{
term: supermodel
count: 1
}
]
}
}
Here Victoria Secret is case sensitive in "not_analyzed" but it takes space in count, but when I query with lowercase as "victoria secret" it doesn't give any results.
Any suggestions??
Thanks,
Suraj
The first examples are not totally clear to me. If you use the KeywordAnalyzer it means that the field will be indexed as it is, but then it makes much more sense to just not analyze the field at all, which is the same. The mapping you posted contains both
"index": "not_analyzed", "analyzer": "keyword"
which doesn't make a lot of sense. If you are not analyzing the field why would select an analyzer for it?
Apart from this, of course if you don't analyze the field the tag Victoria Secret will be indexed as it is, thus the query victoria secret won't match. If you want it to be case-insensitive you need to define a custom analyzer which uses the KeyworkTokenizer, since you don't want to tokenize it and the LowercaseTokenFilter. You can define a custom analyzer through the index settings analysis section and then use it in your mapping. But that way the facet would be always lowercase, which is something that you don't like I guess. That's why it's better to define a multi field and index the field using two different text analysis, one for the facet and one for search.
You can create the index like this:
curl -XPOST "http://localhost:9200/pictures" -d '{
"settings" : {
"analysis" : {
"analyzer" : {
"lowercase_analyzer" : {
"type" : "custom",
"tokenizer" : "keyword",
"filter" : [ "lowercase"]
}
}
}
},
"mappings" : {
"pictures" : {
"properties" : {
"id": { "type": "string" },
"description": {"type": "string", "index": "not_analyzed"},
"featured": { "type": "boolean" },
"categories": { "type": "string", "index": "not_analyzed" },
"tags" : {
"type" : "multi_field",
"fields" : {
"tags": { "type": "string", "analyzer": "lowercase_analyzer" },
"facet": {"type": "string", "index": "not_analyzed"},
}
},
"created_at": { "type": "double" }
}
}
}
}'
Then the custom lowercase_analyzer will be applied by default to the text query too when you search on that field, so that you can either search for Victoria Secret or victoria secret and get the result back. You need to change the facet part and make the facet on the new tags.facet field, which is not analyzed.
Furthermore, you might want to have a look at the match query since the text query has been deprecated with the latest elasticsearch version (0.19.9).
I think this make some sense to my answer
https://gist.github.com/2688072