Mapping format on elasticsearch

Mapping format on elasticsearch - json

I'm to upload a json document to my server via elasticsearch but i wanted to map it before i upload it but i keep getting a search phase execution exception error.
The json data looks like this
{"geometry":{"type":"Point","coordinates":[-73.20266100000001,45.573647]},"properties":{"persistent_id":"XVCPFsbsqB7h4PrxEtCU3w==","timestamp":1408216040000,"tower_id":"10.48.66.178"}}
So far i've tried this as my mapping. Im not sure what i am doing wrong...
curl –XPUT 'http://localhost:9200/carrier/_search?q=coordinates?pretty=true' -d'
{ “geometry”: {
“type” : {“type” : “string”},
“coordinates” : {“type” : “geo_point”}
},
“properties” : {
“persistent_id” : {“type” : “string”},
“timestamp”: { “type” : “long”},
“tower_id” : {“type” : “string”}
}'

There are a few problems here. First of all you need to use put mapping request instead of search request. The body of the request has to start with the name of the type followed by the list of properties (fields) that you add. The second problem is that you probably copied the example from some documentation where all ascii quotes (") were replaced with replaced with their fancy unicode versions (“ and ”) and dash in front of the XPUT parameter looks like n-dash – instead of normal dash -. You need to replace all fancy quotes and dashes with their ascii versions. So, all together the working statement should look like this (assuming doc as your document type):
curl -XPUT 'http://localhost:9200/carrier/doc/_mapping' -d '{
"doc": {
"properties": {
"geometry": {
"properties": {
"type": {
"type": "string"
},
"coordinates": {
"type": "geo_point"
}
}
},
"properties": {
"properties": {
"persistent_id": {
"type": "string"
},
"timestamp": {
"type": "long"
},
"tower_id": {
"type": "string"
}
}
}
}
}
}'
then you can add document like this:
curl -XPUT 'http://localhost:9200/carrier/doc/1' -d '{"geometry":{"type":"Point","coordinates":[-73.20266100000001,45.573647]},"properties":{"persistent_id":"XVCPFsbsqB7h4PrxEtCU3w==","timestamp":1408216040000,"tower_id":"10.48.66.178"}}'
Please note that in order to add the mapping you might need to delete and recreate the index if you already tried to add documents to this index and the mapping was already created.

This is because you're using the _search endpoint in order to install your mapping.
You have to use the _mapping endpoint instead, like this:
curl –XPUT 'http://localhost:9200/carrier/_mapping/geometry' -d '{
...your mapping...
}'

Related

Elasticsearch dynamic mapping for object within attribute

Wondering if I can create a "dynamic mapping" within an elasticsearch index. The problem I am trying to solve is the following: I have a schema that has an attribute that contains an object that can differ greatly between records. I would like to mirror this data within elasticsearch if possible but believe that automatic mapping may get in the way.
Imagine a scenario where I have a schema like the following:
{
name: string
origin: string
payload: object // can be of any type / schema
}
Is it possible to create a mapping that supports this? I do not need to query the records by this payload attribute, but it would be great if I can.
Note that I have checked the documentation but am confused on if what elastic calls dynamic mapping is what I am looking for.

It's certainly possible to specify which queryable fields you expect the payload to contain and what those fields' mappings should be.
Let's say each doc will include the fields payload.livemode and payload.created_at. If these are the only two fields you'll want to perform queries on, and you'd like to disable dynamic, index-time mappings autogenerated by Elasticsearch for the rest of the fields, you can use dynamic templates like so:
PUT my-payload-index
{
"mappings": {
"dynamic_templates": [
{
"variable_payload": {
"path_match": "payload",
"mapping": {
"type": "object",
"dynamic": false,
"properties": {
"created_at": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
},
"livemode": {
"type": "boolean"
}
}
}
}
}
],
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"origin": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Then, as you ingest your docs:
POST my-payload-index/_doc
{
"name": "abc",
"origin": "web.dev",
"payload": {
"created_at": "2021-04-05 08:00:00",
"livemode": false,
"abc":"def"
}
}
POST my-payload-index/_doc
{
"name": "abc",
"origin": "web.dev",
"payload": {
"created_at": "2021-04-05 08:00:00",
"livemode": true,
"modified_at": "2021-04-05 09:00:00"
}
}
and verify with
GET my-payload-index/_mapping
no new mappings will be generated for the fields payload.abc nor payload.modified_at.
Not only that — the new fields will also be ignored, as per the documentation:
These fields will not be indexed or searchable, but will still appear in the _source field of returned hits.
Side note: if fields are neither stored nor searchable, they're effectively the opposite of enabled.
The Big Picture
Working with variable contents of a single, top-level object is quite standard. Take for instance the stripe event object — each event has an id, an api_version and a few other shared params. Then there's the data object that's analogous to your payload field.
Now, all is fine, until you need to aggregate on the contents of your payload. See, since the content is variable, so are the data paths / accessors. But wildcards in aggregation paths don't work in Elasticsearch. Scripts do but are onerous to maintain.
Back to stripe. They partially solved it through what they call polymorphic, typed hashes — as discussed in their blog on API design:
A pretty neat approach that's worth emulating.
P.S. I discuss dynamic templates in more detail in the chapter "Mapping Automation" of my ES Handbook.

rename invalid keys from JSON

I have following flow in NIFI , JSON has (1000+) objects in it.
invokeHTTP->SPLIT JSON->putMongo
Flow works fine, till I receive some keys in json with "." in the name. e.g. "spark.databricks.acl.dfAclsEnabled".
my current solution is not optimal, I have jotted down bad keys, and using multiple replace text processor to replace "." with "_". I am not using REGEX, I am using string literal find/replace. So each time I am getting failure in putMongo processor, I am inserting new replaceText processor.
This is not maintainable. I am wondering if I can use JOLT for this? couple of info regarding input JSON.
1) no set structure, only thing that is confirmed is. everything will be in events array. But event object itself is free form.
2) maximum list size = 1000.
3) 3rd party JSON, so I cant ask for change in format.
Also, key with ".", can appear anywhere. So I am looking for JOLT spec that can cleanse at all level and then rename it.
{
"events": [
{
"cluster_id": "0717-035521-puny598",
"timestamp": 1531896847915,
"type": "EDITED",
"details": {
"previous_attributes": {
"cluster_name": "Kylo",
"spark_version": "4.1.x-scala2.11",
"spark_conf": {
"spark.databricks.acl.dfAclsEnabled": "true",
"spark.databricks.repl.allowedLanguages": "python,sql"
},
"node_type_id": "Standard_DS3_v2",
"driver_node_type_id": "Standard_DS3_v2",
"autotermination_minutes": 10,
"enable_elastic_disk": true,
"cluster_source": "UI"
},
"attributes": {
"cluster_name": "Kylo",
"spark_version": "4.1.x-scala2.11",
"node_type_id": "Standard_DS3_v2",
"driver_node_type_id": "Standard_DS3_v2",
"autotermination_minutes": 10,
"enable_elastic_disk": true,
"cluster_source": "UI"
},
"previous_cluster_size": {
"autoscale": {
"min_workers": 1,
"max_workers": 8
}
},
"cluster_size": {
"autoscale": {
"min_workers": 1,
"max_workers": 8
}
},
"user": ""
}
},
{
"cluster_id": "0717-035521-puny598",
"timestamp": 1535540053785,
"type": "TERMINATING",
"details": {
"reason": {
"code": "INACTIVITY",
"parameters": {
"inactivity_duration_min": "15"
}
}
}
},
{
"cluster_id": "0717-035521-puny598",
"timestamp": 1535537117300,
"type": "EXPANDED_DISK",
"details": {
"previous_disk_size": 29454626816,
"disk_size": 136828809216,
"free_space": 17151311872,
"instance_id": "6cea5c332af94d7f85aff23e5d8cea37"
}
}
]
}

I created a template using ReplaceText and RouteOnContent to perform this task. The loop is required because the regex only replaces the first . in the JSON key on each pass. You might be able to refine this to perform all substitutions in a single pass, but after fuzzing the regex with the look-ahead and look-behind groups for a few minutes, re-routing was faster. I verified this works with the JSON you provided, and also JSON with the keys and values on different lines (: on either):
...
"spark_conf": {
"spark.databricks.acl.dfAclsEnabled":
"true",
"spark.databricks.repl.allowedLanguages"
: "python,sql"
},
...
You could also use an ExecuteScript processor with Groovy to ingest the JSON, quickly filter all JSON keys that contain ., perform a collect operation to do the replacement, and re-insert the keys in the JSON data if you want a single processor to do this in a single pass.

JSON returned by Solr

I'm using Solr in order to index my data.
Through the Solr's UI I added, in the Schema window, two fields: word, messageid
After I made the following query post:
curl -X POST -H "Content-Type: application/json" 'http://localhost:8983/solr/messenger/update.json/docs' --data-binary '{"word":"hello","messageid":"23523}'
I received the following JSON:
{
"responseHeader": {
"status": 0,
"QTime": 55
}
}
When I'm going to the Query Window in the API and Execute a query without parameters I get the following JSON:
{
{
"responseHeader": {
"status": 0,
"QTime": 0,
"params": {
"q": "*:*",
"indent": "on",
"wt": "json",
"_": "1488911768817"
}
},
"response": {
"numFound": 1,
"start": 0,
"docs": [
{
"id": "92db6722-d10d-447a-b5b1-13ad9b70b3e2",
"_src_": "{\"word\":\"hello\",\"messageid\":\"23523\"}",
"_version_": 1561232739042066432
}
}
}
}
Shouldn't my JSON appear more like the following one?:
//More Code
"response": {
"numFound": 1,
"start": 0,
"docs": [
{
"id": "92db6722-d10d-447a-b5b1-13ad9b70b3e2",
"word": "hello",
"messageid": "23523",
"_version_": 1561232739042066432
}
//More Code
In order to be able later on to filter using parameters through the following option?:

It turns out you were using so-called 'custom JSON indexing' approach which is described here. You can tweak it as described in the wiki in order to extract desired fields. Here is excerpt for your reference:
split: Defines the path at which to split the input JSON into multiple Solr documents and is required if you have multiple documents in a single JSON file. If the entire JSON makes a single solr document, the path must be “/”. It is possible to pass multiple split paths by separating them with a pipe (|) example : split=/|/foo|/foo/bar . If one path is a child of another, they automatically become a child document
f: This is a multivalued mapping parameter. The format of the parameter is target-field-name:json-path. The json-path is required. The target-field-name is the Solr document field name, and is optional. If not specified, it is automatically derived from the input JSON.The default target field name is the fully qualified name of the field. Wildcards can be used here, see the section Wildcards below for more information.
But I would recommend using the standard approach of indexing documents which is old good update command from here. So it would look more like:
curl 'http://localhost:8983/solr/messenger/update?commit=true' --data-binary '{"word":"hello","messageid":"23523}' -H 'Content-type:application/json'

Elasticsearch match all tags within given array

Currently developing a tag search application using elasticsearch, I have given each document within the index an array of tags, here's an example of how a document looks:
_source: {
title: "Keep in touch scheme",
intro: "<p>hello this is a test</p> ",
full: " <p>again this is a test mate</p>",
media: "",
link: "/training/keep-in-touch",
tags: [
"employee",
"training"
]
}
I would like to be able to make a search and only return documents with all of the specified tags.
Using the above example, if I searched for a document with tags ["employee", "training"] then the above result would be returned.
In contrast, if I searched with tags ["employee", "other"], then nothing would be returned; all tags within the search query must match.
Currently I am doing:
query: {
bool: {
must: [
{ match: { tags: ["employee","training"] }}
]
}
}
but I am just getting returned exceptions like
IllegalStateException[Can't get text on a START_ARRAY at 1:128];
I have also tried concatenating the arrays and using comma-delimited strings, however this seems to match anything given the first tag matches.
Any suggestions on how to approach this? Cheers

Option 1: Next example should work (v2.3.2):
curl -XPOST 'localhost:9200/yourIndex/yourType/_search?pretty' -d '{
"query": {
"bool": {
"must": [
{ "term": { "tags": "employee" } } ,
{ "term": { "tags": "training" } }
]
}
}
}'
Option 2: Also you can try:
curl -XPOST 'localhost:9200/yourIndex/yourType/_search?pretty' -d '{
"query": {
"filtered": {
"query": {"match_all": {}},
"filter": {
"terms": {
"tags": ["employee", "training"]
}
}
}
}
}'
But without "minimum_should_match": 1 it works little bin not accurate.
I also found "execution": "and" but it works not accurate too.
Option 3: Also you cat try query_string it works perfectly, but looks little bit complicated:
curl -XPOST 'localhost:9200/yourIndex/yourType/_search?pretty' -d '{
"query" : {
"query_string": {
"query": "(tags:employee AND tags:training)"
}
}
}'
Maybe it will be helpful for you...

To ensure that the set contains only the specified values, maintain a secondary field to keep track of the tags count. Then you can query like below to get the desired results
"query":{
"bool":{
"must":[
{"term": {"tags": "employee"}},
{"term": {"tags": "training"}},
{"term": {"tag_count": 2}}
]
}
}

Elasticsearch completion suggester phrase instead of terms

I am developing a search engine with Elasticsearch 1.6 and it's all working great. I get the data from my MySQL database with the JDBC importer from Jorg Prante. I would like to use the Elasticsearch complete suggester like documented here. The problem is only that I cannot find out how to do this without having tags like shwown in the examples everywhere. I only have the title of a product which is a quite long title.
So I would like to know how to make this work like expected by using the full phrase of the title or otherwise how to split the titlephrase into tags and adding them.
This is my current mapping for the field 'title' but that does only return a (not very relevant) whole phrase.
curl -XPUT "http://localhost:9200/jdbc/" -d'
{
"mappings": {
"jdbc": {
"properties": {
"title": {
"type": "completion",
"index_analyzer": "simple",
"search_analyzer": "simple",
"payloads": true
}
}
}
}
}'

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Mapping format on elasticsearch - json

This is because you're using the _search endpoint in order to install your mapping. You have to use the _mapping endpoint instead, like this: curl –XPUT 'http://localhost:9200/carrier/_mapping/geometry' -d '{ ...your mapping... }'

Related

Elasticsearch dynamic mapping for object within attribute

rename invalid keys from JSON

JSON returned by Solr

Elasticsearch match all tags within given array

Elasticsearch completion suggester phrase instead of terms

Categories

Resources