Wondering if I can create a "dynamic mapping" within an elasticsearch index. The problem I am trying to solve is the following: I have a schema that has an attribute that contains an object that can differ greatly between records. I would like to mirror this data within elasticsearch if possible but believe that automatic mapping may get in the way.
Imagine a scenario where I have a schema like the following:
{
name: string
origin: string
payload: object // can be of any type / schema
}
Is it possible to create a mapping that supports this? I do not need to query the records by this payload attribute, but it would be great if I can.
Note that I have checked the documentation but am confused on if what elastic calls dynamic mapping is what I am looking for.
It's certainly possible to specify which queryable fields you expect the payload to contain and what those fields' mappings should be.
Let's say each doc will include the fields payload.livemode and payload.created_at. If these are the only two fields you'll want to perform queries on, and you'd like to disable dynamic, index-time mappings autogenerated by Elasticsearch for the rest of the fields, you can use dynamic templates like so:
PUT my-payload-index
{
"mappings": {
"dynamic_templates": [
{
"variable_payload": {
"path_match": "payload",
"mapping": {
"type": "object",
"dynamic": false,
"properties": {
"created_at": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
},
"livemode": {
"type": "boolean"
}
}
}
}
}
],
"properties": {
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"origin": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
Then, as you ingest your docs:
POST my-payload-index/_doc
{
"name": "abc",
"origin": "web.dev",
"payload": {
"created_at": "2021-04-05 08:00:00",
"livemode": false,
"abc":"def"
}
}
POST my-payload-index/_doc
{
"name": "abc",
"origin": "web.dev",
"payload": {
"created_at": "2021-04-05 08:00:00",
"livemode": true,
"modified_at": "2021-04-05 09:00:00"
}
}
and verify with
GET my-payload-index/_mapping
no new mappings will be generated for the fields payload.abc nor payload.modified_at.
Not only that — the new fields will also be ignored, as per the documentation:
These fields will not be indexed or searchable, but will still appear in the _source field of returned hits.
Side note: if fields are neither stored nor searchable, they're effectively the opposite of enabled.
The Big Picture
Working with variable contents of a single, top-level object is quite standard. Take for instance the stripe event object — each event has an id, an api_version and a few other shared params. Then there's the data object that's analogous to your payload field.
Now, all is fine, until you need to aggregate on the contents of your payload. See, since the content is variable, so are the data paths / accessors. But wildcards in aggregation paths don't work in Elasticsearch. Scripts do but are onerous to maintain.
Back to stripe. They partially solved it through what they call polymorphic, typed hashes — as discussed in their blog on API design:
A pretty neat approach that's worth emulating.
P.S. I discuss dynamic templates in more detail in the chapter "Mapping Automation" of my ES Handbook.
Related
I am reverse engineering an app that sends queries to
SOMESERVERNAME.analysis.windows.net/public/reports/querydata via an HTTP POST of an JSON-structured query.
Some initial lines of a sample query are at the end of this message.
I can't find any documentation on this anywhere. I don't know if this is some secret API or what. I ultimately would like to just ignore the aggregations altogether and just dump the raw data, which seems to sit in some flat-file type container on the back-end, but without some API documentation I'm stuck with just re-running the super basic handful of queries I've been able to intercept.
Note: this app is an embedded analytics page created with PowerBI, but the only REST API I can find for PowerBI has nothing to do with querying, but just basic object management.
Thanks!
{
"version": "1.0.0",
"queries": [
{
"Query": {
"Commands": [
{
"SemanticQueryDataShapeCommand": {
"Query": {
"Version": 2,
"From": [
{
"Name": "s",
"Entity": "Sheet1"
}
],
"Select": [
{
"Aggregation": {
"Expression": {
"Column": {
"Expression": {
"SourceRef": {
"Source": "s"
}
},
"Property": "Total"
}
},
"Function": 0
},
"Name": "Sum(Sheet1.Total)"
}
],
"Where": [
{
"Condition": {
"In": {
"Expressions": [
{
"Column": {
"Expression": {
"SourceRef": {
"Source": "s"
}
},
"Property": "Year"
}
}
],
"Values": [
[
{
"Literal": {
"Value": "'2018'"
}
}
]
]
}
}
},
............
I have built a client that scrapes data off a specific Power BI report using the same API, but probably you'll be able to adapt it to your use case. Maybe we can even abstract the code into a more generalized Power BI client!
Having tinkered with the API for two days, I realised that there are many ways the data can be formatted:
"nested"/multidimensional data can be unflattened, flattened by 1 degree, etc.
a primary "table" of a result dataset (in data.PH) can reference others (in data.SH)
The basics are as follows:
A dataset is structured like a multidimensional table, with cells containing values.
In a set of cells, the first always has a field S that contains the schema of its and all subsequent cells.
The schema maps a field of each cell's object with a selection from your query, e.g. the G0 field with the queried column age.
My client seems to work only with a specific type of query (SemanticQueryDataShapeCommand), a specific nr of dimensions and a specific column marked as primary (via Binding.Primary). But maybe that helps! https://github.com/derhuerst/fetch-bvg-occupancy/blob/1ebb864b1ff7130f9d2f0ab031c6d78bcabdd633/lib/parse-dataset.js
The only documented way to use this API is through the ADOMD.NET or OleDb provider.
If you want to send a DAX/MDX query and retrieve data programmatically, there's a sample of how to front-end the service with a simple REST API here.
We have a heavily nested json document containing server metrcs, the document contains > 1000 fields some of which are completely irrelevant to us for analytic purposes so i would like to remove them before indexing the document in Elastic.
However i am unable to find the correct filter to use as the fields i want to remove have common names in multiple different objects within the document.
The source document looks like this ( reduced in size for brevity)
[
{
"server": {
"is_master": true,
"name": "MYServer",
"id": 2111
},
"metrics": {
"Server": {
"time": {
"boundary": {},
"type": "TEXT",
"display_name": "Time",
"value": "2018-11-01 14:57:52"
}
},
"Mem_OldGen": {
"used": {
"boundary": {},
"display_name": "Used(mb)",
"value": 687
},
"committed": {
"boundary": {},
"display_name": "Committed(mb)",
"value": 7116
}
"cpu_count": {
"boundary": {},
"display_name": "Cores",
"value": 4
}
}
}
}
]
The data is loaded into logstash using the http_poller input plugin and needs to be processed before sending to Elastic for indexing.
I am trying to remove the fields that are not relevant for us to track for analytical purposes, these include the "display_name" and "boundary" fields from each json object in the different metrics.
I have tried using the mutate filter to remove the fields but because they exist in so many different objects it requires to many coded paths to be added to the logstash config.
I have also looked at the ruby filter, which seems promising as it can look the event, but i am unable to get it to crawl the entire json document, or more importantly actually remove the fields.
Here is what i was trying as a test
filter {
split{
field => "message"
}
ruby {
code => '
event.get("[metrics][Mem_OldGen][used]").to_hash.keys.each { |k|
logger.info("field is:", k)
if k.include?("display_name")
event.remove(k)
end
if k.include?("boundary")
event.remove(k)
end
}
'
}
}
It first splits the input at the message level to create one event per server, then tries to remove the fields from a specific metric.
Any help you be greatly appreciated.
If I get the point, you want to keep just the value key.
So, considering the response hash:
response = {
"server": {
"is_master": true,
"name": "MYServer",
"id": 2111
},
"metrics": {
...
You could do:
response[:metrics].transform_values { |hh| hh.transform_values { |h| h.delete_if { |k,v| k != :value } } }
#=> {:server=>{:is_master=>true, :name=>"MYServer", :id=>2111}, :metrics=>{:Server=>{:time=>{:value=>"2018-11-01 14:57:52"}}, :Mem_OldGen=>{:used=>{:value=>687}, :committed=>{:value=>7116}, :cpu_count=>{:value=>4}}}}
I'd like to use JSON schema to validate some values. I two objects, call them trackedItems and trackedItemGroups. The trackedItemGroups are a group name and a list of trackedItems names. For example, the schema is similar to:
"TrackedItems": {
"type": "array",
"items": {
"type": "object",
"properties": {
"TrackedItemName": { "type": "string" },
"Properties": { ---- }
}
}
},
"TrackedItemGroups": {
"type": "array",
"items": {
"type": "object",
"properties": {
"GroupName": {
"type": "string"
},
"TrackedItems": {
"type": "array",
"items": {"type": "string"}
}
}
}
}
I'd like to validate that every string in a TrackedItemGroups's TrackedItems array is a name that's been defined in TrackedItems.TrackedItemName.
This would be something like using the enum property to restrict the values, but the enum list is generated based on the values in TrackedITems.TrackedItemName.
How can I write the schema to use the JSON's own data for validation?
I'm aware I could move things around, i.e. the TrackedItems define the group they're in, but there are hundreds of tracked items and this organization works much better for my use case.
I've tried this:
"TrackedItems": {
"type": "array",
"items": {
"oneOf": [
{"$ref":"#/properties/TrackedItems/items/properties/TrackedItemName"}
]
}
}
But this results in an error:
Newtonsoft.Json.Schema.JSchemaReaderException: Could not resolve
schema reference
'#/properties/TrackedItems/items/properties/TrackedItemName'.
For a data example, if I had the TrackedItems:
Item1, Item2, ItemA, ItemB, ItemC
And groups:
Group1:
Item1, ItemB, ItemC
Group2:
Item1, Item2, ItemZ
Group2 would throw a violation because it contains an item not defined in TrackedItems.
Being a vocabulary for validation (and certain other things described by trivial assertions), JSON Schema does not provide a way to verify the consistency of data.
Validation means assertions like "Verify that X is a string."
Consistency means things like "Verify that X is the ID of an existing, active user."
Since data being compared might be in another database altogether, and since these sorts of assertions are non-trivial, JSON Schema leaves verifying the consistency of data up to the application and/or other technologies. Some implementations have vendor-specific extensions for intra-document comparisons, however these are not standardized, and I'm not aware of any that would work here.
A $ref reference doesn't work here, as it's just a way to substitute in another schema by reference. If you can manage to get the reference to work (and I'm not sure why you got an error, this is implementation-specific detail), this schema:
{ "oneOf": [
{"$ref":"#/properties/TrackedItems/items/properties/TrackedItemName"}
] }
Is the exact same thing as saying:
{ "oneOf": [
{"type": "string"}
] }
Since you're asking "verify that one of the following one statements is true", this is also the same as simply:
{"type": "string"}
This is not to say you can't declare relationships between data in JSON using JSON Schema, but JSON Schema is somewhat opinionated about using URIs and hyperlinks to do so.
I'm looking for some pointers on mapping a somewhat dynamic structure for consumption by Elasticsearch.
The raw structure itself is json, but the problem is that a portion of the structure contains a variable, rather than the outer elements of the structure being static.
To provide a somewhat redacted example, my json looks like this:
"stat": {
"state": "valid",
"duration": 5,
},
"12345-abc": {
"content_length": 5,
"version": 2
}
"54321-xyz": {
"content_length": 2,
"version", 1
}
The first block is easy; Elasticsearch does a great job of mapping the "stat" portion of the structure, and if I were to dump a lot of that data into an index it would work as expected. The problem is that the next 2 blocks are essentially the same thing, but the raw json is formatted in such a way that a unique element has crept into the structure, and Elasticsearch wants to map that by default, generating a map that looks like this:
"stat": {
"properties": {
"state": {
"type": "string"
},
"duration": {
"type": "double"
}
}
},
"12345-abc": {
"properties": {
"content_length": {
"type": "double"
},
"version": {
"type": "double"
}
}
},
"54321-xyz": {
"properties": {
"content_length": {
"type": "double"
},
"version": {
"type": "double"
}
}
}
I'd like the ability to index all of the "content_length" data, but it's getting separated, and with some of the variable names being used, when I drop the data into Kibana I wind up with really long fieldnames that become next to useless.
Is it possible to provide a generic tag to the structure? Or is this more trivially addressed at the json generation phase, with our developers hard coding a generic structure name and adding an identifier field name.
Any insight / help greatly appreciated.
Thanks!
If those keys like 12345-abc are generated and possibly infinite values, it will get hard (if not impossible) to do some useful queries or aggregations. It's not really clear which exact use case you have for analyzing your data, but you should probably have a look at nested objects (https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-objects.html) and generate your input json accordingly to what you want to query for. It seems that you will have better aggregation results if you put these additional objects into an array with a special field containing what is currently your key.
{
"stat": ...,
"things": [
{
"thingkey": "12345-abc",
"content_length": 5,
"version": 2
},
...
]
}
I'm trying to create a JSON schema for an existing JSON file that looks something like this:
{
"variable": {
"name": "age",
"type": "integer"
}
}
In the schema, I want to ensure the type property has the value string or integer:
{
"variable": {
"name": "string",
"type": {
"type": "string",
"enum": ["string", "integer"]
}
}
}
Unfortunately it blows up with message: ValidationError {is not any of [subschema 0]....
I've read that there are "no reserved words" in JSON schema, so I assume a type of type is valid, assuming I declare it correctly?
The accepted answer from jruizaranguren doesn't actually answer the question.
The problem is that given JSON (not JSON schema, JSON data) that has a field named "type", it's hard to write a JSON schema that doesn't choke.
Imagine that you have an existing JSON data feed (data, not schema) that contains:
"ids": [ { "type": "SSN", "value": "123-45-6789" },
{ "type": "pay", "value": "8675309" } ]
What I've found in trying to work through the same problem is that instead of putting
"properties": {
"type": { <======= validation chokes on this
"type": "string"
}
you can put
"patternProperties": {
"^type$": {
"type": "string"
}
but I'm still working through how to mark it as a required field. It may not be possible.
I think, based on looking at the "schema" in the original question, that JSON schemas have evolved quite a lot since then - but this is still a problem. There may be a better solution.
According to the specification, in the Valid typessection for type:
The value of this keyword MUST be either a string or an array. If it is an array, elements of the array MUST be strings and MUST be unique.
String values MUST be one of the seven primitive types defined by the core specification.
Later, in Conditions for successful validation:
An instance matches successfully if its primitive type is one of the types defined by keyword. Recall: "number" includes "integer".
In your case:
{
"variable": {
"name": "string",
"type": ["string", "integer"]
}
}