Schemaless Support for Elastic Search Queries - json

Our REST API allows users to add custom schemaless JSON to some of our REST resources, and we need it to be searchable in Elasticsearch. This custom data and its structure can be completely different across resources of the same type.
Consider this example document:
{
"givenName": "Joe",
"username": "joe",
"email": "joe#mailinator.com",
"customData": {
"favoriteColor": "red",
"someObject": {
"someKey": "someValue"
}
}
}
All fields except customData adhere to a schema. customData is always a JSON Object, but all the fields and values within that Object can vary dramatically from resource to resource. There is no guarantee that any given field name or value (or even value type) within customData is the same across any two resources as users can edit these fields however they wish.
What is the best way to support search for this?
We thought a solution would be to just not create any mapping for customData when the index is created, but then it becomes unqueryable (which is contrary to what the ES docs say). This would be the ideal solution if queries on non-mapped properties worked, and there were no performance problems with this approach. However, after running multiple tests for that matter we haven’t been able to get that to work.
Is this something that needs any special configuration? Or are the docs incorrect? Some clarification as to why it is not working would be greatly appreciated.
Since this is not currently working for us, we’ve thought of a couple alternative solutions:
Reindexing: this would be costly as we would need to reindex every index that contains that document and do so every time a user updates a property with a different value type. Really bad for performance, so this is likely not a real option.
Use multi-match query: we would do this by appending a random string to the customData field name every time there is a change in the customData object. For example, this is what the document being indexed would look like:
{
"givenName": "Joe",
"username": "joe",
"email": "joe#mailinator.com",
"customData_03ae8b95-2496-4c8d-9330-6d2058b1bbb9": {
"favoriteColor": "red",
"someObject": {
"someKey": "someValue"
}
}
}
This means ES would create a new mapping for each ‘random’ field, and we would use phrase multi-match query using a "starts with" wild card for the field names when performing the queries. For example:
curl -XPOST 'eshost:9200/test/_search?pretty' -d '
{
"query": {
"multi_match": {
"query" : "red",
"type" : "phrase",
"fields" : ["customData_*.favoriteColor"]
}
}
}'
This could be a viable solution, but we are concerned that having too many mappings like this could affect performance. Are there any performance repercussions for having too many mappings on an index? Maybe periodic reindexing could alleviate having too many mappings?
This also just feels like a hack and something that should be handled by ES natively. Am I missing something?
Any suggestions about any of this would be much appreciated.
Thanks!

You're correct that Elasticsearch is not truly schemaless. If no mapping is specified, Elasticsearch infers field type primitives based upon the first value it sees for that field. Therefore your non-deterministic customData object can get you in trouble if you first see "favoriteColor": 10 followed by "favoriteColor": "red".
For your requirements, you should take a look at SIREn Solutions Elasticsearch plugin which provides a schemaless solution coupled with an advanced query language (using Twig) and a custom Lucene index format to speed up indexing and search operations for non-deterministic data.

Fields with same mapping will be stored as same lucene field in the lucene index (Elasticsearch shard). Different lucene field will have separate inverted index (term dict and index entry) and separate doc values. Lucene is highly optimized to store documents of same field in a compressed way. Using a mapping with different field for different document prevent lucene from doing its optimization.
You should use Elasticsearch Nested Document to search efficiently. The underlying technology is Lucene BlockJoin, which indexes parent/child documents as a document block.

Related

JSON object with id as primary key schema design

I want to use an ID as the primary key in a JSON object. This way all users in the list are unique.
Like so:
{
"user": [{
"id": 1,
"name": "bob"
}]
}
In an application, I have to search for the id in all elements of the list 'user'.
But I can also use the ID as an index to get easier access to a specific user.
Like so:
{
"user": {
"1": {
"name": "bob"
}
}
}
In an application, I can now simply write user["3"] to get the correct user.
What should I use? Are there any disadvantages to the second option? I'm sure there is a best practice.
It depends on what format you want objects to look like, how much processing you want to do on your objects and how much data you have.
When dealing with web data you will often see the first format. If there is a lot of data then you will need to iterate through all records to find your matching id because your data is an array. Often that query would be enforced on your lower level data set though so it might already be indexed (eg. if it is a database) so this may not be an issue. This format is clean and binds easily.
Your second option works best when you need efficiency in your lookups since you have a dictionary with key value pairs allowing for significantly faster lookups in large datasets. Putting a numeric key (even though you are forcing it to be a string) is not supported by all libraries. You can prefix your Id with an alpha value though, then you can just add the prefix when doing a lookup. I have used k in this example but you can choose a prefix that makes sense for your data. I use this format when storing objects as the json binary data type in databases.
{
"user": {
"k1": {
"name": "bob"
}
}
}

How to easily change a recurring property name in multiple schemas?

To be able to deserialize polymorphic types, I use a type discriminator across many of my JSON objects. E.g., { "$type": "SomeType", "otherProperties": "..." }
For the JSON schemas of concrete types, I specify a const value for type.
{
"type": "object",
"properties": {
"$type": { "const": "SomeType" },
"otherProperties": { "type": "string" }
}
}
This works, but distributes the chosen "$type" property name throughout many different JSON schemas. In fact, we are considering renaming it to "__type" to play more nicely with BSON.
Could I have prevented having to rename this property in all affected schemas?
I tried searching for a way to load the property name from elsewhere. As far as I can tell $ref only works for property values.
JSON Schema has no ability to dynamically load in key values from other location like you are asking. Specifically because the value will be different, and you want only the key to be loaded from elsewhere.
While you can't do this with JSON Schema, you could use a templating tool such as Jsonnet. I've seen this work well at scale.
This would require you have a pre-processing step, but it sounds like that's something you're planning for already, creating some sort of pipeline to generate your schemas.
A word of warning, watch out for existing schema generation tooling. It is often only good for scaffolding, and requires lots of modifications. It sounds like you're building your own, which is likely a better approach.

What is the best practice in REST-Api, to pass structured data or key-value pair?

I have a data-structure similar to the given below, which I am supposed to process. I am designing an API which should accept a POST request similar to the one given below. (ignore the headers, etc)
{
"Name" : "Johny English",
"Id": "534dsf",
"Message":[
{
"Header":"Country of origin",
"Value":"England"
},
{
"Header":"Nature of work",
"Value":"Secret Agent/Spy"
}
]
}
Some how I do not feel, its a correct way to pass/accept data. Here I am talking about structured data vs. Key-Value pair. While I can extract the fields ("Name", "Id") directly to an object attributes, but for Key-Value pairs, I need to loop through the collection and compare with strings (eg. "Nature of Work") to extract values.
I searched few sites, looking for any best practices, could not reach into any conclusion. Is there any best practice, suggestions etc.
I don't think you are going to find any firm, evidence based arguments against including a list of key value pairs in your message schema. But that's the sort of thing to look for - people writing about message schema design, and how to design messages to support change, and so on.
As a practical matter, there's not a whole lot of difference
{
"Name" : "Johny English",
"Id": "534dsf",
"Message":[
{
"Header":"Country of origin",
"Value":"England"
},
{
"Header":"Nature of work",
"Value":"Secret Agent/Spy"
}
]
}
or
{
"Name" : "Johny English",
"Id": "534dsf",
"Message": {
"Country of origin": "England",
"Nature of work": "Secret Agent/Spy"
}
}
In the early days of the world wide web, "everything" is key value pairs, because it was easy to describe a collection of key value pairs in such a way that a general-purpose component, like a web browser, could work with it (ie, definitions of HTML forms). It got the job done.
It's usually good to structure your response data the same as what you'd expect the input of the corresponding POST, PUT, and PATCH endpoints to be. This allows record alteration to not require the consuming entity to transform the data first. So in that context, arrays of objects with "name"/"value" fields is much easier to write input validation for.

Can I return document data after a FTS match?

Suppose I have this data:
{
"test": "Testing1234"
"false": "Falsify"
}
And then using curl, I write this query:
{"explain": true, "fields": [ "*" ], "highlight": {}, "query": { "query": "Testing"}}
I get a response from couchbase. This includes the document id, as well as a locations object that returns details about where my query matched text in the document, including the parent object. All useful information.
However, I do not receive any additional context. For instance, say I have 100 documents with "test": "TestingXXXX" where XXXX is a random string. My search will not provide me with XXXX. Nor does it provide me any way to read additional fields in the same object (for instance, if I wanted to fetch the "false" property). I will simply get 100 different document IDs to query. Thus, it is technically enough information to obtain all the needed information, however it results in me making 100 different requests based on parsed info from the original response.
Is there any way to return context with FTS matches when using the REST API, without simply querying every document that is matched?
You can get the complete objects by issuing the FTS query from within N1QL using the CURL() function, and then joining that up with the objects themselves.
https://developer.couchbase.com/documentation/server/current/n1ql/n1ql-language-reference/curl.html
Your query would have roughly this form:
SELECT *
FROM yourTable
USE KEYS CURL(ftsURL, ftsQuery, ...)
You'll need to wrap the CURL function in some transformation functions to turn the FTS result into an array of ids.
I realize this is quite schematic, since I don't have a full example handy. But work up through these steps:
Issue the FTS query through CURL() in N1QL.
Transform the FTS results into an array of ids.
Embed the request for the array of ids into a SELECT query using USE KEYS.
I figured it out. It's not an issue with the query. The fields were not being indexed. To fix, I changed the index setting "Store Dynamic Fields" to "True". That said the highlighting did return a lot of extra details and I'm sure it also increases the query times quite a bit. The couchbase documentation seemed to imply it is only used for debugging. Thus, I would like to leave this open in case anyone has further suggestions.

MySQL: Embedded JSON vs table

I'm designing the database schema for a video production project management app and struggling with how to persist some embedded, but not repeatable data. In the few CS courses I took, part of normalizing a relational database was identifying repeatable blocks and encapsulating them into their own table. What if I have a block of embedded/nested data that I know is likely to be unique to the record?
Example: A video record has many shoot_locations. Those locations are most likely never to be repeated. shoot_locations can also contain multiple shoot_times. Representing this in JSON, might look like this:
{
video: {
shoot_locations: [
{
name: "Bob's Pony Shack",
address: "99 Horseman Street, Anywhere, US 12345",
shoot_times: {
shoot_at: "2015-08-15 21:00:00",
...
}
},
{
name: "Jerry's Tackle",
address: "15 Pike Place, Anywhere, US 12345",
shoot_times: {
shoot_at: "2015-08-16 21:00:00"
...
}
}
],
...
}
}
Options...
store the shoot_locations in a JSON field (available in MySQL 5.7.8?)
create a separate table for the data.
something else?
I get the sense I should split embedded data into it's own tables and save JSON for non-crucial meta data.
Summary
What's the best option to store non-repeating embedded data?
ONE of the reasons of normalizing a database is to reduce redundancy (your "repeatable blocks")
ANOTHER reason is to allow "backwards" querying. If you wanted to know which video was shot at "15 Pike Place", your JSON solution will fail (you'll have to resort to sequential reading, decoding JSON which defeats the purpose of a RDBMS)
Good rules of thumb:
Structured data - put in tables and columns
Data that might be part of query conditions - put in tables and columns
Unstructured data you know you'll never query by - put into BLOBs, XML or JSON fields
If in doubt, use tables and columns. You might have to spend some extra time initially, but you will never regret it. People have regretted their choice for JSON fields (or XML, for that matter) again and again and again. Did I mention "again"?