JSON object with id as primary key schema design - json

I want to use an ID as the primary key in a JSON object. This way all users in the list are unique.
Like so:
{
"user": [{
"id": 1,
"name": "bob"
}]
}
In an application, I have to search for the id in all elements of the list 'user'.
But I can also use the ID as an index to get easier access to a specific user.
Like so:
{
"user": {
"1": {
"name": "bob"
}
}
}
In an application, I can now simply write user["3"] to get the correct user.
What should I use? Are there any disadvantages to the second option? I'm sure there is a best practice.

It depends on what format you want objects to look like, how much processing you want to do on your objects and how much data you have.
When dealing with web data you will often see the first format. If there is a lot of data then you will need to iterate through all records to find your matching id because your data is an array. Often that query would be enforced on your lower level data set though so it might already be indexed (eg. if it is a database) so this may not be an issue. This format is clean and binds easily.
Your second option works best when you need efficiency in your lookups since you have a dictionary with key value pairs allowing for significantly faster lookups in large datasets. Putting a numeric key (even though you are forcing it to be a string) is not supported by all libraries. You can prefix your Id with an alpha value though, then you can just add the prefix when doing a lookup. I have used k in this example but you can choose a prefix that makes sense for your data. I use this format when storing objects as the json binary data type in databases.
{
"user": {
"k1": {
"name": "bob"
}
}
}

Related

How to store large JSON documents(>20MB) in MongoDB without using GridFS

I want to store a large document in MongoDB, however, these are the two ways I will interact with the document:
I do frequent reads of that data and need to get a part of that data using aggregations
When I need to write to the document, I will be building it from scratch again, i.e remove the document that exists and insert a new one.
Here is how a sample document looks like:
{
"objects_1": [
{
}
],
"objects_2": [
{
}
],
"objects_3": [
{
}
],
"policy_1": [
{
}
],
"policy_2": [
{
}
],
"policy_3": [
{
}
]
}
Here is how I want to access that data:
{
"objects_1": [
{
}
}
If I was storing it in a conventional way, I would write a query like this:
db.getCollection('configuration').aggregate([
{ $match: { _id: "FAAAAAAAAAAAA" } },
{ $project: {
"_id": 0,
"a_objects": {
$filter: {
input: "$settings.a_objects",
as: "arrayItem",
cond: { $eq: [ "$$arrayItem.name", "objectName" ] }
}
}
}}
])
However, since the size of the document is >16 MB, we cant save it directly to MongoDB. The size can be a max of 50MB.
Solutions I thought of:
I thought of storing the json data in gridfs format and reading it as per the docs here: https://docs.mongodb.com/manual/core/gridfs/ . However, then I would need to read the entire file every time I want to look up only one object inside the large json blob, and I need to do such reads frequently, on multiple large documents which would lead to high memory usage
I thought of splitting the json into parts and storing each object in it's own separate collection, and when I need to fetch the entire document, I can reassemble the json
How should I approach this problem? Is there something obvious that I am missing here?
I think your problem is that you're not using the right tools for the job, or not using the tools you have in the way they were meant to be used.
If you want to persist large objects as JSON then I'd argue that a database isn't a natural choice for that - especially if the objects are large. I'd be looking at storage systems designed to do that well (say if your solution is on Azure/AWS/GCP see what specialist service they offer) or even just the file system if you run on a local server.
There's no reason why you can't have the JSON in a file and related data in a database - yes there are issues with that but the limitations of MongoDB won't be one of them.
I do frequent reads of that data and need to get a part of that data using aggregations
If you are doing frequent reads, and only for part of the data, then forcing your system to always read the whole record means you are just penalizing yourself. One option is to store the bits that are highly read in a way that doesn't incur the performance penalty of the full read.
Storing objects as JSON means you can change your program and data without having to worry about what the database looks like, its convenient. But it also has it's limitations. If you think you have hit those limitations then now might be the time to consider a re-architecture.
I thought of splitting the JSON into parts and storing each object in it's own separate collection, and when I need to fetch the entire document, I can reassemble the JSON
That's definably worth looking into. You just need to make sure that the different parts are not stored in the same table / rows, otherwise there'll be no improvement. Think carefully about how you spilt the objects up - think about the key scenarios the objects deal with - e.g. you mention reads. Designing the sub-objects to align with key scenarios is the way to go.
For example, if you commonly show an object's summary in a list of object summaries (e.g. search results), then the summary text, object name, id are candidates for object data that you would split out.

What is the best practice in REST-Api, to pass structured data or key-value pair?

I have a data-structure similar to the given below, which I am supposed to process. I am designing an API which should accept a POST request similar to the one given below. (ignore the headers, etc)
{
"Name" : "Johny English",
"Id": "534dsf",
"Message":[
{
"Header":"Country of origin",
"Value":"England"
},
{
"Header":"Nature of work",
"Value":"Secret Agent/Spy"
}
]
}
Some how I do not feel, its a correct way to pass/accept data. Here I am talking about structured data vs. Key-Value pair. While I can extract the fields ("Name", "Id") directly to an object attributes, but for Key-Value pairs, I need to loop through the collection and compare with strings (eg. "Nature of Work") to extract values.
I searched few sites, looking for any best practices, could not reach into any conclusion. Is there any best practice, suggestions etc.
I don't think you are going to find any firm, evidence based arguments against including a list of key value pairs in your message schema. But that's the sort of thing to look for - people writing about message schema design, and how to design messages to support change, and so on.
As a practical matter, there's not a whole lot of difference
{
"Name" : "Johny English",
"Id": "534dsf",
"Message":[
{
"Header":"Country of origin",
"Value":"England"
},
{
"Header":"Nature of work",
"Value":"Secret Agent/Spy"
}
]
}
or
{
"Name" : "Johny English",
"Id": "534dsf",
"Message": {
"Country of origin": "England",
"Nature of work": "Secret Agent/Spy"
}
}
In the early days of the world wide web, "everything" is key value pairs, because it was easy to describe a collection of key value pairs in such a way that a general-purpose component, like a web browser, could work with it (ie, definitions of HTML forms). It got the job done.
It's usually good to structure your response data the same as what you'd expect the input of the corresponding POST, PUT, and PATCH endpoints to be. This allows record alteration to not require the consuming entity to transform the data first. So in that context, arrays of objects with "name"/"value" fields is much easier to write input validation for.

Proper JSON format in noSQL

I would like to keep a DB of nested groups (For example, a company hierarchy: a manager has managers who manage other managers who manage employees..)
How should one represent this structure as a JSON?
Should the name of each manager be a key and the managers below him be an object? (Assume each manager has a unique name)
{
"mamanger1": {
"sub_manager1": {
...
},
"sub_manager2": {
...
}
}
}
Or, should the JSON consist of "recursive objects", i.e, a key-value object where key is an identifier and value is an array of same, key-value objects?
In this case, the key-value pair would be called "name"-"employees".
{
"name": "mamanger1",
"employees": [
{
"name": "sub_manager1",
"employees": [ ... ]
},
{
"name": "sub_manager2",
"employees": [ ... ]
},
]
}
In the first example ,each manager has a unique key (Better performance on search?)
In the second example, all objects have the save keys (Easier looping?)
In my view you should use the second approach:
Benifits:
It is more extensible. You can add more data to manager entity
later, if needed.
Easy looping as you are having field names against each value.
Your approach will be more realworld relevant as manager names may or may not be unique.
you will not loose in the search performance as you will be having key "name" and values are still unique as per you. Even if it won't be unique, All nosql db store range of values from a key on same node.
When you will ask for details about manager/managers who has name "xyz" the search process is as follows:
You hit the api
A node receives request
Request will be forwarded to a node/nodes having range which xyz belongs to
Only data of this node will be scanned and matched ones will be returned.
Also as per me the first approach will be creating as many key as the number of managers. Considering the limited number of nodes, one node will be scanned if you try to get the details for "xyz" as key.
You will get better performance in approach 2 , if you are seaching for "xyz" and "xyz1" in same query. As the string values are close to each other you may get it in the same node(mostly). However, in the first approach there are less chances of getting it on same node as both are not considered neighbors because of different keys all together.

Return a field as object or as primitive type in JSON in a REST API?

Currently I'm working on a REST API with an object that has a status. Should I return the status as a string or as an object?
When is it smart to change from field being a primitive type to a field being an object?
[
{
"id": 1
"name": "Hello"
"status": "active"
},
{
"id": 1
"name": "Hello"
"status": {
"id": 0
"name": "active"
}
}
]
In terms of extensibility I would suggest going for and object.
Using an object also adds the advantage of being able to split responsibility in terms of identifying (via f.e. an id field) and describing (via f.e. a name or description field), in your case, a status.
Adding i18n as a possible necessity, an object would also have to carry a string as identifier.
All these things are not possible with simple primitives. Conclusion: go for an object.
Other interesting remarks are given here.
It depends on what you need to pass.
If you only want to distinguish between different states and have all other related information (strings, translations, images) on the client either way, you might only want to send a simple integer value and use an enum on the client side. This reduces the data to the smallest amount.
If you have data that changes within one status on the server side, you need an object to pass everything else.
But best practice here would be to reduce data as much as possible.

Schemaless Support for Elastic Search Queries

Our REST API allows users to add custom schemaless JSON to some of our REST resources, and we need it to be searchable in Elasticsearch. This custom data and its structure can be completely different across resources of the same type.
Consider this example document:
{
"givenName": "Joe",
"username": "joe",
"email": "joe#mailinator.com",
"customData": {
"favoriteColor": "red",
"someObject": {
"someKey": "someValue"
}
}
}
All fields except customData adhere to a schema. customData is always a JSON Object, but all the fields and values within that Object can vary dramatically from resource to resource. There is no guarantee that any given field name or value (or even value type) within customData is the same across any two resources as users can edit these fields however they wish.
What is the best way to support search for this?
We thought a solution would be to just not create any mapping for customData when the index is created, but then it becomes unqueryable (which is contrary to what the ES docs say). This would be the ideal solution if queries on non-mapped properties worked, and there were no performance problems with this approach. However, after running multiple tests for that matter we haven’t been able to get that to work.
Is this something that needs any special configuration? Or are the docs incorrect? Some clarification as to why it is not working would be greatly appreciated.
Since this is not currently working for us, we’ve thought of a couple alternative solutions:
Reindexing: this would be costly as we would need to reindex every index that contains that document and do so every time a user updates a property with a different value type. Really bad for performance, so this is likely not a real option.
Use multi-match query: we would do this by appending a random string to the customData field name every time there is a change in the customData object. For example, this is what the document being indexed would look like:
{
"givenName": "Joe",
"username": "joe",
"email": "joe#mailinator.com",
"customData_03ae8b95-2496-4c8d-9330-6d2058b1bbb9": {
"favoriteColor": "red",
"someObject": {
"someKey": "someValue"
}
}
}
This means ES would create a new mapping for each ‘random’ field, and we would use phrase multi-match query using a "starts with" wild card for the field names when performing the queries. For example:
curl -XPOST 'eshost:9200/test/_search?pretty' -d '
{
"query": {
"multi_match": {
"query" : "red",
"type" : "phrase",
"fields" : ["customData_*.favoriteColor"]
}
}
}'
This could be a viable solution, but we are concerned that having too many mappings like this could affect performance. Are there any performance repercussions for having too many mappings on an index? Maybe periodic reindexing could alleviate having too many mappings?
This also just feels like a hack and something that should be handled by ES natively. Am I missing something?
Any suggestions about any of this would be much appreciated.
Thanks!
You're correct that Elasticsearch is not truly schemaless. If no mapping is specified, Elasticsearch infers field type primitives based upon the first value it sees for that field. Therefore your non-deterministic customData object can get you in trouble if you first see "favoriteColor": 10 followed by "favoriteColor": "red".
For your requirements, you should take a look at SIREn Solutions Elasticsearch plugin which provides a schemaless solution coupled with an advanced query language (using Twig) and a custom Lucene index format to speed up indexing and search operations for non-deterministic data.
Fields with same mapping will be stored as same lucene field in the lucene index (Elasticsearch shard). Different lucene field will have separate inverted index (term dict and index entry) and separate doc values. Lucene is highly optimized to store documents of same field in a compressed way. Using a mapping with different field for different document prevent lucene from doing its optimization.
You should use Elasticsearch Nested Document to search efficiently. The underlying technology is Lucene BlockJoin, which indexes parent/child documents as a document block.