Mongodb best practice to store data - json

I've read some mongo documentation but I wasn't able to find an answer to my question.
I'm developing an application where I want to store Json documents. I've read about indexes and so on but one question is remaining for me.
The data I want to store contains information that does not need to be loaded by the client as a whole. So I planed to normalize the data and split my big json into smaller ones and offer them by a seperate rest endpoint.
Not I was thinking about creating a different collection for each group of jsons.
The reason for that is that I want to reduce the search space compared to the option to store everything in one collection.
So each user will have 5 collections and I expect 1 million users.
Is this a good solution in point of performance and scaling?
Is querying multiple collections more expensive then querying one?

Recently while working on a project, I and my team faced this situation where we had a huge data set and in the future, it is supposed to increase rapidly.
We had MongoDB in place as data grew the performance started to degrade. The reason was mainly due to multiple collections, we have to have the lookup to join the collections and get the data.
Interestingly the way we map the two collections plays a very important role in the performance.
We had an initial structure as :
Collection A {
"_id" : ...,
"info" : [
// list of object id of other collection
]
}
Field info was used to map with "_id" of Collection B.
Since mongo have _id as a unique identifier, no matter what indexes we have, it will scan all documents of Collection B and if B is of GBS or TBS, it will take very long to get even one matching the document.
So the change we made as :
Removed array of objects id from Collection A and added new field in Collection B which will have _id of a document in Collection A.
Long story short, we reversed the mapping we had.
Now apply the index on Collection B's fields used in the query. This improved the performance a lot.
So it's not a bad idea to have multiple collections, executing proper mapping between collections, MongoDB can provide excellent performance. You can also use sharding to further enhance it.

Related

searching Mysql table with Elasticsearch

Lets say I have the following "expenses" MySQL Table:
id
amount
vendor
tag
1
100
google
foo
2
450
GitHub
bar
3
22
GitLab
fizz
4
75
AWS
buzz
I'm building an API that should return expenses based on partial "vendor" or "tag" filters, so vendor="Git" should return records 2&3, and tag="zz" should return records 3&4.
I was thinking of utilizing elasticsearch capabilities, but I'm not sure the correct way..
most articles I read suggest replicating the table records (using logstash pipe or other methods) to elastic index.
So my API doesn't even query the DB and return an array of documents directly from ES?
Is this considered good practice? replicating the whole table to elastic?
What about table relations... What If I want to filter by nested table relation?...
So my API doesn't even query the DB and return an array of documents
directly from ES?
Yes, As you are doing query to elasticsearch, you will get result only from Elasticsearch. Another way is, just get id from Elasticsearch and use id to retrive documeents from MySQL, but this might impact response time.
Is this considered good practice? replicating the whole table to
elastic? What about table relations... What If I want to filter by
nested table relation?...
It is not about good practice or bad practice, it is all about what type of functionality and use case you want to implement and based on that technology stack can be used and data can be duplicated. There is lots of company using Elasticsearch as secondary data source where they have duplicated data just because there usecase is best fit with Elasticsearh or other NoSQL db.
Elasticsearch is NoSQL DB and it is not mantain any relationship between data. Hence, you need to denormalize your data before indexing to the Elasticsearch. You can read this article for more about denormalizetion and why it is required.
ElasticSearch provide Nested and Join data type for parent child relationship but both have some limitation and performance impact.
Below is what they have mentioned for join field type:
The join field shouldn’t be used like joins in a relation database. In
Elasticsearch the key to good performance is to de-normalize your data
into documents. Each join field, has_child or has_parent query adds a
significant tax to your query performance. It can also trigger global
ordinals to be built.
Below is what they have mentioned for nested field type:
When ingesting key-value pairs with a large, arbitrary set of keys,
you might consider modeling each key-value pair as its own nested
document with key and value fields. Instead, consider using the
flattened data type, which maps an entire object as a single field and
allows for simple searches over its contents. Nested documents and
queries are typically expensive, so using the flattened data type for
this use case is a better option.
most articles I read suggest replicating the table records (using
logstash pipe or other methods) to elastic index.
Yes, You can use logstash or any language client like java, python etc, to sync data from DB to Elasticsearch. You can check this SO answer for more information on this.
Your Search Requirements
If you go ahead with Elasticsearch then you can use N-Gram Tokenizer or Regex Query and achieve your search requirements.
Maybe you can try TiDB: https://medium.com/#shenli3514/simplify-relational-database-elasticsearch-architecture-with-tidb-c19c330b7f30
If you want to scale your MySQL and have fast filtering and aggregating, TiDB could simplify the architecture and reduce development work.

Rails - json column vs seperate table

I'm currently working on a Ruby on Rails project in which I have objects with association to instructions, meaning, each object, can have zero or more instruction objects that hold some basic data, like title, data (string), and position (for ordering them in the UI). I tried looking up an answer in google but found no relevant answer. the instructions are specific to each object and shouldn't be used for lookup or search of any kind, and therefore I figured I should store them as JSON within the object's own table instead of making a join table. The reason I think of doing so is that join table would explode when there would be many objects and because of that querying for each object's instructions would get longer over time. Is that a reasonable concern for storing this data as a JSON instead of has_many association?
Think of using JSON in an RDBMS as a form of denormalization. There are legitimate reasons to use denormalization, but you must keep in mind that it always optimizes for one type of query at the expense of other types of queries.
For example, in this case you could query your object and it would include the JSON document containing all instructions. But if you wanted to search for a specific instruction, it would be quite complex to search for the row that has a JSON documenting containing a specific instruction. Have you thought about how you would query that?
Using normalized database design, i.e. the join table you mention, allows for more flexibility in queries. You can query the object table, or you can query the instruction table. Either way, then simply join to the other table to the the corresponding rows.
The way to make this more optimized is to use indexes on the columns you want to search. See my presentation How to Design Indexes, Really or the video.
Using JSON creates a lot of complexity that you probably haven't considered. See my presentation How to Use JSON in MySQL Wrong.

Store "extended" metadata on entities stored in Azure Cosmos DB as JSON documents

We are building a REST API in .NET deployed to Azure App Service / Azure API App. From this API, client can create "Products" and query "Products". The product entity has a set of fields that are common, and that all clients have to provide when creating a product, like the fields below (example)
{
"id": "cbf3f7aa-4743-4198-b307-260f703c42c1"
"name": "Product One"
"description": "The number one product"
}
We store these products currently as self-contained documents in Azure Cosmos DB.
Question 1: Partitioning.
The collection will not store a huge amount of documents, we talk about maximum around 2 500 000 documents between 1 - 5 kb each (estimates). We currently have chosen the id field (which is our system generated id, not the internal Cosmos DB document id) as partition key which means 2 500 000 logical partitions with one document each partition. The documents will be used in some low-latency workloads, but these workloads will query by id (the partition key). Clients will also query by e.g. name, and then we have a fan-out query, but those queries will not be latency-critical. In the portal, you can't create a single partition collection anymore, but you can do it from the SDK or have a fixed partition key value. If we have all these documents in one single partition (we talk about data far below 10 GB here), we will never get any fan-out queries, but rely more on the index within the one logical partition. So the question: Even if we don't have huge amounts of data, is it still wise to partition like we currently have done?
Question 2: Extended metadata.
We will face clients that want to write client/application/customer-specific metadata beyond the basic common fields. What is the best way to do this?
Some brainstorming from me below.
1: Just dump everything in one self-contained document.
One option is to allow clients in the API to add a type of nested "extendedMetadata" field with key-value pairs when creating a product. Cosmos DB is schema agnostic, so in theory this should work fine. Some products can have zero extended metadata, while other products can have a lot of extended metadata. For the clients, we can promise the basic common fields, but for the extended metadata field we cannot promise anything in terms of number of fields, naming etc. The document size will then vary. These products will as mentioned still be used in latency-critical workloads that will query by "id" (the partition key"). The extended metadata will never be used in any latency-critical workloads. How much and how in general affects the document size the performance / throughput? For the latency-critical read scenario, the query optimizer will go straight to the right partition, and then use the index to quickly retrieve the document fields of interest. Or will the whole document always be loaded and processed independent of which fields you want to query?
{
"id": "cbf3f7aa-4743-4198-b307-260f703c42c1"
"name": "Product One"
"description": "The number one product"
"extendedMetadta" : {
"prop1": "prop1",
"prop2": "prop2",
"propN": "propN"
}
}
The extended metadata is only useful to retrieve from the same API in certain situations. We can then do something like:
api.org.com/products/{id} -- will always return a product with the basic common fields
api.org.com/products/{id}/extended -- will return the full document (basic + extended metadata)
2: Split the document
One option might be to do some kind of splitting. If a client from the API creates a product that contains extended metadata, we can implement some logic that splits the document if extendedMetadata contains data. I guess the split can be done in many ways, brainstorming below. I guess the main objetive to split the documents (which require more work on write operations) is to get better throughput in case the document size plays a significant role here (in most cases, the clients will be ok with the basic common fields).
One basic document that only contains the basic common fields, and one extended document that (with the same id) contains the basic common fields + extended metadata (duplication of the basic common fields) We can add a "type" field that differentiates between the basic and extended document. If a client asks for extended, we will only query documents of type "extended".
One basic document that only contains the basic common fields + a reference to an extended document that only contains the extended metadata. This means a read operation where client asks for product with extended metadata require reading two documents.
Look into splitting it in different collections, one collection holds the basic documents with throughput dedicated to low-latency read scenarios, and one collection for the extended metadata.
Sorry for a long post. Hope this was understandable, looking forward for your feedback!
Answer 1:
If you can guarantee that the documents total size will never be more than 10GB, then creating a fixed collection is the way to go for 2 reasons.
First, there is no need for a cross partition query. I'm not saying it will be lightning fast without partitioning but because you are only interacting
with a simple physical partition, it will be faster than going in every single physical partition looking for data.
(Keep in mind however that every time people think that they can guarantee things like max size of something, it usually doesn't work out.)
The /id partitioning strategy is only efficient if you can ALWAYS provide the id. This is called a read. If you need to search by any other property, this means that
you are performing a query. This is where the system wouldn't do so well.
Ideally you should design your Cosmos DB collection in a way that you never do a cross partition query as part of your every day work load. Maybe once in a blue moon for reporting reasons.
Answer 2:
Cosmos DB is a NoSQL schema-less database for a reason.
The second approach in your brainstorming would be fitting for a traditional RDBMS database but we don't have that here.
You can simply go with your first approach and either have everything under a single property or just have them at the top level.
Remember that you can just map the response to any object that you want, so you can simply have 2 DTOs. A slim and an extended version
and just map to different versions depending on the endpoint.
Hope this helps.

Database optimized for searching in large number of objects with different attributes

Im am currently searching for an alternative to our aging MySQL database using an EAV approach. Current projects seem to have outgrown traditional table oriented database structures and especially searches in such database.
I head and researched about various NoSQL database systems but I can't find anything that seems to be what Im looking for. Maybe you can help.
I'll show you a generalized example on what kind of data I have and what operations I want to execute on them:
I have an object that has a small number of META attributes. Attributes that are common to all instanced of my objects. For example these
DataObject Common (META) Attributes
Unique ID (Some kind of string containing a unique identifier)
Created Date (A date time showing creation time of the object)
Type (Some kind of type identifier, maybe something like "Article", "News", "Image" or "Video"
... I think you get the Idea
Then each of my Objects has a variable number of other attributes. Most probably, many Objects will share a number of these attributes, but there is no rule. For my sample, we say each Object instance has between 5 to 20 such attributes. Here are some samples
Data Object variable Attributes
Color (Some CSS like color string)
Name (A string)
Category (The category or Tag of this item) (Maybe we also have more than one of these?)
URL (a url containing some website)
Cost (a number with decimals
... And a whole lot of other stuff mostly being of the usual column types
References to other data is an idea, but not a MUST at the moment. I could provide those within my application logic if needed.
A small sample:
Image
Unique ID = "0s987tncsgdfb64s5dxnt"
Created Date = "2013-11-21 12:23:11"
Type = "Image"
Title = "A cute cat"
Category = "Animal"
Size = "10234"
Mime = "image/jpeg"
Filename = "cat_123.jpg"
Copyright = "None"
Typical Operations
An average storage would probably have around 1-5 million such objects, each with 5-20 attributes.
Apart from the usual stuff like writing one object to database or readin it by it's uid, the most problematic operations are these:
Search by several attributes - Select every DataObject that has Type "News" the Titel contains "blue" and the Created Date is after 2012.
Paged bulk read - Get a large number of objects from a search (see above) starting at element 100 and ending at 250
Get many objects with all of their attributes - When reading larger numbers of objects, I need to get every object with all of it's attributes in one call.
Storage Requirements
Persistance - The storage needs to be persistance and not in memory only. If the server reboots, the data has to be at the same point in time as when it shut down before. No memory only systems.
Integrity - All data is important, nothing can be ignored. So every single write action has to be securely stored. Systems (Redis?) that tend to loose something now and then arent usable. Systems with huge asynchronity are also problematic. If data changes, every responsible node should see that.
Complexity - The system should be fairly easy to setup and maintain. So, systems that force the admin to take many week long courses in it's use arent really a solution here. Same goes for huge data warehouses with loads of nodes. Clustering is nice, but it should also be possible to get a cheap system with one node.
tl;dr
Need super fast database system with object oriented data and fast searched even with hundreds of thousands of items.
A reason as to why I am searching for a better alternative to mysql can be found here: Need MySQL optimization for complex search on EAV structured data
Update
Key-Value stores like Redis weren't an option as we need to do some heavy searching insode our data. Somethng which isnt possible in a typical Key-Value store.
In the end, we are using MongoDB with a slightly optimized scheme to make best use of MongoDBs use of indizes.
Some small drawback still remain but are acceptable at the moment:
- MongoDBs aggregate function can not wotk with very large result sets. We have to use find (and refine our data structure to make that one sufficient)
- You can not sort large datasets on specific values as it would take up to much memory. You also cant create indizes on those values as they are schema free.
I don't know if you wan't a more sophisticated answer than mine. But maybe i can inspire you a little.
MySql are scaleable and can be used for exactly your course. I think it's more of an optimization and server problem if you database i slow. Many system with massive amount of data i using MySql and works perfectly, Though NoSql (Not-Only SQL) is built for large amount of data with different attributes.
There's many diffrent NoSql providers and they have different ways of handling you data.
Think about that before you choose a NoSql platform.
The possibilities are
Key–value Stores - ex. Redis, Voldemort, Oracle BDB
Column Store - ex. Cassandra, HBase
Document Store - ex. CouchDB, MongoDb
Graph Database - ex. Neo4J, InfoGrid, Infinite Graph
Most website uses document based storing, but ex. facebook are using the column based, because of the many dynamic atrribute.
You can try the Document based NoSql at http://try.mongodb.org/
In the end, it really depends on how you build and optimize you database, and not from which technology you choose, though chossing the right technology can save a bunch of time.
The system we have developed are using a a combination of MySql and NoSql depending on what data we are working with. MySql for the system itself and NoSql for all the data we import via API's.
Hope this inspires a little and feel free to ask any westions

Storing JSON in database vs. having a new column for each key

I am implementing the following model for storing user related data in my table - I have 2 columns - uid (primary key) and a meta column which stores other data about the user in JSON format.
uid | meta
--------------------------------------------------
1 | {name:['foo'],
| emailid:['foo#bar.com','bar#foo.com']}
--------------------------------------------------
2 | {name:['sann'],
| emailid:['sann#bar.com','sann#foo.com']}
--------------------------------------------------
Is this a better way (performance-wise, design-wise) than the one-column-per-property model, where the table will have many columns like uid, name, emailid.
What I like about the first model is, you can add as many fields as possible there is no limitation.
Also, I was wondering, now that I have implemented the first model. How do I perform a query on it, like, I want to fetch all the users who have name like 'foo'?
Question - Which is the better way to store user related data (keeping in mind that number of fields is not fixed) in database using - JSON or column-per-field? Also, if the first model is implemented, how to query database as described above? Should I use both the models, by storing all the data which may be searched by a query in a separate row and the other data in JSON (is a different row)?
Update
Since there won't be too many columns on which I need to perform search, is it wise to use both the models? Key-per-column for the data I need to search and JSON for others (in the same MySQL database)?
Updated 4 June 2017
Given that this question/answer have gained some popularity, I figured it was worth an update.
When this question was originally posted, MySQL had no support for JSON data types and the support in PostgreSQL was in its infancy. Since 5.7, MySQL now supports a JSON data type (in a binary storage format), and PostgreSQL JSONB has matured significantly. Both products provide performant JSON types that can store arbitrary documents, including support for indexing specific keys of the JSON object.
However, I still stand by my original statement that your default preference, when using a relational database, should still be column-per-value. Relational databases are still built on the assumption of that the data within them will be fairly well normalized. The query planner has better optimization information when looking at columns than when looking at keys in a JSON document. Foreign keys can be created between columns (but not between keys in JSON documents). Importantly: if the majority of your schema is volatile enough to justify using JSON, you might want to at least consider if a relational database is the right choice.
That said, few applications are perfectly relational or document-oriented. Most applications have some mix of both. Here are some examples where I personally have found JSON useful in a relational database:
When storing email addresses and phone numbers for a contact, where storing them as values in a JSON array is much easier to manage than multiple separate tables
Saving arbitrary key/value user preferences (where the value can be boolean, textual, or numeric, and you don't want to have separate columns for different data types)
Storing configuration data that has no defined schema (if you're building Zapier, or IFTTT and need to store configuration data for each integration)
I'm sure there are others as well, but these are just a few quick examples.
Original Answer
If you really want to be able to add as many fields as you want with no limitation (other than an arbitrary document size limit), consider a NoSQL solution such as MongoDB.
For relational databases: use one column per value. Putting a JSON blob in a column makes it virtually impossible to query (and painfully slow when you actually find a query that works).
Relational databases take advantage of data types when indexing, and are intended to be implemented with a normalized structure.
As a side note: this isn't to say you should never store JSON in a relational database. If you're adding true metadata, or if your JSON is describing information that does not need to be queried and is only used for display, it may be overkill to create a separate column for all of the data points.
Like most things "it depends". It's not right or wrong/good or bad in and of itself to store data in columns or JSON. It depends on what you need to do with it later. What is your predicted way of accessing this data? Will you need to cross reference other data?
Other people have answered pretty well what the technical trade-off are.
Not many people have discussed that your app and features evolve over time and how this data storage decision impacts your team.
Because one of the temptations of using JSON is to avoid migrating schema and so if the team is not disciplined, it's very easy to stick yet another key/value pair into a JSON field. There's no migration for it, no one remembers what it's for. There is no validation on it.
My team used JSON along side traditional columns in postgres and at first it was the best thing since sliced bread. JSON was attractive and powerful, until one day we realized that flexibility came at a cost and it's suddenly a real pain point. Sometimes that point creeps up really quickly and then it becomes hard to change because we've built so many other things on top of this design decision.
Overtime, adding new features, having the data in JSON led to more complicated looking queries than what might have been added if we stuck to traditional columns. So then we started fishing certain key values back out into columns so that we could make joins and make comparisons between values. Bad idea. Now we had duplication. A new developer would come on board and be confused? Which is the value I should be saving back into? The JSON one or the column?
The JSON fields became junk drawers for little pieces of this and that. No data validation on the database level, no consistency or integrity between documents. That pushed all that responsibility into the app instead of getting hard type and constraint checking from traditional columns.
Looking back, JSON allowed us to iterate very quickly and get something out the door. It was great. However after we reached a certain team size it's flexibility also allowed us to hang ourselves with a long rope of technical debt which then slowed down subsequent feature evolution progress. Use with caution.
Think long and hard about what the nature of your data is. It's the foundation of your app. How will the data be used over time. And how is it likely TO CHANGE?
Just tossing it out there, but WordPress has a structure for this kind of stuff (at least WordPress was the first place I observed it, it probably originated elsewhere).
It allows limitless keys, and is faster to search than using a JSON blob, but not as fast as some of the NoSQL solutions.
uid | meta_key | meta_val
----------------------------------
1 name Frank
1 age 12
2 name Jeremiah
3 fav_food pizza
.................
EDIT
For storing history/multiple keys
uid | meta_id | meta_key | meta_val
----------------------------------------------------
1 1 name Frank
1 2 name John
1 3 age 12
2 4 name Jeremiah
3 5 fav_food pizza
.................
and query via something like this:
select meta_val from `table` where meta_key = 'name' and uid = 1 order by meta_id desc
the drawback of the approach is exactly what you mentioned :
it makes it VERY slow to find things, since each time you need to perform a text-search on it.
value per column instead matches the whole string.
Your approach (JSON based data) is fine for data you don't need to search by, and just need to display along with your normal data.
Edit: Just to clarify, the above goes for classic relational databases. NoSQL use JSON internally, and are probably a better option if that is the desired behavior.
Basically, the first model you are using is called as document-based storage. You should have a look at popular NoSQL document-based database like MongoDB and CouchDB. Basically, in document based db's, you store data in json files and then you can query on these json files.
The Second model is the popular relational database structure.
If you want to use relational database like MySql then i would suggest you to only use second model. There is no point in using MySql and storing data as in the first model.
To answer your second question, there is no way to query name like 'foo' if you use first model.
It seems that you're mainly hesitating whether to use a relational model or not.
As it stands, your example would fit a relational model reasonably well, but the problem may come of course when you need to make this model evolve.
If you only have one (or a few pre-determined) levels of attributes for your main entity (user), you could still use an Entity Attribute Value (EAV) model in a relational database. (This also has its pros and cons.)
If you anticipate that you'll get less structured values that you'll want to search using your application, MySQL might not be the best choice here.
If you were using PostgreSQL, you could potentially get the best of both worlds. (This really depends on the actual structure of the data here... MySQL isn't necessarily the wrong choice either, and the NoSQL options can be of interest, I'm just suggesting alternatives.)
Indeed, PostgreSQL can build index on (immutable) functions (which MySQL can't as far as I know) and in recent versions, you could use PLV8 on the JSON data directly to build indexes on specific JSON elements of interest, which would improve the speed of your queries when searching for that data.
EDIT:
Since there won't be too many columns on which I need to perform
search, is it wise to use both the models? Key-per-column for the data
I need to search and JSON for others (in the same MySQL database)?
Mixing the two models isn't necessarily wrong (assuming the extra space is negligible), but it may cause problems if you don't make sure the two data sets are kept in sync: your application must never change one without also updating the other.
A good way to achieve this would be to have a trigger perform the automatic update, by running a stored procedure within the database server whenever an update or insert is made. As far as I'm aware, the MySQL stored procedure language probably lack support for any sort of JSON processing. Again PostgreSQL with PLV8 support (and possibly other RDBMS with more flexible stored procedure languages) should be more useful (updating your relational column automatically using a trigger is quite similar to updating an index in the same way).
short answer
you have to mix between them ,
use json for data that you are not going to make relations with them like contact data , address , products variabls
some time joins on the table will be an overhead. lets say for OLAP. if i have two tables one is ORDERS table and other one is ORDER_DETAILS. For getting all the order details we have to join two tables this will make the query slower when no of rows in the tables increase lets say in millions or so.. left/right join is too slower than inner join.
I Think if we add JSON string/Object in the respective ORDERS entry JOIN will be avoided. add report generation will be faster...
You are trying to fit a non-relational model into a relational database, I think you would be better served using a NoSQL database such as MongoDB. There is no predefined schema which fits in with your requirement of having no limitation to the number of fields (see the typical MongoDB collection example). Check out the MongoDB documentation to get an idea of how you'd query your documents, e.g.
db.mycollection.find(
{
name: 'sann'
}
)
As others have pointed out queries will be slower. I'd suggest to add at least an '_ID' column to query by that instead.