Related
I am curious what techniques Database Developers and Architects use to create dynamic filter data response Stored Procedures (or Functions) for large-scale databases.
For example, let's take a database with millions of people in it, and we want to provide a stored procedure "get-person-list" which takes a JSON parameter. Within this JSON parameter, we can define filters such as $.filter.name.first, $.filter.name.last, $.filter.phone.number, $.filter.address.city, etc.
The frontend (web solution) allows the user to define one or more filters, so the front-end can say "Show me everyone with a First name of Ted and last name of Smith in San Diego."
The payload would look like this:
{
"filter": {
"name": {
"last": "smith",
"first": "ted"
},
"address": {
"city": "san diego"
}
}
}
Now, what would the best technique be to write a single stored procedure capable of handling numerous (dozens or more) filter settings (dynamically) and returning the proper result set all with the best optimization/speed?
Is it possible to do this with CTE, or are prepared statements based on IF/THEN logic (building out the SQL to be executed based on filter value) the best/only real method?
How do big companies with huge databases and thousands of users write their calls to return complex dynamic lists of data as quickly as possible?
Everything Bill wrote is true, and good advice.
I'll take it a little further. You're proposing building a search layer into your system, which is fine.
You're proposing an interface in which you pass a JSON object to code inside the DBMS.That's not fine. That code will either have a bunch of canned queries handling the various search scenarios, or will have a mess of string-handling code that reads JSON, puts together appropriate queries, then uses MySQL's PREPARE statement to run them. From my experience that is, with respect, a really bad idea.
Here's why:
The stored-procedure language has very weak string-handling support compared to host languages. No sprintf. No arrays of strings. No join or implode operators. Clunky regex, and not always present on every server. You're going to need string handling to build search queries.
Stored procedures are trickier to debug, test, deploy, and maintain than ordinary application code. That work requires special skills and special access.
You will need to maintain this code, especially if your system proves successful. You'll add requirements that will require expanding your search capabilities.
It's impossible (seriously, impossible) to know what your actual application usage patterns will be at scale. You surely will, as a consequence of growth, find usage patterns that surprise you. My point is that you can't design and build a search system and then forget about it. It will evolve along with your app.
To keep up with evolving usage patterns, you'll need to refactor some queries and add some indexes. You will be under pressure when you do that work: People will be complaining about performance. See points 1 and 2 above.
MySQL / MariaDB's stored procedures aren't compiled with an optimizing compiler, unlike Oracle and SQL Server's. So there's no compelling performance win.
So don't use a stored procedure for this. Please. Ask me how I know this sometime.
If you need a search module with a JSON interface, implement it in your favorite language (php, C#, nodejs, java, whatever). It will be easier to debug, test, deploy, and maintain.
To write a query that searches a variety of columns, you would have to write dynamic SQL. That is, write code to parse your JSON payload for the filter keys and values, and format SQL expressions in a string that is part of a dynamic SQL statement. Then prepare and execute that string.
In general, you can't "optimize for everything." Trying to optimize when you don't know in advance which queries your users will submit is a nigh-impossible task. There's no perfect solution.
The most common method of optimizing search is to create indexes. But you need to know the types of search in advance to create indexes. You need to know which columns will be included, and which types of search operations will be used, because the column order in an index affects optimization.
For N columns, there are N-factorial permutations of columns, but clearly this is impractical because MySQL only allows 64 indexes per table. You simply can't create all the indexes needed to optimize every possible query your users attempt.
The alternative is to optimize queries partially, by indexing a few combinations of columns, and hope that these help the users' most common queries. Use application logs to determine what the most common queries are.
There are other types of indexes. You could use fulltext indexing, either the implementation built in to MySQL, or else supplement your MySQL database with ElasticSearch or similar technology. These provide a different type of index that effectively indexes everything with one index, so you can search based on multiple columns.
There's no single product that is "best." Which fulltext indexing technology meets your needs requires you to evaluate different products. This is some of the unglamorous work of software development — testing, benchmarking, and matching product features to your application requirements. There are few types of work that I enjoy less. It's a toss-up between this and resolving git merge conflicts.
It's also more work to manage copies of data in multiple datastores, making sure data changes in your SQL database are also copied into the fulltext search index. This involves techniques like ETL (extract, transform, load) and CDC (change data capture).
But you asked how big companies with huge databases do this, and this is how.
Input
I to that "all the time". The web page has a <form>. When submitted, I look for fields of that form that were filled in, then build
WHERE this = "..."
AND that = "..."
into the suitable SELECT statement.
Note: I leave out any fields that were not specified in the form; I make sure to escape the strings.
I'm walking through $_GET[] instead of JSON, so it is quite easy.
INDEXing
If you have columns for each possible fields, then it is a matter of providing indexes only for the most likely columns to search on. (There are practical and even hard-coded limits on Indexes.)
If you have stored the attributes in EAV table structure, you have my condolences. Search the [entitity-attribute-value] tag for many other poor soles who wandered into that swamp.
If you store the attributes in JSON, well that is likely to be an order of magnitude worse than EAV.
If you throw all the information in a FULLTEXT columns and use MATCH, then you can get enough speed for "millions" or rows. But it comes with various caveats (word length, stoplist, endings, surprise matches, etc).
If you would like to discuss further, then scale back your expectations and make a list of likely search keys. We can then discuss what technique might be best.
I have tables table_a and table_b in my database and they are mapped in slick with TableQuery Objects. I need to copy a restricted set of data from table_a to table_b.
Let the table query objects be tableQueryA and tableQueryB. The logic for filtering and copying data is complex. So
I think of doing scala collection equivalent of table query object in a for yield and treat them as normal collections. But Everything happens in one transaction. The code looks something like this.
for {
collA <- tableQueryA.filter(.....something....).result
collB <- tableQueryB.filter(.....somethingElse.....).result
...... do something with collA and collB
}
yield ...something
Is there a harm doing this way, i.e handling as scala collections and processing them?
I am using slick 3.2
By doing two separate tableQueryX.filter().result, you'll be executing two separate queries to the database. You could replace it with one query that joins two tables.
It's hard to say what is the better approach in term of performance as it depends on amount of filter or where clauses and what kind of indexes are used by the database to fulfill those. If you need a top notch performance, try both approaches and pick one that is the fastest.
If both of your queries yield big amount of data, you need to consider memory usage for your application too because all data is loaded before scala collection api can be used.
I dont see any harm as long as data is less - but better to filter out data at DB level to avoid any potential out of memory errors.
In MySQL 5.7 a new data type for storing JSON data in MySQL tables has been
added. It will obviously be a great change in MySQL. They listed some benefits
Document Validation - Only valid JSON documents can be stored in a
JSON column, so you get automatic validation of your data.
Efficient Access - More importantly, when you store a JSON document in a JSON column, it is not stored as a plain text value. Instead, it is stored
in an optimized binary format that allows for quicker access to object
members and array elements.
Performance - Improve your query
performance by creating indexes on values within the JSON columns.
This can be achieved with “functional indexes” on virtual columns.
Convenience - The additional inline syntax for JSON columns makes it
very natural to integrate Document queries within your SQL. For
example (features.feature is a JSON column): SELECT feature->"$.properties.STREET" AS property_street FROM features WHERE id = 121254;
WOW ! they include some great features. Now it is easier to manipulate data. Now it is possible to store more complex data in column.
So MySQL is now flavored with NoSQL.
Now I can imagine a query for JSON data something like
SELECT * FROM t1
WHERE JSON_EXTRACT(data,"$.series") IN
(
SELECT JSON_EXTRACT(data,"$.inverted")
FROM t1 | {"series": 3, "inverted": 8}
WHERE JSON_EXTRACT(data,"$.inverted")<4 );
So can I store huge small relations in few json colum? Is it good? Does it break normalization. If this is possible then I guess it will act like NoSQL in a MySQL column. I really want to know more about this feature. Pros and cons of MySQL JSON data type.
SELECT * FROM t1
WHERE JSON_EXTRACT(data,"$.series") IN ...
Using a column inside an expression or function like this spoils any chance of the query using an index to help optimize the query. The query shown above is forced to do a table-scan.
The claim about "efficient access" is misleading. It means that after the query examines a row with a JSON document, it can extract a field without having to parse the text of the JSON syntax. But it still takes a table-scan to search for rows. In other words, the query must examine every row.
By analogy, if I'm searching a telephone book for people with first name "Bill", I still have to read every page in the phone book, even if the first names have been highlighted to make it slightly quicker to spot them.
MySQL 5.7 allows you to define a virtual column in the table, and then create an index on the virtual column.
ALTER TABLE t1
ADD COLUMN series AS (JSON_EXTRACT(data, '$.series')),
ADD INDEX (series);
Then if you query the virtual column, it can use the index and avoid the table-scan.
SELECT * FROM t1
WHERE series IN ...
This is nice, but it kind of misses the point of using JSON. The attractive part of using JSON is that it allows you to add new attributes without having to do ALTER TABLE. But it turns out you have to define an extra (virtual) column anyway, if you want to search JSON fields with the help of an index.
But you don't have to define virtual columns and indexes for every field in the JSON document—only those you want to search or sort on. There could be other attributes in the JSON that you only need to extract in the select-list like the following:
SELECT JSON_EXTRACT(data, '$.series') AS series FROM t1
WHERE <other conditions>
I would generally say that this is the best way to use JSON in MySQL. Only in the select-list.
When you reference columns in other clauses (JOIN, WHERE, GROUP BY, HAVING, ORDER BY), it's more efficient to use conventional columns, not fields within JSON documents.
I presented a talk called How to Use JSON in MySQL Wrong at the Percona Live conference in April 2018. I'll update and repeat the talk at Oracle Code One in the fall.
There are other issues with JSON. For example, in my tests it required 2-3 times as much storage space for JSON documents compared to conventional columns storing the same data.
MySQL is promoting their new JSON capabilities aggressively, largely to dissuade people against migrating to MongoDB. But document-oriented data storage like MongoDB is fundamentally a non-relational way of organizing data. It's different from relational. I'm not saying one is better than the other, it's just a different technique, suited to different types of queries.
You should choose to use JSON when JSON makes your queries more efficient.
Don't choose a technology just because it's new, or for the sake of fashion.
Edit: The virtual column implementation in MySQL is supposed to use the index if your WHERE clause uses exactly the same expression as the definition of the virtual column. That is, the following should use the index on the virtual column, since the virtual column is defined AS (JSON_EXTRACT(data,"$.series"))
SELECT * FROM t1
WHERE JSON_EXTRACT(data,"$.series") IN ...
Except I have found by testing this feature that it does NOT work for some reason if the expression is a JSON-extraction function. It works for other types of expressions, just not JSON functions. UPDATE: this reportedly works, finally, in MySQL 5.7.33.
The following from MySQL 5.7 brings sexy back with JSON sounds good to me:
Using the JSON Data Type in MySQL comes with two advantages over
storing JSON strings in a text field:
Data validation. JSON documents will be automatically validated and
invalid documents will produce an error. Improved internal storage
format. The JSON data is converted to a format that allows quick read
access to the data in a structured format. The server is able to
lookup subobjects or nested values by key or index, allowing added
flexibility and performance.
...
Specialised flavours of NoSQL stores
(Document DBs, Key-value stores and Graph DBs) are probably better
options for their specific use cases, but the addition of this
datatype might allow you to reduce complexity of your technology
stack. The price is coupling to MySQL (or compatible) databases. But
that is a non-issue for many users.
Note the language about document validation as it is an important factor. I guess a battery of tests need to be performed for comparisons of the two approaches. Those two being:
Mysql with JSON datatypes
Mysql without
The net has but shallow slideshares as of now on the topic of mysql / json / performance from what I am seeing.
Perhaps your post can be a hub for it. Or perhaps performance is an after thought, not sure, and you are just excited to not create a bunch of tables.
From my experience, JSON implementation at least in MySql 5.7 is not very useful due to its poor performance.
Well, it is not so bad for reading data and validation. However, JSON modification is 10-20 times slower with MySql that with Python or PHP.
Lets imagine very simple JSON:
{ "name": "value" }
Lets suppose we have to convert it to something like that:
{ "name": "value", "newName": "value" }
You can create simple script with Python or PHP that will select all rows and update them one by one. You are not forced to make one huge transaction for it, so other applications will can use the table in parallel. Of course, you can also make one huge transaction if you want, so you'll get guarantee that MySql will perform "all or nothing", but other applications will most probably not be able to use database during transaction execution.
I have 40 millions rows table, and Python script updates it in 3-4 hours.
Now we have MySql JSON, so we don't need Python or PHP anymore, we can do something like that:
UPDATE `JsonTable` SET `JsonColumn` = JSON_SET(`JsonColumn`, "newName", JSON_EXTRACT(`JsonColumn`, "name"))
It looks simple and excellent. However, its speed is 10-20 times slower than Python version, and it is single transaction, so other applications can not modify the table data in parallel.
So, if we want to just duplicate JSON key in 40 millions rows table, we need to not use table at all during 30-40 hours. It has no sence.
About reading data, from my experience direct access to JSON field via JSON_EXTRACT in WHERE is also extremelly slow (much slower that TEXT with LIKE on not indexed column). Virtual generated columns perform much faster, however, if we know our data structure beforehand, we don't need JSON, we can use traditional columns instead. When we use JSON where it is really useful, i. e. when data structure is unknown or changes often (for example, custom plugin settings), virtual column creation on regular basis for any possible new columns doesn't look like good idea.
Python and PHP make JSON validation like a charm, so it is questionable do we need JSON validation on MySql side at all. Why not also validate XML, Microsoft Office documents or check spelling? ;)
I got into this problem recently, and I sum up the following experiences:
1, There isn't a way to solve all questions.
2, You should use the JSON properly.
One case:
I have a table named: CustomField, and it must two columns: name, fields.
name is a localized string, it content should like:
{
"en":"this is English name",
"zh":"this is Chinese name"
...(other languages)
}
And fields should be like this:
[
{
"filed1":"value",
"filed2":"value"
...
},
{
"filed1":"value",
"filed2":"value"
...
}
...
]
As you can see, both the name and the fields can be saved as JSON, and it works!
However, if I use the name to search this table very frequently, what should I do? Use the JSON_CONTAINS,JSON_EXTRACT...? Obviously, it's not a good idea to save it as JSON anymore, we should save it to an independent table:CustomFieldName.
From the above case, I think you should keep these ideas in mind:
Why MYSQL support JSON?
Why you want to use JSON? Did your business logic just need this? Or there is something else?
Never be lazy
Thanks
Strong disagree with some of things that are said in other answers (which, to be fair, was a few years ago).
We have very carefully started to adopt JSON fields with a healthy skepticism. Over time we've been adding this more.
This generally describes the situation we are in:
Like 99% of applications out there, we are not doing things at a massive scale. We work with many different applications and databases, the majority of these are capable of running on modest hardware.
We have processes and know-how in place to make changes if performance does become a problem.
We have a general idea of which tables are going to be large and think carefully about how we optimize queries for them.
We also know in which cases this is not really needed.
We're pretty good at data validation and static typing at the application layer.
Lastly,
When we use JSON for storing complex data, that data is never referenced directly by other tables. We also tend to never need to use them in where clauses in hot paths.
So with all this in mind, using a little JSON field instead of 1 or more tables vastly reduces the complexity of queries and data model. Removing this complexity makes it easier to write certain queries, makes our code simpler and just generally saves time.
Complexity and performance is something that needs to be carefully balanced. JSON fields should not be blindly applied, but for the cases where this works it's fantastic.
'JSON fields don't perform well' is a valid reason to not use JSON fields, if you are at a place where that performance difference matters.
One specific example is that we have a table where we store settings for video transcoding. The settings table has 1 'profile' per row, and the settings themselves have a maximum nesting level of 4 (arrays and objects).
Despite this being a large database overall, there's only a few hundreds of these records in the database. Suggesting to split this into 5 tables would yield no benefit and lots of pain.
This is an extreme example, but we have plenty of others (with more rows) where the decision to use JSON fields is a few years in the past, and hasn't yet caused an issue.
Last point: it is now possible to directly index on JSON fields.
I have a rather huge application storing data in MongoDB (Mongoose) despite the fact my data is absolutely sequel and can be presented as tables with schemas very well. The specific is I have a lot of relations between objects. So I need to perform very deep populations — 25+ for each request in total.
A good way is to rewrite app for MySQL. However there are tonnes of code binded on MongoDB. The question is: if there will be growing amount of relations between objects by ObjectID, will it be still so efficient as MySQL or should I dive into code and move app complete to MySQL?
In both cases I use ORM. Now Mongoose, if I move — Sequelize.
Is Mongo really efficient in working with relations? I mean, SQL was designed to join tables with relations, I hope it has some optimisations undercover. Relations for Mongo seem to be a bit unusual usecase. So, I worry if logically the same query for gathering data from 25 collections in Mongo or join data from 25 tables in MySQL may be slower for Mongo.
Here's the example of Schema I'm using. Populated fields are marked with *.
Man
-[friends_ids] --> [Man]*
-friends_ids*: ...
-pets_ids*: ...
-...
-[pets_ids] -> [Pet]*
-name
-avatars*: [Avatar]
-path
-size
-...
My thoughts about relations. Lets imagine Man object that should have [friends] field. Let take it out.
MySQL ORM:
from MANS table find Man where id=:id.
from MAN-TO-MANS table find all records where friend id = :id of Man from step 1
from MANS table find all records where id = :id of Men from step 2
join it into one Man object with friends field populated
Mongo:
from MANS collection find Man where _id=:_id. Get it's friends _id's array on this step (non populated)
from MANS collection find all documents where _id = :_id of Men from step 1
join it into one Man object with friends field populated
No requestes to JOIN tables. Am I right?
So I need to perform very deep populations — 25+ for each request in total.
A common misconception is that MongoDB does not support JOINs. While this is partially true it is also quite untrue. The reality is that MongoDB does not support server-side joins.
The MongoDB motto is client side JOINing.
This motto can work against you; the application does not always understand the best way to JOIN as such you have to pick your schema, queries and JOINs very carefully in MongoDB to ensure that you are not querying inefficiently.
25+ is perfectly possible for MongoDB, that's not the problem. The problem will be what JOINs you are doing.
This leads onto:
Is Mongo really efficient in working with relations?
Let me give you an example of where MongoDB would actually be faster than MySQL.
Imagine you have a group collection with each group document containing a user_ids field which is represented as an array of ObjectIds which directly relate to the _id field in the user collection.
Doing two queries, one for the group and one for the users would likely be faster than MySQL in this specific case since MongoDB, for one, would not need to atomically write out a result set using your IO bandwidth for common tasks.
This being said though, anything complex and you will get hammered by the fact that the application does not truly know how to use index inter-sectioning and merging to create a slightly performant JOIN.
So for example say you wish to JOIN between 3 tables in one query paginating by the 3 JOINed table. That would probably kill MongoDBs performance while not being such an inefficient JOIN to perform.
However, you might also find that those JOINs are not scalable anyway and are in fact killing any performance you get on MySQL.
if there will be growing amount of relations between objects by ObjectID, will it be still so efficient as MySQL or should I dive into code and move app complete to MySQL?
Depends on the queries but I have at least given you some pointers.
Your question is a bit broad, but I interpret it in one of two ways.
One, you are saying that you have references 25 levels deep, and in that case using populate is just not going to work. I dearly hope this is not the pickle you find yourself in. Moving to SQL won't help you either, the fact is you'll be going back to the database too many times no matter what. But if this is how it's got to be, you can tackle it using a variation of the materialized path pattern, which will allow you to select subtrees much more efficiently within your very deep data tree. See here for a discussion: http://docs.mongodb.org/manual/tutorial/model-tree-structures-with-materialized-paths/
The other interpretation is that you have 25 relations between collections. Let's say in this case there is one collection in Mongo for every letter of the English alphabet, and documents in collection A have references to one or more documents in each of collections B-Z. In this case, you might be ok. Mongoose populate lets you populate multiple reference paths, and I doubt if there is a limit it is anywhere as low as 25. So you'd do something like docA.populate("B C ... Z"). In this case also, moving to SQL won't help you per se, you'll still be required to join on multiple tables.
Of course, your original statement that this could all be done in SQL is valid, there doesn't seem to have been a specific reason to use (or not use) Mongo here, just seems to be the way things were done. However, it also seems that whether you use NoSQL or SQL approaches here isn't the determining factor in whether you will see inefficiency. Rather, it's whether you model the data correctly within whatever solution you choose.
I am implementing the following model for storing user related data in my table - I have 2 columns - uid (primary key) and a meta column which stores other data about the user in JSON format.
uid | meta
--------------------------------------------------
1 | {name:['foo'],
| emailid:['foo#bar.com','bar#foo.com']}
--------------------------------------------------
2 | {name:['sann'],
| emailid:['sann#bar.com','sann#foo.com']}
--------------------------------------------------
Is this a better way (performance-wise, design-wise) than the one-column-per-property model, where the table will have many columns like uid, name, emailid.
What I like about the first model is, you can add as many fields as possible there is no limitation.
Also, I was wondering, now that I have implemented the first model. How do I perform a query on it, like, I want to fetch all the users who have name like 'foo'?
Question - Which is the better way to store user related data (keeping in mind that number of fields is not fixed) in database using - JSON or column-per-field? Also, if the first model is implemented, how to query database as described above? Should I use both the models, by storing all the data which may be searched by a query in a separate row and the other data in JSON (is a different row)?
Update
Since there won't be too many columns on which I need to perform search, is it wise to use both the models? Key-per-column for the data I need to search and JSON for others (in the same MySQL database)?
Updated 4 June 2017
Given that this question/answer have gained some popularity, I figured it was worth an update.
When this question was originally posted, MySQL had no support for JSON data types and the support in PostgreSQL was in its infancy. Since 5.7, MySQL now supports a JSON data type (in a binary storage format), and PostgreSQL JSONB has matured significantly. Both products provide performant JSON types that can store arbitrary documents, including support for indexing specific keys of the JSON object.
However, I still stand by my original statement that your default preference, when using a relational database, should still be column-per-value. Relational databases are still built on the assumption of that the data within them will be fairly well normalized. The query planner has better optimization information when looking at columns than when looking at keys in a JSON document. Foreign keys can be created between columns (but not between keys in JSON documents). Importantly: if the majority of your schema is volatile enough to justify using JSON, you might want to at least consider if a relational database is the right choice.
That said, few applications are perfectly relational or document-oriented. Most applications have some mix of both. Here are some examples where I personally have found JSON useful in a relational database:
When storing email addresses and phone numbers for a contact, where storing them as values in a JSON array is much easier to manage than multiple separate tables
Saving arbitrary key/value user preferences (where the value can be boolean, textual, or numeric, and you don't want to have separate columns for different data types)
Storing configuration data that has no defined schema (if you're building Zapier, or IFTTT and need to store configuration data for each integration)
I'm sure there are others as well, but these are just a few quick examples.
Original Answer
If you really want to be able to add as many fields as you want with no limitation (other than an arbitrary document size limit), consider a NoSQL solution such as MongoDB.
For relational databases: use one column per value. Putting a JSON blob in a column makes it virtually impossible to query (and painfully slow when you actually find a query that works).
Relational databases take advantage of data types when indexing, and are intended to be implemented with a normalized structure.
As a side note: this isn't to say you should never store JSON in a relational database. If you're adding true metadata, or if your JSON is describing information that does not need to be queried and is only used for display, it may be overkill to create a separate column for all of the data points.
Like most things "it depends". It's not right or wrong/good or bad in and of itself to store data in columns or JSON. It depends on what you need to do with it later. What is your predicted way of accessing this data? Will you need to cross reference other data?
Other people have answered pretty well what the technical trade-off are.
Not many people have discussed that your app and features evolve over time and how this data storage decision impacts your team.
Because one of the temptations of using JSON is to avoid migrating schema and so if the team is not disciplined, it's very easy to stick yet another key/value pair into a JSON field. There's no migration for it, no one remembers what it's for. There is no validation on it.
My team used JSON along side traditional columns in postgres and at first it was the best thing since sliced bread. JSON was attractive and powerful, until one day we realized that flexibility came at a cost and it's suddenly a real pain point. Sometimes that point creeps up really quickly and then it becomes hard to change because we've built so many other things on top of this design decision.
Overtime, adding new features, having the data in JSON led to more complicated looking queries than what might have been added if we stuck to traditional columns. So then we started fishing certain key values back out into columns so that we could make joins and make comparisons between values. Bad idea. Now we had duplication. A new developer would come on board and be confused? Which is the value I should be saving back into? The JSON one or the column?
The JSON fields became junk drawers for little pieces of this and that. No data validation on the database level, no consistency or integrity between documents. That pushed all that responsibility into the app instead of getting hard type and constraint checking from traditional columns.
Looking back, JSON allowed us to iterate very quickly and get something out the door. It was great. However after we reached a certain team size it's flexibility also allowed us to hang ourselves with a long rope of technical debt which then slowed down subsequent feature evolution progress. Use with caution.
Think long and hard about what the nature of your data is. It's the foundation of your app. How will the data be used over time. And how is it likely TO CHANGE?
Just tossing it out there, but WordPress has a structure for this kind of stuff (at least WordPress was the first place I observed it, it probably originated elsewhere).
It allows limitless keys, and is faster to search than using a JSON blob, but not as fast as some of the NoSQL solutions.
uid | meta_key | meta_val
----------------------------------
1 name Frank
1 age 12
2 name Jeremiah
3 fav_food pizza
.................
EDIT
For storing history/multiple keys
uid | meta_id | meta_key | meta_val
----------------------------------------------------
1 1 name Frank
1 2 name John
1 3 age 12
2 4 name Jeremiah
3 5 fav_food pizza
.................
and query via something like this:
select meta_val from `table` where meta_key = 'name' and uid = 1 order by meta_id desc
the drawback of the approach is exactly what you mentioned :
it makes it VERY slow to find things, since each time you need to perform a text-search on it.
value per column instead matches the whole string.
Your approach (JSON based data) is fine for data you don't need to search by, and just need to display along with your normal data.
Edit: Just to clarify, the above goes for classic relational databases. NoSQL use JSON internally, and are probably a better option if that is the desired behavior.
Basically, the first model you are using is called as document-based storage. You should have a look at popular NoSQL document-based database like MongoDB and CouchDB. Basically, in document based db's, you store data in json files and then you can query on these json files.
The Second model is the popular relational database structure.
If you want to use relational database like MySql then i would suggest you to only use second model. There is no point in using MySql and storing data as in the first model.
To answer your second question, there is no way to query name like 'foo' if you use first model.
It seems that you're mainly hesitating whether to use a relational model or not.
As it stands, your example would fit a relational model reasonably well, but the problem may come of course when you need to make this model evolve.
If you only have one (or a few pre-determined) levels of attributes for your main entity (user), you could still use an Entity Attribute Value (EAV) model in a relational database. (This also has its pros and cons.)
If you anticipate that you'll get less structured values that you'll want to search using your application, MySQL might not be the best choice here.
If you were using PostgreSQL, you could potentially get the best of both worlds. (This really depends on the actual structure of the data here... MySQL isn't necessarily the wrong choice either, and the NoSQL options can be of interest, I'm just suggesting alternatives.)
Indeed, PostgreSQL can build index on (immutable) functions (which MySQL can't as far as I know) and in recent versions, you could use PLV8 on the JSON data directly to build indexes on specific JSON elements of interest, which would improve the speed of your queries when searching for that data.
EDIT:
Since there won't be too many columns on which I need to perform
search, is it wise to use both the models? Key-per-column for the data
I need to search and JSON for others (in the same MySQL database)?
Mixing the two models isn't necessarily wrong (assuming the extra space is negligible), but it may cause problems if you don't make sure the two data sets are kept in sync: your application must never change one without also updating the other.
A good way to achieve this would be to have a trigger perform the automatic update, by running a stored procedure within the database server whenever an update or insert is made. As far as I'm aware, the MySQL stored procedure language probably lack support for any sort of JSON processing. Again PostgreSQL with PLV8 support (and possibly other RDBMS with more flexible stored procedure languages) should be more useful (updating your relational column automatically using a trigger is quite similar to updating an index in the same way).
short answer
you have to mix between them ,
use json for data that you are not going to make relations with them like contact data , address , products variabls
some time joins on the table will be an overhead. lets say for OLAP. if i have two tables one is ORDERS table and other one is ORDER_DETAILS. For getting all the order details we have to join two tables this will make the query slower when no of rows in the tables increase lets say in millions or so.. left/right join is too slower than inner join.
I Think if we add JSON string/Object in the respective ORDERS entry JOIN will be avoided. add report generation will be faster...
You are trying to fit a non-relational model into a relational database, I think you would be better served using a NoSQL database such as MongoDB. There is no predefined schema which fits in with your requirement of having no limitation to the number of fields (see the typical MongoDB collection example). Check out the MongoDB documentation to get an idea of how you'd query your documents, e.g.
db.mycollection.find(
{
name: 'sann'
}
)
As others have pointed out queries will be slower. I'd suggest to add at least an '_ID' column to query by that instead.