I'm familar with normalized databases and I'm able to produce all kind of queries. But since i'm starting on a green-field project now, one question kept me busy during this week:
It's the typical "webshop-problem" i'd say (even if i'm not building a webshop): How to model the "product-information"?
There are some approaches, each with its own advantages or disadvantages:
One Table to rule them all
Putting every "product" into a single table, generating every column possible and working with this monster-table.
Pro:
Easy queries
Easy layout
Con:
Lot of NULL Values
The actual code becomes sensitive towards the query (different type, different columns are required)
EAV-Pattern
Obviously the EAV-Pattern can provide a nicer solution for this. However, I've been working with EAV in the past, and when it comes down to performance, it can become a Problem for a huge amount of entries.
Searching is easy, but listing a "normalized table" requires one join per actual column -> slow.
Pro:
Clean
Flexible
Con:
Performance
Not Normalized
Single Table per category
Basically the opposite of the EAV-Pattern: Create one table per product-type, i.e. "cats", "dogs", "cars", ...
While this might be possible for a countable number of categories, it becomse a nightmare for a steady growing amount of categories, if you have to maintain those.
Pro:
Clean
Performance
Con:
Maintenance
Query-Management
Best of both worlds
So, on my journey through the internet I found recommendations to mix both approaches: Use a single Table for the common information, while grouping other attributes into "attribute-groups" which are organized in the EAV-Fashion.
However, here I think, this would basically import the drawbacks of EACH approach... You need to work with regular Tables (basic information) and do a huge amount of joins to get ALL information.
Storing enhanced information in JSON/XML
Another approach is to store extendet information in JSON/XML Format entries (within a column of the "root-table").
However, I don't really like this, as it seems hard(er) to query and to work-with than a regular database layout.
Automating single tables
Another idea was automating the part of "creating tables" per category (and therefore automating the queries on those), while maintaining a "master-table" just containing the id and the category information, in order to get the best performance for an undetermined amount of tables...?
i.e.:
Products
id | category | actualId
1 | cat | 1
2 | car | 1
cats
id | color | mew
1 | white | true
cars
id | wheels | bhp
1 | 4 | 123
the (abstract) Product table would allow to query for everything, while details are available by an easy join with "actualId" and the responsible table.
However, this would lead to problems if you want to run a "show all" query, because this is not solvable by SQL alone, cause the table name (in the join) needs to be explicit in the query.
What other Options are available? There are a lot of "webshops", each dealing with this problem more or less - how do they solve it in a efficent way?
I strongly disagree with your opinion that the "monster" table approach leads to "Easy queries", and that the EAV approach will cause performance issues (premature optimization?). And it doesn't have to require complex queries:
SELECT base.id, base.other_attributes,
, GROUP_CONCAT(CONCATENATE(ext.key, '[', ext.type, ']', ext.value))
FROM base_attributes base
LEFT JOIN extended_attributes ext
ON base.id=ext.id
WHERE base.id=?
;
You would need to do some parsing on the above, but a wee bit of polishing would give something parseable as JSON or XML without putting your data inside anonymous blobs
If you don't care about data integrity and are happy to solve performance via replication, then NoSQL is the way to go (this is really the same thing as using JSON or XML to store your data).
Related
For a social network site, I need to store frequently modified lists for each entity(& millions of such entities) which are:
frequently appended to
frequently read
sometimes reduced
lists are keyed by primary key
I'm already storing some other type of data in an RDBMS. I know that I could store those lists in an RDBMS as a many to many relationship like this way: Create a table listItems with two columns listId & listItem & to generate any particular list, just do a SELECT query for all records WHERE listId = x. But storing lists this way in an RDBMS is not very ideal when high scalability is concerned. Instead I would like to store prepared lists in a natural way, so that retrieval performance is maximized. Because I need to fetch around hundred of such lists for a user, whenever I user does login & view a page.
So how do I solve this ? What kind of database should be used for this data, probably the one that provide adding variable no of columns to keyed by a primary key, the ones like Cassandra ?
I used the same method that is, to store a 2 column row for every record, which I turned to a txt file with the formatted html which then we changed to json and finally to mongodb.
But since you have got frequent operations, I suggest cassandra, hbase and googles big table implementations like accumulo cloudata and hypertable.
Cloudata may be the right one for you.
As you pointed out the solution must be performant and scaleable: I'd suggest you to use Redis with it's LIST data structure and O(1) inserts and O(N) fetches (N - elements to fetch, considering you're fetching last ones from lists) and scale it horizontally with some hashing algorithm. I don't know what amount of data you are going to store and how many machines are available, but definitely it will be the best choice performance-wise, since nothing beats memory access speed.
If the amount of data is huge and you can't keep it all in RAM then Cassandra can do the job - storing lists ordered by time is a nice fit for it even better with partition strategy as Zanson mentioned above.
One more thought: you said read performance must be max, and once user logs in you will need to fetch hundred of lists for this user. Why not to prepare a single list for each user? That way there will be more writes but the read will be optimized since you will need to fetch only latest entries from one list. I'm not sure if that fits your task, just a thought. :)
I would recommend SSDB(https://github.com/ideawu/ssdb), a Google leveldb network wrapper. SSDB is designed to store collection data, such as list, map, zset(sorted set). You can use it like this way:
ssdb->hset(listId, listItem1);
ssdb->hset(listId, listItem2);
ssdb->hset(listId, listItem3);
...
list = ssdb->hscan(listId, 100);
// now list = [listItem1, listItem2, listItem3, ...]
The number of items in one map is only limited to the size of hard disk. Another solution is Redis, but Redis stores all data into memory(say no more than 30GB), so it probably won't fit your project.
C++, PHP, Python, Java, Lua, and more clients are supported by SSDB.
Cassandra has native support for storing sets/maps/lists. If your queries will always be pulling the whole thing down, then they are a very easy way to deal with this type of thing.
http://www.datastax.com/dev/blog/cql3_collections
http://cassandra.apache.org/doc/cql3/CQL.html#collections
If your lists are tied to a user, you can make the different columns on the users row/partition, and then queries for the multiple lists will be fast, as they will all be in the same partition for a given user.
Cassandra can be used very well for such use cases. Create as many Columnfamilies as you want for the returned data sets/queries. Cassandra works best with de-normalized data or sets like 1:m, m:m relations.
I know you didn't want to consider relational databases, but I think for this simple situation, there is also a scalable solution with relational database. The main benefit would be that you don't need to maintain a separate database system.
To gain scalability, all NoSQL solutions will distribute your data across multiple nodes. You can do this in your application code, spreading your data out across multiple relational databases. To keep the load balanced, you may need to move data occasionally, but it may be sufficient to simply spawn a new database for every N lists.
In cassandra you can have wide rows, up to 2B columns per row... if that's enough for an entity's cumulative lists' item, you can store whole entity's lists in a single row then retrieve it all together.
with cassandra's "composite column" you can store elements of each list sequentially and ordered and you can delete a single column(a list item) when you want, and when you have an insertion you just need to insert a column...
something like this: (!)
|list_1_Id : item1Id |list_1_Id : item2Id | list_2_Id : item1Id |...| list_n_Id : item3Id |
entity| item1Value | item2Value | item1Value |...| item3Value |
so practically you deal with columns(=items) rather than lists... and it makes your work much easier.
depends on your lists size cosider using spliting entiti's row to multiple rows...
something like this: (!)
| item1Id | item2Id | item3Id | item4Id |...
entiId_list_1_Id | item1Value | item2Value | item3Value | item4Value |...
| item1Id | item2Id | item3Id | item4Id |...
entiId_list_2_Id | item1Value | item2Value | item3Value | item4Value |...
...
and you can put itemValue in column name and leave column value empty to reduce size...
for example you can insert a new item by simply doing:
//columns are sorted by their id if they have any
insert into entityList[entityId][listId][itemId] = item value;
or
//columns are sorted by their value
insert into entityList[entityId][listId][itemvalue] = nothing;
and delete:
delete from entityList where entityId='d' and listId='o' and itemId='n';
or via you application you can do it by using a rich client like Hector...
I am implementing the following model for storing user related data in my table - I have 2 columns - uid (primary key) and a meta column which stores other data about the user in JSON format.
uid | meta
--------------------------------------------------
1 | {name:['foo'],
| emailid:['foo#bar.com','bar#foo.com']}
--------------------------------------------------
2 | {name:['sann'],
| emailid:['sann#bar.com','sann#foo.com']}
--------------------------------------------------
Is this a better way (performance-wise, design-wise) than the one-column-per-property model, where the table will have many columns like uid, name, emailid.
What I like about the first model is, you can add as many fields as possible there is no limitation.
Also, I was wondering, now that I have implemented the first model. How do I perform a query on it, like, I want to fetch all the users who have name like 'foo'?
Question - Which is the better way to store user related data (keeping in mind that number of fields is not fixed) in database using - JSON or column-per-field? Also, if the first model is implemented, how to query database as described above? Should I use both the models, by storing all the data which may be searched by a query in a separate row and the other data in JSON (is a different row)?
Update
Since there won't be too many columns on which I need to perform search, is it wise to use both the models? Key-per-column for the data I need to search and JSON for others (in the same MySQL database)?
Updated 4 June 2017
Given that this question/answer have gained some popularity, I figured it was worth an update.
When this question was originally posted, MySQL had no support for JSON data types and the support in PostgreSQL was in its infancy. Since 5.7, MySQL now supports a JSON data type (in a binary storage format), and PostgreSQL JSONB has matured significantly. Both products provide performant JSON types that can store arbitrary documents, including support for indexing specific keys of the JSON object.
However, I still stand by my original statement that your default preference, when using a relational database, should still be column-per-value. Relational databases are still built on the assumption of that the data within them will be fairly well normalized. The query planner has better optimization information when looking at columns than when looking at keys in a JSON document. Foreign keys can be created between columns (but not between keys in JSON documents). Importantly: if the majority of your schema is volatile enough to justify using JSON, you might want to at least consider if a relational database is the right choice.
That said, few applications are perfectly relational or document-oriented. Most applications have some mix of both. Here are some examples where I personally have found JSON useful in a relational database:
When storing email addresses and phone numbers for a contact, where storing them as values in a JSON array is much easier to manage than multiple separate tables
Saving arbitrary key/value user preferences (where the value can be boolean, textual, or numeric, and you don't want to have separate columns for different data types)
Storing configuration data that has no defined schema (if you're building Zapier, or IFTTT and need to store configuration data for each integration)
I'm sure there are others as well, but these are just a few quick examples.
Original Answer
If you really want to be able to add as many fields as you want with no limitation (other than an arbitrary document size limit), consider a NoSQL solution such as MongoDB.
For relational databases: use one column per value. Putting a JSON blob in a column makes it virtually impossible to query (and painfully slow when you actually find a query that works).
Relational databases take advantage of data types when indexing, and are intended to be implemented with a normalized structure.
As a side note: this isn't to say you should never store JSON in a relational database. If you're adding true metadata, or if your JSON is describing information that does not need to be queried and is only used for display, it may be overkill to create a separate column for all of the data points.
Like most things "it depends". It's not right or wrong/good or bad in and of itself to store data in columns or JSON. It depends on what you need to do with it later. What is your predicted way of accessing this data? Will you need to cross reference other data?
Other people have answered pretty well what the technical trade-off are.
Not many people have discussed that your app and features evolve over time and how this data storage decision impacts your team.
Because one of the temptations of using JSON is to avoid migrating schema and so if the team is not disciplined, it's very easy to stick yet another key/value pair into a JSON field. There's no migration for it, no one remembers what it's for. There is no validation on it.
My team used JSON along side traditional columns in postgres and at first it was the best thing since sliced bread. JSON was attractive and powerful, until one day we realized that flexibility came at a cost and it's suddenly a real pain point. Sometimes that point creeps up really quickly and then it becomes hard to change because we've built so many other things on top of this design decision.
Overtime, adding new features, having the data in JSON led to more complicated looking queries than what might have been added if we stuck to traditional columns. So then we started fishing certain key values back out into columns so that we could make joins and make comparisons between values. Bad idea. Now we had duplication. A new developer would come on board and be confused? Which is the value I should be saving back into? The JSON one or the column?
The JSON fields became junk drawers for little pieces of this and that. No data validation on the database level, no consistency or integrity between documents. That pushed all that responsibility into the app instead of getting hard type and constraint checking from traditional columns.
Looking back, JSON allowed us to iterate very quickly and get something out the door. It was great. However after we reached a certain team size it's flexibility also allowed us to hang ourselves with a long rope of technical debt which then slowed down subsequent feature evolution progress. Use with caution.
Think long and hard about what the nature of your data is. It's the foundation of your app. How will the data be used over time. And how is it likely TO CHANGE?
Just tossing it out there, but WordPress has a structure for this kind of stuff (at least WordPress was the first place I observed it, it probably originated elsewhere).
It allows limitless keys, and is faster to search than using a JSON blob, but not as fast as some of the NoSQL solutions.
uid | meta_key | meta_val
----------------------------------
1 name Frank
1 age 12
2 name Jeremiah
3 fav_food pizza
.................
EDIT
For storing history/multiple keys
uid | meta_id | meta_key | meta_val
----------------------------------------------------
1 1 name Frank
1 2 name John
1 3 age 12
2 4 name Jeremiah
3 5 fav_food pizza
.................
and query via something like this:
select meta_val from `table` where meta_key = 'name' and uid = 1 order by meta_id desc
the drawback of the approach is exactly what you mentioned :
it makes it VERY slow to find things, since each time you need to perform a text-search on it.
value per column instead matches the whole string.
Your approach (JSON based data) is fine for data you don't need to search by, and just need to display along with your normal data.
Edit: Just to clarify, the above goes for classic relational databases. NoSQL use JSON internally, and are probably a better option if that is the desired behavior.
Basically, the first model you are using is called as document-based storage. You should have a look at popular NoSQL document-based database like MongoDB and CouchDB. Basically, in document based db's, you store data in json files and then you can query on these json files.
The Second model is the popular relational database structure.
If you want to use relational database like MySql then i would suggest you to only use second model. There is no point in using MySql and storing data as in the first model.
To answer your second question, there is no way to query name like 'foo' if you use first model.
It seems that you're mainly hesitating whether to use a relational model or not.
As it stands, your example would fit a relational model reasonably well, but the problem may come of course when you need to make this model evolve.
If you only have one (or a few pre-determined) levels of attributes for your main entity (user), you could still use an Entity Attribute Value (EAV) model in a relational database. (This also has its pros and cons.)
If you anticipate that you'll get less structured values that you'll want to search using your application, MySQL might not be the best choice here.
If you were using PostgreSQL, you could potentially get the best of both worlds. (This really depends on the actual structure of the data here... MySQL isn't necessarily the wrong choice either, and the NoSQL options can be of interest, I'm just suggesting alternatives.)
Indeed, PostgreSQL can build index on (immutable) functions (which MySQL can't as far as I know) and in recent versions, you could use PLV8 on the JSON data directly to build indexes on specific JSON elements of interest, which would improve the speed of your queries when searching for that data.
EDIT:
Since there won't be too many columns on which I need to perform
search, is it wise to use both the models? Key-per-column for the data
I need to search and JSON for others (in the same MySQL database)?
Mixing the two models isn't necessarily wrong (assuming the extra space is negligible), but it may cause problems if you don't make sure the two data sets are kept in sync: your application must never change one without also updating the other.
A good way to achieve this would be to have a trigger perform the automatic update, by running a stored procedure within the database server whenever an update or insert is made. As far as I'm aware, the MySQL stored procedure language probably lack support for any sort of JSON processing. Again PostgreSQL with PLV8 support (and possibly other RDBMS with more flexible stored procedure languages) should be more useful (updating your relational column automatically using a trigger is quite similar to updating an index in the same way).
short answer
you have to mix between them ,
use json for data that you are not going to make relations with them like contact data , address , products variabls
some time joins on the table will be an overhead. lets say for OLAP. if i have two tables one is ORDERS table and other one is ORDER_DETAILS. For getting all the order details we have to join two tables this will make the query slower when no of rows in the tables increase lets say in millions or so.. left/right join is too slower than inner join.
I Think if we add JSON string/Object in the respective ORDERS entry JOIN will be avoided. add report generation will be faster...
You are trying to fit a non-relational model into a relational database, I think you would be better served using a NoSQL database such as MongoDB. There is no predefined schema which fits in with your requirement of having no limitation to the number of fields (see the typical MongoDB collection example). Check out the MongoDB documentation to get an idea of how you'd query your documents, e.g.
db.mycollection.find(
{
name: 'sann'
}
)
As others have pointed out queries will be slower. I'd suggest to add at least an '_ID' column to query by that instead.
I am a data base admin and developer in MySQL. I have been a couple of years working with MySQL. I recently adquire and study O'Reilly High Performance MySQL 2nd Edition to improve my skills on MySQL advanced features, high performance and scalability, because I have often been frustated by the lack of advance knowledge of MySQL I had (and in a big part, I still have).
Currently, I am working on a ambicious web project. In this project, we will have quite content and users from the begining. I am the designer of the data base and this data base must be very fast (some inserts but mostly and more important READS).
I want here to discuss about these requirements:
There will be several kind of items
The items have some fields and relations in common
The items also have some fields and relations special that make them differents each other
Those items will have to be listed all together ordered or filtered by common fields or relations
The items will have to be also listed only by type (for examble item_specialA)
I have some basic design doubts, and I would like you to help me decide and learn which design aproach would be better for a high performance MySQL data base.
Classical aproach
The following diagram shows the clasical aproach which is the first you may think about with the mind thinking in database: Database diagram
Centralized aproach
But maybe we can improve it with some or pseudo object oriented paradigm centralicing the common items and the relations on one common item table. It would also be useful for listing all kind of items: Database diagram
Advantages and disadvantages of each one?
Which aproach would you choose or which changes would you apply seeing the requirements seen before?
Thanks all in advance!!
What you have are two distinct data mapping strategies. That you called "classical" is "one table per concrete class" in other sources, and that you called "centralized" is "one table per class" (Mapping Objects to Relational Databases: O/R Mapping In Detail). They both have their advantages and disadvantages (follow the link above). The queries in the first strategy will be faster (you will need to join only 2 tables vs 3 in the second strategy).
I think that you should explore classic supertype/subtype pattern. Here are some examples from the SO.
If you're looking mostly for speed, consider selective use of MyISAM tables, use a centralized "object" table, and just one additional table with correct indexes on this form:
object_type | object_id | property_name | property_value
user | 1 | photos | true
city | 2 | photos | true
user | 5 | single | true
city | 2 | metro | true
city | 3 | population | 135000
and so on... lookups on primary keys or indexed keys (object_type, object_id, property_name) for example will be blazing fast. Also, you reduce the need to end with 457 tables as new properties appear.
It isn't exactly a well-designed nor perfectly-normalized database and, if you are looking for a long-term big site, you should consider caching, or at least using a denormalized paradigm, denormalized mysql tables like this one, redis, or maybe MongoDB.
I have a MySQL DB containing entry for pages of a website.
Let's say it has fields like:
Table pages:
id | title | content | date | author
Each of the pages can be voted by users, so I have two other tables
Table users:
id | name | etc etc etc
Table votes:
id | id_user | id_page | vote
Now, I have a page where I show a list of the pages (10-50 at a time) with various information along with the average vote of the page.
So, I was wondering if it were better to:
a) Run the query to display the pages (note that this is already fairly heavy as it queries three tables) and then for each entry do another query to calculate the mean vote (or add a 4th join to the main query?).
or
b) Add an "average vote" column to the pages table, which I will update (along with the vote table) when an user votes the page.
nico
Use the database for what it's meant for; option a is by far your best bet. It's worth noting that your query isn't actually particularly heavy, joining three tables; SQL really excels at this sort of thing.
Be cautious of this sort of attempt at premature optimization of SQL; SQL is far more efficient at what it does than most people think it is.
Note that another benefit from using your option a is that there's less code to maintain, and less chance of data diverging as code gets updated; it's a lifecycle benefit, and they're too often ignored for miniscule optimization benefits.
You might "repeat yourself" (violate DRY) for the sake of performance. The trade-offs are (a) extra storage, and (b) extra work in keeping everything self-consistent within your DB.
There are advantages/disadvantages both ways. Optimizing too early has its own set of pitfalls, though.
Honestly, for this issue, I would recommend redundent information. Multiple votes for multiple pages can really create a heavy load for a server, in my opinion. If you forsee to have real traffic on your website, of course... :-)
We have a mySQL database table for products. We are utilizing a cache layer to reduce database load, but we think that it's a good idea to minimize the actual data needed to be stored in the cache layer to speed up the application further.
All the products in the database, that is visible to visitors have a price attached to them:
The prices are stored in a different table, called prices . There are multiple price categories depending on which discount level each visitor (customer) applies to. From time to time, there are campaigns which means that a special price for each product is available. The special prices are stored in a table called specials.
Is it a bad to make a temp table that binds the tables together?
It would only have the neccessary information and would ofcourse be cached.
-------------|-------------|------------
| productId | hasPrice | hasSpecial
-------------|-------------|------------
1 | 1 | 0
2 | 1 | 1
By doing such, it would be super easy to know if the specific product really has a price, without having to iterate through the complete prices or specials table each time a product should be listed or presented.
Are temp tables a common thing for web applications or is it just bad design?
If you're going to cache this data anyways, does it really need to be in a temp table? You would only incur the overhead of the query when you needed to rebuild the cache, so the temp table might not even be necessary.
You should approach it like any other performance problem: Decide how much performance is necessary, then iterate doing testing on production-grade hardware in your lab. Do not do needless optimisations.
You shoud profile your app and discover if it's doing too many queries or the queries themselves are slow; most cases of web-app slowness are caused by doing too many queries (in my experience) even though the queries are very easy.
Normally the best engineering solution is to restructure the database, in some cases denormalising, to make the common read use-cases require fewer queries. Caching may be helpful as well, but refactoring so you need fewer queries is often the best.
Essentially you can increase the amount of work on the write-path to reduce the amount on the read-path, if you are planning to do a lot more reading than writing.