Structuring a database to handle unknown name/value pairs - mysql

Here's the idea: I expect to be receiving thousands of queries, each containing a certain amount of name value pairs; these start off as associative arrays, so I have fairly good control over what can happen to the data. These NVPs vary dependent on the source. For example, if the source is "A", I could receive the array (in JSON for ease of explanation): {'Key1':'test1','key2':'test2'} but if the source was "B", I could receive {'DifferentKey1':'test1','DifferentKey2':'test2'} I'm selecting which keys I want to store in my database, so in this case I could only want to select DifferentKey1 from source B's array, and discard the rest.
My main issue here is that these arrays could technically be completely unrelated content wise. They have a very general association (they're both arrays containing stats) but they're very different (in that the sources are different, ie. different games/sports).
I was thinking SQL: storing a table filled with games and their respective ids would be a good way of linking general NVP strings. For example:
Games table:
| id | name |
-------------
1 golf
2 soccer
NVP table
| id | game_id | nvp
1 1 team1score=87;team2score=94;team3score=73;
2 2 team1score=2;team2score=1;extratime=200;numyellowcards=4;
Hope that's clear enough. Do you see what I mean though? If there's an indeterminant amount of data that I may use, how can I structure a table? Thanks.
Edit: I guess I should note, obviously this set up WOULD work, however is it the best performance wise? Maybe not? I'm not sure, let's see what you guys can come up with!

SQL databases are great for highly relational data - but in a case like this where the data is not relational and there is no fixed schema, you might be better off using a NoSQL solution. There are a lot and I haven't used them enough to be sure what would work best for you. If your data can fit in RAM, then redis is great.

The common way of storing name/value pairs in a relational database is known as "Entity/Attribute/Value". You'll find a lot of discussion on Stack Overflow.
It all depends on what your application wants to do with the data. Storing is it easy - querying is much harder.
If you're building a sports application, you are likely to have domain concepts you want to support - for football, show a league position based on games played. For golf, show the number of birdies or eagles. You will probably want to show all the games a particular team/player has played in a season.
Some things are easy to build in a relational database, and have amazing performance over huge data sets. Find the highest-scoring game ever, find the last game in the 1998 season, find all the games featuring player x - all a great fit, as long as you can build a schema that represents those domain concepts.
From what you write, it does sound like you will have a fixed number of sports; the data coming in to your system sounds like it's not particularly structured, but you should be able to structure that to a domain model. If that's true, I recommend building a relational schema that reflects the domain logic of each sport.
If that's not true - if you can't reason about the domain in advance - the relational model is a bad fit, and NoSQL is probably better. But you will run into the same problem - extracting meaning from name/value pairs is going to be hard!

Related

EAV vs null vs Mixed

I'm familar with normalized databases and I'm able to produce all kind of queries. But since i'm starting on a green-field project now, one question kept me busy during this week:
It's the typical "webshop-problem" i'd say (even if i'm not building a webshop): How to model the "product-information"?
There are some approaches, each with its own advantages or disadvantages:
One Table to rule them all
Putting every "product" into a single table, generating every column possible and working with this monster-table.
Pro:
Easy queries
Easy layout
Con:
Lot of NULL Values
The actual code becomes sensitive towards the query (different type, different columns are required)
EAV-Pattern
Obviously the EAV-Pattern can provide a nicer solution for this. However, I've been working with EAV in the past, and when it comes down to performance, it can become a Problem for a huge amount of entries.
Searching is easy, but listing a "normalized table" requires one join per actual column -> slow.
Pro:
Clean
Flexible
Con:
Performance
Not Normalized
Single Table per category
Basically the opposite of the EAV-Pattern: Create one table per product-type, i.e. "cats", "dogs", "cars", ...
While this might be possible for a countable number of categories, it becomse a nightmare for a steady growing amount of categories, if you have to maintain those.
Pro:
Clean
Performance
Con:
Maintenance
Query-Management
Best of both worlds
So, on my journey through the internet I found recommendations to mix both approaches: Use a single Table for the common information, while grouping other attributes into "attribute-groups" which are organized in the EAV-Fashion.
However, here I think, this would basically import the drawbacks of EACH approach... You need to work with regular Tables (basic information) and do a huge amount of joins to get ALL information.
Storing enhanced information in JSON/XML
Another approach is to store extendet information in JSON/XML Format entries (within a column of the "root-table").
However, I don't really like this, as it seems hard(er) to query and to work-with than a regular database layout.
Automating single tables
Another idea was automating the part of "creating tables" per category (and therefore automating the queries on those), while maintaining a "master-table" just containing the id and the category information, in order to get the best performance for an undetermined amount of tables...?
i.e.:
Products
id | category | actualId
1 | cat | 1
2 | car | 1
cats
id | color | mew
1 | white | true
cars
id | wheels | bhp
1 | 4 | 123
the (abstract) Product table would allow to query for everything, while details are available by an easy join with "actualId" and the responsible table.
However, this would lead to problems if you want to run a "show all" query, because this is not solvable by SQL alone, cause the table name (in the join) needs to be explicit in the query.
What other Options are available? There are a lot of "webshops", each dealing with this problem more or less - how do they solve it in a efficent way?
I strongly disagree with your opinion that the "monster" table approach leads to "Easy queries", and that the EAV approach will cause performance issues (premature optimization?). And it doesn't have to require complex queries:
SELECT base.id, base.other_attributes,
, GROUP_CONCAT(CONCATENATE(ext.key, '[', ext.type, ']', ext.value))
FROM base_attributes base
LEFT JOIN extended_attributes ext
ON base.id=ext.id
WHERE base.id=?
;
You would need to do some parsing on the above, but a wee bit of polishing would give something parseable as JSON or XML without putting your data inside anonymous blobs
If you don't care about data integrity and are happy to solve performance via replication, then NoSQL is the way to go (this is really the same thing as using JSON or XML to store your data).

Storing JSON in database vs. having a new column for each key

I am implementing the following model for storing user related data in my table - I have 2 columns - uid (primary key) and a meta column which stores other data about the user in JSON format.
uid | meta
--------------------------------------------------
1 | {name:['foo'],
| emailid:['foo#bar.com','bar#foo.com']}
--------------------------------------------------
2 | {name:['sann'],
| emailid:['sann#bar.com','sann#foo.com']}
--------------------------------------------------
Is this a better way (performance-wise, design-wise) than the one-column-per-property model, where the table will have many columns like uid, name, emailid.
What I like about the first model is, you can add as many fields as possible there is no limitation.
Also, I was wondering, now that I have implemented the first model. How do I perform a query on it, like, I want to fetch all the users who have name like 'foo'?
Question - Which is the better way to store user related data (keeping in mind that number of fields is not fixed) in database using - JSON or column-per-field? Also, if the first model is implemented, how to query database as described above? Should I use both the models, by storing all the data which may be searched by a query in a separate row and the other data in JSON (is a different row)?
Update
Since there won't be too many columns on which I need to perform search, is it wise to use both the models? Key-per-column for the data I need to search and JSON for others (in the same MySQL database)?
Updated 4 June 2017
Given that this question/answer have gained some popularity, I figured it was worth an update.
When this question was originally posted, MySQL had no support for JSON data types and the support in PostgreSQL was in its infancy. Since 5.7, MySQL now supports a JSON data type (in a binary storage format), and PostgreSQL JSONB has matured significantly. Both products provide performant JSON types that can store arbitrary documents, including support for indexing specific keys of the JSON object.
However, I still stand by my original statement that your default preference, when using a relational database, should still be column-per-value. Relational databases are still built on the assumption of that the data within them will be fairly well normalized. The query planner has better optimization information when looking at columns than when looking at keys in a JSON document. Foreign keys can be created between columns (but not between keys in JSON documents). Importantly: if the majority of your schema is volatile enough to justify using JSON, you might want to at least consider if a relational database is the right choice.
That said, few applications are perfectly relational or document-oriented. Most applications have some mix of both. Here are some examples where I personally have found JSON useful in a relational database:
When storing email addresses and phone numbers for a contact, where storing them as values in a JSON array is much easier to manage than multiple separate tables
Saving arbitrary key/value user preferences (where the value can be boolean, textual, or numeric, and you don't want to have separate columns for different data types)
Storing configuration data that has no defined schema (if you're building Zapier, or IFTTT and need to store configuration data for each integration)
I'm sure there are others as well, but these are just a few quick examples.
Original Answer
If you really want to be able to add as many fields as you want with no limitation (other than an arbitrary document size limit), consider a NoSQL solution such as MongoDB.
For relational databases: use one column per value. Putting a JSON blob in a column makes it virtually impossible to query (and painfully slow when you actually find a query that works).
Relational databases take advantage of data types when indexing, and are intended to be implemented with a normalized structure.
As a side note: this isn't to say you should never store JSON in a relational database. If you're adding true metadata, or if your JSON is describing information that does not need to be queried and is only used for display, it may be overkill to create a separate column for all of the data points.
Like most things "it depends". It's not right or wrong/good or bad in and of itself to store data in columns or JSON. It depends on what you need to do with it later. What is your predicted way of accessing this data? Will you need to cross reference other data?
Other people have answered pretty well what the technical trade-off are.
Not many people have discussed that your app and features evolve over time and how this data storage decision impacts your team.
Because one of the temptations of using JSON is to avoid migrating schema and so if the team is not disciplined, it's very easy to stick yet another key/value pair into a JSON field. There's no migration for it, no one remembers what it's for. There is no validation on it.
My team used JSON along side traditional columns in postgres and at first it was the best thing since sliced bread. JSON was attractive and powerful, until one day we realized that flexibility came at a cost and it's suddenly a real pain point. Sometimes that point creeps up really quickly and then it becomes hard to change because we've built so many other things on top of this design decision.
Overtime, adding new features, having the data in JSON led to more complicated looking queries than what might have been added if we stuck to traditional columns. So then we started fishing certain key values back out into columns so that we could make joins and make comparisons between values. Bad idea. Now we had duplication. A new developer would come on board and be confused? Which is the value I should be saving back into? The JSON one or the column?
The JSON fields became junk drawers for little pieces of this and that. No data validation on the database level, no consistency or integrity between documents. That pushed all that responsibility into the app instead of getting hard type and constraint checking from traditional columns.
Looking back, JSON allowed us to iterate very quickly and get something out the door. It was great. However after we reached a certain team size it's flexibility also allowed us to hang ourselves with a long rope of technical debt which then slowed down subsequent feature evolution progress. Use with caution.
Think long and hard about what the nature of your data is. It's the foundation of your app. How will the data be used over time. And how is it likely TO CHANGE?
Just tossing it out there, but WordPress has a structure for this kind of stuff (at least WordPress was the first place I observed it, it probably originated elsewhere).
It allows limitless keys, and is faster to search than using a JSON blob, but not as fast as some of the NoSQL solutions.
uid | meta_key | meta_val
----------------------------------
1 name Frank
1 age 12
2 name Jeremiah
3 fav_food pizza
.................
EDIT
For storing history/multiple keys
uid | meta_id | meta_key | meta_val
----------------------------------------------------
1 1 name Frank
1 2 name John
1 3 age 12
2 4 name Jeremiah
3 5 fav_food pizza
.................
and query via something like this:
select meta_val from `table` where meta_key = 'name' and uid = 1 order by meta_id desc
the drawback of the approach is exactly what you mentioned :
it makes it VERY slow to find things, since each time you need to perform a text-search on it.
value per column instead matches the whole string.
Your approach (JSON based data) is fine for data you don't need to search by, and just need to display along with your normal data.
Edit: Just to clarify, the above goes for classic relational databases. NoSQL use JSON internally, and are probably a better option if that is the desired behavior.
Basically, the first model you are using is called as document-based storage. You should have a look at popular NoSQL document-based database like MongoDB and CouchDB. Basically, in document based db's, you store data in json files and then you can query on these json files.
The Second model is the popular relational database structure.
If you want to use relational database like MySql then i would suggest you to only use second model. There is no point in using MySql and storing data as in the first model.
To answer your second question, there is no way to query name like 'foo' if you use first model.
It seems that you're mainly hesitating whether to use a relational model or not.
As it stands, your example would fit a relational model reasonably well, but the problem may come of course when you need to make this model evolve.
If you only have one (or a few pre-determined) levels of attributes for your main entity (user), you could still use an Entity Attribute Value (EAV) model in a relational database. (This also has its pros and cons.)
If you anticipate that you'll get less structured values that you'll want to search using your application, MySQL might not be the best choice here.
If you were using PostgreSQL, you could potentially get the best of both worlds. (This really depends on the actual structure of the data here... MySQL isn't necessarily the wrong choice either, and the NoSQL options can be of interest, I'm just suggesting alternatives.)
Indeed, PostgreSQL can build index on (immutable) functions (which MySQL can't as far as I know) and in recent versions, you could use PLV8 on the JSON data directly to build indexes on specific JSON elements of interest, which would improve the speed of your queries when searching for that data.
EDIT:
Since there won't be too many columns on which I need to perform
search, is it wise to use both the models? Key-per-column for the data
I need to search and JSON for others (in the same MySQL database)?
Mixing the two models isn't necessarily wrong (assuming the extra space is negligible), but it may cause problems if you don't make sure the two data sets are kept in sync: your application must never change one without also updating the other.
A good way to achieve this would be to have a trigger perform the automatic update, by running a stored procedure within the database server whenever an update or insert is made. As far as I'm aware, the MySQL stored procedure language probably lack support for any sort of JSON processing. Again PostgreSQL with PLV8 support (and possibly other RDBMS with more flexible stored procedure languages) should be more useful (updating your relational column automatically using a trigger is quite similar to updating an index in the same way).
short answer
you have to mix between them ,
use json for data that you are not going to make relations with them like contact data , address , products variabls
some time joins on the table will be an overhead. lets say for OLAP. if i have two tables one is ORDERS table and other one is ORDER_DETAILS. For getting all the order details we have to join two tables this will make the query slower when no of rows in the tables increase lets say in millions or so.. left/right join is too slower than inner join.
I Think if we add JSON string/Object in the respective ORDERS entry JOIN will be avoided. add report generation will be faster...
You are trying to fit a non-relational model into a relational database, I think you would be better served using a NoSQL database such as MongoDB. There is no predefined schema which fits in with your requirement of having no limitation to the number of fields (see the typical MongoDB collection example). Check out the MongoDB documentation to get an idea of how you'd query your documents, e.g.
db.mycollection.find(
{
name: 'sann'
}
)
As others have pointed out queries will be slower. I'd suggest to add at least an '_ID' column to query by that instead.

CSVs in database columns - not a good idea? [duplicate]

This question already has answers here:
Is storing a delimited list in a database column really that bad?
(10 answers)
Closed 8 years ago.
A while ago, I came to the realization that a way I would like to hold the skills for a player in a game would be through CSV format. On the player's stats, I made a varchar of skills that would be stored as CSV. (1,6,9,10 etc.) I made a 'skills' table with affiliated stats for each skill (name, effect) and when it comes time to see what skills they have, all I have to do is query that single column and use PHP's str_getcsv() to see if a certain skill exists because it'll be in an array.
However, my coworker suggests that a superior system is to have each skill simply be an entry into a master "skills" table that each player will use, and each skill will have an ID foreign key to the player. I just query all rows in this table, and what's returned will be their skills!
At first I thought this wouldn't be very good at all, but it appears the Internet disagrees. I understand that it's less searchable - but it was not my intention to ever say, "does the player have x skill?" or "show me all players with this skill!". At worst if I wanted such data, I'd just make a PHP report for it that would, admittedly, be slow.
But it appears as though this is really faster?! I'm having trouble finding a hard answer extending beyond "yeah it's good and normalized". Can Stack Overflow help me out?
Edit: Thanks, guys! I never realized how bad this was. And sorry about the dupe, but believe me, I didn't type all of that without at least checking for dupes. :P
Putting comma-separated values into a single field in a database is not just a bad idea, it is the incarnation of Satan expressed in a database model.
It cannot represent a great many situations accurately (cases in which the value contains a comma or something else that your CSV-consuming code has trouble with), often has problems with values nested in other values, cannot be properly indexed, cannot be used in database JOINs, is difficult to dedupe, cannot have additional information added to it (number of times the skill was earned, in your case, or a skill level), cannot participate in relational integrity, cannot enforce type constraints, and so on. The list is almost endless.
This is especially true of MySQL which has the very convenient group_concat function that makes it easy to present this data as a comma-separated string when needed while still maintaining the full functionality and speed of a normalized database.
You gain nothing from using the comma-separate approach but lose searchability and performance. Get Satan behind thee, and normalize your data.
Well, there are things such as scaleability to consider. What if you need to add/remove a skill? How about renaming a skill? What happens if the number of skills out grows the size of your field? It's bad practice to have to re-size a field just to accommodate something like this.
What about maintainability? Could another developer come in and understand what you've done? What happens if the same skill is given to a player twice?
You coworker's suggestion is not correct either. You would have 3 tables in this case. A master player table, a skills table, and a table that has a relationship to both, creating a many to many relationship, allowing a single skill to be associated with many players, and many players having the same skill.
Since the database will index the content (assuming that you use index) it will be very very fast to search the content and get the desired contents. Remember: databases are designed to hold a lot of information and a database such as mysql, which is a relational database, is made for relations.
Another matter is the maintainability of the system. It will be much much easier to maintain a system that's normalized. And when you are to remove or add a skill it will be easier.
When you are about to get the information from the database regarding the skills of the player you can easily get information connected to the concerned skills with a simple JOIN.
I say: Let the database do what it does best - handle the data. And let your programming do what it should do ;)

How can several different datatypes be saved in one table

This is my situation: I am constructing an ad-like application in Django and Mysql. I am using a flexible-ad approach where we have:
a table with ad categories (several categories such as home, furniture, cars, etc.)
id_category
name
a table with details for the ad categories (home: area, squared meters. car: seats, color.)
id_detail
id_category (the categ the detail describes)
name
type (boolean, char, int, long, etc.)
the ad table (i am selling a house. i am selling a car.)
id_ad
id_category
text
date
a table where i plan to consolidate the details of the ads (home: A-area, 500 sq-meters. car: 5 seats, red.)
id_detail_ad
id_ad
id_detail
value
Is this possible? Can I have a table of details for all the ads, even if details include numbers, texts, booleans, etc? Or would I have to save them all as text and then interpret them via code accordingly? Please express your opinions. Thank you.
Relational databases doesn't support user-defined data types like OODBs do. I recommend you to have the details column separated into several others columns, as you'll increase performance and future usability and scalability.
Consider having one table for each ad type. This is the old school RDBMS way to model the data you're describing. It means that you'll have to add a table to your database every time you add an ad type. I think you'll find this is not as bad as it sounds. The benefits of this approach will be less code written for data management and/or a better use of you object/relational mapping library (disclaimer: I've never used Django, so your mileage may vary, but this would definitely apply to other tools)
It's a bit of a hack, but you can store any kinds of information you don't care about indexing in a text column in mysql. You can use either pickle if you don't care about the information being readable, or (better) jsonpickle, which is human readable and easy to access with jsonpickle.encode and jsonpickle.decode. We do this at my job, and it works swimmingly.
These tables look an awful lot like key-value pairs... in relational database design, this is a big no-no, you want to do what Ben is recommending.
Yet I've seen that a lot of web CMMSes use this sort of arrangement in their table structures. Almost as if they expect the tables to be structured that way. If Django wants you to solve the problem this way, you might have to do it that way. In which case the other comments on pickling data in/out of columns, or using a BLOB column, works well.
However, if Django expects tables this way, quite frankly, the Django designers didn't know a thing about how to use a relational database correctly or efficiently, and their framework should use a different database engine that operates on key-value pairs. It's a poor design, if they enforce the user/programmer into situations like that.

Anyone used SQl Server 2008 HierarchialID type to store genealogy data

I have a genealogical database (about sheep actually), that is used by breeders to research genetic information. In each record I store fatherid and motherid. In a seperate table I store complete 'roll up' information so that I can quickly tell the complete family tree of any animal without recursing thru the entire database...
Recently discovered the hierarchicalID type built into SQL server 2008, on the surface it sounds promising, but I and am wondering if anyone has used it enough to know whether or not it would be appropriate in my type of app(i.e. two parents, multiple kids)? All the samples I have found/read so far deal with manager/employee type relationships where a given boss can have multiple employees, and each employee can have a single boss.
The needs of my app are similar, but not quite the same.
I am sure I will dig into this new technology anyway, but it would be nice to shortcut my research if someone already knew that it was not designed in such a fashion that it would allow me to make use of it.
I am also curious what kind of performance people are seeing using this new data type versus other methods that do the same thing.
Assuming each sheep has one male parent and one female parent, and that no sheep can be its own parent (leading to an Ovine Temporal Paradox), then what about using two HierarchyIDs?
CREATE TABLE dbo.Sheep(
MotherHID hierarchyid NOT NULL,
FatherHID hierarchyid NOT NULL,
Name int NOT NULL
)
GO
ALTER TABLE dbo.Sheep
ADD CONSTRAINT PK_Sheep PRIMARY KEY CLUSTERED (
MotherHID,
FatherHID
)
GO
By making them a joint PK, you'd be uniquely identifying each sheep as the product of its maternal hierarchy and it's paternal hierarchy.
There may be some inherent problem lurking here, so proceed cautiously with a couple simple prototypes - but initially it seems like it would work for you.
I can't see how it would work; in a regular hierarchy, there is a single chain to the root, so it can store the path (which is what the binary is) to each node. However, with multiple parents, this isn't possible: even if you split matriarchy and partiarchy, you still have 1 mother, 2 grandmothers, 4 great-grand-mothers, etc (not even getting into some of the more "interesting" scanerios possible, especially with livestock). There is no single logical path to encode, so no: I don't think that this can work in your case.
I'm happy to be corrected, though.
Using two separate HierarchyID to indicate father and mother would work well.
However, you definitely would NOT want to use those as a unique indicator of the row, since it's a 2-to-many situation. (Two sheep can have multiple children.)
I don't see anything inherently wrong with using HierarchyId for ancestry--for Sheep at least. For people, the relationships are much more complicated than "this person begat that person", so obviously the use would be limited to breeding.
SQL Server hierarchyID is not a robust solution for many genealogy analytic questions. It is based on ORDPATH and I've used it for awhile in genealogy; but there are too many scenarios in genealogy that cannot be readily addressed with ORDPATH methods for directed acyclic graphs. A graph database is much more robust and well suited for genealogy. I use Neo4j: http://stumpf.org/genealogy-blog/graph-databases-in-genealogy.