Decorating an existing relational SQL database with NoSql features - mysql

We have a relational database (MySql) with a table that stores "Whatever". This table has many fields that store properties of different (logical and data-) types. The request is that another 150 new, unrelated properties are to be added.
We certainly do not want to add 150 new columns. I see two other options:
Add a simple key-value table (ID, FK_Whatever, Key, Value and maybe Type) where *FK_Whatever* references the Whatever ID and Key would be the name of the property. Querying with JOIN would work.
Add a large text field to the Whatever table and serialize the 150 new properties into it (as Xml, maybe). That would, in a way, be the NoSql way of storing data. Querying those fields would mean implementing some smart full text statements.
Type safety is lost in both cases, but we don't really need that anyway.
I have a feeling that there is a smarter solution to this common problem (we cannot move to a NoSql database for various reasons). Does anyone have a hint?

In an earlier project where we needed to store arbitrary extended attributes for a business object, we created an extended schema as follows:
CREATE TABLE ext_fields
{
systemId INT,
fieldId INT,
dataType INT // represented using an enum at the application layer.
// Other attributes.
}
CREATE TABLE request_ext
{
systemId INT, // Composite Primary Key in the business object table.
requestId INT, // Composite Primary Key in the business object table.
fieldId INT,
boolean_value BIT,
integer_value INT,
double_value REAL,
string_value NVARCHAR(256),
text_value NVARCHAR(MAX),
}
A given record will have only of the _value columns set based on the data type of the field as defined in the ext_fields table. This allowed us to not lose the type of the field and it's value and worked pretty well in utilizing all the filtering methods provided by the DBMS for those data types.
My two cents!

Related

Relational database design

I'm trying to understand entities, tables and foreign keys. I have the following:-
AnObject - I have identified this as an entity type.
ID (Primary Key)
Description
State
DependsOn
Creator
Now State has only two values it can be [Alive, Dead]. However it could possibly have another in the future. It can however only be one or the other but it will likely change between the two.
Question:
Should State be its own entity type? Would it be an entity type or
just a table? Should State have a foreign key to AnObject or vice
versa? EG
State
ID (PK)
Description
AnObject_ID (Foreign Key references AnObject)
Question: The DependsOn attribute of AnObject can have multiple values of other AnObject entity types. Obviously a field cannot have multiple values but I'm not sure how to model this?
The Creator attribute of AnObject also takes up a strict number of values [Fred, Jim, Dean]. Should I have an entity type (table) for a Creator with a foreign key to AnObject ID? So, A Creator can create, 0, 1, m AnObjects but AnObject can only have one creator?
Thanks,
State could just be an enum field, unless you need users to be able to add other State values via a user interface, in which case you could use a lookup table (one-to-many relationship) as you suggested. I don't know what database you're using, but here's some info on the enum type in MySQL: http://dev.mysql.com/doc/refman/5.6/en/enum.html.
If you use a lookup table, then AnObject should have a field called StateID that points to the desired row in the State table.
It sounds like DependsOn is a many-to-many relationship. For that you will need a join table, e.g.:
Table: Dependencies
Primary key (called a "composite key" because it's made up of more than one field):
AnObjectParentID
AnObjectChildID
I've assumed that the dependencies are needed for a parent-child relationship but if that's not the case you might want to name the table or fields differently.
You can add extra tables for enumeration values with a foreign key from AnObject to it. State will probably be best represented as a single field of type varchar not null. You can have the primary key for a table be a varchar field - they don't have to be int type.
This will constrain the values but allow you to use reasonable syntax to query the thing (i.e. WHERE state = 'Alive' (although in this case I think you're prematurely abstracting things - I'd keep it simple and just have a simple bool column IsDead).
DependsOn is a one-way attribute (you presumably can't have A depend on B and also B depend on A). The real issue here is how you're intending to query these items and how many of them there will be. If you want to pull out the whole chain of dependencies at once and the chain is long, you want to avoid doing hundreds of individual queries to do that. What is your use case?

Modeling the storage of multiple data types that also have parent child relationships

I'm trying to design a MySQL database for a project I've started but I cannot figure out the best way to do it.
Its an OOP system that contains different types of objects all of which need to be stored in the database. But those objects also need to maintain parent child relationships with one another. Also I want the flexibility to easily add new data types once the system is in production.
As far as I can see I have three options, one that is pure relational, one which I think is entity attribute value (I don't properly understand EAV) and the last is a hybrid design that I've thought of myself, but I assume has already been thought of before and has a proper name.
The relational design would consist of two tables, one large table with columns that allowed it to store any type of object and a second table to maintain the parent child relationships of the rows in the first table.
The EAV design would have two tables, one being an EAV table with the three columns (Entity id, Attribute and Value), the second table would then relate the parent child relationships of these entities.
The hybrid design would have a table for each type of object, then a parent child relation table that would have to store the id of the parent, child and some sort of identifier of the tables that these id's come from.
I'm sure this problem has been tackled and solved hundreds of times before and I would appreciate any references so I can read about the solutions.
This is the only truly relational design:
CREATE TABLE Objects (
object_id INT AUTO_INCREMENT PRIMARY KEY,
parent_object_id INT,
-- also attribute columns common to all object types
FOREIGN KEY (parent_object_id) REFRENCES Objects (object_id)
);
CREATE TABLE RedObjects (
object_id INT PRIMARY KEY,
-- attribute columns for red objects
FOREIGN KEY (object_id) REFRENCES Objects (object_id)
);
CREATE TABLE BlueObjects (
object_id INT PRIMARY KEY,
-- attribute columns for blue objects
FOREIGN KEY (object_id) REFRENCES Objects (object_id)
);
CREATE TABLE YellowObjects (
object_id INT PRIMARY KEY,
-- attribute columns for yellow objects
FOREIGN KEY (object_id) REFRENCES Objects (object_id)
);
But MySQL does not support recursive queries, so if you need to do complex queries to fetch the whole tree for instance, you'll need to use another method to store the relationships. I suggest a Closure Table design:
CREATE TABLE Paths (
ancestor_id INT,
descendant_id INT,
length INT DEFAULT 0,
PRIMARY KEY (ancestor_id, descendant_id),
FOREIGN KEY (ancestor_id) REFRENCES Objects (object_id),
FOREIGN KEY (descendant_id) REFRENCES Objects (object_id)
-- this may need additional indexes to support different queries
);
I describe more about the Closure Table here:
My answer to What is the most efficient/elegant way to parse a flat table into a tree?
My presentation Models for Hierarchical Data with SQL and PHP
My book SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming.
Yes you can very well use the EAV design. It works for the application we created, although after about 3 years of refinement.
You can also use a generic table structure and use any particular table for a group of objects. Or just create one generic table for each object.
Which Table for which Object is part of a metadata repository.
If you use a val_int, val_string type of structure, you will have Null columns except where the value is stored. There are sparse matrix features of MS SQL which you might consider using. Disk size is somewhat cheap these days. So the only drawback you have vis-a-vis traditional structure is NxR rows (say R Attributes for the object) instead of N rows.
Other than that, few things to look out for are object instance GUIDs, dynamic sql generation...

Database Structure for Inconsistent Data

I am creating a database for my company that will store many different types of information. The categories are Brightness, Contrast, Chromaticity, ect. Each category has a number of data points which my company would like to start storing.
Normally, I would create a table for each category which would store the corresponding data. (This is how I learned to do it). However, Sometimes these categories have "sub-data" which would change the number of fields required in each table.
My question is then how do people handle the inconsistency of data when structuring their databases? Do they just keep adding more tables for extra data or is it something else altogether?
There are a few (and thank goodness only a few) unbendable rules about relational database models. One of those is, that if you don't know what to store, you have a hard time storing it. Chances are, you'll have an even harder time retrieving it.
That said, the reality of business rules is often less clear cut than the ivory tower of database design. Most importantly, you might want or even need a way to introduce a new property without changing the schema.
Here are two feasable ways to go at this:
Use a datastore, that specializes in loose or inexistant schemas
(NoSQL and friends). Explaining this in detail is a subject of a CS
Thesis, not a stackoverflow answer.
My recommendation: Use a separate properties table - here is how
this goes:
Assuming for the sake of argument, your products allways have (unique string) name, (integer) id, brightness, contrast, chromaticity plus sometimes (integer) foo and (string) bar, consider these tables
CREATE TABLE products (
id INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(50) NOT NULL,
brightness INT,
contrast INT,
chromaticity INT,
UNIQUE INDEX(name)
);
CREATE TABLE properties (
id INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(50) NOT NULL,
proptype ENUM('null','int','string') NOT NULL default 'null',
UNIQUE INDEX(name)
);
INSERT INTO properties VALUES
(0,'foo','int'),
(0,'bar','string');
CREATE TABLE product_properties (
id INT PRIMARY KEY AUTO_INCREMENT,
products_id INT NOT NULL,
properties_id INT NOT NULL,
intvalue INT NOT NULL,
stringvalue VARCHAR(250) NOT NULL,
UNIQUE INDEX(products_id,properties_id)
);
now your "standard" properties would be in the products table as usual, while the "optional" properties would be stored in a row of product_properties, that references the product id and property id, with the value being in intvalue or stringvalue.
Selecting products including their foo if any would look like
SELECT
products.*,
product_properties.intvalue AS foo
FROM products
LEFT JOIN product_properties
ON products.id=product_properties.product_id
AND product_properties.property_id=1
or even
SELECT
products.*,
product_properties.intvalue AS foo
FROM products
LEFT JOIN product_properties
ON products.id=product_properties.product_id
LEFT JOIN properties
ON product_properties.property_id=properties.id
WHERE properties.name='foo' OR properties.name IS NULL
Please understand, that this incurs a performance penalty - in fact you trade performance against flexibility: Adding another property is nothing more than INSERTing a row in properties, the schema stays the same.
If you're not mysql bound then other databases have table inheritance or arrays to solve certain of those niche cases. Postgresql is a very nice database that you can use as easily and freely as mysql.
With mysql you could:
change your tables, add the extra columns and allow for NULL in the subcategory data that you don't need. This way integrity can be checked since you can still put constraints on the columns. Unless you really have a lot of subcategory columns this way I'd recommend this, otherwise option 3.
store subcategory data dynamically in a seperate table, that has a category_id,category_row_id,subcategory identifier(=type of subcategory) and a value column: that way you can retrieve your data by linking it via the category_id (determines table) and the category_row_id (links to PK of the original category table row). The bad thing: you can't use foreign keys or constraints properly to enforce integrity, you'd need to write hairy insert/update triggers to still have some control there which would push the burden of integrity checking and referential checking solely on the client. (in which case you'd properly be better of going NoSQL route) In short I wouldn't recommend this.
You can make a seperate subcategory table per category table, columns can be fixed or variable via value column(s) + optional subcategory identifier, foreign keys can still be used, best to maintain integrity is fixed since you'll have the full range of constraints at your disposal. If you have a lot of subcategory columns that would otherwise hopefully clutter your regular subcategory table then I'd recommend using this with fixed columns. Like the previous option I'd never recommend going dynamic for anything but throwaway data.
Alternatively if your subcategory is very variable and volatile: use NoSQL with a document database such as mongodb, mind you that you can keep all your regular data in a proper RDBMS and just storeside-data in the document database though that's probably not recommended.
If your subcategory data is in a known fixed state and not prone to change I'd just add the extra columns to the specific category table. Keep in mind that the major feature of a proper DBMS is safeguarding the integrity of your data via checks and constraints, doing away with that never really is a good idea.
If you are not limited to MySQL, you can consider Microsoft SQL server and using Sparse Columns This will allow you to expand your schema to include however many columns you want, without incurring the storage penalty for columns that are not pertinent for a given row.

Normalizing MySQL data

I'm new to MySQL, and just learned about the importance of data normalization. My database has a simple structure:
I have 1 table called users with fields:
userName (string)
userEmail (string)
password (string)
requests (an array of dictionaries in JSON string format)
data (another array of dictionaries in JSON string format)
deviceID (string)
Right now, this is my structure. Being very new to MySQL, I'm really not seeing why my above structure is a bad idea? Why would I need to normalize this and make separate tables? That's the first question-why? (Some have also said not to put JSON in my table. Why or why not?)
The second question is how? With the above structure, how many tables should I have, and what would be in each table?
Edit:
So maybe normalization is not absolutely necessary here, but maybe there's a better way to implement my data field? The data field is an array of dictionaries: each dictionary is just a note item with a few keys (title, author, date, body). So what I do now is, which I think might be inefficient, every time a user composes a new note, I send that note from my app to PHP to handle. I get the JSON array of dictionaries already part of that user's data, I convert it to a PHP array, I then add to the end of this array the new note, convert the whole thing back to JSON, and put it back in the table as an array of dictionaries. And this process is repeated every time a new note is composed. Is there a better way to do this? Maybe a user's data should be a table, with each row being a note-but I'm not really sure how this would work?
The answer to all your questions really depends on what the JSON data is for, and whether you'll ever need to use some property of that data to determine which rows are returned.
If your data truly has no schema, and you're really just using it to store data that will be used by an application that knows how to retrieve the correct row by some other criteria (such as one of the other fields) every time, there's no reason to store it as anything other than exactly as that application expects it (in this case, JSON).
If the JSON data DOES contain some structure that is the same for all entries, and if it's useful to query this data directly from the database, you would want to create one or more tables (or maybe just some more fields) to hold this data.
As a practical example of this, if the data fields contains JSON enumerating services for that user in an array, and each service has a unique id, type, and price, you might want a separate table with the following fields (using your own naming conventions):
serviceId (integer)
userName (string)
serviceType (string)
servicePrice (float)
And each service for that user would get it's own entry. You could then query for users than have a particular service, which depending on your needs, could be very useful. In addition to easy querying, indexing certain fields of the separate tables can also make for very QUICK queries.
Update: Based on your explanation of the data stored, and the way you use it, you probably do want it normalized. Something like the following:
# user table
userId (integer, auto-incrementing)
userName (string)
userEmail (string)
password (string)
deviceID (string)
# note table
noteId (integer, auto-incrementing)
userId (integer, matches user.userId)
noteTime (datetime)
noteData (string, possibly split into separate fields depending on content, such as subject, etC)
# request table
requestId (integer, auto-incrementing)
userId (integer, matches user.userId)
requestTime (datetime)
requestData (string, again split as needed)
You could then query like so:
# Get a user
SELECT * FROM user WHERE userId = '123';
SELECT * FROM user WHERE userNAme = 'foo';
# Get all requests for a user
SELECT * FROM request WHERE userId = '123';
# Get a single request
SELECT * FROM request WHERE requestId = '325325';
# Get all notes for a user
SELECT * FROM note WHERE userId = '123';
# Get all notes from last week
SELECT * FROM note WHERE userId = '123' AND noteTime > CURDATE() - INTERVAL 1 WEEK;
# Add a note to user 123
INSERT INTO note (noteId, userId, noteData) VALUES (null, 123, 'This is a note');
Notice how much more you can do with normalized data, and how easy it is? It's trivial to locate, update, append, or delete any specific component.
Normalization is a philosophy. Some people think it fits their database approach, some don't. Many modern database solutions even focus on denormalization to improve speeds.
Normalization often doesn't improve speed. However, it greatly improves the simplicity of accessing and writing data. For example, if you wanted to add a request, you would have to write a completely new JSON field. If it was normalized, you could simply add a row to a table.
In normalization, "array of dictionaries in JSON string format" is always bad. Array of dictionaries can be translated as list of rows, which is a table.
If you're new to databases: NORMALIZE. Denormalization is something for professionals.
A main benefit of normalization is to eliminate redundant data, but since each user's data is unique to that user, there is no benefit to splitting this table and normalizing. Furthermore, since the front-end will employ the dictionaries as JSON objects anyway, undue complication and a decrease in performance would result from trying to decompose this data.
Okay, here is a normalized mySQL data-model. Note: you can separate authors and titles into two tables to further reduce data redundancy. You can probably use similar techniques for the "requests dictionaries":
CREATE TABLE USERS(
UID int NOT NULL AUTO_INCREMENT PRIMARY KEY,
userName varchar(255) UNIQUE,
password varchar(30),
userEmail varchar(255) UNIQUE,
deviceID varchar(255)
) ENGINE=InnoDB;
CREATE TABLE BOOKS(
BKID int NOT NULL AUTO_INCREMENT PRIMARY KEY,
FKUSERS int,
Title varchar(255),
Author varchar(50)
) ENGINE=InnoDB;
ALTER TABLE BOOKS
ADD FOREIGN KEY (FKUSERS)
REFERENCES USERS(UID);
CREATE TABLE NOTES(
ID int NOT NULL AUTO_INCREMENT PRIMARY KEY,
FKUSERS int,
FKBOOKS int,
Date date,
Notes text
) ENGINE=InnoDB;
ALTER TABLE NOTES
ADD FOREIGN KEY BKNO (FKUSERS)
REFERENCES USERS(UID);
ALTER TABLE NOTES
ADD FOREIGN KEY (FKBOOKS)
REFERENCES BOOKS(BKID);
In your case, I will abstract out the class that handles this table. Then keep the data normalized. if in future, the data access patterns changes and i need to normalized the data, i css just do so with less impact on the program. I just need to change the class that handles this set of data to query the normalized tables , but return the data as if the database structure never changed.

How to store data with dynamic number of attributes in a database

I have a number of different objects with a varying number of attributes. Until now I have saved the data in XML files which easily allow for an ever changing number of attributes. But I am trying to move it to a database.
What would be your preferred way to store this data?
A few strategies I have identified so far:
Having one single field named "attributes" in the object's table and store the data serialized or json'ed in there.
Storing the data in two tables (objects, attributes) and using a third to save the relations, making it a true n:m relation. Very clean solution, but possibly very expensive to fetch an entire object and all its attributes
Identifying attributes all objects have in common and creating fields for these to the object's table. Store the remaining attributes as serialized data in another field. This has an advantage over the first strategy, making searches easier.
Any ideas?
If you ever plan on searching for specific attributes, it's a bad idea to serialize them into a single column, since you'll have to use per-row functions to get the information out - this rarely scales well.
I would opt for your second choice. Have a list of attributes in an attribute table, the objects in their own table, and a many-to-many relationship table called object attributes.
For example:
objects:
object_id integer
object_name varchar(20)
primary key (object_id)
attributes:
attr_id integer
attr_name varchar(20)
primary key (attr_id)
object_attributes:
object_id integer references (objects.object_id)
attr_id integer references (attributes.attr_id)
oa_value varchar(20)
primary key (object_id,attr_id)
Your concern about performance is noted but, in my experience, it's always more costly to split a column than to combine multiple columns. If it turns out that there are performance problems, it's perfectly acceptable to break 3NF for performance reasons.
In that case I would store it the same way but also have a column with the raw serialized data. Provided you use insert/update triggers to keep the columnar and combined data in sync, you won't have any problems. But you shouldn't worry about that until an actual problem surfaces.
By using those triggers, you minimize the work required to only when the data changes. By trying to extract sub-column information, you do unnecessary work on every select.
A variation on your 2d solution is just two tables (assuming all attributes are of a single type):
T1: |Object data columns|Object_id|
T2: |Object id|attribute_name|attribute value| (unique index on first 2 columns)
This is even more efficient when combined with 3rd solution, e.g. all of the common fields go into T1.
Sstuffing >1 attribute into the same blob is no recommended - you can not filter by attributes, you can not efficiently update them
Let me give some concreteness to what DVK was saying.
Assuming values are of same type the table would look like (good luck, I feel you're going to need it):
dynamic_attribute_table
------------------------
id NUMBER
key VARCHAR
value SOMETYPE?
example (cars):
|id| key | value |
---------------------------
| 1|'Make' |'Ford' |
| 1|'Model' |'Edge' |
| 1|'Color' |'Blue' |
| 2|'Make' |'Chevrolet'|
| 2|'Model' |'Malibu' |
| 2|'MaxSpeed'|'110mph' |
Thus,
entity 1 = { ('Make', 'Ford'), ('Model', 'Edge'), ('Color', 'Blue') }
and,
entity 2 = { ('Make', 'Chevrolet'), ('Model', 'Malibu'), ('MaxSpeed', '110mph') }.
If you are using a relational db, then I think you did a good job listing the options. They each have their pros and cons. YOU are in the best position to decide what works best for your circumstances.
The serialized approach is probably the fastest (depending on your code for de-serializing), but it means that you won't be able to query the data with SQL. If you say that you don't need to query the data with SQL, then I agree with #longneck, maybe you should use a key/value style db instead of a relational db.
EDIT - reading more of your comments, WHY are you switching to a db if speed is your main concern. What's BAD about your current XML implementation?
I used to implement this scheme:
t_class (id RAW(16), parent RAW(16)) -- holds class hierachy.
t_property (class RAW(16), property VARCHAR) -- holds class members.
t_declaration (id RAW(16), class RAW(16)) -- hold GUIDs and types of all class instances
t_instance (id RAW(16), class RAW(16), property VARCHAR2(100), textvalue VARCHAR2(200), intvalue INT, doublevalue DOUBLE, datevalue DATE) -- holds 'common' properties
t_class1 (id RAW(16), amount DOUBLE, source RAW(16), destination RAW(16)) -- holds 'fast' properties for class1.
t_class2 (id RAW(16), comment VARCHAR2(200)) -- holds 'fast' properties for class2
--- etc.
RAW(16) is where Oracle holds GUIDs
If you want to select all properties for an object, you issue:
SELECT i.*
FROM (
SELECT id
FROM t_class
START WITH
id = (SELECT class FROM t_declaration WHERE id = :object_id)
CONNECT BY
parent = PRIOR id
) c
JOIN property p
ON p.class = c.id
LEFT JOIN
t_instance i
ON i.id = :object_id
AND i.class = p.class
AND i.property = p.property
t_property hold stuff you normally don't search on (like, text descriptions etc.)
Fast properties are in fact normal tables you have in the database, to make the queries efficient. They hold values only for the instances of a certain class or its descendants. This is to avoid extra joins.
You don't have to use fast tables and limit all your data to these four tables.
sounds like you need something lick couchdb, not an RDBMS.
if you are going to edit/manipulate/delete the attributes in later point, making a true n:m (second option) will be the one which I go for. (Or try to make it 2 table where the same attribute repeats.But data size will be high)
If you are not dealing with attributes(just capturing and showing the data) then you can go ahead and store in one field with some separator(Make sure the separator wont occur in the attribute value)
I am assuming you do not have digital attribute soup, but that there is some order to your data.
Otherwise, an RDBMS might not be the best fit. Something along NO SQL might work better.
If your objects are of different types, you should generally have one table per type.
Especially if you want to connect them using primary keys. It also helps to bring order and sanity if you have Products, Orders, Customers, etc tables, instead of just an Object and Attribute table.
Then look at your attributes. Anything that exists more than, say for 50% of the objects in that type category, make it a column in the object's table and use null when it's not being used.
Anything that is mandatory, should, of course, be defined as a NOT NULL column.
The rest, you can either have one or several "extra attributes" tables for.
You could put the attribute names into the table with the values, or normalize them out in a separate table and only use the primary key in the value table.
You may also find that you have combinations of data. For instance, a variant of an object type always has a certain set of attributes while another variant of the same object type has another set of attributes.
In that case, you might want to do something like:
MainObjectTable:
mainObjectId: PRIMARY KEY
columns...
MainObjectVariant1Table:
mainObjectId: FOREIGN KEY TO MainObjectTable
variant1Columns...
MainObjectVariant2Table:
mainObjectId: FOREIGN KEY TO MainObjectTable
variant2Columns...
I think the hard work, that will pay off, in the long run, is to analyze the data, find the objects and the commonly used attributes and make it into a good "object/ERD/DB" model.