Database Structure for Inconsistent Data

Database Structure for Inconsistent Data - mysql

I am creating a database for my company that will store many different types of information. The categories are Brightness, Contrast, Chromaticity, ect. Each category has a number of data points which my company would like to start storing.
Normally, I would create a table for each category which would store the corresponding data. (This is how I learned to do it). However, Sometimes these categories have "sub-data" which would change the number of fields required in each table.
My question is then how do people handle the inconsistency of data when structuring their databases? Do they just keep adding more tables for extra data or is it something else altogether?

There are a few (and thank goodness only a few) unbendable rules about relational database models. One of those is, that if you don't know what to store, you have a hard time storing it. Chances are, you'll have an even harder time retrieving it.
That said, the reality of business rules is often less clear cut than the ivory tower of database design. Most importantly, you might want or even need a way to introduce a new property without changing the schema.
Here are two feasable ways to go at this:
Use a datastore, that specializes in loose or inexistant schemas
(NoSQL and friends). Explaining this in detail is a subject of a CS
Thesis, not a stackoverflow answer.
My recommendation: Use a separate properties table - here is how
this goes:
Assuming for the sake of argument, your products allways have (unique string) name, (integer) id, brightness, contrast, chromaticity plus sometimes (integer) foo and (string) bar, consider these tables
CREATE TABLE products (
id INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(50) NOT NULL,
brightness INT,
contrast INT,
chromaticity INT,
UNIQUE INDEX(name)
);
CREATE TABLE properties (
id INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(50) NOT NULL,
proptype ENUM('null','int','string') NOT NULL default 'null',
UNIQUE INDEX(name)
);
INSERT INTO properties VALUES
(0,'foo','int'),
(0,'bar','string');
CREATE TABLE product_properties (
id INT PRIMARY KEY AUTO_INCREMENT,
products_id INT NOT NULL,
properties_id INT NOT NULL,
intvalue INT NOT NULL,
stringvalue VARCHAR(250) NOT NULL,
UNIQUE INDEX(products_id,properties_id)
);
now your "standard" properties would be in the products table as usual, while the "optional" properties would be stored in a row of product_properties, that references the product id and property id, with the value being in intvalue or stringvalue.
Selecting products including their foo if any would look like
SELECT
products.*,
product_properties.intvalue AS foo
FROM products
LEFT JOIN product_properties
ON products.id=product_properties.product_id
AND product_properties.property_id=1
or even
SELECT
products.*,
product_properties.intvalue AS foo
FROM products
LEFT JOIN product_properties
ON products.id=product_properties.product_id
LEFT JOIN properties
ON product_properties.property_id=properties.id
WHERE properties.name='foo' OR properties.name IS NULL
Please understand, that this incurs a performance penalty - in fact you trade performance against flexibility: Adding another property is nothing more than INSERTing a row in properties, the schema stays the same.

If you're not mysql bound then other databases have table inheritance or arrays to solve certain of those niche cases. Postgresql is a very nice database that you can use as easily and freely as mysql.
With mysql you could:
change your tables, add the extra columns and allow for NULL in the subcategory data that you don't need. This way integrity can be checked since you can still put constraints on the columns. Unless you really have a lot of subcategory columns this way I'd recommend this, otherwise option 3.
store subcategory data dynamically in a seperate table, that has a category_id,category_row_id,subcategory identifier(=type of subcategory) and a value column: that way you can retrieve your data by linking it via the category_id (determines table) and the category_row_id (links to PK of the original category table row). The bad thing: you can't use foreign keys or constraints properly to enforce integrity, you'd need to write hairy insert/update triggers to still have some control there which would push the burden of integrity checking and referential checking solely on the client. (in which case you'd properly be better of going NoSQL route) In short I wouldn't recommend this.
You can make a seperate subcategory table per category table, columns can be fixed or variable via value column(s) + optional subcategory identifier, foreign keys can still be used, best to maintain integrity is fixed since you'll have the full range of constraints at your disposal. If you have a lot of subcategory columns that would otherwise hopefully clutter your regular subcategory table then I'd recommend using this with fixed columns. Like the previous option I'd never recommend going dynamic for anything but throwaway data.
Alternatively if your subcategory is very variable and volatile: use NoSQL with a document database such as mongodb, mind you that you can keep all your regular data in a proper RDBMS and just storeside-data in the document database though that's probably not recommended.
If your subcategory data is in a known fixed state and not prone to change I'd just add the extra columns to the specific category table. Keep in mind that the major feature of a proper DBMS is safeguarding the integrity of your data via checks and constraints, doing away with that never really is a good idea.

If you are not limited to MySQL, you can consider Microsoft SQL server and using Sparse Columns This will allow you to expand your schema to include however many columns you want, without incurring the storage penalty for columns that are not pertinent for a given row.

Related

What is the proper way to store 'metadata' in relational database?

I have a table called assets, where an asset can belong to a user,team, or division, and possibly multiple of each. My issue is that the assets are highly variable, and can have properties associated with them that are different for each one.
ex. These could be assets:
1.)
type:workbench
cost:200
vendor:Acme Co.
color:black
2.)
type:microscope
serial_no:BH-00102
purchase_date:1337800923
cost:2040
and this could go on for hundreds to thousands of different types of assets.
How would I store this type of data in a normalized way that would be easy to query, without altering my tables every time a new asset type is added? Some of the fields are the present across all assets too, such as cost.
So far I figure that I should have:
assets
id,cost,purchase_date,asset_type_id
asset_types
id,name
division_assets
division_id,asset_id
user_assets
user_id,asset_id
but i do not know where to put the data that varies

When I've been faced with this in the past, the "best" answer always ends up varying depending on how much processing I want to do in the database, vs how much in the client code.
For what it's worth, often the approach that has worked best for me in the past has been to end up with one table per optional attribute (in particular, not one table per entity type). So, in your examples above
assets (as per your example)
asset_types (as per you example)
division_assets (as per your example)
user_assets (as per your example)
colours
asset_id, colour
weights
asset_id, weight
serial_numbers
asset_id, serial_number
Of course, depending on the trade-offs you need to make, this might be a bad choice for you. Personally, I like to keep the schema for data as explicit as possible, including data types and constraints, so I have no drama in changing the the tables next time a new attribute comes along.

I would suggest this:
assets (
id
asset_type_id
vendor_id
cost
purchase_date
)
asset_poperties (
id
asset_id
asset_property_type_id
value
)
asset_property_types (
id
property_type
)
asset_types (
id
asset_type
)
vendors (
id
vendor
)

You can add another table for asset_metadata
asset_metadata
asset_metadata_id,asset_id,metadata_name,metadata_value
if you want to normalize and categorize the metadata, normalize it to this way:
asset_metadata
asset_metadata_id,asset_id,metadata_name_id,metadata_value
metadata_name
metadata_name_id,metadata_name_text

I'd recommend putting the common attributes like cost in conventional column. Then add one more column in which you put a serialized collection of all the other variable asset attributes.
CREATE TABLE assets (
asset_id INT AUTO_INCREMENT PRIMARY KEY,
cost NUMERIC(9,2),
purchase_date DATE,
variables TEXT
);
You can serialize the collection as JSON or XML or whatever you want. Use whatever is most easily processed by your application code.
INSERT INTO assets VALUES (123, 49.95, CURDATE(), 'color: black; vendor: Acme Co.');
The advantage is that you can add new attributes to the text blob at any time. The disadvantage is that you can't read or write an individual attribute, you have to treat the whole collection as a lump.
But you can index individual attributes to make them searchable. You need to create a new table for each attribute you want to be searchable (but this could be a small subset of all attributes):
CREATE TABLE asset_color (
asset_id INT NOT NULL,
color VARCHAR(10),
PRIMARY KEY (asset_id, color),
KEY(color)
);
Not every asset is recorded in this table, only those assets that have a color.
Then you can do an indexed search for all assets that have a color attribute:
SELECT assets.*
FROM assets INNER JOIN asset_color USING (asset_id);
You can also do an indexed search limited to assets that have a color attribute, and the color is black:
SELECT assets.*
FROM assets INNER JOIN asset_color USING (asset_id)
WHERE color = 'black';
There is really no way to design a normalized database that permits variable attributes. All normal forms require first that the table be a relation. And a relation by definition must have a fixed set of attributes.
Other people are recommending an EAV table, but the "value" column in an EAV doesn't meet the definition of a relational column with a type (other consequences of this are that constraints don't work in an EAV table). Therefore an EAV table isn't a relation, and cannot satisfy any normal form either.

You can create two new tables:
1) Defining multiple asset attributes in the following table (as many as the asset may have)
asset_id
asset_attribute
asset_value
2) asset_attribute table
attribute_id
asset_attribute
The logic would be that asset_attributes will need to be first defined in the asset_attribute table and then it can be used (linked/tagged) with any asset (as a foreign key, from a drop down list on UI) and a proper value entered.
Hope this helps.

Representing News Post in mySQL

I'm currently working on a blog for a college news organization. Each post, though, will represent a full show, with multiple contributors and multiple titles.
For example, a post might have three news stories, each with its own title and some contributors for each:
"Story 1" by (id1) and (id2)
"Story 2" by (id3)
"Story 3" by (id4) and (id5)
So for each post, there would be an index (1, 2, 3...) for each individual story, a VARCHAR for the title, and id's that represent contributors, whose details are stored in another "contributors" table. The problem is that I don't know how many stories there will be, or how many contributors there will be per story. It could range from ~3 at the least to up to 6. In case our show expands in the future, I'd like to have the capability to scale up to even more than 6 posts, too.
I want to represent this structure concisely in a mySQL column, but I'm not sure how to do that. One solution would be to create another mySQL table to save the details for each individual story, but I'd prefer to avoid that hassle. The ideal solution would be if I could somehow create an "array" within a mySQL column, which could store (for each story) an index, a string, and multiple id's to show who the contributors are.
Is this possible, or will I have to create a new table to keep track of each story?

Don't use a column - use a table. It can be a simple InnoDB table which doesn't really hurt performance at all. Define a combined primary key (story_id, contributor_id) and insert all contributions in that table.
What you name in your question is called a M:N table. Don't ever go there - it's a very bad thing to do and is, in fact, nearly impossible in relational databases.

Save yourself some future heartburn. Create the extra table. It looks like a table of [Posts] with a one-to-many relationship to [Stories] where [Stories] has a many-to-many relationship to [Contributors].

You could store a comma-delimited string value of contributor ids or story ids in one column, but how, exactly would you relate them? What would seem to be your best bet in that case would be to make it an 'array' of 'arrays', where your main string consisted of pairs of strings strung together through commas.. I (so it's just my opinion, okay?) would avoid using unless totally necessary (can't think of one instance at this time)...
So create your relationships tables. Just to illustrate one approach to the idea:
-- a story may have multiple contributors
CREATE TABLE story_contributor_rel (
story_id INT NOT NULL
, contributor_id INT NOT NULL
)
-- a post may have multiple stories
CREATE TABLE post_story_rel (
post_id INT NOT NULL
, story_id INT NOT NULL
)
Or cheat it a bit, but I'd recommend against this also(!):
-- a less-normalized way
CREATE TABLE post_relationships (
post_id INT NOT NULL
, story_id INT NOT NULL
, contributor_id INT NOT NULL
)
These are just the simplest approaches. Naturally, you'd want to have either additional indentity columns and/or proper indexing and primary key settings, but this is just the way I can illustrate the point I'm driving at better.
Imagine this too.. If you were to put all those relationships in logical columns, then without the application it would not be so easy for anyone to understand what's going on in your tables. If you don't put any logic in the table structures and if you would properly set relationships tracking (meaning relationship tables), then it would appear transparent. One look at these tables and one would not take long enough to understand..
That's just my opinion. :) Cheers!

about database design

I need some idea about my database design. I have about 5 fields for basic information of user, such as name, email, gender etc.
Then I want to have about 5 fields for optional information such as messenger id's.
And 1 optional text field for info about user.
Should i create only one tabel with all fields all together or i should create separate table for the 5 optional fields in order to avoid redundancy etc?
Thanks.

I'll stick with only one table.
Adding another table would only makes thins more complicated and you will only gain really little disk space.
And I really don't see how this can be redundant in any way ;)

I think that you should definately stick with one table. Since all information is relevant to a user and do not reflect any other logical model (like an article, blog post or such), you can safely keep everything in one place, even if they are optional.

I would create only one table for additional fields. But not with 5 fields but a foreign key relation to base table and key/pair value info. Something like:
create table users (
user_id integer,
name varchar(200),
-- the rest of the fields
)
create table users_additional_info (
user_id integer references users(user_id) not null,
ai_type varchar(10) not null, -- type of additional info: messenger, extra email
ai_value varchar(200) not null
)
Eventually you might want an additional_info table to hold possible valid values for extra info: messenger, extra email, whatever. But that is up to you. I wouldn't bother.

It depends on how many people will be having all of that optional information and whether you plan on adding more fields. If you think you're going to add more fields in the future, it might be useful to move that information to a meta table using the EAV pattern : http://en.wikipedia.org/wiki/Entity-attribute-value_model
So, if you're unsure, your table would be like
User : id, name, email, gender, field1, field2
User_Meta : id, user_id, attribute, value
Using the user_id field in your meta table, you can link it to your user table and add as many sparsely used optional fields as you want.
Note : This pays off ONLY if you have many sparsely populated optional fields. Otherwise have it in one field

I would suggest using a single table for this. Databases are very good at optimizing away space for empty columns.
Splitting this table out into two or more tables is an example of vertical partitioning and in this case is likely to be a case of premature optimization. However, this technique can be useful when you have columns that you only need to query some of the time, eg. large binary blobs.

Database design: objects with different attributes

I'm designing a product database where products can have very different attributes depending on their type, but attributes are fixed for each type and types are not manageable at all. E.g.:
magazine: title, issue_number, pages, copies, close_date, release_date
web_site: name, bandwidth, hits, date_from, date_to
I want to use InnoDB and enforce database integrity as much as the engine allows. What's the recommended way to handle this?
I hate those designs where tables have 100 columns and most of the values are NULL so I thought about something like this:
product_type
============
product_type_id INT
product_type_name VARCHAR
product
=======
product_id INT
product_name VARCHAR
product_type_id INT -> Foreign key to product_type.product_type_id
valid_since DATETIME
valid_to DATETIME
magazine
========
magazine_id INT
title VARCHAR
product_id INT -> Foreign key to product.product_id
issue_number INT
pages INT
copies INT
close_date DATETIME
release_date DATETIME
web_site
========
web_site_id INT
name VARCHAR
product_id INT -> Foreign key to product.product_id
bandwidth INT
hits INT
date_from DATETIME
date_to DATETIME
This can handle cascaded product deletion but... Well, I'm not fully convinced...

This is a classic OO design to relational tables impedance mismatch. The table design you've described is known as 'table per subclass'. The three most common designs are all compromises compared to what your objects actually look like in your app:
Table per concrete class
Table per hierarchy
Table per subclass
The design you don't like - "where tables have 100 columns and most of the values are NULL" - is 2. one Table to store the whole specialization hierarchy. This is the least flexible for all kinds of reasons, including - if your app requires a new sub-class, you need to add columns. The design you describe accommodates change much better because you can add extend it by adding a new sub-class table described by a value in product_type.
The remaining option - 1. Table per concrete class - is usually undesirable because of the duplication involved in implementing all the common fields in each specialization table. Although, the advantages are that you wont need to perform any joins and the sub-class tables can even be on different db instances in a very large system.
The design you described is perfectly viable. The variation below is how it might look if you were using an ORM tool to do your CRUD operations. Notice how the ID in each sub-class table IS the FK value to the parent table in the hierarchy. A good ORM will automatically manage the correct sub-class table CRUD based on the value of the discriminator values in product.id and product.product_type_id alone. Whether you are planning on using an ORM or not, look at hibernate's joined sub-class documentation, if only to see the design decisions they made.
product
=======
id INT
product_name VARCHAR
product_type_id INT -> Foreign key to product_type.product_type_id
valid_since DATETIME
valid_to DATETIME
magazine
========
id INT -> Foreign key to product.product_id
title VARCHAR
..
web_site
========
id INT -> Foreign key to product.product_id INT
name VARCHAR
..

You seem to be roughly on the right track, except that you may need to consider the difference between "a product" and what's often called "a stock-keeping unit" (SKU). Is a 25-units box of paper clips (of a certain specific kind) the same "product" as a 50-units box thereof? In terms of a store, or any kind of inventory system, the distinction matters; in some cases, indeed, a simple distinction in packaging of what's otherwise the same amount of the same underlying "product" may give you distinct SKUs to keep track of.
You need to decide where you want to keep track of this issue, if it matters to your application (it may be OK to have the products laid out as you do, and deal with packaging for SKU purposes in other tables, for example, even though for some apps that might be a slight overhead).

This actually a standard way to "enforce" a sort of OO design in a classical RDBMS.
All the "common" attributes go on the master table (e.g. Price, if it is mantained at the product table level, could easily be part of the main table) while the specifics go on a subtable.
In theory if you have sub-sub-types (e.g. magazines could be subtyped in daily newspapers and 4-colours periodicals, maybe, with periodicals having a date interval for shelf-life) you could add one or more sublevels too...
This is pretty common (and proven) design. The only concern is that the master table will always be joined with at least a subtable for most operations. If you have zillions of items this could have performance implications.
On the other hand, common operation like deleting an item (I'd suggest a logical deletion, setting a flag to "true" on the master table) would be done once for every kind of subtype.
Anyway, go for it. And maybe google around for "Object oriented to RDBMS mappings" or somesuch for a complete discussion.

How to store data with dynamic number of attributes in a database

I have a number of different objects with a varying number of attributes. Until now I have saved the data in XML files which easily allow for an ever changing number of attributes. But I am trying to move it to a database.
What would be your preferred way to store this data?
A few strategies I have identified so far:
Having one single field named "attributes" in the object's table and store the data serialized or json'ed in there.
Storing the data in two tables (objects, attributes) and using a third to save the relations, making it a true n:m relation. Very clean solution, but possibly very expensive to fetch an entire object and all its attributes
Identifying attributes all objects have in common and creating fields for these to the object's table. Store the remaining attributes as serialized data in another field. This has an advantage over the first strategy, making searches easier.
Any ideas?

If you ever plan on searching for specific attributes, it's a bad idea to serialize them into a single column, since you'll have to use per-row functions to get the information out - this rarely scales well.
I would opt for your second choice. Have a list of attributes in an attribute table, the objects in their own table, and a many-to-many relationship table called object attributes.
For example:
objects:
object_id integer
object_name varchar(20)
primary key (object_id)
attributes:
attr_id integer
attr_name varchar(20)
primary key (attr_id)
object_attributes:
object_id integer references (objects.object_id)
attr_id integer references (attributes.attr_id)
oa_value varchar(20)
primary key (object_id,attr_id)
Your concern about performance is noted but, in my experience, it's always more costly to split a column than to combine multiple columns. If it turns out that there are performance problems, it's perfectly acceptable to break 3NF for performance reasons.
In that case I would store it the same way but also have a column with the raw serialized data. Provided you use insert/update triggers to keep the columnar and combined data in sync, you won't have any problems. But you shouldn't worry about that until an actual problem surfaces.
By using those triggers, you minimize the work required to only when the data changes. By trying to extract sub-column information, you do unnecessary work on every select.

A variation on your 2d solution is just two tables (assuming all attributes are of a single type):
T1: |Object data columns|Object_id|
T2: |Object id|attribute_name|attribute value| (unique index on first 2 columns)
This is even more efficient when combined with 3rd solution, e.g. all of the common fields go into T1.
Sstuffing >1 attribute into the same blob is no recommended - you can not filter by attributes, you can not efficiently update them

Let me give some concreteness to what DVK was saying.
Assuming values are of same type the table would look like (good luck, I feel you're going to need it):
dynamic_attribute_table
------------------------
id NUMBER
key VARCHAR
value SOMETYPE?
example (cars):
|id| key | value |
---------------------------
| 1|'Make' |'Ford' |
| 1|'Model' |'Edge' |
| 1|'Color' |'Blue' |
| 2|'Make' |'Chevrolet'|
| 2|'Model' |'Malibu' |
| 2|'MaxSpeed'|'110mph' |
Thus,
entity 1 = { ('Make', 'Ford'), ('Model', 'Edge'), ('Color', 'Blue') }
and,
entity 2 = { ('Make', 'Chevrolet'), ('Model', 'Malibu'), ('MaxSpeed', '110mph') }.

If you are using a relational db, then I think you did a good job listing the options. They each have their pros and cons. YOU are in the best position to decide what works best for your circumstances.
The serialized approach is probably the fastest (depending on your code for de-serializing), but it means that you won't be able to query the data with SQL. If you say that you don't need to query the data with SQL, then I agree with #longneck, maybe you should use a key/value style db instead of a relational db.
EDIT - reading more of your comments, WHY are you switching to a db if speed is your main concern. What's BAD about your current XML implementation?

I used to implement this scheme:
t_class (id RAW(16), parent RAW(16)) -- holds class hierachy.
t_property (class RAW(16), property VARCHAR) -- holds class members.
t_declaration (id RAW(16), class RAW(16)) -- hold GUIDs and types of all class instances
t_instance (id RAW(16), class RAW(16), property VARCHAR2(100), textvalue VARCHAR2(200), intvalue INT, doublevalue DOUBLE, datevalue DATE) -- holds 'common' properties
t_class1 (id RAW(16), amount DOUBLE, source RAW(16), destination RAW(16)) -- holds 'fast' properties for class1.
t_class2 (id RAW(16), comment VARCHAR2(200)) -- holds 'fast' properties for class2
--- etc.
RAW(16) is where Oracle holds GUIDs
If you want to select all properties for an object, you issue:
SELECT i.*
FROM (
SELECT id
FROM t_class
START WITH
id = (SELECT class FROM t_declaration WHERE id = :object_id)
CONNECT BY
parent = PRIOR id
) c
JOIN property p
ON p.class = c.id
LEFT JOIN
t_instance i
ON i.id = :object_id
AND i.class = p.class
AND i.property = p.property
t_property hold stuff you normally don't search on (like, text descriptions etc.)
Fast properties are in fact normal tables you have in the database, to make the queries efficient. They hold values only for the instances of a certain class or its descendants. This is to avoid extra joins.
You don't have to use fast tables and limit all your data to these four tables.

sounds like you need something lick couchdb, not an RDBMS.

if you are going to edit/manipulate/delete the attributes in later point, making a true n:m (second option) will be the one which I go for. (Or try to make it 2 table where the same attribute repeats.But data size will be high)
If you are not dealing with attributes(just capturing and showing the data) then you can go ahead and store in one field with some separator(Make sure the separator wont occur in the attribute value)

I am assuming you do not have digital attribute soup, but that there is some order to your data.
Otherwise, an RDBMS might not be the best fit. Something along NO SQL might work better.
If your objects are of different types, you should generally have one table per type.
Especially if you want to connect them using primary keys. It also helps to bring order and sanity if you have Products, Orders, Customers, etc tables, instead of just an Object and Attribute table.
Then look at your attributes. Anything that exists more than, say for 50% of the objects in that type category, make it a column in the object's table and use null when it's not being used.
Anything that is mandatory, should, of course, be defined as a NOT NULL column.
The rest, you can either have one or several "extra attributes" tables for.
You could put the attribute names into the table with the values, or normalize them out in a separate table and only use the primary key in the value table.
You may also find that you have combinations of data. For instance, a variant of an object type always has a certain set of attributes while another variant of the same object type has another set of attributes.
In that case, you might want to do something like:
MainObjectTable:
mainObjectId: PRIMARY KEY
columns...
MainObjectVariant1Table:
mainObjectId: FOREIGN KEY TO MainObjectTable
variant1Columns...
MainObjectVariant2Table:
mainObjectId: FOREIGN KEY TO MainObjectTable
variant2Columns...
I think the hard work, that will pay off, in the long run, is to analyze the data, find the objects and the commonly used attributes and make it into a good "object/ERD/DB" model.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008