Database design: objects with different attributes - mysql

I'm designing a product database where products can have very different attributes depending on their type, but attributes are fixed for each type and types are not manageable at all. E.g.:
magazine: title, issue_number, pages, copies, close_date, release_date
web_site: name, bandwidth, hits, date_from, date_to
I want to use InnoDB and enforce database integrity as much as the engine allows. What's the recommended way to handle this?
I hate those designs where tables have 100 columns and most of the values are NULL so I thought about something like this:
product_type
============
product_type_id INT
product_type_name VARCHAR
product
=======
product_id INT
product_name VARCHAR
product_type_id INT -> Foreign key to product_type.product_type_id
valid_since DATETIME
valid_to DATETIME
magazine
========
magazine_id INT
title VARCHAR
product_id INT -> Foreign key to product.product_id
issue_number INT
pages INT
copies INT
close_date DATETIME
release_date DATETIME
web_site
========
web_site_id INT
name VARCHAR
product_id INT -> Foreign key to product.product_id
bandwidth INT
hits INT
date_from DATETIME
date_to DATETIME
This can handle cascaded product deletion but... Well, I'm not fully convinced...

This is a classic OO design to relational tables impedance mismatch. The table design you've described is known as 'table per subclass'. The three most common designs are all compromises compared to what your objects actually look like in your app:
Table per concrete class
Table per hierarchy
Table per subclass
The design you don't like - "where tables have 100 columns and most of the values are NULL" - is 2. one Table to store the whole specialization hierarchy. This is the least flexible for all kinds of reasons, including - if your app requires a new sub-class, you need to add columns. The design you describe accommodates change much better because you can add extend it by adding a new sub-class table described by a value in product_type.
The remaining option - 1. Table per concrete class - is usually undesirable because of the duplication involved in implementing all the common fields in each specialization table. Although, the advantages are that you wont need to perform any joins and the sub-class tables can even be on different db instances in a very large system.
The design you described is perfectly viable. The variation below is how it might look if you were using an ORM tool to do your CRUD operations. Notice how the ID in each sub-class table IS the FK value to the parent table in the hierarchy. A good ORM will automatically manage the correct sub-class table CRUD based on the value of the discriminator values in product.id and product.product_type_id alone. Whether you are planning on using an ORM or not, look at hibernate's joined sub-class documentation, if only to see the design decisions they made.
product
=======
id INT
product_name VARCHAR
product_type_id INT -> Foreign key to product_type.product_type_id
valid_since DATETIME
valid_to DATETIME
magazine
========
id INT -> Foreign key to product.product_id
title VARCHAR
..
web_site
========
id INT -> Foreign key to product.product_id INT
name VARCHAR
..

You seem to be roughly on the right track, except that you may need to consider the difference between "a product" and what's often called "a stock-keeping unit" (SKU). Is a 25-units box of paper clips (of a certain specific kind) the same "product" as a 50-units box thereof? In terms of a store, or any kind of inventory system, the distinction matters; in some cases, indeed, a simple distinction in packaging of what's otherwise the same amount of the same underlying "product" may give you distinct SKUs to keep track of.
You need to decide where you want to keep track of this issue, if it matters to your application (it may be OK to have the products laid out as you do, and deal with packaging for SKU purposes in other tables, for example, even though for some apps that might be a slight overhead).

This actually a standard way to "enforce" a sort of OO design in a classical RDBMS.
All the "common" attributes go on the master table (e.g. Price, if it is mantained at the product table level, could easily be part of the main table) while the specifics go on a subtable.
In theory if you have sub-sub-types (e.g. magazines could be subtyped in daily newspapers and 4-colours periodicals, maybe, with periodicals having a date interval for shelf-life) you could add one or more sublevels too...
This is pretty common (and proven) design. The only concern is that the master table will always be joined with at least a subtable for most operations. If you have zillions of items this could have performance implications.
On the other hand, common operation like deleting an item (I'd suggest a logical deletion, setting a flag to "true" on the master table) would be done once for every kind of subtype.
Anyway, go for it. And maybe google around for "Object oriented to RDBMS mappings" or somesuch for a complete discussion.

Related

How to structure a Bill of Materials that has multiple options

I am stuck trying to develop a Bill of Materials in Access. I have a table call IM_Item_Registry where I have the Item_Code and a boolean for if it's a component. Where I'm stuck is that past sins of the company made several part numbers for the same ingredient from different vendors. A product may use ingredient 1 at the beginning of the run and ingredient 2 at the end of a run depending on inventory and it may switch from job to job (Lack of discipline and random purchasing based on price). It's creating a headache for me because they typically have different inclusions. How would I go about adding in the flexibility to use both? or would it just be easier to make multiple versions and then select those version upon scheduling?
I know this is loaded and I can include more detail if needed but I appreciate your help I've been researching on how to do this for a couple weeks now.
EDIT (3/28/2019)
this is for an injection molding company.
IM_Item_Registry (Fields: Item_Code, Category(Raw, manufactured, customer supplied, assembly component), Description, Component (boolean), active (boolean), Unit of Measure.
for this Bill-of-materials 100011 produces component lets call this a handle. bill 100011 uses raw resin 700049 at 98% inclusion and raw color 600020 at 2% inclusion. However, we may run out of raw color 600020 and have to run it out of 600051 which would change 700049 to 98.5% inclusion because 600051 requires 1.5% inclusion to achieve the same color.
i would like to create a table that would call out for the general term lets say 600020 and 600051 is yellow color additive. then create a "ghost" number to call for either 600020 or 600051 and give both formulation recipes. When production starts they would scan in which color they actually used to create the production BOM themselves and record which color was used and how much. is there a way to do this in access database structuring?
I'm assuming I would need both the item_registry table, a BoM table (fields: BOM#, ParentID, Ghost_ID) and then a components table (Fields: Ghost_ID, item_code, Inclusion Rate).
Database normalization is the guiding principle for designing efficient, useful tables and relationships in a relational database. Access forms, subforms, reports, etc. require properly normalized tables to work as intended. There are various levels of normalization, but the common idea is to avoid duplication of data between rows and columns of data. Having duplicate data requires a lot of overhead in storage and in ensuring that actions on the database do not create inconsistent states (contradictory data values). Well-normalized tables allow useful constraints to be defined between data columns and/or rows to ensure that data is valid.
The [BoM] table as proposed in the question is not normalized. But before we get to that, the ParentID was not defined and it's not clear what it represents. Instead, to help show why it's not normalized, let me add a [Product] column to the [BoM] table. Then if such a handle has two alternative lists of components (ghosts?), the table would look like
BOMID, Product, GhostID
----- ------- -------
1 Handle 1
1 Handle 2
See the duplication? And now if the product is renamed, for instance to "Bronze Handle", then both rows need to be updated for a single conceptual element. It also introduces the possibility of having contradictory data like
BOMID, Product, GhostID
----- ------- -------
1 Handle 1
1 Bronze Handle 2
Enough said about that, since I've already gone on too much about normalization concepts here. Following is a basic normalized schema which would serve you better, but notice that it's not too much different that what you proposed in the question. The only real difference is that the BoM table is normalized by splitting its columns (and purpose) into another table.
I do not list all columns here, only primary and foreign keys and a few other meaningful columns. PK = Primary Key (unique, non-null key), FK = Foreign Key. Proper indices should be defined on the PK and FK columns AND relationships defined with appropriate constraints.
Table: [IM_Item_Registry]
Item_Code (PK)
Table: [BOM]
BOMID (PK)
ProductID (FK)
Table: [BOM_Option]
OptionID (PK)
BOMID (FK)
Primary (boolean) - flags the primary/usual list of components
Description
Table: [Option_Items]
OptionID (FK; part of composite PK)
Item_Code (FK; part of composite PK)
Inclusion_Rate
The [BOM].[ProductID] column alludes to another table with details of the product which should be defined separately from the Bill of Material. If this database really is super-simplistic, then it could just be a string field [Product] containing the name, but I assume there are more useful details to store. Perhaps this is what the ParentID also alluded to? (I suggest choosing names that are not so abstract like "parent" and "ghost", hence my choice of the word "option".)
Really, since [BOM_Option] should be limited to a single option per BOM, it would fulfill proper normalization to create another table like
Table: [BOM_Primary]
[BOMID] (FK and PK) - Primary key so only one primary option can be defined at once
[OptionID] (FK)

What is the proper way to store 'metadata' in relational database?

I have a table called assets, where an asset can belong to a user,team, or division, and possibly multiple of each. My issue is that the assets are highly variable, and can have properties associated with them that are different for each one.
ex. These could be assets:
1.)
type:workbench
cost:200
vendor:Acme Co.
color:black
2.)
type:microscope
serial_no:BH-00102
purchase_date:1337800923
cost:2040
and this could go on for hundreds to thousands of different types of assets.
How would I store this type of data in a normalized way that would be easy to query, without altering my tables every time a new asset type is added? Some of the fields are the present across all assets too, such as cost.
So far I figure that I should have:
assets
id,cost,purchase_date,asset_type_id
asset_types
id,name
division_assets
division_id,asset_id
user_assets
user_id,asset_id
but i do not know where to put the data that varies
When I've been faced with this in the past, the "best" answer always ends up varying depending on how much processing I want to do in the database, vs how much in the client code.
For what it's worth, often the approach that has worked best for me in the past has been to end up with one table per optional attribute (in particular, not one table per entity type). So, in your examples above
assets (as per your example)
asset_types (as per you example)
division_assets (as per your example)
user_assets (as per your example)
colours
asset_id, colour
weights
asset_id, weight
serial_numbers
asset_id, serial_number
Of course, depending on the trade-offs you need to make, this might be a bad choice for you. Personally, I like to keep the schema for data as explicit as possible, including data types and constraints, so I have no drama in changing the the tables next time a new attribute comes along.
I would suggest this:
assets (
id
asset_type_id
vendor_id
cost
purchase_date
)
asset_poperties (
id
asset_id
asset_property_type_id
value
)
asset_property_types (
id
property_type
)
asset_types (
id
asset_type
)
vendors (
id
vendor
)
You can add another table for asset_metadata
asset_metadata
asset_metadata_id,asset_id,metadata_name,metadata_value
if you want to normalize and categorize the metadata, normalize it to this way:
asset_metadata
asset_metadata_id,asset_id,metadata_name_id,metadata_value
metadata_name
metadata_name_id,metadata_name_text
I'd recommend putting the common attributes like cost in conventional column. Then add one more column in which you put a serialized collection of all the other variable asset attributes.
CREATE TABLE assets (
asset_id INT AUTO_INCREMENT PRIMARY KEY,
cost NUMERIC(9,2),
purchase_date DATE,
variables TEXT
);
You can serialize the collection as JSON or XML or whatever you want. Use whatever is most easily processed by your application code.
INSERT INTO assets VALUES (123, 49.95, CURDATE(), 'color: black; vendor: Acme Co.');
The advantage is that you can add new attributes to the text blob at any time. The disadvantage is that you can't read or write an individual attribute, you have to treat the whole collection as a lump.
But you can index individual attributes to make them searchable. You need to create a new table for each attribute you want to be searchable (but this could be a small subset of all attributes):
CREATE TABLE asset_color (
asset_id INT NOT NULL,
color VARCHAR(10),
PRIMARY KEY (asset_id, color),
KEY(color)
);
Not every asset is recorded in this table, only those assets that have a color.
Then you can do an indexed search for all assets that have a color attribute:
SELECT assets.*
FROM assets INNER JOIN asset_color USING (asset_id);
You can also do an indexed search limited to assets that have a color attribute, and the color is black:
SELECT assets.*
FROM assets INNER JOIN asset_color USING (asset_id)
WHERE color = 'black';
There is really no way to design a normalized database that permits variable attributes. All normal forms require first that the table be a relation. And a relation by definition must have a fixed set of attributes.
Other people are recommending an EAV table, but the "value" column in an EAV doesn't meet the definition of a relational column with a type (other consequences of this are that constraints don't work in an EAV table). Therefore an EAV table isn't a relation, and cannot satisfy any normal form either.
You can create two new tables:
1) Defining multiple asset attributes in the following table (as many as the asset may have)
asset_id
asset_attribute
asset_value
2) asset_attribute table
attribute_id
asset_attribute
The logic would be that asset_attributes will need to be first defined in the asset_attribute table and then it can be used (linked/tagged) with any asset (as a foreign key, from a drop down list on UI) and a proper value entered.
Hope this helps.

Database Structure for Inconsistent Data

I am creating a database for my company that will store many different types of information. The categories are Brightness, Contrast, Chromaticity, ect. Each category has a number of data points which my company would like to start storing.
Normally, I would create a table for each category which would store the corresponding data. (This is how I learned to do it). However, Sometimes these categories have "sub-data" which would change the number of fields required in each table.
My question is then how do people handle the inconsistency of data when structuring their databases? Do they just keep adding more tables for extra data or is it something else altogether?
There are a few (and thank goodness only a few) unbendable rules about relational database models. One of those is, that if you don't know what to store, you have a hard time storing it. Chances are, you'll have an even harder time retrieving it.
That said, the reality of business rules is often less clear cut than the ivory tower of database design. Most importantly, you might want or even need a way to introduce a new property without changing the schema.
Here are two feasable ways to go at this:
Use a datastore, that specializes in loose or inexistant schemas
(NoSQL and friends). Explaining this in detail is a subject of a CS
Thesis, not a stackoverflow answer.
My recommendation: Use a separate properties table - here is how
this goes:
Assuming for the sake of argument, your products allways have (unique string) name, (integer) id, brightness, contrast, chromaticity plus sometimes (integer) foo and (string) bar, consider these tables
CREATE TABLE products (
id INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(50) NOT NULL,
brightness INT,
contrast INT,
chromaticity INT,
UNIQUE INDEX(name)
);
CREATE TABLE properties (
id INT PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(50) NOT NULL,
proptype ENUM('null','int','string') NOT NULL default 'null',
UNIQUE INDEX(name)
);
INSERT INTO properties VALUES
(0,'foo','int'),
(0,'bar','string');
CREATE TABLE product_properties (
id INT PRIMARY KEY AUTO_INCREMENT,
products_id INT NOT NULL,
properties_id INT NOT NULL,
intvalue INT NOT NULL,
stringvalue VARCHAR(250) NOT NULL,
UNIQUE INDEX(products_id,properties_id)
);
now your "standard" properties would be in the products table as usual, while the "optional" properties would be stored in a row of product_properties, that references the product id and property id, with the value being in intvalue or stringvalue.
Selecting products including their foo if any would look like
SELECT
products.*,
product_properties.intvalue AS foo
FROM products
LEFT JOIN product_properties
ON products.id=product_properties.product_id
AND product_properties.property_id=1
or even
SELECT
products.*,
product_properties.intvalue AS foo
FROM products
LEFT JOIN product_properties
ON products.id=product_properties.product_id
LEFT JOIN properties
ON product_properties.property_id=properties.id
WHERE properties.name='foo' OR properties.name IS NULL
Please understand, that this incurs a performance penalty - in fact you trade performance against flexibility: Adding another property is nothing more than INSERTing a row in properties, the schema stays the same.
If you're not mysql bound then other databases have table inheritance or arrays to solve certain of those niche cases. Postgresql is a very nice database that you can use as easily and freely as mysql.
With mysql you could:
change your tables, add the extra columns and allow for NULL in the subcategory data that you don't need. This way integrity can be checked since you can still put constraints on the columns. Unless you really have a lot of subcategory columns this way I'd recommend this, otherwise option 3.
store subcategory data dynamically in a seperate table, that has a category_id,category_row_id,subcategory identifier(=type of subcategory) and a value column: that way you can retrieve your data by linking it via the category_id (determines table) and the category_row_id (links to PK of the original category table row). The bad thing: you can't use foreign keys or constraints properly to enforce integrity, you'd need to write hairy insert/update triggers to still have some control there which would push the burden of integrity checking and referential checking solely on the client. (in which case you'd properly be better of going NoSQL route) In short I wouldn't recommend this.
You can make a seperate subcategory table per category table, columns can be fixed or variable via value column(s) + optional subcategory identifier, foreign keys can still be used, best to maintain integrity is fixed since you'll have the full range of constraints at your disposal. If you have a lot of subcategory columns that would otherwise hopefully clutter your regular subcategory table then I'd recommend using this with fixed columns. Like the previous option I'd never recommend going dynamic for anything but throwaway data.
Alternatively if your subcategory is very variable and volatile: use NoSQL with a document database such as mongodb, mind you that you can keep all your regular data in a proper RDBMS and just storeside-data in the document database though that's probably not recommended.
If your subcategory data is in a known fixed state and not prone to change I'd just add the extra columns to the specific category table. Keep in mind that the major feature of a proper DBMS is safeguarding the integrity of your data via checks and constraints, doing away with that never really is a good idea.
If you are not limited to MySQL, you can consider Microsoft SQL server and using Sparse Columns This will allow you to expand your schema to include however many columns you want, without incurring the storage penalty for columns that are not pertinent for a given row.

Representing News Post in mySQL

I'm currently working on a blog for a college news organization. Each post, though, will represent a full show, with multiple contributors and multiple titles.
For example, a post might have three news stories, each with its own title and some contributors for each:
"Story 1" by (id1) and (id2)
"Story 2" by (id3)
"Story 3" by (id4) and (id5)
So for each post, there would be an index (1, 2, 3...) for each individual story, a VARCHAR for the title, and id's that represent contributors, whose details are stored in another "contributors" table. The problem is that I don't know how many stories there will be, or how many contributors there will be per story. It could range from ~3 at the least to up to 6. In case our show expands in the future, I'd like to have the capability to scale up to even more than 6 posts, too.
I want to represent this structure concisely in a mySQL column, but I'm not sure how to do that. One solution would be to create another mySQL table to save the details for each individual story, but I'd prefer to avoid that hassle. The ideal solution would be if I could somehow create an "array" within a mySQL column, which could store (for each story) an index, a string, and multiple id's to show who the contributors are.
Is this possible, or will I have to create a new table to keep track of each story?
Don't use a column - use a table. It can be a simple InnoDB table which doesn't really hurt performance at all. Define a combined primary key (story_id, contributor_id) and insert all contributions in that table.
What you name in your question is called a M:N table. Don't ever go there - it's a very bad thing to do and is, in fact, nearly impossible in relational databases.
Save yourself some future heartburn. Create the extra table. It looks like a table of [Posts] with a one-to-many relationship to [Stories] where [Stories] has a many-to-many relationship to [Contributors].
You could store a comma-delimited string value of contributor ids or story ids in one column, but how, exactly would you relate them? What would seem to be your best bet in that case would be to make it an 'array' of 'arrays', where your main string consisted of pairs of strings strung together through commas.. I (so it's just my opinion, okay?) would avoid using unless totally necessary (can't think of one instance at this time)...
So create your relationships tables. Just to illustrate one approach to the idea:
-- a story may have multiple contributors
CREATE TABLE story_contributor_rel (
story_id INT NOT NULL
, contributor_id INT NOT NULL
)
-- a post may have multiple stories
CREATE TABLE post_story_rel (
post_id INT NOT NULL
, story_id INT NOT NULL
)
Or cheat it a bit, but I'd recommend against this also(!):
-- a less-normalized way
CREATE TABLE post_relationships (
post_id INT NOT NULL
, story_id INT NOT NULL
, contributor_id INT NOT NULL
)
These are just the simplest approaches. Naturally, you'd want to have either additional indentity columns and/or proper indexing and primary key settings, but this is just the way I can illustrate the point I'm driving at better.
Imagine this too.. If you were to put all those relationships in logical columns, then without the application it would not be so easy for anyone to understand what's going on in your tables. If you don't put any logic in the table structures and if you would properly set relationships tracking (meaning relationship tables), then it would appear transparent. One look at these tables and one would not take long enough to understand..
That's just my opinion. :) Cheers!

Table design and class hierarchies

Hopefully someone can shed some light on this issue through either an example, or perhaps some suggested reading. I'm wondering what is the best design approach for modeling tables after their class hierarchy equivalencies. This can best be described through an example:
abstract class Card{
private $_name = '';
private $_text = '';
}
class MtgCard extends Card{
private $_manaCost = '';
private $_power = 0;
private $_toughness = 0;
private $_loyalty = 0;
}
class PokemonCard extends Card{
private $_energyType = '';
private $_hp = 0;
private $_retreatCost = 0;
}
Now, when modeling tables to synchronize with this class hierarchy, I've gone with something very similar:
TABLE Card
id INT, AUTO_INCREMENT, PK
name VARCHAR(255)
text TEXT
TABLE MtgCard
id INT, AUTO_INCREMENT, PK
card_id INT, FK(card.id)
manacost VARCHAR(32)
power INT
toughness INT
loyalty INT
TABLE PokemonCard
id INT, AUTO_INCREMENT, PK
card_id INT, FK(card.id)
hp INT
energytype ENUM(...)
retreatcost INT
The problem I'm having is trying to figure out how to associate each Card record with the record containing it's details from the corresponding table. Specifically, how to determine what table I should be looking in.
Should I add a VARCHAR column to Card to hold the name of the associated table? That's the only resolution that my peers and I have come to, but it seems too "dirty". Keeping the design extensible is the key here, allowing for the easy addition of new subclasses.
If someone could provide an example or resources showing a clean way of mirroring class/table hierarchies, it would be most appreciated.
Google "generalization specialization relational modeling". You'll find several excellent articles on the subject of how to model the gen-spec pattern using relational tables. This same question has been asked many times in SO, with slightly different details.
The best of these articles will confirm your decision to have one table for generalized data and separate tables for specialized data. The biggest difference will be the way they recommend using primary and foreign keys. Basically, they recommend that specialized tables have a single column that does double duty. It serves as the primary key to the specialized table, but it's also a foreign key that duplicates the PK of the generalized table.
This is a little complicated to maintain, but it's very sweet at join time.
Also keep in mind that DDL is required when a new class is added to the hierarchy.
Basically don't.
Forget about class hierarchies, storage models, and anything that is specific to your app and your particular app language. Unless you want to use the RDb as a mere storage location for your files, a dependent slave.
If you want the power and flexibility (specifically extensibility) of the relational Database, then you need to model it independent of any app, and using RDb principles, not app language requirements. Leave your app context behind for a while and design the database as a database. Learn about them. Normalise (eliminate all duplication). Learn about the structures and rules, and implement them. When you do that, your queries and your "mapping", will be effortless. There will be no "impedance". Use the correct datatypes and there will be no mismatch.
The structure you require is an ordinary subtype-supertype. Those are Relational Database terms that have been in existence for over 30 years in the RM, and over 23 years in Relational Database products. No need to call them funny new names. Wikipedia is not an academic reference.
Given your tables, which are quite correct as a starting point (you've Normalised automatically), you need:
Rename Card.Id as Card.CardId
Remove the ids for the subtypes, they are 100% redundant; the CardId is both the PK and the FK.
Add a discriminator Card.CardType CHAR(1) or TINYINT. This will identify which subtype to join with, when the CardType is not known.
It appears you do not fully understand the concept of Foreign Keys, so that would be good to gear up on first. It is implemented here in its simple, ordinary form:
ALTER TABLE MtgCard
ADD CONSTRAINT Card_MtgCard_fk
FOREIGN KEY (CardId)
REFERENCES Card(CardId)
The relation between Card and MtgCard or PokemonCard is always 1::1. The supertype is complete only when there is a Card plus { MtgCard | PokemonCard } with the same CardId. In your case there can be only one subtype, easy to enforce with a simple CHECK constraint.
In other cases, more than one subtype is quite legal.
The subtypes there are Person Is a Teacher or Person Is a Student
In Relational Databases there is no concept of joining "from" or "to" (or up/down or left/right), those notions are only there to assist us humans; you can start with any table/key you have, and go to any table you need. The tables in-between are demanded only in the absence of Relational Identifiers (ie. where additional Surrogates, ID columns, are used as PKs instead of meaningful natural keys).
In the example, using your terms, you can go straight from Enrollment to Person (eg, to grab the LastName) or to Course (to grab the Name) without having to visit the intermediate tables; the relation lines are solid.
.
Now, class hierarchies ("Is" or "Is a") and anything else, are simple and effortless.
Quick Reference to Standard Relational Database Diagrams.