Storing variable values in a database efficiently - mysql

I am currently dealing with a data structure similar to the one linked here:
http://sqlfiddle.com/#!2/2ad8f/1
There will be a field (fruits in this case) that can contain very variable options - quantity, colour, type, etc. I am trying to work out an efficient way of storing this data and using it programatically in a frontend.
I have thought about creating new fields (e.g. a field for quantity, a field for colour, etc), however the data can be highly variable and I will be dealing with many, many rows. Potentially 1-2 million. I don't want to create a "texture" field for example that is only used for 100/1,000,000 rows.
The "fruits" here would never be order by or referenced by the database storage engine.
My best idea so far is to store a JSON object as a string (see the second insert in link), however is there a more efficient method?

If you want to place all your attributes into one text container, you may as well be using a text file instead of a relational database. The database will have a lot of overhead that you are simply not using so why have it?
If you want this in a relational form, then let's go through some simple modeling.
WE have different kinds of fruit. These fruit can have different and even different kinds of attributes. Here is one simple way:
create table Fruit(
ID int auto_increment primary key,
Name varchar( 20 ) not null, -- Apple, Orange, etc.
Type varchar( 20 ), -- Macintosh, Granny Smith, Navel, etc.
Size char( 1 ), -- S, M, L
Qty int not null,
-- other data such as price, shelflife, whatever
);
So now we create a table for each type of disparate attribute:
create table Attr(
ID int auto_increment primary key,
Type varchar( 20 ), -- Color, Texture, Taste, etc.
Value varchar( 10 ) -- Red, Green, Juicy, Sweet, Sour, etc.
);
Each fruit can have several attributes and each attribute may apply to several kinds of fruit, so you need a many-to-many cross table between them:
create table FruitAttr(
FruitID int,
AttrID int,
primary key( FruitID, AttrID )
);
with FruitID a foreign key to Fruit and AttrID a foreign key to Attr. Now we can create a Basket table which will define each individual basket.
create table Basket(
ID int auto_increment primary key,
Name varchar( 20 ) not null, -- Graduation, Funeral, Birthday, etc.
Price decimal (19,4),
-- other basket-specific attributes
);
A basket is made up of several selections of fruit and each fruit may appear in several types of basket. So there is the same relationship between Basket and Fruit as between Fruit and Attr: many-to-many. As we've already modeled one of those tables, I'll leave that to you.
There are enhancement and changes that may be made to tailor these tables closer to your specific uses, but we now have a workable solution.
So very quickly we have gone from one table to five tables. That may seem like we've complicated everything but if you have to work with them, you will find we have made our (meaning your) life a whole lot easier, especially when you add new types of baskets or fruit, change the makeup of a basket, substitute one fruit (severe core rot suddenly makes Granny Smiths unavailable), or any number of ways you will need to change your data.
After all, it is a relational database and relations are established between tables, not between substrings within strings. So the DML and queries to work with these relations will be so much easier than trying to manipulate text strings.

Related

Designing SQL database for an item with multiple names

I am creating a table for dietary_supplement where a supplement can have many ingredients.
I am having trouble designing the table for the ingredients.
The issue is that an ingredient can have many names or an acronym.
For example, vitaminB1 has other names like Thiamine and thiamin.
An acronym BHA can stand for both Butylated hydroxyanisole and beta hydroxy acid(this is actually an ingredient for skincare products but I am using it anyways because it makes a good example).
I am also concerned about the spacing and "-". For example, someone can spell vitaminA without spacing and someone can write vitamin A. Also, beta hydroxy acid can also be written as β-hydroxy acid(with "-") or β hydroxy acid(without "-").
What I have in mind are 2 options)
1) put all the names for one ingredient in a column using semi-colon to distinguish between names. eg) beta hydroxy acid;BHA;β-hydroxy acid;β hydroxy acid
-this would be easy but I am not sure if this is the smart way to design the database when I have to perform search actions etc.
2) create a table for all the names and relate it with a table for ingredients.
-This is the option that I am leaned towards, but I wonder if there are better ways to do this. And do I have to create separate rows for the same items with difference in spacing and "-"?
Make a mapping table of 'name' to 'canonical_name' (or id). It would have rows like
Thiamine vitaminB1
thiamin vitaminB1
vitaminB1 vitaminB1
B1 vitaminB1
By using a collation ending with _ci, you don't need to worry about capitalization.
When ingesting the data for a suplement, first lookup the name to get the canonical_name, then use the latter in any other table(s).
In that 2-column table, have
PRIMARY KEY(canonical_name),
INDEX(name, canonical_name)
so that you can go either direction.
Create a table for ingredients and supplement and make a column that will be the same in table ingredients and supplement and just join them if you want to select
It might be something like this:
CREATE TABLE Ingredient (
Id INTEGER UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY
, ImagePath VARCHAR(63)
, Description TEXT
-- other ingredient's non-name dependent properties
);
CREATE TABLE IngredientName (
Id INTEGER UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY
, IngredientId INTEGER UNSIGNED NOT NULL
, IsMain TINYINT(1) UNSIGNED NOT NULL DEFAULT 0
, Name VARCHAR(63) NOT NULL
, KEY IX_IngredientName_IngredientId_IsMain (IngredientId, IsMain)
, UNIQUE KEY IX_IngredientName_IngredientId_Name (IngredientId, Name)
, CONSTRAINT FK_IngredientName_IngredientId FOREIGN KEY (`IngredientId`) REFERENCES `Ingredient` (`Id`) ON DELETE CASCADE ON UPDATE CASCADE
);
Or you can add Ingredient.Name that would be the main name and rid off the IngredientName.IsMain then.
For spaces you should use some name normalization in your application such as removing consecutive spaces, capitalizing, normalizing spaces around commas, dashes etc. Sure, you can apply such normalization on database in trigger if you like.
There are some other possibilities.
You should think what would be user cases for using the DB first.
This is very important. There is no 'the best universal DB design'.
If you need some special search cases you might need special DB design or at least indexes.
P.S. I believe that putting different names in one field as something-separated value is bad idea

How to store a data whose type can be numeric, date or string in mysql

We're developing a monitoring system. In our system values are reported by agents running on different servers. This observations reported can be values like:
A numeric value. e.g. "CPU USAGE" = 55. Meaning 55% of the CPU is in
use).
Certain event was fired. e.g. "Backup completed".
Status: e.g. SQL Server is offline.
We want to store this observations (which are not know in advance and will be added dynamically to the system without recompiling).
We are considering adding different columns to the observations table like this:
IntMeasure -> INTEGER
FloatMeasure -> FLOAT
Status -> varchar(255)
So if the value we whish to store is a number we can use IntMeasure or FloatMeasure according to the type. If the value is a status we can store the status literal string (or a status id if we decide to add a Statuses(id, name) table).
We suppose it's possible to have a more correct design but would probably become to slow and dark due to joins and dynamic table names depending on types? How would a join work if we can't specify the tables in advance in the query?
I haven't done a formal study, but from my own experience I would guess that more than 80% of database design flaws are generated from designing with performance as the most important (if not only) consideration.
If a good design calls for multiple tables, create multiple tables. Don't automatically assume that joins are something to be avoided. They are rarely the true cause of performance problems.
The primary consideration, first and foremost in all stages of database design, is data integrity. "The answer may not always be correct, but we can get it to you very quickly" is not a goal any shop should be working toward. Once data integrity has been locked down, if performance ever becomes an issue, it can be addressed. Don't sacrifice data integrity, especially to solve problems that may not exist.
With that in mind, look at what you need. You have observations you need to store. These observations can vary in the number and types of attributes and can be things like the value of a measurement, the notification of an event and the change of a status, among others and with the possibility of future observations being added.
This would appear to fit into a standard "type/subtype" pattern, with the "Observation" entry being the type and each type or kind of observation being the subtype, and suggests some form of type indicator field such as:
create table Observations(
...,
ObservationKind char( 1 ) check( ObservationKind in( 'M', 'E', 'S' )),
...
);
But hardcoding a list like this in a check constraint has a very low maintainability level. It becomes part of the schema and can be altered only with DDL statements. Not something your DBA is going to look forward to.
So have the kinds of observations in their own lookup table:
ID Name Meaning
== =========== =======
M Measurement The value of some system metric (CPU_Usage).
E Event An event has been detected.
S Status A change in a status has been detected.
(The char field could just as well be int or smallint. I use char here for illustration.)
Then fill out the Observations table with a PK and the attributes that would be common to all observations.
create table Observations(
ID int identity primary key,
ObservationKind char( 1 ) not null,
DateEntered date not null,
...,
constraint FK_ObservationKind foreign key( ObservationKind )
references ObservationKinds( ID ),
constraint UQ_ObservationIDKind( ID, ObservationKind )
);
It may seem strange to create a unique index on the combination of Kind field and the PK, which is unique all by itself, but bear with me a moment.
Now each kind or subtype gets its own table. Note that each kind of observation gets a table, not the data type.
create table Measurements(
ID int not null,
ObservationKind char( 1 ) check( ObservationKind = 'M' ),
Name varchar( 32 ) not null, -- Such as "CPU Usage"
Value double not null, -- such as 55.00
..., -- other attributes of Measurement observations
constraint PK_Measurements primary key( ID, ObservationKind ),
constraint FK_Measurements_Observations foreign key( ID, ObservationKind )
references Observations( ID, ObservationKind )
);
The first two fields will be the same for the other kinds of observations except the check constraint will force the value to the appropriate kind. The other fields may differ in number, name and data type.
Let's examine an example tuple that may exist in the Measurements table:
ID ObservationKind Name Value ...
==== =============== ========= =====
1001 M CPU Usage 55.0 ...
In order for this tuple to exist in this table, a matching entry must first exist in the Observations table with an ID value of 1001 and an observation kind value of 'M'. No other entry with an ID value of 1001 can exist in either the Observations table or the Measurements table and cannot exist at all in any other of the "kind" tables (Events, Status). This works the same way for all the kind tables.
I would further recommend creating a view for each kind of observation which will provide a join of each kind with the main observation table:
create view MeasurementObservations as
select ...
from Observations o
join Measurements m
on m.ID = o.ID;
Any code that works solely with measurements would need to only hit this view instead of the underlying tables. Using views to create a wall of abstraction between the application code and the raw data greatly enhances the maintainability of the database.
Now the creation of another kind of observation, such as "Error", involves a simple Insert statement to the ObservationKinds table:
F Fault A fault or error has been detected.
Of course, you need to create a new table and view for these error observations, but doing so will have no impact on existing tables, views or application code (except, of course, to write the new code to work with the new observations).
Just create it as a VARCHAR
This will allow you to store whatever data you require in it. It is much more difficult to do queries based on the number in the field such as
Select * from table where MyVARCHARField > 50 //get CPU > 50
However if you think you want to do this, then either you need a field per item or a generalised table such as
Create Table
Description : Varchar
ValueType : Varchar //Can be String, Float, Int
ValueString: Varchar
ValueFloat: Float
ValueInt : Int
Then when you are filling the data you can put your value in the correct field and select like this.
Select Description ,ValueInt from table where Description like '%cpu%' and ValueInt > 50
I had a used two columns for a similar problem. First column was for data type and second value contained data as a Varchar.
First column had codes ( e.g. 1= integer, 2 = string, 3 = date and so on), which could be combined to compare values. ( e.g. find the max integer where type=1)
I did not have joins, but i think you can use this approach. It will also help you if tomorrow more data types are introduced.

Database design confusion

I'm developing a classifieds site. And I'm totally stuck at database design level.
Advertisiment can only be in 1 category.
In my database I have table called "ads", which has columns, common for all advertisements.
CREATE TABLE Ads (
AdID int not null,
AdDate datetime not null,
AdCategory int not null,
AdHeading varchar(255) not null,
AdText varchar(255) not null,
etc...
);
I also have a lot of categories.
Ads that are posted in "cars" category, for example, have additional columns like make, model, color, etc. Ads, posted in "housing" have columns like housing type, sqft. etc...
I did something like:
CREATE TABLE Cars (
AdID int not null,
CarMake varchar (255) not null,
CarModel varchar(255) not null,
...
);
CREATE TABLE Housing (
AdID int not null,
HousingType varchar (255) not null
...
);
AdId in those is a foreign key to Ads.
But when I need to retrieve information from Ads, I have to look up all those additional tables and check if AdId in Ads equals to AdId in those tables.
For every category I need a new table. I'm gonna end up with like 15 tables or so.
I had an idea to have a boolean columns in Ads table like is_Cars, is_Housing, etc but having a 15 columns, where 14 would be NULL seems to be horrible.
Is there any better way to design this database? I need my database to be in a 3rd normal form, this is the most important requirement.
Don't worry too much - it's a well known dilemma, there are no 'silver bullets' and all solutions have some trade-offs. Your solution sounds good to me, and is commonly used in the industry. On the down side it has JOINS as you mentioned (which is a well-known trade-off of normalization anyway), and also each new product type requires a new TABLE. On the up side the table structure precisely reflects your business logic, it's readable and efficient in storage.
Your other suggestion, as far as I understand, was a single table where each row has a "type" indication - car, house etc (btw no need for multiple columns such as 'is_car', 'is_house' - it's simpler to have a single column 'type', e.g. type=1 indicates car, type=2 indicates house etc). Then multiple columns where some of them are unused for some product types.
Well, here the advantage is capability to add new types dynamically (even user-defined types) without changing the database schema. Also no 'JOINs'. On the down side you'll be storing & retrieving lots of 'null' cells, and also the schema would be less descriptive: e.g. it's harder to put a constraint "carModel column is not nullable", because it is nullable for houses (you can use triggers, but it's less readable).
Personally I prefer the 1st solution (of course depending on the usecase, but the 1st solution is my first instinct). And I can use it with some peace of mind after considering the trade-offs, e.g. understanding that I'm tolerating those JOINS as payment for a readable & compact schema.
One, you are confusing categories and product specifications.
Two, you need to read up on Table Inheritance.
If you don't mind nulls, use Single Table Inheritance. All "categories" (cars, houses, ...) go in one table and have a "type" column.
If you don't like nulls, use Class Table Inheritance. Make a master table with the primary keys that you point your category foreign key at. Make child tables for each type (cars, houses, ...) whose primary key is also a foreign key to the master table. This is easier with an ORM like Hibernate.

When to replace a database column with an ID instead

I'm helping a friend design a database but I'm curious if there is a general rule of thumb for the following:
TABLE_ORDER
OrderNumber
OrderType
The column OrderType has the possibility of coming from a preset list of Order Types. Should I allow VARCHAR values to be used in the OrderType column (ex. Production Order, Sales Order, etc...) Or should I separate it out into another table and have it referenced as a foreign key instead from the TABLE_ORDER as the following?:
TABLE_ORDER
OrderNumber
OrderTypeID
TABLE_ORDER_TYPE
ID
OrderType
If the order type list is set, and will not change, you could opt to not-make a seperate table. But in this case, do not make it VARCHAR, but make it an ENUM.
You can index this better, and you will end up with arguably the same type of database as when you make it an ID with lookup-table.
But if there is any change at all you need to add types, just go for the second. You can add an interface later, but you can easily make "get all types" kind of pages etc.
I would say use another table say "ReferenceCodes" for example:
Type, Name, Description, Code
Then you can just use the Code through out the database and need not worry about the name associated to that code. If you use a name (for example order type in your case), if would be really difficult to change the name later on. This is what we actually do in our system.
In a perfect world, any column that can contain duplicate data should be an id or an ENUM. This helps you make sure that the data is always internally consistent and can reduce database size as well as speed up queries.
For something like this structure, I would probably create a master_object table that you could use for multiple types. OrderType would reference the master_object table. You could then use the same table for other data. For example, let's say you had another table - Payments, with a column of PaymentType. You could use the master_object table to also store the values and meta-data for that column. This gives you quite a bit of flexibility without forcing you to create a bunch of small tables, each containing 2-10 rows.
Brian
If the list is small ( less than 10 items ) then you could choose to model it as your first but put a column constraint to limit the inputs to the values in your list. This forces the entries to belong to your list, but your list should not change often.
e.g. check order_type in ('Val1','Val2',...'Valn')
If the list will ever change, if it is used in multiple tables, you are required to support multiple languages or any other design criteria that demands variability, then create your type table (you are always safe with this choice, it is why it is the most used).
You can collect all such tables into a 'codes' table that generalizes the concept
CREATE TABLE Codes (
Code_Class CHARACTER VARYING(30) NOT NULL,
Code_Name CHARACTER VARYING(30) NOT NULL,
Code_Value_1 CHARACTER VARYING(30),
Code_Value_2 CHARACTER VARYING(30),
Code_Value_3 CHARACTER VARYING(30),
CONSTRAINT PK_Codes PRIMARY KEY (Code_Class, Code_Name)
);
insert into codes ( code_class, code_name, code_value_1 )
values( 'STATE','New York','NY' ),
values( 'STATE, 'California','CA'),
.... );
You can then place and UPDATE/INSERT trigger on the table.column under change that should be constrained to a list of states. Lets say an employee table has a column EMP_STATE to hold state short-forms.
The trigger would simply call a select statement like
SELECT code_name
, code_value_1
INTO v_state_name, v_state_short_name
FROM codes
WHERE code_class = 'STATE'
AND code_value_1 = new.EMP_STATE;
if( not found ) then
raise( some error to fail the trigger and the insert );
end if;
This can be extended to other types:
insert into codes ( code_class, code_name )
values( 'ORDER_TYPE','Production' ),
values( 'ORDER_TYPE', 'Sales'),
.... );
select code_name
, code_value_1
into v_state_name, v_state_short_name
from codes
where code_class = 'ORDER_TYPE'
and code_name = 'Sales';
This last method, although generally applicable can be over-used. It also has the downside that you cannot use different data types (code_name, code_value_*).
The general rule of thumb: create a 'TYPE' (e.g. ORDER_TYPE) table (to hold the values you wish to constrain an attribute to for each type), use an ID as the primary key, use a single sequence to generate all such id's (for all your 'TYPE' tables). The many TYPE tables may clutter your model, but the meaning will be clear to your developers (the ultimate goal).

Two tables with same columns or one table with additional column?

Say I have two tables (Apples and Oranges) with the same columns and just a different table name. Would there be any advantages/disadvantages to turning this into one table (lets say its called Fruit) with an additional column 'type' which would then either store a value of Apple or Orange?
Edit to clarify:
CREATE TABLE apples
(
id int,
weight int,
variety varchar(255)
)
CREATE TABLE oranges
(
id int,
weight int,
variety varchar(255)
)
OR
CREATE TABLE fruit
(
id int,
weight int,
variety varchar(255),
type ENUM('apple', 'orange')
)
Depends on constraints:
Do you have foreign keys or CHECKs on apples that don't exist on oranges (or vice-versa)?
Do you need to keep keys unique across both tables (so no apple can have the same ID as some orange)?
If the answers on these two questions are: "yes" and "no", keep the tables separate (so constraints can be made table-specific1).
If the answers are: "no" and "yes", merge them together (so you can crate a key that spans both).
If the answers are: "yes" and "yes", consider emulating inheritance2:
1 Lookup data is a typical example of tables that look similar, yet must be kept separate so FKs can be kept separate.
2 Specifically, this is the "all classes in separate tables" strategy for representing inheritance (aka. category, subclassing, subtyping, generalization hierarchy etc.). You might want to take a look at this post for more info.
If there really is not any further business rules (and resultant underlying data requirements) that separate the two sub-types then I would use one table with an fk to a FruitType lookup table.
You dont mention what you will be using to access the schema which may affect which approach you take (e.g. if you are using a platform which provides an ORM to your database then this may be worth noting).
The advantage would be normalization. Your tables would then be in 2NF (second normal form).
Your fruit type would be a foreign key to a table with those fruits like so:
CREATE TABLE fruit_type (type varchar(15))
CREATE TABLE fruits (id int, weight int, variety varchar(255), type varchar(15))