I need to figure out which datatype to use for states.
Should it be SET or VARCHAR or anything else?
CREATE TABLE actors(
state SET('USA','Germany','....)
)
alternatively
CREATE TABLE actors(
state VARCHAR(30)
)
Assuming there's going to be tens or over hundred of the countries, it's best to use separate table.
CREATE TABLE states(
state_id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(30)
);
It's also recommended to use foreign key on the state_id so if you want to delete a state from your database, it wouldn't break other data depending on it.
If each actor is going to be assigned only to one state (1:1), you can use column in the actors table.
CREATE TABLE actors(
actor_id INT ...,
state_id INT,
)
Or if each actor can be assigned to more states (1:N), use another table for these relations:
CREATE TABLE actors(
actor_id INT ...,
)
CREATE TABLE actors_to_states(
actor_id INT,
state_id INT
)
SET is a compound datatype containing values from predefined set of possible values. If table contains such a data then according to relational databases theory it is not in 1NF. So it is only few special cases where this approach is reasonable. In most cases I suggest using separate table for countries like in example below:
CREATE TABLE countries (id SMALLINT, name VARCHAR(100))
To answer these type of questions , you should so little bit of data analysis and ask some questions to your data like :
What is the maximum size of my data ?
In your case it will be the country which has the largest name .Note this down. Add 20 to remain on safe side .
Will my data always contain numbers or characters, or combination ?
In your case only characters . So it is varchar .
Also plan your data-model such that you don't need to edit afterwards .Use of set would not be recommended by me in that case.
I recommend you use the standard abbreviations (US for USA, DE for Germany) and put it in
country_code CHAR(2) CHARACTER SET ascii NOT NULL
That way, it is compact (2 bytes) and readable by users. Then, if you want, you can have another table that spells out the country names.
If an actor can belong to multiple states, then this won't work, and you do need to have a SET. If you need that, we can discuss it further.
Related
I store CountryCode at my database and I have only 5 options to store at the column CountryCode "EG, AE, BH, QA, KW"
Should I use char(2) or tinyint or enum('EG', 'AE', 'BH', 'QA', 'KW') any why?
Use the 2-letter standard country_codes.
And make it CHAR(2) CHARACTER SET ascii. And debate between ascii_bin (which disallows case folding) and ascii_general_ci (for case folding).
That would be 2 bytes.
ENUM and TINYINT UNSIGNED would be only one byte, but the total number of countries is dangerously close to 256. At that point you would need a 2-byte ENUM or SMALLINT.
An argument in favor of CHAR(2): It is human readable (mostly). And, if you need more info about each country (full name, population, etc), you can still have a table with PRIMARY KEY(country_code) and easily (and efficiently) JOIN when needed.
Your list of 5 ccs is too long and too likely to change; don't use ENUM.
In general, ENUM should be limited to very short lists that are unlikely to change. Also, consider starting the list with something like 'unknown' instead of making the field NULLable.
If you're quite sure the list of accepted values is not gonna increase too much I would go with the enum to have more clean values, avoiding faulty inputs like 'Bh' 'eg' 'kW' or stuff like that.
ENUM s are fine, but there are drawbacks in terms of maintenance:
listing the allowed values requires accessing the definition of the table
adding new possible values to the list requires modifying the structure of the table
if more than one table has a CountryCode column, you need to recreate another ENUM
So this should be used only in cases where the list is not meant to change over time, and a single column uses it.
In all other cases, it is simpler to have a referential table that stores the values, and create foreign keys in the referencing table(s):
-- referential table
create table countries (countryCode varchar(2) primary key);
insert into countries values ('EG'), ('AE'), ('BH'), ('KW');
-- referencing table
create table mytable (
id int, -- and/or other columns of the table ...
countryCode varchar(2) references countries(countryCode)
);
With this technique, you get the full benefit and flexibility of foreign keys: easy maintenance, data integrity, possible indexing, nice options such as on delete cascade, and so on.
I am creating a table for dietary_supplement where a supplement can have many ingredients.
I am having trouble designing the table for the ingredients.
The issue is that an ingredient can have many names or an acronym.
For example, vitaminB1 has other names like Thiamine and thiamin.
An acronym BHA can stand for both Butylated hydroxyanisole and beta hydroxy acid(this is actually an ingredient for skincare products but I am using it anyways because it makes a good example).
I am also concerned about the spacing and "-". For example, someone can spell vitaminA without spacing and someone can write vitamin A. Also, beta hydroxy acid can also be written as β-hydroxy acid(with "-") or β hydroxy acid(without "-").
What I have in mind are 2 options)
1) put all the names for one ingredient in a column using semi-colon to distinguish between names. eg) beta hydroxy acid;BHA;β-hydroxy acid;β hydroxy acid
-this would be easy but I am not sure if this is the smart way to design the database when I have to perform search actions etc.
2) create a table for all the names and relate it with a table for ingredients.
-This is the option that I am leaned towards, but I wonder if there are better ways to do this. And do I have to create separate rows for the same items with difference in spacing and "-"?
Make a mapping table of 'name' to 'canonical_name' (or id). It would have rows like
Thiamine vitaminB1
thiamin vitaminB1
vitaminB1 vitaminB1
B1 vitaminB1
By using a collation ending with _ci, you don't need to worry about capitalization.
When ingesting the data for a suplement, first lookup the name to get the canonical_name, then use the latter in any other table(s).
In that 2-column table, have
PRIMARY KEY(canonical_name),
INDEX(name, canonical_name)
so that you can go either direction.
Create a table for ingredients and supplement and make a column that will be the same in table ingredients and supplement and just join them if you want to select
It might be something like this:
CREATE TABLE Ingredient (
Id INTEGER UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY
, ImagePath VARCHAR(63)
, Description TEXT
-- other ingredient's non-name dependent properties
);
CREATE TABLE IngredientName (
Id INTEGER UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY
, IngredientId INTEGER UNSIGNED NOT NULL
, IsMain TINYINT(1) UNSIGNED NOT NULL DEFAULT 0
, Name VARCHAR(63) NOT NULL
, KEY IX_IngredientName_IngredientId_IsMain (IngredientId, IsMain)
, UNIQUE KEY IX_IngredientName_IngredientId_Name (IngredientId, Name)
, CONSTRAINT FK_IngredientName_IngredientId FOREIGN KEY (`IngredientId`) REFERENCES `Ingredient` (`Id`) ON DELETE CASCADE ON UPDATE CASCADE
);
Or you can add Ingredient.Name that would be the main name and rid off the IngredientName.IsMain then.
For spaces you should use some name normalization in your application such as removing consecutive spaces, capitalizing, normalizing spaces around commas, dashes etc. Sure, you can apply such normalization on database in trigger if you like.
There are some other possibilities.
You should think what would be user cases for using the DB first.
This is very important. There is no 'the best universal DB design'.
If you need some special search cases you might need special DB design or at least indexes.
P.S. I believe that putting different names in one field as something-separated value is bad idea
We're developing a monitoring system. In our system values are reported by agents running on different servers. This observations reported can be values like:
A numeric value. e.g. "CPU USAGE" = 55. Meaning 55% of the CPU is in
use).
Certain event was fired. e.g. "Backup completed".
Status: e.g. SQL Server is offline.
We want to store this observations (which are not know in advance and will be added dynamically to the system without recompiling).
We are considering adding different columns to the observations table like this:
IntMeasure -> INTEGER
FloatMeasure -> FLOAT
Status -> varchar(255)
So if the value we whish to store is a number we can use IntMeasure or FloatMeasure according to the type. If the value is a status we can store the status literal string (or a status id if we decide to add a Statuses(id, name) table).
We suppose it's possible to have a more correct design but would probably become to slow and dark due to joins and dynamic table names depending on types? How would a join work if we can't specify the tables in advance in the query?
I haven't done a formal study, but from my own experience I would guess that more than 80% of database design flaws are generated from designing with performance as the most important (if not only) consideration.
If a good design calls for multiple tables, create multiple tables. Don't automatically assume that joins are something to be avoided. They are rarely the true cause of performance problems.
The primary consideration, first and foremost in all stages of database design, is data integrity. "The answer may not always be correct, but we can get it to you very quickly" is not a goal any shop should be working toward. Once data integrity has been locked down, if performance ever becomes an issue, it can be addressed. Don't sacrifice data integrity, especially to solve problems that may not exist.
With that in mind, look at what you need. You have observations you need to store. These observations can vary in the number and types of attributes and can be things like the value of a measurement, the notification of an event and the change of a status, among others and with the possibility of future observations being added.
This would appear to fit into a standard "type/subtype" pattern, with the "Observation" entry being the type and each type or kind of observation being the subtype, and suggests some form of type indicator field such as:
create table Observations(
...,
ObservationKind char( 1 ) check( ObservationKind in( 'M', 'E', 'S' )),
...
);
But hardcoding a list like this in a check constraint has a very low maintainability level. It becomes part of the schema and can be altered only with DDL statements. Not something your DBA is going to look forward to.
So have the kinds of observations in their own lookup table:
ID Name Meaning
== =========== =======
M Measurement The value of some system metric (CPU_Usage).
E Event An event has been detected.
S Status A change in a status has been detected.
(The char field could just as well be int or smallint. I use char here for illustration.)
Then fill out the Observations table with a PK and the attributes that would be common to all observations.
create table Observations(
ID int identity primary key,
ObservationKind char( 1 ) not null,
DateEntered date not null,
...,
constraint FK_ObservationKind foreign key( ObservationKind )
references ObservationKinds( ID ),
constraint UQ_ObservationIDKind( ID, ObservationKind )
);
It may seem strange to create a unique index on the combination of Kind field and the PK, which is unique all by itself, but bear with me a moment.
Now each kind or subtype gets its own table. Note that each kind of observation gets a table, not the data type.
create table Measurements(
ID int not null,
ObservationKind char( 1 ) check( ObservationKind = 'M' ),
Name varchar( 32 ) not null, -- Such as "CPU Usage"
Value double not null, -- such as 55.00
..., -- other attributes of Measurement observations
constraint PK_Measurements primary key( ID, ObservationKind ),
constraint FK_Measurements_Observations foreign key( ID, ObservationKind )
references Observations( ID, ObservationKind )
);
The first two fields will be the same for the other kinds of observations except the check constraint will force the value to the appropriate kind. The other fields may differ in number, name and data type.
Let's examine an example tuple that may exist in the Measurements table:
ID ObservationKind Name Value ...
==== =============== ========= =====
1001 M CPU Usage 55.0 ...
In order for this tuple to exist in this table, a matching entry must first exist in the Observations table with an ID value of 1001 and an observation kind value of 'M'. No other entry with an ID value of 1001 can exist in either the Observations table or the Measurements table and cannot exist at all in any other of the "kind" tables (Events, Status). This works the same way for all the kind tables.
I would further recommend creating a view for each kind of observation which will provide a join of each kind with the main observation table:
create view MeasurementObservations as
select ...
from Observations o
join Measurements m
on m.ID = o.ID;
Any code that works solely with measurements would need to only hit this view instead of the underlying tables. Using views to create a wall of abstraction between the application code and the raw data greatly enhances the maintainability of the database.
Now the creation of another kind of observation, such as "Error", involves a simple Insert statement to the ObservationKinds table:
F Fault A fault or error has been detected.
Of course, you need to create a new table and view for these error observations, but doing so will have no impact on existing tables, views or application code (except, of course, to write the new code to work with the new observations).
Just create it as a VARCHAR
This will allow you to store whatever data you require in it. It is much more difficult to do queries based on the number in the field such as
Select * from table where MyVARCHARField > 50 //get CPU > 50
However if you think you want to do this, then either you need a field per item or a generalised table such as
Create Table
Description : Varchar
ValueType : Varchar //Can be String, Float, Int
ValueString: Varchar
ValueFloat: Float
ValueInt : Int
Then when you are filling the data you can put your value in the correct field and select like this.
Select Description ,ValueInt from table where Description like '%cpu%' and ValueInt > 50
I had a used two columns for a similar problem. First column was for data type and second value contained data as a Varchar.
First column had codes ( e.g. 1= integer, 2 = string, 3 = date and so on), which could be combined to compare values. ( e.g. find the max integer where type=1)
I did not have joins, but i think you can use this approach. It will also help you if tomorrow more data types are introduced.
I'm helping a friend design a database but I'm curious if there is a general rule of thumb for the following:
TABLE_ORDER
OrderNumber
OrderType
The column OrderType has the possibility of coming from a preset list of Order Types. Should I allow VARCHAR values to be used in the OrderType column (ex. Production Order, Sales Order, etc...) Or should I separate it out into another table and have it referenced as a foreign key instead from the TABLE_ORDER as the following?:
TABLE_ORDER
OrderNumber
OrderTypeID
TABLE_ORDER_TYPE
ID
OrderType
If the order type list is set, and will not change, you could opt to not-make a seperate table. But in this case, do not make it VARCHAR, but make it an ENUM.
You can index this better, and you will end up with arguably the same type of database as when you make it an ID with lookup-table.
But if there is any change at all you need to add types, just go for the second. You can add an interface later, but you can easily make "get all types" kind of pages etc.
I would say use another table say "ReferenceCodes" for example:
Type, Name, Description, Code
Then you can just use the Code through out the database and need not worry about the name associated to that code. If you use a name (for example order type in your case), if would be really difficult to change the name later on. This is what we actually do in our system.
In a perfect world, any column that can contain duplicate data should be an id or an ENUM. This helps you make sure that the data is always internally consistent and can reduce database size as well as speed up queries.
For something like this structure, I would probably create a master_object table that you could use for multiple types. OrderType would reference the master_object table. You could then use the same table for other data. For example, let's say you had another table - Payments, with a column of PaymentType. You could use the master_object table to also store the values and meta-data for that column. This gives you quite a bit of flexibility without forcing you to create a bunch of small tables, each containing 2-10 rows.
Brian
If the list is small ( less than 10 items ) then you could choose to model it as your first but put a column constraint to limit the inputs to the values in your list. This forces the entries to belong to your list, but your list should not change often.
e.g. check order_type in ('Val1','Val2',...'Valn')
If the list will ever change, if it is used in multiple tables, you are required to support multiple languages or any other design criteria that demands variability, then create your type table (you are always safe with this choice, it is why it is the most used).
You can collect all such tables into a 'codes' table that generalizes the concept
CREATE TABLE Codes (
Code_Class CHARACTER VARYING(30) NOT NULL,
Code_Name CHARACTER VARYING(30) NOT NULL,
Code_Value_1 CHARACTER VARYING(30),
Code_Value_2 CHARACTER VARYING(30),
Code_Value_3 CHARACTER VARYING(30),
CONSTRAINT PK_Codes PRIMARY KEY (Code_Class, Code_Name)
);
insert into codes ( code_class, code_name, code_value_1 )
values( 'STATE','New York','NY' ),
values( 'STATE, 'California','CA'),
.... );
You can then place and UPDATE/INSERT trigger on the table.column under change that should be constrained to a list of states. Lets say an employee table has a column EMP_STATE to hold state short-forms.
The trigger would simply call a select statement like
SELECT code_name
, code_value_1
INTO v_state_name, v_state_short_name
FROM codes
WHERE code_class = 'STATE'
AND code_value_1 = new.EMP_STATE;
if( not found ) then
raise( some error to fail the trigger and the insert );
end if;
This can be extended to other types:
insert into codes ( code_class, code_name )
values( 'ORDER_TYPE','Production' ),
values( 'ORDER_TYPE', 'Sales'),
.... );
select code_name
, code_value_1
into v_state_name, v_state_short_name
from codes
where code_class = 'ORDER_TYPE'
and code_name = 'Sales';
This last method, although generally applicable can be over-used. It also has the downside that you cannot use different data types (code_name, code_value_*).
The general rule of thumb: create a 'TYPE' (e.g. ORDER_TYPE) table (to hold the values you wish to constrain an attribute to for each type), use an ID as the primary key, use a single sequence to generate all such id's (for all your 'TYPE' tables). The many TYPE tables may clutter your model, but the meaning will be clear to your developers (the ultimate goal).
I have a person table and I want users to be able to create custom many to many relations of information with them. Educations, residences, employments, languages, and so on. These might require different number of columns. E.g.
Person_languages(person_fk,language_fk)
Person_Educations(person,institution,degree,field,start,end)
I thought of something like this. (Not correct sql)
create Tables(
table_id PRIMARY_KEY,
table_name_fk FOREIGN_KEY(Table_name),
person_fk FOREIGN_KEY(Person),
table_description TEXT
)
Table holding all custom table name and descriptions
create Table_columns(
column_id PRIMARY_KEY,
table_fk FOREIGN_KEY(Tables),
column_name_fk FOREIGN_KEY(Columns),
rank_column INT,
)
Table holding the columns in each custom table and the order they are to be displayed in.
create Table_rows(
row_id PRIMARY_KEY,
table_fk FOREIGN_KEY(Tables),
row_nr INT,
)
Table holding the rows of each custom table.
create Table_cells(
cell_id PRIMARY_KEY,
table_fk FOREIGN_KEY(Tables),
row_fk FOREIGN_KEY(Table_rows),
column_fk FOREIGN_KEY(Table_columns),
cell_content_type_fk FOREIGN_KEY(Content_types),
cell_object_id INT,
)
Table holding cell info.
If any custom table starts to be used with most persons and becomes large, the idea was to maybe then extract it into a separate hard-coded many-to-many table just for that table.
Is this a stupid idea? Is there a better way to do this?
I strongly advise against such a design - you are on the road to an extremely fragmented and hard to read design.
IIUC your base problem is, that you have a common set of (universal) properties for a person, that may be extended by other (non-universal) properties.
I'd tackle this by having the universal properties in the person table and create two more tables: property_types, which translates a property name into an INT primary key and person_properties which combines person PK, propety PK and value.
If you set the PK of this table to be (person,property) you get the best possible index locality for the person, which makes requesting all properties for a person a very fast query.