Lets say there are warehouses each storing items of a specific type.
So there are tables with fields
Warehouse - ID,Name,Type
Item - ID,Name,Type
WarehouseItem - Warehouse, Item
Type - ID, Name
The question is - given that a Warehouse only holds Items with of specific Type, what database normalization rule is this breaking?
Is this database normalized?
(The problem's example is made up, but I basically have this problem in real life.)
I'm making some assumptions from just looking at your metadata without any data examples, but on first glance it appears that your schema for the most part is normalized. Technically speaking your table is 3NF (which should be your target) if it meets all of the following standards:
It is also 1NF - Each entry only contains atomic data (or a single piece of info)
It is also 2NF - No candidate key dependency meaning that when you have have a composite primary key (a key made up of more than one column) that all data is dependent on the entire key
It is 3NF - No transitive dependency meaning all data is only dependent on the primary key and not some other column in the table
Note that there are also higher normalized forms but they are mostly academic as you begin experiencing performance degradation the more you normalize
Given this definition:
Warehouse appears 3NF assuming that each warehouse can only have one Type. If not then you would be failing the transitive dependency and would need to move Type information to a new table.
Item too appears 3NF assuming only one Type can be assigned
Type appears to contain redundant data and should be removed unless of course you have a many-to-many relationship between Type and Warehouse and/or Item. In that case, you would want to introduce a bridge-entity (aka composite entry) between Type and Warehouse or Item to create two 1-to-many relationships.
Lastly, if I'm reading this correctly, WarehouseItem appears to be a bridge-entity between Warehouse and Item to break up the many-to-many relationship between them. If this is correct, you should be able to argue that this table is 3NF assuming the combination of Warehouse and Item represent a composite key.
So assuming I interpreted your schema correctly, once you eliminate the redundant Type table, then yes I would say this setup technically meets 3NF. Note that your requirement that
given that a Warehouse only holds Items with of specific Type
may require you introduce a new type field which will mean you need to reevaluate your normalization of that table. If you have two distinct types (a WarehouseType and an ItemType) then you may need to keep that Type table after all and turn it into a mapping table between those two new fields. But I'd need to see data examples to better evaluate.
Related
I can't find a term for what I'm trying to do so that may be limiting my ability to find info related to my question.
I'm trying to relate product identifiers and product processing codes (orange table in fig.) with validation against what product types and subtypes are valid for each process code based on process type. Importantly, each product identifier is related to a product type (see ProductIdentifier table) and each process code is related to process type (see ProcessCode table). I minimized the attributes in the tables below to only those necessary for my question.
In the above example, when I INSERT INTO the RunProcessTypeOne table, I need to validate that the ProductCode for RoleOneProductIdentifier is present in ProductTypeTwo. Similarly, I need to validate that the ProductCode for RoleTwoProductIdentifier is present in ProductSubtypeOne.
Of course I can use a stored procedure that inserts into the RunProcessTypeOne table after running SELECT to check for the presence of the ProductCode related to RoleOneProductIdentifier and RoleTwoProductIdentifier in the relevant tables. This doesn't seem optimal since I'm having to run three SELECTs for every INSERT. Plus, it seems fishy that the relationship between ProcessTypes and ProductCodes would only be known within the stored procedure and not via relationships established between the tables themselves (foreign key).
Are there alternatives to this approach? Is there a standard for handling this type of validation where you need to validate individual instances (e.g. ProductIdentifiers) of entity types based on the relationships between those types (e.g. the relationship between ProductTypeTwo and ProcessTypeOne)?
If more details are helpful: The relationship between ProductCode and ProcessCode is many-to-many but there are rules that define product roles in each process and only certain product types or subtypes may fulfill those roles. ProductTypeOne might include attributes that define a specific kind of product like color or shape. ProductIdentifier includes the many lots of any ProductCode that are manufactured. ProcessCode includes settings that are put on a machine for processing. ProductType by way of ProductCode determines if a ProductIdentifier is valid for a particular ProcessType. Individual ProcessCodes don't discriminate valid ProducIdentifiers, only the ProcessType related to the ProcessCode would discriminate.
it seems fishy that the relationship between ProcessTypes and ProductCodes would only be known within the stored procedure and not via relationships established between the tables themselves (foreign key).
Yes that's an important observation, good to see you questioning the current schema. The fact of the matter is that SQL is not very powerful when it comes to representing data structures. So often a stored procedure is the only/least worst approach.
I'll make a suggestion for how to achieve this without stored procedures, but I won't call it "optimal": there's likely to be a performance hit for INSERTs (and worse for UPDATEs), because the SQL engine will probably be in effect carrying out the same SELECTs as you'd code in a stored procedure.
Split table ProductIdentifier into two:
ProductIdentifierTypeTwo PK ProductIdentifier, ProductCode FK REFERENCES ProductTypeTwo.ProductCode.
ProductIdentifierTypeOne PK ProductIdentifier, ProductCode FK REFERENCES ProductTypeOne.ProductCode.
Also CREATE VIEW ProductIdentifier UNION the two sub-tables, PK ProductIdentifier. This makes sure ProductIdentifier isn't duplicated between the two types.
IOW this avoids the ProductIdentifier table directly referencing the ProductCode table, where it can only examine ProductType as a column value, not as a referential structure.
Then
RunProcessTypeOne.RoleOneProductIdentifier FK REFERENCES ProductIdentifierTypeTwo.ProductIdentifier.
RunProcessTypeOne.RoleTwoProductIdentifier FK REFERENCES ProductIdentifierTypeOne.ProductIdentifier.
Making the original ProductIdentifier a VIEW is the least non-optimal way to manage updates (I'm guessing from your comment): ProductIdentifiers are less volatile than RunProcesses.
Re your more general question:
Is there a standard for handling this type of validation where you need to validate individual instances (e.g. ProductIdentifiers) of entity types based on the relationships between those types (e.g. the relationship between ProductTypeTwo and ProcessTypeOne)?
There are facilities included in the SQL standard. Most vendors haven't implemented them, or only partially support them -- essentially because implementing them would need running SELECTs with tricky logic as part of table updates.
You should be able to CREATE VIEW with a filter to only the rows that are the target of some FK.
(Your dba is likely to object that VIEWs come with an unacceptable performance hit. In this example, you'd have a single ProductIdentifier table, with the two sub-tables I suggest above as VIEWs. But maintaining those views would need joining to ProductCode to filter by ProductType.)
Then you should be able to define a FK to the VIEW rather than to the base table.
(This is the bit many SQL vendors don't support.)
I have two tables currently with the same primary key, can I have these two tables with the same primary key?
Also are all the tables in 3rd normal form
Ticket:
-------------------
Ticket_id* PK
Flight_name* FK
Names*
Price
Tax
Number_bags
Travel class:
-------------------
Ticket id * PK
Customer_5star
Customer_normal
Customer_2star
Airmiles
Lounge_discount
ticket_economy
ticket_business
ticket_first
food allowance
drink allowance
the rest of the tables in the database are below
Passengers:
Names* PK
Credit_card_number
Credit_card_issue
Ticket_id *
Address
Flight:
Flight_name* PK
Flight_date
Source_airport_id* FK
Dest_airport_id* FK
Source
Destination
Plane_id*
Airport:
Source_airport_id* PK
Dest_airport_id* PK
Source_airport_country
Dest_airport_country
Pilot:
Pilot_name* PK
Plane id* FK
Pilot_grade
Month
Hours flown
Rate
Plane:
Plane_id* PK
Pilot_name* FK
This is not meant as an answer but it became too long for a comment...
Not to sound harsh, but your model has some serious flaws and you should probably take it back to the drawing board.
Consider what would happen if a Passenger buys a second Ticket for instance. The Passenger table should not hold any reference to tickets. Maybe a passenger can have more than one credit card though? Shouldn't Credit Cards be in their own table? The same applies to Addresses.
Why does the Airport table hold information that really is about destinations (or paths/trips)? You already record trip information in the Flights table. It seems to me that the Airport table should hold information pertaining to a particular airport (like name, location?, IATA code et cetera).
Can a Pilot just be associated with one single Plane? Doesn't sound very likely. The pilot table should not hold information about planes.
And the Planes table should not hold information on pilots as a plane surely can be connected to more than one pilot.
And so on... there are most likely other issues too, but these pointers should give you something to think about.
The only tables that sort of looks ok to me are Ticket and Flight.
Re same primary key:
Yes there can be multiple tables with the same primary key. Both in principle and in good practice. We declare a primary or other unique column set to say that those columns (and supersets of them) are unique in a table. When that is the case, declare such column sets. This happens all the time.
Eg: A typical reasonable case is "subtyping"/"subtables", where entities of a kind identified by a candidate key of one table are always or sometimes also of the kind identifed by the same values in another table. (If always then the one table's candidate key values are also in the other table's. And so we would declare a foreign key from the one to the other. We would say the one table's kind of entity is a subtype of the other's.) On the other hand sometimes one table is used with attributes of both kinds and attributes inapplicable to one kind are not used. (Ie via NULL or a tag indicating kind.)
Whether you should have cases of the same primary key depends on other criteria for good design as applied to your particular situation. You need to learn design including normalization.
Eg: All keys simple and 3NF implies 5NF, so if your two tables have the same set of values as only & simple primary key in every state and they are both in 3NF then their join contains exactly the same information as they do separately. Still, maybe you would keep them separate for clarity of design, for likelihood of change or for performance based on usage. You didn't give that information.
Re normal forms:
Normal forms apply to tables. The highest normal form of a table is a property independent of any other table. (Athough you might choose that form based on what forms & tables are alternatives.)
In order to normalize or determine a table's highest normal form one needs to know (in general) all the functional dependencies in it. (For normal forms above BCNF, also join dependencies.) You didn't give them. They are determined by what the meaning of the table is (ie how to determine what rows go in it in any given situation) and the possible situtations that can arise. You didn't give them. Your expectation that we could tell you about the normal forms your tables are in without giving such information suggests that you do not understand normalization and need to educate yourself about it.
Proper design also needs this information and in general all valid states that can arise from situations that arise. Ie constraints among given tables. You didn't give them.
Having two tables with the same key goes against the idea of removing redundancy in normalization.
Excluding that, are these tables in 1NF and 2NF?
Judging by the Names field, I'd suggest that table1 is not. If multiple names can belong to one ticket, then you need a new table, most likely with a composite key of ticket_id,name.
I have a base enitiy (items) that will host a vast range of item types (>200) with totaly different properties. I want a clean portable and fast solution and have come up with an idea that maby has a name I'm unaware of.
Here it goes:
items-entity holds base class fields + additional fields for subclass fields but with dummie-names, ItemID,ItemNo,ItemTypeID,int1,int2,dec1,dec2,dec3,str1,str2
referenced itemtype-record holds name of type and child enity (1:n):
itemtypefields [itemtypeid,name,type,realfield]
example in [53,MaxPressure,dec,dec3]
It's limitations:
hard to estimate field requirements in baseclass
harder to add domains/checkconstraints based on child type
need application layer to translate tagged sql to real query
Only possible to query one type at a time since shared attributes may be defined to different "real-fields".
3rd bullet explained:
select ItemNo,_MaxPressure_ from items where ItemTypeID=10 and _MaxPressure_>42
should translate to:
select ItemNo,dec3 as MaxPressure from items where ItemType=10 and dec3>42
(can't do that with sp's or udf's right - or whould it be possible?)
But benefits of:
Performance
Ease of CRUD-operations
Easier to sort/filter at application level.
Now - does it have a name?
This antipattern is called One True Lookup Table.
In a relational database, each column needs to be defined as one logical type. I don't mean one SQL data type like INT or VARCHAR, I mean everything in that column from start to finish must be from the same set of values, and you should be able to tell one value apart from another value.
You can't put shoe size and average temperature and threads per inch into the same column of a given table, and still call it a relation.
Basically, your database would not be a database at all -- it would be a spreadsheet.
Read almost any book by C. J. Date, such as SQL and Relational Theory for a proper explanation of relations and types.
Re your comment:
Read the Q again before lecuturing about elementary books and mocking about semi structured data.
Okay, I have re-read your post.
The classic use of One True Lookup Table isn't exactly what you're doing, but what you're doing shares the same problems with OTLT.
Suppose you have "MaxPressure" stored in column dec3 for ItemType 10. Suppose there are a fixed set of valid choices for the value of MaxPressure, and you want to put those in another lookup table, so that no one can enter an invalid MaxPressure value.
Now: declare a foreign key constraint on dec3 referencing your MaxPressures lookup table. You can't -- the problem is that the foreign key constraint applies to the dec3 column in all rows, not just those rows where ItemType is 10.
The reason is that you're storing more than one set of values in a single column. The same problem arises for any other kind of constraint -- unique constraints, check constraints, even NOT NULL. And you can't declare a DEFAULT value for the column either, because you probably have a different correct default for each ItemType (and some ItemTypes have no default for that attribute).
The reason that I referred to the C. J. Date book is that he gives a crisp definition for a type: it's a named finite set, over which the equality operation is defined. That is, you can tell if the value "42" on one row is the same as the value "42" on another row. In a relational column, that must be true because they must come from the same original set of values. In your table, dec3 could have the value "42" when it's MaxPressure, but "42" for another ItemType when it's threads per inch. Therefore they aren't the same value "42". If you had a unique constraint, these two 42's would not be considered duplicates. If you had a foreign key, each of the different 42's would reference a different lookup table, etc.
What you're doing is not a valid relational database design.
Don't bristle at my referring you to a resource on relational database design unless you understand that.
Part of my schema for a travel project has the following tables
Cruises
Flights
Hotels
CarParking
I need a container that wraps one or more of these products into a package. One Cruise/Hotel etc might be part of many packages. I initially thought of
Package
- PackageId
- Etc
PackageItem
- PackageItemId
- PackageId (fk)
- ItemId (fk)
- ItemType
Where ItemType would indicate whether it's a Cruise, Flight, Hotel etc. I suppose I could use Triggers to enforce referential integrity.
My other idea was
Package
- ...
PackageItem
- PackageItemId
- PackageId (fk)
- CruiseId (nullable fk)
- FlightId (nullable fk)
- HotelId (nullable fk)
- CarParkingId (nullable fk)
- etc
I suppose each has it pros and cons, but I can't decide. Which do you think is better, which would you choose if you had to implement something like this?
Database is MySql. Platform is C# MVC ASP.NET
(I did search and there were a few similar questions but nothing that corresponded all that well)
The first option is the most flexible. And I tend to go with flexibility.
Advantage: Common Queries
If you want a report on cruises, the query is the same as one for hotels, but with a different WHERE clause.
Using the second form you need to join on and select from different tables.
*Advantage: Growth without Schema Changes
If you need to add Excursions to your model (something that can certainly have many associated to a single package), you just create a new Excursions type.
Using the second form you need to add new fields to your tables, creates new tables to hold the data, and update your queries and logic to use those new tables and fields.
Cost: Data moving to a form not friendly for human digestion
Many people could legitimately say that this shouldn't matter at all. I say that it matters in so far as you have to take account of it...
- It can make debugging harder, so you need to be more regimented and methodical
- It means your GUI has to be smarter in transforming your data for display
Also, although this is a cost, it has the benefit of forcing you into a mid-set where you are less likely so make simplistic assumptions and make sloppy mistakes. This is a cost that I like to have.
Falacy: Constraints can't be enforced
Constraint - Each package component must be either Hotel, Packing, Flight or Cruise
Method - Have a component_type table, and FK to that table
Constraint - Only one of each type allowed per package
Method - UNIQUE constraint on (package_id, component_type_id)
Constraint - Each component can only be within one package
Method - UNIQUE constraint on (component_id)
Cost - Deferred complexity
In my opinion, the normalised table to map Packages to Components is actually simple and elegant. The next step, is to decide how to store the associated details of a component.
A single global "component" table could hold all the fields, but allow them to be nullable. Thus a HOTEL would have a NULL Flight_Number. But all components would have a Price.
Or you could create an Entity_Attribute_Value table. This can be formed in such a way as to prevent hotels having a flight number...
- component_attributes table = (id, type_id, attribute_id, attribute_value)
- (type_id, attribute_id) can be foreign keyed to allowable combinations
It's impossible (afaik) to enforce REQUIRED fields, such as Price.
The Value is often stored as a VARCHAR.
For that reason, and others, search the data by Value becomes hard.
final opinion
I would not use option 2, as this is highly constrained and merges two considerations together - How to hold data for different component types (hotels, flights, etc) and how to relate them to their parent packages.
I would instead recommend that you consider the multitude of ways for holding the component data, and make that decision based on your needs. Then relate those components to the packages using a 1:many normalised mapping table. Your option 1.
You haven't mentioned in question whether you need to support multiple products of same type inside a single package - i.e. whether package can contain multiple Hotels, for example.
1) If support for multiple same-type products per package is required then you should go first way, but maybe split relationships into separate tables per product type, i.e.
PackageHotelItem
- PackageItemId
- PackageId (fk)
- HotelId (fk)
PackageCruiseItem
- PackageItemId
- PackageId (fk)
- CruiseId (fk)
... etc.
This way you will be able to have referential integrity via normal FK mechanism.
2) If you don't need such support then you may use your second solution.
What is normalization in MySQL and in which case and how we need to use it?
I try to attempt to explain normalization in layman terms here. First off, it is something that applies to relational database (Oracle, Access, MySQL) so it is not only for MySQL.
Normalisation is about making sure each table has the only minimal fields and to get rid of dependencies. Imagine you have an employee record, and each employee belongs to a department. If you store the department as a field along with the other data of the employee, you have a problem - what happens if a department is removed? You have to update all the department fields, and there's opportunity for error. And what if some employees does not have a department (newly assigned, perhaps?). Now there will be null values.
So the normalisation, in brief, is to avoid having fields that would be null, and making sure that the all the fields in the table only belong to one domain of data being described. For example, in the employee table, the fields could be id, name, social security number, but those three fields have nothing to do with the department. Only employee id describes which department the employee belongs to. So this implies that which department an employee is in should be in another table.
Here's a simple normalization process.
EMPLOYEE ( < employee_id >, name, social_security, department_name)
This is not normalized, as explained. A normalized form could look like
EMPLOYEE ( < employee_id >, name, social_security)
Here, the Employee table is only responsible for one set of data. So where do we store which department the employee belongs to? In another table
EMPLOYEE_DEPARTMENT ( < employee_id >, department_name )
This is not optimal. What if the department name changes? (it happens in the US government all the time). Hence it is better to do this
EMPLOYEE_DEPARTMENT ( < employee_id >, department_id )
DEPARTMENT ( < department_id >, department_name )
There are first normal form, second normal form and third normal form. But unless you are studying a DB course, I usually just go for the most normalized form I could understand.
Normalization is not for MYSql only. Its a general database concept.
Normalization is the process of
efficiently organizing data in a
database. There are two goals of the
normalization process: eliminating
redundant data (for example, storing
the same data in more than one table)
and ensuring data dependencies make
sense (only storing related data in a
table). Both of these are worthy goals
as they reduce the amount of space a
database consumes and ensure that data
is logically stored.
Normal forms in SQL are given below.
First Normal form (1NF): A relation is
said to be in 1NF if it has only
single valued attributes, neither
repeating nor arrays are permitted.
Second Normal Form (2NF): A relation
is said to be in 2NF if it is in 1NF
and every non key attribute is fully
functional dependent on the primary
key.
Third Normal Form (3NF): We say that a
relation is in 3NF if it is in 2NF and
has no transitive dependencies.
Boyce-Codd Normal Form (BCNF): A
relation is said to be in BCNF if and
only if every determinant in the
relation is a candidate key.
Fourth Normal Form (4NF): A relation
is said to be in 4NF if it is in BCNF
and contains no multivalued dependency.
Fifth Normal Form (5NF): A relation is
said to be in 5NF if and only if every
join dependency in relation is implied
by the candidate keys of relation.
Domain-Key Normal Form (DKNF): We say
that a relation is in DKNF if it is
free of all modification anomalies.
Insertion, Deletion, and update
anomalies come under modification
anomalies
Seel also
Database Normalization Basics
It's a technique for ensuring that your data remains consistent, by eliminating duplication. So a database in which the same information is stored in more than one table is not normalized.
See the Wikipedia article on Database normalization.
(It's a general technique for relational databases, not specific to MySQL.)
While creating a database schema for your application, you need to make sure that you avoid any information being stored in more than one column across different tables.
As every table in your DB, identifies a significant entity in your application, a unique identifier is a must-have columns for them.
Now, while deciding the storage schema, various kinds of relationships are being identified between these entities (tables), viz-a-viz, one-to-one, one-to-many, many-to-many.
For a one-to-one relationship (eg. A
Student has a unique rank in the
class), same table could be used to
store columns (from both tables).
For a one-to-many relationship (eg.
A semester can have multiple
courses), a foreign key is being
created in a parent table.
For a many-to-many relationship (eg.
A Prof. attends to many students and
vice-versa), a third table needs to
be created (with primary key from
both tables as a composite key), and
related data of the both tables will
be stored.
Once you attend to all these scenarios, your db-schema will be normalized to 4NF.
In the field of relational database
design, normalization is a systematic
way of ensuring that a database
structure is suitable for
general-purpose querying and free of
certain undesirable
characteristics—insertion, update, and
deletion anomalies—that could lead to
a loss of data integrity.[1] E.F.
Codd, the inventor of the relational
model, introduced the concept of
normalization and what we now know as
the first normal form in 1970.[2] Codd
went on to define the second and third
normal forms in 1971,[3] and Codd and
Raymond F. Boyce defined the
Boyce-Codd normal form in 1974.[4]
Higher normal forms were defined by
other theorists in subsequent years,
the most recent being the sixth normal
form introduced by Chris Date, Hugh
Darwen, and Nikos Lorentzos in
2002.[5]
Informally, a relational database
table (the computerized representation
of a relation) is often described as
"normalized" if it is in the third
normal form (3NF).[6] Most 3NF tables
are free of insertion, update, and
deletion anomalies, i.e. in most cases
3NF tables adhere to BCNF, 4NF, and
5NF (but typically not 6NF).
A standard piece of database design
guidance is that the designer should
create a fully normalized design;
selective denormalization can
subsequently be performed for
performance reasons.[7] However, some
modeling disciplines, such as the
dimensional modeling approach to data
warehouse design, explicitly recommend
non-normalized designs, i.e. designs
that in large part do not adhere to
3NF.[8]
Edit: Source: http://en.wikipedia.org/wiki/Database_normalization