How to reference groups of records in relational databases? - relational-database

Humans
| HumanID | FirstName | LastName | Gender |
|---------+-----------+----------+--------|
| 1 | Issac | Newton | M |
| 2 | Marie | Curie | F |
| 3 | Tim | Duncan | M |
Animals
| AmimalID | Species | NickName |
|----------+---------+----------|
| 4 | Tiger | Ronnie |
| 5 | Dog | Snoopy |
| 6 | Dog | Bear |
| 7 | Cat | Sleepy |
How do I reference a group of records in other tables?
For example:
Foods
| FoodID | FoodName | EatenBy |
|--------+----------+---------|
| 8 | Rice | ??? |
What I want to store in EatenBy may be:
a single record in the Humans and Animals tables (e.g. Tim Ducan)
a group of records in a table (e.g. all dogs, all males, all females)
a whole table (e.g. all humans)
A simple solution is to use a concatenated string, which includes primary keys
from different tables and special strings such as 'Humans' and 'M'.
The application could parse the concatenated string.
Foods
| FoodID | FoodName | EatenBy |
|--------+----------+--------------|
| 8 | Rice | Humans, 6, 7 |
Using a concatenated string is a bad idea from the perspective of
relational database design.
Another option is to add another table and use a foreign key.
Foods
| FoodID | FoodName |
|--------+----------|
| 8 | Rice |
EatenBy
| FoodID | EatenBy |
|--------+---------|
| 8 | Humans |
| 8 | 6 |
| 8 | 7 |
It's better than the first solution. The problem is that the EatenBy field stores values of different meanings. Is that a problem? How do I model this requirement? How do I achieve 3NF?
The example tables here are a bit contrived, but I do run into situations like
this at work. I have seen quite a few tables just use a concatenated string. I think it is bad but can't think of a more relational way.

This Answer is laid out in chronological order. The Question progressed in terms of detail, noted as Updates. There is a series of matching Responses.
The progression from the initial question to the final answer stands as a learning experience, especially for OO/ORM types. Major headings mark Responses, minor headings mark subjects.
The Answer exceeds the maximum length exceeded. I provide them as links in order to overcome that.
Response to Initial Question
You might have seen something like that at work, but that doesn't mean it was right, or acceptable. CSVs break 2NF. You can't search that field easily. You can't update that field easily. You have to manage the content (eg. avoid duplicates; ordering) manually, via code. You don't have a database or anything resembling one, you have a grand Record Filing System that you have to write mountains of code to "process". Just like the bad old days of the 1970's ISAM data processing.
The problem is, that you seem to want a relational database. Perhaps you have heard of the data integrity, the relational power (Join power for you, at this stage), and speed. A Record Filing System has none of that.
If you want a Relational database, then you are going to have to:
think about the data relationally, and apply Relational Database Methods, such as modelling the data, as data, and nothing but data (not as data values).
Then classifying the data (no relation whatever to the OO class or classifier concept).
Then relating the classified data.
The second problem is, and this is typical of OO types, they concentrate on, obsess on, the data values, rather than on the meaning of the data; how it is classified; how it relates to other data; etc.
No question, you did not think that concept up yourself, your "teachers" fed it to you, I see it all the time. And they love the Record Filing Systems. Notice, instead of giving table definitions, you state that you give "structure", but instead you list data values.
In case you don't appreciate what I am saying, let me assure you that this is a classic problem in the OO world, and the solution is easy, if you apply the principles. Otherwise it is an endless mess in the OO stack. Recently I completely eliminated an OO proposal + solution that a very well known mathematician, who supports the OO monolith, proposed. It is a famous paper.
I relationalised the data (ie. I simply placed the data in the Relational context: modelled and Normalised it, which took a grand total of ten minutes), and the problem disappeared, the proposal + solution was not required. Read the Hidders Response. Note, I was not attempting to destroy the paper, I was trying to understand the data, which was presented in schizophrenic form, and the easiest way to do that is to erect a Relational data model. That simple act destroyed the paper.
Please note that the link is an extract of a formal report of a paid assignment for a customer, a large Australian bank, who has kindly given me permission to publish the extract with a view to educating the public about the dangers of ignoring Relational database principles, especially by OO proponents.
The exact same process happened with a second, more famous paper Kohler Response. This response is much smaller, less formal, it was not paid work for a customer. That author was theorising about yet another abnormal "normal form".
Therefore, I would ask you to:
forget about "table structures" or definitions
forget about what you want
forget about implementation options
forget ID columns, completely and totally
forget EatenBy
think about what you have in terms of data, the meaning of the data, not as data values or example data, not as what you want to do with it
think about how that data is classified, and how it can be classified.
how the data relates to other data. (You may think that your EatenBy is that but it isn't, because the data has no organisation yet, to form relationships upon.)
If I look at my crystal ball, most of it is dark, but from the little flecks of light that I can see, it looks like you want:
Things
Groups of Things
Relationships between Things and ThingGroups
The Things are nouns, subjects. Eventually we will be doing something between those subjects, that will be verbs or action statements. That will form Predicates (First Order Logic). But not now, for now, we want the only the Things.
Now if you can modify your question and tell me more about your Things, and what they mean, I can give you a complete data model.
Response to Update 1 re Hierarchy
Record IDs are Physical, Non-relational
If you want a Relational Database, you need Relational Keys, not Record IDs. Additionally, starting the Data Modelling exercise with an ID stamped on every file cripples the exercise.
Please read this Answer.
Hierarchies Exist in the Data
If you want a full discourse, please ask a new question. Here is a quick summary.
Hierarchies occur naturally in the world, they are everywhere. That results in hierarchies being implemented in many databases. The Relational Model was founded on, and is a progression of, the Hierarchical Model. It supports hierarchies brilliantly. Unfortunately the famous writers do not understand the RM, they teach only pre-1970s Record Filing Systems badged as "relational". Likewise, they do not understand hierarchies, let alone hierarchies as supported in the RM, so they suppress it.
The result of that is, the hierarchies that are everywhere, that have to be implemented, are not recognised as such, and thus they are implemented in a grossly incorrect and massively inefficient manner.
Conversely, if the hierarchy that occurs in the data that is being modelled, is modelled correctly, and implemented using genuine Relational constructs (Relational Keys, Normalisation, etc) the result is an easy-to-use and easy-to-code database, as well as being devoid of data duplication (in any form) and extremely fast. It is quite literally Relational at its best.
There are three types of Hierarchies that occur in data.
Hierarchy Formed in Sequence of Tables
This requirement, the need for Relational Keys, occurs in every database, and conversely, the lack of it cripples the database ad produces a Record Filing System, with none of the integrity, power or speed of a Relational Database.
The hierarchy is plainly visible in the form of the Relational Key, which progresses in compounding, in any sequence of tables: father, son, grandson, etc. This is essential for ordinary Relational data integrity, the kind that Hidders and 95% of the database implementations do not have.
The Hidders Response has a great example of Hierarchies:
a. that exist naturally in the data
b. that OO types are blind to [as Hidders evidently is]
c. they implement RFS with no integrity, and then they try to "fix" the problem in the object layers, adding even more complexity.
Whereas I implemented the hierarchy in a classic Relational form, and the problem disappeared entirely, eliminating the proposed "solution", the paper. Relational-isation eliminates theory.
The two hierarchies in those four tables are:
Domain::Animal::Harvest
Domain::Activity::Harvest
Note that Hidders is ignorant of the fact that the data is an hierarchy; that his RFS doesn't have integrity precisely because it is not Relational; that placing the data in the Relational context provides the very integrity he is seeking outside it; that the Relational Model eliminates all such "problems", and makes all such "solutions" laughable.
Here's another example, although the modelling is not yet complete. Please make sure to examine the Predicates, and page 2 for the actual Keys. The hierarchies are:
Subject::CategorySubject::ExaminationResult
Category::CategorySubject::ExaminationResult
Person::Registrant::Candidate::ExaminationResult
Note that last one is a progression of state of the business instrument, thus the Key does not compound.
Hierarchy of Rows within One Table
Typically a tree structure of some sort, there are literally millions of them. For any given Node, this supports a single ancestor or parent, and unlimited children. Done properly, there is no limit to the number of levels, or the height of the tree (ie. unlimited ancestor and progeny generations).
The terms ancestor and descendant use here are plain technical terms, they do not have the OO connotations and limitations.
You do need recursion in the server, in order to traverse the tree structure, so that you can write simple procs and functions that are recursive.
Here is one for Messages. Please read both the question and the Answer, and visit the linked Message Data Model. Note that the seeker did not mention Hierarchy or tree, because the knowledge of Hierarchies in Relational Databases is suppressed, but (from the comments) once he saw the Answer and the Data Model, he recognised it for the hierarchy that it is, and that it suited him perfectly. The hierarchy is:
Message::Message[Message]::Message[::Message[Message]] ...
Hierarchy of Rows within One Table, Via an Associative Table
This hierarchy provides an ancestor/descendant structure for multiple ancestors or parents. It requires two relationships, therefore an additional Associative Table is required. This is commonly known as the Bill of Materials structure. Unlimited height, recursively traversed.
The Bill of Materials Problem was a limitation of Hierarchical DBMS, that we overcame partially in Network DBMS. It was a burning issue at the time, and one of IBM's specific problems that Dr E F Codd was explicitly charged to overcome. Of course he met those goals, and exceeded them spectacularly.
Here is the Bill of Materials hierarchy, modelled and implemented correctly.
Please excuse the preamble, it is from an article, skip the top two rows, look at the bottom row.
Person::Progeny is also given.
The hierarchies are:
Part[Assembly]::Part[Component] ...
Part[Component]::Part[Assembly] ...
Person[Parent]::Person[Child] ...
Person[Child]::Person[Parent] ...
Ignorance Of Hierarchy
Separate to the fact that hierarchies commonly exist in the data, that they are not recognised as such, due to the suppression, and that therefore they are not implemented as hierarchies, when they are recognised, they are implemented in the most ridiculous, ham-fisted ways.
Adjacency List
The suppressors hilariously state that "the Relational Model doesn't support hierarchies", in denial that it is founded on the Hierarchical Model (each of which provides plain evidence that they are ignorant of the basic concepts in the RM, which they allege to be postulating about). So they can't use the name. This is the stupid name they use.
Generally, the implementation will have recognised that there is an hierarchy in the data, but the implementation will be very poor, limited by physical Record IDs, etc, absent of Relational Integrity, etc.
And they are clueless as to how to traverse the tree, that one needs recursion.
Nested Sets
An abortion, straight from hell. A Record Filing System within a Record Filing system. Not only does this generate masses of duplication and break Normalisation rules, this fixes the records in the filing system in concrete.
Moving a single node requires the entire affected part of the tree to be re-written. Beloved of the Date, Darwen and Celko types.
The MS HIERARCHYID Datatype does the same thing. Gives you a mass of concrete that has to be jack-hammered and poured again, every time a node changes.
Ok, it wasn't so short.
Response to Update 2
Response to Update 2
Response to Update 3
Response to Update 3
Response to Update 4
Response to Update 4

For each category who eats the food, you should add one table. for example, if one food may be eaten by some specific gender, you would have:
Food_Gender(FoodID,GenderID)
for humans you would have:
Food_Human(FoodID,HumanID)
for animals species:
Food_AnimalSpc(FoodID,Species)
for an entire table:
Food_Table(FoodID,TableID)
and so on for other categories

Related

MySQL table managment [duplicate]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Why do database guys go on about normalisation?
What is it? How does it help?
Does it apply to anything outside of databases?
Normalization is basically to design a database schema such that duplicate and redundant data is avoided. If the same information is repeated in multiple places in the database, there is the risk that it is updated in one place but not the other, leading to data corruption.
There is a number of normalization levels from 1. normal form through 5. normal form. Each normal form describes how to get rid of some specific problem.
First normal form (1NF) is special because it is not about redundancy. 1NF disallows nested tables, more specifically columns which allows tables as values. Nested tables are not supported by SQL in the first place, so most normal relational databases will be in 1NF by default. So we can ignore 1NF for the rest of the discussions.
The normal forms 2NF to 5NF all concerns scenarios where the same information is represented multiple times in the same table.
For example consider a database of moons and planets:
Moon(PK) | Planet | Planet kind
------------------------------
Phobos | Mars | Rock
Daimos | Mars | Rock
Io | Jupiter | Gas
Europa | Jupiter | Gas
Ganymede | Jupiter | Gas
The redundancy is obvious: The fact that Jupiter is a gas planet is repeated three times, one for each moon. This is a waste of space, but much more seriously this schema makes inconsistent information possible:
Moon(PK) | Planet | Planet kind
------------------------------
Phobos | Mars | Rock
Deimos | Mars | Rock
Io | Jupiter | Gas
Europa | Jupiter | Rock <-- Oh no!
Ganymede | Jupiter | Gas
A query can now give inconsistent results which can have disastrous consequences.
(Of course a database cannot protect against wrong information being entered. But it can protect against inconsistent information, which is just as serious a problem.)
The normalized design would split the table into two tables:
Moon(PK) | Planet(FK) Planet(PK) | Planet kind
--------------------- ------------------------
Phobos | Mars Mars | Rock
Deimos | Mars Jupiter | Gas
Io | Jupiter
Europa | Jupiter
Ganymede | Jupiter
Now no fact is repeated multiple times, so there is no possibility of inconsistent data. (It may look like there still is some repetition since the planet names are repeated, but repeating primary key values as foreign keys does not violate normalization since it does not introduce a risk of inconsistent data.)
Rule of thumb
If the same information can be represented with fewer individual cell values, not counting foreign keys, then the table should be normalized by splitting it into more tables. For example the first table has 12 individual values, while the two tables only have 9 individual (non-FK) values. This means we eliminate 3 redundant values.
We know the same information is still there, since we can write a join query which return the same data as the original un-normalized table.
How do I avoid such problems?
Normalization problems are easily avoided by giving a bit of though to the conceptual model, for example by drawing an entity-relationship diagram. Planets and moons have a one-to-many relationship which means they should be represented in two different tables with a foreign key-association. Normalization issues happen when multiple entities with a one-to-many or many-to-many relationship are represented in the same table row.
Is normalization it important? Yes, it is very important. By having a database with normalization errors, you open the risk of getting invalid or corrupt data into the database. Since data "lives forever" it is very hard to get rid of corrupt data when first it has entered the database.
But I don't really think it is important to distinguish between the different normal forms from 2NF to 5NF. It is typically pretty obvious when a schema contains redundancies - whether it is 3NF or 5NF which is violated is less important as long as the problem is fixed.
(There are also some additional normal forms like DKNF and 6NF which are only relevant for special purpose systems like data-warehouses.)
Don't be scared of normalization. The official technical definitions of the normalization levels are quite obtuse. It makes it sound like normalization is a complicated mathematical process. However, normalization is basically just the common sense, and you will find that if you design a database schema using common sense it will typically be fully normalized.
There are a number of misconceptions around normalization:
some believe that normalized databases are slower, and the denormalization improves performance. This is only true in very special cases however. Typically a normalized database is also the fastest.
sometimes normalization is described as a gradual design process and you have to decide "when to stop". But actually the normalization levels just describe different specific problems. The problem solved by normal forms above 3rd NF are pretty rare problems in the first place, so chances are that your schema is already in 5NF.
Does it apply to anything outside of databases? Not directly, no. The principles of normalization is quite specific for relational databases. However the general underlying theme - that you shouldn't have duplicate data if the different instances can get out of sync - can be applied broadly. This is basically the DRY principle.
Most importantly it serves to remove duplication from the database records.
For example if you have more than one place (tables) where the name of a person could come up you move the name to a separate table and reference it everywhere else. This way if you need to change the person name later you only have to change it in one place.
It is crucial for proper database design and in theory you should use it as much as possible to keep your data integrity. However when retrieving information from many tables you're losing some performance and that's why sometimes you could see denormalised database tables (also called flattened) used in performance critical applications.
My advise is to start with good degree of normalisation and only do de-normalisation when really needed
P.S. also check this article: http://en.wikipedia.org/wiki/Database_normalization to read more on the subject and about so-called normal forms
Normalization a procedure used to eliminate redundancy and functional dependencies between columns in a table.
There exist several normal forms, generally indicated by a number. A higher number means fewer redundancies and dependencies. Any SQL table is in 1NF (first normal form, pretty much by definition) Normalizing means changing the schema (often partitioning the tables) in a reversible way, giving a model which is functionally identical, except with less redundancy and dependencies.
Redundancy and dependency of data is undesirable because it can lead to inconsisencies when modifying the data.
It is intended to reduce redundancy of data.
For a more formal discussion, see the Wikipedia http://en.wikipedia.org/wiki/Database_normalization
I'll give a somewhat simplistic example.
Assume an organization's database that usually contains family members
id, name, address
214 Mr. Chris 123 Main St.
317 Mrs. Chris 123 Main St.
could be normalized as
id name familyID
214 Mr. Chris 27
317 Mrs. Chris 27
and a family table
ID, address
27 123 Main St.
Near-Complete normalization (BCNF) is usually not used in production, but is an intermediate step. Once you've put the database in BCNF, the next step is usually to De-normalize it in a logical way to speed up queries and reduce the complexity of certain common inserts. However, you can't do this well without properly normalizing it first.
The idea being that the redundant information is reduced to a single entry. This is particularly useful in fields like addresses, where Mr. Chris submits his address as Unit-7 123 Main St. and Mrs. Chris lists Suite-7 123 Main Street, which would show up in the original table as two distinct addresses.
Typically, the technique used is to find repeated elements, and isolate those fields into another table with unique ids and to replace the repeated elements with a primary key referencing the new table.
Quoting CJ Date: Theory IS practical.
Departures from normalization will result in certain anomalies in your database.
Departures from First Normal Form will cause access anomalies, meaning that you have to decompose and scan individual values in order to find what you are looking for. For example, if one of the values is the string "Ford, Cadillac" as given by an earlier response, and you are looking for all the ocurrences of "Ford", you are going to have to break open the string and look at the substrings. This, to some extent, defeats the purpose of storing the data in a relational database.
The definition of First Normal Form has changed since 1970, but those differences need not concern you for now. If you design your SQL tables using the relational data model, your tables will automatically be in 1NF.
Departures from Second Normal Form and beyond will cause update anomalies, because the same fact is stored in more than one place. These problems make it impossible to store some facts without storing other facts that may not exist, and therefore have to be invented. Or when the facts change, you may have to locate all the plces where a fact is stored and update all those places, lest you end up with a database that contradicts itself. And, when you go to delete a row from the database, you may find that if you do, you are deleting the only place where a fact that is still needed is stored.
These are logical problems, not performance problems or space problems. Sometimes you can get around these update anomalies by careful programming. Sometimes (often) it's better to prevent the problems in the first place by adhering to normal forms.
Notwithstanding the value in what's already been said, it should be mentioned that normalization is a bottom up approach, not a top down approach. If you follow certain methodologies in your analysis of the data, and in your intial design, you can be guaranteed that the design will conform to 3NF at the very least. In many cases, the design will be fully normalized.
Where you may really want to apply the concepts taught under normalization is when you are given legacy data, out of a legacy database or out of files made up of records, and the data was designed in complete ignorance of normal forms and the consequences of departing from them. In these cases you may need to discover the departures from normalization, and correct the design.
Warning: normalization is often taught with religious overtones, as if every departure from full normalization is a sin, an offense against Codd. (little pun there). Don't buy that. When you really, really learn database design, you'll not only know how to follow the rules, but also know when it's safe to break them.
As Martin Kleppman says in his book Designing Data Intensive Applications:
Literature on the relational model distinguishes several different normal forms, but the distinctions are of little practical interest. As a rule of thumb, if you’re duplicating values that could be stored in just one place, the schema is not normalized.
Normalization is one of the basic concepts. It means that two things do not influence on each other.
In databases specifically means that two (or more) tables do not contain the same data, i.e. do not have any redundancy.
On the first sight that is really good because your chances to make some synchronization problems are close to zero, you always knows where your data is, etc. But, probably, your number of tables will grow and you will have problems to cross the data and to get some summary results.
So, at the end you will finish with database design that is not pure normalized, with some redundancy (it will be in some of the possible levels of normalization).

Is saving array in mysql good do i need separate new table? [duplicate]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
Why do database guys go on about normalisation?
What is it? How does it help?
Does it apply to anything outside of databases?
Normalization is basically to design a database schema such that duplicate and redundant data is avoided. If the same information is repeated in multiple places in the database, there is the risk that it is updated in one place but not the other, leading to data corruption.
There is a number of normalization levels from 1. normal form through 5. normal form. Each normal form describes how to get rid of some specific problem.
First normal form (1NF) is special because it is not about redundancy. 1NF disallows nested tables, more specifically columns which allows tables as values. Nested tables are not supported by SQL in the first place, so most normal relational databases will be in 1NF by default. So we can ignore 1NF for the rest of the discussions.
The normal forms 2NF to 5NF all concerns scenarios where the same information is represented multiple times in the same table.
For example consider a database of moons and planets:
Moon(PK) | Planet | Planet kind
------------------------------
Phobos | Mars | Rock
Daimos | Mars | Rock
Io | Jupiter | Gas
Europa | Jupiter | Gas
Ganymede | Jupiter | Gas
The redundancy is obvious: The fact that Jupiter is a gas planet is repeated three times, one for each moon. This is a waste of space, but much more seriously this schema makes inconsistent information possible:
Moon(PK) | Planet | Planet kind
------------------------------
Phobos | Mars | Rock
Deimos | Mars | Rock
Io | Jupiter | Gas
Europa | Jupiter | Rock <-- Oh no!
Ganymede | Jupiter | Gas
A query can now give inconsistent results which can have disastrous consequences.
(Of course a database cannot protect against wrong information being entered. But it can protect against inconsistent information, which is just as serious a problem.)
The normalized design would split the table into two tables:
Moon(PK) | Planet(FK) Planet(PK) | Planet kind
--------------------- ------------------------
Phobos | Mars Mars | Rock
Deimos | Mars Jupiter | Gas
Io | Jupiter
Europa | Jupiter
Ganymede | Jupiter
Now no fact is repeated multiple times, so there is no possibility of inconsistent data. (It may look like there still is some repetition since the planet names are repeated, but repeating primary key values as foreign keys does not violate normalization since it does not introduce a risk of inconsistent data.)
Rule of thumb
If the same information can be represented with fewer individual cell values, not counting foreign keys, then the table should be normalized by splitting it into more tables. For example the first table has 12 individual values, while the two tables only have 9 individual (non-FK) values. This means we eliminate 3 redundant values.
We know the same information is still there, since we can write a join query which return the same data as the original un-normalized table.
How do I avoid such problems?
Normalization problems are easily avoided by giving a bit of though to the conceptual model, for example by drawing an entity-relationship diagram. Planets and moons have a one-to-many relationship which means they should be represented in two different tables with a foreign key-association. Normalization issues happen when multiple entities with a one-to-many or many-to-many relationship are represented in the same table row.
Is normalization it important? Yes, it is very important. By having a database with normalization errors, you open the risk of getting invalid or corrupt data into the database. Since data "lives forever" it is very hard to get rid of corrupt data when first it has entered the database.
But I don't really think it is important to distinguish between the different normal forms from 2NF to 5NF. It is typically pretty obvious when a schema contains redundancies - whether it is 3NF or 5NF which is violated is less important as long as the problem is fixed.
(There are also some additional normal forms like DKNF and 6NF which are only relevant for special purpose systems like data-warehouses.)
Don't be scared of normalization. The official technical definitions of the normalization levels are quite obtuse. It makes it sound like normalization is a complicated mathematical process. However, normalization is basically just the common sense, and you will find that if you design a database schema using common sense it will typically be fully normalized.
There are a number of misconceptions around normalization:
some believe that normalized databases are slower, and the denormalization improves performance. This is only true in very special cases however. Typically a normalized database is also the fastest.
sometimes normalization is described as a gradual design process and you have to decide "when to stop". But actually the normalization levels just describe different specific problems. The problem solved by normal forms above 3rd NF are pretty rare problems in the first place, so chances are that your schema is already in 5NF.
Does it apply to anything outside of databases? Not directly, no. The principles of normalization is quite specific for relational databases. However the general underlying theme - that you shouldn't have duplicate data if the different instances can get out of sync - can be applied broadly. This is basically the DRY principle.
Most importantly it serves to remove duplication from the database records.
For example if you have more than one place (tables) where the name of a person could come up you move the name to a separate table and reference it everywhere else. This way if you need to change the person name later you only have to change it in one place.
It is crucial for proper database design and in theory you should use it as much as possible to keep your data integrity. However when retrieving information from many tables you're losing some performance and that's why sometimes you could see denormalised database tables (also called flattened) used in performance critical applications.
My advise is to start with good degree of normalisation and only do de-normalisation when really needed
P.S. also check this article: http://en.wikipedia.org/wiki/Database_normalization to read more on the subject and about so-called normal forms
Normalization a procedure used to eliminate redundancy and functional dependencies between columns in a table.
There exist several normal forms, generally indicated by a number. A higher number means fewer redundancies and dependencies. Any SQL table is in 1NF (first normal form, pretty much by definition) Normalizing means changing the schema (often partitioning the tables) in a reversible way, giving a model which is functionally identical, except with less redundancy and dependencies.
Redundancy and dependency of data is undesirable because it can lead to inconsisencies when modifying the data.
It is intended to reduce redundancy of data.
For a more formal discussion, see the Wikipedia http://en.wikipedia.org/wiki/Database_normalization
I'll give a somewhat simplistic example.
Assume an organization's database that usually contains family members
id, name, address
214 Mr. Chris 123 Main St.
317 Mrs. Chris 123 Main St.
could be normalized as
id name familyID
214 Mr. Chris 27
317 Mrs. Chris 27
and a family table
ID, address
27 123 Main St.
Near-Complete normalization (BCNF) is usually not used in production, but is an intermediate step. Once you've put the database in BCNF, the next step is usually to De-normalize it in a logical way to speed up queries and reduce the complexity of certain common inserts. However, you can't do this well without properly normalizing it first.
The idea being that the redundant information is reduced to a single entry. This is particularly useful in fields like addresses, where Mr. Chris submits his address as Unit-7 123 Main St. and Mrs. Chris lists Suite-7 123 Main Street, which would show up in the original table as two distinct addresses.
Typically, the technique used is to find repeated elements, and isolate those fields into another table with unique ids and to replace the repeated elements with a primary key referencing the new table.
Quoting CJ Date: Theory IS practical.
Departures from normalization will result in certain anomalies in your database.
Departures from First Normal Form will cause access anomalies, meaning that you have to decompose and scan individual values in order to find what you are looking for. For example, if one of the values is the string "Ford, Cadillac" as given by an earlier response, and you are looking for all the ocurrences of "Ford", you are going to have to break open the string and look at the substrings. This, to some extent, defeats the purpose of storing the data in a relational database.
The definition of First Normal Form has changed since 1970, but those differences need not concern you for now. If you design your SQL tables using the relational data model, your tables will automatically be in 1NF.
Departures from Second Normal Form and beyond will cause update anomalies, because the same fact is stored in more than one place. These problems make it impossible to store some facts without storing other facts that may not exist, and therefore have to be invented. Or when the facts change, you may have to locate all the plces where a fact is stored and update all those places, lest you end up with a database that contradicts itself. And, when you go to delete a row from the database, you may find that if you do, you are deleting the only place where a fact that is still needed is stored.
These are logical problems, not performance problems or space problems. Sometimes you can get around these update anomalies by careful programming. Sometimes (often) it's better to prevent the problems in the first place by adhering to normal forms.
Notwithstanding the value in what's already been said, it should be mentioned that normalization is a bottom up approach, not a top down approach. If you follow certain methodologies in your analysis of the data, and in your intial design, you can be guaranteed that the design will conform to 3NF at the very least. In many cases, the design will be fully normalized.
Where you may really want to apply the concepts taught under normalization is when you are given legacy data, out of a legacy database or out of files made up of records, and the data was designed in complete ignorance of normal forms and the consequences of departing from them. In these cases you may need to discover the departures from normalization, and correct the design.
Warning: normalization is often taught with religious overtones, as if every departure from full normalization is a sin, an offense against Codd. (little pun there). Don't buy that. When you really, really learn database design, you'll not only know how to follow the rules, but also know when it's safe to break them.
As Martin Kleppman says in his book Designing Data Intensive Applications:
Literature on the relational model distinguishes several different normal forms, but the distinctions are of little practical interest. As a rule of thumb, if you’re duplicating values that could be stored in just one place, the schema is not normalized.
Normalization is one of the basic concepts. It means that two things do not influence on each other.
In databases specifically means that two (or more) tables do not contain the same data, i.e. do not have any redundancy.
On the first sight that is really good because your chances to make some synchronization problems are close to zero, you always knows where your data is, etc. But, probably, your number of tables will grow and you will have problems to cross the data and to get some summary results.
So, at the end you will finish with database design that is not pure normalized, with some redundancy (it will be in some of the possible levels of normalization).

EAV vs null vs Mixed

I'm familar with normalized databases and I'm able to produce all kind of queries. But since i'm starting on a green-field project now, one question kept me busy during this week:
It's the typical "webshop-problem" i'd say (even if i'm not building a webshop): How to model the "product-information"?
There are some approaches, each with its own advantages or disadvantages:
One Table to rule them all
Putting every "product" into a single table, generating every column possible and working with this monster-table.
Pro:
Easy queries
Easy layout
Con:
Lot of NULL Values
The actual code becomes sensitive towards the query (different type, different columns are required)
EAV-Pattern
Obviously the EAV-Pattern can provide a nicer solution for this. However, I've been working with EAV in the past, and when it comes down to performance, it can become a Problem for a huge amount of entries.
Searching is easy, but listing a "normalized table" requires one join per actual column -> slow.
Pro:
Clean
Flexible
Con:
Performance
Not Normalized
Single Table per category
Basically the opposite of the EAV-Pattern: Create one table per product-type, i.e. "cats", "dogs", "cars", ...
While this might be possible for a countable number of categories, it becomse a nightmare for a steady growing amount of categories, if you have to maintain those.
Pro:
Clean
Performance
Con:
Maintenance
Query-Management
Best of both worlds
So, on my journey through the internet I found recommendations to mix both approaches: Use a single Table for the common information, while grouping other attributes into "attribute-groups" which are organized in the EAV-Fashion.
However, here I think, this would basically import the drawbacks of EACH approach... You need to work with regular Tables (basic information) and do a huge amount of joins to get ALL information.
Storing enhanced information in JSON/XML
Another approach is to store extendet information in JSON/XML Format entries (within a column of the "root-table").
However, I don't really like this, as it seems hard(er) to query and to work-with than a regular database layout.
Automating single tables
Another idea was automating the part of "creating tables" per category (and therefore automating the queries on those), while maintaining a "master-table" just containing the id and the category information, in order to get the best performance for an undetermined amount of tables...?
i.e.:
Products
id | category | actualId
1 | cat | 1
2 | car | 1
cats
id | color | mew
1 | white | true
cars
id | wheels | bhp
1 | 4 | 123
the (abstract) Product table would allow to query for everything, while details are available by an easy join with "actualId" and the responsible table.
However, this would lead to problems if you want to run a "show all" query, because this is not solvable by SQL alone, cause the table name (in the join) needs to be explicit in the query.
What other Options are available? There are a lot of "webshops", each dealing with this problem more or less - how do they solve it in a efficent way?
I strongly disagree with your opinion that the "monster" table approach leads to "Easy queries", and that the EAV approach will cause performance issues (premature optimization?). And it doesn't have to require complex queries:
SELECT base.id, base.other_attributes,
, GROUP_CONCAT(CONCATENATE(ext.key, '[', ext.type, ']', ext.value))
FROM base_attributes base
LEFT JOIN extended_attributes ext
ON base.id=ext.id
WHERE base.id=?
;
You would need to do some parsing on the above, but a wee bit of polishing would give something parseable as JSON or XML without putting your data inside anonymous blobs
If you don't care about data integrity and are happy to solve performance via replication, then NoSQL is the way to go (this is really the same thing as using JSON or XML to store your data).

Incremental MySQL database design where future needs are unknown

I am using MySQL, InnoDB, and running it on Ubuntu 13.04.
My general question is: If I don't know how my database is going to evolve or what my needs will eventually be, should I not worry about redundancy and relationships now?
Here is my situation:
I'm currently building a baseball database from scratch, but I am unsure how I should proceed. Right now, I'm approaching the design in a modular fashion. For example, I am currently writing a python script to parse the XML feed of a sports betting website which tells me the money line and the over/under. Since I need to start recording the information, I am wondering if I should just go ahead and populate the tables and worry about keys and such later.
So for example, my python sports odds scraping script would populate three tables (Game,Money Line, Over/Under) like so:
DateTime = Date and time of observation
Game
+-----------+-----------+--------------+
| Home Team | Away Team | Date of Game |
+-----------+-----------+--------------+
Money Line
+-----------+-----------+--------------+-----------+-----------+----------+
| Home Team | Away Team | Date of Game | Home Line | Away Line | DateTime |
+-----------+-----------+--------------+-----------+-----------+----------+
Over/Under
+-----------+-----------+--------------+-----------+-----------+----------+----------+
| Home Team | Away Team | Date of Game | Total | Over | Under | DateTime |
+-----------+-----------+--------------+-----------+-----------+----------+----------+
I feel like I should be doing something with the redundant (home team, away team, date of game) columns of information, but I don't really know how my database is going to expand, and in what ways I will be linking everything together. I'm basically building a database so I can answer complicated questions such as:
How does weather in Detroit affect the betting lines when Justin Verlander is pitching against teams who have averaged 5 or fewer runs per game for 20 games prior to the appearance against Verlander? (As you can see, complex questions create complex relationships and queries.)
So is it alright if I go ahead and start collecting data as shown above, or is this going to create a big headache for me down the road?
The topic of future proofing a database is a large one. In general, the more successful a database is, the more likely it is to be subjected to mission creep, and therefore to have new requirements.
One very basic question is this: who will be providing the new requirements? From the way you wrote the question, it sounds like you have built the database to fit your own requirements, and you will also be inventing or discovering the new requirements down the road. If this is not true, then you need to study the evolving pattern of your client(s) needs, so as to at least guess where mission creep is likely to lead you.
Normalization is part of the answer, and this aspect has been dealt with in a prior answer. In general, a partially denormalized database is less future proofed than a fully normalized database. A denormalized database has been adapted to present needs, and the more adapted something is, the less adaptable it is. But normalization is far from the whole answer. There are other aspects of future proofing as well.
Here's what I would do. Learn the difference between analysis and design, especially with regard to databases. Learn how to use ER modeling to capture the present requirements WITHOUT including the present design. Warning: not all experts in ER modeling use it to express requirements analysis. In particular, you omit foreign keys from an analysis model because foreign keys are a feature of the solution, not a feature of the problem.
In parallel, maintain a relational model that conforms to the requirements of your ER model and also conforms to rules of normalization, and other rules of simple sound design.
When a change comes along, first see if your ER model needs to be updated. Sometimes the answer is no. If the answer is yes, first update your ER model, then update your relational model, then update your database definitions.
This is a lot of work. But it can save you a lot of work, if the new requirements are truly crucial.
Try normalizing your data (so that you do not have redundant info) like:
Game
+---+-----------+-----------+--------------+
|ID | Home Team | Away Team | Date of Game |
+---+-----------+-----------+--------------+
Money Line
+-----------+-----------+--------------+-----------+
| Game_ID | Home Line | Away Line | DateTime |
+-----------+-----------+--------------+-----------+
Over/Under
+-----------+-----------+--------------+-----------+-----------+
| Game_ID | Total | Over | Under | DateTime |
+-----------+-----------+--------------+-----------+-----------+
You can read more on NORMALIZATION here

Which of these MySQL database designs (attached) is best for mostly read high performance?

I am a data base admin and developer in MySQL. I have been a couple of years working with MySQL. I recently adquire and study O'Reilly High Performance MySQL 2nd Edition to improve my skills on MySQL advanced features, high performance and scalability, because I have often been frustated by the lack of advance knowledge of MySQL I had (and in a big part, I still have).
Currently, I am working on a ambicious web project. In this project, we will have quite content and users from the begining. I am the designer of the data base and this data base must be very fast (some inserts but mostly and more important READS).
I want here to discuss about these requirements:
There will be several kind of items
The items have some fields and relations in common
The items also have some fields and relations special that make them differents each other
Those items will have to be listed all together ordered or filtered by common fields or relations
The items will have to be also listed only by type (for examble item_specialA)
I have some basic design doubts, and I would like you to help me decide and learn which design aproach would be better for a high performance MySQL data base.
Classical aproach
The following diagram shows the clasical aproach which is the first you may think about with the mind thinking in database: Database diagram
Centralized aproach
But maybe we can improve it with some or pseudo object oriented paradigm centralicing the common items and the relations on one common item table. It would also be useful for listing all kind of items: Database diagram
Advantages and disadvantages of each one?
Which aproach would you choose or which changes would you apply seeing the requirements seen before?
Thanks all in advance!!
What you have are two distinct data mapping strategies. That you called "classical" is "one table per concrete class" in other sources, and that you called "centralized" is "one table per class" (Mapping Objects to Relational Databases: O/R Mapping In Detail). They both have their advantages and disadvantages (follow the link above). The queries in the first strategy will be faster (you will need to join only 2 tables vs 3 in the second strategy).
I think that you should explore classic supertype/subtype pattern. Here are some examples from the SO.
If you're looking mostly for speed, consider selective use of MyISAM tables, use a centralized "object" table, and just one additional table with correct indexes on this form:
object_type | object_id | property_name | property_value
user | 1 | photos | true
city | 2 | photos | true
user | 5 | single | true
city | 2 | metro | true
city | 3 | population | 135000
and so on... lookups on primary keys or indexed keys (object_type, object_id, property_name) for example will be blazing fast. Also, you reduce the need to end with 457 tables as new properties appear.
It isn't exactly a well-designed nor perfectly-normalized database and, if you are looking for a long-term big site, you should consider caching, or at least using a denormalized paradigm, denormalized mysql tables like this one, redis, or maybe MongoDB.