Related
(edited 1/5 10:22hr. added some explanation about my notation. And added some additional information I received)
I am doing a course on database design and currently we're doing ERD's and designing db's in MySQL worksbench. Think 1st, 2nd and 3rd NF, creating schema, tables, constraints, etc.
Most of it is pretty clear to me.
However there's one aspect where things remain unclear: the X:X to 1:many relationship vs the X:X to 0:many relationship (meaning: whatever to 0:many, vs whatever to 1:many, etc).
In some cases it's obvious, in others not so much. Whenever it's unclear to me, it's mostly something like this:
Example :
an artist has 1 to many paintings. A painting has 1-and only 1 artist.
Relationship:
|artist| 1:1 -------- 1:many |painting|
the same in another notation
|artist| ||------------ 1< |painting|
This seems fair, but....Then there's the thought: I could be a new artist, not having produced a painting yet.
Or: I could be entering a new artist into a artist table, not yet having entered his paintings yet (which could lead to a practical issue).
Another example:
A workshop has 1 to many participants. A participant enters 0-to-many workshops.
Relationship:
|workshop| many:0 ------- 1:many |participant|
Okay. However: a workshop could have 0 participants (no one want to participate, probably leading to cancellation).
Or: I could be entering a new workshop into a table, not having added any participants yet.
Another example:
An event is held at 1 only 1 location. A location had 1 to many events.
Relationship: |event| many:1 -------- 1:1 |location|
However, maybe you're entering a new (future) location, and there have not been events there yet.
Long shorty short: I am having a hard time establishing the minimal cardinality in cases like above.
Also, when I'm designing a db and get Workbench to forward engineer the SQL for creating the tables (based on my ERD), there doesn't seem to be any difference between a X to 1/many vs a X to 0/many variant. Which makes me wonder: what's the actual (practical) effect or implication of doing one or the other? Maybe the implications (further down the road) make for an easier choice?
Can anyone explain this matter in a simple (fool-proof) way?
Looking forward to understanding!
Addition 1/5:
I've talked about my question/issue with a teacher. He agreed with me that certain minimum cardinalities could lead to a deadlock:
one table cannot be inserted without there being a occurence in the other, and vice versa.
He explained to me that the ERD diagram is a logical model, not perse a fysical model. In other words, the ERD's minimum cardinality is not neccessarily for technical implementation.
Well, if that is the case, I understand his point. Usually an artist has at least one painting. A workshop normally has at least one participant. A location usually has at least one event. So on a logical level, that seems fine.
On a technical/implementation level, it is another deal. You should be able to enter a artist, workshop or location without there already being occurrences in another table.
My question now is:
is this true? Is a ERD a logical model, not a technical model?
and if that is so, WHAT is the reason for adding the minimum cardinality? It seems of little use.
Let's continue with your artist::paintings relationship, I think that's the clearest. When we say 1 to many, we often mean 0/1 to 0/many, but not all the 4 permutations work or are meaningful.
How does "I can have no artists with no paintings" sound? That's the zero-to-zero permutation. It does not sound wrong, but it's a degenerate case that is of no use to us.
1 artist to 0 paintings is OK, as matbailie described, and presents no problems.
1 artist to many paintings is OK, and is the main use case.
0 artists to many paintings is just not correct and should not be supported in the model. A painting must have an artist.
Because cases 1 & 4 don't work, it is not really correct to say 0/1 to 0/many. It is more correct to characterize the cardinality as 1 to 0/many, which encompasses cases 2 & 3 above.
You might say, 'but there are cases where we don't know the artist so we should leave an opportunity to have paintings with no artists'. This statement feels to me like you are leaving the realm of the ERD and entering into physical design. From a design standpoint you could just as easily say there is an artist we just don't know who, so let's create an UNKNOWN ARTIST record and connect those paintings to it.
Your second example (workshops::participants) looks more like a 0/many to 0/many. If you run through the same 4 permutations, they all look credible, though the 0-to-0 case still seems kind of ludicrous.
Your last example is another 1 to 0/many because events without locations cannot be held. When you get to the physical level you can talk about the best way to handle virtual events.
So, none of your examples seem to show a true zero-to-many (which is more accurately stated 0/1 to 0/many). I'm thinking they are pretty rare, if they exist at all. It would have to be associated with some kind of optional activity where if you did enroll in it there were a constrained set of choices.
But... A participant can be in 0-N workshops. So that needs to be many-to-many.
In general "zero" is a degenerate case of "1" or "N"; it is rarely worth worrying about.
I like to start by identifying the "entities" in the model. Your participant, workshop, painting, event, artist, location are excellent examples of such. Then I look for "relations" between obvious pairs.
I say there are only 3 cases. And I like to remember that the two "entities" are manifested as two "database tables":
1:1 -- At which point I ask why the two entities (database tables) are not merged together. The two tables share the same unique key.
1:many -- Represented by the id of the "1" as a column in the "many" table.
Many:many -- This needs a link table between the two tables. This linking table has two ids.
"Id" means a unique key for a table. It is usually a number, but that is not a requirement.
For "0"...
1:0 or many:0 -- You may need a LEFT JOIN to provide NULLs when the entry on the "0" side is missing
Many:many -- If either id is non-existent (the zero case), then there are no rows for that relationship.
Then comes defining the INDEXes for efficient access. And, optionally, FOREIGN KEYs for integrity. The indexes that represent how two entities are related are prime candidates for FKs. Other INDEXes should be added to optimize WHERE clauses in SQL queries.
In all cases, the id/FK/index may be "composite" -- meaning that it is two or more columns that are used for a single id/FK/index.
I am working on a multiple choice online test project here i have designed database to store result but looking for more optimized way.
Requirements:
Every question have four options.
Only one option can be selected and that needs to be stored in database.
My design:
tables:
students
stud_id, name, email
tests
test_id, testname, duration
questions
que_id, question, opt1, opt2, opt3, opt4, answer, test_id
answers
stud_id, que_id, answer
By this way answers can be stored but it increase the number of records as for every question solved by student new record will be added in answers table.
e.g.
One test consists 100 questions and 1000 students take that test, for every student there will be 100 records for each question and for 1000 students 100k records.
Is there any better way to do this where number of records will be less.
Initial Response
Understanding the Data
You have done good work. As far as the data is concerned, the design is correct, but incomplete. There are two errors:
opt1…opt4 is a repeating group, that breaks 2NF. It must be placed in a separate table.
Further, there seems to be no option name or descriptor, which is strange (what do you paint on the page, next to each radio button?)
If you ever add a fifth option, that is now catered for; if you have questions with less than four options, that is now catered for.
Conversely, you have a fixed set of columns, and if there are any such changes in the future, you have to change both the database and the existing code. And the code will be horrendous (extra processing instead of direct SELECTs)
Your answers table has no integrity. As it stands, answers can be recorded against a question that the student was not asked, or for a test that the student did not sit. Prevention of that type of error is ordinary fare in a Relational Database, and it is not possible in a Record Filing System.
In these dark days of IT, this is a common trend. People focus on the data values; they imagine the values in spreadsheet form, and they go directly to implementing object that contain those values. Instead of understanding the data and what it means.
answers(stud_id, que_id, answer) has no meaning, no integrity, unless the context of a student_test is asserted.
The third item is not an error, because you did not give it as a requirement. However, it seems to me that a question can be used in more than one test. The way you have set it up, such questions will be duplicated (the whole point of a database is to Normalise it, such that there is no duplication).
Of course, the consequence is an Associative Table, test_question.
Questions
By this way answers can be stored but it increase the number of records as for every question solved by student new record will be added in answers table.
Yes. That is normal for a database.
Is there any better way to do this where number of records will be less.
For a Record Filing System, yes. For a database, no. Since you have tagged your question as database-design, I will assume that that is what you want.
A database is a collection of facts, not of records with related fields. The facts are about the real world, limited to the scope of the database and app.
It is important to determine the discrete facts that we need, because subordinate facts depend on higher-order facts. That is database design. And we Normalise the data, as we progress, as part of one and the same exercise. Normalisation has the purpose of eliminating duplication, otherwise you have Update Anomalies. And we determine Relational Keys, as we progress, again as part of one and the same exercise. Relational Keys provide the logical structure of a Relational database, ie. the logical integrity.
e.g. One test consists 100 questions and 1000 students take that test, for every student there will be 100 records for each question and for 1000 students 100k records.
Yes. But that is expressed in ISAM record-processing terms. In database terms, you cannot get around the fact that the database stores:
facts about 100 questions
facts about 1,000 students
facts about 1,000 students times the 100 choices they made
You need to get your head around two things: the large number of discrete facts; and the use of compound Keys. Both are essential to Relational databases. If either of those are missing, or you implement them with reluctance, you will not have the integrity, power, or speed of a Relational database, you will have a pre-1970's ISAM Record Filing System.
Further, the SQL platforms, and to some degree the non-SQL platforms such as MySQL, are heavily optimised for processing sets of data (not record-by-record); heavy I/O and caching; etc. If you implement the structures required for high concurrency, you will obtain even more performance.
Implementation
As far as the implementation is concerned, and particularly since you are concerned about performance, there are errors. A restatement would be, the implementation should not be attempted until the data is understood and modelled correctly.
The problem across the board, is that you have added a surrogate (there is no such thing as "surrogate key", it is simply a surrogate, a physical record id). It is far to early in the modelling exercise; it hasn't progressed enough; the model is not stable, to add surrogates.
Surrogates are always an additional column plus the underlying index. Obviously that consumes resources, and has a cost on inserts and deletes.
Surrogates do not provide row uniqueness, which is demanded in a relational database.
The Relational Model demands that Keys are made up from the data. Relational Keys provide row uniqueness.
A surrogate isn't made up from the data. Therefore it is not a Relational Key, and it does not provide any of the qualities of one.
If a surrogate is used, it does not replace the Key, it is in addition to the Key. Which is why we evaluate the need for surrogates after, not before, modelling the data. It is an implementation concern, not a modelling one.
Solution
Rather than going back and forth, let me provide the proposal, and you can discuss it.
Student Test Data Model (Page 1 only, for those following the progression).
If you are not used to the Notation, please be advised that every little tick, notch, and mark, the solid vs dashed lines, the square vs round corners, means something very specific. Refer to the IDEF1X Notation.
For test and question. I have left id columns in, but note that you will be much better off with short, meaningful codes.
student_id is valid because both name and email are too large to migrate to the child tables.
Please check the Verb Phrases carefully, they comprise a set of Predicates. The remainder of the Predicates can be determined directly from the model. If this is not clear, please ask.
See if you can determine that this is a collection of facts, and each fact is discrete precisely because other facts depend on it; that it is not a collection of records with fields that are related.
Your answers table has no integrity. As it stands, answers can be recorded against a question that the student was not asked, or for a test that the student did not sit. Prevention of that type of error is ordinary fare in a Relational Database, and it is not possible in a Record Filing System.
That is now prevented. The answers table, now named student_response, now has some integrity. A student is registered for a test in student_test, and the student_responses are constrained to student_test.
Please comment/discuss.
Response to Comments
I will add additional table subject (subject_id, subject_name) and add that subject_id in question table as FK is this okay?
Yes, by all means. But that has consequences. Some advice to make sure we do that properly, across the board:
As explained, do not use surrogates (Record IDs) unless you absolutely have to. Short Codes are much better for Identifiers, for both users and developers.
If you would like more info on the problems related to ID columns, read this Answer.
Subject is important. It is the context in which (a) a question exists, and (b) a test exists. They did exist as independent items (page 1 of the DM), but now they are subordinate to subject. The addition substantially improves data integrity.
The fact of a student registration and the fact of a student sitting for a test, are discrete and separate facts.
Gratefully, that eliminated two surrogates question_id and test_id. Short codes such as CHAR(2) are easier and more meaningful.
Note the improvement in the table names, improved clarity.
I have updated the Student Test Data Model (Page 2 only, for those following the progression).
However, that exposes something (that is why we model data, paper is cheap, many drafts are normal). If we evaluate the Predicates (readily visible in the Data Model, as detailed in the IDEF1X Notation document):
each subject_test was taken by 0-to-n student_tests
each student_test is [a taking of] 1 subject_test
each student took 0-to-n student_tests
each student_test is taken by 1 student
those Predicates are not accurate. A student can sit for a test in any subject. Given the new subject table, I would think that we want students to be registered for subjects, and therefore student_test to be constrained to subjects that the student is registered for.
If you would like to information on the important Relational concept of Predicates, and how it is used to both understand and verify the model, visit this Answer, scroll down until you find the Predicate section, and read that carefully.
I have updated the Student Test Data Model (Page 3). Now we have even more integrity, such that student_test is constrained to subjects that the student is registered for. The relevant Predicates are:
each student registered for 0-to-n student_subjects
each student_subject is a registration of 1 student
each subject attracted 0-to-n student_subjects
each student_subject is an attraction of 1 subject
each subject_test was taken by 0-to-n student_tests
each student_test is [a taking of] 1 subject_test
each student_subject took 0-to-n student_tests
each student_test is taken by 1 student_subjects
Now the data model appears to be complete.
Context is everything in a database.
The data hierarchies are plainly visible in the compounding of the Keys.
Notice that it is the Relational Keys, in the child tables, that provide Relational Integrity with the parent tables, to every higher level (parent, grandparent) in the hierarchy.
In case it is not obvious, notice the power of Relational Joins. Something you cannot do with Record Filing Systems that have ID fields in every File. Eg:
- Join `student_response` directly to `subject` on `subject_code`, without having to navigate the two levels in-between
- Join `student_response` directly to `student` on `student_id`, without having to navigate the two levels in-between
No, there is no better design, because the design has nothing to do with how many records will be in the tables. You will choose the same design, no matter whether you deal with ten students or ten thousand.
Your table design looks good. Don't worry about the number of records. A dbms is made to deal with large tables. And 100k records is still a small database. I wouldn't even change this design if there were billions of answers to store.
If you want to normalize the data, then I'd create the tables a little differently.
Your Student table looks fine. Generally, I use a singular name for tables, rather than plural.
Student
-------
Student ID
Name
Email
...
Here's the Test table:
Test
----
Test ID
Test Name
...
We tie students to tests with a junction table.
StudentTest
-----------
Student ID
Test ID
Test Started Timestamp
Test Duration
...
The time and length of the test vary from student to student, so those columns are included on the StudentTest table.
The Question table.
Question
--------
Question ID
Question Text
And the Answer table.
Answer
------
Answer ID
Answer Text
Now here's where things get tricky. You could assign questions to a test based on the ID, like this.
TestQuestion
------------
Test ID
Question ID
But if you do that, and someone changes the Question text after the test, then the Question ID is pointing to a different question than the question on the test.
To solve this problem, we create history tables like this:
QuestionHistory
---------------
QuestionHistory ID
Question Text
AnswerHistory
-------------
AnswerHistory ID
Answer Text
So, we create the TestQuestion table like this:
TestQuestion
------------
Test ID
QuestionHistory ID
And copy the questions as well as the answers to the history tables.
For similar reasons, we create the QuestionAnswer table like this:
QuestionAnswer
--------------
QuestionHistory ID
AnswerHistory ID
Is Correct Answer
Your code could make sure that each question has 4 possible answers. The database allows for more or less than 4 possible answers.
Finally, we tie the student's answers to the test questions.
StudentQuestionAnswer
---------------------
Student ID
Test ID
QuestionHistory ID
AnswerHistory ID
Is Correct Answer
Yes, the Test ID column is duplicated here. This is so you can query by the test as well as the student that took the test.
The Is Correct Answer field has a different meaning in the QuestionAnswer table and the StudentQuestionAnswer table. In the QuestionAnswer table, the Is Correct Answer boolean points to the correct answer. In the StudentQuestionAnswer table, the Is Correct Answer boolean signifies that the student answered the question correctly.
This should be a complete question / answer database. You could tie tests to courses if you want.
You can store the details for answer as a ~ separated record for the corresponding question id which is also a ~ separated. In this way for one student id there will be only one record. You can also decode the ans for a particular question id
I'm rewriting a system that is currently linked to a MySQL database that is roughly 1GB in size. There are hundreds of thousands of articles, each with a list of contributors (think Wiki style). I've not yet been given access to the existing database schema, but while I wait I've been brainstorming a bit.
Basically, what I'm wondering is if having an article_contributors table would be an efficient way of handling this or if there is a better method to approaching this situation. Considering there are roughly 200,000 articles, if there are 5 contributors on each, that'd be 1,000,000 rows in the meta table.
I'd call that a one-to-many table, not a "meta" table. Or else a multi-valued attribute.
Storing contributors in a separate table, one per row, is the proper way of designing a relational database. There may be other ways to store the data, but they are not relational.
Consider my answer to Is storing a delimited list in a database column really that bad? Storing the contributors as a list in the articles table causes a lot of common SQL queries to break or become horribly inefficient. If you need to do a variety of queries against this data, you will thank yourself for storing it in a normalized fashion.
On the other hand, if you never query anything but the list of contributors as an indivisible unit, then why not store it denormalized (as a list)? That's a valid choice too -- but it depends on how you're going to use the table.
By the way, 1 million rows is not a large MySQL database by some people's standards. This week I'm advising a client who has a table with 900 million rows.
An interesting question!
You're going to need to see the schema to get a straight answer about this. That's because the schema probably embodies some core decisions made by experts in bibliography (reference librarians, etc).
If you try use a join table (articles_contributors) so you can avoid listing a given contributor multiple times when she contributes to multiple articles, you're implicitly declaring that you can create a canonical list of contributors, with a contributor_id for each distinct person.
In the world of bibliography and library science, that sort of list is called a "controlled vocabulary" It's controlled by an "authority." (Read this: http://en.wikipedia.org/wiki/Authority_control) That is, some organization has the responsibility to decide whether this "Jane Smaith" is a different person from that "Jane Smith." That is surprisingly hard to do correctly with people.
For an example of a relatively simple controlled vocabulary, see the "North American Industry Classification System" (NAICS). This has a code for each distinct kind of industry. http://www.census.gov/eos/www/naics/ It's controlled by national committees in three countries. Many bibliographic databases that cover industry include those terms as one of the ways of classifying their contents.
The designers of the system you're soon to take over will have made decisions about these kinds of controlled vocabularies. Will they have one for contributors? You could wait and see, or you could ask. But one thing is sure: the bibliographic designers won't be too delighted if you, on your own authority, create that kind of controlled vocabulary.
The Library of Congress in the USA doesn't attempt to create a controlled list of authors and contributors.
Edit
If you do have a definitive list of contributors, it is a good idea to create a join table articles_contributors as you suggested. You should consider the following columns:
article_id primary key
contributor_id primary key
role primary key values like ("author", "illustrator", "editor", etc)
order 1, 2, 3 so contributors can be listed in proper order.
contact 1 or 0 indicating whether readers should contact this author for more info.
I have a hierarchical data structure which, as far as I can see, needs to have a series of successive many-to-many relationships.
It goes something like this:
Company
Account
Treaty
Benefit
Policy
Person
With the following relationships:
Company 1---8 Account
Account 1---8 Treaty
...all still fun
And then, many to many:
Treaty 8---8 Benefit, so I create the relational table TreatyBenefit, and do:
Treaty 1---8 TreatyBenefit 8---1 Benefit
Now, for a specific Treaty and a specific Benefit (i.e. a TreatyBenefit) there can be many Policies. But again, a single policy can also fall under multiple TreatyBenefits
So, then I have TreatyBenefit 1---8 TreatyBenefitPolicy 8---1 Policy
And then of course, the same applies to Person, so I also then get:
TreatyBenefitPolicy 1---8 TreatyBenefitPolicyPerson 8---1 Person
What I would like to know is if there are any conventions for naming tables so that you can avoid names that become so long that they are essentially meaningless? Or are there better approaches to the design that avoids this kind of structure entirely?
Thanks
Karl
IMHO unless there are other strong, wideley accepted, meaningful business-centric names for these entities / concepts, then I would stick with the trusted Many:Many mangles that you've described above.
Also, each of the 6 entities you've listed are reasonably concise, so there seems little point in abbreviating e.g. Ben, Per, Pol, Acc, Co etc would cause more confusion than benefit.
In designing RDBMS schema, I wonder if there is formal principle of concrete objects: for example, if it is Persons table, then each record is very concrete and unique. Each record in fact represents a unique person.
But what about a table such as Courses (as in school). It can have a description, number of units, offered only in Autumn (Fall) or Spring, etc, which are the "general properties" of a course.
And then there is actual CourseSessions, which has information about the time_from and time_to (such as 10 to 11am), whether it is Monday, Wednesday or Tue / Thur, and the instructor teaching it, and also pointing back using a course_id to the Courses table.
So the above 2 tables are both needed.
Are there principles of table design for "concrete" vs "abstract"?
Update: what I mean "abstract" here is that a course is an abstract idea... there can be multiple instances of it... such as the course Physics 10 from 10-11am, and another at 12-1pm.
for example, if it is Persons table, then each record is very concrete and unique. Each record in fact represents a unique person.
That is the hope, but not the reality of the situation.
By immigration or legal death status, it is possible for there to be two (or more records) that represent the same person. Uniquely identifying people is difficult - first, middle and surnames can match but actually reflect different people. SSN/SIN are not reliable, because they can change (immigration, legally dead). A name doesn't guarantee gender, and gender can be changed.
Are there principles of table design for "concrete" vs "abstract"
The classification of being "concrete" vs "abstract" is arbitrary, subject to interpretation. Does the start and end date really make a Course session "concrete"? Because I can book numerous things in [Calendaring software of choice] - doesn't mean class actually took place, or that final grades are legitimate values...
Table design is based on business rules, and the logical entities (which can become tables in the physical model) required to support those rules. Normalization helps make these entities more obvious.
The relational data model, base on mathematics, prove a way to design your data model on which certain operations is correct without risk.
Unfortunatly, this kind of data model is not a suitable solution for performance issue in database. How to organize tables for certain business domain is need to consider about not only the abstract model of objects or database normalization but also performance planning on your system. Yes, the leak of abstraction.
For example, there are two design strategies for tree structure: Adjacency model and Materialized path model(The art of SQL). Which one is better is based on which operations need to be optimized.
There is a good and classical article I recommend: The Law of Leaky Abstractions
Abstraction has its price (& it is often higher than expected)
By Keith Cooper
The art of SQL, of course, the soul of database design in my opinion.