Improve database design for online exam - mysql

I am working on a multiple choice online test project here i have designed database to store result but looking for more optimized way.
Requirements:
Every question have four options.
Only one option can be selected and that needs to be stored in database.
My design:
tables:
students
stud_id, name, email
tests
test_id, testname, duration
questions
que_id, question, opt1, opt2, opt3, opt4, answer, test_id
answers
stud_id, que_id, answer
By this way answers can be stored but it increase the number of records as for every question solved by student new record will be added in answers table.
e.g.
One test consists 100 questions and 1000 students take that test, for every student there will be 100 records for each question and for 1000 students 100k records.
Is there any better way to do this where number of records will be less.

Initial Response
Understanding the Data
You have done good work. As far as the data is concerned, the design is correct, but incomplete. There are two errors:
opt1…opt4 is a repeating group, that breaks 2NF. It must be placed in a separate table.
Further, there seems to be no option name or descriptor, which is strange (what do you paint on the page, next to each radio button?)
If you ever add a fifth option, that is now catered for; if you have questions with less than four options, that is now catered for.
Conversely, you have a fixed set of columns, and if there are any such changes in the future, you have to change both the database and the existing code. And the code will be horrendous (extra processing instead of direct SELECTs)
Your answers table has no integrity. As it stands, answers can be recorded against a question that the student was not asked, or for a test that the student did not sit. Prevention of that type of error is ordinary fare in a Relational Database, and it is not possible in a Record Filing System.
In these dark days of IT, this is a common trend. People focus on the data values; they imagine the values in spreadsheet form, and they go directly to implementing object that contain those values. Instead of understanding the data and what it means.
answers(stud_id, que_id, answer) has no meaning, no integrity, unless the context of a student_test is asserted.
The third item is not an error, because you did not give it as a requirement. However, it seems to me that a question can be used in more than one test. The way you have set it up, such questions will be duplicated (the whole point of a database is to Normalise it, such that there is no duplication).
Of course, the consequence is an Associative Table, test_question.
Questions
By this way answers can be stored but it increase the number of records as for every question solved by student new record will be added in answers table.
Yes. That is normal for a database.
Is there any better way to do this where number of records will be less.
For a Record Filing System, yes. For a database, no. Since you have tagged your question as database-design, I will assume that that is what you want.
A database is a collection of facts, not of records with related fields. The facts are about the real world, limited to the scope of the database and app.
It is important to determine the discrete facts that we need, because subordinate facts depend on higher-order facts. That is database design. And we Normalise the data, as we progress, as part of one and the same exercise. Normalisation has the purpose of eliminating duplication, otherwise you have Update Anomalies. And we determine Relational Keys, as we progress, again as part of one and the same exercise. Relational Keys provide the logical structure of a Relational database, ie. the logical integrity.
e.g. One test consists 100 questions and 1000 students take that test, for every student there will be 100 records for each question and for 1000 students 100k records.
Yes. But that is expressed in ISAM record-processing terms. In database terms, you cannot get around the fact that the database stores:
facts about 100 questions
facts about 1,000 students
facts about 1,000 students times the 100 choices they made
You need to get your head around two things: the large number of discrete facts; and the use of compound Keys. Both are essential to Relational databases. If either of those are missing, or you implement them with reluctance, you will not have the integrity, power, or speed of a Relational database, you will have a pre-1970's ISAM Record Filing System.
Further, the SQL platforms, and to some degree the non-SQL platforms such as MySQL, are heavily optimised for processing sets of data (not record-by-record); heavy I/O and caching; etc. If you implement the structures required for high concurrency, you will obtain even more performance.
Implementation
As far as the implementation is concerned, and particularly since you are concerned about performance, there are errors. A restatement would be, the implementation should not be attempted until the data is understood and modelled correctly.
The problem across the board, is that you have added a surrogate (there is no such thing as "surrogate key", it is simply a surrogate, a physical record id). It is far to early in the modelling exercise; it hasn't progressed enough; the model is not stable, to add surrogates.
Surrogates are always an additional column plus the underlying index. Obviously that consumes resources, and has a cost on inserts and deletes.
Surrogates do not provide row uniqueness, which is demanded in a relational database.
The Relational Model demands that Keys are made up from the data. Relational Keys provide row uniqueness.
A surrogate isn't made up from the data. Therefore it is not a Relational Key, and it does not provide any of the qualities of one.
If a surrogate is used, it does not replace the Key, it is in addition to the Key. Which is why we evaluate the need for surrogates after, not before, modelling the data. It is an implementation concern, not a modelling one.
Solution
Rather than going back and forth, let me provide the proposal, and you can discuss it.
Student Test Data Model (Page 1 only, for those following the progression).
If you are not used to the Notation, please be advised that every little tick, notch, and mark, the solid vs dashed lines, the square vs round corners, means something very specific. Refer to the IDEF1X Notation.
For test and question. I have left id columns in, but note that you will be much better off with short, meaningful codes.
student_id is valid because both name and email are too large to migrate to the child tables.
Please check the Verb Phrases carefully, they comprise a set of Predicates. The remainder of the Predicates can be determined directly from the model. If this is not clear, please ask.
See if you can determine that this is a collection of facts, and each fact is discrete precisely because other facts depend on it; that it is not a collection of records with fields that are related.
Your answers table has no integrity. As it stands, answers can be recorded against a question that the student was not asked, or for a test that the student did not sit. Prevention of that type of error is ordinary fare in a Relational Database, and it is not possible in a Record Filing System.
That is now prevented. The answers table, now named student_response, now has some integrity. A student is registered for a test in student_test, and the student_responses are constrained to student_test.
Please comment/discuss.
Response to Comments
I will add additional table subject (subject_id, subject_name) and add that subject_id in question table as FK is this okay?
Yes, by all means. But that has consequences. Some advice to make sure we do that properly, across the board:
As explained, do not use surrogates (Record IDs) unless you absolutely have to. Short Codes are much better for Identifiers, for both users and developers.
If you would like more info on the problems related to ID columns, read this Answer.
Subject is important. It is the context in which (a) a question exists, and (b) a test exists. They did exist as independent items (page 1 of the DM), but now they are subordinate to subject. The addition substantially improves data integrity.
The fact of a student registration and the fact of a student sitting for a test, are discrete and separate facts.
Gratefully, that eliminated two surrogates question_id and test_id. Short codes such as CHAR(2) are easier and more meaningful.
Note the improvement in the table names, improved clarity.
I have updated the Student Test Data Model (Page 2 only, for those following the progression).
However, that exposes something (that is why we model data, paper is cheap, many drafts are normal). If we evaluate the Predicates (readily visible in the Data Model, as detailed in the IDEF1X Notation document):
each subject_test was taken by 0-to-n student_tests
each student_test is [a taking of] 1 subject_test
each student took 0-to-n student_tests
each student_test is taken by 1 student
those Predicates are not accurate. A student can sit for a test in any subject. Given the new subject table, I would think that we want students to be registered for subjects, and therefore student_test to be constrained to subjects that the student is registered for.
If you would like to information on the important Relational concept of Predicates, and how it is used to both understand and verify the model, visit this Answer, scroll down until you find the Predicate section, and read that carefully.
I have updated the Student Test Data Model (Page 3). Now we have even more integrity, such that student_test is constrained to subjects that the student is registered for. The relevant Predicates are:
each student registered for 0-to-n student_subjects
each student_subject is a registration of 1 student
each subject attracted 0-to-n student_subjects
each student_subject is an attraction of 1 subject
each subject_test was taken by 0-to-n student_tests
each student_test is [a taking of] 1 subject_test
each student_subject took 0-to-n student_tests
each student_test is taken by 1 student_subjects
Now the data model appears to be complete.
Context is everything in a database.
The data hierarchies are plainly visible in the compounding of the Keys.
Notice that it is the Relational Keys, in the child tables, that provide Relational Integrity with the parent tables, to every higher level (parent, grandparent) in the hierarchy.
In case it is not obvious, notice the power of Relational Joins. Something you cannot do with Record Filing Systems that have ID fields in every File. Eg:
- Join `student_response` directly to `subject` on `subject_code`, without having to navigate the two levels in-between
- Join `student_response` directly to `student` on `student_id`, without having to navigate the two levels in-between

No, there is no better design, because the design has nothing to do with how many records will be in the tables. You will choose the same design, no matter whether you deal with ten students or ten thousand.
Your table design looks good. Don't worry about the number of records. A dbms is made to deal with large tables. And 100k records is still a small database. I wouldn't even change this design if there were billions of answers to store.

If you want to normalize the data, then I'd create the tables a little differently.
Your Student table looks fine. Generally, I use a singular name for tables, rather than plural.
Student
-------
Student ID
Name
Email
...
Here's the Test table:
Test
----
Test ID
Test Name
...
We tie students to tests with a junction table.
StudentTest
-----------
Student ID
Test ID
Test Started Timestamp
Test Duration
...
The time and length of the test vary from student to student, so those columns are included on the StudentTest table.
The Question table.
Question
--------
Question ID
Question Text
And the Answer table.
Answer
------
Answer ID
Answer Text
Now here's where things get tricky. You could assign questions to a test based on the ID, like this.
TestQuestion
------------
Test ID
Question ID
But if you do that, and someone changes the Question text after the test, then the Question ID is pointing to a different question than the question on the test.
To solve this problem, we create history tables like this:
QuestionHistory
---------------
QuestionHistory ID
Question Text
AnswerHistory
-------------
AnswerHistory ID
Answer Text
So, we create the TestQuestion table like this:
TestQuestion
------------
Test ID
QuestionHistory ID
And copy the questions as well as the answers to the history tables.
For similar reasons, we create the QuestionAnswer table like this:
QuestionAnswer
--------------
QuestionHistory ID
AnswerHistory ID
Is Correct Answer
Your code could make sure that each question has 4 possible answers. The database allows for more or less than 4 possible answers.
Finally, we tie the student's answers to the test questions.
StudentQuestionAnswer
---------------------
Student ID
Test ID
QuestionHistory ID
AnswerHistory ID
Is Correct Answer
Yes, the Test ID column is duplicated here. This is so you can query by the test as well as the student that took the test.
The Is Correct Answer field has a different meaning in the QuestionAnswer table and the StudentQuestionAnswer table. In the QuestionAnswer table, the Is Correct Answer boolean points to the correct answer. In the StudentQuestionAnswer table, the Is Correct Answer boolean signifies that the student answered the question correctly.
This should be a complete question / answer database. You could tie tests to courses if you want.

You can store the details for answer as a ~ separated record for the corresponding question id which is also a ~ separated. In this way for one student id there will be only one record. You can also decode the ans for a particular question id

Related

database design: X:X to 1:many vs X:X to 0:many

(edited 1/5 10:22hr. added some explanation about my notation. And added some additional information I received)
I am doing a course on database design and currently we're doing ERD's and designing db's in MySQL worksbench. Think 1st, 2nd and 3rd NF, creating schema, tables, constraints, etc.
Most of it is pretty clear to me.
However there's one aspect where things remain unclear: the X:X to 1:many relationship vs the X:X to 0:many relationship (meaning: whatever to 0:many, vs whatever to 1:many, etc).
In some cases it's obvious, in others not so much. Whenever it's unclear to me, it's mostly something like this:
Example :
an artist has 1 to many paintings. A painting has 1-and only 1 artist.
Relationship:
|artist| 1:1 -------- 1:many |painting|
the same in another notation
|artist| ||------------ 1< |painting|
This seems fair, but....Then there's the thought: I could be a new artist, not having produced a painting yet.
Or: I could be entering a new artist into a artist table, not yet having entered his paintings yet (which could lead to a practical issue).
Another example:
A workshop has 1 to many participants. A participant enters 0-to-many workshops.
Relationship:
|workshop| many:0 ------- 1:many |participant|
Okay. However: a workshop could have 0 participants (no one want to participate, probably leading to cancellation).
Or: I could be entering a new workshop into a table, not having added any participants yet.
Another example:
An event is held at 1 only 1 location. A location had 1 to many events.
Relationship: |event| many:1 -------- 1:1 |location|
However, maybe you're entering a new (future) location, and there have not been events there yet.
Long shorty short: I am having a hard time establishing the minimal cardinality in cases like above.
Also, when I'm designing a db and get Workbench to forward engineer the SQL for creating the tables (based on my ERD), there doesn't seem to be any difference between a X to 1/many vs a X to 0/many variant. Which makes me wonder: what's the actual (practical) effect or implication of doing one or the other? Maybe the implications (further down the road) make for an easier choice?
Can anyone explain this matter in a simple (fool-proof) way?
Looking forward to understanding!
Addition 1/5:
I've talked about my question/issue with a teacher. He agreed with me that certain minimum cardinalities could lead to a deadlock:
one table cannot be inserted without there being a occurence in the other, and vice versa.
He explained to me that the ERD diagram is a logical model, not perse a fysical model. In other words, the ERD's minimum cardinality is not neccessarily for technical implementation.
Well, if that is the case, I understand his point. Usually an artist has at least one painting. A workshop normally has at least one participant. A location usually has at least one event. So on a logical level, that seems fine.
On a technical/implementation level, it is another deal. You should be able to enter a artist, workshop or location without there already being occurrences in another table.
My question now is:
is this true? Is a ERD a logical model, not a technical model?
and if that is so, WHAT is the reason for adding the minimum cardinality? It seems of little use.
Let's continue with your artist::paintings relationship, I think that's the clearest. When we say 1 to many, we often mean 0/1 to 0/many, but not all the 4 permutations work or are meaningful.
How does "I can have no artists with no paintings" sound? That's the zero-to-zero permutation. It does not sound wrong, but it's a degenerate case that is of no use to us.
1 artist to 0 paintings is OK, as matbailie described, and presents no problems.
1 artist to many paintings is OK, and is the main use case.
0 artists to many paintings is just not correct and should not be supported in the model. A painting must have an artist.
Because cases 1 & 4 don't work, it is not really correct to say 0/1 to 0/many. It is more correct to characterize the cardinality as 1 to 0/many, which encompasses cases 2 & 3 above.
You might say, 'but there are cases where we don't know the artist so we should leave an opportunity to have paintings with no artists'. This statement feels to me like you are leaving the realm of the ERD and entering into physical design. From a design standpoint you could just as easily say there is an artist we just don't know who, so let's create an UNKNOWN ARTIST record and connect those paintings to it.
Your second example (workshops::participants) looks more like a 0/many to 0/many. If you run through the same 4 permutations, they all look credible, though the 0-to-0 case still seems kind of ludicrous.
Your last example is another 1 to 0/many because events without locations cannot be held. When you get to the physical level you can talk about the best way to handle virtual events.
So, none of your examples seem to show a true zero-to-many (which is more accurately stated 0/1 to 0/many). I'm thinking they are pretty rare, if they exist at all. It would have to be associated with some kind of optional activity where if you did enroll in it there were a constrained set of choices.
But... A participant can be in 0-N workshops. So that needs to be many-to-many.
In general "zero" is a degenerate case of "1" or "N"; it is rarely worth worrying about.
I like to start by identifying the "entities" in the model. Your participant, workshop, painting, event, artist, location are excellent examples of such. Then I look for "relations" between obvious pairs.
I say there are only 3 cases. And I like to remember that the two "entities" are manifested as two "database tables":
1:1 -- At which point I ask why the two entities (database tables) are not merged together. The two tables share the same unique key.
1:many -- Represented by the id of the "1" as a column in the "many" table.
Many:many -- This needs a link table between the two tables. This linking table has two ids.
"Id" means a unique key for a table. It is usually a number, but that is not a requirement.
For "0"...
1:0 or many:0 -- You may need a LEFT JOIN to provide NULLs when the entry on the "0" side is missing
Many:many -- If either id is non-existent (the zero case), then there are no rows for that relationship.
Then comes defining the INDEXes for efficient access. And, optionally, FOREIGN KEYs for integrity. The indexes that represent how two entities are related are prime candidates for FKs. Other INDEXes should be added to optimize WHERE clauses in SQL queries.
In all cases, the id/FK/index may be "composite" -- meaning that it is two or more columns that are used for a single id/FK/index.

"Merging" Multiple Database Tables [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I've read through multiple questions here on SO regarding merging multiple databases into one, however they primarily all deal with uniform schema/tables. My apologies if I'm repeating a question.
I have an assortment of database tables that are all similar, but not identical. For example, imagine ten databases with ten "User" tables. All contain a userid (we'll use this for reference). Most contain username and an email columns. Some will contain other columns, such as skype, msn, phone, etc. that exist in only a few of the other tables, or no other tables.
I want to merge this content into one database, with the prerequisite that, moving forward, the possibility of additional databases also containing unique columns will also need to be merged into the new database.
I've been looking at EAV Tables, and was considering something along the lines of (continuing with the example above) a master user table that had a newly-assigned user id (id), originating database reference of some type (database_id), and the originating user-id (native_user_id). I'd then have a separate properties table with a primary key (id), a entity key (user_id), an attribute (attribute) column, and the value (value) column.
The issue at hand is that almost everything I've read recommends against EAV tables while implying there are better ways to go about this. However, I've not actually found any material that covers what this method would be.
So, my questions:
Are EAV Tables really that bad?
What practical major downfalls that I should plan ahead for should I go the EAV table route (any examples of personal experience would be swell)?
What alternatives exist for handling this type of scenario besides EAV tables (while accommodating future attributes without tedious ALTER TABLE commands)?
I used EAV in a project to address requirements similar to yours: lack of a universal data model in the messy real world.
In my case, EAV allowed incremental change as the company grew by acquisition, which in turn caused continual expansion, refinement, or generalization of the data model. The project ultimately failed because management withdrew support for it.
I learned that EAV presents itself to management and users as needlessly complex unless you do the work to create concise views to hide the complexity while preserving the completeness of the data. I also learned that EAV imposes a demand to fill in the "missing answers" in a meaningful way. It isn't enough to say that every answer to a question that wasn't asked in database X is "NULL". Sometimes that is not the right answer. "NULL" becomes a synonym for "I don't know; the attribute didn't exist in this database so no-one ever decided what the value should be".
This is a fairly broad question, eh?
If you have your tables already in SQL I suggest you try experimenting with this sort of UNION ALL query.
SELECT 'one' AS dbid,
id AS id,
first AS first_name,
last AS last_name
FROM first_table
UNION ALL
SELECT 'two' AS dbid,
member_id AS id,
fname AS first_name,
lname AS last_name
FROM members
Etcetera. The idea is to use a UNION ALL query to try to coerce your various sources of information into a single result set, and figure out which of your values from those various sources are somehow conformable. If the lion's share of your data is conformable -- that is, you can simply move it over into appropriate columns in your new tables, you'll avoid the worst pitfalls of EAV storage.
Once you have done that, you can use EAV style storage for your remaining information.
I hope this helps you plan this migration a bit.

Designing a database : Which is the better approach? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I am designing a database and am wondering which approach should I use. I am going to describe the database I intend to design and the possible approaches that I can use to store the data in the tables.
Please recommend which approach I should use and why?
About the data:
A) I have seven attributes that need to be taken care of. These are just examples and not the actual ones I intend to store. Let me call them:
1)Name
2)DOB (Modified..I had earlier put in age here..)
3)Gender
4)Marital Status
5)Salary
6)Mother Tongue
7)Father's Name
B) There will be a minimum of 10000 rows in the table and they can go up from there in the long term
C) The number of attributes can change over the period of time. That is, new attributes can be added to the existing dataset. No attributes will ever be removed.
Approach 1
Create a table with 7 attributes and store the data as it is. Added new columns if and when new attributed need to be added.
Pro: Easier to read the data and information is well organized
Con: There can be a lot of null values in certain rows for certain attributes for which values are unknown.
Approach 2
Create a table with 3 attributes. Let them be called :
1) Attr_Name : Stores the attribute name . eg name,age,gender ..etc
2) Attr_Value :Stores value for the above attribute, eg : Tom, 25, Male
3) Unique ID : Uniquely identifies the Name, Value pair in the database. eg. SSN
So, in approach 2, in case new attributes need to be added for certain rows, we can just add them to the hashmap we have created without worrying about null values.
Pro: Hashmap structure. Eliminates nulls.
Con: Data is not easy to read. Information cannot be easily grasped.
C) The Question
Which is the better approach.?
I feel that approach 1 is the better approach. Because its not too tough to handle null values and data is well organized and its easy to grasp this king of data. Please suggest which approach I should use and why?
Thanks!
This is a typical narrow table (attribute based) vs. wide table discussion. The problem with approach #2 is that you are probably going to have to pivot the data, to get it into a form the user can work with (back into a wide view format). This can be very resource intensive as the number of rows grows, and as the number of attributes grows. It's also hard to look at the table, in raw table view, and see what's going on.
We have had this discussion many times at our company. We have some tables that lend themselves very well to an attribute type schema. We've always decided against it because of the necessity to pivot the data and the inability to view the data and have it make sense (but this is the lessor of the two problems for us - we just don't want to pivot millions of rows of data).
BTW, I wouldn't store age as a number. I would store the birth date, if you have it. Also, I don't know what 'Mother Tongue' refers to, but, if it's the language the mother speaks, I would store this as a FK to a master language table. It's more efficient and lessens the problem of bad data because of a misspelled language.
Your second option is one of teh worst design mistakes you can make. This should only be done when you have hundreds of attributes that change constantly and are in no way the same from object to object (such as medical lab tests). If you need to do that, then do not under any circumstances use a relational database to do it. NOSQL database handle EAV designs better by far than relational ones.
Another problem with design 2 is that it becomes almost impossible to have good data integrity as you cannot correctly enforce FKs and data types and add contraints to the data. Since this stuff shoudl never be designed to happen only in the application since things other than the application often affect the data, this factor alone is enough to make your second idea foolish and foolhardy.
The first design will perform better in general. It will be easier to write queries and it will force you to think about what needs to change when you add an attribute (this is a plus not a minus) instead of having to design to always show all attributes whether you need them or not. If you would have a lot of nulls, then add a related table rather than more columns(you can have one-to-one related tables). Usually in this case you might have something that you know only a subset of the records will have and they often fall into groupings by subject fairly naturally. For instance you might have general people related attributes (name, phone, email, address) that belong in one table. Then you might have student-related attributes that belong in a separate table and teacher-related attributes that belong in a third table. Or you might have things you need for all insurance policies and separate tables for vehicle insurance, health insurance, House insurance and life insurance.
There is a third design possibility. If you have a set of attributes you know up front then put them in one table and have an EAV table only for attributes that cannot be determined at design time. This is the common pattern when the application wants to have the flexibility for the user to add customer specific data fields.
I don't think anyone can really determine which one is better immediately, but here are a couple of things to think about:
Do you have sample data? If yes then see if there will be a lot of nulls, if there are not then just go with option 1
Do you have a good sense on how the attributes will grow? For instance looking at the attributes you listed above, you may not know all of them, but they all do exist - so in theory you could fill the table. If you will have a lot of sparse data then #2 may work
When you do get new types of data can you group it into another table and use a foreign key? For instance if you want to capture the address you could always have an address table that references your initial table
What type of queries do you plan on using? It's much harder to query a key-value table than a "normal one" (not super hard, just harder - if you're comfortable using implied joins and the like to normalize the data then it's probably not a big deal).
Overall I'd be really careful before you implemented #2 - I've done it for certain specialized cases (metrics gathering where I have dozens of different metrics and don't really want to maintain dozens of different tables) but in general it's more trouble than it's worth.
For something like this I'd just create one table, and either add columns as you go along, or just create new tables for new data structures if necessary.

What is the most efficient way to store a list in a relational database?

I have read many strong statements here and elsewhere on the subject of storing arrays in mysql. The rules of normalization seem to suggest its a bad idea and searching within the stored array fosters inelegant code. HOWEVER, for the application I am working on it seems like a reasonable solution to store an array in a field. I'm sure that is what everyone wrongly thinks in this position but I can't figure out a better way. Here is the setup:
I have a series of tables that store registered students, courses they can take and their performance on each course. All are "normalized" to avoid duplication and errors. I want to be able to generate a "myCourses" section so after login the student sees courses they are eligible for and courses they have taken but are free to review. The approach that comes to mind is two arrays; my_eligible_courses and my_completed_courses. On registration, the student is given a set of courses for which they are eligible. This could be stored as rows where there are multiple occurrences of studentid, one for each course they can take:
student1 course 1
student1 course 2
student1 course n
The table could then be queried for all of student 1's eligible courses and displayed as a list when the student logs in.
Alternately, studentid could be a primary key and in a column "eligible_courses" there would be an array (course 1,course 2, course n).
There is a table for student performance, to record every course taken and metrics associated with student performance. It will be queried to report on student performance, quality of course etc but this table will grow quite large. I'm having a hard time believing that the most efficient way to generate a list of my_completed_courses is to query this table by studentid every time they login just to give them a list of completed courses.
One other complication is that the set of courses a student is eligible is variable and expanding as new courses are developed, which to me seems to suggest that generating a set of new columns for each new course is a bad idea-for example, new course_name, pretest_score, posttest_score, time_to_complete, ... Also, a table for each new course seems like a complicated solution for the relatively mundane endpoint of generating a simple set of lists.
So to restate the question, is it better to store "inelegant" arrayed list of eligible and completed courses in a registered student table or dynamically generate these lists?
I'm guessing this is still too vague but any discussion of db design that gives an example of an inelegant array vs a restructured schema would be appreciated.
You should feel confident that if you have indexes on your tables for the appropriate columns, querying for my_completed_courses will be pretty snappy.
When your table grows to the point that you notice slowdown, you can configure your MySQL server with appropriate memory allocation settings so that it can keep more data cached in memory. Or you could look into that now.
In response to the edit you made about adding new courses: Don't add a new column for each course. Don't add a new table for each course. Create a table for courses, and add rows for each course.
You should then be able to join your tables together on indexed columns to generate the list of data you need.
This is a bad idea for two obvious reasons:
DBMS can't enforce proper referentialX (and possibly domain) integrity and relying on application-level integrity is almost always a bad idea.
While the database will be able to answer the query: "based on given student, give me courses", you won't be able to (efficiently) go in the opposite direction, should you ever need to.
X What's to stop a buggy application from storing a non-existent ID in array? Or deleting a course that is still referenced by students? Even if your application is careful about course deletion, there is no way to do it efficiently - you'll need a full table scan to examine all arrays.
Why are you even trying this? A link (aka. junction) table would solve these problems, for a moderate cost of some additional storage space.
If you are really concerned about storage space, you could even switch the DBMS and use one that supports leading-edge index compression (such as Oracle).
I'm having a hard time believing that the most efficient way to generate a list of my_completed_courses is to query this table by studentid every time they login just to give them a list of completed courses.
Databases are very good at querying humongous amounts of data. In this case, if you use the clustering properly, the DBMS will be able to get this data in very few I/O operations, meaning very fast. Did you perform any actual benchmarks? Have you measured any actual performance problem?
Also, a table for each new course seems like a complicated solution for the relatively mundane endpoint of generating a simple set of lists.
Generating a new table may be justified in case it will have different columns. But, that doesn't sound like what you are trying to do.
It seems to me that you simply need:
CHECK (
(COMPLETED = 0 AND (performance fields) IS NULL)
OR (COMPLETED = 1 AND (performance fields) IS NOT NULL)
)
When a student enrolls into course, insert a row in STUDENT_COURSE, set COMPLETED to 0 and leave performance fields NULL.
When the student completed the course, set COMPLETED to 1 and fill the performance fields.
(BTW, you could even omit COMPLETED altogether and just rely on testing the performance fields for NULL.)
InnoDB tables are clustered, which means that rows in STUDENT_COURSE belonging to the same student are stored physically close together, which means that getting courses of the given student is extremely fast.
If you need to go in the opposite direction (get students of a given course), add an index on same fields but in opposite order: {COURSE_ID, STUDENT_ID}. You might even consider covering in this case.
Since we are talking about small number of rows, leaving COMPLETED unindexed is just fine. If you are really concerned about that, you can even do something like:
The COMPLETED_STUDENT_COURSE is a B-Tree only for completed courses (and essentially a subset of STUDENT_COURSE which is a B-Tree for all enrolled courses).
Here are a few thoughts that I believe may assist you in making a good decision.
Generally, it is a rule to use correctly normalised tables. But there can be exceptions to this. Perhaps your project may be such.
Most of the time, new developers tend to focus on getting the data into a DB. They get stuck when it comes to retrieving it for a specific purpose. So given both cases of arrays vs. relational tables, ask your self if either method serves your purpose. For example, if you wanted to list the courses of student X, your array method is just fine. This is because you can retrieve it by the primary key like a student ID. But if you wanted to know how many students are on course A, the array method will be a horrible way to go.
Then again, the above point would depend on your data volume as well. For example, if you only have about a hundred students, you'll probably not notice a difference in performance. But if you're looking at several thousand records and you have a big list of courses for students, the array approach is not the way to go.
Benchmark. This is the best way for you to find out your answer. You can use MySQL's explain or just time it using your program that executes the queries. Try each method with your standard volume of data and see which one works best. For example, in the recent past, MySQL was boasting about their strength of the ISAM engine. Then I had to work on a large application that involved millions of records. And here, I noticed that each time a new record came in, Indexes had to be rebuilt. So now we had to bend the rules. Likewise, you'd better do your tests with the correct volumes of data and make a better decision.
But do not take this example as a rule. Rather, go by the standards of normalisation and only bend the rules for exceptions.

Schema design for when users can define fields

Greetings stackers,
I'm trying to come up with the best database schema for an application that lets users create surveys and present them to the public. There are a bunch of "standard" demographic fields that most surveys (but not all) will include, like First Name, Last Name, etc. And of course users can create an unlimited number of "custom" questions.
The first thing I thought of is something like this:
Survey
ID
SurveyName
SurveyQuestions
SurveyID
Question
Responses
SurveyID
SubmitTime
ResponseAnswers
SurveyID
Question
Answer
But that's going to suck every time I want to query data out. And it seems dangerously close to Inner Platform Effect
An improvement would be to include as many fields as I can think of in advance in the responses table:
Responses
SurveyID
SubmitTime
FirstName
LastName
Birthdate
[...]
Then at least queries for data from these common columns is straightforward, and I can query, say, the average age of everyone who ever answered any survey where they gave their birthdate.
But it seems like this will complicate the code a bit. Now to see which questions are asked in a survey I have to check which common response fields are enabled (using, I guess, a bitfield in Survey) AND what's in the SurveyQuestions table. And I have to worry about special cases, like if someone tries to create a "custom" question that duplicates a "common" question in the Responses table.
Is this the best I can do? Am I missing something?
Your first schema is the better choice of the two. At this point, you shouldn't worry about performance problems. Worry about making a good, flexible, extensible design. There are all sorts of tricks you can do later to cache data and make queries faster. Using a less flexible database schema in order to solve a performance problem that may not even materialize is a bad decision.
Besides, many (perhaps most) survey results are only viewed periodically and by a small number of people (event organizers, administrators, etc.), so you won't constantly be querying the database for all of the results. And even if you were, the performance will be fine. You would probably paginate the results somehow anyway.
The first schema is much more flexible. You can, by default, include questions like name and address, but for anonymous surveys, you could simply not create them. If the survey creator wants to only view everyone's answers to three questions out of five hundred, that's a really simple SQL query. You could set up a cascading delete to automatically deleting responses and questions when a survey is deleted. Generating statistics will be much easier with this schema too.
Here is a slightly modified version of the schema you provided. I assume you can figure out what data types go where :-)
surveys
survey_id (index)
title
questions
question_id (index, auto increment)
survey_id (link to surveys->survey_id)
question
responses
response_id (index, auto increment)
survey_id (link to surveys->survey_id)
submit_time
answers
answer_id (index, auto increment)
question_id (link to questions-question_id)
answer
I would suggest you always take a normalized approach to your database schema and then later decided if you need to create a solution for performance reasons. Premature optimization can be dangerous. Premature database de-normalization can be disastrous!
I would suggest that you stick with the original schema and later, if necessary, create a reporting table that is a de-normalized version of your normalized schema.
One change that may or may not help simplify things would be to not link the ResponseAnswers back to the SurveyID. Rather, create an ID per response and per question and let your ResponseAnswers table contain the fields ResponseID, QuestionID, Answer. Although this would require keeping unique Identifiers for each unit it would help keep things a little bit more normalized. The response answers do no need to associate with the survey they were answering just the specific question they are answering and the response information that they are associated.
I created a customer surveys system at my previous job and came up with a schema very similar to what you have. It was used to send out surveys (on paper) and tabulate the responses.
A couple of minor differences:
Surveys were NOT anonymous, and this was made very clear in the printed forms. It also meant that the demographic data in your example was known in advance.
There was a pool of questions which were attached to the surveys, so one question could be used on multiple surveys and analyzed independently of the survey it appeared on.
Handling different types of questions got interesting -- we had a 1-3 scale (e.g., Worse/Same/Better), 1-5 scale (Very Bad, Bad, OK, Good, Very Good), Yes/No, and Comments.
There was special code to handle the comments, but the other question types were handled generically by having a table of question types and another table of valid answers for each type.
To make querying easier you could probably create a function to return the response based on a survey ID and question ID.