Database normalization: How can I tabulate the data?

Database normalization: How can I tabulate the data? - mysql

My question is in regards to normalizing data.
INFO
I'm trying to tabulate test results in a database. The information I'd like to record is test_instance, user_id, test_id, completed(date/time), duration (of test), score, incorrect questions and reviewed questions.
For the most part, I think I'd organise the info according to TABLE 1, but I've come a little unstuck trying to work out the best way to record incorrect or reviewed questions. Please note that I DON'T want to put all the incorrect questions together in one entry as per TABLE 2.
I'd like to make a separate entry for each incorrectly marked question (or reviewed question).
NOTE: Reviewed questions are ones that at one time or another were marked incorrectly and hence need to be tested again.
TABLE 1
-------------------------------------------------------------
| instance | user_id | test_id |completed |duration|score|
-------------------------------------------------------------
| 1 | 23 | 33 | 2JAN2012 | 20m | 75 |
| 2 | 11 | 12 | 10DEC2011| 35m | 100 |
| 3 | 1 | 3 | 3JUL2008 | 1m | 0 |
| 4 | 165 | 213 | 4SEP2010 | 10m | 50 |
-------------------------------------------------------------
TABLE 2
------------------------
| instance ||wrong Q|
------------------------
| 1 || 3,5,7 |
------------------------
Ultimately, I'd like to know how many times a user has gotten a particular question wrong over time. Also, I need to keep track of which test the wrong questions came from. This is the same for the reviewed questions.
Incidentally it's possible for questions to be reviewed AND wrong in the same instance.
I've come up with 2 different ways to represent the data, but I don't like either of them.
-------------------------------------------------
| instance | Q number | Wrong | Reviewed |
-------------------------------------------------
OR
---------------------------------------------------
| user_id | test_id | Q number | Wrong | Reviewed |
---------------------------------------------------
Note: Wrong/Reviewed category is counting how many times the Q number falls into that category.
MY QUESTIONS SUMMARISED
How can I efficiently represent wrong/reviewed questions in a table? Is TABLE 1 set up efficiently?
EDIT : Questions that have been answered incorrectly can be used to generate new tests. Only incorrect questions will be used for the tests. If a generated test is taken, the questions tested will be marked as reviewed. The score will not be updated as it will be a new test and a new test_id will be generated.
NOTE-It is possible to retake old tests, but the score will not be updated. A new instance will be created for each test that is taken.
In regards to the generated tests, I guess this means I will need to include one more table to keep track of which quiz the questions originally came from. Sorry- I hadn't thought it all the way through to the end.
THANKS
It was difficult for me to choose an answer as everyone gave me really useful information. My final design will take into consideration everything you have said. Thanks again.

Revisiting my answer after you updates, I came up with this kind of layout which I think would work quite nicely.
As a prerequisite, I'm assuming you have your tests and questions somewhere. For consistency, I'm including them (with only relevant columns) in my layout.
USERS
- user id
TESTS
- test id
QUESTIONS
- question id
- test id
Then for the interesting part. Considering how you say:
Questions that have been answered incorrectly can be used to generate
new tests. Only incorrect questions will be used for the tests
You don't mention how many times a test can be retaken, I assume an indefinite or at least more than one time.
TEST INSTANCE
- instance id [PK]
- revision id [PK]
- user id
- completed
- duration
COMMENT: you may want to consider replacing completed and duration with
a start and end timestamp. They will serve the same purpose without
the need for any calculations at insert/update.
TEST INSTANCE SCORE
- instance id [FK, TEST INSTANCE (instance id)]
- score
FAILED QUESTIONS
- question id [FK, QUESTION (question id)]
- instance id [FK, TEST INSTANCE (instance id)]
- reviewed [FK, TEST INSTANCE (revision id)]
Then to my comments.
As I see it, a new actual test for the failed questions wouldn't make sense, so instead I added a revision id to the TEST INSTANCE table. Each time a test is retaken, a new record for the same instance id with a revision id (e.g. a running number sequence) is created.
Any failed questions would be stored in FAILED QUESTIONS along with the instance id and initially a NULL value for reviewed. When a failed question is considered reviewed, its reviewed column would be updated with the revision id of the latest test instance for instance id.
With this approach, you will have a complete history of how many times a failed question has been attempted before it was successfully reviewed.
Furthermore, I chose in my revised answer to move the score to its own table, because you said scores won't be updated despite reviewing the failed questions and my proposed model would have introduced data duplication. You'll notice I left out the revision id from that table, because for a test instance (and any number of revisions) there is only one score.

Talking about normalization, and just to make sure that you can retrieve all kind of calculated data out of your database, I'd propose a more complex model, which will end up in something easier to manage...
You'll need the following tables
test_table
PK: id_test
testDescription
question_table
PK: id_question
FK: id_test
questionDescription
instance_table *please note that duration and scores will be calculated later on
PK: id_instance
FK: id_user
FK: id_test
startingTime
endingTime
question_instance_table
PK: id_question_instance
FK: id_instance
FK: id_question
questionResult (Boolean)
(please note here that the PK could be id_instance + id_question ...)
Back to your needs, we then have the following:
duration is calculated with startingDate and endingDate of instance_table
score is calculated as the sum of True values from questionResult field
you can track and compare answers on same question over time for one user
thus your reviewed questions can be defined as questions with at least one false value for a specific user
if your database supports null values for boolean fields, you'll have the possibility to follow unanswered questions (with questionResult = Null). Otherwise, I advise you to use or build a three states field (integer with Null allowed, plus 0 and 1 values for example) to follow unanswered questions (null), wrong answers (0), and correct answers (1).
Score, being 100 * (number of good answers)/(number of questions in the test), can easily be calculated via SQL agregates.
You could even calculate partial scores as number of good answers/number of questions answered in the test.
This model accepts any number of tests, any number of questions per test, any number of instances, any number of users...
Of course, it can be further improved by adding missing properties to tables (testNumber, questionNumber fields for example)
etc...

Assuming the number of questions on a test don't change, and that each question is worth the same number of marks, I suggest the following tables:
test
----
test_id
number_of_questions
test_instance
-------------
instance_id
user_id
test_id
completed
duration
notable_questions
-----------------
instance_id
question_id
status (W - Wrong, R - Reviewed)
So, for example:
test:
---------------------------------
| test_id | number_of_questions |
---------------------------------
| 3 | 50 |
| 12 | 100 |
| 33 | 78 |
| 213 | 50 |
---------------------------------
test_instance:
-------------------------------------------------------
| instance_id | user_id | test_id |completed |duration|
-------------------------------------------------------
| 1 | 23 | 33 | 2JAN2012 | 20m |
| 2 | 11 | 12 | 10DEC2011| 35m |
| 3 | 1 | 3 | 3JUL2008 | 1m |
| 4 | 23 | 213 | 4SEP2010 | 10m |
-------------------------------------------------------
notable_questions:
------------------------------------
| instance_id |question_id| Status |
------------------------------------
| 1 | 3 | W |
| 1 | 5 | W |
| 1 | 7 | W |
| 4 | 2 | R |
------------------------------------

From the above example I assume that instance correlates directly to user_id+test_id combination.
If that is so, you can consider having table 2 in the following format:
Instance | question_id | status | date
PK for the table should be on instance, question_id and status.
entries in this table will not be updated, only inserted. That way you can have:
Instance | question_id | status | date
1 3 W 1/1/2011
1 3 R 1/5/2011
this will allow you complete tracking of wrong and reviewed questions, and the date of review. if you don't need the date of review, don't define this column :)
You can add a unique index on instance and status fields, so when you access the table your search will be more efficient.
*Additional data that can be added to the 2nd table is "new test_id" and "new question_id" for reviewed questions, so that you can check if for the same question (assuming question_id is generated each time) you still have failures.

Related

Mysql insertion order [duplicate]

This question already has answers here:
Return rows in the exact order they were inserted
(4 answers)
Closed 4 years ago.
I don't know whether it is already answered. I hadn't got any answers.In Mysql tables, the rows will be arranged in the order of primary key. For example
+----+--------+
| id | name |
+----+--------+
| 1 | john |
| 2 | Bryan |
| 3 | Princy |
| 5 | Danny |
+----+--------+
If I insert anothe row insert into demo_table values(4,"Michael").The table will be like
+----+---------+
| id | name |
+----+---------+
| 1 | john |
| 2 | Bryan |
| 3 | Princy |
| 4 | Michael |
| 5 | Danny |
+----+---------+
But I need the table to be like
+----+---------+
| id | name |
+----+---------+
| 1 | john |
| 2 | Bryan |
| 3 | Prince |
| 5 | Danny |
| 4 | Michael |
+----+---------+
I want the row to be concatenated to the table i.e.,
The rows of the table should be in the order of insertion.Can anybody suggest me the query to get it.Thank you for any answer in advance.

There is in general no internal order to the records in a MySQL table. The only order which exists is the one you impose at the time you query. You typically impose that order using an ORDER BY clause. But there is a bigger design problem here. If you want to order the records by the time when they were inserted, then you should either add a dedicated column to your table which contains a timestamp, or perhaps make the id column auto increment.
If you want to go with the latter option, here is how you would do that:
ALTER TABLE demo_table MODIFY COLUMN id INT auto_increment;
Then, do your insertions like this:
INSERT INTO demo_table (name) VALUES ('Michael');
The database will choose an id value for the Michael record, and in general it would be greater than any already existing id value. If you need absolute control, then adding a timestamp column might make more sense.

Just add another Column Created (Timestamp) in your table to store the time of insertion
Then use this Command for insertion
insert into demo_table id, name,created values(4,"Michael",NOW())
The NOW() function returns the current date and time.
Since you are recording the timestamp, it can be also used for future reference too

It's not clear why you want to control the "order" in which the data is stored in your table. The relational model does not support this; unless you specify an order by clause, the order in which records are returned is not deterministic.. Even if it looks like data is stored in a particular sequence, the underlying database engine can change its mind at any point in time without breaking the standards or documented behaviours.
The fact you observe a particular order when executing a select query without order by is a side effect. Side effects are usually harmless, right up to the point where the mean feature changes and the side effect's behaviour changes too.
What's more - it's generally a bad idea to rely on the primary key to have "meaning". I assume your id column represents a primary key; you should really not rely on any business meaning in primary keys - this is why most people use surrogate keys. Depending on the keys indicating in which order a record was created is probably harmless, but it still seems like a side effect to me. In this, I don't support #TimBiegeleisen's otherwise excellent answer.
If you care about the order in which records were entered, make this explicit in the schema by adding a timestamp column, and write your select statement to order by that timestamp. This is the least sensitive to bugs or changes in the underlying logic/database engine.

Many to many relationship with different data types

I am trying to create a Database for different types of events. Each event has arbitrary, user created, properties of different types. For example "number of guests", "special song to play", "time the clown arrives". Not every event has a clown but one user could still have different events with a clown. My basic concept is
propID | name | type
------ | ---- | -----
1 |#guest| number
2 |clown | time
and another table with every event with a unique eventID. The Problem is that a simple approach like
eventID | propID | value
------ | ------ | -----
1 | 1 | 20
1 | 2 | 10:00
does not really work because of the different DataTypes.
Now I thought about some possible solutions but I don't really know which one is best, or if there is an even better solution?
1. I store all values as strings and use the datatype in the property table. I think this is called EAV and is not considered good practice.
2. There are only a limited amount of meaningful datatypes, which could lead to a table like this:
eventID | propID | stringVal | timeVal | numberVal
------ | ------ | --------- | ------- | --------
1 | 1 | null | null | 20
1 | 2 | null | 10:00 | null
3. Use the possible datatypes for multiple tables like:
propDateEvent propNumberEvent
-------------------------- --------------------------
eventID | propId | value eventID | propId | value
--------|--------|-------- --------|--------|--------
1 | 2 | 10:00 1 | 1 | 20
Somehow I think every solution has its ups and downs. #1 feels like the simplest but least robust. #3 seems like the cleanest solution, but pretty complicated if I wanted to add e.g. a priority for the properties per event.

All the options you propose are variations on entity/attribute/value or EAV. The basic concept is that you store entities (in your case events), their attributes (#guest, clown), and the values of those attributes as rows, not columns.
There are lots of EAV questions on Stack Overflow, discussing the benefits and drawbacks.
Your 3 options provide different ways of storing the data - but you don't address the ways in which you want to retrieve that data, or verify the data you're about to store. This is the biggest problem with EAV.
How will you enforce the rule that all events must have "#guests" as a mandatory field (for instance)? How will you find all events that have at least 20 guest, and no clown booke? How will you show a list of events between 2 dates, ordered by date, and number of guests?
If those requirements don't matter to you, EAV is fine. If they do, consider using a document to store this user-defined data (JSON or XML). MySQL can query those documents natively, you can enforce business logic much more easily, and you won't have to write horribly convoluted queries for even the simplest business cases.

The correct way of storing millions of Many-To-Many relations (pivot)

What I have
I have the next schema:
users table:
| id | name | ... | created_at | updated_at |
groups table:
| id | name | ... | created_at | updated_at |
messages table:
| id | text | ... | created_at | updated_at |
user_messages table (Pivot):
| user_id | message_id | sent_at |
user_groups table (Pivot):
| user_id | group_id | joined_at |
For now project is using only MySQL database.
Problem
Storing Many-To-Many is traditional way and it is okey. But in this case I am a little confused.
Groups can include unlimited users and a group can send 10-1000 messages (or more) per day. For example, let's take some basic numbers (not millions what in a real life can also be): 1 group, 10000 users, 100 messages per day. The relation row count per day: 10000 x 100 = 1000000. One million rows per day for one group, but groups count can also be thousands.
One idea that can come at once: Why do you need Pivot for messages? Answer is: The "target" sending option is required (For sending message only for x users, not for all).
Question
My question is: "What is correct way to store this kind of data?"
Maybe I need to use other Database system or maybe this numbers are not a problem for MySQL.

1 very large table or 3 large table? MySQL Performance

Assume a very large database. A table with 900 million records.
Method A:
Table: Posts
+----------+-------------- +------------------+----------------+
| id (int) | item_id (int) | post_type (ENUM) | Content (TEXT) |
+----------+---------------+------------------+----------------+
| 1 | 1 | user | some text ... |
+----------+---------------+------------------+----------------+
| 2 | 1 | page | some text ... |
+----------+---------------+------------------+----------------+
| 3 | 1 | group | some text ... |
// row 1 : User with ID 1 has a post with ID #1
// row 2 : Page with ID 1 has a post with ID #2
// row 3 : Group with ID 1 has a post with ID #3
The goal is displaying 20 records from all 3 post_types in a page.
SELECT * FROM posts LIMIT 20
But I am worried about number of records for this method
Method B:
Separate 900 million records to 3 tables with 300 millions for each one.
Table: User Posts
+----------+-------------- +----------------+
| id (int) | user_id (int) | Content (TEXT) |
+----------+---------------+----------------+
| 1 | 1 | some text ... |
+----------+---------------+----------------+
| 2 | 2 | some text ... |
+----------+---------------+----------------+
| 3 | 3 | some text ... |
Table: Page Posts
+----------+-------------- +----------------+
| id (int) | page_id (int) | Content (TEXT) |
+----------+---------------+----------------+
| 1 | 1 | some text ... |
+----------+---------------+----------------+
| 2 | 2 | some text ... |
+----------+---------------+----------------+
| 3 | 3 | some text ... |
Table: Group Posts
+----------+----------------+----------------+
| id (int) | group_id (int) | Content (TEXT) |
+----------+----------------+----------------+
| 1 | 1 | some text ... |
+----------+----------------+----------------+
| 2 | 2 | some text ... |
+----------+----------------+----------------+
| 3 | 3 | some text ... |
now to get a list of 20 posts to display
SELECT * FROM User_Posts LIMIT 10
SELECT * FROM Page_Posts LIMIT 10
SELECT * FROM group_posts LIMIT 10
// and make an array or object of result. and display in output.
In this method, I should sort them in an array in php, and then semd them to page.
Which method is preferred?
Separating a 900 million records table to three tables will affect on speed of reading and writing in mysql?

This is actually a discussion about Singe - Table - Inheritance vs. Table Per Class Inheritance and missing out joined inheritance. The former is related to Method A, the second to your Method B and Method C would be as having all IDs of your posts in one table and deferring specific attributes for group or user - posts ijto different tables.
Whilst having a big sized table always has its negativ impacts related to table full scans the approach of splitting tables has it's own , too. It depends on how often your application needs to access the whole list of posts vs. only retrieving certain post types.
Another consideration you should take into account is data partitioning which can be done with MySQL or Oracle Database e.g. which is a way of organizing your data within tables given opportunities for information lifecycle (which data is accessed when and how often, can part of it be moved and compressed reducing database size and increasing the speed for accessing the left part of the data in the table), which is basically split into three major techniques:
Range based partitioning, list based partitioning and hash based partitioning.
Other features not so commonly supported related to reducing table sizes are the ones dealing with insert's with timestamp invalidating the inserted data automatically after a certain timeperiod has expired.
What indeed is a major application design decision and can boost performance is to distinguish between read and writeaccesses to the database at application level.
Consider a MySQL - Backend: Because writeaccesses are obviously more critical to database performance then read accesses you could setup a MySQL - Instance for writing to the database and another one as replicant of this for the readaccesses, though this is also discussable, mainly when it comes to RDT (real time decisions), where absolute consistency of data at any given time is a must.
Using object pools as a layer between your application and the database also is a technique to improve application performance though I don't know of existing solutions in the PHP world yet. Oracle Hot Cache is a pretty sophisticated example of it.
You could build your own one implemented on top of a in - memory database or using memcache, though.

MySQL - how to optimize query to count votes

Just after some opinions on the best way to achieve the following outcome:
I would like to store in my MySQL database products which can be voted on by users (each vote is worth +1). I also want to be able to see how many times in total a user has voted.
To my simple mind, the following table structure would be ideal:
table: product table: user table: user_product_vote
+----+-------------+ +----+-------------+ +----+------------+---------+
| id | product | | id | username | | id | product_id | user_id |
+----+-------------+ +----+-------------+ +----+------------+---------+
| 1 | bananas | | 1 | matthew | | 1 | 1 | 2 |
| 2 | apples | | 2 | mark | | 2 | 2 | 2 |
| .. | .. | | .. | .. | | .. | .. | .. |
This way I can do a COUNT of the user_product_vote table for each product or user.
For example, when I want to look up bananas and the number of votes to show on a web page I could perform the following query:
SELECT p.product AS product, COUNT( v.id ) as votes
FROM product p
LEFT JOIN user_product_vote v ON p.id = v.product_id
WHERE p.id =1
If my site became hugely successful (we can all dream) and I had thousands of users voting on thousands of products, I fear that performing such a COUNT with every page view would be highly inefficient in terms of server resources.
A more simple approach would be to have a 'votes' column in the product table that is incremented each time a vote is added.
table: product
+----+-------------+-------+
| id | product | votes |
+----+-------------+-------+
| 1 | bananas | 2 |
| 2 | apples | 5 |
| .. | .. | .. |
While this is more resource friendly - I lose data (eg. I can no longer prevent a person from voting twice as there is no record of their voting activity).
My questions are:
i) am I being overly worried about server resources and should just stick with the three table option? (ie. do I need to have more faith in the ability of the database to handle large queries)
ii) is their a more efficient way of achieving the outcome without losing information

You can never be over worried about resources, when you first start building an application you should always have resources, space, speed etc. in mind, if your site's traffic grew dramatically and you never built for resources then you start getting into problems.
As for the vote system, personally I would keep the votes like so:
table: product table: user table: user_product_vote
+----+-------------+ +----+-------------+ +----+------------+---------+
| id | product | | id | username | | id | product_id | user_id |
+----+-------------+ +----+-------------+ +----+------------+---------+
| 1 | bananas | | 1 | matthew | | 1 | 1 | 2 |
| 2 | apples | | 2 | mark | | 2 | 2 | 2 |
| .. | .. | | .. | .. | | .. | .. | .. |
Reasons:
Firstly user_product_vote does not contain text, blobs etc., it's purely integer so it takes up less resources anyways.
Secondly, you have more of a doorway to new entities within your application such as Total votes last 24 hr, Highest rated product over the past 24 hour etc.
Take this example for instance:
table: user_product_vote
+----+------------+---------+-----------+------+
| id | product_id | user_id | vote_type | time |
+----+------------+---------+-----------+------+
| 1 | 1 | 2 | product |224.. |
| 2 | 2 | 2 | page |218.. |
| .. | .. | .. | .. | .. |
And a simple query:
SELECT COUNT(id) as total FROM user_product_vote WHERE vote_type = 'product' AND time BETWEEN(....) ORDER BY time DESC LIMIT 20
Another thing is if a user voted at 1AM and then tried to vote again at 2PM, you can easily check when the last time they voted and if they should be allowed to vote again.
There are so many opportunities that you will be missing if you stick with your incremental example.
In regards to your count(), no matter how much you optimize your queries it would not really make a difference on a large scale.
With an extremely large user-base your resource usage will be looked at from a different perspective such as load balancers, mainly server settings, Apache, catching etc., there's only so much you can do with your queries.

If my site became hugely successful (we can all dream) and I had thousands of users voting on thousands of products, I fear that performing such a COUNT with every page view would be highly inefficient in terms of server resources.
Don't waste your time solving imaginary problems. mysql is perfectly able to process thousands of records in fractions of a second - this is what databases are for. Clean and simple database and code structure is far more important than the mythical "optimization" that no one needs.

Why not mix and match both? Simply have the final counts in the product and users tables, so that you don't have to count every time and have the votes table , so that there is no double posting.
Edit:
To explain it a bit further, product and user table will have a column called "votes". Every time the insert is successfull in user_product_vote, increment the relevant user and product records. This would avoid dupe votes and you wont have to run the complex count query every time as well.
Edit:
Also i am assuming that you have created a unique index on product_id and user_id, in this case any duplication attempt will automatically fail and you wont have to check in the table before inserting. You will just to make sure the insert query ran and you got a valid value for the "id" in the form on insert_id

You have to balance the desire for your site to perform quickly (in which the second schema would be best) and the ability to count votes for specific users and prevent double voting (for which I would choose the first schema). Because you are only using integer columns for the user_product_vote table, I don't see how performance could suffer too much. Many-to-many relationships are common, as you have implemented with user_product_vote. If you do want to count votes for specific users and prevent double voting, a user_product_vote is the only clean way I can think of implementing it, as any other could result in sparse records, duplicate records, and all kinds of bad things.

You don't want to update the product table directly with an aggregate every time someone votes - this will lock product rows which will then affect other queries which are using products.
Assuming that not all product queries need to include the votes column, you could keep a separate productvotes table which would retain the running totals, and keep your userproductvote table as a means to enforce your user voting per product business rules / and auditing.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008