Database too large - store as a row or serialise data? - mysql

I have Quiz App that constitutes many Modules containing Questions. Each question has many Categories (many-to-many). Every time a quiz is completed, the user's score is sent to the Scores Table. (I've attached an entity-relation diagram for clarification purposes).
I have been thinking of breaking down the user scores according to categories (i.e. a user when completing a quiz will get an overall quiz score along with score for each category).
However, if each quiz consists of at least 30 questions, there could around 15-20 categories per quiz. So if one user completes a quiz, then it would create a minimum of 15-20 rows in the scores table. With multiple users, the Scores table would get really big really fast.
I assume this would affect the performance of retrieving data from the Scores table. For example, if I wanted to calculate the average score for a user for a specific category.
Does anyone have a better suggestion for how I can still be able to store scores based on categories?
I thought about serialising the JSON data, but of course, this has its limitations.

The DB should be able to handle millions of rows and there is nothing inherently wrong with your design. A few things I would suggest:
Put indexes in the following (or combinations of) user id, exam id (which I assume is what you call scorable id ) exam type (scorable Type?) and creation date.
As your table grows, partition it. Potential candidates could be creation date buckets (by year or year/month would probably work well) or maybe if students are in particular classes you could have class buckets
As your table grow even more you could move the partitions to different different disks (how you partitioned the data will be even more crucial here because if the data has to go across too many partitions you may end up hurting performance instead of helping)
Beyond that another suggestion would be to break the scores table into two score and scoreDetail. The score table would contain top level stuff like user id ,exam id, overall score, etc... While the child table would contain the scores by category (philosophy, etc....). I would bet 80% of the time people only care about the top score anyways. This way you only reach out to the bigger table when some one wants to get the details of their score in a particular exam.
Finally, you probably want to have the score by category in rows rather than columns to make it easier to do analysis and aggregations, but this is not necessarily a performance booster and really depends on how you plan to use the data.
In the end though, the best optimizations really depend on how you plan to use your data. I would suggest just creating a random data set that represents a few years worth of data and play with that.

I doubt that serialization would give you a significant benefit.
I would even dare to say that you'd kind of limit the power of a database by doing so.
Relational databases are designed to store a lot of rows in their tables, and they also usually use their own compression algorithms, so you should be fine.
Additionally, you will need to deserialize every time you want to read from your table. That would eliminate the possibility to use SQL statements for sorting, filtering, JOINing etc.
So in the end you will probably cause yourself more trouble by serializing than by simply storing the rows.

Related

T-SQL Optimized conditional join

Hey guys it's Brian from OMDbAPI.com
I hit a little speed bump when trying to use a single query for both Movie and Episode data. I recently started collecting additional Episode details in a separate table (being only two new columns have been added, Season #/Episode #) I put them in a separate table because those columns would be null in my main table 90% of the time but the other columns do work across movies/episodes (title/rating/release date/plot etc.)
So I'm trying to use a single query for returning Movie data but if the ID has a type = 'episode' return the additional fields from the other table. Problem is I don't know that ID is an episode until it's queried, and the least amount of calls to the database (smaller execution plan) the better, as this is called hundreds of times per second (currently 25+ million requests a day)
I created a small SQL Fiddle of what I'm trying to achieve.
My question is what is the best method with the least performance cost to show these fields if it's an episode and completely suppress them if not? Is Dynamic SQL my only option? Thanks.
Supposing that each Movie row is associated with at most one Episode row, you are certain to get the best query plans by putting the episode data in the Movie table instead of in a separate one. That avoids having to determine during query execution whether to look at the episode data, and it also avoids any need for a JOIN when you do need it.
Having the 90% NULL episode data in your Movie table will cost you some space, and therefore it will have some performance impact, but I'm inclined to think that the resulting simpler query plans will offset that cost.
JOINing the tables every time is your next best bet, I think. That gives you reasonably simple query plans, and looks for performance gains through reducing the size of the Movie data. Still, as a general rule, the fewer JOINs you perform, the faster your queries will run.

Is is necessary to link or join tables in MySQL?

I've created many databases before, but I have never linked two tables together. I've tried looking around, but cannot find WHY one would need to link two or more tables together.
There is a good tutorial here that goes over database relationships, but does not explain why they would be needed. He just simply says that they are.
Are they truly necessary? I understand that (in his example) all orders have a customer, and so one would link the orders table to the customers table, but I just don't see why this would be absolutely necessary. I can (and have) created shopping carts and other complex databases that work just fine without creating any table relationships.
I've just started playing around with MySQL Workbench v6.0 for a new project that has a fairly large and complex database, and so I'm wondering if I am losing anything by creating the entire project without relationships?
NOTE: Please let me know if this question is too general or off topic, and I will change it. I understand that a lot can be said about this topic, and so I'm really just looking to know if I am opening myself up to any security issues or significant performance issues by not using relationships. Please be specific in your response; "Yes you are opening yourself up to performance issues" is useless and not helpful for myself, nor for anyone else looking at this thread at a later date. Please include details and specifics in your response.
Thank you in advance!
As Sam D points out in the comments, entire books can be written about database design and why having tables with relationships can make a lot of sense.
That said, theoretically, you lose absolutely no expressive/computational power by just putting everything in the same table. The primary arguments against doing so likely deal with performance and maintenance issues that might arise.
The answer revolves around granularity, space consumption, speed, and detail.
Inherently different types of data will be more granular than others, as items can always be rolled up to a larger umbrella. For a chain of stores, items sold can be rolled up into transactions, transactions can be rolled up into register batches, register batches can be rolled up to store sales, store sales can be rolled up to company sales. The two options then are:
Store the data at the lowest grain in a single table
Store the data in separate tables that are dedicated to purpose
In the first case, there would be a lot of redundant data, as each item sold at location 3 of 430 would have store, date, batch, transaction, and item information. That redundant data takes up a large volume of space, when you could very easily create separated tables for their unique purpose.
In this example, lets say there were a thousand transactions a day totaling a million items sold from that one store. By creating separate tables you would have:
Stores = 430 records
Registers = 10 records
Transactions = 1000 records
Items sold = 1000000 records
I'm sure your asking where the space savings comes in ... it is in the detail for each record. The store table has names, address, phone, etc. The register has number, purchase date, manager who reconciles, etc. Transactions have customer, date, time, amount, tax, etc. If these values were duplicated for every record over a single table it would be a massive redundancy of data adding up to far more space consumption than would occur just by linking a field in one table (transaction id) to a field in another table (item id) to show that relationship.
Additionally, the amount of space consumed, as well as the size of the overall table, inversely impacts the speed of you querying that data. By keeping tables small and capitalizing on the relationship identifiers to link between them, you can greatly increase the response time. Every time the query engine needs to find a value, it traverses the table until it finds it (that is a grave oversimplification, but not untrue), so the larger and broader the table the longer the seek time. These problems do not exist with insignificant volumes of data, but for organizations that deal with millions, billions, trillions of records (I work for one of them) storing everything in a single table would make the application unusable.
There is so very, very much more on this topic, but hopefully this gives a bit more insight.
Short answer: In a relational database like MySQL Yes. Check this out about referential integrity http://databases.about.com/cs/administration/g/refintegrity.htm
That does not mean that you have to use relational databases for your project. In fact the trend is to use Non-Relational databases (NoSQL), like MongoDB to achieve same results with better performance. More about RDBMS vs NoSQL http://www.zdnet.com/rdbms-vs-nosql-how-do-you-pick-7000020803/
I think that with this example you will understand better:
Let's we want to create on-line store. We have at minimum Users, Payments and Events (events about the pages where the user navigates or other actions). In this scenario we want to link in a secure and relational way the Users with the Payments. We do not want a Payment to be lost or assigned to another User. So we can use a RDBMS like MySQL to create the tables Users and Payments and linked the with proper Foreign Keys. However for the events, we are going to be a lot of them per users (maybe millions) and we need to track them in a fast way without killing the relation database. In that case a No-SQL database like MongoDB makes totally sense.
To sum up to can use an hybrid of SQL and NO-SQL, but either if you use one, the other or both kind of solutions, do it properly.

What is the most efficient way to store a list in a relational database?

I have read many strong statements here and elsewhere on the subject of storing arrays in mysql. The rules of normalization seem to suggest its a bad idea and searching within the stored array fosters inelegant code. HOWEVER, for the application I am working on it seems like a reasonable solution to store an array in a field. I'm sure that is what everyone wrongly thinks in this position but I can't figure out a better way. Here is the setup:
I have a series of tables that store registered students, courses they can take and their performance on each course. All are "normalized" to avoid duplication and errors. I want to be able to generate a "myCourses" section so after login the student sees courses they are eligible for and courses they have taken but are free to review. The approach that comes to mind is two arrays; my_eligible_courses and my_completed_courses. On registration, the student is given a set of courses for which they are eligible. This could be stored as rows where there are multiple occurrences of studentid, one for each course they can take:
student1 course 1
student1 course 2
student1 course n
The table could then be queried for all of student 1's eligible courses and displayed as a list when the student logs in.
Alternately, studentid could be a primary key and in a column "eligible_courses" there would be an array (course 1,course 2, course n).
There is a table for student performance, to record every course taken and metrics associated with student performance. It will be queried to report on student performance, quality of course etc but this table will grow quite large. I'm having a hard time believing that the most efficient way to generate a list of my_completed_courses is to query this table by studentid every time they login just to give them a list of completed courses.
One other complication is that the set of courses a student is eligible is variable and expanding as new courses are developed, which to me seems to suggest that generating a set of new columns for each new course is a bad idea-for example, new course_name, pretest_score, posttest_score, time_to_complete, ... Also, a table for each new course seems like a complicated solution for the relatively mundane endpoint of generating a simple set of lists.
So to restate the question, is it better to store "inelegant" arrayed list of eligible and completed courses in a registered student table or dynamically generate these lists?
I'm guessing this is still too vague but any discussion of db design that gives an example of an inelegant array vs a restructured schema would be appreciated.
You should feel confident that if you have indexes on your tables for the appropriate columns, querying for my_completed_courses will be pretty snappy.
When your table grows to the point that you notice slowdown, you can configure your MySQL server with appropriate memory allocation settings so that it can keep more data cached in memory. Or you could look into that now.
In response to the edit you made about adding new courses: Don't add a new column for each course. Don't add a new table for each course. Create a table for courses, and add rows for each course.
You should then be able to join your tables together on indexed columns to generate the list of data you need.
This is a bad idea for two obvious reasons:
DBMS can't enforce proper referentialX (and possibly domain) integrity and relying on application-level integrity is almost always a bad idea.
While the database will be able to answer the query: "based on given student, give me courses", you won't be able to (efficiently) go in the opposite direction, should you ever need to.
X What's to stop a buggy application from storing a non-existent ID in array? Or deleting a course that is still referenced by students? Even if your application is careful about course deletion, there is no way to do it efficiently - you'll need a full table scan to examine all arrays.
Why are you even trying this? A link (aka. junction) table would solve these problems, for a moderate cost of some additional storage space.
If you are really concerned about storage space, you could even switch the DBMS and use one that supports leading-edge index compression (such as Oracle).
I'm having a hard time believing that the most efficient way to generate a list of my_completed_courses is to query this table by studentid every time they login just to give them a list of completed courses.
Databases are very good at querying humongous amounts of data. In this case, if you use the clustering properly, the DBMS will be able to get this data in very few I/O operations, meaning very fast. Did you perform any actual benchmarks? Have you measured any actual performance problem?
Also, a table for each new course seems like a complicated solution for the relatively mundane endpoint of generating a simple set of lists.
Generating a new table may be justified in case it will have different columns. But, that doesn't sound like what you are trying to do.
It seems to me that you simply need:
CHECK (
(COMPLETED = 0 AND (performance fields) IS NULL)
OR (COMPLETED = 1 AND (performance fields) IS NOT NULL)
)
When a student enrolls into course, insert a row in STUDENT_COURSE, set COMPLETED to 0 and leave performance fields NULL.
When the student completed the course, set COMPLETED to 1 and fill the performance fields.
(BTW, you could even omit COMPLETED altogether and just rely on testing the performance fields for NULL.)
InnoDB tables are clustered, which means that rows in STUDENT_COURSE belonging to the same student are stored physically close together, which means that getting courses of the given student is extremely fast.
If you need to go in the opposite direction (get students of a given course), add an index on same fields but in opposite order: {COURSE_ID, STUDENT_ID}. You might even consider covering in this case.
Since we are talking about small number of rows, leaving COMPLETED unindexed is just fine. If you are really concerned about that, you can even do something like:
The COMPLETED_STUDENT_COURSE is a B-Tree only for completed courses (and essentially a subset of STUDENT_COURSE which is a B-Tree for all enrolled courses).
Here are a few thoughts that I believe may assist you in making a good decision.
Generally, it is a rule to use correctly normalised tables. But there can be exceptions to this. Perhaps your project may be such.
Most of the time, new developers tend to focus on getting the data into a DB. They get stuck when it comes to retrieving it for a specific purpose. So given both cases of arrays vs. relational tables, ask your self if either method serves your purpose. For example, if you wanted to list the courses of student X, your array method is just fine. This is because you can retrieve it by the primary key like a student ID. But if you wanted to know how many students are on course A, the array method will be a horrible way to go.
Then again, the above point would depend on your data volume as well. For example, if you only have about a hundred students, you'll probably not notice a difference in performance. But if you're looking at several thousand records and you have a big list of courses for students, the array approach is not the way to go.
Benchmark. This is the best way for you to find out your answer. You can use MySQL's explain or just time it using your program that executes the queries. Try each method with your standard volume of data and see which one works best. For example, in the recent past, MySQL was boasting about their strength of the ISAM engine. Then I had to work on a large application that involved millions of records. And here, I noticed that each time a new record came in, Indexes had to be rebuilt. So now we had to bend the rules. Likewise, you'd better do your tests with the correct volumes of data and make a better decision.
But do not take this example as a rule. Rather, go by the standards of normalisation and only bend the rules for exceptions.

MySQL: normalization, is this a valid exception?

We have 10 years of archived sports data, spread across separate databases.
Trying to consolidate all the data into a single database. Since we'll be handling 10X the number of records, I'm trying to make schema redesign changes now to avoid potential performance hit.
One change entails breaking up the team roster table into 2 tables; one, a players table that stores fixed data: playerID, firstName, lastName, birthDate, etc., and another, the new roster table that stores variable data about a player: yearInSchool, jerseyNumber, position, height, weight, etc. This will allow us to, among other things, create career 4 year aggregate views of player stats.
Fair enough, makes sense, but then again, when I look at queries that tally, for example, a players aggregate scoring stats, I have to join on both player & roster tables, in addition to scoring and schedule tables, in order to get all the information needed.
Where I'm considering denormalizing is with player first and last name. If I store player first and last name in the roster table, then I can omit the player table from the equation for stat queries, which I'm assuming will be a big performance win given that total record count per table will be over 100K (i.e. most query joins will be across tables that each contain at least 100K records, and up to, for now, 300K).
So, where to draw the line with denormalization in this case? I assume duplicating first, last name is OK. Generally I enjoy non-duplication/integrity of data, but I suspect site visitors enjoy performance more!
First thought is, are you sure you've exhausted tuning options to get good SELECT performance without denormalising here?
I'm very much with you in the sense of "no sacred cows" and denormalise when necessary, but this sounds like a case where it shouldn't be too hard to get decent performance.
Of course you guys have done your own exploration, if you've ruled that out then personal opinion is it's acceptable, yeah.
One issue - what happens if a player's name changes? Can it do so in your system? Would you use a transation to update all roster details in a single COMMIT operation? For a historical records Db this could be totally irrelevant mind you.

The right way to plan my database

I'm creating a music sharing site, so each user can set up his account, add songs etc..
I would like to add the ability for users to give points to one another based on whether they like the song.
For example user1 has some songs in his collection, user2 likes a song so he clicks "I like" resulting in giving a point to user1.
Now I would like to know if my idea of creating the "Points table" in my database is somewhat right and correct.
I decided to create a separate table to hold data about points, this table would have id column, who gave the point to who, song id column, date column etc. My concern is that in my table I will have a row for every single point that has been given.
Of course it's nice to have all this specific info, but i'm not sure if this is the right way to go, or perhaps i'm wasting reasources, space.. and so on.
Maybe I could redesign my songs Table to have additional column points, and I would just count how many points each song has.
I need some advice on this, maybe I shouldn't really worry about my design, optimalization and scalibility, since todays technology is so fast and powerful and database queries are instant quick..
IMO, it's better to use a transactional table to track the points given to a user based on their song-lists. Consider how Stackoverflow (SO) works, if you up-vote a question or solution, you can remove your vote at a later time, if SO used a summation column, it would be impossible to support this type of functionality.
I wouldn't worry too much about the number of rows in your points table, as it will probably be pretty narrow, generously; 10 columns at the most. Not to mention the table would be a pivot table between users, so would comprised mostly of int values.
Part of the issue is really simple. If you need to know
who gave a point
to whom
for which song
on which date
then you need to record all that information.
Wasn't that simple?
If you only need to know the totals, then you can just store the totals.
As for scale, say you have 20,000 users, each with an average of 200 songs. Let's say 1 in 10 gets any up votes, averaging 30 per song. That's 4 million user-songs; 400,000 that get up votes, at 30 per song you have 12 million rows. That's not that many. If the table gets too many rows, partitioning on "to whom" would speed things up a lot.