MySQL Database Design - Specific Case, Columns or additional table? - mysql

I am really only a hobbyist, who has aspirations far too grand, that said, I am trying to figure out the right way to create my database so that database changes don't require client refactoring, but also fast. Please respond in as if I wouldn't understand typical development or DBA terminology well.
The situation: I was trying to determine how many books each user has rated. I consider a book rated if it has any two of the following:
-Overall rating (ratings table)
-sub rating (ratings table)
-tag (book_tags table)
-list (book_lists table)
*Related tables: users, tags, lists
The problem: I have 10 sub ratings and the two overall ratings all in the ratings table, each in a column (guessing this is bad, but not sure). Should I instead have a ratings table (12 rows) and a book_ratings table where each row of the ratings table is a type of rating for a user?
-e.g. book_ratings: id | user_id | book_id | rating_id
If yes, what happens if there are 500k books, 12 rating types per book, 10,000 users and a total of 5 billion rows on that book_ratings table? is that going to run super slow? Another consideration is that I may want to add more sub rating types in the future, which is partially why I think it might be valuable to change it, but it's a lot of work, so I wanted to check first.
Thanks!

You should model your system to make it usable and extendable. Having 12 rating columns is going to cause you a lot of pain when you want to aggregate results etc. There are plenty of examples of those kinds of pains on this site.
As it grows you optimize by adding indexes, clustering, data partitioning, etc.
But if you know you are going to have massive amounts of data right away you might want to consider some "Big Data" solutions and perhaps go the NoSQL way..

Yes, I would change the structure as you describe - it is more flexible and more 'correct' (normalized).
You'll have 5 billion rows (that would indeed be bad) only if EACH user gives ALL the ratings to ALL the books, that seems unlikely. A great majority of users would not rate anything, and a great majority of books will not attract any rating.

Related

Database too large - store as a row or serialise data?

I have Quiz App that constitutes many Modules containing Questions. Each question has many Categories (many-to-many). Every time a quiz is completed, the user's score is sent to the Scores Table. (I've attached an entity-relation diagram for clarification purposes).
I have been thinking of breaking down the user scores according to categories (i.e. a user when completing a quiz will get an overall quiz score along with score for each category).
However, if each quiz consists of at least 30 questions, there could around 15-20 categories per quiz. So if one user completes a quiz, then it would create a minimum of 15-20 rows in the scores table. With multiple users, the Scores table would get really big really fast.
I assume this would affect the performance of retrieving data from the Scores table. For example, if I wanted to calculate the average score for a user for a specific category.
Does anyone have a better suggestion for how I can still be able to store scores based on categories?
I thought about serialising the JSON data, but of course, this has its limitations.
The DB should be able to handle millions of rows and there is nothing inherently wrong with your design. A few things I would suggest:
Put indexes in the following (or combinations of) user id, exam id (which I assume is what you call scorable id ) exam type (scorable Type?) and creation date.
As your table grows, partition it. Potential candidates could be creation date buckets (by year or year/month would probably work well) or maybe if students are in particular classes you could have class buckets
As your table grow even more you could move the partitions to different different disks (how you partitioned the data will be even more crucial here because if the data has to go across too many partitions you may end up hurting performance instead of helping)
Beyond that another suggestion would be to break the scores table into two score and scoreDetail. The score table would contain top level stuff like user id ,exam id, overall score, etc... While the child table would contain the scores by category (philosophy, etc....). I would bet 80% of the time people only care about the top score anyways. This way you only reach out to the bigger table when some one wants to get the details of their score in a particular exam.
Finally, you probably want to have the score by category in rows rather than columns to make it easier to do analysis and aggregations, but this is not necessarily a performance booster and really depends on how you plan to use the data.
In the end though, the best optimizations really depend on how you plan to use your data. I would suggest just creating a random data set that represents a few years worth of data and play with that.
I doubt that serialization would give you a significant benefit.
I would even dare to say that you'd kind of limit the power of a database by doing so.
Relational databases are designed to store a lot of rows in their tables, and they also usually use their own compression algorithms, so you should be fine.
Additionally, you will need to deserialize every time you want to read from your table. That would eliminate the possibility to use SQL statements for sorting, filtering, JOINing etc.
So in the end you will probably cause yourself more trouble by serializing than by simply storing the rows.

Database design - performance

I'm designing a database for a club. There is a "member" entity that practices activities (sports, etc.). These activities have different categories according to the age of the member. Each activity in its respective category will have schedules. In turn, the members will have their own schedules, because, for example they could attend only 2 days of an activity that takes place 4 days within a week.
I solved it like this:
https://ibb.co/dsJkHz
My question is: Is this a viable method for solving this problem? I seems a bit complicated and I don't think it's ideal/optimized for performance. I'm sure there must be another way. Thanks!
Performance will be fine as long as you have indexes on the id columns that will be searched or joined on. The other option is enums in some places to limit the joins which you can get away with if there are 4 categories that will not change or rarely ever change but it looks like you're trying to follow best practices and plan for the future.
Your schema is based on categories and activities wanting to use a set list but that list can grow or names can change so they have their own table and from what I can tell the other tables are required for a one to many relationship so even though other options are available they seem like a bad route to take.

How do I design my tables to allow for a massive number of rows?

I'm working on a website that, boiled down, will work for the end user as a glorified to-do list. SQL is where I have the least experience I need for doing this well. Ignoring whether or not in reality, this will actually get a user base that massive, how could I design for the scenario that I have tens of thousands or more people adding dozens of their own items to this table?
Here's the layout I currently have planned for the
Items table:
ItemID | UserID| Content | Subcontent | Parent | Hierarchy | Days | Note | Alert | Deadline
So, the items created by each user are contained in that table, to be queried using something like "SELECT * WHERE UserID = $thisUser", then placed on the page and handled correctly using the other information from that row.
With this layout, would hundreds of thousands or millions of entries become a serious performance problem? If you have any suggestions or resources that you think would be helpful, I would appreciate them. Thank you.
If you index the column user_id, some hundred thousand or a few million should be no big problem. If we speak of even more rows, maybe several ten or hundred million, you should think of a way to evenly distribute the items according to their users. However, the row count is only one aspect influencing the performance. The moddeling of your data and the code, which queries your database, are likely to have more impact.
I believe you need to rethink your database layout. Rarely are individual users going to use the same content. I think you should have a table for each user then it would be UerID|ItemID|Content|Subcontent....
This allows you to maintain your database when a user quits.

is having millions of tables and millions of rows within them a common practice in MySQL database design?

I am doing database design for an upcoming web app, and I was wondering from anybody profusely using mysql in their current web apps if this sort of design is efficient for a web app for lets say 80,000 users.
1 DB
in DB, millions of tables for features for each user, and within each table, potentially millions of rows.
While this design is very dynamic and scales nicely, I was wondering two things.
Is this a common design in web applications today?
How would this perform, time wise, if querying millions of rows.
How does a DB perform if it contains MILLIONS of tables? (again, time wise, and is this even possible?)
if it performs well under above conditions, how could it perform under strenuous load, if all 80,000 users accessed the DB 20-30 times each for 10 -15 minute sessions every day?
how much server space would this require, very generally speaking (reiterating, millions of tables each containing potentially millions of rows with 10-15 columns filled with text)
Any help is appreciated.
1 - Definitely not. Almost anyone you ask will tell you millions of tables is a terrible idea.
2 - Millions of ROWS is common, so just fine.
3 - Probably terribly, especially if the queries are written by someone who thinks it's OK to have millions of tables. That tells me this is someone who doesn't understand databases very well.
4 - See #3
5 - Impossible to tell. You will have a lot of extra overhead from the extra tables as they all need extra metadata. Space needed will depend on indexes and how wide the tables are, along with a lot of other factors.
In short, this is a very very very seriously bad idea and you should not do it.
Millions of rows is perfectly normal usage, and can respond quickly if properly optimized and indexed.
Millions of tables is an indication that you've made a major goof in how you've architected your application. Millions of rows times millions of tables times 80,000 users means what, 80 quadrillion records? I strongly doubt you have that much data.
Having millions of rows in a table is perfectly normal and MySQL can handle this easily, as long as you use appropriate indexes.
Having millions of tables on the other hand seems like a bad design.
In addition to what others have said, don't forget that finding the right table based on the given table name also takes time. How much time? Well, this is internal to DBMS and likely not documented, but probably more than you think.
So, a query searching for a row can either take:
Time to find the table + time to find the row in a (relatively) small table.
Or, just the time to find a row in one large table.
The (2) is likely to be faster.
Also, frequently using different table names in your queries makes query preparation less effective.
If you are thinking of having millions of tables, I can't imagine that you actually designing millions of logically distinct tables. Rather, I would strongly suspect that you are creating tables dynamically based on data. That is, rather than create a FIELD for, say, the user id, and storing one or more records for each user, you are contemplating creating a new TABLE for each user id. And then you'll have thousands and thousands of tables that all have exactly the same fields in them. If that's what you're up to: Don't. Stop.
A table should represent a logical TYPE of thing that you want to store data for. You might make a city table, and then have one record for each city. One of the fields in the city table might indicate what country that city is in. DO NOT create a separate table for each country holding all the cities for each country. France and Germany are both examples of "country" and should go in the same table. They are not different kinds of thing, a France-thing and a Germany-thing.
Here's the key question to ask: What data do I want to keep in each record? If you have 1,000 tables that all have exactly the same columns, then almost surely this should be one table with a field that has 1,000 possible values. If you really seriously keep totally different information about France than you keep about Germany, like for France you want a list of provinces with capital city and the population but for Germany you want a list of companies with industry and chairman of the board of directors, then okay, those should be two different tables. But at that point the difference is likely NOT France versus Germany but something else.
1] Look up dimensions and facts tables in database design. You can start with http://en.wikipedia.org/wiki/Database_model#Dimensional_model.
2] Be careful about indexing too much: for high write/update you don't want to index too much because that gets very expensive (think average case or worst case of balancing a b-tree). For high read tables, index only the fields you search by. for example in
select * from mutable where A ='' and B='';
you may want to index A and B
3] It may not be necessary to start thinking about replication. but since you are talking about 10^6 entries and tables, maybe you should.
So, instead of me telling you a flat no for the millions of tables question (and yes my answer is NO), I think a little research will serve you better. As far as millions of records, it hints that you need to start thinking about "scaling out" -- as opposed to "scaling up."
SQL Server has many ways you can support large tables. You may find some help by splitting your indexes across multiple partitions (filegroups), placing large tables on their own filegroup, and indexes for the large table on another set of filegroups.
A filegroup is basically a separate drive. Each drive has its own dedicated read and write heads. The more drives the more heads are searching the indexes at a time and thus faster results finding your records.
Here is a page that talks in details about filegroups.
http://cm-bloggers.blogspot.com/2009/04/table-and-index-partitioning-in-sql.html

The right way to plan my database

I'm creating a music sharing site, so each user can set up his account, add songs etc..
I would like to add the ability for users to give points to one another based on whether they like the song.
For example user1 has some songs in his collection, user2 likes a song so he clicks "I like" resulting in giving a point to user1.
Now I would like to know if my idea of creating the "Points table" in my database is somewhat right and correct.
I decided to create a separate table to hold data about points, this table would have id column, who gave the point to who, song id column, date column etc. My concern is that in my table I will have a row for every single point that has been given.
Of course it's nice to have all this specific info, but i'm not sure if this is the right way to go, or perhaps i'm wasting reasources, space.. and so on.
Maybe I could redesign my songs Table to have additional column points, and I would just count how many points each song has.
I need some advice on this, maybe I shouldn't really worry about my design, optimalization and scalibility, since todays technology is so fast and powerful and database queries are instant quick..
IMO, it's better to use a transactional table to track the points given to a user based on their song-lists. Consider how Stackoverflow (SO) works, if you up-vote a question or solution, you can remove your vote at a later time, if SO used a summation column, it would be impossible to support this type of functionality.
I wouldn't worry too much about the number of rows in your points table, as it will probably be pretty narrow, generously; 10 columns at the most. Not to mention the table would be a pivot table between users, so would comprised mostly of int values.
Part of the issue is really simple. If you need to know
who gave a point
to whom
for which song
on which date
then you need to record all that information.
Wasn't that simple?
If you only need to know the totals, then you can just store the totals.
As for scale, say you have 20,000 users, each with an average of 200 songs. Let's say 1 in 10 gets any up votes, averaging 30 per song. That's 4 million user-songs; 400,000 that get up votes, at 30 per song you have 12 million rows. That's not that many. If the table gets too many rows, partitioning on "to whom" would speed things up a lot.