I'm creating a music sharing site, so each user can set up his account, add songs etc..
I would like to add the ability for users to give points to one another based on whether they like the song.
For example user1 has some songs in his collection, user2 likes a song so he clicks "I like" resulting in giving a point to user1.
Now I would like to know if my idea of creating the "Points table" in my database is somewhat right and correct.
I decided to create a separate table to hold data about points, this table would have id column, who gave the point to who, song id column, date column etc. My concern is that in my table I will have a row for every single point that has been given.
Of course it's nice to have all this specific info, but i'm not sure if this is the right way to go, or perhaps i'm wasting reasources, space.. and so on.
Maybe I could redesign my songs Table to have additional column points, and I would just count how many points each song has.
I need some advice on this, maybe I shouldn't really worry about my design, optimalization and scalibility, since todays technology is so fast and powerful and database queries are instant quick..
IMO, it's better to use a transactional table to track the points given to a user based on their song-lists. Consider how Stackoverflow (SO) works, if you up-vote a question or solution, you can remove your vote at a later time, if SO used a summation column, it would be impossible to support this type of functionality.
I wouldn't worry too much about the number of rows in your points table, as it will probably be pretty narrow, generously; 10 columns at the most. Not to mention the table would be a pivot table between users, so would comprised mostly of int values.
Part of the issue is really simple. If you need to know
who gave a point
to whom
for which song
on which date
then you need to record all that information.
Wasn't that simple?
If you only need to know the totals, then you can just store the totals.
As for scale, say you have 20,000 users, each with an average of 200 songs. Let's say 1 in 10 gets any up votes, averaging 30 per song. That's 4 million user-songs; 400,000 that get up votes, at 30 per song you have 12 million rows. That's not that many. If the table gets too many rows, partitioning on "to whom" would speed things up a lot.
Related
I have Quiz App that constitutes many Modules containing Questions. Each question has many Categories (many-to-many). Every time a quiz is completed, the user's score is sent to the Scores Table. (I've attached an entity-relation diagram for clarification purposes).
I have been thinking of breaking down the user scores according to categories (i.e. a user when completing a quiz will get an overall quiz score along with score for each category).
However, if each quiz consists of at least 30 questions, there could around 15-20 categories per quiz. So if one user completes a quiz, then it would create a minimum of 15-20 rows in the scores table. With multiple users, the Scores table would get really big really fast.
I assume this would affect the performance of retrieving data from the Scores table. For example, if I wanted to calculate the average score for a user for a specific category.
Does anyone have a better suggestion for how I can still be able to store scores based on categories?
I thought about serialising the JSON data, but of course, this has its limitations.
The DB should be able to handle millions of rows and there is nothing inherently wrong with your design. A few things I would suggest:
Put indexes in the following (or combinations of) user id, exam id (which I assume is what you call scorable id ) exam type (scorable Type?) and creation date.
As your table grows, partition it. Potential candidates could be creation date buckets (by year or year/month would probably work well) or maybe if students are in particular classes you could have class buckets
As your table grow even more you could move the partitions to different different disks (how you partitioned the data will be even more crucial here because if the data has to go across too many partitions you may end up hurting performance instead of helping)
Beyond that another suggestion would be to break the scores table into two score and scoreDetail. The score table would contain top level stuff like user id ,exam id, overall score, etc... While the child table would contain the scores by category (philosophy, etc....). I would bet 80% of the time people only care about the top score anyways. This way you only reach out to the bigger table when some one wants to get the details of their score in a particular exam.
Finally, you probably want to have the score by category in rows rather than columns to make it easier to do analysis and aggregations, but this is not necessarily a performance booster and really depends on how you plan to use the data.
In the end though, the best optimizations really depend on how you plan to use your data. I would suggest just creating a random data set that represents a few years worth of data and play with that.
I doubt that serialization would give you a significant benefit.
I would even dare to say that you'd kind of limit the power of a database by doing so.
Relational databases are designed to store a lot of rows in their tables, and they also usually use their own compression algorithms, so you should be fine.
Additionally, you will need to deserialize every time you want to read from your table. That would eliminate the possibility to use SQL statements for sorting, filtering, JOINing etc.
So in the end you will probably cause yourself more trouble by serializing than by simply storing the rows.
I want to design a database for storing the location (latitude and longitude) of several buses (it would be a tracking app). So, I will have one device in each bus, sending this location every 10 seconds to the server, storing it in the database. And I will also have clients that will look up for the location of a given bus.
I have thought about 2 possible solutions, one would be creating 2 tables, one to store the complete report with next columns: bus number, timedate, latitude and longitude. And the second table the same, but just to store the last location, so the entries of this table for the buses would be updating all the time. In the first one I would have a lot of rows, growing too fast, thats why I thought about creating a second table to just store the last location and improve the performance of the client querys.
And the reason why I don't keep just this second table (that would be enough for the clients functionality), is because I want a complete report to be able to have statistics.
The second solution would be just creating one table with all the reports, and when a client wants to know the location, I would look for the last one ordering by the timedate of that specific bus. But as I said before, if the table is growing too much, this will take maybe too many time in the future. Other possibility is to clean the database entries every week for example, take statistics, store them in one table and clean the main one.
So, in addition, I also would like to know what should be the primary key in that table. I have read that using timedate as primary key is not adviced, because sometimes if the difference in time is too small, it will take it as the same timedate and wont store the entry. But as I will be reporting every 10 seconds, maybe it shouldn't be a problem. Other option is to add an ID for every report, but I think it would be just a waste of space. Or maybe I could have it without primary key (something that I think is not adviced too...).
For the second table, I would use the line number of the bus as primary key, because I would have just one entry (the last one) for every bus, so there wouldn't be repetitions.
Any help about all of this? I'm quite newb as you can see in this area and I would appreciate it very much.
So to sum up, my questions are:
It would improve the performance having 2 tables or just 1 is enough?
What would be the primary key in the table with all the reports?
Thank you for reading and for helping :D
Just go with the two tables approach. Simple and clean.
IMHO stick with a simple auto-increment integer key on the big table. Many engines will create one for you anyway if you dont (so its not saving anything by not creating it)
As a simple number its also great for ordering results by date (technically the order they received, but should be generally be the same). Will perform better than date-ordering, and doesnt require another index creating.
(timestamp not good. because while it should generally be unique for one bus, what if all busses happen to send their update at the same second()
We have 10 years of archived sports data, spread across separate databases.
Trying to consolidate all the data into a single database. Since we'll be handling 10X the number of records, I'm trying to make schema redesign changes now to avoid potential performance hit.
One change entails breaking up the team roster table into 2 tables; one, a players table that stores fixed data: playerID, firstName, lastName, birthDate, etc., and another, the new roster table that stores variable data about a player: yearInSchool, jerseyNumber, position, height, weight, etc. This will allow us to, among other things, create career 4 year aggregate views of player stats.
Fair enough, makes sense, but then again, when I look at queries that tally, for example, a players aggregate scoring stats, I have to join on both player & roster tables, in addition to scoring and schedule tables, in order to get all the information needed.
Where I'm considering denormalizing is with player first and last name. If I store player first and last name in the roster table, then I can omit the player table from the equation for stat queries, which I'm assuming will be a big performance win given that total record count per table will be over 100K (i.e. most query joins will be across tables that each contain at least 100K records, and up to, for now, 300K).
So, where to draw the line with denormalization in this case? I assume duplicating first, last name is OK. Generally I enjoy non-duplication/integrity of data, but I suspect site visitors enjoy performance more!
First thought is, are you sure you've exhausted tuning options to get good SELECT performance without denormalising here?
I'm very much with you in the sense of "no sacred cows" and denormalise when necessary, but this sounds like a case where it shouldn't be too hard to get decent performance.
Of course you guys have done your own exploration, if you've ruled that out then personal opinion is it's acceptable, yeah.
One issue - what happens if a player's name changes? Can it do so in your system? Would you use a transation to update all roster details in a single COMMIT operation? For a historical records Db this could be totally irrelevant mind you.
I am really only a hobbyist, who has aspirations far too grand, that said, I am trying to figure out the right way to create my database so that database changes don't require client refactoring, but also fast. Please respond in as if I wouldn't understand typical development or DBA terminology well.
The situation: I was trying to determine how many books each user has rated. I consider a book rated if it has any two of the following:
-Overall rating (ratings table)
-sub rating (ratings table)
-tag (book_tags table)
-list (book_lists table)
*Related tables: users, tags, lists
The problem: I have 10 sub ratings and the two overall ratings all in the ratings table, each in a column (guessing this is bad, but not sure). Should I instead have a ratings table (12 rows) and a book_ratings table where each row of the ratings table is a type of rating for a user?
-e.g. book_ratings: id | user_id | book_id | rating_id
If yes, what happens if there are 500k books, 12 rating types per book, 10,000 users and a total of 5 billion rows on that book_ratings table? is that going to run super slow? Another consideration is that I may want to add more sub rating types in the future, which is partially why I think it might be valuable to change it, but it's a lot of work, so I wanted to check first.
Thanks!
You should model your system to make it usable and extendable. Having 12 rating columns is going to cause you a lot of pain when you want to aggregate results etc. There are plenty of examples of those kinds of pains on this site.
As it grows you optimize by adding indexes, clustering, data partitioning, etc.
But if you know you are going to have massive amounts of data right away you might want to consider some "Big Data" solutions and perhaps go the NoSQL way..
Yes, I would change the structure as you describe - it is more flexible and more 'correct' (normalized).
You'll have 5 billion rows (that would indeed be bad) only if EACH user gives ALL the ratings to ALL the books, that seems unlikely. A great majority of users would not rate anything, and a great majority of books will not attract any rating.
I've seen several question on how to secure and prevent abuse of ranking systems (like staring movies, products, etc) but nothing on actually implementing it. To simplify this question, security is not a concern to me, the people accessing this system are all trusted, and abuse of the ranking system if it were to happen is trivial and easier to revert than cause. Anyways, I'm curious how to store the votes.
One thought is to have a votes table, that logs each vote, and then either immediately, at scheduled times, or on every load of the product (this seems inefficient, but maybe not) the votes are tallied and a double between 0 and 5 is updated into the product's entry in the product table.
Alternatively, I store in the products table a total score and a number of votes, and just divide that out when I display, and add the vote to total and increment number when someone votes.
Or is there a better way to do it that I haven't though of? I'd kind of like to just have a 'rating' field in the product table, but can't think of a way to update votes without some additional data.
Again, data integrity is important, but by no means necessary, any thoughts?
I would keep a "score" with your products but would also keep a vote table to see who voted for what. And when somebody votes, Insert vote, update product score.
This allows quick sorting and you also have a table to be able to recalculate the scores from and to stop people double-voting.
There is no need to wait to write the vote and update the scores. That will introduce problems and if it's acting like a traditional system (lots more reads than writes), gives you no benefits.
you mean, you'll store the votes seperately in a table and then update the respective ranking of product in product's table with a defined strategy?
That seems like an inefficient way of storing it. Maybe there is a background to that reason; but why would you not want to store all votes in one table and keep making references of those votes to respective product. This gives you a real time count.
On UI you'll calculate a average of all the votings to a near integer to show. That would suffice, isn't it? Or am I missing something?
I agree with Oli. In addition, you can cache your score. So you update the product score in the cache and your application always picks up the cache value. Thus even on a page refresh, you would get the latest score without hitting the database.