Implementing A Ranking System - mysql

I've seen several question on how to secure and prevent abuse of ranking systems (like staring movies, products, etc) but nothing on actually implementing it. To simplify this question, security is not a concern to me, the people accessing this system are all trusted, and abuse of the ranking system if it were to happen is trivial and easier to revert than cause. Anyways, I'm curious how to store the votes.
One thought is to have a votes table, that logs each vote, and then either immediately, at scheduled times, or on every load of the product (this seems inefficient, but maybe not) the votes are tallied and a double between 0 and 5 is updated into the product's entry in the product table.
Alternatively, I store in the products table a total score and a number of votes, and just divide that out when I display, and add the vote to total and increment number when someone votes.
Or is there a better way to do it that I haven't though of? I'd kind of like to just have a 'rating' field in the product table, but can't think of a way to update votes without some additional data.
Again, data integrity is important, but by no means necessary, any thoughts?

I would keep a "score" with your products but would also keep a vote table to see who voted for what. And when somebody votes, Insert vote, update product score.
This allows quick sorting and you also have a table to be able to recalculate the scores from and to stop people double-voting.
There is no need to wait to write the vote and update the scores. That will introduce problems and if it's acting like a traditional system (lots more reads than writes), gives you no benefits.

you mean, you'll store the votes seperately in a table and then update the respective ranking of product in product's table with a defined strategy?
That seems like an inefficient way of storing it. Maybe there is a background to that reason; but why would you not want to store all votes in one table and keep making references of those votes to respective product. This gives you a real time count.
On UI you'll calculate a average of all the votings to a near integer to show. That would suffice, isn't it? Or am I missing something?

I agree with Oli. In addition, you can cache your score. So you update the product score in the cache and your application always picks up the cache value. Thus even on a page refresh, you would get the latest score without hitting the database.

Related

Database too large - store as a row or serialise data?

I have Quiz App that constitutes many Modules containing Questions. Each question has many Categories (many-to-many). Every time a quiz is completed, the user's score is sent to the Scores Table. (I've attached an entity-relation diagram for clarification purposes).
I have been thinking of breaking down the user scores according to categories (i.e. a user when completing a quiz will get an overall quiz score along with score for each category).
However, if each quiz consists of at least 30 questions, there could around 15-20 categories per quiz. So if one user completes a quiz, then it would create a minimum of 15-20 rows in the scores table. With multiple users, the Scores table would get really big really fast.
I assume this would affect the performance of retrieving data from the Scores table. For example, if I wanted to calculate the average score for a user for a specific category.
Does anyone have a better suggestion for how I can still be able to store scores based on categories?
I thought about serialising the JSON data, but of course, this has its limitations.
The DB should be able to handle millions of rows and there is nothing inherently wrong with your design. A few things I would suggest:
Put indexes in the following (or combinations of) user id, exam id (which I assume is what you call scorable id ) exam type (scorable Type?) and creation date.
As your table grows, partition it. Potential candidates could be creation date buckets (by year or year/month would probably work well) or maybe if students are in particular classes you could have class buckets
As your table grow even more you could move the partitions to different different disks (how you partitioned the data will be even more crucial here because if the data has to go across too many partitions you may end up hurting performance instead of helping)
Beyond that another suggestion would be to break the scores table into two score and scoreDetail. The score table would contain top level stuff like user id ,exam id, overall score, etc... While the child table would contain the scores by category (philosophy, etc....). I would bet 80% of the time people only care about the top score anyways. This way you only reach out to the bigger table when some one wants to get the details of their score in a particular exam.
Finally, you probably want to have the score by category in rows rather than columns to make it easier to do analysis and aggregations, but this is not necessarily a performance booster and really depends on how you plan to use the data.
In the end though, the best optimizations really depend on how you plan to use your data. I would suggest just creating a random data set that represents a few years worth of data and play with that.
I doubt that serialization would give you a significant benefit.
I would even dare to say that you'd kind of limit the power of a database by doing so.
Relational databases are designed to store a lot of rows in their tables, and they also usually use their own compression algorithms, so you should be fine.
Additionally, you will need to deserialize every time you want to read from your table. That would eliminate the possibility to use SQL statements for sorting, filtering, JOINing etc.
So in the end you will probably cause yourself more trouble by serializing than by simply storing the rows.

Is is necessary to link or join tables in MySQL?

I've created many databases before, but I have never linked two tables together. I've tried looking around, but cannot find WHY one would need to link two or more tables together.
There is a good tutorial here that goes over database relationships, but does not explain why they would be needed. He just simply says that they are.
Are they truly necessary? I understand that (in his example) all orders have a customer, and so one would link the orders table to the customers table, but I just don't see why this would be absolutely necessary. I can (and have) created shopping carts and other complex databases that work just fine without creating any table relationships.
I've just started playing around with MySQL Workbench v6.0 for a new project that has a fairly large and complex database, and so I'm wondering if I am losing anything by creating the entire project without relationships?
NOTE: Please let me know if this question is too general or off topic, and I will change it. I understand that a lot can be said about this topic, and so I'm really just looking to know if I am opening myself up to any security issues or significant performance issues by not using relationships. Please be specific in your response; "Yes you are opening yourself up to performance issues" is useless and not helpful for myself, nor for anyone else looking at this thread at a later date. Please include details and specifics in your response.
Thank you in advance!
As Sam D points out in the comments, entire books can be written about database design and why having tables with relationships can make a lot of sense.
That said, theoretically, you lose absolutely no expressive/computational power by just putting everything in the same table. The primary arguments against doing so likely deal with performance and maintenance issues that might arise.
The answer revolves around granularity, space consumption, speed, and detail.
Inherently different types of data will be more granular than others, as items can always be rolled up to a larger umbrella. For a chain of stores, items sold can be rolled up into transactions, transactions can be rolled up into register batches, register batches can be rolled up to store sales, store sales can be rolled up to company sales. The two options then are:
Store the data at the lowest grain in a single table
Store the data in separate tables that are dedicated to purpose
In the first case, there would be a lot of redundant data, as each item sold at location 3 of 430 would have store, date, batch, transaction, and item information. That redundant data takes up a large volume of space, when you could very easily create separated tables for their unique purpose.
In this example, lets say there were a thousand transactions a day totaling a million items sold from that one store. By creating separate tables you would have:
Stores = 430 records
Registers = 10 records
Transactions = 1000 records
Items sold = 1000000 records
I'm sure your asking where the space savings comes in ... it is in the detail for each record. The store table has names, address, phone, etc. The register has number, purchase date, manager who reconciles, etc. Transactions have customer, date, time, amount, tax, etc. If these values were duplicated for every record over a single table it would be a massive redundancy of data adding up to far more space consumption than would occur just by linking a field in one table (transaction id) to a field in another table (item id) to show that relationship.
Additionally, the amount of space consumed, as well as the size of the overall table, inversely impacts the speed of you querying that data. By keeping tables small and capitalizing on the relationship identifiers to link between them, you can greatly increase the response time. Every time the query engine needs to find a value, it traverses the table until it finds it (that is a grave oversimplification, but not untrue), so the larger and broader the table the longer the seek time. These problems do not exist with insignificant volumes of data, but for organizations that deal with millions, billions, trillions of records (I work for one of them) storing everything in a single table would make the application unusable.
There is so very, very much more on this topic, but hopefully this gives a bit more insight.
Short answer: In a relational database like MySQL Yes. Check this out about referential integrity http://databases.about.com/cs/administration/g/refintegrity.htm
That does not mean that you have to use relational databases for your project. In fact the trend is to use Non-Relational databases (NoSQL), like MongoDB to achieve same results with better performance. More about RDBMS vs NoSQL http://www.zdnet.com/rdbms-vs-nosql-how-do-you-pick-7000020803/
I think that with this example you will understand better:
Let's we want to create on-line store. We have at minimum Users, Payments and Events (events about the pages where the user navigates or other actions). In this scenario we want to link in a secure and relational way the Users with the Payments. We do not want a Payment to be lost or assigned to another User. So we can use a RDBMS like MySQL to create the tables Users and Payments and linked the with proper Foreign Keys. However for the events, we are going to be a lot of them per users (maybe millions) and we need to track them in a fast way without killing the relation database. In that case a No-SQL database like MongoDB makes totally sense.
To sum up to can use an hybrid of SQL and NO-SQL, but either if you use one, the other or both kind of solutions, do it properly.

Getting average or keeping temp data in db - performance concern

I am building a little app for users to create collections. I want to have a rating system in there. And now, since I want to cover all my fields, let's pretend that I have a lot of visitors. Performance comes into play, especially with rates.
Let's suppose that I have rates table, and there I have id, game_id, user_id and rate. Data comes simple, for every user there is one entry. Let's suppose again, that 1000 users will rate one game. And I want to print out average rate on that game subpage (and somewhere else, like on the games list). For now, I have two scenarios to go with:
Getting AVG each time the game is displayed.
Creating another column in games, called temprate and store there rate for the game. It would be updated evey time someone votes.
Those two scenarios have obvious flaws. First one is more stressful to my host, since it definietly will consume more power of the machine. Secound is more work while rating (getting all the game data, submitting rate, getting new AVG).
Please advice me, which scenario should I go with? Or maybe you have some other ideas?
I work with PDO and no framework.
So I've finally manage to solve this issue. I used file caching based on dumping arrays into files. I just go with something like if (cache) { $var = cache } else { $var = db }. I am using JG Cache, for now, but propably I'll write myself something similar soon, but for now - it's a great solution.
I'd have gone with a variation of your "number 2" solution (update a separate rating column), maybe in a separate table just for this.
If the number of writes becomes a problem, then that'd be well after select avg(foo) from ... does, and there are lots of ways to mitigate it by just updating the average rating periodically or just processing new votes every so often.
Likely then eventually you can't just do an avg() anyway because you have to consider each vote for fraud, calculating a sort score and who knows what else.,

The right way to plan my database

I'm creating a music sharing site, so each user can set up his account, add songs etc..
I would like to add the ability for users to give points to one another based on whether they like the song.
For example user1 has some songs in his collection, user2 likes a song so he clicks "I like" resulting in giving a point to user1.
Now I would like to know if my idea of creating the "Points table" in my database is somewhat right and correct.
I decided to create a separate table to hold data about points, this table would have id column, who gave the point to who, song id column, date column etc. My concern is that in my table I will have a row for every single point that has been given.
Of course it's nice to have all this specific info, but i'm not sure if this is the right way to go, or perhaps i'm wasting reasources, space.. and so on.
Maybe I could redesign my songs Table to have additional column points, and I would just count how many points each song has.
I need some advice on this, maybe I shouldn't really worry about my design, optimalization and scalibility, since todays technology is so fast and powerful and database queries are instant quick..
IMO, it's better to use a transactional table to track the points given to a user based on their song-lists. Consider how Stackoverflow (SO) works, if you up-vote a question or solution, you can remove your vote at a later time, if SO used a summation column, it would be impossible to support this type of functionality.
I wouldn't worry too much about the number of rows in your points table, as it will probably be pretty narrow, generously; 10 columns at the most. Not to mention the table would be a pivot table between users, so would comprised mostly of int values.
Part of the issue is really simple. If you need to know
who gave a point
to whom
for which song
on which date
then you need to record all that information.
Wasn't that simple?
If you only need to know the totals, then you can just store the totals.
As for scale, say you have 20,000 users, each with an average of 200 songs. Let's say 1 in 10 gets any up votes, averaging 30 per song. That's 4 million user-songs; 400,000 that get up votes, at 30 per song you have 12 million rows. That's not that many. If the table gets too many rows, partitioning on "to whom" would speed things up a lot.

MySQL: Where do I store each user's profile information in a website?

Sorry if this has been covered - I've been looking for hours but I think I simply lack the vocabulary to search effectively.
I'm trying to figure out how I should store profile information for each user. By profile information I don't mean information like email and the like, but more their preferences regarding the site I'm working on.
It's a language learning site, and I want users to be able to save their "progress", giving them the option to flag a lesson as learned.
I also want to keep track of which exercises they have done, so that I can try to only give them exercises they haven't done (or when they've used up the available exercises, start from the least recent). I'm just not sure where to store all this information.
Should I have a lookup table linking users to lessons? I fear this will get huge as the number of users and tables increases. Seeing as its just a boolean, I considered giving each user an int (and later more ints as an array) where each bit represents a lesson, and performing bitwise operators on those numbers to get the information about which lessons they've saved... though that sounds like it could be cumbersome in the future.
As for remembering which exercises they've done, I fear this will lead to a huge amount of waste if I try to save it in mysql. Could I try to have this done on the user's computer using cookies, and anybody who has cookies disabled will simply have to deal with repeating exercise questions?
Maybe I should think about other tables and even other databases! I don't know!
Sorry for all the rambling nonsense. At the very least I'd appreciate some pointers towards what I need to read up on...
A lookup table between the users and the exercises is the simplest and most flexible, and you really shouldn't have to worry about the size of it. It'll have a user id, an exercise id, and some sort of progress variable, so (depending on your needs) that's probably going to be less than 10 bytes of space per row. 1 million rows wouldn't even take up 10MB of space.
I'd probably just have records only get created in the table once the user has made some sort of progress on a particular exercise. So if you ever try to look up a user's progress on an exercise and a row isn't found, that means that they haven't done anything on that exercise. That way you only need to create rows to represent progress, and it should keep the number fairly low overall.
You'll need a junction table to link each user to different exercises (many-to-many relationship):
user_id(int) exercise_id(int) learned(boolean)
You don't have to have entries for every possible combination, you can add each combination when a lesson is flagged as learned.
The bitwise method is going down a bad road, you'd need a bit for each lesson... it's not scalable.