Is it a good idea to store like count in the following format?
like table:
u_id | post_id | user_id
And count(u_id) of a post?
What if there were thousands of likes for each post? The like table is going to be filled with billions of rows after a few months.
What are other efficient ways to do so?
In two words answer is : yes , it is OK. (to store data about each like any user did for any post).
But I want just to separate or transform it to several questions:
Q. Is there other way to count(u_id)? or even better:
SELECT COUNT(u_id) FROM likes WHERE post_id = ?
A. Why not? you can save count in your post table and increase/decrease it every time when user like/dislike the post. You can set trigger (stored procedure) to automate this action. And then to get counter you need just:
SELECT counter FROM posts WHERE post_id = ?
If you like previous Q/A and think that it is good idea I have next question:
Q. Why do we need likes table then?
A. That depends of your application design and requirements. According to the columns set you posted : u_id, post_id, user_id (I would even add another column timestamp). Your requirements is to store info about user as well as about post when it liked. That means you can recognize if user already liked this post and refuse multilikes. If you don't care about multilikes or historical timeline and stats you can delete your likes table.
Last question I see here:
Q. The like table is going to be filled with billions of rows after a few months. isn't it?
A. I wish you that success but IMHO you are 99% wrong. to get just 1M records you need 1000 active users (which is very very good number for personal startup (you are building whole app with no architect or designer involved?)) and EVERY of those users should like EVERY of 1000 posts if you have any.
My point here is: fortunately you have enough time till your database become really big and that would hurt your application. Till your table get 10-20M of records you can do not worry about size and performance.
Related
Lets say i have a table with posts, and each post has index of topic it belongs to. And i have a table with topics, with integer field, representing number of posts in this topic. When i create new post, i increase this value by 1, and then i delete post, i decrease value by 1.
I do it to not query database each time i need to count number of posts in certain topics.
But i heared that this approach may not be safe to use and actual number of posts in table may not match stored value.
Is there any ceratin info about how safe is it?
Without transactions, the primary issue is timing. Consider a delete and two users:
Time User 1 User 2
1 count = count - 1
2 update finishes How many posts?
3 delete post Count returned
4 delete finishes
Remember that actions such as updates and deletes take a finite amount of time -- even if they take effect all at once. Because of this, User 2 will get the wrong number of posts. This is a race condition; and it may or may not be an issue in your application.
Transactions fix this particular problem, by ensuring that resetting the count and deleting the post both take effect "at the same time".
A secondary issue is data quality. Your data consistency checks are outside the database. Someone can come directly into the database and say "Oh, these posts from user X should be removed". That user might then delete those posts en masse -- but "forget" or not know to change the associated values.
This can be a big issue. Triggers should solve this problem.
I would like to achieve something like you see on Facebook:
- Posting status
- Comment status
- Like status (like for comments not implemented yet)
My tables structure is like this :
Posts Users Comments Likes
------- ------- -------- -------
ID ID ID ID
UserID Username PostID PostID
Content UserID UserID
Date Content
Date
So at this time when someone access to the main page the system is going to show the 10 lasts posts. My query uses LEFT JOIN on theses tables.
If for example there is 10 posts without any comments and any likes the query will return 10 records.
But for each comment or likes my query will return a new record (row) with some NULL value in the corresponding column.
At the end by simply wanting to retrieve 10 posts my query will return at least 50 rows (if each post has some comments and likes).
I was wondering if that will cause problem in the future. And I was wondering if I should better use multiple queries and parse all the results into an array like:
1. Select the 10 last posts
2. Save the IDs into array and all data into global array
3. Parse the array and make a prepared query for the comments something like:
SELECT * FROM COMMENTS WHERE PostID IN (1, 2, 3, 4, 5, 6,...)
4. Save the result into global array
5. Repeat again for the like table
I hope my explanation was clear enough :) Thank you
Doing one 50 row query reduces the overhead when communicating with the server, on the other hand it adds processing after the rows are retrieved.
It really depends on the overall solution.
However, unless the application is performance critical with the server being the bottleneck, i would go with 10 result sets - one per row, probably using some class/widget/object to display the post on the page.
I'm not an expert, if I understand correctly your option are:
A) the single mega query that will return a lot of NULL's and repeated values.
[Note: By "all" I mean, all you are interested in]
B) Three queries: One for all posts, one for all comments, and one for all likes (all joined with the users table), and then you can process them into objects or structs or dictionaries with whatever language you are using to query the database.
I would go with the second because It is easier, and the increase in order of magnitude seems benign, and probably even more flexible design wise.
What I would prefer NOT to do is one query per post. That would probably become a problem sooner than later. At least much sooner than A or B.
I am developing a forum in PHP MySQL. I want to make my forum as efficient as I can.
I have made these two tables
tbl_threads
tbl_comments
Now, the problems is that there is a like and dislike button under the each comment. I have to store the user_name which has clicked the Like or Dislike Button with the comment_id. I have made a column user_likes and a column user_dislikes in tbl_comments to store the comma separated user_names. But on this forum, I have read that this is not an efficient way. I have been advised to create a third table to store the Likes and Dislikes and to comply my database design with 1NF.
But the problem is, If I make a third table tbl_user_opinion and make two fields like this
1. comment_id
2. type (like or dislike)
So, will I have to run as many sql queries as there are comments on my page to get the like and dislike data for each comment. Will it not inefficient. I think there is some confusion on my part here. Can some one clarify this.
You have a Relational Scheme like this:
There are two ways to solve this. The first one, the "clean" one is to build your "like" table, and do "count(*)'s" on the appropriate column.
The second one would be to store in each comment a counter, indicating how many up's and down's have been there.
If you want to check, if a specific user has voted on the comment, you only have to check one entry, wich you can easily handle as own query and merge them two outside of your database (for this use a query resulting in comment_id and kind of the vote the user has done in a specific thread.)
Your approach with a comma-seperated-list is not quite performant, due you cannot parse it without higher intelligence, or a huge amount of parsing strings. If you have a database - use it!
("One Information - One Dataset"!)
The comma-separate list violates the principle of atomicity, and therefore the 1NF. You'll have hard time maintaining referential integrity and, for the most part, querying as well.
Here is one way to do it in a normalized fashion:
This is very clustering-friendly: it groups up-votes belonging to the same comment physically close together (ditto for down-votes), making the following query rather efficient:
SELECT
COMMENT.COMMENT_ID,
<other COMMENT fields>,
COUNT(DISTINCT UP_VOTE.USER_ID) - COUNT(DISTINCT DOWN_VOTE.USER_ID) SCORE
FROM COMMENT
LEFT JOIN UP_VOTE
ON COMMENT.COMMENT_ID = UP_VOTE.COMMENT_ID
LEFT JOIN DOWN_VOTE
ON COMMENT.COMMENT_ID = DOWN_VOTE.COMMENT_ID
WHERE
COMMENT.COMMENT_ID = <whatever>
GROUP BY
COMMENT.COMMENT_ID,
<other COMMENT fields>;
[SQL Fiddle]
Please measure on realistic amounts of data if that works fast enough for you. If not, then denormalize the model and cache the total score in the COMMENT table, and keep it current it through triggers every time a new row is inserted to or deleted from *_VOTE tables.
If you also need to get which comments a particular user voted on, you'll need indexes on *_VOTE {USER_ID, COMMENT_ID}, i.e. the reverse of the primary/clustering key above.1
1 This is one of the reasons why I didn't go with just one VOTE table containing an additional field that can be either 1 (for up-vote) or -1 (for down-vote): it's less efficient to cover with secondary indexes.
Summary: What is the most efficient way to store information similar to the like system on FB. Aka, a tally of likes is kept, the person who like it is kept etc.
It needs to be associated with a user id so as to know who liked it. The issue is, do you have a column that has a comma delimited list of the id of things that were liked, or do you have a separate column for each like (way too many columns). The info that's stored would be a boolean value (1/0) but needs to be associated with the user as well as the "page" that was liked.
My thought was this:
Column name = likes eg.:
1,2,3,4,5
Aka, the user has "like" the pages that have an id of 1, 2, 3, 4 and 5. To calculate total "likes" a tally would need to be taken and then stored in a database associated with the pages themselves (table already exists).
That seems the best way to me but is there a better option that anyone can think of?
P.S. I'm not doing FB likes but it's the easiest explanation.
EDIT: Similar idea to the plus/neg here on stackoverflow.
In this case the best way would be to create a new table to keep track of the likes. So supposing you have table posts, which has a column post_id which contains all the posts (on which the users can vote). And you have another table users with a column user_id, which contains all the users.
You should create a table likes which has at least two columns, something like like_postid and like_userid. Now, everytime a user likes a post create a new row in this table with the id of the post (the value of post_id from posts) that is liked and the id of the user (the value of user_id from users) that likes the post. Of course you can enter some more columns in the likes table (for instance to keep track of when a like is created).
What you have here is called a many-to-many relationship. Google it to get some more information about it and to find some more advice on how to implement them correctly (you will find that a comma seperated lists of id's will not be one of the best practices).
Update based on comments:
If I'm correct; you want to get a list of all users (ordered by name) who have voted on an artist. You should do that something like:
SELECT Artists.Name, User.Name
FROM Artists
JOIN Votes
ON Votes.page_ID = Artists.ID
JOIN Users
ON Votes.Votes_Userid = Users.User_ID
WHERE Artists.Name = "dfgdfg"
ORDER BY Users.Users_Name
There a strange thing here; the column in your Votes table which contains the artist id seems to be called page_ID. Also you're a bit inconsistent in column names (not really bad, but something to keep in mind if you want to be able to understand your code after leaving it alone for 6 months). In your comment you say that you only make one join, but you actually do two joins. If you specify two table names (like you do: JOIN Users, Votes SQL actually joins these two tables.
Based on the query you posted in the comments I can tell you haven't got much experience using joins. I suggest you read up on how to use them, it will really improve your ability to write good code.
I have a project, where I have posts for example.
The task is next: I must show to user his last posts visit.
This is my solution: every time user visits new (for him) topic, I create a new record in table visits.
Table visits has next structure: id, user_id, post_id, last_visit.
Now my tables visits has ~14,000,000 records and its still growing every day..
May be my solution isnt optimal and exists another way how to store users visits?
Its important to save every visit as standalone record, because I also have feature to select and use users visits. And I cant purge this table, because data could be needed later month, year. How I could optimize this situation?
Nope, you don't really have much choice other than to store your visit data in a table with columns for (at a bare minimum) user id, post id, and timestamp if you need to track the last time that each user visited each post.
I question whether you need an id field in that table, rather than using a composite key on (user_id, post_id), but I'd expect that to have a minor effect, provided that you already have a unique index on (user_id, post_id). (If you don't have an index on that pair of fields, adding one should improve query performance considerably and making it a unique index or composite key will protect against accidentally inserting duplicate records.)
If performance is still an issue despite proper indexing, you should be able to improve it a bit by segmenting the table into a collection of smaller tables, but segment it by user_id or post_id (rather than by date as previous answers have suggested). If you break it up by user or post id, then you will still be able to determine whether a given user has previously viewed a given post and, if so, on what date with only a single query. If you segment it by date, then that information will be spread across all tables and, in the worst-case scenario of a user who has never previously viewed a post (which I expect to be fairly common), you'll need to separately query each and every table before having a definitive answer.
As for whether to segment it by user id or by post id, that depends on whether you will more often be looking for all posts viewed by a user (segment by user_id to get them all in one query) or all users who have viewed a post (segment by post_id).
If it doesn't need to be long lasting, you could store it in session instead. If it does, you could either break the records apart by table, like say 1 per month, or you could only store the last 5-10 pages visited, and delete old ones as new ones come in. You could also change it to pages visited today, this week, etc.
If you do need all 14 million records, I would create another historical table to archive the visits that are not the most relevant for the day-to-day site operation.
At the end of the month (or week, or quarter, etc...) have some scheduled logic to archive records beyond a certain cutoff point to the historical table and reduce the number of records in the "live" table. This should help increase the query speed on the "live" table since you would have less records in it.
If you do need to query all of the data, you can use both tables and have all of the data available to you.
you could delete the ones you don't need - if you only want to show the last 10 visited posts then
DELETE FROM visits WHERE user_id = ? AND id NOT IN (SELECT id from visits where user_id = ? ORDER BY last_visit DESC LIMIT 0, 10);
(i think that's the best way to do that query, any mysql guru can tell me otherwise? you can ORDER BY in DELETE but the LIMIT only takes 1 parameter, so you can't do LIMIT 10, 100 there)
after inserting/updating each new row, or every few days if you like
Having a structure like (id, user_id, post_id, last_visit) for your vists table, makes it appear as though you are saving all posts, not just last post per Topic. Don't you need a topic ID in there somewhere so that you can determine what there last post PER TOPIC was, and so you know which row to replace when they post in the same topic more than once?
Store post_ids to $_SESSION and then using MYSQL IN with one SELECT query you will be able to show his visited posts. But all those ids will be destroyed after member close his browser, but anyways, this is much more faster and optimal than using database.
edit: sorry, I didn't notice you that you must store that records in database and use it after months. Then I have no idea how to optimize it, but with 14 mln. records you should definitely use indexes.