So I've found through researching myself that the best way I can design a structure for liking posts is by having a database like the following. Let's say like Reddit, a post can be upvoted, downvoted, or not voted on at all.
The database would then having three columns, [username,post,liked].
Liked could be some kind of boolean, 1 indicating liked, and 0 indicating disliked.
Then to find a post like amount, I would do SELECT COUNT(*) FROM likes WHERE post=12341 AND liked=1 for example, then do the same for liked=0(disliked), and do the addition server side along with controversy percentage.
So I have a few concerns, first off, what would be the appropriate way to find out if a user liked a post? Would I try to select the liked boolean value, and either retrieve or catch error. Or would I first check if the record exist, and then do another select to find out the value? What if I want to check if a user liked multiple posts at once?
Secondly, would this table not need a primary key? Because no row will have the same post and username, should I use a compound primary key?
For performance you will want to alter your database plans:
User Likes Post table
Fields:
Liked should be a boolean, you are right. You can transform this to -1/+1 in your code. You will cache the numeric totals elsewhere.
Username should be UserID. You want only numeric values in this table for speed.
Post should be PostID for the same reason.
You also want a numeric primary key because they're easier to search against, and to perform sub-selects with.
And create a unique index on (Username, Post), because this table is mainly an index built for speed.
So did a user vote on a post?
select id
from user_likes_post
where userID = 123 and postID = 456;
Did the user like the post?
select id
from user_likes_post
where userID = 123 and postID = 456 and liked = true;
You don't need to worry about errors, you'll either get results or you won't, so you might as well go straight to the value you're after:
select liked from user_liked_post where userID=123 and postID=456
Get all the posts they liked:
select postID
from user_likes_post
where userID = 123 and liked = true;
Post Score table
PostID
TotalLikes
TotalDislikes
Score
This second table will be dumped and refreshed every n minutes by calculating on the first table. This second table is your cached aggregate score that you'll actually load for all users visiting that post. Adjust the frequency of this repeat dump-and-repopulate schedule however you see fit. For a small hobby or student project, just do it every 30 seconds or 2 minutes; bigger sites, every 10 or 15 minutes. For an even bigger site like reddit, you'd want to make the schema more complex to allow busier parts of the site to have faster refresh.
// this is not exact code, just an outline
totalLikes =
select count(*)
from user_likes_post
where postID=123 and liked=true
totalDislikes =
select count(*)
from user_likes_post
where postID=123 and liked=false
totalVotes = totalLikes + totalDislikes
score = totalLikes / totalVotes;
(You can simulate an update by involving the user's localStorage -- client-side Javascript showing a bump-up or down on the posts that user has voted on.)
Given your suggested 3-column table and the selects you suggest, be sure to have
PRIMARY KEY(username, post) -- helps with "did user like a post"
INDEX(post_id, liked) -- for that COUNT
When checking whether a user liked a post, either do a LEFT JOIN so that you get one of three things: 1=liked, 0=unliked, or NULL=not voted. Or you could use EXISTS( SELECT .. )
Tables need PKs.
I agree with Rick James that likes table should be uniquely indexed by (username, post) pair.
Also I advise you to let a bit redundancy and keep the like_counter in the posts table. It will allow you to significantly reduce the load on regular queries.
Increase or decrease the counter right after successful adding the like/dislike record.
All in all,
to get posts with likes: plain select of posts
no need to add joins and aggregate sub-queries.
to like/dislike: (1) insert into likes, on success (2) update posts.like_counter.
unique index prevents duplication.
get know if user has already liked the post: select from likes by username+post pair.
index helps to do it fast
My initial thought was that the problem is because boolean type is not rich enough to express the possible reactions to a post. So instead of boolean, you needed an enum with possible states of Liked, Disliked, and the third and the default state of Un-reacted.
Now however it seems, you can do away with boolean too because you do not need to record the Un-reacted state. A lack of reaction means that you do not add the entry in the table.
What would be the appropriate way to find out if a user liked a post?
SELECT Liked
FROM Likes
WHERE Likes.PostId == 1234
AND Likes.UserName == "UniqueUserName";
If the post was not interacted with by the user, there would be no results. Otherwise, 1 if liked and 0 if disliked.
What if I want to check if a user liked multiple posts at once?
I think for that you need to store a timestamp too. You can then use that timestamp to see if it there are multiple liked post within a short duration.
You could employ k-means clustering to figure if there are any "cluster" of likes. The complete explanation is too big to add here.
Would this table not need a primary key?
Of course it would. But Like is a weak entity depending upon the Post. So it would require the PK of Post, which is the field post (I assume). Combined with username we would have the PK because (post, username) would be unique for user's reaction.
Related
Say I'll get all the followers of a certain content from my project; here is my db
table
contents
users
Now, everytime I want to get content's numbers of followers, I have this table here to get connections with users called content-followers.
table
contents
users
content-followers <
columns
user_id
content_id
Now my concern is say this will run getting the numbers of followers of a content, but this will be along with the other queries and stuff and I understand it may get the sql slower on process.
See, everytime people will visit the content, I'll have to show that count, but that count (as I imagine) will run through the entire table just to count.
Is there other way to make it simple? Like counting only once a certain time and save to contents table?
I have no proper database lessons so, thanks guys for your help in advance!
CREATE TABLE ContentFollowers (
user_id ...,
content_id ...,
PRIMARY KEY(user_id, content_id),
INDEX(content_id, user_id)
) ENGINE=InnoDB;
SELECT ...,
( SELECT COUNT(*) FROM ContentFollowers
WHERE user_id = u.id
) AS follower_count
FROM Contents AS c
JOIN Users AS u ON ...
WHERE ...
The COUNT(*) will efficiently use the PRIMARY KEY of ContentFollowers. The added time taken will be a few milliseconds, even with many millions of users and contents.
If you want to discuss further, please provide the SHOW CREATE TABLE for each relevant table and your tentative SELECT (which will have more than what I specified). So "... counting only once ..." should be unnecessary (and a hassle).
Is it possible for a "user" to "follow" a "content" more than once? This is a potential hack to mess up your numbers, but I think what I say here avoids that possibility. (A PRIMARY KEY includes an 'uniqueness' constraint.) Without this, a user could repeatedly click on [Follow] to inflate the number of 'followers'.
In what you have specified so far, I don't see the need for a TRIGGER. Furthermore, a Trigger would reopen the possibility of the above 'hack'.
I would like to know which is the best when I want to check if somebody already participated to a members related event (like a poll):
Imagine that I have a table that stores all the voters votes. Over time, it can reach a very big size (10000+ entries/votes for 500 different polls).
When I want to check if a member has already voted to my new poll, what's the best? :
1/ Make a SELECT or a COUNT on the "10000+ entries VOTERS table" to see if said USERID already voted to my new poll.
2/ Having a TEXT columns in my POLL_main_infos Table where i stock/CONCAT the USERIDS like these:
"1,15,42,12,523,8521,7444, etc etc."
And to check, I get that columns as a variable then in my PHP script, I use a REGEX to check if a USERID is already present in it (like looking for ",42,", meaning the USER with the ID "42" already participated to said poll.
Also, if the second solution is the best, should I stock the IDS in a text column or a BLOB?
Anyways, thank you very much in advance!
In a Relational Database system it's usually best to store data normalized, so no list of user ids, never.
So using Holmes IV suggested tables it's a simple:
SELECT 'User already voted'
WHERE EXISTS
(
SELECT * FROM Participation
WHERE PollId = 123
AND UserId = 456
)
With an appropriate index (a UNIQUE index if a user might vote only once per poll) this will be very fast regardless of the number of rows.
How much freedom do you have with this? I would say you need 3 tables, 1 for each : Polls, Voters, Participation . In the Polls table you store Poll_ID and Poll_Description. In voter you store all Voter_ID and any other voter information you might have. Then in the participation you store, Poll_ID,Voter_ID, and results or simply a date of when they participated. Then when you want to see such information you query the 3rd table, and join it on the other too for the specifics
The best way to check if a user already voted is to query the Polls_Voters tables in your setup it would be
Select *
from Polls_Voters
where Voter_ID = 'ID of voter you are looking for'
and poll_id = 'ID of poll you wish to see if they voted in'
I made a voting system and recorded votes in vote table as a row like this:
post_id | user_id | votes | date
Then, I add this vote to user-meta table which has a structure like this:
user_id | meta_key | meta_value
In this table, giving an user_id with the meta_key of "votes", I can save or get an array of post_ids as meta_value.
The reason I save one vote in two tables is--
1. Why use the Vote table?
I need to calculate vote_total and vote_popularity across all posts, this is not easy to do if I save it only as an element of an array in the user's meta table.
2. Why use the user meta table?
I need to check if a user already voted for a post or not. The user meta is cached once set. So, I only need to do a check of in_array($post_id, $my_voted_posts). If without the user-meta table, for each post the user visit, I will have to check the vote table.
Is there a better way?
Although my voting system works fine. But somehow I feel this is not efficient way of doing this. I would like to consult you professionals, to learn how I can improve it for better performance. Thanks!
To start, I think checking the vote table is fine if you have an efficient index. You can use composite indexes to index post_id and user_id, which will make a query like select * from votes where post_id = X and vote_id = Y fairly performant.
There will come a point where you will need to denormalize your data for speed, but I don't think it should be in the same layer as your normalized data. Perhaps you can use redis / memcache for the user metadata?
I want to create a table where my users can associate a friendship between one another. Which at the same time this table will work in conjunction to what I would to be a one-to-many relation between various other tables I am attempting to work up.
Right now I am thinking of something like this
member_id, friend_id, active, date
member_id would be the column of the user making the call, friend_id would be the column of the friend they are attempting to tie to, active would be a toggle of sorts 0 = pending, 1 = active, date would just be a logged date of the last activity on that particular row.
Now my confusion is if I were to query I would typically query for member_id then base the rest of the query off of associated friend_id's to display data accordingly to the right people. So with this logic of sorts in mind, that makes me think I would have to have 2 rows per request. One where its the member_id who's requesting and the friend_id of the request inserted into the table, then one thats the opposite so I could query accordingly every time. So in essences its like double dipping for every one action requested to this particular table I need to make 2 like actions to make it work.
Which in all does not make sense to me as far as optimization goes. So in all my question is what is the proper way to handle data for relations like this? Or am I actually thinking sanely about this being an approach to handling it?
If a friendship is always mutual, then you can choose between data redundancy (i.e. both directions having a row) for the sake of simpler queries, or learn to live with slightly more complex queries. I'd personally avoid data redundancy unless there is a compelling reason otherwise - you're not just wasting space and performance, but you'll need to be careful when enforcing it - a simple CHECK is incapable of referencing other rows and depending on your DBMS a trigger may be limited in what it can do with a mutating table.
An easy way ensure to only one row per friendship is to always insert the lower value in member_id and higher value in friend_id (make a constraint CHECK (member_id < friend_id) to enforce it). Then, when you query, you'll have search in both directions - for example, finding all friends of the given person (identified by person_id) would look something like this:
SELECT *
FROM
person
WHERE
id <> :person_id
AND (
id IN (
SELECT friend_id
FROM friendship
WHERE member_id = :person_id
)
OR
id IN (
SELECT member_id
FROM friendship
WHERE friend_id = :person_id
)
)
BTW, in this scheme, you'd probably want to rename member_id and friend_id to, say, friend1_id and friend2_id...
Two ways to look at it:
WHERE ((friend_id = x AND member_id = y) OR (friend_id = y AND member_id = x))
would allow you to query by simply stating one side of the relationship. If both sides are added, this method would still work without causing duplicate rows to be returned.
Conversely, adding both sides of the relationship, so that your queries consist of
WHERE friend_id = x AND member_id = y
not only makes queries easier to write, but also easier to plan (meaning better DB performance).
My vote is for the latter option.
Beautiful - there's no problem with your table as-is.
ALSO:
I'm not sure if this cardinality is "one to many", or "many to many":
http://en.wikipedia.org/wiki/Cardinality_%28data_modeling%29
Q: I were to query I would typically query for member_id then base the
rest of the query off of associated friend_id's to display data
accordingly to the right people
A: Frankly, I don't see any problem querying "member to friend", or "friend to member" (or any other combinations - e.g. friends who share friends). Again, it looks good.
Introduce a helper table like:
users
user_id, name, ...
friendship
user_id, friend_id, ....
select u.name as user, u2.name as friend from users u
inner join friendship f on f.user_id = u.user_id
inner join users u2 on u2.user_id = f.friend_id
I think this is pretty similar to what you have, just putting a query as an example.
I want to include a like/dislike system similar to Facebook and so far, I have set the like/dislike columns as a 'text' type. This is so that I can add the id for the user(s) who liked/disliked a post. Would that be the best way of doing it? Also, in addition to the question above, how would I stop a user pressing the like and dislike button again? Since, once a user has liked a post, it should display an unlike/undislike option? A concept/idea would be great of how to do this.
While it's hard to make armchair decisions, here are my ideas:
First, you could have a 'likes' integer column for each post. When a user clicks up and down, have that number increment or decrement. This offers no protection against users clicking multiple times, but it's easy and fast.
Another way would be to have a 'Like' table, with columns post_id, user_id, and score. score can have two values: '1' or '-1'. All 3 columns are integers. When the user clicks 'like', you do an INSERT/UPDATE command on the row with user_id & post_id matching.
Then, to see the final score for a post, you do a SELECT SUM(score) FROM that_table WHERE post_id = ?.
With this second method, if you wanted to see the name of the most recent clicker, you could add a timestamp column and search for the most recent entry.
I would create a table with the following structure:
Table: Likes
PostId bigint
UserId bigint
Like bit(true, false)
Then set PostId and UserId together as the primary key. This will prevent the database from inserting multiple likes/unlikes for the same post.
In your code, check to see if the user/post combination exists and then toggle the bit value if it does or set the bit value to true if it does not.