I have a system where there can be "posts", "comments" to the posts, and "comments" to other comments. Think of it as a very simplified facebook comment system.
To back the data, I chose to use one table for both posts and comments, as their structure is pretty much the same:
ID (of comment or post) | TopParentID (if post, same as ID, if comment, ID of post) | DirectParentID (0 if post, either ID of post or ID of parent comment if comment) | Some more fields that are the same for both post and comments
What I'd like to achieve: Select, for example, the first 20 posts. However, with the posts, also select the comments of that post.
I know that this sounds like something where a JOIN with another table would be more optimal than having only one table, but I thought that it would be beneficial to let both posts and comments use the same ID counter, to make it easier to find both direct comments to a post, and comments to a comment of a post.
However, I do not know how to design my query with the above table design; I'd somehow need to select the first x rows where DirectParentID = 0, but also rows where TopParentID = the post id of each selected row.
Should I just change my table design after all?
You could do something like this:
SELECT *
FROM t
WHERE topParentId IN (SELECT id
FROM t
WHERE directParentId = 0
AND id <= 20);
LIMIT can't be used in sub-queries, so you would need a where clause to filter out your parent posts (maybe by date or user or something).
If you want to use limit, you can use a 'self-join' like so:
SELECT t.*
FROM (SELECT *
FROM t
WHERE t.did = 0
LIMIT 20) AS t1
INNER JOIN t
ON t.tid = t1.id;
There isn't anything wrong with your design. Self referencing tables are common. You can consider your table to contain 'comment-able objects' and both comments and posts could be categorized into that. But queries which treat posts and comments as two separate entities may be more difficult/complex to create for a table treating them as a single entity (at least, that's the case for me :))
I recommend you have one table for the "Thread" plus one table for "Comments". Probably "Posts" can simply be thrown with Comments, but do not try to put Thread in also.
If you want a hierarchy (Comments to Comments), the Post should be distinguished from Comments simply by directParentId = 0.
If you don't want a hierarchy, then all you need is a thread_id on all Posts/Comments, plus a simple flag to say which is the original Post.
Related
So I've found through researching myself that the best way I can design a structure for liking posts is by having a database like the following. Let's say like Reddit, a post can be upvoted, downvoted, or not voted on at all.
The database would then having three columns, [username,post,liked].
Liked could be some kind of boolean, 1 indicating liked, and 0 indicating disliked.
Then to find a post like amount, I would do SELECT COUNT(*) FROM likes WHERE post=12341 AND liked=1 for example, then do the same for liked=0(disliked), and do the addition server side along with controversy percentage.
So I have a few concerns, first off, what would be the appropriate way to find out if a user liked a post? Would I try to select the liked boolean value, and either retrieve or catch error. Or would I first check if the record exist, and then do another select to find out the value? What if I want to check if a user liked multiple posts at once?
Secondly, would this table not need a primary key? Because no row will have the same post and username, should I use a compound primary key?
For performance you will want to alter your database plans:
User Likes Post table
Fields:
Liked should be a boolean, you are right. You can transform this to -1/+1 in your code. You will cache the numeric totals elsewhere.
Username should be UserID. You want only numeric values in this table for speed.
Post should be PostID for the same reason.
You also want a numeric primary key because they're easier to search against, and to perform sub-selects with.
And create a unique index on (Username, Post), because this table is mainly an index built for speed.
So did a user vote on a post?
select id
from user_likes_post
where userID = 123 and postID = 456;
Did the user like the post?
select id
from user_likes_post
where userID = 123 and postID = 456 and liked = true;
You don't need to worry about errors, you'll either get results or you won't, so you might as well go straight to the value you're after:
select liked from user_liked_post where userID=123 and postID=456
Get all the posts they liked:
select postID
from user_likes_post
where userID = 123 and liked = true;
Post Score table
PostID
TotalLikes
TotalDislikes
Score
This second table will be dumped and refreshed every n minutes by calculating on the first table. This second table is your cached aggregate score that you'll actually load for all users visiting that post. Adjust the frequency of this repeat dump-and-repopulate schedule however you see fit. For a small hobby or student project, just do it every 30 seconds or 2 minutes; bigger sites, every 10 or 15 minutes. For an even bigger site like reddit, you'd want to make the schema more complex to allow busier parts of the site to have faster refresh.
// this is not exact code, just an outline
totalLikes =
select count(*)
from user_likes_post
where postID=123 and liked=true
totalDislikes =
select count(*)
from user_likes_post
where postID=123 and liked=false
totalVotes = totalLikes + totalDislikes
score = totalLikes / totalVotes;
(You can simulate an update by involving the user's localStorage -- client-side Javascript showing a bump-up or down on the posts that user has voted on.)
Given your suggested 3-column table and the selects you suggest, be sure to have
PRIMARY KEY(username, post) -- helps with "did user like a post"
INDEX(post_id, liked) -- for that COUNT
When checking whether a user liked a post, either do a LEFT JOIN so that you get one of three things: 1=liked, 0=unliked, or NULL=not voted. Or you could use EXISTS( SELECT .. )
Tables need PKs.
I agree with Rick James that likes table should be uniquely indexed by (username, post) pair.
Also I advise you to let a bit redundancy and keep the like_counter in the posts table. It will allow you to significantly reduce the load on regular queries.
Increase or decrease the counter right after successful adding the like/dislike record.
All in all,
to get posts with likes: plain select of posts
no need to add joins and aggregate sub-queries.
to like/dislike: (1) insert into likes, on success (2) update posts.like_counter.
unique index prevents duplication.
get know if user has already liked the post: select from likes by username+post pair.
index helps to do it fast
My initial thought was that the problem is because boolean type is not rich enough to express the possible reactions to a post. So instead of boolean, you needed an enum with possible states of Liked, Disliked, and the third and the default state of Un-reacted.
Now however it seems, you can do away with boolean too because you do not need to record the Un-reacted state. A lack of reaction means that you do not add the entry in the table.
What would be the appropriate way to find out if a user liked a post?
SELECT Liked
FROM Likes
WHERE Likes.PostId == 1234
AND Likes.UserName == "UniqueUserName";
If the post was not interacted with by the user, there would be no results. Otherwise, 1 if liked and 0 if disliked.
What if I want to check if a user liked multiple posts at once?
I think for that you need to store a timestamp too. You can then use that timestamp to see if it there are multiple liked post within a short duration.
You could employ k-means clustering to figure if there are any "cluster" of likes. The complete explanation is too big to add here.
Would this table not need a primary key?
Of course it would. But Like is a weak entity depending upon the Post. So it would require the PK of Post, which is the field post (I assume). Combined with username we would have the PK because (post, username) would be unique for user's reaction.
Let's say I have a table posts that contains the User_id of the user who posted and the post post_id of the post. And I have another table comments that contains only the post that it belongs to child_of_post.
The problem is here: I need to select only from comments but at the same time get the user_id of the post that the comment belongs to.
So should I use a join like:
SELECT user_id FROM comments INNER JOIN posts ON child_of_post = post_id
Reading this confused me even more, I don't really know how to explain it, but, in general if I need to use the same value like and id, should I save that value in every table that I need it ? Or should I save it only in one table and use joins to retrieve it ?
Is using a join better that adding one more column to a table ?
Is using a join better than adding one more column to a table ?
In general : Yes.
Your database design looks good. As a general principle, avoid duplicating data across tables. This is inefficient in terms of storage, and also can quickly turn into a maintenance nightmare when it comes to modifying data, which ultimately threatens the integrity of your data.
Instead of duplicating data, the usual approach is to store a reference to the table row where the original data is stored ; this is called a foreign key, and it offers various functionalities that help maintain data integrity (prevent inserts of orphan records in the child table, delete child records when the parent is deleted, ...).
In your use case, you indeed would need to JOIN to recover the user that created the original post, like :
SELECT p.user_id, c.*
FROM comments c
INNNER JOIN posts p ON c.child_of_post = p.post_id
Assuming that post_id is the primary key of table posts, such JOIN with an equality condition referencing the primary key of another table, is very efficient, especially if you create an index on referencing column comments.child_of_post.
PS : it is a good practice to give aliases to table names and use them to index the fields in the query ; it avoids subtle bugs caused by column name clashes (when both tables have fields with the same name), and makes the query easier.
If I have a column in a table which is tags for example, and in that column an entry can be tagged with various keywords. Is it acceptable for an entry to have a list in that column instead of a single value?
So for example if I was putting a question into my table and I wanted to tag it as "math" and "physics", would it be acceptable to put math, physics into the tags column? And would it then later be possible to write a query that would be something along the lines of:
SELECT * FROM questions WHERE tags CONTAINS 'physics'
I know that CONTAINS is not a SQL operator but is there something that does that functionality? Or would i need to pull all entries from the table and then one by one search their tags for physics?
Never, never, never store multiple values in one column!
That is the 3rd rule of DB normalization.
Use another table to map the tags to the questions:
questions table
---------------
id
title
.....
tags table
----------
id
name
question_tags table
-------------------
question_id
tag_id
After that you can look for questions that have a specific tag like this
select q.*
from questions q
join question_tags qt on qt.question_id = q.id
join tags t on t.id = qt.tag_id
where t.name = 'physics'
Its not really 3rd normal form to have a column full of tags. Rather, you might consider having a tags table that relates to the questions table. Consider the following tables. This sets up a many to many relationship between questions and tags.
Questions:
QuestionID|*
Tags:
TagID|TagText
QuestionTags:
QuestionID|TagID
Then if you wanted all the physics questions a query something like this:
Select * from questions where questionid in (select questionID from questionTags where tagid in (select tagid from tags where tagtext = 'physics'))
As far as something like CONTAINS is concerned, you might be looking for the LIKE operator
SELECT * FROM questions WHERE tags LIKE '%physics%'
The % on each side act as wildcards, so it doesn't matter where in the string your search term is.
However, as mentioned by drew010, it is not efficient at all as it has to search through the entire table. If your table is large, this method will be very slow and is not recommended.
My team working on a php/MySQL website for a school project. I have a table of users with typical information (ID,first name, last name, etc). I also have a table of questions with sample data like below. For this simplified example, all the answers to the questions are numerical.
Table Questions:
qid | questionText
1 | 'favorite number'
2 | 'gpa'
3 | 'number of years doing ...'
etc.
Users will have the ability fill out a form to answer any or all of these questions. Note: users are not required to answer all of the questions and the questions themselves are subject to change in the future.
The answer table looks like this:
Table Answers:
uid | qid | value
37 | 1 | 42
37 | 2 | 3.5
38 | 2 | 3.6
etc.
Now, I am working on the search page for the site. I would like the user to select what criteria they want to search on. I have something working, but I'm not sure it is efficient at all or if it will scale (not that these tables will ever be huge - like I said, it is a school project). For example, I might want to list all users whose favorite number is between 100 and 200 and whose GPA is above 2.0. Currently, I have a query builder that works (it creates a valid query that returns accurate results - as far as I can tell). A result of the query builder for this example would look like this:
SELECT u.ID, u.name (etc)
FROM User u
JOIN Answer a1 ON u.ID=a1.uid
JOIN Answer a2 ON u.ID=a2.uid
WHERE 1
AND (a1.qid=1 AND a1.value>100 AND a1.value<200)
AND (a2.qid=2 AND a2.value>2.0)
I add the WHERE 1 so that in the for loops, I can just add " AND (...)". I realize I could drop the '1' and just use implode(and,array) and add the where if array is not empty, but I figured this is equivalent. If not, I can change that easy enough.
As you can see, I add a JOIN for every criteria the searcher asks for. This also allows me to order by a1.value ASC, or a2.value, etc.
First question:
Is this table organization at least somewhat decent? We figured that since the number of questions is variable, and not every user answers every question, that something like this would be necessary.
Main question:
Is the query way too inefficient? I imagine that it is not ideal to join the same table to itself up to maybe a dozen or two times (if we end up putting that many questions in). I did some searching and found these two posts which seem to kind of touch on what I'm looking for:
Mutiple criteria in 1 query
This uses multiple nested (correct term?) queries in EXISTS
Search for products with multiple criteria
One of the comments by youssef azari mentions using 'query 1' UNION 'query 2'
Would either of these perform better/make more sense for what I'm trying to do?
Bonus question:
I left out above for simplicity's sake, but I actually have 3 tables (for number valued questions, booleans, and text)
The decision to have separate tables was because (as far as I could think of) it would either be that or have one big answers table with 3 value columns of different types, having 2 always empty.
This works with my current query builder - an example query would be
SELECT u.ID,...
FROM User u
JOIN AnswerBool b1 ON u.ID=b1.uid
JOIN AnswerNum n1 ON u.ID=n1.uid
JOIN AnswerText t1 ON u.ID=t1.uid
WHERE 1
AND (b1.qid=1 AND b1.value=true)
AND (n1.qid=16 AND n1.value<999)
AND (t1.qid=23 AND t1.value LIKE '...')
With that in mind, what is the best way to get my results?
One final piece of context:
I mentioned this is for a school project. While this is true, then eventual goal (it is an undergrad senior design project) is to have a department use our site for students creating teams for their senior design. For a rough estimate of size, every semester, the department would have somewhere around 200 or so students use our site to form teams. Obviously, when we're done, the department will (hopefully) check our site for security issues and other stuff they need to worry about (what with FERPA and all). We are trying to take into account all common security practices and scalablity concerns, but in the end, our code may be improved by others.
UPDATE
As per nnichols suggestion, I put in a decent amount of data and ran some tests on different queries. I put around 250 users in the table, and about 2000 answers in each of the 3 tables. I found the links provided very informative
(links removed because I can't hyperlink more than twice yet) Links are in nnichols' response
as well as this one that I found:
http://phpmaster.com/using-explain-to-write-better-mysql-queries/
I tried 3 different types of queries, and in the end, the one I proposed worked the best.
First: using EXISTS
SELECT u.ID,...
FROM User u WHERE 1
AND EXISTS
(SELECT * FROM AnswerNumber
WHERE uid=u.ID AND qid=# AND value>#) -- or any condition on value
AND EXISTS
(SELECT * FROM AnswerNumber
WHERE uid=u.ID AND qid=another # AND some_condition(value))
AND EXISTS
(SELECT * FROM AnswerText
...
I used 10 conditions on each of the 3 answer tables (resulting in 30 EXISTS)
Second: using IN - a very similar approach (maybe even exactly?) which yields the same results
SELECT u.ID,...
FROM User u WHERE 1
AND (u.ID) IN (SELECT uid FROM AnswerNumber WHERE qid=# AND ...)
...
again with 30 subqueries.
The third one I tried was the same as described above (using 30 JOINs)
The results of using EXPLAIN on the first two were as follows: (identical)
The primary query on table u had a type of ALL (bad, though users table is not huge) and rows searched was roughly twice the size of the user table (not sure why). Each other row in the output of EXPLAIN was a dependent query on the relevant answer table, with a type of eq_ref (good) using WHERE and key=PRIMARY KEY and only searching 1 row. Overall not bad.
For the query I suggested (JOINing):
The primary query was actually on whatever table you joined first (in my case AnswerBoolean) with type of ref (better than ALL). The number of rows searched was equal to the number of questions answered by anyone (as in 50 distinct questions have been answered by anyone) (which will be much less than the number of users). For each additional row in EXPLAIN output, it was a SIMPLE query with type eq_ref (good) using WHERE and key=PRIMARY KEY and only searching 1 row. Overall almost the same, but a smaller starting multiplier.
One final advantage to the JOIN method: it was the only one I could figure out how to order by various values (such as n1.value). Since the other two queries were using subqueries, I could not access the value of a specific subquery. Adding the order by clause did change the extra field in the first query to also have 'using temporary' (required, I believe, for order by's) and 'using filesort' (not sure how to avoid that). However, even with those slow-downs, the number of rows is still much less, and the other two (as far as I could get) cannot use order by.
You could answer most of these questions yourself with a suitably large test dataset and the use of EXPLAIN and/or the profiler.
Your INNER JOINs will almost certainly perform better than switching to EXISTS but again this is easy to test with a suitable test dataset and EXPLAIN.
If you were implementing a blog application - what would you prefer -
Having a counter in the "POSTS" table storing the number of comments
SELECT comment_count
FROM posts WHERE post_id = $id
...or counting the number of comments for a particular post from a "COMMENTS" table:
SELECT COUNT(*)
FROM comments
WHERE post_id = $id
Which one is more optimized?
I would use the second form, COUNT, until I was sure that performance in that particular SQL query was a problem. What you're suggesting in the first is basically denormalization, which is fine and dandy when you know for sure you need it.
Indexes would allow you to perform the second query pretty quickly regardless.
Let's look at the factors.
Case 1: When you display the post you display the comments. That means you retrieve them all and can count them as you display them. In that case, no.
Case 2: When you display the post you do not display the comments, but a link that says "15 comments." In that case there is an equation.
Materializing the count:
Cost of one comment save = 1 Insert + 1 Update
Cost of one post display = 1 Read
Average Number of blog displays = D
Average Number of Comments = C
So, for what ratio of Displays D vs. Comments C is it true that:
C * (Insert + Update) < D * (Read)
Since it is usually true that D >> C, I would suggest that the cost of that extra update vanishes.
This may not be so important for a blog, but it is a crucial formula to know when you have many tables and need to make these decisions.