Weird bug in sql script - mysql

SELECT p.*, u.user_id, u.user_name,
COUNT(c.comment_id) AS count,
COUNT(v.vote_post_id) /
(TIMESTAMPDIFF(MINUTE, v.vote_timestamp, SYSDATE()) + 1) AS rate
FROM posts AS p
LEFT JOIN comments AS c ON (p.post_id = c.comment_post_id)
LEFT JOIN post_votes AS v ON (p.post_id = v.vote_post_id)
LEFT JOIN users AS u ON (p.postedby_id = u.user_id)
GROUP BY p.post_id
ORDER BY COUNT(v.vote_post_id) /
(TIMESTAMPDIFF(MINUTE, v.vote_timestamp, SYSDATE()) + 1) DESC
This is the script I'm working on. I don't have my db filled up very well for testing, but the first two results gets double up with comments. Can you see any obvious mistakes here? I have another version of the script that works fine here:
SELECT p.*, u.user_id, u.user_name,
COUNT(c.comment_id) AS count
FROM posts AS p
LEFT JOIN comments AS c ON (p.post_id = c.comment_post_id)
LEFT JOIN users AS u ON (p.postedby_id = u.user_id)
GROUP BY p.post_id
ORDER BY COUNT(c.comment_post_id) /
(TIMESTAMPDIFF(MINUTE, p.post_timestamp, SYSDATE()) + 1) DESC

Multiple votes will cause your comments to duplicate. You want to do a sub-select on the post_votes table to get the total votes per post as a single value if you GROUP BY the vote_post_id.
Since COUNT is a reserved word, I don't recommend using it as a column name in your result set.
If you're just getting the comment count and not the comments themselves, then you'll want that in a sub-select, too, or you'll be doubling up on posts.
SELECT p.*, u.user_id, u.user_name, c.comment_count,
v.vote_count AS total_votes,
v.vote_count / (TIMESTAMPDIFF(MINUTE, p.post_timestamp, SYSDATE()) + 1) as votes_per_minute
FROM posts AS p
LEFT JOIN (SELECT comment_post_id, COUNT(comment_post_id) AS comment_count FROM comments GROUP BY comment_post_id) AS c ON (p.post_id = c.comment_post_id)
LEFT JOIN (SELECT vote_post_id, COUNT(vote_post_id) AS vote_count FROM post_votes GROUP BY vote_post_id) AS v ON (p.post_id = v.vote_post_id)
LEFT JOIN users AS u ON (p.postedby_id = u.user_id) GROUP BY p.post_id
ORDER BY v.vote_count / (TIMESTAMPDIFF(MINUTE, p.post_timestamp, SYSDATE()) + 1) DESC

Clearly you are using MYSQL becasue it is the only database that allows this type of group by. It is a bad choice 100% of the time to use it however. You should group by all the fields in the select that are not part of the aggregate functions the way all other databases require. THis is how a group by needs to work. This may clear your problem. It may not depending on the data. If you have multiple records in some of the joined tables and they happen to have differnt data in the some of the fields, you still may not get one record when you properly group. In that case, you need to write a derived table for the join or use an aggregate on the field that has more than one value to tell the database which value to use.

Related

How to properly join these three tables in SQL?

I'm currently creating a small application where users can post a text which can be commented and the post can also be voted (+1 or -1).
This is my database:
Now I want to select all information of all posts with status = 1 plus two extra columns: One column containing the count of comments and one column containing the sum (I call it score) of all votes.
I currently use the following query, which correctly adds the count of the comments:
SELECT *, COUNT(comments.fk_commented_post) as comments
FROM posts
LEFT JOIN comments
ON posts.id_post = comments.fk_commented_post
AND comments.status = 1
WHERE posts.status = 1
GROUP BY posts.id_post
Then I tried to additionally add the sum of the votes, using the following query:
SELECT *, COUNT(comments.fk_commented_post) as comments, SUM(votes_posts.type) as score
FROM posts
LEFT JOIN comments
ON posts.id_post = comments.fk_commented_post
AND comments.status = 1
LEFT JOIN votes_posts
ON posts.id_post = votes_posts.fk_voted_post
WHERE posts.status = 1
GROUP BY posts.id_post
The result is no longer correct for either the votes or the comments. Somehow some of the values seem to be getting multiplied...
This is probably simpler using correlated subqueries:
select p.*,
(select count(*)
from comments c
where c.fk_commented_post = p.id_post and c.status = 1
) as num_comments,
(select sum(vp.type)
from votes_posts vp
where c.fk_voted_post = p.id_post
) as num_score
from posts p
where p.status = 1;
The problem with join is that the counts get messed up because the two other tables are not related to each tother -- so you get a Cartesian product.
You want to join comments counts and votes counts to the posts. So, aggregate to get the counts, then join.
select
p.*,
coalesce(c.cnt, 0) as comments,
coalesce(v.cnt, 0) as votes
from posts p
left join
(
select fk_commented_post as id_post, count(*) as cnt
from comments
where status = 1
group by fk_commented_post
) c on c.id_post = p.id_post
left join
(
select fk_voted_post as id_post, count(*) as cnt
from votes_posts
group by fk_voted_post
) v on v.id_post = p.id_post
where p.status = 1
order by p.id_post;

Three tables join and count showing wrong result

I have 3 tables and I need to get all info from catalog ,join ratings table and join to comments table and then count comments by catalog posts, my SQL query:
SELECT
catalog.catalog_id,
catalog.slug,
catalog.title,
catalog.city,
catalog.street,
catalog.image COUNT(ratings.rate) AS votes,
COUNT(comments.catalog_id) AS total_comments,
ROUND(SUM(ratings.rate) / COUNT(ratings.rate)) AS average
FROM
catalog
LEFT JOIN ratings ON ratings.object_id = catalog.catalog_id
LEFT JOIN comments ON comments.catalog_id = catalog.catalog_id
GROUP BY
catalog.catalog_id
ORDER BY
average,
votes DESC
Everything shows fine only total_comments bad numbers 6, but in comments table only 2 rows, so its bad result. I'm thinking it's a problem with the grouping. I've tried adding GROUP BY catalog.catalog_id, comments.catalog_id but it didn't help.
My tables:
The problem is that you have multiple ratings and comments, so you are getting a cartesian product for each post.
The correct solution is to pre-aggregate the data before joining.
SELECT c.*, r.votes, c.total_comments,
ROUND(sumrate / votes) AS average
FROM catalog c LEFT JOIN
(SELECT r.object_id, COUNT(*) as votes, SUM(r.rate) as sumrate
FROM ratings r
GROUP BY r.object_id
) r
ON r.object_id = c.catalog_id LEFT JOIN
(SELECT c.catalog_id, COUNT(*) as total_comments
FROM comments c
GROUP BY c.catalog_id
) c
ON c.catalog_id = c.catalog_id
GROUP BY c.catalog_id
ORDER BY average, votes DESC;

Post with average comments per hour

SELECT p.*,
u.user_id,
u.user_name,
count(c.comment_post_id) AS comments
FROM posts AS p
LEFT JOIN comments AS c
ON (p.post_id = c.comment_post_id)
LEFT JOIN users AS u
ON (p.postedby_id = u.user_id)
WHERE c.comment_added > NOW() - INTERVAL 1 HOUR
GROUP BY p.post_id
ORDER BY count(c.comment_post_id) DESC
This makes a list of posts sorted by most comments the last hour. The problem now is that the posts with 0 comments last hour does not make the list.
So is there a way to make it ordered by average comment per hour since the comment was created? This way all the posts will make the list, and the most discussed posts will always be on top.
http://sqlfiddle.com/#!9/820044/1/0
EDIT:
SELECT p.*, u.user_id, u.user_name, COUNT(c.comment_post_id) / (TIMESTAMPDIFF(HOUR, p.post_timestamp, SYSDATE()) + 1) AS rate FROM posts AS p LEFT JOIN comments AS c ON (p.post_id = c.comment_post_id) LEFT JOIN users AS u ON (p.postedby_id = u.user_id) GROUP BY p.post_id ORDER BY COUNT(c.comment_post_id) / (TIMESTAMPDIFF(HOUR, p.post_timestamp, SYSDATE()) + 1) DESC
This is what solved the problem.
You select only posts that have comments within one hour, because of the WHERE clause. What you can try is:
SELECT p.*,
u.user_id,
u.user_name,
SUM(c.comment_added > NOW() - INTERVAL 1 HOUR) AS comments
FROM posts AS p
LEFT JOIN comments AS c
ON (p.post_id = c.comment_post_id)
LEFT JOIN users AS u
ON (p.postedby_id = u.user_id)
GROUP BY p.post_id
ORDER BY
COUNT(c.comment_post_id) /
(TIMESTAMPDIFF(HOUR, p.post_timestamp, SYSDATE()) + 1) DESC
For each comment that is within the hour the boolean expression c.comment_added > NOW() - INTERVAL 1 HOUR will contribute one to the SUM(...).
For the second part, we can find how many hours passed from publishing a post with TIMESTAMPDIFF(HOUR, p.post_timestamp, SYSDATE()). So dividing the number of comments with the number of hours, gives us the sorting criterion. But we add one to the hours to avoid division by zero.
Bit hard to see without the data, but to solve your immediate problem (i.e. get the zero-recent-comments posts showing), don't you just want to move your constraint into the LEFT JOIN?
LEFT JOIN comments AS c
ON (p.post_id = c.comment_post_id AND c.comment_added > NOW - INTERVAL 1 HOUR)

Duplicated rows

SQL Query:
SELECT
T.*,
U.nick AS author_nick,
P.id AS post_id,
P.name AS post_name,
P.author AS post_author_id,
P.date AS post_date,
U2.nick AS post_author
FROM
zero_topics T
LEFT JOIN
zero_posts P
ON
T.id = P.topic_id
LEFT JOIN
zero_players U
ON
T.author = U.uuid
LEFT JOIN
zero_players U2
ON
P.author = U2.uuid
ORDER BY
CASE
WHEN P.date is null THEN T.date
ELSE P.date
END DESC
Output:
Topics:
Posts:
Question: Why i have duplicated topic id 22? i have in mysql two topics (id 22 and 23) and two posts(id 24 and 25). I want to see topic with last post only.
If a join produces multiple results and you want only at most one result, you have to rewrite the join and/or filtering criteria to provide that result. If you want only the latest result of all the results, it's doable and reasonably easy once you use it a few times.
select a.Data, b.Data
from Table1 a
left join Table2 b
on b.JoinValue = a.JoinValue
and b.DateField =(
select Max( DateField )
from Table2
where JoinValue = b.JoinValue );
The correlated subquery pulls out the one date that is the highest (most recent) value of all the joinable candidates. That then becomes the row that takes part in the join -- or, of course, nothing if there are no candidates at all. This is a pattern I use quite a lot.

Count votes in subquery or use join - which is faster?

I am working on a forum system (mysql) and I'm not sure which path to choose for better performance when retrieving in a single query posts, up and down votes and if the current user voted for each post.
The first option is this:
SELECT posts.post_id, post_content, display_name,
(SELECT COUNT(post_id) FROM post_votes WHERE post_votes.post_id=posts.post_id AND post_votes.user_id='+user_id+') voted,
(SELECT COUNT(post_id) FROM post_votes WHERE post_votes.post_id=posts.post_id AND up_vote=1) upvotes,
(SELECT COUNT(post_id) FROM post_votes WHERE post_votes.post_id=posts.post_id AND up_vote=0) downvotes
FROM posts JOIN users ON users.user_id=posts.user_id WHERE parent_id ='+parent_id+' ORDER BY post_id DESC
The second option is to replace all the count sub-queries with LEFT JOIN and count.
Are there any advantages to one method over the other?
Edit:
Since I'm looking to retrieve all posts rather than a single row that groups posts, I came up with this query (with some inspiration from here):
SELECT p.post_id, post_content, display_name,
COALESCE(v.upvotes, 0) AS upvotes,
COALESCE(v.downvotes, 0) AS downvotes,
COALESCE(v.voted, 0) AS voted
FROM posts p
LEFT JOIN (
SELECT post_id,
SUM(vt.up_vote = 1) AS upvotes,
SUM(vt.up_vote = 0) AS downvotes,
MAX(IF(vt.user_id = ' + user_id + ', vt.up_vote, NULL)) voted
FROM post_votes vt
GROUP BY vt.post_id
)
v ON v.post_id = p.post_id
JOIN users ON users.user_id=p.user_id
WHERE parent_id =' + parent_id + ' ORDER BY post_id DESC
I have ran both solutions on my demo db (tiny at the moment, contains less than 100 rows in each table) and the durations were identical.
The question is which one will be faster for the long term.
I can hardly think of anything where a subquery was faster than a join.
In this case you don't even need a join. Do it all in one query:
SELECT
p.post_id,
p.post_content,
u.display_name,
COUNT(pv.post_id) AS voted,
SUM(pv.up_vote = 1) AS upvotes,
SUM(pv.up_vote = 0) downvotes
FROM posts p
JOIN users u ON u.user_id = p.user_id
LEFT JOIN post_votes pv ON posts.post_id = pv.post_id AND pv.user_id ='whatever'
WHERE p.parent_id ='+parent_id+'
GROUP BY p.post_id
ORDER BY p.post_id DESC
The pv.up_vote = 'whatever' inside the SUM() function returns either true or false, 1 or 0. That's why we use the SUM() function here. And voila, everything in one query.