How to sort groups in MySQL join operator? - mysql

In my sql I have this query
SELECT * FROM threads t
JOIN (
SELECT c.*
FROM comments c
WHERE c.thread_id = t.id
ORDER BY date_sent
ASC LIMIT 1
) d ON t.id = d.thread_id
ORDER By d.date_sent DESC
Basically I have two tables, threads and comments. Comments have a foreign key to the thread table. I want to get the earliest comment row for each thread row. Threads should have at least 1 comment. If it doesn't, then the thread row shouldn't be included.
In my query above, I do a select on thread, and then I join it with a custom query. I want to use t.id, where t is the select table outside the brackets. Inside the brackets I create a new result set thats comments are for the current thread. I do the sorting and limiting there.
Then afterwards, I sort it again, so its earliest on top. However when I run this, it gives an error #1054 - Unknown column 't.id' in 'where clause'.
Does anyone know whats wrong here?
Thanks

The unknown column t.id is due to the fact that the alias t is unknown inside the subquery, but indeed it isn't needed anyway since you join it in the ON clause.
Instead of a LIMIT 1, use a MIN(date_sent) aggregate grouped by thread_id in the subquery. Be careful also using SELECT * in a join query, if columns in both tables have the same names; better to list the columns explicitly.
SELECT
/* List the columns you explicitly need here rather than *
if there is any name overlap (like `id` for example) */
t.*,
c.*
FROM
threads t
/* join threads against the subquery returning only thread_id and earliest date_sent */
INNER JOIN (
SELECT thread_id, MIN(date_sent) AS firstdate
FROM comments
GROUP BY thread_id
) earliest ON t.id = earliest.thread_id
/* then join the subquery back against the full comments table to get the other columns
in that table. The join is done on both thread_id and the date_sent timestamp */
INNER JOIN comments c
ON earliest.thread_id = c.thread_id
AND earliest.firstdate = c.date_sent
ORDER BY c.date_sent DESC

Michael's answer is correct. This is another answer that follows more the form of your query. You can do what you want as a correlated subquery and then join in the additional information:
SELECT *
FROM (SELECT t.*,
(SELECT c.id
FROM comments c
WHERE c.thread_id = t.id
ORDER BY c.date_sent ASC
LIMIT 1
) as mostrecentcommentid
FROM threads t
) t JOIN
comments c
on t.mostrecentcommentid = c.id
ORDER By c.date_sent DESC;
It is possible that this has better performance, because it does not require aggregating all the data. However, for performance, you would want an index on comments(thread_id, date_set, id).

Related

Order by select count(*) and LIMIT is very slow

I have this query in my program, when I do some sorting with select count(*) field from the query, I dont know why, it very slow when running that query.
The problem is when i do some ordering from posts_count, it run more slower than i do ordering with the other field.
Here's the query:
select 'tags'.*, (select count(*) from 'posts' inner join 'post_tag' on 'posts'.'id' = 'post_tag'.'post_id' where 'tags'.'id' = 'post_tag'.'tag_id') as 'posts_count' from 'tags' order by 'posts_count' asc limit 15 offset 0;
Here's the execution time :
Please someone help me to improve this query , Thank you.
What i expect is the query can be run faster.
SELECT t.*, COUNT(*) AS count
FROM tags AS t
LEFT OUTER JOIN post_tag AS pt ON t.id = pt.tag_id
GROUP BY t.id
ORDER BY count ASC LIMIT 15 OFFSET 0;
You should make sure post_tag has an index starting with the tag_id column. You didn't include your table definition in your question, so I must assume the index is there. If the primary key starts with tag_id, that's okay too.
You don't need to join to posts, if I can assume that a row exists in post_tag means it must reference an existing row in posts. You can get the information you need only by joining to post_tag.

SQL check if thread timestamp is newer than reply timestamp in JOIN Statement

So I'm kinda new to SQL joins and was thinking on going full overkill probably.
What I want to do is join my four tables together.
What I want to accomplish is that I want all the information from category, and I want it to be matched to the replies with the newest timestamp and then I want to join the t.title which t.id matches r.thread_id
SELECT c.*, t.id, t.title, r.timestamp, u.id, u.username
FROM forum_category AS c
LEFT JOIN forum_threads AS t ON (c.id = t.category_id)
LEFT JOIN forum_replies AS r ON (t.id = r.thread_id
AND r.timestamp =
(
SELECT timestamp
FROM forum_replies
ORDER BY timestamp DESC LIMIT 1
))
LEFT JOIN users AS u ON (r.user_id = u.id)
GROUP BY c.id
As it is now this code seems to work, not having tested it alot.
However I need to expand it to check if t.timestamp is newer than latest r.timestamp and JOIN that one instead then. with the t.title, t.timestamp and t.user_id.
So if a thread is newer than the latest reply.
I know I could make the first post a reply and solve it that way. But I'm not doing that right now if it's possible to solve in the SQL statement.
SQL layout imgur here:
https://imgur.com/a/nCn2a
forum_category:
forum_threads:
forum_replies:
One helpful technique is to use Subqueries to break up the mental logic of what your query is trying to do. Basically, a subquery takes the place of a regular table in any query.
So, first up, we need to get the most recent time stamp in the replies for each thread:
select thread_id, max(timestamp) as LatestReply
from forum_replies
group by thread_id
Let's call this our MostRecentThreadSubquery. So, it would let us do something like:
select * from
forum_threads t
LEFT JOIN
(
select thread_id, max(timestamp) as LatestReply
from forum_replies
group by thread_id
) as MostRecentThreadSubquery
on t.thread_id = MostRecentThreadSubquery.thread_id
Make sense? We're no longer joining the forum_threads table against the forum_replies table - we've made a subquery to help us list the most recent reply for each thread id.
Now, we add the SQL CASE statement, to get something like:
select
thread_id,
CASE WHEN t.timestamp > MostRecentThreadSubquery.LatestReply
THEN t.timestamp
ELSE MostRecentThreadSubquery.LatestReply
END as MostRecentTimestamp
from -- ... the rest of that earlier SQL statement
Okay, so now we've got a query that, for every thread_id, has the most recent timestamp - whether that's from the forum_replies or from the forum_threads table.
... and you guessed it. We're going to make it another subquery. Let's call it our MostRecentPerThread
select *
from forum_category AS c
LEFT JOIN
(
-- ... that previous query ...
) as MostRecentPerThread
on c.thread_id = MostRecentPerThread.thread_id
Make sense? You're using subqueries as a way of logically breaking down your query into smaller components. You no longer have one gigantic query. You've got a small subquery that simply gets the timestamp of the most recent reply. You've got a small subquery that compares that first subquery to the threads table to get the most recent timestamp. And you've got a main query that uses the second subquery to merge it with the categories table.

Joining on "greater than" returning more than one row for left table

I have a query.
SELECT * FROM users LEFT JOIN ranks ON ranks.minPosts <= users.postCount
This returns a row every time it is matched. By using a GROUP BY users.id I get each row as a individual id.
However, when they group I only get the first row. I would instead like the row with the highest value of ranks.minPosts
Is there a way to do this, also, would it be faster (less resources) to just use two different queries?
Assuming there is only one column in ranks that you want, you can do this using a correlated subquery:
SELECT u.*,
(select r.minPosts
from ranks r
where r.minPosts <= u.PostCount
order by minPosts desc
limit 1
) as minPosts
FROM users u;
If you need the entire row from ranks, then join it back in:
SELECT ur.*, r.*
FROM (SELECT u.*,
(select r.minPosts
from ranks r
where r.minPosts <= u.PostCount
order by minPosts desc
limit 1
) as minPosts
FROM users u
) ur join
ranks r
on ur.minPosts = r.minPosts;
(The * is for convenience; you should list out the columns you want.)
Because you're using mysql, this will work:
SELECT * FROM (
SELECT *, users.id user_id
FROM users
LEFT JOIN ranks ON ranks.minPosts <= users.postCount
ORDER BY ranks.minPosts DESC
) x
GROUP BY user_id
Mysql always returns the first row encountered for each unique group, so if you first order the data, then use the non-standard grouping behaviour, you'll get the row you want.
Disclaimer:
Although this works reliably in practice, the mysql documentation says not to rely on it. If you use this convenient approach (which will reliably pass any test you can write), you should consider that it is not recommended by mysql and that later releases of mysql may not continue behave in this way.
What we'd really like to do would be to order the rows by ranks.minPosts before the group by. Unfortunately MySQL doesn't support that without using a subquery of some form.
If the ranks are already ordered by their ids then you can extract the id by selecting MAX(ranks.id), and if they're not, you can still get the highest ranks.minPosts by selecting MAX(ranks.minPosts). However, it would be nice to be able to get the entire record. I guess you're left with the subquery solution, which is as follows:
SELECT <fields> FROM users LEFT JOIN
(SELECT * FROM ranks ORDER BY minPosts DESC) as r
ON r.minPosts <= users.postCount GROUP BY users.id

Not sure how to approach this with a mySQL query for two tables

I have two tables users and distance. In a page I need to list all users with a simple query such as select * from users where active=1 order by id desc.
Sometimes I need to output data from the distance table along with this query where the user ID field in users is matched in the distance table in EITHER of two columns, say userID_1 and userID_2. Also in the distance table either of the two mentioned columns must also match a specified id ($userID) as well in the where clause.
This is the best that I came up with:
select
a.*,
b.distance
from
users a,
distance b
where
((b.userID_1='$userID' and a.id=b.userID_2)
or (a.id=b.userID_1 and b.userID_2='$userID'))
and a.active=1
order by a.id desc
The only problem with this query is that if there is no entry in the distance table for the where clause to find a match, the query does not return anything at all. I still want it to return the row from the user table and return distance as null if there are no matches.
I cannot figure out if I need to use a JOIN, UNION, SUBQUERY or anything else for this situation.
Thanks.
Use a left join
select
a.*,
b.distance
from
users a
left join distance b on
(b.userID_1=? and a.id=b.userID_2)
or (b.userID_2=? and a.id=b.userID_1)
where
a.active=1
order by a.id desc
and use a prepared statement. Substituting text into a query is vulnerable to SQL Injection attacks.
You need a left join between 'users' and 'distance'. As a result (pun not intended), you will always get the rows from the 'users' table along with any matching rows (if any) from 'distance'.
I notice that you are using the SQL-89 join syntax ("implicit joins") as opposed to SQL-92 join syntax ("explicit joins"). I wrote about this once.
I suggest that you change your query to
select a.*, b.distance
from users a left join distance b
on ((b.userID_1='$userID' and a.id=b.userID_2)
or (a.id=b.userID_1 and b.userID_2='$userID'))
where a.active=1
order by a.id desc
Try this:
select a.*, b.distance
from users a
left join distance b on (a.id=b.userID_1 or a.id=b.userID_2) and
(b.userID_1 = '$userID' or b.userID_2 = '$userID')
where a.active=1
order by a.id desc

Why doesn't my content field match my MAX(id) field in MySQL?

I'm trying to get a subset of data based on the latest id and dates. It seems that when selecting other fields in the table they are not in sync with the max id and dates returned.
Any idea how I can fix this?
MySQL:
SELECT MAX(m.id) as id, m.sender_id, m.receiver_id, MAX(m.date) as date, m.content, l.username, p.gender
FROM messages m
LEFT JOIN login_users l on l.user_id = m.sender_id
LEFT JOIN profiles p ON p.user_id = l.user_id
WHERE m.receiver_id=3
GROUP BY m.sender_id ORDER BY date DESC LIMIT 0, 7
The data for content isn't the correct one. It seems to be returning random content and not the content that is tied to the row for max id and max date.
Do I need to do some sort of sub select to fix this?
To answer the question in the title, "Why doesn't my content field match my MAX(id) field", that's because there is no guarantee that the values returned for the non-aggregate fields will be from the row where the MAX value is found. This is the documented behavior, and this is what we expect.
Other DBMS would throw an error on the statement, MySQL is just more lax, and you are getting values from one row, but it's not guaranteed to be the row that either of the MAX values (id or date) is found on.
You have two separate aggregate expression MAX(m.id) and MAX(m.date). Note that there is no guarantee that those values will come from the same row.
The rule in other databases is that every non-aggregate expression in the SELECT list needs to appear in the GROUP BY. (MySQL is more lax about that, and doesn't make that a requirement.)
One way to "fix" the query so that it does return values from the row with the MAX value is to use an inline view (query) that gets the MAX(id) grouped by what you want to GROUP BY, and then a JOIN back to the original table to get other values on the row.
From your statement it's not clear what result set you want returned. If you want the row that has the maximum id and you also want the row with maximum date, then you could something like this:
SELECT m.id
, m.sender_id
, m.receiver_id
, m.date
, m.content
, l.username
, p.gender
FROM ( SELECT t.sender_id
, t.receiver_id
, MAX(t.id) AS max_id
, MAX(t.date) AS max_date
FROM messages t
WHERE t.receiver_id=3
GROUP
BY t.sender_id
, t.receiver_id
) s
JOIN messages m
ON m.sender_id = s.sender_id
AND m.receiver_id = s.receiver_id
AND ( m.id = s.max_id OR m.date = s.max_date)
LEFT
JOIN login_users l on l.user_id = m.sender_id
LEFT
JOIN profiles p ON p.user_id = l.user_id
ORDER BY m.date DESC LIMIT 0, 7
The inline view aliased as "s" returns the max values, and then that gets joined back to the messages table, aliased as "m".
NOTE
In most cases, we find that a JOIN (query) will perform better than an IN (query), because of the different access plans. You can see the difference in plans with an EXPLAIN.
For performance, you'll want an index
... ON messages (`receiver_id`, `sender_id`, `id`, `date`)
There's an equality predicate on receiver_id, so that should be the leading column, to get a range scan (instead of a full scan). You want the sender_id column next, because that should allow MySQL to avoid a "Using filesort" operation to get the rows grouped. The id and date columns are included, so that the inline view query can be satisfied entirely from the index pages without a need to access the pages in the table. (The EXPLAIN should show "Using where; Using index".)
That same index should also suitable for the outer query, though it does need to access the "content" column from the table pages, so the EXPLAIN will not show "Using index" for that step. (It's likely that the "content" column is much longer than we would want in the index.)
Using a join
SELECT LatestM.id, m.sender_id, m.receiver_id, m.date, m.content, l.username, p.gender
(
SELECT sender_id, MAX(id) AS id
FROM messages
WHERE receiver_id=3
GROUP BY sender_id
) LatestM
INNER JOIN messages m
ON LatestM.sender_id = m.sender_id AND LatestM.id = m.id
LEFT JOIN login_users l on l.user_id = m.sender_id
LEFT JOIN profiles p ON p.user_id = l.user_id
WHERE m.receiver_id = 3
ORDER BY date DESC
LIMIT 0, 7
Problem with this is that if the latest id does not reflect the latest date then the date returned will not be the latest one.
Well, you could probably solve it without a subselect, but doing one is fairly straight forward. Something like this should work, just make the subselect return the id's of the interesting rows in messages, and get the data for only them.
SELECT m.id as id, m.sender_id, m.receiver_id, m.date as date,
m.content, l.username, p.gender
FROM messages m
LEFT JOIN login_users l on l.user_id = m.sender_id
LEFT JOIN profiles p ON p.user_id = l.user_id
WHERE m.id IN (
SELECT max(id) FROM messages
WHERE receiver_id=3
GROUP BY sender_id
)
ORDER BY date DESC
LIMIT 0, 7
The reason that your original query does not match up fields is that GROUP BY really requires aggregate functions (like MAX/MIN/SUM/...) applied to every field you select that's not grouped by. The reason the query even runs is that MySQL does not enforce that, but instead returns indeterminate fields from any row that is matching. Afaik, all other SQL RDBMS' refuse to run the query.
EDIT: As for performance, a few indexes that are likely to help are;
CREATE INDEX ix_inner ON messages(receiver_id, sender_id, id);
CREATE INDEX ix_login_users ON login_users(user_id);
CREATE INDEX ix_profiles ON profiles(user_id);