I am trying to code a forum website and I want to display a list of threads. Each thread should be accompanied by info about the first post (the "head" of the thread) as well as the last. My current database structure is the following:
threads table:
id - int, PK, not NULL, auto-increment
name - varchar(255)
posts table:
id - int, PK, not NULL, auto-increment
thread_id - FK for threads
The tables have other fields as well, but they are not relevant for the query. I am interested in querying threads and somehow JOINing with posts so that I obtain both the first and last post for each thread in a single query (with no subqueries). So far I am able to do it using multiple queries, and I have defined the first post as being:
SELECT *
FROM threads t
LEFT JOIN posts p ON t.id = p.thread_id
ORDER BY p.id
LIMIT 0, 1
The last post is pretty much the same except for ORDER BY id DESC. Now, I could select multiple threads with their first or last posts, by doing:
SELECT *
FROM threads t
LEFT JOIN posts p ON t.id = p.thread_id
ORDER BY p.id
GROUP BY t.id
But of course I can't get both at once, since I would need to sort both ASC and DESC at the same time.
What is the solution here? Is it even possible to use a single query? Is there any way I could change the structure of my tables to facilitate this? If this is not doable, then what tips could you give me to improve the query performance in this particular situation?
You could do something with a subquery and joins:
SELECT first.text as first_post_text, last.text as last_post_text
FROM
(SELECT MAX(id) as max_id, MIN(id) as min_id FROM posts WHERE thread_id = 1234) as sub
JOIN posts first ON (sub.max_id = first.id)
JOIN posts last ON (sub.min_id = last.id)
But that doesn't solve your problem of doing it without subqueries.
You could add columns to your threads table so that you keep the id of the first and last post of each thread. The first post would never change, but every time you added a new post you would have to update that record in the threads table, so that would double your writes, and you may need to use a transaction to avoid race conditions.
Or you could go so far as to duplicate information about the first and last post in the threads row. Say you needed the user_id of the poster, the timestamp it was posted, and the first 100 characters of the post. You could create 6 new columns in the threads table to contain those pieces of data for the first and last post. It duplicates data, but it means you may be able to display a list of threads without needing to query the posts table at all.
Related
As an example, consider an application with 2 MySQL tables: posts and comments. I want to fetch the posts ordered by the latest comment time. If there are lots of comments per post, this will be slow. I want to cache each post's latest comment's time somewhere.
If I cache the latest comment time for each post in Redis, then I can't use it for sorting in MySQL. Here are the approaches I can think of:
Add a "latest comment time" column to the posts table, then update this column whenever a new comment is created (could have performance issues because MySQL locks the row)
Create new table with only the post ID and latest comment time, then update this table whenever a new comment is created (need to join with main posts table)
Store the tuple (latest comment time, post ID) in Redis Sorted Sets, then fetch the post IDs from Redis (if I have a lot of conditions in the where clause, it'll be hard to represent these conditions in Redis)
My main concern is the frequency of updating the latest comment time. Even if I batched it (e.g. update each post at most once per minute), it could still be slow.
Which of these methods is "good" or "bad"? Are there better methods?
Latest comment for each post for one user:
SELECT p.*
FROM ( SELECT post_id, MAX(dt) AS last_comment_dt
FROM Comments
WHERE user_id = ?
GROUP BY post_id ) AS x
JOIN Posts AS p USING(post_id)
ORDER BY last_comment_dt DESC
Index:
Comments: INDEX(user_id, post_id, dt)
Latest comment for each post:
SELECT p.*
FROM ( SELECT post_id, MAX(dt) AS last_comment_dt
FROM Comments
GROUP BY post_id ) AS x
JOIN Posts AS p USING(post_id)
ORDER BY last_comment_dt DESC
Index:
Comments: INDEX(post_id, dt)
I have this query and i want to know if i can optimize it in some way because currently it takes a long time to execute (like 4/5 seconds)
SELECT *
FROM `posts` ml INNER JOIN
posts_tag_one gt
ON gt.post_id = ml.id AND gt.tag_id = 15 INNER JOIN
posts_tag_two gg
ON gg.post_id = ml.id AND gg.tag_id = 5
WHERE active = '1' AND NOT ml.id = '639474'
ORDER BY ml.id DESC
LIMIT 5
I want to say the database it has like 600k+ posts, the posts_tag_one 5 milions records, the posts_tag_two 475k+ records.
That example i gave it's only with 2 joins but in some cases i have up to 4 joins so the other tables has like 300k-400k records.
I am using foregin keys and indexes for posts_tag_one, posts_tag_two tables but the query it's still slow.
Any advice would help. Thanks!
By means of Transitive property (if a=b and b=c, then a=c), your ML.ID = GT.Post_ID = GG.Post_ID. Since you are trying to pre-qualify specific tags, I would rewrite and try to see if cardinality of data may help by moving to a front position and using better indexes to optimize the query. Also, MySQL has a nice keyword "STRAIGHT_JOIN" that tells the engine query the data in the order I tell you, dont think for me. I have used many times and have seen significant improvement.
SELECT STRAIGHT_JOIN
*
FROM
posts_tag_two gg
INNER JOIN posts_tag_one gt
ON gg.post_id = gt.post_id
AND gt.tag_id = 15
INNER JOIN posts ml
ON gt.post_id = ml.id
AND ml.active = 1
WHERE
gg.tag_id = 5
AND NOT gg.post_id = 639474
ORDER BY
gg.post_id DESC
LIMIT 5
I would ensure the following table / multi-field indexes
table index
Posts_Tag_One ( tag_id, post_id )
Posts_Tag_Two ( tag_id, post_id )
posts ( id, active )
By starting with the Posts_Tag_Two table which you are pre-filtering for tag_id = 5, you are already cutting the list down to those pre-qualified FIRST. Not by starting with ALL posts and seeing which qualify with the tag.
Second level join is to the POSTS_TAG_ONE table on same ID, but that level filtered by its Tag_ID = 15.
Only then does it even care to get to the POSTS table for active.
Since the order is based on the ID descending, and the Posts_tag_two table "post_id" is the same value as Posts.id, the index from the posts_tag_two table should return the record already pre-sorted.
HTH, and would be interested to know final performance difference. Again, I have used STRAIGHT_JOIN many times with significant improvement in performance. I also typically do NOT do "Select *" for all tables / all columns. Get what you need.
FEEDBACK
#eshirvana, in MANY cases, yes, the optimizers do by default. But sometimes, the designer knows a better the makeup of the data. Lets take the scenario of POSTS in the lead-position. You have a room of boxes for posts. Each box contains say 10k records. You have to go through all 10k records, then to the next box until you get through 400k records... again, just for example. Once you find those, then it goes to the join on the filtered criteria for a specific tag. Those too are ordered by ID so you have to do a one-to-one- correlation. So which table stays in a primary position.
Now, by the index by tag, and one of the posts_tag tables (smaller by choice is #2).
Now, you have a room of boxes, but each box only has one tag within it. If you have 300 tag IDs available, you have already cut out x-amount of records giving you just the small sample you pre-qualify to.
So now, the second posts table similarly is a room of boxes. Their boxes are also broken down by tags. So now you only have to grab box for tag #15.
So now you have two very finite sets of records that the JOIN can match on the ID that exists in both cases. only once that is done do you ever need to go to the posts table, which by ID is going to be quick and direct. But having the active status in the index, the engine never needs to go to any actual data pages to retrieve the data until all conditions are met. Only then does it pull the record from the 3 respective tables being returned.
Sounds like posts_tags is a many-to-many mapping table? It need two indexes: (post_id, tag_id) and (tag_id, post_id). One of those should probably be the PRIMARY KEY (Having an auto_increment id is wasteful and slows things down). The other should be INDEX (not UNIQUE). More discussion: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table
But, why have both posts_tag_two and posts_tag_one?
In addition to those 'composite' keys, do not also have the single-column (post_id) or (tag_id).
If tag is simply a short string, don't bother normalizing it; simply have it in the table.
For further discussion, please provide SHOW CREATE TABLE for each table. And EXPLAIN SELECT ....
I have two DB tables, one that store events and the second that stores any associated comments for that event.
DB Tables:
events: id, owner_id, timestamp
comments: cmt_id, parent_id(events id), cmt_time
I'm trying to get the last 5 comments for each event based on a specific owner_id.
This is how I'm joining my tables:
SELECT * FROM `events`
LEFT JOIN comments ON comments.parent_id=events.id
WHERE owner_id=X
ORDER BY timestamp DESC LIMIT 0,5
Any idea how I can get the number of comments based on the event_id?
Your question is about the number of comments for each event (at least as I interpret it). For this, you want to use a group by:
SELECT e.event_id, COUNT(c.parent_id) as NumComments
FROM events e left JOIN
comments c
ON c.parent_id=e.id
WHERE e.owner_id = X
group by e.event_id;
As for the query in your question. It does not do what you want it to do ("I'm trying to get the last 5 comments for each event based on a specific owner_id."). Instead, it is getting the last five comments for a given user. Period.
You can do the table join at then use the COUNT() function to count how many comments are associated with a given event_id
http://www.w3schools.com/sql/sql_func_count.asp
I would give an example but I'm not entirely sure what you would like your end dataset to look like. COUNT(col) will count the number of rows associated with the query result
I hope itll be legal to post this as i'm aware of other similar posts on this topic. But im not able to get the other solutions to work, so trying to post my own scenario. Pretty much on the other examples like this one, im unsure how they use the tablenames and rows. is it through the punctuation?
SELECT bloggers.*, COUNT(post_id) AS post_count
FROM bloggers LEFT JOIN blogger_posts
ON bloggers.blogger_id = blogger_posts.blogger_id
GROUP BY bloggers.blogger_id
ORDER BY post_count
I have a table with articles, and a statistics table that gets new records every time an article is read. I am trying to make a query that sorts my article table by counting the number of records for that article id in the statistics table. like a "sort by views" functions.
my 2 tables:
article
id
statistics
pid <- same as article id
Looking at other examples im lacking the left join. just cant wrap my head around how to work that. my query at the moment looks like this:
$query = "SELECT *, COUNT(pid) AS views FROM statistics GROUP BY pid ORDER BY views DESC";
Any help is greatly appreciated!
SELECT article.*, COUNT(statistics.pid) AS views
FROM article LEFT JOIN statistics ON article.id = statistics.pid
GROUP BY article.id
ORDER BY views DESC
Ideas:
Combine both tables using a join
If an article has no statistics, fill up with NULL, i.e. use a left join
COUNT only counts non-NULL values, so count by right table to give correct zero results
GROUP BY to obtain exactly one result row for every article, i.e. to count statistics for each article individually
I want to copy an existing MySQL table over to a duplicate table, but with two additional fields populated by data retrieved from other queries. It's for a forums software that never captured the original creation date of the thread (it always overwrote the time with the most recent reply).
So I want to take the existing table, 'threads'
thread_id
thread_author
thread_subject
thread_last_reply_date
and create a new table, 'new_threads' of the same structure, but with two extra fields:
thread_id
thread_author
thread_subject
thread_last_reply_date
thread_creation_date
thread_last_poster_id
Both thread_last_reply_date and thread_last_poster_id could be populated from dependent queries.. For example, last_poster_id with a query of:
SELECT post_author FROM posts WHERE thread_id = ? AND post_date = thread_last_reply_date
And for the thread_creation_date:
SELECT MIN(post_date) FROM posts WHERE thread_id = ?
That's essentially the process I would do with a PHP script.. A series of SELECTs and then inserting records one by one. Any advice or direction on how to do this type of process in pure SQL would be crazy helpful (if it's possible).
EDIT: an example of the technique would be helpful. I don't need an exact answer for the above. I'm sure everyone has better things to do.
Just had a similar problem.
This solution worked for me:
INSERT INTO `new_thread` SELECT *,null,null FROM `thread` WHERE 1=1
To create your target table (empty):
CREATE TABLE new_threads
SELECT t.*, thread_creation_date date, thread_last_poster_id number
FROM thread where 1=0;
and to populate it:
insert into new_threads(thread_id, thread_author, thread_subject, thread_last_reply_date, thread_creation_date, thread_last_poster_id)
(select t.*,
min(p.post_date),
(SELECT p1.post_author
FROM posts p1
WHERE p1.thread_id = t.thread_id
AND p1.post_date = t.thread_last_reply_date) thread_last_poster_id
from threads t, posts p
where t.thread_id = p.thread_id;
(untested)