MySQL - COUNT and retrieve n rows from a subquery - mysql

Context:
I have an app that shows posts and comments on the home page.
My intention is to limit the number of posts shown (ie, 10 posts) and...
Limit the number of comments shown per post (ie, 2 comments).
Show the total number of comments in the front end (ie, "read all 10 comments")
MySQL:
(SELECT *
FROM (SELECT *
FROM post
ORDER BY post_timestamp DESC
LIMIT 0, 10) AS p
JOIN user_profiles
ON user_id = p.post_author_id
LEFT JOIN (SELECT *
FROM data
JOIN pts
ON pts_id = pts_id_fk) AS d
ON d.data_id = p.data_id_fk
LEFT JOIN (SELECT *
FROM comment
JOIN user_profiles
ON user_id = comment_author_id
ORDER BY comment_id ASC) AS c
ON p.post_id = c.post_id_fk))
I've failed to insert LIMIT and COUNT in this code to get what I want - any suggestions? - will be glad to post more info if needed.

If I'm understanding you correctly you want no more than 10 posts (and 2 comments) to come back for each unique user in the returned result set.
This is very easy in SQLServer / Oracle / Postgre using a "row_number() PARTITION BY".
Unfortunately there is no such function in MySql. Similar question has been asked here:
ROW_NUMBER() in MySQL
I'm sorry I can't offer a more specific solution for MySql. Definitely further research "row number partition by" equivalents for MySql.
The essence of what this does:
You can add a set of columns that make up a unique set, say user id for example sake (this is the "partition") A "row number" column is then added to each row that matches the partition and starts over when it changes.
This should illustrate:
user_id row_number
1 1
1 2
1 3
2 1
2 2
You can then add an outer query that says: select where row_number <= 10, which can be used in your case to limit to no more than 10 posts. Using the max row_number for that user to determine the "read all 10 comments" part.
Good luck!

This is the skeleton of the query you're looking for:
select * from (
select p1.id from posts p1
join posts p2 on p1.id <= p2.id
group by p1.id
having count(*) <= 3
order by p1.post_timestamp desc
) p left join (
select c1.id, c2.post_id from comments c1
join comments c2 on c1.id <= c2.id and c1.post_id = c2.post_id
group by c1.id
having count(*) <= 2
order by c1.comment_timestamp desc
) c
on p.id = c.post_id
It will get posts ordered by their descending timestamp but only the top 3 of them. That result will be joined with the top 2 comments of each post order by their descending timestamp. Just change the column names and it will work :)

Related

Avoiding the “n+1 selects” problem when I want a subset of related resources

Imagine I'm designing a multi-user blog and I have user, post, and comment tables with the obvious meanings. On the main page, I want to show the ten most recent posts along with all their related comments.
The naive approach would be to SELECT the ten most recent posts, probably JOINed with the users that wrote them. And then I can loop through them to SELECT the comments, again, probably JOINed with the users that wrote them. This would require 11 selects: 1 for the posts and 10 for their comments, hence the name of the famous anti-pattern: n+1 selects.
The usual advice for avoiding this anti-pattern is to use the IDs from the first query to fetch all related comments in a second query which may look something like this:
SELECT
*
FROM
comments
WHERE
post_id IN (/* A comma separated list of post IDs returned from the first query */)
As long as that comma separated list is in reasonably short we managed to fetch all the data we need by issuing only two SELECT queries instead of eleven. Great.
But what if I only want the top three comments for each post? I didn't try but I can probably come up with some LEFT JOIN trickery to fetch the most recent posts along with their top three comments in a single query but I'm not sure it would be scalable. What if I want the top hundred comments which would exceed the join limit of 61 tables of a typical MySQL installation for instance?
What is the usual solution for this other than reverting to n+1 selects anti-pattern? What is the most efficient way to fetch items with a subset of items related to each one in this fairly typical scenario?
It is usually a better option to run as few queries as possible, and then implement some application logic on top of it if needed. In your use case, I would build a query that returns both the most recent posts and the most recent associated comments, with proper ordering to make the application processing easier. Then your application can take care of displaying them.
Assuming that you use MySQL (since you mentionned it in your question), let's start with a query that gives you the 10 most recent posts:
SELECT * FROM posts ORDER BY post_date DESC LIMIT 10
Then you can join this with the corresponding comments:
SELECT
p.*,
c.*
FROM
(SELECT * FROM posts ORDER BY post_date DESC LIMIT 10) p
INNER JOIN comments c ON c.post_id = p.id
Finally, let's set up a limit on the number of comments per posts. For this, you can use ROW_NUMBER() (available in MySQL 8.0) to rank the comments per post, and then filter only the a given number of comments. This gives you the 10 most recent posts along with each of their 3 most recents comments:
SELECT *
FROM (
SELECT
p.*,
c.*,
ROW_NUMBER() OVER(PARTITION BY p.post_id ORDER BY c.comment_date DESC) rn
FROM
(SELECT * FROM posts ORDER BY post_date DESC LIMIT 10) p
INNER JOIN comments c ON c.post_id = p.id
) x
WHERE rn <= 3
ORDER BY p.post_date DESC, c.comment_date DESC
Query results are ordered by post, then by comment date. So when your application fetches the resuts, you get 1 to 3 records per post, in sequence.
If you want the last 10 posts
SELECT p.post_id
FROM post p
ORDER BY p.publish_date DESC
LIMIT 10
Now if you want the comment of those posts:
SELECT c.comment_id, u.name
FROM comments c
JOIN users u
on c.user_id = u.user_id
WHERE c.post_id IN ( SELECT p.post_id
FROM post p
ORDER BY p.publish_date DESC
LIMIT 10 )
Now for the last 3 comments is where rdbms version is important so you can use row_number or not:
SELECT *
FROM (
SELECT c.comment_id, u.name,
row_number() over (partition by c.post_id order by c.comment_date DESC) as rn
FROM comments c
JOIN users u
on c.user_id = u.user_id
WHERE c.post_id IN ( SELECT p.post_id
FROM post p
ORDER BY p.publish_date DESC
LIMIT 10 )
) x
WHERE x.rn <= 3
You can do this in one query:
select . . . -- whatever columns you want here
from (select p.*
from posts p
order by <datecol> desc
fetch first 10 rows only
) p join
users u
on p.user_id = u.user_id join
comments c
on c.post_id = p.post_id;
This returns the posts/users/comments in one table, mixing the columns. But it only requires one query.

MySQL: combination of LEFT JOIN and ORDER BY is slow

There are two tables: posts (~5,000,000 rows) and relations (~8,000 rows).
posts columns:
-------------------------------------------------
| id | source_id | content | date (int) |
-------------------------------------------------
relations columns:
---------------------------
| source_id | user_id |
---------------------------
I wrote a MySQL query for getting 10 most recent rows from posts which are related to a specific user:
SELECT p.id, p.content
FROM posts AS p
LEFT JOIN relations AS r
ON r.source_id = p.source_id
WHERE r.user_id = 1
ORDER BY p.date DESC
LIMIT 10
However, it takes ~30 seconds to execute it.
I already have indexes at relations for (source_id, user_id), (user_id) and for (source_id), (date), (date, source_id) at posts.
EXPLAIN results:
How can I optimize the query?
Your WHERE clause renders your outer join a mere inner join (because in an outer-joined pseudo record user_id will always be null, never 1).
If you really want this to be an outer join then it is completely superfluous, because every record in posts either has or has not a match in relations of course. Your query would then be
select id, content
from posts
order by "date" desc limit 10;
If you don't want this to be an outer join really, but want a match in relations, then we are talking about existence in a table, an EXISTS or IN clause hence:
select id, content
from posts
where source_id in
(
select source_id
from relations
where user_id = 1
)
order by "date" desc
limit 10;
There should be an index on relations(user_id, source_id) - in this order, so we can select user_id 1 first and get an array of all desired source_id which we then look up.
Of course you also need an index on posts(source_id) which you probably have already, as source_id is an ID. You can even speed things up with a composite index posts(source_id, date, id, content), so the table itself doesn't have to be read anymore - all the information needed is in the index already.
UPDATE: Here is the related EXISTS query:
select id, content
from posts p
where exists
(
select *
from relations r
where r.user_id = 1
and r.source_id = p.source_id
)
order by "date" desc
limit 10;
You could put an index on the date column of the posts table, I believe that will help the order-by speed.
You could also try reducing the number of results before ordering with some additional where statements. For example if you know the that there will likely be ten records with the correct user_id today, you could limit the date to just today (or N days back depending on your actual data).
Try This
SELECT p.id, p.content FROM posts AS p
WHERE p.source_id IN (SELECT source_id FROM relations WHERE user_id = 1)
ORDER BY p.date DESC
LIMIT 10
I'd consider the following :-
Firstly, you only want the 10 most recent rows from posts which are related to a user. So, an INNER JOIN should do just fine.
SELECT p.id, p.content
FROM posts AS p
JOIN relations AS r
ON r.source_id = p.source_id
WHERE r.user_id = 1
ORDER BY p.date DESC
LIMIT 10
The LEFT JOIN is needed if you want to fetch the records which do not have a relations mapping. Hence, doing the LEFT JOIN results in a full table scan of the left table, which as per your info, contains ~5,000,000 rows. This could be the root cause of your query.
For further optimisation, consider moving the WHERE clause into the ON clause.
SELECT p.id, p.content
FROM posts AS p
JOIN relations AS r
ON (r.source_id = p.source_id AND r.user_id = 1)
ORDER BY p.date DESC
LIMIT 10
I would try with a composite index on relations :
INDEX source_user (user_id,source_id)
and change the query to this :
SELECT p.id, p.content
FROM posts AS p
INNER JOIN relations AS r
ON ( r.user_id = 1 AND r.source_id = p.source_id )
ORDER BY p.date DESC
LIMIT 10

SQL top records based on two tables relations

I have three main items I am storing: Articles, Entities, and Keywords. This makes 5 tables:
article { id }
entity {id, name}
article_entity {id, article_id, entity_id}
keyword {id, name}
article_keyword {id, article_id, keyword_id}
I would like to get all articles that contain the TOP X keywords + entities. I can get the top X keywords or entities with a simple group by on the entity_id/keyword_id.
SELECT [entity|keyword]_id, count(*) as num FROM article_entity
GROUP BY entity_id ORDER BY num DESC LIMIT 10
How would I get all articles that have a relation to the top entities and keywords?
This was what I imagined, but I know it doesn't work because of the group by entity limiting the article_id's that return.
SELECT * FROM article
WHERE EXISTS (
[... where article is mentioned in top X entities.. ]
) AND EXISTS (
[... where article is mentioned in top X keywords.. ]
);
If I understand you correct the objective of the query is to find the articles that have a relation to both one of the top 10 entities as well as to one of the top 10 keywords. If this is the case the following query should do that, by requiring that the article returned has a match in both the set of top 10 entities and the set of top 10 keywords.
Please give it a try.
SELECT a.id
FROM article a
INNER JOIN article_entity ae ON a.id = ae.article_id
INNER JOIN article_keyword ak ON a.id = ak.article_id
INNER JOIN (
SELECT entity_id, COUNT(article_id) AS article_entity_count
FROM article_entity
GROUP BY entity_id
ORDER BY article_entity_count DESC LIMIT 10
) top_ae ON ae.entity_id = top_ae.entity_id
INNER JOIN (
SELECT keyword_id, COUNT(article_id) AS article_keyword_count
FROM article_keyword
GROUP BY keyword_id
ORDER BY article_keyword_count DESC LIMIT 10
) top_ak ON ak.keyword_id = top_ak.keyword_id
GROUP BY a.id;
The downside to using a simplelimit 10in the two subqueries for top entities/keywords is that it won't handle ties, so if the 11th keyword was just as popular as the 10th it still won't get chosen. This can be fixed though by using a ranking function, but afaik MySQL doesn't have anything build in (like RANK() window functions in Oracle or MSSQL).
I set up a sample SQL Fiddle (but using fewer data points andlimit 2as I'm lazy).
Not knowing the volume of data you are working with, I would first recommend that you have two storage columns on your article table for count of entities and keywords respectively. Then via triggers on adding/deleting from each, update the respective counter columns. This way, you don't have to do a burning query each time needed, especially in a web-based interface. Then, you can just select from the articles table ordered by the E+K counts descending and be done with it, instead of constant sub-querying the underlying tables.
Now, that said, the other suggestions are somewhat similar to what I am posting, but they all appear to be doing a limit of 10 records for each set. Lets throw this scenario into the picture. Say you have articles 1-20 all a range of 10, 9 and 8 entities and 1-2 keywords. Then articles 21-50 have the reverse... 10, 9, 8 keywords and 1-2 entities. Now, you have articles 51-58 that have 7 entities AND 7 keywords total of 14 combined points. None of the queries would have caught this as entities would only return the qualifying 1-20 records and keywords records 21-50. Articles 51-58 would be so far down on the list, it would not even be considered even though its total is 14.
To handle this, each sub-query is a full query specifically on the article ID and its count. Simple order by the article_ID as that is basis of the join to the master article table.
Now, the coalesce() will get the count if so available, otherwise 0 and add the two values together. From that, the results are ordered with the highest counts first (thus getting scenario sample articles 51-58 plus a few of the others) when the limit is applied.
SELECT
a.id,
coalesce( JustE.ECount, 0 ) ECount,
coalesce( JustK.KCount, 0 ) KCount,
coalesce( JustE.ECount, 0 ) + coalesce( JustK.KCount, 0 ) TotalCnt
from
article a
LEFT JOIN ( select article_id, COUNT(*) as ECount
from article_entity
group by article_id
order by article_id ) JustE
on a.id = JustE.article_id
LEFT JOIN ( select article_id, COUNT(*) as KCount
from article_keyword
group by article_id
order by article_id ) JustK
on a.id = JustK.article_id
order by
coalesce( JustE.ECount, 0 ) + coalesce( JustK.KCount, 0 ) DESC
limit 10
I took this in several steps
tl;dr This shows all the articles from the top (4) keywords and entities:
Here's a fiddle
select
distinct article_id
from
(
select
article_id
from
article_entity ae
inner join
(select
entity_id, count(*)
from
article_entity
group by
entity_id
order by
count(*) desc
limit 4) top_entities on ae.entity_id = top_entities.entity_id
union all
select
article_id
from
article_keyword ak
inner join
(select
keyword_id, count(*)
from
article_keyword
group by
keyword_id
order by
count(*) desc
limit 4) top_keywords on ak.keyword_id = top_keywords.keyword_id) as articles
Explanation:
This starts with an effort to find the top X entities. (4 seemed to work for the number of associations i wanted to make in the fiddle)
I didn't want to select articles here because it skews the group by, you want to focus solely on the top entities. Fiddle
select
entity_id, count(*)
from
article_entity
group by
entity_id
order by
count(*) desc
limit 4
Then I selected all the articles from these top entities. Fiddle
select
*
from
article_entity ae
inner join
(select
entity_id, count(*)
from
article_entity
group by
entity_id
order by
count(*) desc
limit 4) top_entities on ae.entity_id = top_entities.entity_id
Obviously the same logic needs to happen for the keywords. The queries are then unioned together (fiddle) and the distinct article ids are pulled from the union.
This will give you all articles that have a relation to the top (x) entities and keywords.
This gets the top 10 keyword articles that are also a top 10 entity. You may not get 10 records back because it is possible that an article only meets one of the criteria (top entity but not top keyword or top keyword but not top entity)
select *
from article a
inner join
(select count(*),ae.article_id
from article_entity ae
group by ae.article_id
order by count(*) Desc limit 10) e
on a.id = e.article_id
inner join
(select count(*),ak.article_id
from article_keyword ak
group by ak.article_id
order by count(*) Desc limit 10) k
on a.id = k.article_id

Advise for best query performance (MySQL)

I'm looking for the best MySQL query for that situation:
I'm listing 10 last posts of a member.
table for posts:
post_id | uid | title | content | date
The member have the possibility to subscribe to other member posts, so that posts are listed in the same list (sorted by date - same table)
So it's ok to select last posts of userid X and userid Y
But I'd like to allow members to diable display of some posts (the ones he doesn't want to be displayed).
My problem is: how can I make that as simple as possible for MySQL?... I thought about a second table where I put the post ids that the user doesn't want:
table postdenied
uid | post_id
Then make a select like:
select * from posts as p where not exists (select 1 from postdenied as d where d.post_id = p.post_id and d.uid = p.uid) order by date DESC limit 10
I'm right?
Or is there something better?
Thanks
If I understand correctly, the posts.uid column stores the ID of the poster. And the postdenied.uid stores the ID of the user that doesn't want to see a certain post.
If the above assumptions are correct, then your query is fine, except that you should not join on the uid columns, only on the post_id ones. And you should have a parameter or constant the userID (noted as #X in the code below) of the user that you want to show all the posts - except those he has "denied":
select p.*
from posts as p
where not exists
(select 1
from postdenied as d
where d.post_id = p.post_id
and d.uid = #X -- #X is the userID of the specific user
)
order by date DESC
limit 10 ;
Another approach to implementing this would be with a LEFT JOIN clause.
SELECT * FROM posts AS p
LEFT JOIN postdenied as d ON d.post_id = p.post_id and d.uid = p.uid
WHERE d.uid IS NULL
ORDER BY date DESC
LIMIT 10
It's unclear to me whether this would be more amenable to the query optimizer. If you have a large amount of data, it may be worth testing both queries and seeing if one is more performant than the other.
See http://sqlfiddle.com/#!2/be7e3/1
Appreciation to ypercube and Lamak for their feedback on my original answer

Mysql: How to select top 10 users of the last 24 hours for a ranking list (and avoid multiple user entries)

I want to list the top 10 users in the last 24 hours with the highest WPM (words per minute) value. If a user has multiple highscores, only the highest score should be shown. I got this far:
SELECT results.*, users.username
FROM results
JOIN users
ON (results.user_id = users.id)
WHERE results.user_id != 0 AND results.created >= DATE_SUB(NOW(),INTERVAL 24 HOUR)
GROUP BY results.user_id
ORDER BY wpm DESC LIMIT 10
My Problem is, that my code doesn't fetch the highest value. For example:
user x has 2 highscores but instead of selecting the row with the highest wpm for this user, the row with a lower value is selected instead. If I use "MAX(results.wpm)" I get the highest wpm. This would be fine, but I also need the keystrokes for this row. My problem is that even though I fetch the correct user I don't receive the right row for this user (the row which made the user reach the top 10).
This is the results table:
id | user_id | wpm | keystrokes | count_correct_words |
count_wrong_words | created
(editing answer as we cannot use LIMIT inside a subquery)
Here's another attempt...
SELECT users.username, R1.*
FROM users
JOIN results R1 ON users.userId = R1.userId
JOIN (SELECT userId, MAX(wpm) AS wpm FROM results GROUP BY userId) R2 ON R2.wpm = R1.wpm AND R2.userId = R1.userId
WHERE R1.user_id != 0 AND R1.created >= DATE_SUB(NOW(),INTERVAL 24 HOUR)
ORDER BY R1.wpm DESC LIMIT 10;
We use max() to first isolate the maximum wpm's for every user_id, then inner join the Results table with this subset to get the full row information.
Thanks Oceanic, I think your subquery approach was what gave me the working idea:
The problem was, that GROUP BY picked the first column for the aggregation(?), I now use a subquery to first order the results by wpm and use this "tmp table" for my operation
SELECT t1.*, users.username
FROM (SELECT results.* FROM results WHERE results.user_id != 0 AND results.created >= DATE_SUB(NOW(),INTERVAL 24 HOUR) ORDER BY wpm DESC) t1
JOIN users ON (t1.user_id = users.id)
GROUP BY t1.user_id
ORDER BY wpm DESC
LIMIT 10
This seems to work fine.