MYSQL - Using SUM - mysql

I am trying to find the top 10 authors who has written the most number of books.
I have two table as follows:
Author table:
author_name
publisher_key
Publication table:
publisher_id
publisher_key
title
year
pageno
To find the result, I tried using the following query:
SELECT a.author_name, SUM(p.pageno)
FROM author a JOIN publication p ON a.publisher_key = p.publisher_key
GROUP BY a.author_name
LIMIT 10;
I have no idea why when I run this query it takes ages though the number of records is only 200.

Try
SELECT a.author_name, count(*)
FROM author a
INNER JOIN publication p ON a.publisher_key = p.publisher_key
GROUP BY a.author_name
ORDER BY 2 desc
LIMIT 10;
You want to know who write most number of books, so you need to count the number of registries by author.
The order by 2 desc will order your query from the bigger number to the lesser 2 means the second field on the select list.

If it is as ydoow suggested that it maybe a locking issue, try running your select with NOLOCK to confirm.
SELECT a.author_name, count(*)
FROM author a WITH (nolock)
INNER JOIN publication p WITH (nolock) ON a.publisher_key = p.publisher_key
GROUP BY a.author_name
ORDER BY 2 desc
LIMIT 10;
You can get more info here:
Any way to select without causing locking in MySQL?

Related

Avoiding the “n+1 selects” problem when I want a subset of related resources

Imagine I'm designing a multi-user blog and I have user, post, and comment tables with the obvious meanings. On the main page, I want to show the ten most recent posts along with all their related comments.
The naive approach would be to SELECT the ten most recent posts, probably JOINed with the users that wrote them. And then I can loop through them to SELECT the comments, again, probably JOINed with the users that wrote them. This would require 11 selects: 1 for the posts and 10 for their comments, hence the name of the famous anti-pattern: n+1 selects.
The usual advice for avoiding this anti-pattern is to use the IDs from the first query to fetch all related comments in a second query which may look something like this:
SELECT
*
FROM
comments
WHERE
post_id IN (/* A comma separated list of post IDs returned from the first query */)
As long as that comma separated list is in reasonably short we managed to fetch all the data we need by issuing only two SELECT queries instead of eleven. Great.
But what if I only want the top three comments for each post? I didn't try but I can probably come up with some LEFT JOIN trickery to fetch the most recent posts along with their top three comments in a single query but I'm not sure it would be scalable. What if I want the top hundred comments which would exceed the join limit of 61 tables of a typical MySQL installation for instance?
What is the usual solution for this other than reverting to n+1 selects anti-pattern? What is the most efficient way to fetch items with a subset of items related to each one in this fairly typical scenario?
It is usually a better option to run as few queries as possible, and then implement some application logic on top of it if needed. In your use case, I would build a query that returns both the most recent posts and the most recent associated comments, with proper ordering to make the application processing easier. Then your application can take care of displaying them.
Assuming that you use MySQL (since you mentionned it in your question), let's start with a query that gives you the 10 most recent posts:
SELECT * FROM posts ORDER BY post_date DESC LIMIT 10
Then you can join this with the corresponding comments:
SELECT
p.*,
c.*
FROM
(SELECT * FROM posts ORDER BY post_date DESC LIMIT 10) p
INNER JOIN comments c ON c.post_id = p.id
Finally, let's set up a limit on the number of comments per posts. For this, you can use ROW_NUMBER() (available in MySQL 8.0) to rank the comments per post, and then filter only the a given number of comments. This gives you the 10 most recent posts along with each of their 3 most recents comments:
SELECT *
FROM (
SELECT
p.*,
c.*,
ROW_NUMBER() OVER(PARTITION BY p.post_id ORDER BY c.comment_date DESC) rn
FROM
(SELECT * FROM posts ORDER BY post_date DESC LIMIT 10) p
INNER JOIN comments c ON c.post_id = p.id
) x
WHERE rn <= 3
ORDER BY p.post_date DESC, c.comment_date DESC
Query results are ordered by post, then by comment date. So when your application fetches the resuts, you get 1 to 3 records per post, in sequence.
If you want the last 10 posts
SELECT p.post_id
FROM post p
ORDER BY p.publish_date DESC
LIMIT 10
Now if you want the comment of those posts:
SELECT c.comment_id, u.name
FROM comments c
JOIN users u
on c.user_id = u.user_id
WHERE c.post_id IN ( SELECT p.post_id
FROM post p
ORDER BY p.publish_date DESC
LIMIT 10 )
Now for the last 3 comments is where rdbms version is important so you can use row_number or not:
SELECT *
FROM (
SELECT c.comment_id, u.name,
row_number() over (partition by c.post_id order by c.comment_date DESC) as rn
FROM comments c
JOIN users u
on c.user_id = u.user_id
WHERE c.post_id IN ( SELECT p.post_id
FROM post p
ORDER BY p.publish_date DESC
LIMIT 10 )
) x
WHERE x.rn <= 3
You can do this in one query:
select . . . -- whatever columns you want here
from (select p.*
from posts p
order by <datecol> desc
fetch first 10 rows only
) p join
users u
on p.user_id = u.user_id join
comments c
on c.post_id = p.post_id;
This returns the posts/users/comments in one table, mixing the columns. But it only requires one query.

MYSQL entries from one table that appear most often in another table

I have 2 tables, authors and books
authors contains the unique id authorId
books also contains this as a foreign key
I need to know the authors with the most number of books. If 2 or more authors are tied for the greatest number of books, I need to show both authors
I have been able to achieve this by first getting the maximum count
SELECT #maxCount := (MAX(counter)) FROM (SELECT count(*) AS counter FROM books GROUP BY authorId) AS counts;
and then using it to get the Ids with that count as part of my author selection
SELECT *
FROM authors
WHERE authorId IN (
SELECT authorId
FROM books
GROUP BY authorId
HAVING COUNT(*) = #maxCount
);
I've been told that I am not allowed to use variables and that what I've done is horribly inefficient if the tables grow very large.
Am I missing something obvious here? Is there a way to do this in a single statement without a variable (or temp table), and without having to select/group the entire books table twice?
SELECT author, COUNT(*)
FROM authors
JOIN books
ON authors.authorId=books.AuthorId
GROUP BY author
ORDER BY COUNT(*) DESC
Will give you a list ordered by the number of books for each author. I don't have an instance nearby to test, and tend to steer clear of embedded variables but expect something like....
SELECT *
FROM (
SELECT author
, #maxcount:=IF(COUNT(*)>#maxcount,COUNT(*), #maxcount)
, COUNT(*) AS cnt
FROM authors
JOIN books
ON authors.authorId=books.AuthorId
GROUP BY author
ORDER BY COUNT(*) DESC
) ilv
WHERE cnt=#maxcount;
Performance still sucks with large datasets (even with the right indexes). If you have to run this query frequently with >100,000 records, then you might consider denormalizing your data.
Symcbean solution is great... you can add Limit 1 to it, to get only one instance.
SELECT A.authorId, A.name, COUNT(*) AS num_books
FROM authors A
INER JOIN books B
ON A.authorId=B.AuthorId
GROUP BY A.authorId, A.name
ORDER BY COUNT(*) DESC
LIMIT 1
But if you want to get all the authors who share the max number of books, your best bet is to store the max(count) in a variable, or temp table and use it in second query.
for example, you can store the info in the following temp table
CREATE TEMPORARY TABLE IF NOT EXISTS maxBooks AS (
SELECT authorId, COUNT(*) AS num_books
FROM books
GROUP BY authorId
ORDER BY COUNT(*) DESC
LIMIT 1
)
now you can join it to your table to get counts which are equal to max count

SQL top records based on two tables relations

I have three main items I am storing: Articles, Entities, and Keywords. This makes 5 tables:
article { id }
entity {id, name}
article_entity {id, article_id, entity_id}
keyword {id, name}
article_keyword {id, article_id, keyword_id}
I would like to get all articles that contain the TOP X keywords + entities. I can get the top X keywords or entities with a simple group by on the entity_id/keyword_id.
SELECT [entity|keyword]_id, count(*) as num FROM article_entity
GROUP BY entity_id ORDER BY num DESC LIMIT 10
How would I get all articles that have a relation to the top entities and keywords?
This was what I imagined, but I know it doesn't work because of the group by entity limiting the article_id's that return.
SELECT * FROM article
WHERE EXISTS (
[... where article is mentioned in top X entities.. ]
) AND EXISTS (
[... where article is mentioned in top X keywords.. ]
);
If I understand you correct the objective of the query is to find the articles that have a relation to both one of the top 10 entities as well as to one of the top 10 keywords. If this is the case the following query should do that, by requiring that the article returned has a match in both the set of top 10 entities and the set of top 10 keywords.
Please give it a try.
SELECT a.id
FROM article a
INNER JOIN article_entity ae ON a.id = ae.article_id
INNER JOIN article_keyword ak ON a.id = ak.article_id
INNER JOIN (
SELECT entity_id, COUNT(article_id) AS article_entity_count
FROM article_entity
GROUP BY entity_id
ORDER BY article_entity_count DESC LIMIT 10
) top_ae ON ae.entity_id = top_ae.entity_id
INNER JOIN (
SELECT keyword_id, COUNT(article_id) AS article_keyword_count
FROM article_keyword
GROUP BY keyword_id
ORDER BY article_keyword_count DESC LIMIT 10
) top_ak ON ak.keyword_id = top_ak.keyword_id
GROUP BY a.id;
The downside to using a simplelimit 10in the two subqueries for top entities/keywords is that it won't handle ties, so if the 11th keyword was just as popular as the 10th it still won't get chosen. This can be fixed though by using a ranking function, but afaik MySQL doesn't have anything build in (like RANK() window functions in Oracle or MSSQL).
I set up a sample SQL Fiddle (but using fewer data points andlimit 2as I'm lazy).
Not knowing the volume of data you are working with, I would first recommend that you have two storage columns on your article table for count of entities and keywords respectively. Then via triggers on adding/deleting from each, update the respective counter columns. This way, you don't have to do a burning query each time needed, especially in a web-based interface. Then, you can just select from the articles table ordered by the E+K counts descending and be done with it, instead of constant sub-querying the underlying tables.
Now, that said, the other suggestions are somewhat similar to what I am posting, but they all appear to be doing a limit of 10 records for each set. Lets throw this scenario into the picture. Say you have articles 1-20 all a range of 10, 9 and 8 entities and 1-2 keywords. Then articles 21-50 have the reverse... 10, 9, 8 keywords and 1-2 entities. Now, you have articles 51-58 that have 7 entities AND 7 keywords total of 14 combined points. None of the queries would have caught this as entities would only return the qualifying 1-20 records and keywords records 21-50. Articles 51-58 would be so far down on the list, it would not even be considered even though its total is 14.
To handle this, each sub-query is a full query specifically on the article ID and its count. Simple order by the article_ID as that is basis of the join to the master article table.
Now, the coalesce() will get the count if so available, otherwise 0 and add the two values together. From that, the results are ordered with the highest counts first (thus getting scenario sample articles 51-58 plus a few of the others) when the limit is applied.
SELECT
a.id,
coalesce( JustE.ECount, 0 ) ECount,
coalesce( JustK.KCount, 0 ) KCount,
coalesce( JustE.ECount, 0 ) + coalesce( JustK.KCount, 0 ) TotalCnt
from
article a
LEFT JOIN ( select article_id, COUNT(*) as ECount
from article_entity
group by article_id
order by article_id ) JustE
on a.id = JustE.article_id
LEFT JOIN ( select article_id, COUNT(*) as KCount
from article_keyword
group by article_id
order by article_id ) JustK
on a.id = JustK.article_id
order by
coalesce( JustE.ECount, 0 ) + coalesce( JustK.KCount, 0 ) DESC
limit 10
I took this in several steps
tl;dr This shows all the articles from the top (4) keywords and entities:
Here's a fiddle
select
distinct article_id
from
(
select
article_id
from
article_entity ae
inner join
(select
entity_id, count(*)
from
article_entity
group by
entity_id
order by
count(*) desc
limit 4) top_entities on ae.entity_id = top_entities.entity_id
union all
select
article_id
from
article_keyword ak
inner join
(select
keyword_id, count(*)
from
article_keyword
group by
keyword_id
order by
count(*) desc
limit 4) top_keywords on ak.keyword_id = top_keywords.keyword_id) as articles
Explanation:
This starts with an effort to find the top X entities. (4 seemed to work for the number of associations i wanted to make in the fiddle)
I didn't want to select articles here because it skews the group by, you want to focus solely on the top entities. Fiddle
select
entity_id, count(*)
from
article_entity
group by
entity_id
order by
count(*) desc
limit 4
Then I selected all the articles from these top entities. Fiddle
select
*
from
article_entity ae
inner join
(select
entity_id, count(*)
from
article_entity
group by
entity_id
order by
count(*) desc
limit 4) top_entities on ae.entity_id = top_entities.entity_id
Obviously the same logic needs to happen for the keywords. The queries are then unioned together (fiddle) and the distinct article ids are pulled from the union.
This will give you all articles that have a relation to the top (x) entities and keywords.
This gets the top 10 keyword articles that are also a top 10 entity. You may not get 10 records back because it is possible that an article only meets one of the criteria (top entity but not top keyword or top keyword but not top entity)
select *
from article a
inner join
(select count(*),ae.article_id
from article_entity ae
group by ae.article_id
order by count(*) Desc limit 10) e
on a.id = e.article_id
inner join
(select count(*),ak.article_id
from article_keyword ak
group by ak.article_id
order by count(*) Desc limit 10) k
on a.id = k.article_id

MYSQL group by and inner join

I have an article table which holds the number of articles views for each day. A new record is created to hold the count for each seperate day for each article.
The query below gets the article id and total views for the top 5 viewed article id for all time :
SELECT article_id,
SUM(article_count) as cnt
FROM article_views
GROUP BY article_id
ORDER BY cnt DESC
LIMIT 5
I also have a seperate article table which holds all the article fields. I want to ammend the query above to join to the article table and get two fields for each article id. I have tried to do this below but count is comming back incorrectly :
SELECT article_views.article_id, SUM( article_views.article_count ) AS cnt, articles.article_title, articles.artcile_url
FROM article_views
INNER JOIN articles ON articles.article_id = article_views.article_id
GROUP BY article_views.article_id
ORDER BY cnt DESC
LIMIT 5
Im not sure exactly what im doing wrong. Do I need to do a subquery?
Add articles.article_title, articles.artcile_url to the GROUP BY clause:
SELECT
article_views.article_id,
articles.article_title,
articles.artcile_url,
SUM( article_views.article_count ) AS cnt
FROM article_views
INNER JOIN articles ON articles.article_id = article_views.article_id
GROUP BY article_views.article_id,
articles.article_title,
articles.artcile_url
ORDER BY cnt DESC
LIMIT 5;
The reason you were not getting correct result set, is that when you select rows that are not included in the GROUP BY nor in an aggregate function in the SELECT clause MySQL picks up random value.
You are using a MySQL (mis) feature called Hidden Columns, because article title is not in the group by. However, this may or may not be causing your problem.
If the counts are wrong, then I think you have duplicate article_id in the article table. You can check this by doing:
select article_id, count(*) as cnt
from articles
group by article_id
having cnt > 1
If any appear, then that is your problem. If they all have different titles, then grouping by the title (as suggested by Mahmoud) would fix the problem.
If not, one way to fix it is the following:
SELECT article_views.article_id, SUM( article_views.article_count ) AS cnt, articles.article_title, articles.artcile_url
FROM article_views INNER JOIN
(select a.* from articles group by article_id) articles
ON articles.article_id = article_views.article_id
GROUP BY article_views.article_id
ORDER BY cnt DESC
LIMIT 5
This chooses an abitrary title for the article.
Your query looks basically right to me...
But the value returned for cnt is going to be dependent upon article_id column being UNIQUE in the articles table. We'd assume that it's the primary key, and absent a schema definition, that's only an assumption.)
Also, we're likely to assume there's a foreign key between the tables, that is, there are no values of article_id in the articles_view table which don't match a value of article_id on a row from the articles table.
To check for "orphan" article_id values, run a query like:
SELECT v.article_id
FROM articles_view v
LEFT
JOIN articles a
ON a.article_id = v.article_id
WHERE a.article_id IS NULL
To check for "duplicate" article_id values in articles, run a query like:
SELECT a.article_id
FROM articles a
GROUP BY a.article_id
HAVING COUNT(1) > 1
If either of those queries returns rows, that could be an explanation for the behavior you observe.

MySQL is not using INDEX in subquery

I have these tables and queries as defined in sqlfiddle.
First my problem was to group people showing LEFT JOINed visits rows with the newest year. That I solved using subquery.
Now my problem is that that subquery is not using INDEX defined on visits table. That is causing my query to run nearly indefinitely on tables with approx 15000 rows each.
Here's the query. The goal is to list every person once with his newest (by year) record in visits table.
Unfortunately on large tables it gets real sloooow because it's not using INDEX in subquery.
SELECT *
FROM people
LEFT JOIN (
SELECT *
FROM visits
ORDER BY visits.year DESC
) AS visits
ON people.id = visits.id_people
GROUP BY people.id
Does anyone know how to force MySQL to use INDEX already defined on visits table?
Your query:
SELECT *
FROM people
LEFT JOIN (
SELECT *
FROM visits
ORDER BY visits.year DESC
) AS visits
ON people.id = visits.id_people
GROUP BY people.id;
First, is using non-standard SQL syntax (items appear in the SELECT list that are not part of the GROUP BY clause, are not aggregate functions and do not sepend on the grouping items). This can give indeterminate (semi-random) results.
Second, ( to avoid the indeterminate results) you have added an ORDER BY inside a subquery which (non-standard or not) is not documented anywhere in MySQL documentation that it should work as expected. So, it may be working now but it may not work in the not so distant future, when you upgrade to MySQL version X (where the optimizer will be clever enough to understand that ORDER BY inside a derived table is redundant and can be eliminated).
Try using this query:
SELECT
p.*, v.*
FROM
people AS p
LEFT JOIN
( SELECT
id_people
, MAX(year) AS year
FROM
visits
GROUP BY
id_people
) AS vm
JOIN
visits AS v
ON v.id_people = vm.id_people
AND v.year = vm.year
ON v.id_people = p.id;
The: SQL-fiddle
A compound index on (id_people, year) would help efficiency.
A different approach. It works fine if you limit the persons to a sensible limit (say 30) first and then join to the visits table:
SELECT
p.*, v.*
FROM
( SELECT *
FROM people
ORDER BY name
LIMIT 30
) AS p
LEFT JOIN
visits AS v
ON v.id_people = p.id
AND v.year =
( SELECT
year
FROM
visits
WHERE
id_people = p.id
ORDER BY
year DESC
LIMIT 1
)
ORDER BY name ;
Why do you have a subquery when all you need is a table name for joining?
It is also not obvious to me why your query has a GROUP BY clause in it. GROUP BY is ordinarily used with aggregate functions like MAX or COUNT, but you don't have those.
How about this? It may solve your problem.
SELECT people.id, people.name, MAX(visits.year) year
FROM people
JOIN visits ON people.id = visits.id_people
GROUP BY people.id, people.name
If you need to show the person, the most recent visit, and the note from the most recent visit, you're going to have to explicitly join the visits table again to the summary query (virtual table) like so.
SELECT a.id, a.name, a.year, v.note
FROM (
SELECT people.id, people.name, MAX(visits.year) year
FROM people
JOIN visits ON people.id = visits.id_people
GROUP BY people.id, people.name
)a
JOIN visits v ON (a.id = v.id_people and a.year = v.year)
Go fiddle: http://www.sqlfiddle.com/#!2/d67fc/20/0
If you need to show something for people that have never had a visit, you should try switching the JOIN items in my statement with LEFT JOIN.
As someone else wrote, an ORDER BY clause in a subquery is not standard, and generates unpredictable results. In your case it baffled the optimizer.
Edit: GROUP BY is a big hammer. Don't use it unless you need it. And, don't use it unless you use an aggregate function in the query.
Notice that if you have more than one row in visits for a person and the most recent year, this query will generate multiple rows for that person, one for each visit in that year. If you want just one row per person, and you DON'T need the note for the visit, then the first query will do the trick. If you have more than one visit for a person in a year, and you only need the latest one, you have to identify which row IS the latest one. Usually it will be the one with the highest ID number, but only you know that for sure. I added another person to your fiddle with that situation. http://www.sqlfiddle.com/#!2/4f644/2/0
This is complicated. But: if your visits.id numbers are automatically assigned and they are always in time order, you can simply report the highest visit id, and be guaranteed that you'll have the latest year. This will be a very efficient query.
SELECT p.id, p.name, v.year, v.note
FROM (
SELECT id_people, max(id) id
FROM visits
GROUP BY id_people
)m
JOIN people p ON (p.id = m.id_people)
JOIN visits v ON (m.id = v.id)
http://www.sqlfiddle.com/#!2/4f644/1/0 But this is not the way your example is set up. So you need another way to disambiguate your latest visit, so you just get one row per person. The only trick we have at our disposal is to use the largest id number.
So, we need to get a list of the visit.id numbers that are the latest ones, by this definition, from your tables. This query does that, with a MAX(year)...GROUP BY(id_people) nested inside a MAX(id)...GROUP BY(id_people) query.
SELECT v.id_people,
MAX(v.id) id
FROM (
SELECT id_people,
MAX(year) year
FROM visits
GROUP BY id_people
)p
JOIN visits v ON (p.id_people = v.id_people AND p.year = v.year)
GROUP BY v.id_people
The overall query (http://www.sqlfiddle.com/#!2/c2da2/1/0) is this.
SELECT p.id, p.name, v.year, v.note
FROM (
SELECT v.id_people,
MAX(v.id) id
FROM (
SELECT id_people,
MAX(year) year
FROM visits
GROUP BY id_people
)p
JOIN visits v ON ( p.id_people = v.id_people
AND p.year = v.year)
GROUP BY v.id_people
)m
JOIN people p ON (m.id_people = p.id)
JOIN visits v ON (m.id = v.id)
Disambiguation in SQL is a tricky business to learn, because it takes some time to wrap your head around the idea that there's no inherent order to rows in a DBMS.