I have 2 tables, authors and books
authors contains the unique id authorId
books also contains this as a foreign key
I need to know the authors with the most number of books. If 2 or more authors are tied for the greatest number of books, I need to show both authors
I have been able to achieve this by first getting the maximum count
SELECT #maxCount := (MAX(counter)) FROM (SELECT count(*) AS counter FROM books GROUP BY authorId) AS counts;
and then using it to get the Ids with that count as part of my author selection
SELECT *
FROM authors
WHERE authorId IN (
SELECT authorId
FROM books
GROUP BY authorId
HAVING COUNT(*) = #maxCount
);
I've been told that I am not allowed to use variables and that what I've done is horribly inefficient if the tables grow very large.
Am I missing something obvious here? Is there a way to do this in a single statement without a variable (or temp table), and without having to select/group the entire books table twice?
SELECT author, COUNT(*)
FROM authors
JOIN books
ON authors.authorId=books.AuthorId
GROUP BY author
ORDER BY COUNT(*) DESC
Will give you a list ordered by the number of books for each author. I don't have an instance nearby to test, and tend to steer clear of embedded variables but expect something like....
SELECT *
FROM (
SELECT author
, #maxcount:=IF(COUNT(*)>#maxcount,COUNT(*), #maxcount)
, COUNT(*) AS cnt
FROM authors
JOIN books
ON authors.authorId=books.AuthorId
GROUP BY author
ORDER BY COUNT(*) DESC
) ilv
WHERE cnt=#maxcount;
Performance still sucks with large datasets (even with the right indexes). If you have to run this query frequently with >100,000 records, then you might consider denormalizing your data.
Symcbean solution is great... you can add Limit 1 to it, to get only one instance.
SELECT A.authorId, A.name, COUNT(*) AS num_books
FROM authors A
INER JOIN books B
ON A.authorId=B.AuthorId
GROUP BY A.authorId, A.name
ORDER BY COUNT(*) DESC
LIMIT 1
But if you want to get all the authors who share the max number of books, your best bet is to store the max(count) in a variable, or temp table and use it in second query.
for example, you can store the info in the following temp table
CREATE TEMPORARY TABLE IF NOT EXISTS maxBooks AS (
SELECT authorId, COUNT(*) AS num_books
FROM books
GROUP BY authorId
ORDER BY COUNT(*) DESC
LIMIT 1
)
now you can join it to your table to get counts which are equal to max count
Related
I have three main items I am storing: Articles, Entities, and Keywords. This makes 5 tables:
article { id }
entity {id, name}
article_entity {id, article_id, entity_id}
keyword {id, name}
article_keyword {id, article_id, keyword_id}
I would like to get all articles that contain the TOP X keywords + entities. I can get the top X keywords or entities with a simple group by on the entity_id/keyword_id.
SELECT [entity|keyword]_id, count(*) as num FROM article_entity
GROUP BY entity_id ORDER BY num DESC LIMIT 10
How would I get all articles that have a relation to the top entities and keywords?
This was what I imagined, but I know it doesn't work because of the group by entity limiting the article_id's that return.
SELECT * FROM article
WHERE EXISTS (
[... where article is mentioned in top X entities.. ]
) AND EXISTS (
[... where article is mentioned in top X keywords.. ]
);
If I understand you correct the objective of the query is to find the articles that have a relation to both one of the top 10 entities as well as to one of the top 10 keywords. If this is the case the following query should do that, by requiring that the article returned has a match in both the set of top 10 entities and the set of top 10 keywords.
Please give it a try.
SELECT a.id
FROM article a
INNER JOIN article_entity ae ON a.id = ae.article_id
INNER JOIN article_keyword ak ON a.id = ak.article_id
INNER JOIN (
SELECT entity_id, COUNT(article_id) AS article_entity_count
FROM article_entity
GROUP BY entity_id
ORDER BY article_entity_count DESC LIMIT 10
) top_ae ON ae.entity_id = top_ae.entity_id
INNER JOIN (
SELECT keyword_id, COUNT(article_id) AS article_keyword_count
FROM article_keyword
GROUP BY keyword_id
ORDER BY article_keyword_count DESC LIMIT 10
) top_ak ON ak.keyword_id = top_ak.keyword_id
GROUP BY a.id;
The downside to using a simplelimit 10in the two subqueries for top entities/keywords is that it won't handle ties, so if the 11th keyword was just as popular as the 10th it still won't get chosen. This can be fixed though by using a ranking function, but afaik MySQL doesn't have anything build in (like RANK() window functions in Oracle or MSSQL).
I set up a sample SQL Fiddle (but using fewer data points andlimit 2as I'm lazy).
Not knowing the volume of data you are working with, I would first recommend that you have two storage columns on your article table for count of entities and keywords respectively. Then via triggers on adding/deleting from each, update the respective counter columns. This way, you don't have to do a burning query each time needed, especially in a web-based interface. Then, you can just select from the articles table ordered by the E+K counts descending and be done with it, instead of constant sub-querying the underlying tables.
Now, that said, the other suggestions are somewhat similar to what I am posting, but they all appear to be doing a limit of 10 records for each set. Lets throw this scenario into the picture. Say you have articles 1-20 all a range of 10, 9 and 8 entities and 1-2 keywords. Then articles 21-50 have the reverse... 10, 9, 8 keywords and 1-2 entities. Now, you have articles 51-58 that have 7 entities AND 7 keywords total of 14 combined points. None of the queries would have caught this as entities would only return the qualifying 1-20 records and keywords records 21-50. Articles 51-58 would be so far down on the list, it would not even be considered even though its total is 14.
To handle this, each sub-query is a full query specifically on the article ID and its count. Simple order by the article_ID as that is basis of the join to the master article table.
Now, the coalesce() will get the count if so available, otherwise 0 and add the two values together. From that, the results are ordered with the highest counts first (thus getting scenario sample articles 51-58 plus a few of the others) when the limit is applied.
SELECT
a.id,
coalesce( JustE.ECount, 0 ) ECount,
coalesce( JustK.KCount, 0 ) KCount,
coalesce( JustE.ECount, 0 ) + coalesce( JustK.KCount, 0 ) TotalCnt
from
article a
LEFT JOIN ( select article_id, COUNT(*) as ECount
from article_entity
group by article_id
order by article_id ) JustE
on a.id = JustE.article_id
LEFT JOIN ( select article_id, COUNT(*) as KCount
from article_keyword
group by article_id
order by article_id ) JustK
on a.id = JustK.article_id
order by
coalesce( JustE.ECount, 0 ) + coalesce( JustK.KCount, 0 ) DESC
limit 10
I took this in several steps
tl;dr This shows all the articles from the top (4) keywords and entities:
Here's a fiddle
select
distinct article_id
from
(
select
article_id
from
article_entity ae
inner join
(select
entity_id, count(*)
from
article_entity
group by
entity_id
order by
count(*) desc
limit 4) top_entities on ae.entity_id = top_entities.entity_id
union all
select
article_id
from
article_keyword ak
inner join
(select
keyword_id, count(*)
from
article_keyword
group by
keyword_id
order by
count(*) desc
limit 4) top_keywords on ak.keyword_id = top_keywords.keyword_id) as articles
Explanation:
This starts with an effort to find the top X entities. (4 seemed to work for the number of associations i wanted to make in the fiddle)
I didn't want to select articles here because it skews the group by, you want to focus solely on the top entities. Fiddle
select
entity_id, count(*)
from
article_entity
group by
entity_id
order by
count(*) desc
limit 4
Then I selected all the articles from these top entities. Fiddle
select
*
from
article_entity ae
inner join
(select
entity_id, count(*)
from
article_entity
group by
entity_id
order by
count(*) desc
limit 4) top_entities on ae.entity_id = top_entities.entity_id
Obviously the same logic needs to happen for the keywords. The queries are then unioned together (fiddle) and the distinct article ids are pulled from the union.
This will give you all articles that have a relation to the top (x) entities and keywords.
This gets the top 10 keyword articles that are also a top 10 entity. You may not get 10 records back because it is possible that an article only meets one of the criteria (top entity but not top keyword or top keyword but not top entity)
select *
from article a
inner join
(select count(*),ae.article_id
from article_entity ae
group by ae.article_id
order by count(*) Desc limit 10) e
on a.id = e.article_id
inner join
(select count(*),ak.article_id
from article_keyword ak
group by ak.article_id
order by count(*) Desc limit 10) k
on a.id = k.article_id
There are 3 entities - articles, journals and subscribers. There are no restrictions on how to store data in database.
The same article can be simultaneously published in several journals.
How to select all published articles from subscribed journals sorted
by date of publication and without repeats?
The easiest way:
Create a table with articles:
posts
p_id, j1_id, j2_id, text, date
Create a table with subscribtions:
follows
f_id, u_id, j_id (u_id — is a user id from table users)
Execute:
example query
select posts.* from posts inner join follows on (j_id = j1_id or j_id
= j2_id) where u_id = 1 order by date desc
This query returns data with duplicates. You can use mechanisms DISTINCT or GROUP BY, but it creates an additional sorting operation to remove duplicates.
The other way it can be done using mechanism UNION, but it also uses a DISTINCT.
(select posts.* from posts inner join follows on j_id = j1_id where u_id = 1)
union
(select posts.* from posts inner join follows on j_id = j2_id where u_id = 1)
order by date desc
Perhaps I selected the incorrect storage structure in my way.
Actually the question, is it possible to do something about this problem, to minimize the time required for big data?
you can use the following table structure
posts : pid, text, date
journals : jid, jtext
journals_posts : jid, pid
follows : fid, uid, jid
select distinct posts.* from posts
inner join journals_posts on journals_posts.pid = posts.pid
inner join follows on follows.jid = journals_posts.jid
where follows.uid = <userid>
to take care of speed you can create index on
journals_posts(jid)
follows(uid)
you might required to create indexes on other fields check with "explain " which tables are scanned without using joins
I've a table of articles, a table of authors, and a table that maps articles to authors.
I'm doing the following query to find out the authors with the most articles:
SELECT a.*, count(*) c
FROM articleAuthors aa
LEFT JOIN authors a ON aa.author_id=a.id
GROUP BY (author_name)
ORDER BY c DESC LIMIT 50
However this query takes a whole minute to complete. The database has about 1,000,000 records in articles_to_authors table.
How could I speed up this GROUP BY query?
Under an assumption of the articleAuthors table having more than 50 distinct authors, I would pre-query just that component and limit to the 50 records you want. Ensure an index exists on (author_id). Also, ensure your authors table has an index on (id). Change your query to
select
a.*,
JustAuthorIDs.cntPerAuthor
from
( select
aa.author_id,
count(*) cntPerAuthor
from
articleAuthors aa
group by
aa.author_id
order by
cntPerAuthor DESC
limit 50 ) JustAuthorIDs
JOIN Authors a
on JustAuthorIDs.author_ID = a.id
The order by count descending in the prequery will pre-flush AND be pre-ordered by largest count first and stop after 50 records. Then, a simple join to the authors table to get the name and whatever else.
I have the group by based on the author_ID instead of the name as what if you have two authors called "bill board"... The actual ID will be distinct between the two of them.
Now, with the above being a query, you will always be required to query through all million records every time. For something like this, it would PROBABLY be better to add a single "AuthoredItems" column in the authors table. Then, via a trigger on the authorArticles table, when an entry gets added or deleted, just update the final count for the one author on the author table. Then, build an index on the "AuthoredItems" column. Then, you can super simplify the query by doing
select a.*
from authors a
order by a.AuthoredItems
limit 50
My database has news articles and blog posts. The primary key for both is an ItemID that is unique across both tables.
The articles are in a table that has the following fields
item_id
title
body
date_posted
The blogposts table has the following fields
item_id
title
body
date_posted
both tables have extra fields unique to them.
I have a third table that holds meta information about articles and posts.
The items table has the following fields
item_id
source_id
...
every blogpost and article has a record in the items table and a record in its respective table.
What I am trying to do is build a query that will count the number of items posted per day. I can do it for one table using a count grouped by date_posted but how to combine articles and posts count in one query?
Similar to Dems, but slightly simpler:
select date_posted, count(*)
from (select date_posted from article union all
select date_posted from blogposts) v
group by date_posted
You can do it two ways.
1. Join everything together and then aggregate (See Tom H's answer).
2. Aggregate each table, UNION them, and aggregate again.
Option 1 may seem shorter, but will mean that you may not benefit from INDEXes on the root tables (As they have to be re-ordered for the JOIN). So I'll show option 2, which is the direction you were headed any way.
SELECT
date_posted,
SUM(daily_count) AS daily_count
FROM
(
SELECT date_posted, COUNT(*) AS daily_count FROM article GROUP BY date_posted
UNION ALL
SELECT date_posted, COUNT(*) AS daily_count FROM blogposts GROUP BY date_posted
)
AS combined
GROUP BY
date_posted
This should be fastest, provided that you have an index on each table where date_posted is the first field in the index. Other-wise the tables will still need to be re-ordered for the aggregation.
I would have used a different table design for this, with types and subtypes. Your Items table has a single column primary key and your Blog_Posts and Articles tables' primary keys are the same ID with a foreign key to the Items table. That would make something like this pretty easy to do and also helps to ensure data integrity.
With your existing design, your best bet is probably something like this:
SELECT
I.item_id,
I.source_id,
COALESCE(A.date_posted, B.date_posted) AS date_posted,
COUNT(*) AS date_count
FROM
Items I
LEFT OUTER JOIN Articles A ON
A.item_id = I.item_id AND
I.source_id = 'A' -- Or whatever the Articles ID is
LEFT OUTER JOIN Blog_Posts B ON
B.item_id = I.item_id AND
I.source_id = 'B' -- Or whatever the Blog_Posts ID is
GROUP BY
I.item_id,
I.source_id,
COALESCE(A.date_posted, B.date_posted)
You could also try using a UNION:
SELECT
SQ.item_id,
SQ.source_id,
SQ.date_posted,
COUNT(*) AS date_count
FROM
(
SELECT I1.item_id, I1.source_id, A.date_posted
FROM Items I1
INNER JOIN Articles A ON A.item_id = I1.item_id
WHERE I1.source_id = 'A'
UNION ALL
SELECT I2.item_id, I2.source_id, B.date_posted
FROM Items I2
INNER JOIN Articles B ON B.item_id = I2.item_id
WHERE I2.source_id = 'B'
)
select item_id, date_posted from blogposts where /* some conditions */
union all select item_id, date_posted from articles where /* some conditions */
You'll probably need to put that into a subquery, and if you so desire, join it with other tables, when running the group by. But the main point is that union is the operator you use to combine like data from different tables. union all tells the database that you don't need it to combine duplicate records, since you know that the two tables will never share an item_id, so it's a little faster (probably).
I have 2 tables authors and authors_sales
The table authors_sales is updated each hour so is huge.
What I need is to create a ranking, for that I need to join both tables (authors has all the author data while authors_sales has just sales numbers)
How can I create a final table with the ranking of authors ordering it by sales?
The common key is the: authorId
I tried with LEFT JOIN but I must be doing something wrong because I get all the authors_sales table, not just the last.
Any tip in the right direction much appreciated
If you're looking for aggregate data of the sales, you'd want to join the tables, group by the authorId. Something like...
select authors.author_id, SUM(author_sales.sale_amt) as total_sales
from authors
inner join author_sales on author_sales.author_id = authors.author_id
group by authors.author_id
order by total_sales desc
However (I couldn't distinguish from your question whether the above scenario or next is true), if you're only looking for the max value of the author_sales table (if the data in this table is already aggregated), you can join on a nested query for author_sales, such as...
select author.author_id, t.sales from authors
inner join
(select top 1 author_sales.author_id,
author_sales.sale_amt,
author_sales.some_identifier
from author_sales order by some_identifier desc) t
on t.author_id = author.author_id
order by t.sales desc
The some_identifier would be how you determine which record is the most recent for author_sales, whether it is a timestamp of when it was inserted or an incremental primary key, however it is set up. Depending on if the data in author_sales is aggregated already, one of these two should do it for you...
select a.*, sum(b.sales)
from authors as a
inner join authors_sales as b
using authorId
group by b.authorId
order by sum(b.sales) desc;
/* assuming column sales = total for each row in authors_sales */