My database has news articles and blog posts. The primary key for both is an ItemID that is unique across both tables.
The articles are in a table that has the following fields
item_id
title
body
date_posted
The blogposts table has the following fields
item_id
title
body
date_posted
both tables have extra fields unique to them.
I have a third table that holds meta information about articles and posts.
The items table has the following fields
item_id
source_id
...
every blogpost and article has a record in the items table and a record in its respective table.
What I am trying to do is build a query that will count the number of items posted per day. I can do it for one table using a count grouped by date_posted but how to combine articles and posts count in one query?
Similar to Dems, but slightly simpler:
select date_posted, count(*)
from (select date_posted from article union all
select date_posted from blogposts) v
group by date_posted
You can do it two ways.
1. Join everything together and then aggregate (See Tom H's answer).
2. Aggregate each table, UNION them, and aggregate again.
Option 1 may seem shorter, but will mean that you may not benefit from INDEXes on the root tables (As they have to be re-ordered for the JOIN). So I'll show option 2, which is the direction you were headed any way.
SELECT
date_posted,
SUM(daily_count) AS daily_count
FROM
(
SELECT date_posted, COUNT(*) AS daily_count FROM article GROUP BY date_posted
UNION ALL
SELECT date_posted, COUNT(*) AS daily_count FROM blogposts GROUP BY date_posted
)
AS combined
GROUP BY
date_posted
This should be fastest, provided that you have an index on each table where date_posted is the first field in the index. Other-wise the tables will still need to be re-ordered for the aggregation.
I would have used a different table design for this, with types and subtypes. Your Items table has a single column primary key and your Blog_Posts and Articles tables' primary keys are the same ID with a foreign key to the Items table. That would make something like this pretty easy to do and also helps to ensure data integrity.
With your existing design, your best bet is probably something like this:
SELECT
I.item_id,
I.source_id,
COALESCE(A.date_posted, B.date_posted) AS date_posted,
COUNT(*) AS date_count
FROM
Items I
LEFT OUTER JOIN Articles A ON
A.item_id = I.item_id AND
I.source_id = 'A' -- Or whatever the Articles ID is
LEFT OUTER JOIN Blog_Posts B ON
B.item_id = I.item_id AND
I.source_id = 'B' -- Or whatever the Blog_Posts ID is
GROUP BY
I.item_id,
I.source_id,
COALESCE(A.date_posted, B.date_posted)
You could also try using a UNION:
SELECT
SQ.item_id,
SQ.source_id,
SQ.date_posted,
COUNT(*) AS date_count
FROM
(
SELECT I1.item_id, I1.source_id, A.date_posted
FROM Items I1
INNER JOIN Articles A ON A.item_id = I1.item_id
WHERE I1.source_id = 'A'
UNION ALL
SELECT I2.item_id, I2.source_id, B.date_posted
FROM Items I2
INNER JOIN Articles B ON B.item_id = I2.item_id
WHERE I2.source_id = 'B'
)
select item_id, date_posted from blogposts where /* some conditions */
union all select item_id, date_posted from articles where /* some conditions */
You'll probably need to put that into a subquery, and if you so desire, join it with other tables, when running the group by. But the main point is that union is the operator you use to combine like data from different tables. union all tells the database that you don't need it to combine duplicate records, since you know that the two tables will never share an item_id, so it's a little faster (probably).
Related
I have 2 tables. Persons and Images.
One person may have many images.
I want to select a person with his primary image BUT(!), if none of the images marked as isPrimary=1 bring the first one.
Only one or less images may be isPrimary=1
SELECT
*,
(
SELECT id
FROM Image
WHERES personId=r.id AND isPrimary=1
LIMIT 1
) AS primaryImageId
FROM Persons r
ORDER BY id DESC;
I did it with subselect, join is also good...
Thanks
You can use:
SELECT r.*,
(SELECT i.id
FROM Image i
WHERES i.personId = r.id
ORDER BY i.isPrimary DESC, i.id ASC
LIMIT 1
) AS primaryImageId
FROM Persons r
ORDER BY id DESC;
This orders the images, with the primary one first -- and then takes the first image.
You should learn to qualify all column references -- this is especially important when using correlated subqueries. I assume that the alias r makes sense on the persons table in your native language. Otherwise, use something sensible such as p.
I have 2 tables, authors and books
authors contains the unique id authorId
books also contains this as a foreign key
I need to know the authors with the most number of books. If 2 or more authors are tied for the greatest number of books, I need to show both authors
I have been able to achieve this by first getting the maximum count
SELECT #maxCount := (MAX(counter)) FROM (SELECT count(*) AS counter FROM books GROUP BY authorId) AS counts;
and then using it to get the Ids with that count as part of my author selection
SELECT *
FROM authors
WHERE authorId IN (
SELECT authorId
FROM books
GROUP BY authorId
HAVING COUNT(*) = #maxCount
);
I've been told that I am not allowed to use variables and that what I've done is horribly inefficient if the tables grow very large.
Am I missing something obvious here? Is there a way to do this in a single statement without a variable (or temp table), and without having to select/group the entire books table twice?
SELECT author, COUNT(*)
FROM authors
JOIN books
ON authors.authorId=books.AuthorId
GROUP BY author
ORDER BY COUNT(*) DESC
Will give you a list ordered by the number of books for each author. I don't have an instance nearby to test, and tend to steer clear of embedded variables but expect something like....
SELECT *
FROM (
SELECT author
, #maxcount:=IF(COUNT(*)>#maxcount,COUNT(*), #maxcount)
, COUNT(*) AS cnt
FROM authors
JOIN books
ON authors.authorId=books.AuthorId
GROUP BY author
ORDER BY COUNT(*) DESC
) ilv
WHERE cnt=#maxcount;
Performance still sucks with large datasets (even with the right indexes). If you have to run this query frequently with >100,000 records, then you might consider denormalizing your data.
Symcbean solution is great... you can add Limit 1 to it, to get only one instance.
SELECT A.authorId, A.name, COUNT(*) AS num_books
FROM authors A
INER JOIN books B
ON A.authorId=B.AuthorId
GROUP BY A.authorId, A.name
ORDER BY COUNT(*) DESC
LIMIT 1
But if you want to get all the authors who share the max number of books, your best bet is to store the max(count) in a variable, or temp table and use it in second query.
for example, you can store the info in the following temp table
CREATE TEMPORARY TABLE IF NOT EXISTS maxBooks AS (
SELECT authorId, COUNT(*) AS num_books
FROM books
GROUP BY authorId
ORDER BY COUNT(*) DESC
LIMIT 1
)
now you can join it to your table to get counts which are equal to max count
I have three Tables:
Posts:
id, title, authorId, text
authors:
id, name, country
Comments:
id, authorId, text, postId
I want to run a mysql command which selects the first 5 posts which were written by authors, whose country is 'Ireland'. In the same call, I want to retrieve all the comments for those five posts, and also the author info.
I've tried the following:
SELECT posts.id as 'posts.id', posts.title as 'posts.title' (etc. etc. list all fields in three table)
FROM
(SELECT * FROM posts, authors WHERE authors.country = 'ireland' AND authors.id = posts.authorId LIMIT 0, 5 ) as posts
LEFT JOIN
comments ON comments.postId = posts.id,
authors
WHERE
authors.id = posts.authorId
I had to include every field with an alias ^ because there was a duplicate for id, and more fields in future may become duplicates as I'm looking for a generic solution.
My two questions are:
1) I am getting a duplicate field entry from within my subselect for id, so do I have to list out all my fields as aliases again within the subselect or is there only one field I need for a subselect
2) Is there a way to auto-alias my call? At the moment I've just aliased every field in the main select but can it do this for me so there are no duplicates?
Sorry if this isn't very clear it's a bit of a messy problem! Thanks.
You are doing an unnecessary join back to the author table in your query. You get all the fields you want in the posts subquery. I would rename this to something other than an existing table, perhaps pa to indicate posts and authors.
You say you want the first 5 posts, but have no order clause. A better form of the query is:
SELECT pa.id as 'posts.id', pa.title as 'posts.title' (etc. etc. list all fields in three table)
FROM (SELECT *
FROM posts join
authors
on authors.id = posts.authorId
WHERE authors.country = 'ireland'
order by post.date
LIMIT 0, 5
) pa LEFT JOIN
comments c
ON c.postId = pa.id
Note that this returns the first five posts and their authors (as specified in the question). But one author may be responsible for all five posts.
In MySQL, you can use * and it will get rid of duplicate aliases in the from clause. I think this is dangerous. It is better to list all the columns you want.
To answer your questions:
You can select as many (or as few) columns as you need from a sub-query
You do not need to join the authors table again since you already selected all fields in the sub-query (and so get rid of duplicate columns names).
A few additional remarks...
... about the JOIN syntax
Prefer the form
FROM t1 JOIN t2 ON (t1.fk = t2.pk)
to the obsolete, obscure
FROM t1, t2 WHERE t1.fk = t2.pk
... about the use of a LIMIT clause without an ORDER BY clause
The order in which rows are returned by a SELECT statement without an ORDER BY clause is undefined. Therefore, a LIMIT n clause without an ORDER BY clause could return any n rows in theory.
Your final query should look like this:
SELECT *
FROM (
SELECT *
FROM posts
JOIN authors ON (authors.id = posts.authorId )
WHERE authors.country = 'ireland'
ORDER BY posts.id DESC -- assuming this column is monotonically increasing
LIMIT 5
) AS last_posts
LEFT JOIN comments ON ( comments.postId = last_posts .id )
I have an article table which holds the number of articles views for each day. A new record is created to hold the count for each seperate day for each article.
The query below gets the article id and total views for the top 5 viewed article id for all time :
SELECT article_id,
SUM(article_count) as cnt
FROM article_views
GROUP BY article_id
ORDER BY cnt DESC
LIMIT 5
I also have a seperate article table which holds all the article fields. I want to ammend the query above to join to the article table and get two fields for each article id. I have tried to do this below but count is comming back incorrectly :
SELECT article_views.article_id, SUM( article_views.article_count ) AS cnt, articles.article_title, articles.artcile_url
FROM article_views
INNER JOIN articles ON articles.article_id = article_views.article_id
GROUP BY article_views.article_id
ORDER BY cnt DESC
LIMIT 5
Im not sure exactly what im doing wrong. Do I need to do a subquery?
Add articles.article_title, articles.artcile_url to the GROUP BY clause:
SELECT
article_views.article_id,
articles.article_title,
articles.artcile_url,
SUM( article_views.article_count ) AS cnt
FROM article_views
INNER JOIN articles ON articles.article_id = article_views.article_id
GROUP BY article_views.article_id,
articles.article_title,
articles.artcile_url
ORDER BY cnt DESC
LIMIT 5;
The reason you were not getting correct result set, is that when you select rows that are not included in the GROUP BY nor in an aggregate function in the SELECT clause MySQL picks up random value.
You are using a MySQL (mis) feature called Hidden Columns, because article title is not in the group by. However, this may or may not be causing your problem.
If the counts are wrong, then I think you have duplicate article_id in the article table. You can check this by doing:
select article_id, count(*) as cnt
from articles
group by article_id
having cnt > 1
If any appear, then that is your problem. If they all have different titles, then grouping by the title (as suggested by Mahmoud) would fix the problem.
If not, one way to fix it is the following:
SELECT article_views.article_id, SUM( article_views.article_count ) AS cnt, articles.article_title, articles.artcile_url
FROM article_views INNER JOIN
(select a.* from articles group by article_id) articles
ON articles.article_id = article_views.article_id
GROUP BY article_views.article_id
ORDER BY cnt DESC
LIMIT 5
This chooses an abitrary title for the article.
Your query looks basically right to me...
But the value returned for cnt is going to be dependent upon article_id column being UNIQUE in the articles table. We'd assume that it's the primary key, and absent a schema definition, that's only an assumption.)
Also, we're likely to assume there's a foreign key between the tables, that is, there are no values of article_id in the articles_view table which don't match a value of article_id on a row from the articles table.
To check for "orphan" article_id values, run a query like:
SELECT v.article_id
FROM articles_view v
LEFT
JOIN articles a
ON a.article_id = v.article_id
WHERE a.article_id IS NULL
To check for "duplicate" article_id values in articles, run a query like:
SELECT a.article_id
FROM articles a
GROUP BY a.article_id
HAVING COUNT(1) > 1
If either of those queries returns rows, that could be an explanation for the behavior you observe.
I have 2 tables authors and authors_sales
The table authors_sales is updated each hour so is huge.
What I need is to create a ranking, for that I need to join both tables (authors has all the author data while authors_sales has just sales numbers)
How can I create a final table with the ranking of authors ordering it by sales?
The common key is the: authorId
I tried with LEFT JOIN but I must be doing something wrong because I get all the authors_sales table, not just the last.
Any tip in the right direction much appreciated
If you're looking for aggregate data of the sales, you'd want to join the tables, group by the authorId. Something like...
select authors.author_id, SUM(author_sales.sale_amt) as total_sales
from authors
inner join author_sales on author_sales.author_id = authors.author_id
group by authors.author_id
order by total_sales desc
However (I couldn't distinguish from your question whether the above scenario or next is true), if you're only looking for the max value of the author_sales table (if the data in this table is already aggregated), you can join on a nested query for author_sales, such as...
select author.author_id, t.sales from authors
inner join
(select top 1 author_sales.author_id,
author_sales.sale_amt,
author_sales.some_identifier
from author_sales order by some_identifier desc) t
on t.author_id = author.author_id
order by t.sales desc
The some_identifier would be how you determine which record is the most recent for author_sales, whether it is a timestamp of when it was inserted or an incremental primary key, however it is set up. Depending on if the data in author_sales is aggregated already, one of these two should do it for you...
select a.*, sum(b.sales)
from authors as a
inner join authors_sales as b
using authorId
group by b.authorId
order by sum(b.sales) desc;
/* assuming column sales = total for each row in authors_sales */