I have three main items I am storing: Articles, Entities, and Keywords. This makes 5 tables:
article { id }
entity {id, name}
article_entity {id, article_id, entity_id}
keyword {id, name}
article_keyword {id, article_id, keyword_id}
I would like to get all articles that contain the TOP X keywords + entities. I can get the top X keywords or entities with a simple group by on the entity_id/keyword_id.
SELECT [entity|keyword]_id, count(*) as num FROM article_entity
GROUP BY entity_id ORDER BY num DESC LIMIT 10
How would I get all articles that have a relation to the top entities and keywords?
This was what I imagined, but I know it doesn't work because of the group by entity limiting the article_id's that return.
SELECT * FROM article
WHERE EXISTS (
[... where article is mentioned in top X entities.. ]
) AND EXISTS (
[... where article is mentioned in top X keywords.. ]
);
If I understand you correct the objective of the query is to find the articles that have a relation to both one of the top 10 entities as well as to one of the top 10 keywords. If this is the case the following query should do that, by requiring that the article returned has a match in both the set of top 10 entities and the set of top 10 keywords.
Please give it a try.
SELECT a.id
FROM article a
INNER JOIN article_entity ae ON a.id = ae.article_id
INNER JOIN article_keyword ak ON a.id = ak.article_id
INNER JOIN (
SELECT entity_id, COUNT(article_id) AS article_entity_count
FROM article_entity
GROUP BY entity_id
ORDER BY article_entity_count DESC LIMIT 10
) top_ae ON ae.entity_id = top_ae.entity_id
INNER JOIN (
SELECT keyword_id, COUNT(article_id) AS article_keyword_count
FROM article_keyword
GROUP BY keyword_id
ORDER BY article_keyword_count DESC LIMIT 10
) top_ak ON ak.keyword_id = top_ak.keyword_id
GROUP BY a.id;
The downside to using a simplelimit 10in the two subqueries for top entities/keywords is that it won't handle ties, so if the 11th keyword was just as popular as the 10th it still won't get chosen. This can be fixed though by using a ranking function, but afaik MySQL doesn't have anything build in (like RANK() window functions in Oracle or MSSQL).
I set up a sample SQL Fiddle (but using fewer data points andlimit 2as I'm lazy).
Not knowing the volume of data you are working with, I would first recommend that you have two storage columns on your article table for count of entities and keywords respectively. Then via triggers on adding/deleting from each, update the respective counter columns. This way, you don't have to do a burning query each time needed, especially in a web-based interface. Then, you can just select from the articles table ordered by the E+K counts descending and be done with it, instead of constant sub-querying the underlying tables.
Now, that said, the other suggestions are somewhat similar to what I am posting, but they all appear to be doing a limit of 10 records for each set. Lets throw this scenario into the picture. Say you have articles 1-20 all a range of 10, 9 and 8 entities and 1-2 keywords. Then articles 21-50 have the reverse... 10, 9, 8 keywords and 1-2 entities. Now, you have articles 51-58 that have 7 entities AND 7 keywords total of 14 combined points. None of the queries would have caught this as entities would only return the qualifying 1-20 records and keywords records 21-50. Articles 51-58 would be so far down on the list, it would not even be considered even though its total is 14.
To handle this, each sub-query is a full query specifically on the article ID and its count. Simple order by the article_ID as that is basis of the join to the master article table.
Now, the coalesce() will get the count if so available, otherwise 0 and add the two values together. From that, the results are ordered with the highest counts first (thus getting scenario sample articles 51-58 plus a few of the others) when the limit is applied.
SELECT
a.id,
coalesce( JustE.ECount, 0 ) ECount,
coalesce( JustK.KCount, 0 ) KCount,
coalesce( JustE.ECount, 0 ) + coalesce( JustK.KCount, 0 ) TotalCnt
from
article a
LEFT JOIN ( select article_id, COUNT(*) as ECount
from article_entity
group by article_id
order by article_id ) JustE
on a.id = JustE.article_id
LEFT JOIN ( select article_id, COUNT(*) as KCount
from article_keyword
group by article_id
order by article_id ) JustK
on a.id = JustK.article_id
order by
coalesce( JustE.ECount, 0 ) + coalesce( JustK.KCount, 0 ) DESC
limit 10
I took this in several steps
tl;dr This shows all the articles from the top (4) keywords and entities:
Here's a fiddle
select
distinct article_id
from
(
select
article_id
from
article_entity ae
inner join
(select
entity_id, count(*)
from
article_entity
group by
entity_id
order by
count(*) desc
limit 4) top_entities on ae.entity_id = top_entities.entity_id
union all
select
article_id
from
article_keyword ak
inner join
(select
keyword_id, count(*)
from
article_keyword
group by
keyword_id
order by
count(*) desc
limit 4) top_keywords on ak.keyword_id = top_keywords.keyword_id) as articles
Explanation:
This starts with an effort to find the top X entities. (4 seemed to work for the number of associations i wanted to make in the fiddle)
I didn't want to select articles here because it skews the group by, you want to focus solely on the top entities. Fiddle
select
entity_id, count(*)
from
article_entity
group by
entity_id
order by
count(*) desc
limit 4
Then I selected all the articles from these top entities. Fiddle
select
*
from
article_entity ae
inner join
(select
entity_id, count(*)
from
article_entity
group by
entity_id
order by
count(*) desc
limit 4) top_entities on ae.entity_id = top_entities.entity_id
Obviously the same logic needs to happen for the keywords. The queries are then unioned together (fiddle) and the distinct article ids are pulled from the union.
This will give you all articles that have a relation to the top (x) entities and keywords.
This gets the top 10 keyword articles that are also a top 10 entity. You may not get 10 records back because it is possible that an article only meets one of the criteria (top entity but not top keyword or top keyword but not top entity)
select *
from article a
inner join
(select count(*),ae.article_id
from article_entity ae
group by ae.article_id
order by count(*) Desc limit 10) e
on a.id = e.article_id
inner join
(select count(*),ak.article_id
from article_keyword ak
group by ak.article_id
order by count(*) Desc limit 10) k
on a.id = k.article_id
Related
I have two tables, one for image records (posts) and the other one is for likes records. So i made an INNER JOIN from one table to another because i needed to select the image and the quantity of likes that particular image has. but i also need to order them by the quantity of likes so i can make a top 10 of most voted images on the site, so here is my query:
SELECT
COUNT(DISTINCT B.votes),
A.id_image,
A.image,
A.title
FROM likes_images AS B INNER JOIN images AS A ON A.id_image = B.id_image
GROUP BY A.title
ORDER BY COUNT(DISTINCT B.votes) ASC
LIMIT 10
It works, but it's only ordering the images by the title (Alphabetical). I want to order them from the most voted to the less voted.
Any ideas?
In most SQL implementations, GROUP BY criterion implies any ORDER BY clause to be ignored in favor of criterion.
So you might try this:
SELECT L.id_image, A.image, A.title,
* FROM (
SELECT COUNT(votes) AS likes, id_image
FROM likes_images
GROUP BY id_image
) AS L
JOIN images B ON B.id_image = L.id_image
ORDER BY L.likes DESC
LIMIT 10
Note that I set ORDER BY to DESC (since you want top10 I don't understand you chose ASC)!
I am trying to find the top 10 authors who has written the most number of books.
I have two table as follows:
Author table:
author_name
publisher_key
Publication table:
publisher_id
publisher_key
title
year
pageno
To find the result, I tried using the following query:
SELECT a.author_name, SUM(p.pageno)
FROM author a JOIN publication p ON a.publisher_key = p.publisher_key
GROUP BY a.author_name
LIMIT 10;
I have no idea why when I run this query it takes ages though the number of records is only 200.
Try
SELECT a.author_name, count(*)
FROM author a
INNER JOIN publication p ON a.publisher_key = p.publisher_key
GROUP BY a.author_name
ORDER BY 2 desc
LIMIT 10;
You want to know who write most number of books, so you need to count the number of registries by author.
The order by 2 desc will order your query from the bigger number to the lesser 2 means the second field on the select list.
If it is as ydoow suggested that it maybe a locking issue, try running your select with NOLOCK to confirm.
SELECT a.author_name, count(*)
FROM author a WITH (nolock)
INNER JOIN publication p WITH (nolock) ON a.publisher_key = p.publisher_key
GROUP BY a.author_name
ORDER BY 2 desc
LIMIT 10;
You can get more info here:
Any way to select without causing locking in MySQL?
I have an article table which holds the number of articles views for each day. A new record is created to hold the count for each seperate day for each article.
The query below gets the article id and total views for the top 5 viewed article id for all time :
SELECT article_id,
SUM(article_count) as cnt
FROM article_views
GROUP BY article_id
ORDER BY cnt DESC
LIMIT 5
I also have a seperate article table which holds all the article fields. I want to ammend the query above to join to the article table and get two fields for each article id. I have tried to do this below but count is comming back incorrectly :
SELECT article_views.article_id, SUM( article_views.article_count ) AS cnt, articles.article_title, articles.artcile_url
FROM article_views
INNER JOIN articles ON articles.article_id = article_views.article_id
GROUP BY article_views.article_id
ORDER BY cnt DESC
LIMIT 5
Im not sure exactly what im doing wrong. Do I need to do a subquery?
Add articles.article_title, articles.artcile_url to the GROUP BY clause:
SELECT
article_views.article_id,
articles.article_title,
articles.artcile_url,
SUM( article_views.article_count ) AS cnt
FROM article_views
INNER JOIN articles ON articles.article_id = article_views.article_id
GROUP BY article_views.article_id,
articles.article_title,
articles.artcile_url
ORDER BY cnt DESC
LIMIT 5;
The reason you were not getting correct result set, is that when you select rows that are not included in the GROUP BY nor in an aggregate function in the SELECT clause MySQL picks up random value.
You are using a MySQL (mis) feature called Hidden Columns, because article title is not in the group by. However, this may or may not be causing your problem.
If the counts are wrong, then I think you have duplicate article_id in the article table. You can check this by doing:
select article_id, count(*) as cnt
from articles
group by article_id
having cnt > 1
If any appear, then that is your problem. If they all have different titles, then grouping by the title (as suggested by Mahmoud) would fix the problem.
If not, one way to fix it is the following:
SELECT article_views.article_id, SUM( article_views.article_count ) AS cnt, articles.article_title, articles.artcile_url
FROM article_views INNER JOIN
(select a.* from articles group by article_id) articles
ON articles.article_id = article_views.article_id
GROUP BY article_views.article_id
ORDER BY cnt DESC
LIMIT 5
This chooses an abitrary title for the article.
Your query looks basically right to me...
But the value returned for cnt is going to be dependent upon article_id column being UNIQUE in the articles table. We'd assume that it's the primary key, and absent a schema definition, that's only an assumption.)
Also, we're likely to assume there's a foreign key between the tables, that is, there are no values of article_id in the articles_view table which don't match a value of article_id on a row from the articles table.
To check for "orphan" article_id values, run a query like:
SELECT v.article_id
FROM articles_view v
LEFT
JOIN articles a
ON a.article_id = v.article_id
WHERE a.article_id IS NULL
To check for "duplicate" article_id values in articles, run a query like:
SELECT a.article_id
FROM articles a
GROUP BY a.article_id
HAVING COUNT(1) > 1
If either of those queries returns rows, that could be an explanation for the behavior you observe.
I have these tables and queries as defined in sqlfiddle.
First my problem was to group people showing LEFT JOINed visits rows with the newest year. That I solved using subquery.
Now my problem is that that subquery is not using INDEX defined on visits table. That is causing my query to run nearly indefinitely on tables with approx 15000 rows each.
Here's the query. The goal is to list every person once with his newest (by year) record in visits table.
Unfortunately on large tables it gets real sloooow because it's not using INDEX in subquery.
SELECT *
FROM people
LEFT JOIN (
SELECT *
FROM visits
ORDER BY visits.year DESC
) AS visits
ON people.id = visits.id_people
GROUP BY people.id
Does anyone know how to force MySQL to use INDEX already defined on visits table?
Your query:
SELECT *
FROM people
LEFT JOIN (
SELECT *
FROM visits
ORDER BY visits.year DESC
) AS visits
ON people.id = visits.id_people
GROUP BY people.id;
First, is using non-standard SQL syntax (items appear in the SELECT list that are not part of the GROUP BY clause, are not aggregate functions and do not sepend on the grouping items). This can give indeterminate (semi-random) results.
Second, ( to avoid the indeterminate results) you have added an ORDER BY inside a subquery which (non-standard or not) is not documented anywhere in MySQL documentation that it should work as expected. So, it may be working now but it may not work in the not so distant future, when you upgrade to MySQL version X (where the optimizer will be clever enough to understand that ORDER BY inside a derived table is redundant and can be eliminated).
Try using this query:
SELECT
p.*, v.*
FROM
people AS p
LEFT JOIN
( SELECT
id_people
, MAX(year) AS year
FROM
visits
GROUP BY
id_people
) AS vm
JOIN
visits AS v
ON v.id_people = vm.id_people
AND v.year = vm.year
ON v.id_people = p.id;
The: SQL-fiddle
A compound index on (id_people, year) would help efficiency.
A different approach. It works fine if you limit the persons to a sensible limit (say 30) first and then join to the visits table:
SELECT
p.*, v.*
FROM
( SELECT *
FROM people
ORDER BY name
LIMIT 30
) AS p
LEFT JOIN
visits AS v
ON v.id_people = p.id
AND v.year =
( SELECT
year
FROM
visits
WHERE
id_people = p.id
ORDER BY
year DESC
LIMIT 1
)
ORDER BY name ;
Why do you have a subquery when all you need is a table name for joining?
It is also not obvious to me why your query has a GROUP BY clause in it. GROUP BY is ordinarily used with aggregate functions like MAX or COUNT, but you don't have those.
How about this? It may solve your problem.
SELECT people.id, people.name, MAX(visits.year) year
FROM people
JOIN visits ON people.id = visits.id_people
GROUP BY people.id, people.name
If you need to show the person, the most recent visit, and the note from the most recent visit, you're going to have to explicitly join the visits table again to the summary query (virtual table) like so.
SELECT a.id, a.name, a.year, v.note
FROM (
SELECT people.id, people.name, MAX(visits.year) year
FROM people
JOIN visits ON people.id = visits.id_people
GROUP BY people.id, people.name
)a
JOIN visits v ON (a.id = v.id_people and a.year = v.year)
Go fiddle: http://www.sqlfiddle.com/#!2/d67fc/20/0
If you need to show something for people that have never had a visit, you should try switching the JOIN items in my statement with LEFT JOIN.
As someone else wrote, an ORDER BY clause in a subquery is not standard, and generates unpredictable results. In your case it baffled the optimizer.
Edit: GROUP BY is a big hammer. Don't use it unless you need it. And, don't use it unless you use an aggregate function in the query.
Notice that if you have more than one row in visits for a person and the most recent year, this query will generate multiple rows for that person, one for each visit in that year. If you want just one row per person, and you DON'T need the note for the visit, then the first query will do the trick. If you have more than one visit for a person in a year, and you only need the latest one, you have to identify which row IS the latest one. Usually it will be the one with the highest ID number, but only you know that for sure. I added another person to your fiddle with that situation. http://www.sqlfiddle.com/#!2/4f644/2/0
This is complicated. But: if your visits.id numbers are automatically assigned and they are always in time order, you can simply report the highest visit id, and be guaranteed that you'll have the latest year. This will be a very efficient query.
SELECT p.id, p.name, v.year, v.note
FROM (
SELECT id_people, max(id) id
FROM visits
GROUP BY id_people
)m
JOIN people p ON (p.id = m.id_people)
JOIN visits v ON (m.id = v.id)
http://www.sqlfiddle.com/#!2/4f644/1/0 But this is not the way your example is set up. So you need another way to disambiguate your latest visit, so you just get one row per person. The only trick we have at our disposal is to use the largest id number.
So, we need to get a list of the visit.id numbers that are the latest ones, by this definition, from your tables. This query does that, with a MAX(year)...GROUP BY(id_people) nested inside a MAX(id)...GROUP BY(id_people) query.
SELECT v.id_people,
MAX(v.id) id
FROM (
SELECT id_people,
MAX(year) year
FROM visits
GROUP BY id_people
)p
JOIN visits v ON (p.id_people = v.id_people AND p.year = v.year)
GROUP BY v.id_people
The overall query (http://www.sqlfiddle.com/#!2/c2da2/1/0) is this.
SELECT p.id, p.name, v.year, v.note
FROM (
SELECT v.id_people,
MAX(v.id) id
FROM (
SELECT id_people,
MAX(year) year
FROM visits
GROUP BY id_people
)p
JOIN visits v ON ( p.id_people = v.id_people
AND p.year = v.year)
GROUP BY v.id_people
)m
JOIN people p ON (m.id_people = p.id)
JOIN visits v ON (m.id = v.id)
Disambiguation in SQL is a tricky business to learn, because it takes some time to wrap your head around the idea that there's no inherent order to rows in a DBMS.
I'm wondering if there's a way to speed up a mysql query which is ordered by multiple subqueries.
On a music related site users can like different things like artists, songs, albums etc. These "likes" are all stored in the same table. Now I want to show a list of artists ordered by the number of "likes" by the users friends and all users. I want to show all artists, also those who have no likes at all.
I have the following query:
SELECT `artists`.*,
// friend likes
(SELECT COUNT(*)
FROM `likes`
WHERE like_type = 'artist'
AND like_id = artists.id
AND user_id IN (1,2,3,4, etc) // ids of friends
GROUP BY like_id
) AS `friend_likes`,
// all likes
(SELECT COUNT(*)
FROM `likes`
WHERE like_type = 'artist'
AND like_id = artists.id
GROUP BY like_id
) AS `all_likes`
FROM artists
ORDER BY
friend_likes DESC,
all_likes DESC,
artists.name ASC
The query takes ± 1.5 seconds on an artist table with 2000 rows. I'm afraid that this takes longer and longer as the table gets bigger and bigger. I tried using JOINS by can't seem to get this working because the subqueries contain WHERE statements.
Any ideas in the right direction would be greatly appreciated!
Try using JOINs instead of subqueries:
SELECT
artists.*, -- do you really need all this?
count(user_id) AS all_likes,
sum(user_id IN (1, 2, 3, 4)) AS friend_likes
FROM artists a
LEFT JOIN likes l
ON l.like_type = 'artist' AND l.like_id = a.id
GROUP BY a.id
ORDER BY
friend_likes DESC,
all_likes DESC,
artists.name ASC;
If this doesn't make the query faster, try adding indices, or consider selecting less fields.
You need to break this down a bit to see where the time goes. You're absolutely right that 1.5 sec on 2000 rows won't scale well. I suspect you need to look at indexes and foreign-key relationships. Look at each count/group-by query individually to get them tuned as best you can then recombine.
try rolling up into a query using inline IF() and go through the table/join ONCE
SELECT STRAIGHT_JOIN
artists.*
, LikeCounts.AllCount
, LikeCounts.FriendLikeCount
FROM
(SELECT
like_id
, count(*) AllCount
, sum( If( User_id in ( 1, 2, 3, 4 ), 1, 0 ) as FriendLikeCount
FROM
friend_likes
WHERE
like_type = 'artist'
GROUP BY
like_id ) LikeCounts
JOIN artists ON LikeCounts.like_id = artists.id
ORDER BY
LikeCounts.FriendLikeCount DESC
, LikeCounts.AllCount DESC
, artists.name ASC