How to select n% random rows with specific condition in SQL? - mysql

I have two tables.
Table1 has two columns: brand and review_counter.
Table2 also has two columns: brand and review.
There are several reviews for each brand. Is there any way in SQL to randomly select around 10% of reviews for each brand, without using "top n" command?
For example for 'Sony' there are 2,005 reviews, and I need to select 10% of them, 200 reviews.
Thank you in advance.

Possible method using user counters. Compared to a solution similar to that above this is likely to bring back a more accurate 10% of reviews per brand, but it also likely to be slower (neither solution will be fast as both rely on using RAND() on every row on the tables).
This gets all the rows, ordered by brand and then RAND(). It uses that as a sub query and adds a sequence number, resetting back to 1 for the first record for each brand. Then that in turn is used as the source for a query which eliminates records where the generated sequence number is <= to a tenth of the reviews for that brand.
SELECT sub1.brand, sub1.review
FROM
(
SELECT sub0.brand, sub0.reviews_wanted, sub0.review, #cnt:=IF(#brand = brand, #cnt+1, 1) AS cnt, #brand := brand
FROM
(
SELECT Table1.brand, (Table1.review_counter * 0.1) AS reviews_wanted, Table2.review
FROM Table1
INNER JOIN Table2
ON Table1.brand = Table2.brand
ORDER BY Table1.brand, RAND()
) sub0
CROSS JOIN (SELECT #cnt:=0, #brand:='') sub2
) sub1
WHERE cnt <= sub1.reviews_wanted
EDIT.
This might be a bit more memory efficient (although probably slower).
This has a sub query that gets the unique id of all the reviews for a brand in a random order, along with a count that is 1 tenth of the number of reviews for the brand. It then uses the count with SUBSTRING_INDEX to get the ids of the first random 10%, and joins that using FIND_IN_SET with the reviews table.
SELECT sub0.brand, Table2.review
FROM
(
SELECT Table1.brand, CEIL(Table1.review_counter * 0.1) AS reviews_wanted, GROUP_CONCAT(Table2.id ORDER BY RAND()) AS id
FROM Table1
INNER JOIN Table2
ON Table1.brand = Table2.brand
GROUP BY Table1.brand, reviews_wanted
) sub0
INNER JOIN Table2
ON FIND_IN_SET(Table2.id, SUBSTRING_INDEX(sub0.id, ',', reviews_wanted))
You might be able to do something a bit more efficient using one of the solutions here:-
How can i optimize MySQL's ORDER BY RAND() function?

RAND() generates random values between 0 and 1. Why don't you try this?
UPDATED2
SELECT review
FROM
(
SELECT review, (review_counter * RAND()) / review_counter AS rand
FROM Table1 INNER JOIN Table2 ON Table1.brand = Table2.brand
) t
WHERE rand < 0.1

Related

MySQL Spring complicated query - ways to order and query efficiency

I run this complicated query on Spring JPA Repository.
My goal is to get all info from the site table, ordering it by events severity on each site.
This is my query:
SELECT alls.* FROM sites AS alls JOIN
(
SELECT distinct ets.id FROM
(
SELECT s.id, et.`type`, et.severity_level, COUNT(et.`type`) FROM sites AS s
JOIN users_sites AS us ON (s.id=us.site_id)
JOIN users AS u ON (us.user_id=u.user_id)
JOIN areas AS a ON (s.id=a.site_id)
JOIN panels AS p ON (a.id=p.area_id)
JOIN events AS e ON (p.id=e.panel_id)
JOIN event_types AS et ON (e.event_type_id=et.id)
WHERE u.user_id="98765432-123a-1a23-123b-11a1111b2cd3"
GROUP BY s.id , et.`type`, et.severity_level
ORDER BY et.severity_level, COUNT(et.`type`) DESC
) AS ets
) as etsd ON alls.id = etsd.id
The second select (the one with "distinct") returns site_ids ordered correctly by severity.
Note that there are different event_types + severity in each site, and I use pagination on the answer, so I need the distinct.
The problem is - the main select doesn't keep this order.
Is there any way to keep the order in one complicated query?
Another related question - one of my ideas was making two queries:
The "select distinct" query that will return me the order --> saved in a list "order list"
The main "sites" query (that becomes very simple) with "where id in {"order list"}
Order the second query in code by "order list".
I use the query every 10 seconds, so it is very sensitive on performance.
What seems to be faster in this case - original complicated query or those 2?
Any insight will be appreciated.
Tnx a lot.
A quirk of SQL's declarative set-oriented syntax for us procedural programmers: ORDER by clauses in subqueries are not carried through to the outer query, except sometimes by accident. If you want ordering at any query level, you must specify it at that level or you will get unpredictable results. The query optimizers are usually smart enough to avoid wasting sort operations.
Your requirement: give at most one sites row for each sites.id value, ordered by the worst event. Worst: lowest event severity, and if there are more than one event with lowest severity, the largest count.
Use this sort of thing to get the "worst" for each id, in place of DISTINCT.
SELECT id, MIN(severity_level) severity_level, MAX(num) num
FROM (
/* your inner query */
) ets
GROUP BY id
This gives at most one row per sites.id value. Then your outer query is
SELECT alls.*
FROM sites alls
JOIN (
SELECT id, MIN(severity_level) severity_level, MAX(num) num
FROM (
/* your inner query */
) ets
GROUP BY id
) worstevents ON alls.id = worstevents.id
ORDER BY worstevents.severity_level, worstevents.num DESC, alls.id
Putting it all together:
SELECT alls.*
FROM sites alls
JOIN (
SELECT id, MIN(severity_level) severity_level, MAX(num) num
FROM (
SELECT s.id, et.severity_level, COUNT(et.`type`) num
FROM sites AS s
JOIN users_sites AS us ON (s.id=us.site_id)
JOIN users AS u ON (us.user_id=u.user_id)
JOIN areas AS a ON (s.id=a.site_id)
JOIN panels AS p ON (a.id=p.area_id)
JOIN events AS e ON (p.id=e.panel_id)
JOIN event_types AS et ON (e.event_type_id=et.id)
WHERE u.user_id="98765432-123a-1a23-123b-11a1111b2cd3"
GROUP BY s.id , et.`type`, et.severity_level
) ets
GROUP BY id
) worstevents ON alls.id = worstevents.id
ORDER BY worstevents.severity_level, worstevents.num DESC, alls.id
An index on users.user_id will help performance for these single-user queries.
If you still have performance trouble, please read this and ask another question.

Unwanted result with two "Join" in SQL

I'm currently stuck on a problem with my database. I have a table of film reviews, a table of positives and another one of negatives. These last ones are linked to the id of an review.
Here are the positive and negative tables:
I'd like to get this result:
But I have this one instead:
Here's my SQL code to get this result:
SELECT positives.libelle AS positive, negatives.libelle AS negative FROM reviews LEFT JOIN positives ON positives.review_id = reviews.id LEFT JOIN negatives ON negatives.review_id = reviews.id WHERE reviews.id = 1
The result that you want is not really in a relational format -- because the column values on a given row really have nothing to do with each other.
MySQL does not support full join, so my recommendation is union all with row_number() to enumerate the rows and group by to bring them together:
SELECT MAX(positive) as positive), MAX(negative) as negative)
FROM ((SELECT p.review_id, p.libelle as positive, NULL as negative,
ROW_NUMBER() OVER (PARTITION BY p.review_id ORDER BY id) as seqnum
FROM positives p
WHERE p.review_id = 1
) UNION ALL
(SELECT n.review_id, NULL, n.libelle,
ROW_NUMBER() OVER (PARTITION BY n.review_id ORDER BY id) as seqnum
FROM negatives n
WHERE n.review_id = 1
)
) pn
GROUP BY review_id, id
ORDER BY review_id, id;
Note this will return no rows if there are no reviews (positive and negative). You can incorporate a left join if that really is a consideration.
If you don't need any information about reviewer, why you join tables with reviewer? just limit tables by review_id=1.
The following query doesn't cover your need totally, however, maybe be helpful for your problem. Consider that Union is much more efficient that Join. If you can use Union, avoid using Join.
(SELECT positives.libelle AS positive,NULL AS negative FROM positives WHERE review_id=1)
UNION
(SELECT NULL,negatives.libelle FROM negatives WHERE review_id=1)

SQL top records based on two tables relations

I have three main items I am storing: Articles, Entities, and Keywords. This makes 5 tables:
article { id }
entity {id, name}
article_entity {id, article_id, entity_id}
keyword {id, name}
article_keyword {id, article_id, keyword_id}
I would like to get all articles that contain the TOP X keywords + entities. I can get the top X keywords or entities with a simple group by on the entity_id/keyword_id.
SELECT [entity|keyword]_id, count(*) as num FROM article_entity
GROUP BY entity_id ORDER BY num DESC LIMIT 10
How would I get all articles that have a relation to the top entities and keywords?
This was what I imagined, but I know it doesn't work because of the group by entity limiting the article_id's that return.
SELECT * FROM article
WHERE EXISTS (
[... where article is mentioned in top X entities.. ]
) AND EXISTS (
[... where article is mentioned in top X keywords.. ]
);
If I understand you correct the objective of the query is to find the articles that have a relation to both one of the top 10 entities as well as to one of the top 10 keywords. If this is the case the following query should do that, by requiring that the article returned has a match in both the set of top 10 entities and the set of top 10 keywords.
Please give it a try.
SELECT a.id
FROM article a
INNER JOIN article_entity ae ON a.id = ae.article_id
INNER JOIN article_keyword ak ON a.id = ak.article_id
INNER JOIN (
SELECT entity_id, COUNT(article_id) AS article_entity_count
FROM article_entity
GROUP BY entity_id
ORDER BY article_entity_count DESC LIMIT 10
) top_ae ON ae.entity_id = top_ae.entity_id
INNER JOIN (
SELECT keyword_id, COUNT(article_id) AS article_keyword_count
FROM article_keyword
GROUP BY keyword_id
ORDER BY article_keyword_count DESC LIMIT 10
) top_ak ON ak.keyword_id = top_ak.keyword_id
GROUP BY a.id;
The downside to using a simplelimit 10in the two subqueries for top entities/keywords is that it won't handle ties, so if the 11th keyword was just as popular as the 10th it still won't get chosen. This can be fixed though by using a ranking function, but afaik MySQL doesn't have anything build in (like RANK() window functions in Oracle or MSSQL).
I set up a sample SQL Fiddle (but using fewer data points andlimit 2as I'm lazy).
Not knowing the volume of data you are working with, I would first recommend that you have two storage columns on your article table for count of entities and keywords respectively. Then via triggers on adding/deleting from each, update the respective counter columns. This way, you don't have to do a burning query each time needed, especially in a web-based interface. Then, you can just select from the articles table ordered by the E+K counts descending and be done with it, instead of constant sub-querying the underlying tables.
Now, that said, the other suggestions are somewhat similar to what I am posting, but they all appear to be doing a limit of 10 records for each set. Lets throw this scenario into the picture. Say you have articles 1-20 all a range of 10, 9 and 8 entities and 1-2 keywords. Then articles 21-50 have the reverse... 10, 9, 8 keywords and 1-2 entities. Now, you have articles 51-58 that have 7 entities AND 7 keywords total of 14 combined points. None of the queries would have caught this as entities would only return the qualifying 1-20 records and keywords records 21-50. Articles 51-58 would be so far down on the list, it would not even be considered even though its total is 14.
To handle this, each sub-query is a full query specifically on the article ID and its count. Simple order by the article_ID as that is basis of the join to the master article table.
Now, the coalesce() will get the count if so available, otherwise 0 and add the two values together. From that, the results are ordered with the highest counts first (thus getting scenario sample articles 51-58 plus a few of the others) when the limit is applied.
SELECT
a.id,
coalesce( JustE.ECount, 0 ) ECount,
coalesce( JustK.KCount, 0 ) KCount,
coalesce( JustE.ECount, 0 ) + coalesce( JustK.KCount, 0 ) TotalCnt
from
article a
LEFT JOIN ( select article_id, COUNT(*) as ECount
from article_entity
group by article_id
order by article_id ) JustE
on a.id = JustE.article_id
LEFT JOIN ( select article_id, COUNT(*) as KCount
from article_keyword
group by article_id
order by article_id ) JustK
on a.id = JustK.article_id
order by
coalesce( JustE.ECount, 0 ) + coalesce( JustK.KCount, 0 ) DESC
limit 10
I took this in several steps
tl;dr This shows all the articles from the top (4) keywords and entities:
Here's a fiddle
select
distinct article_id
from
(
select
article_id
from
article_entity ae
inner join
(select
entity_id, count(*)
from
article_entity
group by
entity_id
order by
count(*) desc
limit 4) top_entities on ae.entity_id = top_entities.entity_id
union all
select
article_id
from
article_keyword ak
inner join
(select
keyword_id, count(*)
from
article_keyword
group by
keyword_id
order by
count(*) desc
limit 4) top_keywords on ak.keyword_id = top_keywords.keyword_id) as articles
Explanation:
This starts with an effort to find the top X entities. (4 seemed to work for the number of associations i wanted to make in the fiddle)
I didn't want to select articles here because it skews the group by, you want to focus solely on the top entities. Fiddle
select
entity_id, count(*)
from
article_entity
group by
entity_id
order by
count(*) desc
limit 4
Then I selected all the articles from these top entities. Fiddle
select
*
from
article_entity ae
inner join
(select
entity_id, count(*)
from
article_entity
group by
entity_id
order by
count(*) desc
limit 4) top_entities on ae.entity_id = top_entities.entity_id
Obviously the same logic needs to happen for the keywords. The queries are then unioned together (fiddle) and the distinct article ids are pulled from the union.
This will give you all articles that have a relation to the top (x) entities and keywords.
This gets the top 10 keyword articles that are also a top 10 entity. You may not get 10 records back because it is possible that an article only meets one of the criteria (top entity but not top keyword or top keyword but not top entity)
select *
from article a
inner join
(select count(*),ae.article_id
from article_entity ae
group by ae.article_id
order by count(*) Desc limit 10) e
on a.id = e.article_id
inner join
(select count(*),ak.article_id
from article_keyword ak
group by ak.article_id
order by count(*) Desc limit 10) k
on a.id = k.article_id

Joining on "greater than" returning more than one row for left table

I have a query.
SELECT * FROM users LEFT JOIN ranks ON ranks.minPosts <= users.postCount
This returns a row every time it is matched. By using a GROUP BY users.id I get each row as a individual id.
However, when they group I only get the first row. I would instead like the row with the highest value of ranks.minPosts
Is there a way to do this, also, would it be faster (less resources) to just use two different queries?
Assuming there is only one column in ranks that you want, you can do this using a correlated subquery:
SELECT u.*,
(select r.minPosts
from ranks r
where r.minPosts <= u.PostCount
order by minPosts desc
limit 1
) as minPosts
FROM users u;
If you need the entire row from ranks, then join it back in:
SELECT ur.*, r.*
FROM (SELECT u.*,
(select r.minPosts
from ranks r
where r.minPosts <= u.PostCount
order by minPosts desc
limit 1
) as minPosts
FROM users u
) ur join
ranks r
on ur.minPosts = r.minPosts;
(The * is for convenience; you should list out the columns you want.)
Because you're using mysql, this will work:
SELECT * FROM (
SELECT *, users.id user_id
FROM users
LEFT JOIN ranks ON ranks.minPosts <= users.postCount
ORDER BY ranks.minPosts DESC
) x
GROUP BY user_id
Mysql always returns the first row encountered for each unique group, so if you first order the data, then use the non-standard grouping behaviour, you'll get the row you want.
Disclaimer:
Although this works reliably in practice, the mysql documentation says not to rely on it. If you use this convenient approach (which will reliably pass any test you can write), you should consider that it is not recommended by mysql and that later releases of mysql may not continue behave in this way.
What we'd really like to do would be to order the rows by ranks.minPosts before the group by. Unfortunately MySQL doesn't support that without using a subquery of some form.
If the ranks are already ordered by their ids then you can extract the id by selecting MAX(ranks.id), and if they're not, you can still get the highest ranks.minPosts by selecting MAX(ranks.minPosts). However, it would be nice to be able to get the entire record. I guess you're left with the subquery solution, which is as follows:
SELECT <fields> FROM users LEFT JOIN
(SELECT * FROM ranks ORDER BY minPosts DESC) as r
ON r.minPosts <= users.postCount GROUP BY users.id

MySQL UNION or Create an intermediate Table?

I have two Table (MySQL, InnoDB) which are identical to each other holding some score information. I want to create a leader-board out of those two.
Scores from Table-1 have more wight (^2) in comparison to Table-2.
One strategy is two write a SQL Query using UNION statement and get the result on the fly and the other is to create another table and insert data from both tables to that on timely basis. (e.g Cron-Job)
My question is how expensive this UNION query would be?
SELECT
username,
specialId,
SUM(correct) points,
IFNULL((SUM(foracc) / SUM(completegames)), 0) accuracy
FROM
((SELECT
u.username,
u.specialId,
POW(p.correct, 2) as correct,
p.correct as foracc, #to calcuate accuracy
p.completegames
FROM
Table_01 p
LEFT JOIN users u ON u.userid = p.userid
WHERE
p.year = 2012 AND p.type = 1) UNION (SELECT
u.username,
u.specialId,
p.correct,
p.correct as foracc, #to calcuate accuracy
p.completegames
FROM
Table_02 p
LEFT JOIN users u ON u.userid = p.userid
WHERE
p.year = 2012 AND p.type = 1)) AS Table_aggregatedscore
GROUP BY 1
ORDER BY points DESC , acc DESC
LIMIT 10;
Both Table_01 and Table_02 are having more than 20million rows.
PS. I don't have time/ability to benchmark those two things.
UNION ALL does not check for unique rows, so it can speed up query.
You can also move some aggregation inside subselects to reduce intermediate amount of data (in case your aggregation reduces row count).
Or you can LEFT JOIN (instead of UNION) both tables with users and move POW inside SUM, (SUM(Table_02.correct) + SUM(POW(Table_01.correct, 2))) AS points.
Build a second table. Once you've loaded data, add a trigger on the original tables to insert into the new table on new record creation in either original table that way you don't have to run batches.