Ordering MySQL 8 results by count existence in a crosswalk table

Ordering MySQL 8 results by count existence in a crosswalk table - mysql

I have the following MySQL 8 tables:
[submissions]
===
id
submission_type
name
[reject_reasons]
===
id
name
[submission_reject_reasons] -- crosswalk joining the first 2 tables
===
id
submission_id
reject_reason_id
In my application, users can submit submissions, and other users can request changes to those submissions. When they request these rejections, 1+ entries get saved to the submission_reject_reasons table (which stores the ID of the submission for which rejections are requested, as well as the ID of the reason for why the rejection is being made). So a typical entry in the table might look like:
id submission_id reject_reason_id
==============================================
45 384 294
Where submission_id = 384 is the "Fizz Buzz" submission and reject_reason_id = 294 is the "Missing Required Field" reason.
I currently have a query that fetches all the reject_reasons out of the DB:
SELECT * FROM reject_reasons
I now want to modify this query to sort the results based on their usage frequency. Meaning the query might currently return:
294 | Missing Required Field
14 | Malformed Entry
1885 | Makes No Sense
etc. But lets say there are 5 entries in the submission_reject_reasons table where 294 (Missing Required Field) is the reject_reason_id, and say there are 15 enries where 1885 (Makes No Sense) is present, and 120 entries where 14 (Malformed Entry) are present. I need a query that returns all reject_reasons sorted by their count in the submission_reject_reasons (SRR) table, descending, so that the most frequently used appear earlier in the sort. Hence the result set would be:
14 | Malformed Entry --> because there are 120 instances of this in the SRR table
1885 | Makes No Sense --> because there are 15 instances in the SRR
294 | Missing Required Field --> because there are only 5 instances in the SRR
Furthermore, I need a ranking from most-used to least-used. If a reason doesn't exist in the SRR table it should have a default "count" of zero (0) but should still come back in the query. If 2+ reason counts are tied, then I don't care how they are sorted. Any ideas here? I need the final result set to only contain the rr.id and rr.name field/values.
My best attempt is not getting me anywhere:
SELECT rr.id, rr.name
FROM reject_reasons AS rr
LEFT JOIN submission_reject_reasons AS srr on rr.id = srr.reject_reason_id
GROUP BY rr.id
ORDER BY COUNT(*) DESC
Can anyone help me over the finish line here? Can anyone spot where I'm goin awry? Thanks in advance!

You should be grouping by the reject reason ID. COUNT(*) is what you want to count in each group.
SELECT rr.id, rr.name
FROM reject_reasons AS rr
JOIN submission_reject_reasons AS srr on rr.id = srr.reject_reason_id
GROUP BY rr.id
ORDER BY COUNT(*) DESC
There's no need for any EXISTS check, since the INNER JOIN won't return any reject reasons that don't exist in submission_reject_reasons.

Related

Creating a SQL view from tables without UIDs

I have two tables:
match_rating, which have data on a team's performance in a match. There are naturally two tuples for every matchId (since there are two teams to each match). The PK is matchId, teamId.
event, which has information on events during matches. The PK is an autoincremented UID, and it contains the Foreign Keys match_id and subject_team_id as well.
Now I want to create a new view which counts how many times certain events happen in a match, for each team, with fields like this:
But for the life of me I cannot get around the fact that there are 1) two tuples for each match in the match_rating table, and 2) querying the event table on match_id returns events for both teams.
The closest I got was something like this:
SELECT SUM(
CASE
WHEN evt.event_type_id = 101 THEN 1
WHEN evt.event_type_id = 111 THEN 1
WHEN evt.event_type_id = 121 THEN 1
[etc]
END
) AS 'mid_chances',
SUM(
CASE
WHEN evt.event_type_id = 103 THEN 1
WHEN evt.event_type_id = 113 THEN 1
WHEN evt.event_type_id = 123 THEN 1
[etc]
END
) AS 'right_chances',
mr.tactic,
mr.tactic_skill,
mr.bp,
evt.match_id,
evt.subject_team_id
FROM event evt
JOIN match_rating mr
ON evt.match_id = mr.match_id
WHERE evt.event_type_id BETWEEN 100 AND 104 OR
evt.event_type_id BETWEEN 110 AND 114 OR
evt.event_type_id BETWEEN 120 AND 124 OR
[etc]
GROUP BY evt.match_id
ORDER BY `right_chances` DESC
But still, this counts the events twice, reporting 2 events where there was only 1, 6 for 3 events and so on. I have tried grouping on team_id as well (GROUP BY evt.match_id AND team_id) , but that returns only 2 rows with all events counted.
I hope I have made my problem clear, and it should be obvious that I really need a good tip or two.
Edit for clarity (sorry):
Sample data for match_rating table:
Sample data for the event table:
What I would like to see as the result is this:
That is, two tuples for each match, one for each team, where the types of events that team had is summed up. Thanks so much for looking into this!

Update after comments/feedback
OK.. just to confirm, what you want is
Each row of the output represents a team within a match
Other values (other than match_id and team_id) are sums or other aggregations across multiple rows?
If that is the case, then I believe you should be doing a GROUP BY the match_id and team_id. This should cause the correct number of rows to be generated (one for each match_id/team_id combination). You say in your question that you have tried it already - I suggest reviewing it (potentially after also considering the below).
With your data, it appears that the 'event' table also has a field which indicates the team_id. To ensure you only get the relevant team's events, I suggest your join between match_rating and event be on both fields e.g.,
FROM event evt
JOIN match_rating mr
ON evt.match_id = mr.match_id
AND evt.subject_team_id = mr.team_id
Previous answer - does not answer the question (as per later comments)
Just confirming - the issue is that when you run it, for each match it returns 2 rows - one for each team - but you want to do processing on both teams as one row only?
As such, you could do a few things (e.g., self-join the match rating table to itself, with Team1 ratings and Team2 ratings).
Alternatively, you could modify your FROM to have joins to match_rating twice - where the first has the lower ID for the two teams e.g.,
FROM event evt
JOIN match_rating mr_team1
ON evt.match_id = mr_team1.match_id
JOIN match_rating mr_team2
ON evt.match_id = mr_team2.match_id
AND mr_team1.match_id < mr_team2.match_id
Of course, your processing then needs to be modified to take this into account e.g., one row represents a match, and you have a bunch of data for team1 and similar data for team2. You'd then, I assume, compare the data for team1 columns and team2 columns to get some sort of rating etc (e.g., chance for Team1 to win, etc).

ORDER BY and GROUP BY those results in a single query

I am trying to query a dataset from a single table, which contains quiz answers/entries from multiple users. I want to pull out the highest scoring entry from each individual user.
My data looks like the following:
ID TP_ID quiz_id name num_questions correct incorrect percent created_at
1 10154312970149546 1 Joe 3 2 1 67 2015-09-20 22:47:10
2 10154312970149546 1 Joe 3 3 0 100 2015-09-21 20:15:20
3 125564674465289 1 Test User 3 1 2 33 2015-09-23 08:07:18
4 10153627558393996 1 Bob 3 3 0 100 2015-09-23 11:27:02
My query looks like the following:
SELECT * FROM `entries`
WHERE `TP_ID` IN('10153627558393996', '10154312970149546')
GROUP BY `TP_ID`
ORDER BY `correct` DESC
In my mind, what that should do is get the two users from the IN clause, order them by the number of correct answers and then group them together, so I should be left with the 2 highest scores from those two users.
In reality it's giving me two results, but the one from Joe gives me the lower of the two values (2), with Bob first with a score of 3. Swapping to ASC ordering keeps the scores the same but places Joe first.
So, how could I achieve what I need?

You're after the groupwise maximum, which can be obtained by joining the grouped results back to the table:
SELECT * FROM entries NATURAL JOIN (
SELECT TP_ID, MAX(correct) correct
FROM entries
WHERE TP_ID IN ('10153627558393996', '10154312970149546')
GROUP BY TP_ID
) t
Of course, if a user has multiple records with the maximal score, it will return all of them; should you only want some subset, you'll need to express the logic for determining which.

MySql is quite lax when it comes to group-by-clauses - but as a rule of thumb you should try to follow the rule that other DBMSs enforce:
In a group-by-query each column should either be part of the group-by-clause or contain a column-function.
For your query I would suggest:
SELECT `TP_ID`,`name`,max(`correct`) FROM `entries`
WHERE `TP_ID` IN('10153627558393996', '10154312970149546')
GROUP BY `TP_ID`,`name`
Since your table seems quite denormalized the group by name-par could be omitted, but it might be necessary in other cases.
ORDER BY is only used to specify in which order the results are returned but does nothing about what results are returned - so you need to apply the max()-function to get the highest number of right answers.

Select max date by grouping?

PLEASE will someone help? I've put HOURS into this silly, stupid problem. This stackoverview post is EXACTLY my question, and I have tried BOTH suggested solutions to no avail.
Here are MY specifics. I have extracted 4 records from my actual database, and excluded no fields:
master_id date_sent type mailing response
00001 2015-02-28 00:00:00 PHONE NULL NULL
00001 2015-03-13 14:45:20 EMAIL ThankYou.html NULL
00001 2015-03-13 14:34:43 EMAIL ThankYou.html NULL
00001 2015-01-11 00:00:00 EMAIL KS_PREVIEW TRUE
00001 2015-03-23 21:42:03 EMAIL MailChimp Update #2 NULL
(sorry about the alignment of the columns.)
I want to get the most recent mailing and date_sent for each master_id. (My extract is of only one master_id to make this post simple.)
So I run this query:
SELECT master_id,date_sent,mailing
FROM contact_copy
WHERE type="EMAIL"
and get the expected result:
master_id date_sent mailing
1 3/13/2015 14:45:20 ThankYou.html
1 3/13/2015 14:34:43 ThankYou.html
1 1/11/2015 0:00:00 KS_PREVIEW
1 3/23/2015 21:42:03 MailChimp Update #2
BUT, when I add this simple aggregation to get the most recent date:
SELECT master_id,max(date_sent),mailing
FROM contact_copy
WHERE type="EMAIL"
group BY master_id
;
I get an UNEXPECTED result:
master_id max(date_sent) mailing
00001 2015-03-23 21:42:03 ThankYou.html
So my question: why is it returning the WRONG MAILING?
It's making me nuts! Thanks.
By the way, I'm not a developer, so sorry if I'm breaking some etiquette rule of asking. :)

That's because when you use GROUP BY, all the columns have to be aggregate columns, and mailing is not one of them..
You should use a subquery or a join to make it work
SELECT master_id,date_sent,mailing
FROM contact_copy cc
JOIN
( SELECT master_id,max(date_sent)
FROM contact_copy
WHERE type="EMAIL"
group BY master_id
) result
ON cc.master_id= result.master_id AND cc.date_sent=result.date_sent

You're getting an "unexpected" result because of a MySQL specific extension to the GROUP BY functionality. The result you're getting is actually expected, according to the MySQL Reference Manual.
Ref: https://dev.mysql.com/doc/refman/5.5/en/group-by-handling.html
Other database engines would reject your query as invalid... an error along the lines of "non-aggregate expressions included in the SELECT list not included in the GROUP BY".)
We can get MySQL to behave like other databases (and return an error for that query) if we include ONLY_FULL_GROUP_BY in the SQL mode.
Ref: https://dev.mysql.com/doc/refman/5.5/en/sql-mode.html#sqlmode_only_full_group_by
To get the result you are looking for...
If the (master_id,type,date_sent) tuple is UNIQUE in contact_copy (that is, if for given values of master_id and type, there will be no "duplicate" values of date_sent), we could use a JOIN operation to retrieve the specified result.
First, we write a query to get the "maximum" date_sent for a given master_id and type. For example:
SELECT mc.master_id
, mc.type
, MAX(mc.date_sent) AS max_date_sent
FROM contact_copy mc
WHERE mc.master_id = '0001'
AND mc.type = 'EMAIL'
To retrieve the entire row associated with that "maximum" date_sent, we can use that query as an inline view. That is, wrap the query text in parens, assign an alias, and then reference that as if it were a table, for example:
SELECT c.master_id
, c.date_sent
, c.mailing
FROM ( SELECT mc.master_id
, mc.type
, MAX(mc.date_sent) AS max_date_sent
FROM contact_copy mc
WHERE mc.master_id = '0001'
AND mc.type = 'EMAIL'
) m
JOIN contact_copy c
ON c.master_id = m.master_id
AND c.type = m.type
AND c.date_sent = m.max_date_sent
Note that if there are multiple rows that have the same values of master_id,type and date_sent, there is potential to return more than one row. You could add a LIMIT 1 clause to guarantee that you return only one row; which of those rows is returned is indeterminate, without an ORDER BY clause before the LIMIT clause.

Mysql join queries not returning all rows as it should

I have an event software. The first table stores individuals who signed up for an event: EventIndividuals. I have a second table that stores the t-shirt they selected as a gift when they attend the event: EventIndividualShirtXref.
When I run the following query to see how many individuals are attending the 1st event it returns 31 rows correctly:
SELECT Id
FROM EventIndividuals
WHERE EventId = 1
Then when I run my second query to pair them up with a shirt it only returns 22 rows:
SELECT *
FROM EventIndividualShirtXref
WHERE EventIndividualId IN(SELECT Id FROM EventIndividuals WHERE EventId = 1)
I also tried running the next query using a join and it still only returns 22 rows:
SELECT esx.*
FROM EventIndividualShirtXref esx
INNER JOIN EventIndividuals ei
ON esx.EventIndividualId = ei.Id
WHERE ei.EventId = 1
I checked the indexing and the the columns are indexed correctly.
Is this enough info provided to figure out why the 31 rows are cut to 22?
This has never happened to me before and it makes no sense.

Obtain running frequency distribution from previous N rows of MySQL database

I have a MySQL database where one column contains status codes. The column is of type int and the values will only ever be 100,200,300,400. It looks like below; other columns removed for clarity.
id | status
----------------
1 300
2 100
3 100
4 200
5 300
6 300
7 100
8 400
9 200
10 300
11 100
12 400
13 400
14 400
15 300
16 300
The id field is auto-generated and will always be sequential. I want to have a third column displaying a comma-separated string of the frequency distribution of the status codes of the previous 10 rows. It should look like this.
id | status | freq
-----------------------------------
1 300
2 100
3 100
4 200
5 200
6 300
7 100
8 400
9 300
10 300
11 100 300,100,200,400 -- from rows 1-10
12 400 100,300,200,400 -- from rows 2-11
13 400 100,300,200,400 -- from rows 3-12
14 400 300,400,100,200 -- from rows 4-13
15 300 400,300,100,200 -- from rows 5-14
16 300 300,400,100 -- from rows 6-15
I want the most frequent code listed first. And where two status codes have the same frequency it doesn't matter to me which is listed first but I did list the smaller code before the larger in the example. Lastly, where a code doesn't appear at all in the previous ten rows, it shouldn't be listed in the freq column either.
And to be very clear the row number that the frequency string appears on does NOT take into account the status code of that row; it's only the previous rows.
So what have I done? I'm pretty green with SQL. I'm a programmer and I find this SQL language a tad odd to get used to. I managed the following self-join select statement.
select *, avg(b.status) freq
from sample a
join sample b
on (b.id < a.id) and (b.id > a.id - 11)
where a.id > 10
group by a.id;
Using the aggregate function avg, I can at least demonstrate the concept. The derived table b provides the correct rows to the avg function but I just can't figure out the multi-step process of counting and grouping rows from b to get a frequency distribution and then collapse the frequency rows into a single string value.
Also I've tried using standard stored functions and procedures in place of the built-in aggregate functions, but it seems the b derived table is out of scope or something. I can't seem to access it. And from what I understand writing a custom aggregate function is not possible for me as it seems to require developing in C, something I'm not trained for.
Here's sql to load up the sample.
create table sample (
id int NOT NULL AUTO_INCREMENT,
PRIMARY KEY(id),
status int
);
insert into sample(status) values(300),(100),(100),(200),(200),(300)
,(100),(400),(300),(300),(100),(400),(400),(400),(300),(300),(300)
,(100),(400),(100),(100),(200),(500),(300),(100),(400),(200),(100)
,(500),(300);
The sample has 30 rows of data to work with. I know it's a long question, but I just wanted to be as detailed as I could be. I've worked on this for a few days now and would really like to get it done.
Thanks for your help.

The only way I know of to do what you're asking is to use a BEFORE INSERT trigger. It has to be BEFORE INSERT because you want to update a value in the row being inserted, which can only be done in a BEFORE trigger. Unfortunately, that also means it won't have been assigned an ID yet, so hopefully it's safe to assume that at the time a new record is inserted, the last 10 records in the table are the ones you're interested in. Your trigger will need to get the values of the last 10 ID's and use the GROUP_CONCAT function to join them into a single string, ordered by the COUNT. I've been using SQL Server mostly and I don't have access to a MySQL server at the moment to test this, but hopefully my syntax will be close enough to at least get you moving in the right direction:
create trigger sample_trigger BEFORE INSERT ON sample
FOR EACH ROW
BEGIN
DECLARE _freq varchar(50);
SELECT GROUP_CONCAT(tbl.status ORDER BY tbl.Occurrences) INTO _freq
FROM (SELECT status, COUNT(*) AS Occurrences, 1 AS grp FROM sample ORDER BY id DESC LIMIT 10) AS tbl
GROUP BY tbl.grp
SET new.freq = _freq;
END

SELECT id, GROUP_CONCAT(status ORDER BY freq desc) FROM
(SELECT a.id as id, b.status, COUNT(*) as freq
FROM
sample a
JOIN
sample b ON (b.id < a.id) AND (b.id > a.id - 11)
WHERE
a.id > 10
GROUP BY a.id, b.status) AS sub
GROUP BY id;
SQL Fiddle

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008