Grouping items with even distribution in SQL

Grouping items with even distribution in SQL - sql-server-2008

Consider a student table with 104 rows in it. I need to create groups with a minimum of 10 students in each groups. In the case with 104 students, I would end up having 10 groups of 10 students and 1 group of 4 students if I iterate on each students and create the grouping. There's a rule that a group with remaining students cannot have less than 5 students in it (in this case the last group consist of 4 student). Two possible approach I'm trying to do:
Roll up the last group that has less than 5 students and assign each of them to any groups, or
Spread the last group evenly to any groups.
How do I achieve any of these? Many thanks.
Eric

You can use ntile.
Distributes the rows in an ordered partition into a specified number
of groups. The groups are numbered, starting at one. For each row,
NTILE returns the number of the group to which the row belongs.
Some sample code:
declare #NumberOfStudents int
declare #StudentsPerGroup int
set #StudentsPerGroup = 10
set #NumberOfStudents = 104
select StudentID,
ntile(#NumberOfStudents / #StudentsPerGroup) over(order by StudentID) as GroupID
from Students
Try it out on SE-Data.

Here is a variant 2. First part prepares counters. As I don't have any data on students I resolved to creating a temporary table of #maxStudents rows with only one column ID.
First cte (students) generates a list of students of maxStudents rows. Second (s) extracts students assigning them row number (obviously not necessary here, but essential when you plug in your query that retrieves students). It also returns number of students.
Third part places students into groups. Students belonging to last group will be relocated to another group if they belong to last group having less than #minGroupSize members. Version one can be achieved by replacing then part in case statement with for example 1 to place them in group one.
declare #group_size int
set #group_size = 10
declare #maxStudents int
set #maxStudents = 104
declare #minGroupSize int
set #minGroupSize = 5
;with students as (
select 1 id
union all
select 2 * id + b
from students cross join (select 0 b union all select 1) b
where 2 * id + b <= #maxStudents
),
s as (
select students.id, row_number() over(order by students.id) - 1 rowNumber, count (*) over () TotalStudents
from students
)
select s.id StudentID,
case when TotalStudents % #group_size < #minGroupSize
and rowNumber >= (TotalStudents / #group_size * #group_size)
then rowNumber - (TotalStudents / #group_size * #group_size)
else rowNumber / #group_size
end + 1 Group_number
from s
order by 2, 1

Related

Query for getting top 5 candidate in every group in single table

I have a table in which student marks in each subject and i have to get query in such a way that i will able to get all top 5 student in every subject who secure highest marks.
Here is a sample table:
My expected output look somthing like :
Top five student in PCM, ART, PCB on the basis of students marks,And also if two or more student secure same than those record also need to be in list with single query.

Original Answer
Technically, what you want to accomplish is not possible using a single SQL query. Had you only wanted one student per subject you could have achieved that using GROUP BY, but in your case it won't work.
The only way I can think of to get 5 students for each subject would be to write x queries, one for each subject and use UNION to glue them together. Such query will return a maximum of 5x rows.
Since you want to get the top 5 students based on the mark, you will have to use an ORDER BY clause, which, in combination with the UNION clauses will cause an error. To avoid that, you will have to use subqueries, so that UNION and ORDER BY clauses are not on the same level.
Query:
-- Select the 5 students with the highest mark in the `PCM` subject.
(
SELECT *
FROM student
WHERE subject = 'PCM'
ORDER BY studentMarks DESC
LIMIT 5
)
UNION
(
SELECT *
FROM student
WHERE subject = 'PCB'
ORDER BY studentMarks DESC
LIMIT 5
)
UNION
(
SELECT *
FROM student
WHERE subject = 'ART'
ORDER BY studentMarks DESC
LIMIT 5
);
Check out this SQLFiddle to evaluate the result yourself.
Updated Answer
This update aims to allow getting more than 5 students in the scenario that many students share the same grade in a particular subject.
Instead of using LIMIT 5 to get the top 5 rows, we use LIMIT 4,1 to get the fifth highest grade and use that to get all students that have a grade more or equal to that in a given subject. Though, if there are < 5 students in a subject LIMIT 4,1 will return NULL. In that case, we want essentially every student, so we use the minimum grade.
To achieve what is described above, you will need to use the following piece of code x times, as many as the subjects you have and join them together using UNION. As can be easily understood, this solution can be used for a small handful of different subjects or the query's extent will become unmaintainable.
Code:
-- Select the students with the top 5 highest marks in the `x` subject.
SELECT *
FROM student
WHERE studentMarks >= (
-- If there are less than 5 students in the subject return them all.
IFNULL (
(
-- Get the fifth highest grade.
SELECT studentMarks
FROM student
WHERE subject = 'x'
ORDER BY studentMarks DESC
LIMIT 4,1
), (
-- Get the lowest grade.
SELECT MIN(studentMarks)
FROM student
WHERE subject = 'x'
)
)
) AND subject = 'x';
Check out this SQLFiddle to evaluate the result yourself.
Alternative:
After some research I found an alternative, simpler query that will yield the same result as the one presented above based on the data you have provided without the need of "hardcoding" every subject in its own query.
In the following solution, we define a couple of variables that help us control the data:
one to cache the subject of the previous row and
one to save an incremental value that differentiates the rows having the same subject.
Query:
-- Select the students having the top 5 marks in each subject.
SELECT studentID, studentName, studentMarks, subject FROM
(
-- Use an incremented value to differentiate rows with the same subject.
SELECT *, (#n := if(#s = subject, #n +1, 1)) as n, #s:= subject
FROM student
CROSS JOIN (SELECT #n := 0, #s:= NULL) AS b
) AS a
WHERE n <= 5
ORDER BY subject, studentMarks DESC;
Check out this SQLFiddle to evaluate the result yourself.
Ideas were taken by the following threads:
Get top n records for each group of grouped results
How to SELECT the newest four items per category?
Select X items from every type
Getting the latest n records for each group

Below query produces almost what I desired, may this query helps others in future.
SELECT a.studentId, a.studentName, a.StudentMarks,a.subject FROM testquery AS a WHERE
(SELECT COUNT(*) FROM testquery AS b
WHERE b.subject = a.subject AND b.StudentMarks >= a.StudentMarks) <= 2
ORDER BY a.subject ASC, a.StudentMarks DESC

How to GROUP BY 2 different columns together

I have 2 columns having users id participating in a transaction, source_id and destination_id. I'm building a function to sum all transactions grouped by any user participating on it, either as source or as destination.
The problem is, when I do:
select count (*) from transactions group by source_id, destination_id
it will first group by source, then by destination, I want to group them together. Is it possible using only SQL?
Sample Data
source_user_id destination_user_id
1 4
3 4
4 1
3 2
Desired result:
Id Count
4 - 3 (4 appears 3 times in any of the columns)
3 - 2 (3 appears 2 times in any of the columns)
1 - 2 (1 appear 2 times in any of the columns)
2 - 1 (1 appear 1 time in any of the columns)
As you can see on the example result, I want to know the number of times an id will appear in any of the 2 fields.

Use union all to get the id's into one column and get the counts.
select id,count(*)
from (select source_id as id from tbl
union all
select destination_id from tbl
) t
group by id
order by count(*) desc,id

edited to add: Thank you for clarifying your question. The following isn't what you need.
Sounds like you want to use the concatenate function.
https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_concat
GROUP BY CONCAT(source_id,"_",destination_id)
The underscore is intended to distinguish "source_id=1, destination_id=11" from "source_id=11, destination_id=1". (We want them to be 1_11 and 11_1 respectively.) If you expect these IDs to contain underscores, you'd have to handle this differently, but I assume they're integers.

It may look like this.
Select id, count(total ) from
(select source_id as id, count (destination_user_id) as total from transactions group by source_id
union
select destination_user_id as id , count (source_id) as total from transactions group by destination_user_id ) q group by id

mysql query to generate a commision report based on referred members

A person gets a 10% commision for purchases made by his referred friends.
There are two tables :
Reference table
Transaction table
Reference Table
Person_id Referrer_id
3 1
4 1
5 1
6 2
Transaction Table
Person_id Amount Action Date
3 100 Purchase 10-20-2011
4 200 Purchase 10-21-2011
6 400 Purchase 12-15-2011
3 200 Purchase 12-30-2011
1 50 Commision 01-01-2012
1 10 Cm_Bonus 01-01-2012
2 20 Commision 01-01-2012
How to get the following Resultset for Referrer_Person_id=1
Month Ref_Pur Earn_Comm Todate_Earn_Comm BonusRecvd Paid Due
10-2011 300 30 30 0 0 30
11-2011 0 0 30 0 0 30
12-2011 200 20 50 0 0 50
01-2012 0 0 50 10 50 0
Labels used above are:
Ref_Pur = Total Referred Friend's Purchase for that month
Earn_Comm = 10% Commision earned for that month
Todate_Earn_Comm = Total Running Commision earned upto that month
MYSQL CODE that i wrote
SELECT dx1.month,
dx1.ref_pur,
dx1.earn_comm,
( #cum_earn := #cum_earn + dx1.earn_comm ) as todate_earn_comm
FROM
(
select date_format(`date`,'%Y-%m') as month,
sum(amount) as ref_pur ,
(sum(amount)*0.1) as earn_comm
from transaction tr, reference rf
where tr.person_id=rf.person_id and
tr.action='Purchase' and
rf.referrer_id=1
group by date_format(`date`,'%Y-%m')
order by date_format(`date`,'%Y-%m')
)as dx1
JOIN (select #cum_earn:=0)e;
How to join the query to also include BonusRecvd,Paid and Due trnsactions, which is not dependent on reference table?
and also generate row for the month '11-2011', even though no trnx occured on that month

If you want to include commission payments and bonuses into the results, you'll probably need to include corresponding rows (Action IN ('Commision', 'Cm_Bonus')) into the initial dataset you are using to calculate the results on. Or, at least, that's what I would do, and it might be like this:
SELECT t.Amount, t.Action, t.Date
FROM Transaction t LEFT JOIN Reference r ON t.Person_id = r.Person_id
WHERE r.Referrer_id = 1 AND t.Action = 'Purchase'
OR t.Person_id = 1 AND t.Action IN ('Commision', 'Cm_Bonus')
And when calculating monthly SUMs, you can use CASE expressions to distinguish among Amounts related to differnt types of Action. This is how the corresponding part of the query might look like:
…
IFNULL(SUM(CASE Action WHEN 'Purchase' THEN Amount END) , 0) AS Ref_Pur,
IFNULL(SUM(CASE Action WHEN 'Purchase' THEN Amount END) * 0.1, 0) AS Earn_Comm,
IFNULL(SUM(CASE Action WHEN 'Cm_Bonus' THEN Amount END) , 0) AS BonusRecvd,
IFNULL(SUM(CASE Action WHEN 'Commision' THEN Amount END) , 0) AS Paid
…
When calculating the Due values, you can initialise another variable and use it quite similarly to #cum_earn, except you'll also need to subtract Paid, something like this:
(#cum_due := #cum_due + Earn_Comm - Paid) AS Due
One last problem seems to be missing months. To address it, I would do the following:
Get the first and the last date from the subset to be processed (as obtained by the query at the beginning of this post).
Get the corresponding month for each of the dates (i.e. another date which is merely the first of the same month).
Using a numbers table, generate a list of months covering the two calculated in the previous step.
Filter out the months that are present in the subset to be processed and use the remaining ones to add dummy transactions to the subset.
As you can see, the "subset to be processed" needs to be touched twice when performing these steps. So, for effeciency, I would insert that subset into a temporary table and use that table, instead of executing the same (sub)query several times.
A numbers table mentioned in Step #3 is a tool that I would recommend keep always handy. You would only need to initialise it once, and its uses for you may turn out numerous, if you pardon the pun. Here's but one way to populate a numbers table:
CREATE TABLE numbers (n int);
INSERT INTO numbers (n) SELECT 0;
INSERT INTO numbers (n) SELECT cnt + n FROM numbers, (SELECT COUNT(*) AS cnt FROM numbers) s;
INSERT INTO numbers (n) SELECT cnt + n FROM numbers, (SELECT COUNT(*) AS cnt FROM numbers) s;
INSERT INTO numbers (n) SELECT cnt + n FROM numbers, (SELECT COUNT(*) AS cnt FROM numbers) s;
INSERT INTO numbers (n) SELECT cnt + n FROM numbers, (SELECT COUNT(*) AS cnt FROM numbers) s;
INSERT INTO numbers (n) SELECT cnt + n FROM numbers, (SELECT COUNT(*) AS cnt FROM numbers) s;
INSERT INTO numbers (n) SELECT cnt + n FROM numbers, (SELECT COUNT(*) AS cnt FROM numbers) s;
INSERT INTO numbers (n) SELECT cnt + n FROM numbers, (SELECT COUNT(*) AS cnt FROM numbers) s;
INSERT INTO numbers (n) SELECT cnt + n FROM numbers, (SELECT COUNT(*) AS cnt FROM numbers) s;
/* repeat as necessary; every repeated line doubles the number of rows */
And that seems to be it. I will not post a complete solution here to spare you the chance to try to use the above suggestions in your own way, in case you are keen to. But if you are struggling or just want to verify that they can be applied to the required effect, you can try this SQL Fiddle page for a complete solution "in action".

SQL query that reports N or more consecutive absents from attendance table

I have a table that looks like this:
studentID | subjectID | attendanceStatus | classDate | classTime | lecturerID |
12345678 1234 1 2012-06-05 15:30:00
87654321
12345678 1234 0 2012-06-08 02:30:00
I want a query that reports if a student has been absent for 3 or more consecutive classes. based on studentID and a specific subject between 2 specific dates as well. Each class can have a different time. The schema for that table is:
PK(`studentID`, `classDate`, `classTime`, `subjectID, `lecturerID`)
Attendance Status: 1 = Present, 0 = Absent
Edit: Worded question so that it is more accurate and really describes what was my intention.

I wasn't able to create an SQL query for this. So instead, I tried a PHP solution:
Select all rows from table, ordered by student, subject and date
Create a running counter for absents, initialized to 0
Iterate over each record:
If student and/or subject is different from previous row
Reset the counter to 0 (present) or 1 (absent)
Else, that is when student and subject are same
Set the counter to 0 (present) or plus 1 (absent)
I then realized that this logic can easily be implemented using MySQL variables, so:
SET #studentID = 0;
SET #subjectID = 0;
SET #absentRun = 0;
SELECT *,
CASE
WHEN (#studentID = studentID) AND (#subjectID = subjectID) THEN #absentRun := IF(attendanceStatus = 1, 0, #absentRun + 1)
WHEN (#studentID := studentID) AND (#subjectID := subjectID) THEN #absentRun := IF(attendanceStatus = 1, 0, 1)
END AS absentRun
FROM table4
ORDER BY studentID, subjectID, classDate
You can probably nest this query inside another query that selects records where absentRun >= 3.
SQL Fiddle

This query works for intended result:
SELECT DISTINCT first_day.studentID
FROM student_visits first_day
LEFT JOIN student_visits second_day
ON first_day.studentID = second_day.studentID
AND DATE(second_day.classDate) - INTERVAL 1 DAY = date(first_day.classDate)
LEFT JOIN student_visits third_day
ON first_day.studentID = third_day.studentID
AND DATE(third_day.classDate) - INTERVAL 2 DAY = date(first_day.classDate)
WHERE first_day.attendanceStatus = 0 AND second_day.attendanceStatus = 0 AND third_day.attendanceStatus = 0
It's joining table 'student_visits' (let's name your original table so) to itself step by step on consecutive 3 dates for each student and finally checks the absence on these days. Distinct makes sure that result willn't contain duplicate results for more than 3 consecutive days of absence.
This query doesn't consider absence on specific subject - just consectuive absence for each student for 3 or more days. To consider subject simply add .subjectID in each ON clause:
ON first_day.subjectID = second_day.subjectID
P.S.: not sure that it's the fastest way (at least it's not the only).

Unfortunately, mysql does not support windows functions. This would be much easier with row_number() or better yet cumulative sums (as supported in Oracle).
I will describe the solution. Imagine that you have two additional columns in your table:
ClassSeqNum -- a sequence starting at 1 and incrementing by 1 for each class date.
AbsentSeqNum -- a sequence starting a 1 each time a student misses a class and then increments by 1 on each subsequent absence.
The key observation is that the difference between these two values is constant for consecutive absences. Because you are using mysql, you might consider adding these columns to the table. They are big challenging to add in the query, which is why this answer is so long.
Given the key observation, the answer to your question is provided by the following query:
select studentid, subjectid, absenceid, count(*) as cnt
from (select a.*, (ClassSeqNum - AbsentSeqNum) as absenceid
from Attendance a
) a
group by studentid, subjectid, absenceid
having count(*) > 2
(Okay, this gives every sequence of absences for a student for each subject, but I think you can figure out how to whittle this down just to a list of students.)
How do you assign the sequence numbers? In mysql, you need to do a self join. So, the following adds the ClassSeqNum:
select a.StudentId, a.SubjectId, count(*) as ClassSeqNum
from Attendance a join
Attendance a1
on a.studentid = a1.studentid and a.SubjectId = a1.Subjectid and
a.ClassDate >= s1.classDate
group by a.StudentId, a.SubjectId
And the following adds the absence sequence number:
select a.StudentId, a.SubjectId, count(*) as AbsenceSeqNum
from Attendance a join
Attendance a1
on a.studentid = a1.studentid and a.SubjectId = a1.Subjectid and
a.ClassDate >= a1.classDate
where AttendanceStatus = 0
group by a.StudentId, a.SubjectId
So the final query looks like:
with cs as (
select a.StudentId, a.SubjectId, count(*) as ClassSeqNum
from Attendance a join
Attendance a1
on a.studentid = a1.studentid and a.SubjectId = a1.Subjectid and
a.ClassDate >= s1.classDate
group by a.StudentId, a.SubjectId
),
a as (
select a.StudentId, a.SubjectId, count(*) as AbsenceSeqNum
from Attendance a join
Attendance a1
on a.studentid = a1.studentid and a.SubjectId = a1.Subjectid and
a.ClassDate >= s1.classDate
where AttendanceStatus = 0
group by a.StudentId, a.SubjectId
)
select studentid, subjectid, absenceid, count(*) as cnt
from (select cs.studentid, cs.subjectid,
(cs.ClassSeqNum - a.AbsentSeqNum) as absenceid
from cs join
a
on cs.studentid = a.studentid and cs.subjectid = as.subjectid
) a
group by studentid, subjectid, absenceid
having count(*) > 2

SELECT rows with minimum count(*)

Let's say i have a simple table voting with columns
id(primaryKey),token(int),candidate(int),rank(int).
I want to extract all rows having specific rank,grouped by candidate and most importantly only with minimum count(*).
So far i have reached
SELECT candidate, count( * ) AS count
FROM voting
WHERE rank =1
AND candidate <200
GROUP BY candidate
HAVING count = min( count )
But,it is returning empty set.If i replace min(count) with actual minimum value it works properly.
I have also tried
SELECT candidate,min(count)
FROM (SELECT candidate,count(*) AS count
FROM voting
where rank = 1
AND candidate < 200
group by candidate
order by count(*)
) AS temp
But this resulted in only 1 row,I have 3 rows with same min count but with different candidates.I want all these 3 rows.
Can anyone help me.The no.of rows with same minimum count(*) value will also help.
Sample is quite a big,so i am showing some dummy values
1 $sampleToken1 101 1
2 $sampleToken2 102 1
3 $sampleToken3 103 1
4 $sampleToken4 102 1
Here ,when grouped according to candidate there are 3 rows combining with count( * ) results
candidate count( * )
101 1
103 1
102 2
I want the top 2 rows to be showed i.e with count(*) = 1 or whatever is the minimum

Try to use this script as pattern -
-- find minimum count
SELECT MIN(cnt) INTO #min FROM (SELECT COUNT(*) cnt FROM voting GROUP BY candidate) t;
-- show records with minimum count
SELECT * FROM voting t1
JOIN (SELECT id FROM voting GROUP BY candidate HAVING COUNT(*) = #min) t2
ON t1.candidate = t2.candidate;

Remove your HAVING keyword completely, it is not correctly written.
and add SUB SELECT into the where clause to fit that criteria.
(ie. select cand, count(*) as count from voting where rank = 1 and count = (select ..... )

The HAVING keyword can not use the MIN function in the way you are trying. Replace the MIN function with an absolute value such as HAVING count > 10

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Grouping items with even distribution in SQL - sql-server-2008

Related

Query for getting top 5 candidate in every group in single table

How to GROUP BY 2 different columns together

mysql query to generate a commision report based on referred members

SQL query that reports N or more consecutive absents from attendance table

SELECT rows with minimum count(*)

Categories

Resources