Using Groupby with Duplicated - duplicates

I have researched this and not found any information to do the following. I need to
groupby a set of columns (here: studentid, subj, topic, lesson).
Then I need to find the duplicated rows in a subset of columns
(here: testtime, responsetime).
Create a column that indicates whether the row (across the 2 columns) is a duplicate or not
Starting dataframe
studentid subj topic lesson testtime responsetime
1 1 math add a timestamp1 45sec
2 1 math add a timestamp1 45sec
3 1 math add a timestamp2 30sec
4 1 math add a timestamp3 15sec
5 1 math add b timestamp1 0sec
6 1 math add b timestamp1 0sec
7 1 math add b timestamp1 45sec
8 1 math add b timestamp1 45sec
What I have tried:
Method 1:
Using putting duplicated in a user-defined function to create a column indicating whether the row is a duplicate or not - ERROR: 'function' object is not subscriptable
def check_dup(list):
return df.duplicated([list],keep='first')
df_alt['dup_values'] = df.groupby(['studentidd', 'subj','topic','lesson']). apply(check_dup['testtime','responsetime'],axis=1)
Method 2:
Using multiindexing, but the problem there is that the duplicated function is looking for duplicates in the index rows, and not in a separate column set ('testtime','responsetime'):
dfnew['dup_indicator'] = df.set_index(['studentidd', 'subj','topic','lesson']).
duplicated(['testtime','responsetime'],keep=False)
Desired dataframe
studentid subj topic lesson testtime responsetime dup_indicator
1 1 math add a timestamp1 45sec 1
2 1 math add a timestamp1 45sec 1
3 1 math add a timestamp2 30sec 0
4 1 math add a timestamp3 15sec 0
5 1 math add b timestamp1 0sec 1
6 1 math add b timestamp1 0sec 1
7 1 math add b timestamp1 45sec 1
8 1 math add b timestamp1 45sec 1

You don't need to use groupby or modify the index to accomplish what you want to do. Just pass in all the columns that you want to use to identify as duplicates:
> df
studentid subj topic lesson testtime responsetime
1 1 math add a timestamp1 45sec
2 1 math add a timestamp1 45sec
3 1 math add a timestamp2 30sec
4 1 math add a timestamp3 15sec
5 1 math add b timestamp1 0sec
6 1 math add b timestamp1 0sec
7 1 math add b timestamp1 45sec
8 1 math add b timestamp1 45sec
> dup_cols = ['studentid', 'subj', 'topic', 'lesson', 'testtime', 'responsetime']
> df.loc[df.duplicated(subset=dup_cols, keep=False), 'dup_indicator'] = 1
> df['dup_indicator'].fillna(0, inplace=True)
> df
studentid subj topic lesson testtime responsetime dup_indicator
1 1 math add a timestamp1 45sec 1.0
2 1 math add a timestamp1 45sec 1.0
3 1 math add a timestamp2 30sec 0.0
4 1 math add a timestamp3 15sec 0.0
5 1 math add b timestamp1 0sec 1.0
6 1 math add b timestamp1 0sec 1.0
7 1 math add b timestamp1 45sec 1.0
8 1 math add b timestamp1 45sec 1.0
To break down the steps:
Find all rows where df.duplicated returns True, in this case all rows that are duplicates based on what's passed to the subset argument
Filter the dataframe to select the duplicate rows using .loc
Creating a new column dup_indicator and assigning 1 to the duplicate rows
Use fillna to assign 0 to the non-duplicate rows

Related

MySql: sum value based on state field on millions rows table

I have a table with millions of rows (SF_COLLECTIONS)
ID MEMBERID COLLECTIONID CARDID STATE (D / M) HOWMANY
1 1 1 1 D 1
2 1 1 2 D 2
3 2 1 1 M 1
4 2 1 2 M 1
5 2 2 3 D 1
6 1 2 3 M 2
and I want to know for every COLLECTIONID the SUM of HOWMANY field for STATE=D and STATE=M
So I try this approach
select COLLECTIONID,
sum(if(STATE='D',HOWMANY,0)) as HMD,
sum(if(STATE='M',HOWMANY,0)) as HMM
from SF_COLLECTIONS
group by COLLECTIONID
and it takes about 15 seconds to answer
Any suggestions to get better performances?
Thanks in advance
Please try this:
SELECT COLLECTIONID,
SUM(CASE WHEN STATE='D' THEN HOWMANY END) AS HMD,
SUM(CASE WHEN STATE='M' THEN HOWMANY END) AS HMM
FROM SF_COLLECTIONS
GROUP BY COLLECTIONID;
Duration / Fetch Time
0.00068 sec / 0.000012 sec
Your existing_query:
SELECT COLLECTIONID,
SUM(IF(STATE='D',HOWMANY,0)) AS HMD,
SUM(IF(STATE='M',HOWMANY,0)) AS HMM
FROM SF_COLLECTIONS
GROUP BY COLLECTIONID;
Duration / Fetch Time
0.00072 sec / 0.000012 sec
Only 6 rows there is 0.00004 difference :
For 10 million rows, I think It is noticeable:)

group every 3 count rows with mysql query

can anyone help me to group every 3 counts rows???
I have data like this
num
score
1
3
1
3
1
3
1
3
1
3
1
3
4
3
4
3
4
3
2
3
2
3
2
3
and i want result like this
num
count(num)
1
3
1
3
2
3
4
3
You want to group every 3 rows. But without giving supplementary conditions, we can only take assumptions on how you would like to group. Supposing each row in the 3-rows group has the same content and the score value from the base table has no influence on grouping, I guess you just want to get the num once for every 3 rows as the 3 rows have the same num. Besides, we use a constant value 3 for count(num) column. If all my assumptions come true, try this:
select num, 3 as 'count(num)'
from (select num,#row_id:=#row_id+1 as row_id
from test,(select #row_id:=0) t1 ) t2
where row_id mod 3 =0
order by num
;

What should be the MySQL query having dynamic group by cluase?

Need MySQL query for below problem
Consider a table having student and their marks in a particular subject
Schema
std_id int(11)
marks int(11)
Sample data
std_id marks
1 10
2 15
3 90
4 120
5 25
6 29
7 121
8 122
Now I have an web app in which a form will take a input (int) from user.
For eg 12
then I am required to show total number of student ids (std_id) and their corresponding marks group.
Eg
std_total (tot no of students) group (marks range we got from form)
1 0-11
1 12-23
2 24-35
1 84-95
3 120-131
#Barmar Your answer was almost correct, I made few changes to clean the output. Your query gives output as below :
0-11 2
1-12 2
2-13 1
3-14 1
4-15 1
6-17 1
7-18 2
My query return Outout as
0-11 2
12-23 2
24-35 1
36-47 1
48-59 1
72-83 1
84-95 2
SELECT CONCAT(FLOOR(marks/12)*12, '-', FLOOR(marks/12)+11*(FLOOR(marks/12))+11) AS `group`, COUNT(*) as `std_total`
FROM yourTable
GROUP BY `group`
Use division and FLOOR() to get the beginning of each range.
SELECT CONCAT(FLOOR(marks/12), '-', FLOOR(marks/12)+11) AS `group`, COUNT(*) as `std_total`
FROM yourTable
GROUP BY `group`

mysql - How to get if day past of any data in SELECT

I'm trying to make a summary making a indicator if any data in the SELECT past day from now... or just show a day (if > 0 : + , if < 0 : - ).
Like this: these are my tables.
We supose, today is 2013-12-25
tb_employee:
ID_EMP EMPLOYEE
1 Employee 1
2 Employee 2
3 Employee 3
tb_requirement:
ID_REQ REQUIREMENT
1 Requirement 1
2 Requirement 2
3 Requirement 3
4 Requirement 4
tb_detail:
ID_DET ID_EMP ID_REQ EXPIRATION
1 1 1 2013-12-29
2 1 2 2013-12-28
3 1 3 2013-12-31
4 2 2 2014-01-05
5 2 3 2013-12-20
6 2 4 2013-12-15
Now, the SELECT QUERY should show like this:
ID_EMP EMPLOYEE REQUIREMENTS_GOT ANY_REQ_EXPIRE
1 Employee 1 3 YES
2 Employee 2 3 NO
I hope i explained well. Maybe it could be with DATEDIFF ?
Thank you for answers... and of course, Merry Christmas !
Since you're trying to determine if any of the requirements expired, you should compare the minimal expiry date to today's date. There's no need to use datediff - a simple > operator packed in a case statement would do:
SELECT id_emp,
employee,
COUNT(*) AS requirements_got,
CASE WHEN CURDATE() > MIN(expiration) THEN 'yes' ELSE 'no' END AS any_req_expire
FROM tb_detail
JOIN tb_employee ON tb_detail.id_emp = tb_employee.id_emp
GROUP BY id_emp, employee

mysql dividing all columns by values from one column (for vector calculation etc)

i have a table of this general format. it was generated via pivoting, so the number of columns is not fixed.
id c1 c2... total
10 0 2 1 1 0 4
9 0 1 0 1 0 2
8 1 2 0 0 0 3
7 0 0 0 1 0 1
6 0 1 0 1 1 3
5 1 0 0 1 2 4
4 0 1 1 0 0 2
3 0 3 0 1 1 5
2 2 2 2 0 0 6
1 1 0 1 0 0 2
what i need, is to take the "total col" (last from left), and divide each one of the {c1, c2, c3....} columns by their respective total... for instance, if row 10, c2=2, then c2/total = 2/4 =0.5
just to emphasize, the number of cols. is not fixed. this is a sample table.
is it possible do to only via mysql, or is an external script needed?
many thanks
EDIT TO CLARIFY:
my inital data, pre-pivoting, looks like this:
2 2
8 1
2 2
1 5
3 1
9 1
5 3
4 1
1 2
10 5
6 4
4 5
5 2
10 3
5 4
3 1
6 1
6 3
3 4
3 1
5 4
7 3
2 5
10 1
9 3
where the first col is "id", second is "c". as shown, it needs to be transformed into a contingency table of sort. where each id has a count for each "c" {c1,c2,c3...}
is there an efficient way to code this data into a the format #bobwienholt mentioned below? (i'm new to mysql, in fact i taught it to myself today for the pivoting. apologies if this is trivial).
If I were you, I would structure my table as follows:
CREATE TABLE data ( row INT, col INT, value INT );
Then you can do this:
SELECT d.row, d.col, d.value/t.total
FROM (
SELECT row, SUM(value) as total
FROM data
GROUP BY row;
) t INNER JOIN data d
ON d.row = t.row
ORDER BY row, col;
It would work for any number of "rows" and "columns".
Ok, based on your edit... I would just import the data as is. So you would create a table like this:
CREATE TABLE data ( id INT, c INT);
Then you could import your data using LOAD DATA INFILE. You should consult the MySQL docs to learn how to use that.
Then, you would get all your c1, c2, etc counts like this:
SELECT id, c, COUNT(1) as num
FROM data
GROUP BY id, c;
That would yield results like (based on your sample data):
id c num
1 2 1
1 5 1
2 2 2
2 5 1
3 1 3
So, basically for id 3, c1 = 3... for id 2 c2=2, etc.
Your total column would be:
SELECT id, SUM(num) as total
FROM (
SELECT id, c, COUNT(1) as num
FROM data
GROUP BY id, c
) x
GROUP BY id;
In this scenario, you wouldn't have to pivot your data.