mysql query to efficiently remove duplicates - mysql

Hi folks and thanks for reading
I have a quiz feature on my site which stores a score, username and ip address as the most important columns. I currently have a horrible series of views bringing back the high scores based on the criteria I need which are...
Lowest score first but...only the lowest score for each Quiz user.
The complexity lies if the user has changed ip, i.e. keeps the same username but has a different ip OR if the user keeps the same IP address but changes user name.
It's easier to explain with an example.
First visitor has 4 entries but from 3 different IP Addresses
Second user from 2 IP Addresses
Third user using one IP Address but using 3 Usernames
Table with VALUES(UserID, IPA, Score)
User 1, IP1, 13
User 1, IP1, 20
User 1, IP2, 30
User 1, IP3, 10
User 2, IP4, 20
User 2, IP5, 22
User 2, IP5, 15
User 3, IP6, 12
User 3, IP6, 20
User 4, IP6, 15
User 5, IP6, 11
The highscore query would present you with
User 1, IP3, 10
User 5, IP6, 11
User 2, IP5, 15
The score value is highly unlikely to be duplicated but I guess it is possible. The figures above are simplified to explain my conundrum!
Can anyone suggest an efficient way of removing these duplicates as my table is now over 15,000 records and the views are creaking!
Many thanks.

To identify occurrences of duplicate (UserID,IPA) tuples is pretty straightforward:
SELECT s.UserID
, s.IPA
FROM mytable s
GROUP
BY s.UserID
, s.IPA
HAVING COUNT(1) > 1
To get the lowest score, you could add MIN(s.Score) to the select list.
Deleting duplicates is a little more difficult, in that you don't seem to have any guarantee of uniqueness. Some will recommend that you copy the rows you want to keep out to a separate table, and then either swap the tables with renames, or truncate the original table and reload from the new table. (That usually turns out to be the most efficient approach.)
CREATE TABLE newtable LIKE mytable ;
INSERT INTO newtable (UserID,IPA,Score)
SELECT s.UserID
, s.IPA
, MIN(Score) AS Score
FROM mytable s
GROUP
BY s.UserID
, s.IPA ;
If you want to identify duplicates by just UserID, the same approach can work. If it isn't important that the IPA value comes from the row with the lowest score, it's a little easier. I can put together the query that gets the row that has the lowest score for the user.
If you want to delete rows from the existing table, without adding a unique identifier (like an AUTO_INCREMENT id column) on each row, that can be done too.
This will get you partway, deleting all rows for a given (UserID,IPA) that have a score higher than the lowest score:
DELETE t.*
FROM mytable t
JOIN ( SELECT s.UserID
, s.IPA
, MIN(s.Score)
FROM mytable s
GROUP
BY s.Userid
, s.IPA
) k
ON k.UserID = t.UserID
AND k.IPA = t.IPA
AND k.Score < t.Score
But that will still leave duplicate occurrences of duplicate (UserID,IPA,Score) tuples. Without some other column on the table that makes the row unique, it's a little more difficult to remove duplicates. (Again, a common technique is copy the rows you want to keep to another table, and either swap tables or reload the original table from the saved rows.
FOLLOWUP
Note that views (both stored and inline) can be expensive performancewise, with MySQL, since the views get materialized as temporary MyISAM tables (MySQL calls them "derived tables").
But correlated subqueries can be even more problematic on large sets.
So, choose your poison.
If there the table has an index ON (userID, Score, IPA) here's how I would get the resultset:
SELECT IF(#prev_user=t.UserID,#i:=#i+1,#i:=1) AS seq
, #prev_user := t.UserID AS UserID
, t.IPA
, t.Score
FROM mytable t
JOIN (SELECT #i := NULL, #prev_user := NULL) i
GROUP
BY t.UserID ASC
, t.Score ASC
, t.IPA ASC
HAVING seq = 1
This is taking advantage of some MySQL-specific features: user_variables and the guarantee that the GROUP BY will return a sorted resultset. (The EXPLAIN output will show "Using index", which means we avoid a sort operation, but the query will still create a derived table. We use the user_variables to identify the "first" row for each UserID, and the HAVING clause eliminates all but that first row.
test case:
create table mytable (UserID VARCHAR(6), IPA varchar(3), Score INT);
create index mytable_IX ON mytable (UserID, Score, IPA);
insert into mytable values ('User 1','IP1',13)
,('User 1','IP1',20)
,('User 1','IP2',30)
,('User 1','IP3',10)
,('User 2','IP4',20)
,('User 2','IP5',22)
,('User 2','IP5',15)
,('User 3','IP6',12)
,('User 3','IP6',20)
,('User 4','IP6',15)
,('User 5','IP6',11);
Another followup
To eliminate 'User 4' and 'User 5' from the resultset (it's not at all clear why you would want or need to do that. If it's because those users have only one row in the table, then you could add a JOIN to a subquery (inline view) that gets a list of UserID values where there is more than one row, like this:
SELECT IF(#prev_user=t.UserID,#i:=#i+1,#i:=1) AS seq
, #prev_user := t.UserID AS UserID
, t.IPA
, t.Score
FROM mytable t
JOIN ( SELECT d.UserID
FROM mytable d
GROUP
BY d.UserID
HAVING COUNT(1) > 1
) m
ON m.UserID = t.UserID
CROSS
JOIN (SELECT #i := NULL, #prev_user := NULL) i
GROUP
BY t.UserID ASC
, t.Score ASC
, t.IPA ASC
HAVING seq = 1

Related

find first transaction after created date and added to a column MySQL

I am using MySQL version 8.0
MRE:
create table users(
user varchar(5),
work_type varchar(20),
time datetime
);
insert into users(user, work_type, time)
Values ("A", "create", "2020-01-01 11:11:11")
, ("A", "bought", "2020-01-04 16:11:11")
, ("A", "bought", "2020-01-07 18:10:10")
, ("A", "bought", "2020-01-08 12:00:11")
, ("A", "create", "2020-02-02 15:17:11")
, ("A", "bought", "2020-02-02 16:11:11");
In my table for each user there is a "work_type" column which specifies what user does.
user work_type time
A create 2020-01-01 11:11:11
A bought 2020-01-04 16:11:11
A bought 2020-01-07 18:10:10
A bought 2020-01-08 12:00:11
A create 2020-02-02 15:17:11
A bought 2020-02-02 16:11:11
Since after user A "create" their account I want to find only first bought time and add it to new column
user work_type time bought_time
A create 2020-01-01 11:11:11 2020-01-04 16:11:11
A create 2020-02-02 15:17:11 2020-02-02 16:11:11
Notice that user A can have multiple create work_type. Above is the desired output however there will be multiple user as well.
A correlated subquery in the select list can retrieve a single value. I use the order by time asc limit 1 clauses to limit the number of returned rows to 1:
select t.*, (select t2.`time` from yourtable t2 where t2.user=t.user and t2.`time` > t.`time` and t2.work_type='bought' order by t2.`time` asc limit 1) as bought_time
from yourtable t
where work_type='create'
The above query is fine, as long as you have at least 1 bought record after each create one. If you cannot guarantee this and you have no other fields to link a create with the subsequent bought, then you have to complicate things to check for the type of the next record after the create. Note: I do not filter on the work_type field in the subquery any longer:
select t.*, (select if(t2.work_type='bought',t2.`time`,null) from yourtable t2 where t2.user=t.user and t2.`time` > t.`time` order by t2.`time` asc limit 1) as bought_time
from yourtable t
where work_type='create'
If the create and subsequent bought records form part of a set, then I would definitely create a field that links them together, meaning that this field would have the same value for all records belonging to the same set. This way it would be really easy to identify which records form part of the set.
Solution for your problem:
SELECT * FROM
(
SELECT
user
,work_type
,CASE WHEN UPPER(work_type) = 'CREATE' THEN time END time
,CASE WHEN UPPER(work_type) = 'CREATE'
THEN LEAD(time) OVER(PARTITION BY user ORDER BY time) END bought_time
FROM
Table1) A
WHERE UPPER(work_type) = 'CREATE';
Link for demo:
https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=ac0cf9375025b964769fd28514db0ce1

In SQL or mySQL, how can I find a key which has the highest sum in another column while the key appears in two column?

Suppose I have a video chatting app that records the username of two users and the length of the call, the table is data of all the calls.
A person can appear in both user1 and user2. For example, in the table David appears in both user1 and user2. Using the data that we have on the table, how can I write a SQL query that finds the user who has the longest total call length? In this case, David has the longest total call length, which is 50 minutes.
You can use a LEAST/GREATEST trick here:
SELECT user, SUM(length) AS total_length
FROM
(
SELECT LEAST(User1, User2) AS user, length
FROM yourTable
UNION ALL
SELECT GREATEST(User1, User2), length
FROM yourTable
) t
GROUP BY
user
ORDER BY
SUM(length) DESC
Demo
with dat as
(
Select 'Jhony' User1, 'Jennifer' User2, 23 Call_Length union all
Select 'David','Michael',10 union all
Select 'Lisa','David',40 union all
Select 'Lisa','Jennifer',5
)
Select top 1 sum(a.call_length+nvl(b.call_length,0)),a.user1,b.user2 from
dat a
left join dat b on a.user1=b.user2
group by a.user1,b.user2
order by sum(a.call_length+nvl(b.call_length,0)) desc
Another way, which might look dirtier since mysql doesn't support any window fucntion like any other RDBMS but always give you the exact result including multiple users having the highest total length is by combining the results of both users, calculating its total length and use that value in the outer query comparing to the total sum of the same query without the use of LIMIT.
SELECT Caller, SUM(length) TotalLength
FROM
(
SELECT User1 AS Caller, length FROM calls UNION ALL
SELECT User2, length FROM calls
) a
GROUP BY Caller
HAVING SUM(length) = (
SELECT MAX(TotalLength)
FROM
(
SELECT Caller, SUM(length) TotalLength
FROM
(
SELECT User1 AS Caller, length FROM calls UNION ALL
SELECT User2, length FROM calls
) a
GROUP BY Caller
) a
)
Here's a Demo.
I will combine all call time for each users (user1 and user 2) then group by user. Get the top 1 record based on call time.
Select user, sum(calltime) as calltime
From
(Select user1 as user, calltime
from tbl
Union all
Select user2, calltime
from tbl
) t
Group by user
Order by calltime desc
Limit 1;

MySQL : Group By Clause Not Using Index when used with Case

Im using MySQL
I cant change the DB structure, so thats not an option sadly
THE ISSUE:
When i use GROUP BY with CASE (as need in my situation), MYSQL uses
file_sort and the delay is humongous (approx 2-3minutes):
http://sqlfiddle.com/#!9/f97d8/11/0
But when i dont use CASE just GROUP BY group_id , MYSQL easily uses
index and result is fast:
http://sqlfiddle.com/#!9/f97d8/12/0
Scenerio: DETAILED
Table msgs, containing records of sent messages, with fields:
id,
user_id, (the guy who sent the message)
type, (0=> means it's group msg. All the msgs sent under this are marked by group_id. So lets say group_id = 5 sent 5 msgs, the table will have 5 records with group_id =5 and type=0. For type>0, the group_id will be NULL, coz all other types have no group_id as they are individual msgs sent to single recipient)
group_id (if type=0, will contain group_id, else NULL)
Table contains approx 10 million records for user id 50001 and with different types (i.e group as well as individual msgs)
Now the QUERY:
SELECT
msgs.*
FROM
msgs
INNER JOIN accounts
ON (
msgs.user_id = accounts.id
)
WHERE 1
AND msgs.user_id IN (50111)
AND msgs.type IN (0, 1, 5, 7)
GROUP BY CASE `msgs`.`type` WHEN 0 THEN `msgs`.`group_id` ELSE `msgs`.`id` END
ORDER BY `msgs`.`group_id` DESC
LIMIT 100
I HAVE to get summary in a single QUERY,
so msgs sent to group lets say 5 (have 5 records in this table) will be shown as 1 record for summary (i may show COUNT later, but thats not an issue).
The individual msgs have NULL as group_id, so i cant just put 'GROUP BY group_id ' coz that will Group all individual msgs to single record which is not acceptable.
Sample output can be something like:
id owner_id, type group_id COUNT
1 50001 0 2 5
1 50001 1 NULL 1
1 50001 4 NULL 1
1 50001 0 7 5
1 50001 5 NULL 1
1 50001 5 NULL 1
1 50001 5 NULL 1
1 50001 0 10 5
Now the problem is that the GROUP condition after using CASE (which i currently think that i have to because i only need to group by group_id if type=0) is causing alot of delay coz it's not using indexes which it does if i dont use CASE (like just group by group_id ). Please view SQLFiddles above to see the explain results
Can anyone plz give an advice how to get it optimized
UPDATE
I tried a workaround , that does somehow works out (drops INITIAL queries to 1sec). Using union, what it does is, to minimize the resultset by union that forces SQL to write on disk for filesort (due to huge resultset), limit the resultset of group msgs, and individual msgs (view query below)
-- first part of union retrieves group msgs (that have type 0 and needs to be grouped by group_id). Applies the limit to captivate the out of control result set
-- The second query retrieves individual msgs, (those with type !=0, grouped by msgs.id - not necessary but just to be save from duplicate entries due to joins). Applies the limit to captivate the out of control result set
-- JOins the two to retrieve the desired resultset
Here's the query:
SELECT
*
FROM
(
(
SELECT
msgs.id as reference_id, user_id, type, group_id
FROM
msgs
INNER JOIN accounts
ON (msgs.user_id = accounts.id)
WHERE 1
AND accounts.id IN (50111 ) AND type = 0
GROUP BY msgs.group_id
ORDER BY msgs.id DESC
LIMIT 40
)
UNION
ALL
(
SELECT
msgs.id as reference_id, user_id, type, group_id
FROM
msgs
INNER JOIN accounts
ON (
msgs.user_id = accounts.id
)
WHERE 1
AND msgs.type != 0
AND accounts.id IN (50111)
GROUP BY msgs.id
ORDER BY msgs.id
LIMIT 40
)
) AS temp
ORDER BY reference_id
LIMIT 20,20
But has alot of caveats,
-I need to handle the limit in inner queries as well. Lets say 20recs per page, and im on page 4. For inner queries , i need to apply limit 0,80, since im not sure which of the two parts had how many records in the previous 3 pages. So, as the records per page and number of pages grow, my query grows heavier. Lets say 1k rec per page, and im on page 100 , or 1K, the load gets heavier and time exponentially increases
I need to handle ordering in inner queries and then apply on the resultset prepared by union , conditions need to be applied on both inner queries seperately(but not much of an issue)
-Cant use calc_found_rows, so will need to get count using queries seperately
The main issue is the first one. The higher i go with the pagination , the heavier it gets
Would this run faster?
SELECT id, user_id, type, group_id
FROM
( SELECT id, user_id, type, group_id, IFNULL(group_id, id) AS foo
FROM msgs
WHERE user_id IN (50111)
AND type IN (0, 1, 5, 7)
)
GROUP BY foo
ORDER BY `group_id` DESC
LIMIT 100
It needs INDEX(user_id, type).
Does this give the 'correct' answer?
SELECT DISTINCT *
FROM msgs
WHERE user_id IN (50111)
AND type IN (0, 1, 5, 7)
GROUP BY IFNULL(group_id, id)
ORDER BY `group_id` DESC
LIMIT 100
(It needs the same index)

Mysql Ranking Query on 2 columns

Table
id user_id rank_solo lp
1 1 15 45
2 2 7 79
3 3 17 15
How can I sort out a ranking query that sorts on rank_solo ( This ranges from 0 to 28) and if rank_solo = rank_solo , uses lp ( 0-100) to further determine ranking?
(If lp = lp, add a ranking for no tie rankings)
The query should give me the ranking from a certain random user_id. How is this performance wise on 5m+ rows?
So
User_id 1 would have ranking 2
User_id 2 would have ranking 3
User_id 3 would have ranking 1
You can get the ranking using variablesL
select t.*, (#rn := #rn + 1) as ranking
from t cross join
(select #rn := 0) params
order by rank_solo desc, lp;
You can use ORDER BY to sort your query:
SELECT *
FROM `Table`
ORDER BY rank_solo, lp
I'm not sure I quite understand what you're saying. With that many rows, create a query on the fields you're using to do your selects. For example, in MySQL client use:
create index RANKINGS on mytablename(rank_solo,lp,user_id);
Depending on what you use in your query to select the data, you may change the index or add another index with a different field combination. This has improved performance on my tables by a factor of 10 or more.
As for the query, if you're selecting a specific user then could you not just use:
select rank_solo from table where user_id={user id}
If you want the highest ranking individual, you could:
select * from yourtable order by rank_solo,lp limit 1
Remove the limit 1 to list them all.
If I've misunderstood, please comment.
An alternative would be to use a 2nd table.
table2 would have the following fields:
rank (auto_increment)
user_id
rank_solo
lp
With the rank field as auto increment, as it's populated, it will automatically populate with values beginning with "1".
Once the 2nd table is ready, just do this when you want to update the rankings:
delete from table2;
insert into table2 select user_id,rank_solo,lp from table1 order by rank_solo,lp;
It may not be "elegant" but it gets the job done. Plus, if you create an index on both tables, this query would be very quick since the fields are numeric.

MySQL Query to find row duplicates based on condition with limit

I have two tables:
Members:
id username
Trips:
id member_id flag_status created
("YES" or "NO")
I can do a query like this:
SELECT
Trip.id, Trip.member_id, Trip.flag_status
FROM
trips Trip
WHERE
Trip.member_id = 1711
ORDER BY
Trip.created DESC
LIMIT
3
Which CAN give results like this:
id member_id flag_status
8 1711 YES
9 1711 YES
10 1711 YES
My goal is to know if the member's last three trips all had a flag_status = "YES", if any of the three != "YES", then I don't want it to count.
I also want to be able to remove the WHERE Trip.member_id = 1711 clause, and have it run for all my members, and give me the total number of members whose last 3 trips all have flag_status = "YES"
Any ideas?
Thanks!
http://sqlfiddle.com/#!2/28b2d
In that sqlfiddle, when the correct query i'm seeking runs, I should see results such as:
COUNT(Member.id)
2
The two members that should qualify are members 1 and 3. Member 5 fails because one of his trips has flag_status = "NO"
You could use GROUP_CONCAT function, to obtain a list of all of the status ordered by id in ascending order:
SELECT
member_id,
GROUP_CONCAT(flag_status ORDER BY id DESC) as status
FROM
trips
GROUP BY
member_id
HAVING
SUBSTRING_INDEX(status, ',', 3) NOT LIKE '%NO%'
and then using SUBSTRING_INDEX you can extract only the last three status flags, and exclude those that contains a NO. Please see fiddle here. I'm assuming that all of your rows are ordered by ID, but if you have a created date you should better use:
GROUP_CONCAT(flag_status ORDER BY created DESC) as status
as Raymond suggested. Then, you could also return just the count of the rows returned using something like:
SELECT COUNT(*)
FROM (
...the query above...
) as q
Although I like the simplicity of fthiella's solution, I just can't think of a solution that depends so much on data representation. In order not to depend on it you can do something like this:
SELECT COUNT(*) FROM (
SELECT member_id FROM (
SELECT
flag_status,
#flag_index := IF(member_id = #member, #flag_index + 1, 1) flag_index,
#member := member_id member_id
FROM trips, (SELECT #member := 0, #flag_index := 1) init
ORDER BY member_id, id DESC
) x
WHERE flag_index <= 3
GROUP BY member_id
HAVING SUM(flag_status = 'NO') = 0
) x
Fiddle here. Note I've slightly modified the fiddle to remove one of the users.
The process basically ranks the trips for each of the members based on their id desc and then only keeps the last 3 of them. Then it makes sure that none of the fetched trips has a NO in the flag_status. FInally all the matching meembers are counted.