Removing duplicate user entries from mySQL database table - mysql

I have a table in my database to store user data. I found a defect in the code that adds data to this table database where if a network timeout occurs, the code updated the next user's data with the previous user's data. I've addressed this defect but I need to clean the database. I've added a flag to indicate the rows that need to be ignored and my goal is to mark these flags accordingly for duplicates. In some cases, though, duplicate values may actually be legitimate so I am more interested in finding several user's with the same data (i.e, u> 2).
Here's an example (tablename = Data):
id---- user_id----data1----data2----data3----datetime-----------flag
1-----usr1--------3---------- 2---------2---------2012-02-16..-----0
2-----usr2--------3---------- 2---------2---------2012-02-16..-----0
3-----usr3--------3---------- 2---------2---------2012-02-16..-----0
In this case, I'd like to mark the 1 and 2 id flags as 1 (to indicate ignore). Since we know usr1 was the original datapoint (assuming the oldest dates are earlier in the list).
At this point there are so many entries in the table that I'm not sure the best way to identify the users that have duplicate entries.
I'm looking for a mysql command to identify the problem data first and then I'll be able to mark the entries. Could someone guide me in the right direction?

Well, first select duplicate data with their min user id:
CREATE TEMPORARY TABLE duplicates
SELECT MIN(user_id), data1,data2,data3
FROM data
GROUP BY data1,data2,data3
HAVING COUNT(*) > 1 -- at least two rows
AND COUNT(*) = COUNT(DISTINCT user_id) -- all user_ids must be different
AND TIMESTAMPDIFF( MINUTE, MIN(`datetime`), MAX(`datetime`)) <= 45;
(I'm not sure, if I used TIMESTAMPDIFF properly.)
Now we can update the flag in those rows where user_id is different:
UPDATE duplicate
INNER JOIN data ON data.data1 = duplicate.data1
AND data.data2 = duplicate.data2
AND data.data3 = duplicate.data3
AND data.user_id != duplicate.user_id
SET data.flag = 1;

UPDATE Data A
LEFT JOIN
(
SELECT user_id,data1,data2,data3,min(id) min_id
FROM Data GROUP BY user_id,data1,data2,data3
) B
ON A.id = B.min_id
SET A.flag = IF(ISNULL(B.min_id),1,0);
If there are duplicate times involved, maybe try this
UPDATE Data A
LEFT JOIN
(
SELECT user_id,data1,data2,data3,,`datetime`,min(id) min_id
FROM Data GROUP BY user_id,data1,data2,data3,`datetime`
) B
ON A.id = B.min_id
SET A.flag = IF(ISNULL(B.min_id),1,0);

Related

Mysql select optimization (huge db)

I have a select request in MySQL that takes between 25-30s, which is extremely long and I was wondering if you could help me fasten it.
CREATE TEMPORARY TABLE results(
id VARCHAR(30),
secondid VARCHAR(5),
allele VARCHAR(30),
translation VARCHAR(10),
level VARCHAR(20),
subgroup VARCHAR(20),
subgroup2 VARCHAR(20)
);
INSERT INTO results(id, secondid, allele, level) SELECT DISTINCT t1.id, t1.secondid, t1.texte, t3.texte
FROM database t1
JOIN database t2 ON t1.id=t2.id
JOIN database t3 ON t1.id=t3.id AND t1.secondid=t3.secondid
WHERE (t1.qualifier,t2.qualifier) = ("allele","organism") AND t3.qualifier = "level_length" AND t3.texte NOT REGEXP "X" AND t3.texte IS NOT NULL
AND t2.texte = ? AND t1.texte REGEXP ?
GROUP BY t1.texte;
UPDATE results SET translation = (SELECT t1.qualifier
FROM database t1
JOIN database t2 ON t1.id=t2.id AND t1.secondid=t2.secondid
JOIN database t3 ON t1.id=t3.id AND t1.secondid=t3.secondid
WHERE t1.qualifier IN ("protein","ncRNA","rRNA") AND t2.texte=results.allele AND t3.texte=results.level LIMIT 1);
UPDATE results SET subgroup = (SELECT t2.subgrp
FROM alleledb.alleleSubgroups t1
JOIN alleledb.subgroups t2 ON t1.subgroup=t2.subgroup
WHERE t1.gene=SUBSTRING_INDEX(results.allele, "*", 1) AND t1.species=? LIMIT 1);
ALTER TABLE results DROP id, DROP secondid;
SELECT * FROM results ORDER BY subgroup ASC, level ASC;
DROP TABLE results;
I need to go through many dbs to get join (same id), database are huge but results to extract are quite low (less than 1% of all the database). The majority of the results are stored in the same db, in different rows (with the same id and secondid). However, id and secondid are not unique to the rows I need to select, only the combinaison of two is.
Thank you.
I would start by having a proper composite index on your database table
First on
(qualifier, id, secondid, texte)
This will help your joins, the where testing and NOT have to go back to the actual raw data tables for the records as the index has the data you are interested in.
Next, I would adjust the query/joins. Since you are specifically looking for the "allele" and "organism" from t1 and t2 respectively, make them as such.
I have no idea what you are doing with your REGEXP "X" or "?" values for texte, but you'll figure that out after.
Here is how I would revise the queries
insert into ...
SELECT DISTINCT
t1.id,
t1.secondid,
t1.texte,
t3.texte
FROM
database t1
JOIN database t2
ON t1.id = t2.id
AND t2.qualifier = 'organism'
JOIN database t3 ON
t1.id = t3.id
AND t1.secondid = t3.secondid
AND t3.qualifier = 'level_length'
WHERE
t1.qualifier = 'allele'
AND t1.texte REGEXP ?
-- I would move these t2 and t3 into the respective JOINs above directly.
AND t3.texte NOT REGEXP "X"
AND t3.texte IS NOT NULL
AND t2.texte = ?
GROUP BY
t1.texte;
As for your UPDATE commands, having a second index on (id, secondid) will help on the join to t2 and t3 since there is no qualifier context to the join.
As for your UPDATE commands, as Rick mentioned, without some context of an ORDER BY clause, you have no guarantee WHICH record is returned back by the LIMIT 1.
First of all, thank you for all your help.
My first table (The insert to and the first update, database named) looks like this :
I want all things in red. In others words, I need some parameters which has the same id and secondid as the "level" which is unique among the id. Whereas others parameters may be repeated within the same id (but different second id).
I am filtering using the allele name (ECK in EC locus) with thé REGEXP and species. For example, all allèles from EC locus of human.
Then (last update), I take one parameter (allele), substring it and go to a database that gives me one id (one row -> one id). And I use this id on annoter database that gives me one or two rows (one subgroup or two subgroups/rare). So as in my example I only has one group, the absence of ORDER BY was not seen. But yes I want to order (get the subgroup that contains the allele in first). I don't know how to do that.
Finally, I can try to make an index but due to the size of the db, I'm wondering the time and the size of such an index. Would it significally improve time and can I remove it ?
The REGEXP "X" is to remove every matches that are not relevant regarding this parameter (I don't want them).
The ? is user input (for the species/2 occurrences this one and the locus).
The operations on the first database takes 30s, last operation on the two databases lasts 1-2s. Others (drop , select) are <20ms (not the problem).

Optimising a SQL query with a huge where clause

I am working on a system (with Laravel) where users can fill a few filters to get the data they need.
Data is not prepared real time, once the filters are set, a job is pushed to the queue and once the query finishes a CSV file is created. Then the user receives an email with the file which was created so that they can download it.
I have seen some errors in the jobs where it took longer than 30 mins to process one job and when I checked I have seen some users created filter with more than 600 values.
This filter values are translated like this:
SELECT filed1,
field2,
field6
FROM table
INNER JOIN table2
ON table.id = table2.cid
/* this is how we try not to give same data to the users again so we used NOT IN */
WHERE table.id NOT IN(SELECT data_id
FROM data_access
WHERE data_user = 26)
AND ( /* this bit is auto populated with the filter values */
table2.filed_a = 'text a'
OR table2.filed_a = 'text b'
OR table2.filed_a = 'text c' )
Well I was not expecting users to go wild and fine tune with a huge filter set. It is okay for them to do this but need a solution to make this query quicker.
One way is to create a temp table on the fly with the filter values and covert the query for INNER JOIN but not sure if it would increase the performance.
Also, given that in a normal day system would need to create at least 40-ish temp tables and delete them afterwards. Would this become another issue in the long run?
I would love to hear any other suggestions that may help me solve this issue other then temp table method.
I would suggest writing the query like this:
SELECT ?.filed1, ?.field2, ?.field6 -- qualify column names (but no effect on performance)
FROM table t JOIN
table2 t2
ON t.id = t2.cid
WHERE NOT EXISTS (SELECT 1
FROM data_access da
WHERE t.id = da.data_id AND da.data_user = 26
) AND
t2.filed_a IN ('text a', 'text b', 'text c') ;
Then I would recommend indexes. Most likely:
table2(filed_a, cid)
table1(id) (may not be necessary if id is already the primary key)
data_access(data_id, data_user)
You can test this as your own query. I don't know how to get Laravel to produce this (assuming it meets your performance objectives).

Compare differences in 2 tables

I am running a MySQL Server on Ubuntu, patched up to date...
In MySQL, I have 2 tables in a database. I am trying to get a stock query change working and it kind of is, but it's not :(
What I have is a table (table A) that holds the last time I have checked stock levels, and another table (table B) that holds current stock levels. Each table has identical column names and types.
What I want to do is report on the changes from table B. The reason is that there are about 1/2 million items in this table - and I cannot just update each item using the table as a source as I am limited to 100 changes at a time. So, ideally, I want to get the changes - store them in a temporary table, and use that table to update our system with just those changes...
The following below brings back the changes but shows both Table A and Table B.
I have tried using a Left Join to only report back on Table B but I'm not a mysql (or any SQL) guy, and googling all this... Can anyone help please. TIA. Stuart
SELECT StockItemName,StockLevel
FROM (
SELECT StockItemName,StockLevel FROM stock
UNION ALL
SELECT StockItemName,StockLevel FROM stock_copy
) tbl
GROUP BY StockItemName,StockLevel
HAVING count(*) = 1
ORDER BY StockItemName;
The query below spit out records that have different stock level in both table.
SELECT s.StockItemName, s.StockLevel, sc.StockLevel
FROM stock s
LEFT JOIN stock_copy sc ON sc.Id = s.Id AND sc.StockLevel <> s.StockLevel
ORDER BY s.StockItemName
ok - I solved it - as there wasn't a unique ID on each table that could be matched, and rather than make one, I used 3 colums to create the unique ID and left joined on that.
SELECT sc.StockItem, sc.StockItemName, sc.Warehouse, sc.stocklevel
FROM stock s
LEFT JOIN stock_copy sc ON (sc.StockItem = s.StockItem AND sc.StockItemName = s.StockItemName AND sc.Warehouse = s.Warehouse AND sc.StockLevel <> s.StockLevel)
having sc.StockLevel is not Null;

How to select distinct rows from a table without a primary key

I need to show a Notification on user login if there is any unread messages. So if multiple users send (5 messages each) while the user is in offline these messages should be shown on login. Means have to show the last messages from each user.
I use joining to find records.
In this scenario Message from User is not a primary key.
This is my query
SELECT
UserMessageConversations.MessageFrom, UserMessageConversations.MessageFromUserName,
UserMessages.MessageTo, UserMessageConversations.IsGroupChat,
UserMessageConversations.IsLocationChat,
UserMessageConversations.Message, UserMessages.UserGroupID,UserMessages.LocationID
FROM
UserMessageConversations
LEFT OUTER JOIN
UserMessages ON UserMessageConversations.UserMessageID = UserMessages.UserMessageID
WHERE
UserMessageConversations.MessageTo = 743
AND UserMessageConversations.ReadFlag = 0
This is the output obtained from above query.
MessageFrom -582 appears twice. I need only one record of this user.
How is it possible
I'm not entirely sure I totally understand your question - but one approach would be to use a CTE (Common Table Expression).
With this CTE, you can partition your data by some criteria - i.e. your MessageFrom - and have SQL Server number all your rows starting at 1 for each of those partitions, ordered by some other criteria - this is the point that's entirely unclear from your question, whether you even care what the rows for each MessageFrom number are sorted on (do you have some kind of a MessageDate or something that you could order by?) ...
So try something like this:
;WITH PartitionedMessages AS
(
SELECT
umc.MessageFrom, umc.MessageFromUserName,
um.MessageTo, umc.IsGroupChat,
umc.IsLocationChat,
umc.Message, um.UserGroupID, um.LocationID ,
ROW_NUMBER() OVER(PARTITION BY umc.MessageFrom
ORDER BY MessageDate DESC) AS 'RowNum' <=== totally unclear yet
FROM
dbo.UserMessageConversations umc
LEFT OUTER JOIN
dbo.UserMessages um ON umc.UserMessageID = um.UserMessageID
WHERE
umc.MessageTo = 743
AND umc.ReadFlag = 0
)
SELECT
MessageFrom, MessageFromUserName, MessageTo,
IsGroupChat, IsLocationChat,
Message, UserGroupID, LocationID
FROM
PartitionedMessages
WHERE
RowNum = 1
Here, I am selecting only the "first" entry for each "partition" (i.e. for each MessageFrom) - ordered by a "imagined" MessageDate column so that the most recent (the newest) message would be selected.
Does that approach what you're looking for??
If you think of them as same rows, I assume you don't care about the message field.
In this case you can use the DISTINCT clause:
SELECT DISTINCT
UserMessageConversations.MessageFrom, UserMessageConversations.MessageFromUserName,
UserMessages.MessageTo, UserMessageConversations.IsGroupChat,
UserMessageConversations.IsLocationChat,
UserMessages.UserGroupID,UserMessages.LocationID
FROM
UserMessageConversations
LEFT OUTER JOIN
UserMessages ON UserMessageConversations.UserMessageID = UserMessages.UserMessageID
WHERE
UserMessageConversations.MessageTo = 743
AND UserMessageConversations.ReadFlag = 0
In general with distinct clause you have a row for every distinct group of row attributes.
If your requirement instead is to show a single field for all the messages (example: every message folded in a single message with a separator between them) you can use an aggregate function, but in SQL Server it seems is not that easy.

MySQL query to fetch most recent record

I've got four tables. The structure of these tables is shown below (I am only showing the relevant column names).
User (user_id)
User_RecordType (user_id, recordType_id)
RecordType (recordType_id)
Record (recordType_id, record_timestamp, record_value)
I need to find the most recent record_value for each RecordType that a given user has access to. Timestamps are stored as seconds since the epoch.
I can get the RecordTypes that the user has access to with the query:
SELECT recordType_id
FROM User, User_RecordType, RecordType
WHERE User.user_id=User_RecordType.user_id
AND User_RecordType.recordType_id=RecordType.recordType_id;
What this query doesn't do is also fetch the most recent Record for each RecordType that the user has access to. Ideally, I'd like to do this all in a single query and without using any stored procedures.
So, can somebody please lend me some of their SQL-fu? Thanks!
SELECT
Record.recordType_id,
Record.record_value
FROM
Record
INNER JOIN
(
SELECT
recordType_id,
MAX(record_timestamp) AS `record_timestamp`
FROM
Record
GROUP BY
recordType_id
) max_values
ON
max_values.recordType_id = Record.recordType_id
AND
max_values.record_timestamp = Record.record_timestamp
INNER JOIN
User_RecordType
ON
UserRecordType.recordType_id = RecordType.recordType_id
WHERE
User_RecordType.user_id = ?