Remove duplicate records/objects uniquely identified by multiple attributes

Remove duplicate records/objects uniquely identified by multiple attributes - mysql

I have a model called HeroStatus with the following attributes:
id
user_id
recordable_type
hero_type (can be NULL!)
recordable_id
created_at
There are over 100 hero_statuses, and a user can have many hero_statuses, but can't have the same hero_status more than once.
A user's hero_status is uniquely identified by the combination of recordable_type + hero_type + recordable_id. What I'm trying to say essentially is that there can't be a duplicate hero_status for a specific user.
Unfortunately, I didn't have a validation in place to assure this, so I got some duplicate hero_statuses for users after I made some code changes. For example:
user_id = 18
recordable_type = 'Evil'
hero_type = 'Halitosis'
recordable_id = 1
created_at = '2010-05-03 18:30:30'
user_id = 18
recordable_type = 'Evil'
hero_type = 'Halitosis'
recordable_id = 1
created_at = '2009-03-03 15:30:00'
user_id = 18
recordable_type = 'Good'
hero_type = 'Hugs'
recordable_id = 1
created_at = '2009-02-03 12:30:00'
user_id = 18
recordable_type = 'Good'
hero_type = NULL
recordable_id = 2
created_at = '2009-012-03 08:30:00'
(Last two are not a dups obviously. First two are.) So what I want to do is get rid of the duplicate hero_status. Which one? The one with the most-recent date.
I have three questions:
How do I remove the duplicates using a SQL-only approach?
How do I remove the duplicates using a pure Ruby solution? Something similar to this: Removing "duplicate objects".
How do I put a validation in place to prevent duplicate entries in the future?

For an SQL only approach, I would use this query - (I'm assuming the id's are unique.)
DELETE FROM HeroStatus WHERE id IN
(SELECT id FROM
(SELECT user_id, recordable_type, hero_type, recordable_id, MAX(created_at)
GROUP BY del.user_id, recordable_type, hero_type, recordable_id
HAVING Count(id)>1) AS del
INNER JOIN HeroStatus AS hs ON
hs.user_id=del.user_id AND hs.recordable_type=del.recordable_type
AND hs.hero_type=del.hero_type AND hs.recordable_id=del.recordable_id
AND hs.created_at = del.created_at)
A bit of a monster! The query finds all duplicates using the natural key (user_id, recordable_type, hero_type) and selects the one with the largest created_at value (most recently created). It then finds the IDs of those rows (by joining back to the main table) and deletes rows with that id.
(Please try this on a copy of the table first and verify you get the results you want! :-)
To prevent this happening in future, add a unique index or constraint over the columns user_id, recordable_type, hero_type, recordable_id. E.g.
ALTER TABLE HeroStatus
ADD UNIQUE (user_id, recordable_type, hero_type, recordable_id)
EDIT:
You add (and remove) this index within a migration like this:
add_index(:HeroStatus, [:user_id, :recordable_type, :hero_type, :recordable_id], :unique => true)
remove_index(:HeroStatus, :column => [:user_id, :recordable_type, :hero_type, :recordable_id], :unique => true)
Or, if you want to explicitly name it:
add_index(:HeroStatus, [:user_id, :recordable_type, :hero_type, :recordable_id], :unique => true, :name => :my_unique_index)
remove_index(:HeroStatus, :name => :my_unique_index)

Sometimes you need to just roll up your sleeves and do some serious SQL to kill off all the ones you don't want. This is easy if it's a one shot thing, and not too hard to roll into a Rake task you can fire on demand.
For instance, to select all the distinct status records, it is reasonable to use something like the following:
SELECT id FROM hero_statuses GROUP BY user_id, hero_type, recordable_id
Given that these are the sufficiently unique records in your set, you can go about removing all the ones you don't want:
DELETE FROM hero_statuses WHERE id NOT IN (SELECT id FROM hero_statuses GROUP BY user_id, hero_type, recordable_id)
As with any operation that involves DELETE FROM, I hope you don't just fire this off on your production data without the usual precautions of backing things up.
As to how to prevent this in the future, if these are unique constraints, create a unique index on them:
add_index :hero_statuses, [ :user_id, :hero_type, :recordable_id ], :unique => true
This will generate ActiveRecord exceptions when you attempt to introduce a duplicate record. One benefit of a unique index is that you can make use of the "INSERT IGNORE INTO..." or "INSERT ... ON DUPLICATE KEY ..." features to recover from potential duplications.

Related

How should I index my tables and build my request to improve perfomance?

I'm using MySQL. For a MP system, I have two tables (+1 to list the conversations) :
_ msg_individus (= members of a conversation)
mi_mcid : id of the conversation
mi_uid : id of the user
mi_ustatus : status of the conversation for the user (opened or closed)
mi_datelecture : the last time (timestamp) the user opened the conversation
For now I indexed mi_mcid and mi_muid as primary key.
_ msg_messages (= messages of the conversation)
msg_id : id of the message
msg_uid : id of the user who wrote the message
msg_mcid : id of the conversation
msg_text : content of the message
msg_timestamp : when the message was posted
For now I indexed msg_id as primary key and msg_mcid as an index.
Here's the thing : I want to know if there is a message unread by the user. For that, I compare the last msg_timestamp and the mi_datelecture, if the first one is bigger than the second one, then there's something new.
But for some reason, the performance on this request is very bad and I can't figure out how to index properly and how to build my request in the best way to increase the performances.
This is what I built :
SELECT 1 FROM msg_messages as msg
WHERE msg.msg_uid != :u_id
AND msg.msg_status = "1"
AND msg.msg_mcid IN (SELECT mi.mi_mcid
FROM msg_individus as mi
WHERE mi.mi_uid = :uid
AND mi.mi_ustatus = "2"
AND mi.mi_datelecture < msg.msg_timestamp)
LIMIT 0,1
I tried to set some indexes on msg_status, mi_uid, mi_status for example but even if things are a little better, performances are sad haha. When I don't compare mi_datelecture and msg_timestamp, it takes like 0.05sec to process, while it takes 0.20sec when I do.
Thank you for your advises.
(from Comment) New attempt:
SELECT 1
FROM msg_messages as msg
WHERE msg.msg_uid != :u_id
AND msg.msg_status = "1"
AND EXISTS
(
SELECT *
FROM msg_individus as mi
WHERE mi.mi_mcid = msg.msg_mcid
AND mi.mi_uid = :uid
AND mi.mi_ustatus = "2"
AND (mi.mi_datelecture = "0"
OR mi.mi_datelecture < msg.msg_timestamp)
)
LIMIT 0,1

Create this index.
CREATE INDEX ON msg_individus
(mi_uid, mi_ustatus, mi_datelecture, mi_mcid );
It is a covering index suitable for your subquery. The subquery can be satisfied completely from the index.
If you need more help read this then ask another question.

If all you need is "existence", use EXISTS ( SELECT 1 ... ) instead of LIMIT 1.
Change IN ( SELECT ... ) into either a JOIN or EXISTS ( SELECT 1 ... ); either is likely to be faster.
And see my Comment that ponders where the query even provides the desired info.
Then we can, and should, discuss indexes.

Thinking Sphinx or condtion

How do you use a conditional or in thinking sphinx?
The situation is:
I have a Message model with a sender_id and recipient_id attribute. I would like to compose this query:
Message.where("sender_id = ? OR recipient_id = ?", business_id, business_id)
Right now, I'm searching twice, one for all the messages that has recipient_id = business_id and another to return all messages that has sender_id = business_id. Then I just merge them.
I feel that there's a more efficient way to do this.
EDIT - Adding index file
ThinkingSphinx::Index.define :message, with: :active_record, delta: ThinkingSphinx::Deltas::DelayedDelta do
# fields
indexes body
# attributes
has job_id
has sender_id
has recipient_id
end

Sphinx doesn't allow for OR logic between attributes, only fields. However, a workaround would be to combine the two columns into a third attribute:
has [sender_id, recipient_id], :as => :business_ids, :multi => true
And then you can search on the combined values like so:
Message.search :with => {:business_ids => business_id}

Optimized way to retrieve IDs through has_many :through association

I have 3 model as following :
(I'm also describing the database structure in case of anyone not familiar with RubyOnRails is able to help me)
Thread.rb
class Thread
has_many :thread_profils
has_many :profils, :through => :thread_profils
end
Table threads
integer: id (PK)
ThreadProfil.rb
class ThreadProfil
belongs_to :thread
belongs_to :profil
end
Table thread_profils
integer: id (PK)
integer: thread_id (FK)
integer: profil_id (FK)
Profil.rb
class Profil
end
Table profils
integer: id (PK)
In one of my controllers I am looking for the most optimized way to find the Threads IDs that has include exactly two profils (the current one, a some other one) :
I got my current_profil.id and another profil.id and I can't figure out a simple way to get that collection/list/array of Thread.id, while processing the fewer SQL request.
For now the only solution I found is the following one, which I don't consider as being "optimized" at all.
thread_profils = ThreadProfil.where(:profil_id => current_profil.id)
thread_ids = thread_profils.map do | association |
profils = Thread.find(association.thread_id).profils.map do | profil |
profil.id if profil.id != current_profil.id
end.compact
if (profils - [id]).empty?
association.thread_id
end
end.compact
That is processing the following SQL queries :
SELECT `thread_profils`.* FROM `thread_profils` WHERE `thread_profils`.`profil_id` = [current_profil.id]
And for each result :
SELECT `threads`.* FROM `threads` WHERE `threads`.`id` = [thread_id] LIMIT 1
SELECT `profils`.* FROM `profils` INNER JOIN `thread_profils` ON `profils`.`id` = `thread_profils`.`profil_id` WHERE `thread_profils`.`thread_id` = [thread_id]
Is there any light way to do that, either with rails or directly with SQL ?
Thanks

I found the following query in sql:
SELECT array_agg(thread_id) FROM "thread_profils" WHERE "thread_profils"."profil_id" = 1 GROUP BY profil_id HAVING count(thread_id) =2
note: array_agg is a postgres aggregate function. Mysql has group_concat which would give you a comma-delimited string of IDs instead of an array.
This sql was generated by the following Rails code:
ThreadProfil.select('array_agg(mythread_id)').where(profil_id: 1).group(:profil_id).having("count(thread_id) =2").take
This generates the right query, but the result is not meaningful as a ThreadProfil - still, you might be able to work further with this to get what you want.

How can I sanitize my DB from these duplicates

I have a table with the following fields:
id | domainname | domain_certificate_no | keyvalue
An example for the output of a select statement can be as:
'57092', '02a1fae.netsolstores.com', '02a1fae.netsolstores.com_1', '55525772666'
'57093', '02a1fae.netsolstores.com', '02a1fae.netsolstores.com_2', '22225554186'
'57094', '02a1fae.netsolstores.com', '02a1fae.netsolstores.com_3', '22444356259'
'97168', '02aa6aa.netsolstores.com', '02aa6aa.netsolstores.com_1', '55525772666'
'97169', '02aa6aa.netsolstores.com', '02aa6aa.netsolstores.com_2', '22225554186'
'97170', '02aa6aa.netsolstores.com', '02aa6aa.netsolstores.com_3', '22444356259’
I need to sanitize my db such that: I want to remove the domain names that have repeated keyvalue for the first domain_certificate_no (i.e, in this example, I look for the field domain_certificate_no: 02aa6aa.netsolstores.com_1, since it is number 1, and has repeated value for the key, then I want to remove the whole chain which is 02aa6aa.netsolstores.com_2 and 02aa6aa.netsolstores.com_3 and this by deleting the domain name that this chain belongs to which is 02aa6aa.netsolstores.com.
How can I automate the checking process for the whole DB. So, I have a query that checks any domain name in the pattern ('%.%.%) EDIT: AND they have share domain name (in this ex: netsolstores.com) , if it finds cert no. 1 that belongs to this domain name has a repeated key value, then delete. Otherwise no. Please, note tat, it is ok for domain_certificate_no to have repeated value if it is not number 1.
EDIT: I only compare the repeated valeues for the same second level domain name. Ex: in this question, I compare the values that share the domain name: .netsolstores.com. If I have another domain name, with sublevel domains, I do the same. But the point is that I don't need to compare the whole DB. Only the values with shared domain name (but different sub domain).

I'm not sure what happens with '02aa6aa.netsolstores.com_1' in your example.
The following keeps only the minimum id for any repeated key:
with t as (
select t.*,
substr(domain_certificate_no,
instr(domain_certificate_no, '_') + 1, 1000) as version,
left(domain_certificate_no, instr(domain_certificate_no, '_') - 1) as dcn
from t
)
select t.*
from t join
(select keyvalue, min(dcn) as mindcn
from t
group by keyvalue
) tsum
on t.keyvalue = tsum.keyvalue and
t.dcn = tsum.mindcn
For the data you provide, this seems to do the trick. This will not return the "_1" version of the repeats. If that is important, the query can be pretty easily modified.
Although I prefer to be more positive (thinking about the rows to keep rather than delete), the following should delete what you want:
with t as (
select t.*,
substr(domain_certificate_no,
instr(domain_certificate_no, '_') + 1, 1000) as version,
left(domain_certificate_no, instr(domain_certificate_no, '_') - 1) as dcn
from t
),
tokeep as (
select t.*
from t join
(select keyvalue, min(dcn) as mindcn
from t
group by keyvalue
) tsum
on t.keyvalue = tsum.keyvalue and
t.dcn = tsum.mindcn
)
delete from t
where t.id not in (select id from tokeep)
There are other ways to express this that are possibly more efficient (depending on the database). This, though, keeps the structure of the original query.
By the way, when trying new DELETE code, be sure that you stash a copy of the table. It is easy to make a mistake with DELETE (and UPDATE). For instance, if you leave out the WHERE clause, all the rows will disappear, after the long painful process of logging all of them. You might find it faster to simply select the desired results into a new table, validate them, then truncate the old table and re-insert them.

Batch processing in Rails

Rails query:
Detail.created_at_gt(15.days.ago.to_datetime).find_each do |d|
//Some code
end
Equivalent mysql query:
SELECT * FROM `details` WHERE (details.id >= 0) AND
(details.created_at > '2012-07-01 12:22:32')
ORDER BY details.id ASC LIMIT 1000
By using find_each in rails it is checking for details.id >= 0 and ordering details in
in ascending order.
Here, I want to avoid those two actions because in my case it is scanning whole table when
I have large data to process (i.e) indexing on created_at fails. So this is inefficient to do. Please anyone help.

Here you've source of find_in_batches used in find_each:
http://apidock.com/rails/ActiveRecord/Batches/find_in_batches
Click Show source link.
Essential lines are:
relation = relation.reorder(batch_order).limit(batch_size)
records = relation.where(table[primary_key].gteq(start)).all
and
records = relation.where(table[primary_key].gt(primary_key_offset)).to_a
You must order records by primary index or other unique index to process in batches and to select next batches.
You can't do batches by created_at because it is not unique. But you could mix ordering by created_at and selecting by unique id:
relation = relation.reorder('created_at ASC, id ASC').limit(batch_size)
records = relation.where(table[primary_key].gteq(start)).all
#....
while records.any?
records_size = records.size
primary_key_offset = records.last.id
created_at_key = records.last.created_at
yield records
break if records_size < batch_size
if primary_key_offset
records = relation.where('created_at>:ca OR (created_at=:ca AND id>:id)',:ca=>created_at_key,:id=>primary_key_offset).to_a
else
raise "Primary key not included in the custom select clause"
end
end
If you are absolutely sure that no record, with the same created_at value, will be repeated more than bach_size times you could just use created_at as only key in batch processing.
Anyway you need index on created_at to be efficient.

Detail.where('created_at > ? AND id < ?', 15.days.ago.to_datetime, 1000).order('details.id ASC')
You don't have to explicitly check details.id >= 0 as Rails does it for you by default.

Be better if you will use scopes and ARel style of quering:
class Detail < ActiveRecord::Base
table = self.arel_table
scope :created_after, lambda { |date| where(table[:created_at].gt(date)).limit(1000) }
end
Than you can find 1000 records that was created after some date:
#details = Detail.created_after(15.days.ago.to_date_time)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008