MySQL - find all duplicate records - mysql

I have a table with 55 columns. This table is going to be populated with data from a CSV file. I have created a PHP script which reads in the CSV file and inserts the records.
Whilst scanning through the CSV file I noticed there are some rows that are duplicates. I want to eliminate all duplicate records.
My question is, what would be the best way of doing this? I assume it will be either one of these two options:
Remove / skip duplicate records at source, i.e. duplicate records will not be inserted in the table.
Insert all records from the CSV file, then query the table to find and remove all duplicate records.
For option one, would this be possible to do using MS Excel or even just a text editor?
For option 2, I came across some possible solutions but surely this would result in a rather large query. I am looking for something short and simple. Is this at all possible to do?

A good way is to define a key for the table. A key is a set of fields that make each record unique and all other fields depend on it. (In the worst case the key will consist of all the columns in your table but usually you can define a smaller key). Then you can use the database itself to enforce that key for example using a primary key constraint or an unique index.

Related

Insert statement yields different results than the select all in sql

I'm currently trying to create a table of theater locations that only has three locations. I imported denormalized data that I tried to normalize with this statement:
insert into theater(`name`, email, address, phone)
select distinct theater, theater_email, theater_address, theater_phone
from denormalized_tickets;
when I comment out the first line and run it I get the result I'm looking for.
When I write a query to see the theater table select * from theater;, it returns the theaters duplicated each 12 times.
How should I solve this? Is there anything I'm overlooking?
As discussed in the comments above, INSERT creates new rows each time you execute it. If you do that multiple times, you may add more rows every time.
Vasya recommended creating a UNIQUE index to block new rows from being created with the same values. This may or may not be appropriate for a given table. For instance, what if you want to allow multiple rows to have the same values?
Another thing you might like to read about is MySQL's REPLACE statement. The syntax is similar to INSERT, but if there's a duplicate in column(s) of a primary key or unique key, it first deletes the old row and then inserts the new row. But this won't help if you don't have the unique key defined, because how would MySQL know it's a conflict?

How to properly deal with long key constraints(longer than 3072) in MySql?

I can see that there are similar questions and answers on SO regarding this problem.
I need to create a unique constraint on 7 columns together.
alter table ga_data_model add constraint uq_1234596 unique (portal_id,date,dimension,country,os,os_version,theme);
there has been various answers to use prefix keys to solve this issue. However, because of the nature of my data, simply using the first one or two character to create the index is dangerous as this might result in having duplicate results. So such a solution won't work for me:
alter table ga_data_model add constraint uq_1234596 unique (portal_id,date(2),dimension(2),country(1),os(2),os_version(1),theme(2));
I was thinking of creating a new column in my table that contains the calculated hash of these columns and I create my constraint on this one. But this means that every time I want to insert something into db, I need to first do a select for this column, calculated the hash for the new values, compare them and save/or not save. I think this is a bit too expensive, considering that I will be having a lot of write operations.
Has anyone had the same problem and have a better solution as I explained above?
Thanks!
I want to insert something into db, I need to first do a select for this column, calculated the hash for the new values, compare them and save/or not save
No - you save it, and if you get a unique key violation then you already have the data. Also, implement the hash calculation as a table trigger - that way there's no backdoor for amending the data.

inserting rows to mysql table containing predefined Primary Key

I was looking for the solution of this problem for some time and did not manage to find anything satisfactory. I know that similar problems were many times answered but there are usually workarounds rather than standard solutions for them.
The problem in my particular case is:
I have one table that contains predefined Primary Key that cannot be used as auto increment. It is predefined and it is also used by several other tables as a foreign key.
NID - my Primary Key
PID - the key from external source
Serial
bla1
bla2
NID is already in table ids (target table), not in source table
PID is already in source file/table, not in target table
other columns are in both tables
The pair NID-PID would be a unique match as these would be used further after matching.
Now I need to be able to insert values to this table on a weekly basis as these would be sent to me in csv/excel files, hundreds of records, so some easy way would be best, especially as easy way is easy to validate the import process.
Since there is no auto increment PK, I get an error:
1062 - Duplicate entry '' for key 'NID'
I was thinking about creating unique index on multiple fields like:
CREATE UNIQUE INDEX unique_index ON ids (NID,PID);
But it did not work very well either:
1062 - Duplicate entry '107521' for key 'unique_index'
I also tried to create separate table with data to be imported, but I get the same error.
The question is: what is the best way to insert records to table that contains PK and continue to do so on regular basis without altering existing data? What should I do to achieve this?
I would really appreciate any help since I'm stuck.

Updating existing lines in MySql and treating Duplicated Keys

I have a MySql database containing data about users of an application. This application is in production already, however improvements are added every day. The last improvement I've made changed the way data is collected and inserted into the database.
Just to be clearer, my database is composed of 5 tables containing user data and 1 table to relate all the tables, through foreign keys. These 5 foreign keys, together, form my Unique Index for this "Main Table" I have.
The issue is that one of these tables containing user data changed its format, and I want to remove all the data older than the modification I made on my application (just from this table, the other ones I need to keep untouched). However, this dataset has foreign keys in the main table, and I can't just drop these lines on the main table because the other informations I have are important. I tried to change the value of the foreign key for this table, in specific, but then, obviously, I have a problem related to duplicated indexes.
Reading on internet, I've found a solution to my problem using "Insert ... On duplicate key update ...", but i'm not inserting data, just updating it. I have an Idea about how to make a program on PHP to update my database, but is there another easier solution? Is it possible to avoid these problems using just MySql syntax?
might be worth looking at the below link
http://www.kavoir.com/2009/05/mysql-insert-if-doesnt-exist-otherwise-update-the-existing-row.html

overwrite mysql table data

I have a web crawler. The web crawler gathers the links from web pages I give it but when it is retrieving the links, some links are duplicated due to the website. is there a way in MYSQL to overwrite data if a new row is the exact same as an old row.
Say if I have http://www.facebook.com in a link field
I also manage to pick up http://www.facebook.com again, I would like the latter to overwrite the old row. therefore I don't have clashes on my search engine.
I'm assuming that you want to update a last_updated date if the url already exists. Else there is no good reason to do an update.
INSERT INTO `scrapping_table`
(`url`)
VALUES
("www.facebook.com")
ON DUPLICATE KEY UPDATE
`date_updated` = `datetime.now()`
look into ON DUPLICATE KEY actions
http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html
Basically make the columns you're concerned with a unique key write your insert statement and then add
ON DUPLICATE KEY UPDATE col = overwriting value
If your link field is unique than you can use
INSERT INTO "mytable" (link_field, x_column, y_column) VALUES ("www.facebook.com",'something new for x','something new for y')
ON DUPLICATE KEY UPDATE x_column='something new for x', y_column='something new for y'
Just make sure your link field is unique and if you have more unique fields in your column, I suggest use this second method because they suggest avoid using an ON DUPLICATE KEY clause on tables with multiple unique indexes.
set your link field as unique.
before inserting a row try
Select "primary_id" from mytable where link_field="www.facebook.com"
Count the number of returned row from this SQL.
=>If count>0 then UPDATE the row using the "primary_id" we just grabbed through the SELECT SQL
=> if count==0 , just insert your row
beware!!
while operating a web crawler that probably will find millions of links
you want to minimize the query's each "crawl" process fires...
do you want to create a unique link table that will feed the bots? or do you want to prevent duplicate search results?
unique url pool table:
while crawling the page - you should save url's to an array (or list) and making sure (!in_array()) that the its a unique value array, you will find that each page you crawl includes alot of repeated links - so clean them before using sql.
covert the urls to hashes ("simhash" of 32 digits [1,0]).
now open a connection to db and check if exists if it does dump them! don't update (its making a second process). you should match the links using the hashes over an indexed table it will be far more faster.
prevent duplicate results search:
if you indexed the url in the above methodology you should not find duplicate url's, if you have, it means there is a problem in your crawling operation.
even if you have duplicate values in another table and you want to search it but not returning duplicate results you can use DISTINCT in your query.
good luck!