Efficiently Removing duplicate rows - mysql

I have a table with 28 million records but now it has 56 million records because I assumed the load local infile command would ignore rows that were already in the table. No I need a way to efficiently remove the duplicate rows. What is the best way to approach this?
If I do not want to touch my table can I just select unique rows by this statement:
select distinct (l1.lat, l2.lon) from A, B;

Select originals into a new/temp table, delete the 56 million records, insert your originals.
Example:
INSERT INTO new_fresh_table
SELECT a, b, c, d FROM table_with_dupes
GROUP BY a, b, c, d
If you've lost duped your IDs somehow (not sure how that's possible with a PK), you need to use GROUP BY on every single column. Write a SELECT against meta-data to write your SELECT for you.

You didn't specify how the records are duped. Is it Primary Key? Name? What?
From O'Reily's SQL Cookbook (highly recommended, even for SQL pros):
delete from dupes
where id not in ( select min(id) from dupes group by name )

If you cannot touch the table, and have to use it, why don't you create a view which only show you distinct records?

Related

MySQL copy distinct values from table to. table

One of my table contains data(numbers) that i would like to copy to other table, but problem is that data is not unique there can be 2 or more rows with same data i would like to copy (i need to copy each number only once). Table is around 3 milion records. Is any effcient way to do this?
Would this work for you?
INSERT INTO destination_table ('the_value_field') SELECT DISTINCT('the_value_field') FROM origin_table
Suppose there are two columns a, b in your table
INSERT INTO new_table (a, b) SELECT
a, b FROM old_table GROUP BY
a, b HAVING COUNT(*) > 1;
you can extend this with more columns.
this will be a slow process and may never complete with huge data.
So, instead copy all values into new_table using
Insert into new_table select * from old_table;
and then delete duplicate records from new table . This can be relatively faster and is with an assured completion.
You can use SELECT DISTINCT to select only the unique values.
https://www.w3schools.com/sql/sql_distinct.asp
SELECT DISTINCT `val` FROM `table_name`

how to compare huge table of mysql

I have a huge table of mysqlwhich contains more than 33 million records .How I could compare my table to found non duplicate records , but unfortunately select statement doesn't work. Because it's huge table.
Please provide me a solution
First, Create a snapshot of your database or the tables you want to compare.
Optionally you can also limit the range of data you want to compare , for example only 3 years of data. This way your select query won't hog all the resources.
Snapshot will be bunch of files each representing a table containg your primary key or business key for each record ( I am assuming you can compare data based on aforementioned key . If thats not the case record all the field in your file)
Next, read each records from the file and do a select against the corresponding table. If there are more than 1 record you know it is a duplicate
Thanks
Look at the explain plan and see if what the DB is actually doing for the NOT IN.
You could try refactoring, with an index on subscriber as Roy suggested if necessary. I'm not familiar enough with MySQL to know whether the optimizer will execute these identically.
SELECT *
FROM contracts
WHERE NOT EXISTS
( SELECT 1
FROM edms
WHERE edms.subscriber=contracts.subscriber
);
-- or
SELECT C.*
FROM contracts AS C
LEFT
JOIN edms AS E
ON E.subscriber = C.subscriber
WHERE E.subscriber IS NULL;

Dropping duplicate MySQL rows based on column data

I have a table called sg with the following columns:
player_uuid, player_name, coins, kills, deaths, and wins
However, I ran into an issue that caused some duplicate rows and some of those rows been modified. So, I am wondering how to drop the rows with older data. That said...
How do I drop the duplicate rows where player_uuid is the same? But I only want to drop the rows where coins, kills, deaths, or wins is smaller than it's duplicate.
Example data: http://i.stack.imgur.com/Xieod.png
In this case, I want to keep the row with 46 deaths and delete the row with 43 deaths.
Failing to come up with a single delete statement due to the way the data is structured: 3 Delete statements instead:
The way it works is: Find if there are multiple rows for a given UUID, and determined which row is to be kept (Max value of the given column), then join back on itself and determine which rows are not to be kept, store in temporary table and delete all that is marked in that temporary table from the main data table (called someTable). The benefit of this approach is: If you have more then 1 duplicate (3,4,5 rows till infinity), they will also be deleted.
CREATE TEMPORARY TABLE tempTable AS
SELECT a.player_uuid, a.kills, b.keepRow
FROM someTable a
LEFT JOIN (SELECT MAX(kills) AS kills, player_uuid, 1 AS keepRow
FROM sometable
GROUP BY player_uuid
HAVING COUNT(*)>1
) b ON a.player_uuid=b.player_uuid AND a.kills=b.kills
WHERE b.keepRow!=1;
DELETE a.* FROM someTable a, tempTable b
WHERE a.player_uuid=b.player_uuid AND a.kills=b.kills;
Repeat for the other columns (wins,coins,deaths) by replacing all kills with the other column names.
Always test delete code first :)
Also: While you are at it:
At a unique index to prevent this from happening again:
CREATE UNIQUE INDEX idx_st_nn_1 ON someTable(player_uuid);
When you then try to insert a faulty record, your code will just get an error in return. The best code to handle inserts in that case would be:
INSERT INTO someTable(player_uuid,kills) VALUES ('someplayer',1000)
ON DUPLICATE KEY UPDATE kills=1000;
What also helps is having some time indicator column: Then only one delete would have to be executed:
ALTER TABLE someTable ADD COLUMN (last_updated TIMESTAMP);
Timestamps update them selves, so no code changes required to use this.

Which one faster on Check and Skip Insert if existing on SQL / MySQL

I have read many article about this one. I want to hear from you.
My problem is:
A table: ID(INT, Unique, Auto Increase) , Title(varchar), Content(text), Keywords(varchar)
My PHP Code will always do insert new record, but not accept duplicated record base on Title or Keywords. So, the title or keyword can't be Primary field. My PHP Code need to do check existing and insert like 10-20 records same time.
So, I check like this:
SELECT * FROM TABLE WHERE TITLE=XXX
And if return nothing, then I do INSERT.
I read some other post. And some guy say:
INSERT IGNORE INTO Table values()
An other guy suggest:
SELECT COUNT(ID) FROM TABLE
IF it return 0, then do INSERT
I don't know which one faster between those queries.
And I have 1 more question, what is different and faster on those queries too:
SELECT COUNT(ID) FROM ..
SELECT COUNT(0) FROM ...
SELECT COUNT(1) FROM ...
SELECT COUNT(*) FROM ...
All of them show me total of records in table, but I don't know do mySQL think number 0 or 1 is my ID field? Even I do SELECT COUNT(1000) , I still get total records of my table, while my table only have 4 columns.
I'm using MySQL Workbench, have any option for test speed on this app?
I would use insert on duplicate key update command. One important comment from the documents states that: "...if there is a single multiple-column unique index on the table, then the update uses (seems to use) all columns (of the unique index) in the update query."
So if there is a UNIQUE(Title,Keywords) constraint on the table in the example, then, you would use:
INSERT INTO table (Title,Content,Keywords) VALUES ('blah_title','blah_content','blah_keywords')
ON DUPLICATE KEY UPDATE Content='blah_content';
it should work and it is one query to the database.
SELECT COUNT(*) FROM .... is faster than SELECT COUNT(ID) FROM .. or build something like this:
INSERT INTO table (a,b,c) VALUES (1,2,3)
ON DUPLICATE KEY UPDATE c=3;

SQL: Select Keys that doesn't exist in one table

I got a table with a normal setup of auto inc. ids. Some of the rows have been deleted so the ID list could look something like this:
(1, 2, 3, 5, 8, ...)
Then, from another source (Edit: Another source = NOT in a database) I have this array:
(1, 3, 4, 5, 7, 8)
I'm looking for a query I can use on the database to get the list of ID:s NOT in the table from the array I have. Which would be:
(4, 7)
Does such exist? My solution right now is either creating a temporary table so the command "WHERE table.id IS NULL" works, or probably worse, using the PHP function array_diff to see what's missing after having retrieved all the ids from table.
Since the list of ids are closing in on millions or rows I'm eager to find the best solution.
Thank you!
/Thomas
Edit 2:
My main application is a rather easy table which is populated by a lot of rows. This application is administrated using a browser and I'm using PHP as the intepreter for the code.
Everything in this table is to be exported to another system (which is 3rd party product) and there's yet no way of doing this besides manually using the import function in that program. There's also possible to insert new rows in the other system, although the agreed routing is to never ever do this.
The problem is then that my system cannot be 100 % sure that the user did everything correct from when he/she pressed the "export" key. Or, that no rows has ever been created in the other system.
From the other system I can get a CSV-file out where all the rows that system has. So, by comparing the CSV file and my table I can see if:
* There are any rows missing in the other system that should have been imported
* If someone has created rows in the other system
The problem isn't "solving it". It's making the best solution to is since there are so much data in the rows.
Thanks again!
/Thomas
We can use MYSQL not in option.
SELECT id
FROM table_one
WHERE id NOT IN ( SELECT id FROM table_two )
Edited
If you are getting the source from a csv file then you can simply have to put these values directly like:
I am assuming that the CSV are like 1,2,3,...,n
SELECT id
FROM table_one
WHERE id NOT IN ( 1,2,3,...,n );
EDIT 2
Or If you want to select the other way around then you can use mysqlimport to import data in temporary table in MySQL Database and retrieve the result and delete the table.
Like:
Create table
CREATE TABLE my_temp_table(
ids INT,
);
load .csv file
LOAD DATA LOCAL INFILE 'yourIDs.csv' INTO TABLE my_temp_table
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
(ids);
Selecting records
SELECT ids FROM my_temp_table
WHERE ids NOT IN ( SELECT id FROM table_one )
dropping table
DROP TABLE IF EXISTS my_temp_table
What about using a left join ; something like this :
select second_table.id
from second_table
left join first_table on first_table.id = second_table.id
where first_table.is is null
You could also go with a sub-query ; depending on the situation, it might, or might not, be faster, though :
select second_table.id
from second_table
where second_table.id not in (
select first_table.id
from first_table
)
Or with a not exists :
select second_table.id
from second_table
where not exists (
select 1
from first_table
where first_table.id = second_table.id
)
The function you are looking for is NOT IN (an alias for <> ALL)
The MYSQL documentation:
http://dev.mysql.com/doc/refman/5.0/en/all-subqueries.html
An Example of its use:
http://www.roseindia.net/sql/mysql-example/not-in.shtml
Enjoy!
The problem is that T1 could have a million rows or ten million rows, and that number could change, so you don't know how many rows your comparison table, T2, the one that has no gaps, should have, for doing a WHERE NOT EXISTS or a LEFT JOIN testing for NULL.
But the question is, why do you care if there are missing values? I submit that, when an application is properly architected, it should not matter if there are gaps in an autoincrementing key sequence. Even an application where gaps do matter, such as a check-register, should not be using an autoincrenting primary key as a synonym for the check number.
Care to elaborate on your application requirement?
OK, I've read your edits/elaboration. Syncrhonizing two databases where the second is not supposed to insert any new rows, but might do so, sounds like a problem waiting to happen.
Neither approach suggested above (WHERE NOT EXISTS or LEFT JOIN) is air-tight and neither is a way to guarantee logical integrity between the two systems. They will not let you know which system created a row in situations where both tables contain a row with the same id. You're focusing on gaps now, but another problem is duplicate ids.
For example, if both tables have a row with id 13887, you cannot assume that database1 created the row. It could have been inserted into database2, and then database1 could insert a new row using that same id. You would have to compare all column values to ascertain that the rows are the same or not.
I'd suggest therefore that you also explore GUID as a replacement for autoincrementing integers. You cannot prevent database2 from inserting rows, but at least with GUIDs you won't run into a problem where the second database has inserted a row and assigned it a primary key value that your first database might also use, resulting in two different rows with the same id. CreationDateTime and LastUpdateDateTime columns would also be useful.
However, a proper solution, if it is available to you, is to maintain just one database and give users remote access to it, for example, via a web interface. That would eliminate the mess and complication of replication/synchronization issues.
If a remote-access web-interface is not feasible, perhaps you could make one of the databases read-only? Or does database2 have to make updates to the rows? Perhaps you could deny insert privilege? What database engine are you using?
I have the same problem: I have a list of values from the user, and I want to find the subset that does not exist in anther table. I did it in oracle by building a pseudo-table in the select statement Here's a way to do it in Oracle. Try it in MySQL without the "from dual":
-- find ids from user (1,2,3) that *don't* exist in my person table
-- build a pseudo table and join it with my person table
select pseudo.id from (
select '1' as id from dual
union select '2' as id from dual
union select '3' as id from dual
) pseudo
left join person
on person.person_id = pseudo.id
where person.person_id is null