Remove duplicates from TWO columns - mysql

Good Morning stackoverflownians,
I have a very big table with duplicates on two columns. Means that if numbers on row a are duplicated in col1 and col2 on row b, I should keep only row a :
## table_1
col1 col2
1 10
1 10
1 10
1 11
1 11
1 12
2 20
2 20
2 21
2 21
# should return this tbl without duplication
col1 col2
1 10
1 11
1 12
2 20
2 21
My previous code account only for col1, and I don't know how to query this on two coluns :
CREATE TABLE temp LIKE db.table_1;
INSERT INTO temp SELECT * FROM table_1 WHERE 1 GROUP BY col1;
DROP TABLE table_1;
ALTER TABLE temp RENAME table_1;
So I thought about that :
CREATE TABLE temp LIKE db.table_1;
INSERT INTO temp(col1,col2)
SELECT DISTINCT col1,col2 FROM table_1;
then drop and rename..
But I'm not sure it's gonna work and MySQL tend to be unstable, if it takes too long I will have to stop the query and that my crash the server again .. T.T
We have 200,000,000 rows and all of them have at least one duplicate..
Any Suggestion of code ? :)
Also .. How long would it take ? minutes or hours ?

you already know quite a ways :)
you can try this also
Use INSERT IGNORE rather than INSERT. If a record doesn't duplicate an existing record, MySQL inserts it as usual. If the record is a duplicate, the IGNORE keyword tells MySQL to discard it silently without generating an error.
Read from existing table and then write on a new table using INSERT IGNORE. This way you can control insert process depending on your resource usage.
When using INSERT IGNORE and you do have key violations, MySQL does NOT raise a warning!!!

the distinct clause is the way to go, but it will take a while to run on that many records. I'd add an ID column that is autoincrment, and is your pk. Then you can run the deduplicate in stages that won't time out.
Good luck and HTH
-- Joe

Related

"Filtering" huge MariaDB/Mysql table based on different table

Struggling with a large dataset in my mariaDB database. I have two tables, where table A contains 57 million rows and table B contains around 500. Table B is a subset of ids related to a column in table A. I want to delete all rows from A which do not have a corresponding ID in table B.
Example table A:
classification_id
Name
20
Mercedes
30
Kawasaki
80
Leitz
70
HP
Example table B:
classification_id
Type
20
car
30
bike
40
bus
50
boat
So in this example the last two rows from table A would be deleted (or a mirror table would be made containing only the first two rows, thats also fine).
I tried to do the second one using an inner join but this query took a few minutes before giving an out of memory exception.
Any suggestions on how to tackle this?
try this:
delete from "table A" where classification_id not in (select classification_id from "table B");
Since you say that the filter table contains a relatively small number of rows, your best bet would be creating a separate table that contains the same columns as the original table A and the rows that match your criteria, then replace the original table and drop it. Also, with this number of IDs you probably want to use WHERE IN () instead of joins - as long as the field you're using there is indexed, it will usually be way faster. Bringing it all together:
CREATE TABLE new_A AS
SELECT A.* FROM A
WHERE classification_id IN (SELECT classification_id FROM B);
RENAME TABLE A TO old_A, new_A to A;
DROP TABLE old_A;
Things to be aware of:
Backup your data! And test the queries thoroughly before running that DROP TABLE. You don't want to lose 57M rows of data because of a random answer at StackOverflow.
If A has any indexes or foreign keys, these won't be copied over - so you'll have to recreate them all manually. I'd recommend running SHOW CREATE TABLE A first and making note on its structure. Alternatively, you may consider creating the table new_A explicitly using the output of SHOW CREATE TABLE A as a template and then performing INSERT INTO new_A SELECT ... instead of CREATE TABLE new_A AS SELECT ... with the same query after this.

insert 100 millions records to mysql database

I'm starting to learn mysql and i need a database with 100 millions records, i'm trying to use basic loop but that just takes too long. Can anyone show me how that could be done? each records can just be a number or a bit, but it has to be 100 millions
1. DROP TABLE IF EXISTS my_table;
2. CREATE TABLE my_table(id SERIAL PRIMARY KEY);
3. INSERT INTO my_table VALUES(NULL);
4. INSERT INTO my_table SELECT NULL FROM my_table;
5. Repeat line 4 26 times.
Failing that, see http://datacharmer.blogspot.com/2006/06/filling-test-tables-quickly.html

A way to keep and update only one row with TYPE=3

The table:
ID TYPE USER_ID
======================
1 1 15
2 1 15
. 3 15
. 1 15
.
should keep multiple USER_ID's with TYPE=1 but only 0 or 1 row where TYPE=3.
In the case that TYPE=3, upon insert I need to either update or create (much like insert on duplicate key update) that row.
Is there a good way to accomplish this without first SELECTing, and updating or inserting depending on the SELECT results in the program?
Preferably doing this in a single command, and without triggers?
One way might be to add new tuple, hold the user id you added in a variable, then
delete where type = 3 and id != {Added id}
it would work, but I want to make the disclaimer that it seems dodgy somehow
You can do the update with subqueries. In this case since you would want to read and write to the same tuple you would need to rename the subquery on the same table to erase the lock on that tuple.
Say you want to update the user_id of the first row of data with type=3 to
20 do:
UPDATE tbl SET user_id=20 WHERE id=
(SELECT A.id FROM (SELECT MIN(id) id
FROM tbl
WHERE type=3) A);
See DEMO ON SQL Fiddle.

Mysql, insert if all columns don't exist

I have a table with 3 columns. None of the columns are unique key.
I want to run an insert only if a row doesn't exists already with the exact same values in each column.
given the following table:
a b c
----------
1 3 5
7 1 3
9 49 4
a=3 b=4 c=3 should insert
a=7 b=1 c=3 should not insert (a row with these exact values exists)
The solutions I have found so far need a unique primary key.
The most efficient way is adding a UNIQUE KEY to your table. You can also make an algorithm for comparing the values but you do not want to do that if you have many columns in your table.
I'm not sure, I get your point correctly. But hope this help.
First of all you have to SELECT the row with WHERE clause
SELECT * FROM table WHERE a=$a && b=$b && c=$c
After that you fetch_array or fetch_row if the array or row exist, that means 'not insert'.

MySQL table 30 million records insert into another

I have a 30 million record mysql table.
Has about 20 columns of which I will use 15 to insert into another table.
Now I can't use PHP to load this large dataset (selecting 30 million rows and loading into memory isn't feasible), what would be the best method of loading all these records? MySQL 5.X
I'm using EMS to connect to the database.
What about doing an INSERT INTO MySmallerTable SELECT Col1, col2, col3... FROM MyBiggerTable
It might be worth breaking it into multiple INSERT
Like:
INSERT INTO ... SELECT ... WHERE ID between 1 and 100000;
INSERT INTO ... SELECT ... WHERE ID between 100001 and 200000;
etc.
you can do this
INSERT INTO new_table (`col1`,`col2`,`col3`) SELECT `oldcol1`,`oldcol2`,`oldcol3`
FROM old_table LIMIT 0,100000
and repeat it by php loop (with changing limit start value)
There are few ways you can including one user M_M provided above. I have not used EMS and not sure what it can and can't do. But I have extensively used Workbench.
A.
Create the new destination table
Create view on the source table with the columns of interest
LInsert into destination from source with simple INSERT INTO SOURCE_TABLE SELECT * FROM DESTINATION_TABLE
B.
Use mysqldump
Upload into new table
Alter table by dropping the columns you don't need