Remove Purge duplicate/multiplicate records from mariadb - mysql

Briefly: database imported from foreign source, so I cannot prevent duplicates, I can only prune and clean the database.
Foreign db changes daily, so, I want to automate the pruning process.
It resides on:
MariaDB v10.4.6 managed predominantly by phpMyadmin GUI v4.9.0.1 (both pretty much up to date as of this writing).
This is a radio browsing database.
It has multiple columns, but for me there are only few important:
StationID (it is unique entry number, thus db does not consider new entries as duplicates, all of them are unique because of this primary key)
There are no row numbers.
Name, url, home-page, country, etc
I do want to remove multiple url duplicated entries base on:
duplicate url has country to it, but some country values are NULL (=empty)
so I do want remove all duplicates except one containing country name, if there is one entry with it, if there is none, just one url, regardless of name (names are multilingual, so some duplicated urls have also various names, which I do not care for.
StationID (unique number, but not consecutive, also this is primary db key)
Name (variable, least important)
url (variable, but I do want to remove the duplicates)
country (variable, frequently NULL/empty, I want to eliminate those with empty entries as much as possible, if possible)
One url has to stay by any means (not to be deleted)
I have tried multitude of queries, some work for SELECT, but do NOT for DELETE, some hang my machine when executed. Here are some queries I tried (remember I use MariaDB, not oracle, or ms-sql)
SELECT * from `radio`.`Station`
WHERE (`radio`.`Station`.`Url`, `radio`.`Station`.`Name`) IN (
SELECT `radio`.`Station`.`Url`, `radio`.`Station`.`Name`
FROM `radio`.`Station`
GROUP BY `radio`.`Station`.`Url`, `radio`.`Station`.`Name`
HAVING COUNT(*) > 1)
This one should show all entries (not only one grouped), but this query hangs my machine
This query gets me as close as possible:
SELECT *
FROM `radio`.`Station`
WHERE `radio`.`Station`.`StationID` NOT IN (
SELECT MAX(`radio`.`Station`.`StationID`)
FROM `radio`.`Station`
GROUP BY `radio`.`Station`.`Url`,`radio`.`Station`.`Name`,`radio`.`Station`.`Country`)
However this query lists more entries:
SELECT *, COUNT(`radio`.`Station`.`Url`) FROM `radio`.`Station` GROUP BY `radio`.`Station`.`Name`,`radio`.`Station`.`Url` HAVING (COUNT(`radio`.`Station`.`Url`) > 1);
But all of these queries group them and display only one row.
I also tried UNION, INNER JOIN, but failed.
WITH cte AS..., but phpMyadmin does NOT like this query, and mariadb cli also did not like it.
I also tried something of this kind, published at oracle blog, which did not work, and I really had no clue what was what in this function:
select *
from (
select f.*,
count(*) over (
partition by `radio`.`Station`.`Url`, `radio`.`Station`.`Name`
) ct
from `radio`.`Station` f
)
where ct > 1
I did not know what f.* was, query did not like ct.

Given
drop table if exists radio;
create table radio
(stationid int,name varchar(3),country varchar(3),url varchar(3));
insert into radio values
(1,'aaa','uk','a/b'),
(2,'bbb','can','a/b'),
(3,'bbb',null,'a/b'),
(4,'bbb',null,'b/b'),
(5,'bbb',null,'b/b');
You could give the null countries a unique value (using coalesce), fortunately stationid is unique so:
select t.stationid,t.name,t.country,t.url
from radio t
join
(select url,max(coalesce(country,stationid)) cntry from radio t group by url) s
on s.url = t.url and s.cntry= coalesce(t.country,t.stationid);
Yields
+-----------+------+---------+------+
| stationid | name | country | url |
+-----------+------+---------+------+
| 1 | aaa | uk | a/b |
| 5 | bbb | NULL | b/b |
+-----------+------+---------+------+
2 rows in set (0.00 sec)
Translated to a delete
delete t from radio t
join
(select url,max(coalesce(country,stationid)) cntry from radio t group by url) s
on s.url = t.url and s.cntry <> coalesce(t.country,t.stationid);
MariaDB [sandbox]> select * from radio;
+-----------+------+---------+------+
| stationid | name | country | url |
+-----------+------+---------+------+
| 1 | aaa | uk | a/b |
| 5 | bbb | NULL | b/b |
+-----------+------+---------+------+
2 rows in set (0.00 sec)

Fix 2 problems at once:
Dup rows already in table
Dup rows can still be put in table
Do this fore each table:
CREATE TABLE new LIKE real;
ALTER TABLE new ADD UNIQUE(x,y); -- will prevent future dups
INSERT IGNORE INTO new -- IGNORE dups
SELECT * FROM real;
RENAME TABLE real TO old, new TO real;
DROP TABLE old;

Related

Sql Query to retrive data from table

How to retrieve odd rows from the table?
In the Base table always Cr_id is duplicated 2 times.
Base table
I want a SELECT statement that retrieves only those c_id =1 where Cr_id is always first as shown in the output table.
Output table
Just see the base table and output table you should automatically know what I want, Thanx.
Just testing min date should be enough
drop table if exists t;
create table t(c_id int,cr_id int,dt date);
insert into t values
(1,56,'2020-12-17'),(56,56,'2020-12-17'),
(1,8,'2020-12-17'),(56,8,'2020-12-17'),
(123,78,'2020-12-17'),(1,78,'2020-12-18');
select c_id,cr_id,dt
from t
where c_id = 1 and
dt = (select min(dt) from t t1 where t1.cr_id = t.cr_id);
+------+-------+------------+
| c_id | cr_id | dt |
+------+-------+------------+
| 1 | 56 | 2020-12-17 |
| 1 | 8 | 2020-12-17 |
+------+-------+------------+
2 rows in set (0.002 sec)
What you're looking for could be "partition by", at least if you're working on mssql.
(In the future, please include more background, SQL is not just SQL)
https://codingsight.com/grouping-data-using-the-over-and-partition-by-functions/
I have an old query lying around, that is able to put a sorting index on data who lacks this, although the underlying reason is 99.9% sure to be a bad data design.
Typically I use this query to remove bad data, but you may rewrite it to become a join instead, so that you can identify the data you need.
The reason why I'm not putting that answer here, is to point out, bad data design results in more work when reading it afterwards, whom seems to be the real root cause here.
DELETE t
FROM
(
SELECT ROW_NUMBER () OVER (PARTITION BY column_1 ,column_2, column_3 ORDER BY column_1,column_2 ,column_3 ) AS Seq
FROM Table
)t
WHERE Seq > 1

MySQL select with all where and one or more where not

Table structure and data (I know data in IP/domain fields might not make much sense, but this is for illustration purposes):
rec_id | account_id | product_id | ip | domain | some_data
----------------------------------------------------------------------------
1 | 1 | 1 | 192.168.1.1 | 127.0.0.1/test | abc
2 | 1 | 1 | 192.168.1.1 | 127.0.0.1/other | xyz
3 | 1 | 1 | 192.168.1.2 | 127.0.0.1/test | ooo
Table has unique index ip_domain combined from ip and domain fields (so records with identical values in both fields can't exist).
In each case I know values for account_id, product_id, ip, domain fields, and I need to get other rows that have the SAME account_id, product_id values and one (or both) of ip, domain values are DIFFERENT.
Example: I know that account_id=1, product_id=1, ip=192.168.1.1, domain=127.0.0.1/test (so it matches rec_id 1), I need to select records with IDs 2 and 3 (because record 2 has different domain and record 3 has different ip).
So, I used query:
SELECT * FROM table WHERE
account_id='1' AND product_id='1' AND ip!='192.168.1.1' AND domain!='127.0.0.1/test'
Of course, it returned 0 rows. Looked at mysql multiple where and where not in and wrote:
SELECT * FROM table WHERE
account_id='1' AND product_id='1' AND installation_ip NOT IN ('192.168.1.1') AND installation_domain NOT IN ('127.0.0.1/test')
My guess is that this query is identical (just formatted different way), so 0 rows again. Found some more examples too, but none worked in my case.
The syntax is correct, but you're using the wrong logical operation
SELECT *
FROM table
WHERE account_id='1' AND product_id='1' AND
(ip != '192.168.1.1' OR domain != '127.0.0.1/test')
Select * from table
Where ROWID <> myRowid
And account_id = '1'
And product_id = '1';
myRowid is the unique id given by your dbms to each record, in this case you need to retrieve it with your select statement and then pass it back when using this select. This will return all the rows with account_id = 1 and product_id = 1 except the one you have selected.
If your inputs are not defined/or if you want list then you may be look at Group By clause. Also, you may look at group_concat
Query would be something like:
SELECT ACCOUNT_ID, PRODUCT_ID, GROUP_CONCAT(DISTINCT IP||'|'||DOMAIN, ','), COUNT(1)
FROM TABLE
GROUP BY ACCOUNT_ID, PRODUCT_ID
P.S.: I dont have mysql installed hence the query syntax is not verified

Update table row value to a random row value from another table

I have 2 MySQL tables.
One table has a column that lists all the states
colStates | column2 | column 3
------------------------------
AK | stuff | stuff
AL | stuff | stuff
AR | stuff | stuff
etc.. | etc.. | etc..
The second table has a column(randomStates) with all NULL values that need to be populated with a randomly selected state abbreviation.
Something like...
UPDATE mytable SET `randomStates`= randomly selected state value WHERE randomStates IS NULL
Can someone help me with this statement. I have looked around at other posts, but I don't understand them.
this works for me with trial data in SQLite:
UPDATE mytable
SET randomStates = (SELECT colStates FROM
(SELECT * FROM first_table ORDER BY RANDOM())
WHERE randomStates IS NULL)
without the first SELECT portion, you end up with the same random value inserted into all the NULL randomStates field. (i.e. if you just do SELECT StateValue FROM counts ORDER BY RANDOM() you don't get what you want).

Distinct - which items are taken? The first or the last occurance?

If I use following query:
SELECT DISTINCT comment FROM table;
And I have for example following data: (IDs are just there to SHOW the order...)
ID | comment
-------------
1 | comment1
2 | comment1
3 | comment2
4 | comment1
What I could get back are following three results:
Result 1:
1 | comment1
3 | comment2
Result 2:
3 | comment2
4 | comment1
Result 3:
order is unpredicatable
Question 1:
Is the result independant from the platform? Can I make sure, that I always get a predictable result?
Question 2:
I want to distinct select all comments and get the NEWEST only, meaning I want to always get result 2. Is it possible to achive that? Maybe ordering by the key would affect the result?
Your query doesn't request the ID column, only the comment column:
SELECT DISTINCT comment FROM table;
In the result, the ID is not included, so the row each value comes from is irrelevant.
comment1
comment2
As for how it will sort them, I think it depends on index order. I'll do a test to confirm:
mysql> create table t (id int primary key, comment varchar(100));
mysql> insert into t values
-> (1, 'comment2'),
-> (2, 'comment1'),
-> (3, 'comment2'),
-> (4, 'comment1');
The default order is that of the primary key:
mysql> select distinct comment from t;
+----------+
| comment |
+----------+
| comment2 |
| comment1 |
+----------+
Whereas if we have an index on the requested column, it returns the values in index order:
mysql> create index i on t(comment);
mysql> select distinct comment from t;
+----------+
| comment |
+----------+
| comment1 |
| comment2 |
+----------+
I'm assuming the InnoDB storage engine, because everyone should be using InnoDB. ;-)
Your last question indicates that you really want a query that doesn't involve DISTINCT at all, but it's a greatest-n-per-group question. This type of question is very common, and it has been asked and answered hundreds of times on StackOverflow. Follow the link and read the many solutions.
You can experiment and see which of the unique rows is returned, and you can experiment and see which order they're returned in, but that will only show you how things turn out with your experimental table, today, under the current database engine version. Bottom line:
If you SELECT DISTINCT comment the id is immaterial because it's not in your SELECT
If you don't ORDER BY the database will determine the order.
If you want the most recent distinct comment with its ID, this will work every time (full disclosure: this replaces an earlier answer that works but was over-thinking the problem):
SELECT comment, MAX(id)
FROM myTable
GROUP BY comment
ORDER BY 2 DESC;
Note that the ORDER BY 2 DESC assumes that the higher the ID, the more recent the comment.
If you select a single distinct column, the other will not be returned.
select distinct column from table
is the same result as
select column from table group by column
In both these cases, the sort order of column is unpredictable, depending on the execution plan which may vary with larger amounts of data, diferent table structures, diferent database versions
to mimic your result, one would have to do :
select id, column from table group by column
which is an illegal grouping. If your SQL mode permits it to run, ID will be random.
if you mean select distinct * from table, then all distinct rows will be returned, in your case all the table.

SQL checking duplicates in one column and deleting another

I need to delete around 300,000 duplicates in my database. I want to check the Card_id column for duplicates, then check for duplicate timestamps. Then delete one copy and keep one. Example:
| Card_id | Time |
| 1234 | 5:30 |
| 1234 | 5:45 |
| 1234 | 5:30 |
| 1234 | 5:45 |
So remaining data would be:
| Card_id | Time |
| 1234 | 5:30 |
| 1234 | 5:45 |
I have tried several different delete statements, and merging into a new table but with no luck.
UPDATE: Got it working!
Alright after many failures I got this to work for DB2.
delete from(
select card_id, time, row_number() over (partition by card_id, time) rn
from card_table) as A
where rn > 1
rn increments when there are duplicates for card_id and time. The duplicated, or second rn, will be deleted.
I strongly suggest you take this approach:
create temporary table tokeep as
select distinct card_id, time
from t;
truncate table t;
insert into t(card_id, time)
select *
from tokeep;
That is, store the data you want. Truncate the table, and then regenerate it. By truncating the table, you get to keep triggers and permissions and other things linked to the table.
This approach should also be faster than deleting many, many duplicates.
If you are going to do that, you ought to insert a proper id as well:
create temporary table tokeep as
select distinct card_id, time
from t;
truncate table t;
alter table t add column id int auto_increment;
insert into t(card_id, time)
select *
from tokeep;
If you haven't Primary key or Candidate key probably there is no option using only one command. Try solution below.
Create table with duplicates
select Card_id,Time
into COPY_YourTable
from YourTable
group by Card_id,Time
having count(1)>1
Remove duplicates using COPY_YourTable
delete from YourTable
where exists
(
select 1
from COPY_YourTable c
where c.Card_id = YourTable.Card_id
and c.Time = YourTable.Time
)
Copy data without duplicates
insert into YourTable
select Card_id,Time
from COPY_YourTabl