Trying to delete duplicate rows based on a hash in MySQL

Trying to delete duplicate rows based on a hash in MySQL - mysql

I'm trying to delete duplicate values (which will all have the same nid) based on the hash value.
I'm going to leave the initial (oldest) nid row with the same hash.
For some reason, I get the error, "You can't specify target table 'node_revision' for update in FROM clause
I'm trying to alias my tables, but that doesn't seem to work - what am I doing wrong?
delete from node_revision
WHERE nid NOT IN(SELECT MIN(nid) FROM node_revision GROUP BY hash)
(timestamp is just for illustration, don't actually want this used in any queries)
| nid | hash | timestamp |
| 2 | 123456 | 123364600 |
| 2 | 123456 | 123364601 |
| 2 | 1234567 | 123364602 |
Rows 1, and 3 would survive in this case.

You can phrase this as a left join:
delete nr from node_revision nr left join
(SELECT MIN(nid) as minnid
FROM node_revision
GROUP BY hash
) nrkeep
on nr.nid = nrkeep.minnid
where nrkeep.minnid is null;
You can also "trick" MySQL into using the subquery:
DELETE FROM node_revision
WHERE nid NOT IN (SELECT minnid
FROM (SELECT MIN(nid) as minnid FROM node_revision GROUP BY hash
) t
);
MySQL has a well-documented limitation on using the modified table in update and delete statements. This query gets around the limitation by actually materializing the list of minnids by using a subquery.
EDIT:
Based on the example now in the question, you should use timestamp as follows:
delete nr from node_revision nr left join
(SELECT hash, nid, min(timestamp) as mintimestamp
FROM node_revision
GROUP BY hash
) nrkeep
on nr.hash = nrkeep.hash and
nr.nid = nrkeep.nid and
nr.timestamp = nrkeep.mintimestamp
where nrkeep.minnid is null;

Related

Remove continuous duplicated values with different IDs in MySQL

I know there is a ton of same questions about finding and removing duplicate values in mySQL but my question is a bit different:
I have a table with columns as ID, Timestamp and price. A script scrapes data from another webpage and saves it in the database every 10 seconds. Sometimes data ends up like this:
| id | timestamp | price |
|----|-----------|-------|
| 1 | 12:13 | 100 |
| 2 | 12:14 | 120 |
| 3 | 12:15 | 100 |
| 4 | 12:16 | 100 |
| 5 | 12:17 | 110 |
As you see there are 3 duplicated values and removing the price with ID = 4 will shrink the table without damaging data integrity. I need to remove continuous duplicated records except the first one (which has the lowest ID or Timestamp).
Is there a sufficient way to do it? (there is about a million records)
I edited my scraping script so it checks for duplicated price before adding it but I need to shrink and maintain my old data.

Since MySQL 8.0 you can use window function LAG() in next way:
delete tbl.* from tbl
join (
-- use lag(price) for get value from previous row
select id, lag(price) over (order by id) price from tbl
) l
-- join rows with same previous price witch will be deleted
on tbl.id = l.id and tbl.price = l.price;
fiddle

I am just grouping based on price and filtering only one record per group.The lowest id gets displayed.Hope the below helps.
select id,timestamp,price from yourTable group by price having count(price)>0;

My query is based on #Tim Biegeleisen one.
-- delete records
DELETE
FROM yourTable t1
-- where exists an older one with the same price
WHERE EXISTS (SELECT 1
FROM yourTable t2
WHERE t2.price = t1.price
AND t2.id < t1.id
-- but does not exists any between this and the older one
AND NOT EXISTS (SELECT 1
FROM yourTable t3
WHERE t1.price <> t3.price
AND t3.id > t2.id
AND t3 < t1.id));
It deletes records where exists an older one with same price but does not exists any different between
It could be checked by timestamp column if id column is not numeric and ascending.

MySQL - Table Query Inner Joining to itself

Consider the above query result,
Is there a way I can join the table itself to get the following results:-
POH_ID | JOH_ID | .............
-------------------------------------------
NULL | JOH_00000002 | .............
POH_00000002 | JOH_00000001 | .............
POH_00000001 | JOH_00000001 | .............
Meaning, if there's only a single JOH_ID, I retrieve that particular row, if there's more than one of the same JOH_ID, I retrieve the ones with POH_ID.
The result in the photo is a result of a query

You could find count of rows with same joh_id, join it with main table to filter the rows which have either only one row per joh_id or non-null poh_id
select t.*
from your_table t
join (
select joh_id, count(*) as cnt
from your_table
group by joh_id
) t2 on t.joh_id = t2.joh_id
where t2.cnt = 1 or t.poh_id is not null;

Faster sql query then join

I have a big table with more than 10,000 rows and it will grow to 1,000,000 in the near future, and I need to run a query which gives back a Time value for each keyword for each user. I have one right now which is quite slow because I use left joins and it needs one subquery / keyword:
SELECT rawdata.user, t1.Facebook_Time, t2.Outlook_Time, t3.Excel_time
FROM
rawdata left join
(SELECT user, sec_to_time(SuM(time_to_sec(EndTime-StartTime))) as 'Facebook_Time'
FROM rawdata
WHERE MainWindowTitle LIKE '%Facebook%'
GROUP by user)t1 on rawdata.user = t1.user left join
(SELECT user, sec_to_time(SuM(time_to_sec(EndTime-StartTime))) as 'Outlook_Time'
FROM rawdata
WHERE MainWindowTitle LIKE '%Outlook%'
GROUP by user)t2 on rawdata.user = t2.user left join
(SELECT user, sec_to_time(SuM(time_to_sec(EndTime-StartTime))) as 'Excel_Time'
FROM rawdata
WHERE MainWindowTitle LIKE '%Excel%'
GROUP by user)t3 on rawdata.user = t3.user
The table looks like this:
WindowTitle | StartTime | EndTime | User
------------|-----------|---------|---------
Form1 | DateTime | DateTime| user1
Form2 | DateTime | DateTime| user2
... | ... | ... | ...
Form_n | DateTime | DateTime| user_n
The output should looks like this:
User | Keyword | SUM(EndTime-StartTime)
-------|-----------|-----------------------
User1 | 'Facebook'| 00:34:12
User1 | 'Outlook' | 00:12:34
User1 | 'Excel' | 00:43:13
User2 | 'Facebook'| 00:34:12
User2 | 'Outlook' | 00:12:34
User2 | 'Excel' | 00:43:13
... | ... | ...
User_n | ... | ...
And the question is, which is the fastest way in MySQL to do this?

I think your wildcard searches are probably what's slowing it down the most, since you can't really utilize indexes on those fields. Also if you can avoid doing sub-queries and just do a straight join, it might help, but the wildcard searches are far worse. Is there anyway you could change the table to have a categoryName or categoryID that can have an index and not require a wildcard search? Like "where categoryName = 'Outlook'"
To optimize the data in your tables, add a categoryID (ideally this would reference a separate table, but let's just use arbitrary numbers for this example):
alter table rawData add column categoryID int not null
alter table rawData add index (categoryID)
Then populate the categoryID field for the existing data:
update rawData set categoryID=1 where name like '%Outlook%'
update rawData set categoryID=2 where name like '%Facebook%'
-- etc...
Then change your insert to follow the same rules.
Then make your SELECT query like this (changed wild cards to categoryID):
SELECT rawdata.user, t1.Facebook_Time, t2.Outlook_Time, t3.Excel_time
FROM
rawdata left join
(SELECT user, sec_to_time(SuM(time_to_sec(EndTime-StartTime))) as 'Facebook_Time'
FROM rawdata
WHERE categoryID = 2
GROUP by user)t1 on rawdata.user = t1.user left join
(SELECT user, sec_to_time(SuM(time_to_sec(EndTime-StartTime))) as 'Outlook_Time'
FROM rawdata
WHERE categoryID = 1
GROUP by user)t2 on rawdata.user = t2.user left join
(SELECT user, sec_to_time(SuM(time_to_sec(EndTime-StartTime))) as 'Excel_Time'
FROM rawdata
WHERE categoryID = 3
GROUP by user)t3 on rawdata.user = t3.user

Query works too slow when there is no results. How to improve it?

I have three tables
filters (id, name)
items(item_id, name)
items_filters(item_id, filter_id, value_id)
values(id, filter_id, filter_value)
about 20000 entries in items.
about 80000 entries in items_filters.
SELECT i.*
FROM items_filters itf INNER JOIN items i ON i.item_id = itf.item_id
WHERE (itf.filter_id = 1 AND itf.value_id = '1')
OR (itf.filter_id = 2 AND itf.value_id = '7')
GROUP BY itf.item_id
WITH ROLLUP
HAVING COUNT(*) = 2
LIMIT 0,10;
It 0.008 time when there is entries that match query and 0.05 when no entries match.
I tried different variations before:
SELECT * FROM items WHERE item_id IN (
SELECT `item_id`
FROM `items_filters`
WHERE (`filter_id`='1' AND `value_id`=1)
OR (`filter_id`='2' AND `value_id`=7)
GROUP BY `item_id`
HAVING COUNT(*) = 2
) LIMIT 0,6;
This completely freezes mysql when there are no entries.
What I really don't get is that
SELECT i.*
FROM items_filters itf INNER JOIN items i ON i.item_id = itf.item_id
WHERE itf.filter_id = 1 AND itf.value_id = '1' LIMIT 0,1
takes ~0.05 when no entries found and ~0.008 when there are
Explain
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | i | ALL | PRIMARY | NULL | NULL | NULL | 10 | Using temporary; Using filesort |
| 1 | SIMPLE | itf | ref | item_id | item_id | 4 | ss_stylet.i.item_id | 1 | Using where; Using index |

Aside from ensuring and index on items_filters on both (filter_id, value_id), I would prequalify your item IDs up front with a group by, THEN join to the items table. It looks like you are trying to find an item that meets two specific conditions, and for those, grab the items...
I've also left the "group by with rollup" in the outer, even though there will be a single instance per ID returned from the inner query. But since the inner query is already applying the limit of 0,10 records, its not throwing too many results to be joined to your items table.
However, since you are not doing any aggregates, I believe the outer group by and rollup are not really going to provide you any benefit and could otherwise be removed.
SELECT i.*
FROM
( select itf.item_id
from items_filters itf
WHERE (itf.filter_id = 1 AND itf.value_id = '1')
OR (itf.filter_id = 2 AND itf.value_id = '7')
GROUP BY itf.item_id
HAVING COUNT(*) = 2
LIMIT 0, 10 ) PreQualified
JOIN items i
ON PreQualified.item_id = i.item_id
Another approach MIGHT be to do a JOIN on the inner query so you don't even need to apply a group by and having. Since you are explicitly looking for exactly two items, I would then try the following. This way, the first qualifier is it MUST have an entry of the ID = 1 and value = '1'. It it doesn't even hit THAT entry, it would never CARE about the second. Then, by applying a join to the same table (aliased itf2), it has to find on that same ID -- AND the conditions for the second (id = 2 value = '7'). This basically forces a look almost like a single pass against the one entry FIRST and foremost before CONSIDERING anything else. That would STILL result in your limited set of 10 before getting item details.
SELECT i.*
FROM
( select itf.item_id
from items_filters itf
join items_filters itf2
on itf.item_id = itf2.item_id
AND itf2.filter_id = 2
AND itf2.value_id = '7'
WHERE
itf.filter_id = 1 AND itf.value_id = '1'
LIMIT 0, 10 ) PreQualified
JOIN items i
ON PreQualified.item_id = i.item_id
I also removed the group by / with rollup as per your comment of duplicates (which is what I expected).

That looks like four tables to me.
Do an EXPLAIN PLAN on the query and look for a TABLE SCAN. If you see one, add indexes on the columns in the WHERE clauses. Those will certainly help.

How to select an item, the one below and the one above in MYSQL

I have a database with ID's that are non-integers like this:
b01
b02
b03
d01
d02
d03
d04
s01
s02
s03
s04
s05
etc. The letters represent the type of product, the numbers the next one in that group.
I'd like to be able to select an ID, say d01, and get b03, d01, d02 back. How do I do this in MYSQL?

Here is another way to do it using UNIONs. I think this is a little easier to understand and more flexible than the accepted answer. Note that the example assumes the id field is unique, which appears to be the case based on your question.
The SQL query below assumes your table is called demo and has a single unique id field, and the table has been populated with the values you listed in your question.
( SELECT id FROM demo WHERE STRCMP ( 'd01', id ) > 0 ORDER BY id DESC LIMIT 1 )
UNION ( SELECT id FROM demo WHERE id = 'd01' ORDER BY id ) UNION
( SELECT id FROM demo WHERE STRCMP ( 'd01', id ) < 0 ORDER BY id ASC LIMIT 1 )
ORDER BY id
It produces the following result: b03, d01, d02.
This solution is flexible because you can change each of the LIMIT 1 statements to LIMIT N where N is any number. That way you can get the previous 3 rows and the following 6 rows, for example.

Note: this is from M$ SQL Server, but the only thing that needs tweaking is the isnull function.
select *
from test m
where id between isnull((select max(id) from #test where col < 'd01'),'d01')
and isnull((select min(id) from #test where col > 'd01'),'d01')

Find your target row,
SELECT p.id FROM product WHERE id = 'd01'
and the row above it with no other row between the two.
LEFT JOIN product AS p1 ON p1.id > p.id -- gets the rows above it
LEFT JOIN -- gets the rows between the two which needs to not exist
product AS p1a ON p1a.id > p.id AND p1a.id < p1.id
and similarly for the row below it. (Left as an exercise for the reader.)
In my experience this is also quite efficient.
SELECT
p.id, p1.id, p2.id
FROM
product AS p
LEFT JOIN
product AS p1 ON p1.id > p.id
LEFT JOIN
product AS p1a ON p1a.id > p.id AND p1a.id < p1.id
LEFT JOIN
product AS p2 ON p2.id < p.id
LEFT JOIN
product AS p2a ON p2a.id < p.id AND p2a.id > p2.id
WHERE
p.id = 'd01'
AND p1a.id IS NULL
AND p2a.ID IS NULL

Although not a direct answer to your question I personally wouldn't rely on the natural order, since it may change duo to import/exports and produce side effects not easily understandable by fellow programmers. What about creating an alternate INTEGER index and fire up another query? "WHERE id > ...yourdesiredid ... LIMIT 1"?

mysql> describe test;
+-------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+-------+
| id | varchar(50) | YES | | NULL | |
+-------+-------------+------+-----+---------+-------+
mysql> select * from test;
+------+
| id |
+------+
| b01 |
| b02 |
| b03 |
| b04 |
+------+
mysql> select * from test where id >= 'b02' LIMIT 3;
+------+
| id |
+------+
| b02 |
| b03 |
| b04 |
+------+

What about using a cursor? This would let you traverse the returned set one row at a time. using it with two variables (like "current" and "last"), you could inchworm along the result until you hit your target. Then return the value of "last" (for n-1), your entered target (n), and then traverse / iterate one more time and return the "current" (n+1).

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Trying to delete duplicate rows based on a hash in MySQL - mysql

Related

Remove continuous duplicated values with different IDs in MySQL

MySQL - Table Query Inner Joining to itself

Faster sql query then join

Query works too slow when there is no results. How to improve it?

How to select an item, the one below and the one above in MYSQL

Categories

Resources