Keep only last two rows for grouped columns in table - mysql

I have a table "History" with about 300.000 rows, which is filled with new data daily. I want to keep only the last two lines of every refSchema/refId combination.
Actually I go this way:
First Step:
SELECT refSchema,refId FROM History GROUP BY refSchema,refId
With this statement I get all combinations (which are about 40.000).
Second Step:
I run a foreach which looks up for the existing rows for the query above like this:
SELECT id
FROM History
WHERE refSchema = ? AND refId = ? AND state = 'done'
ORDER BY importedAt
DESC LIMIT 2,2000
Please keep in mind, that I want to hold the last two rows in my table, so I limit 2,2000. If I find matching rows I put the id's in an array called idList.
Final Step
I delete all id's from the array in that way:
DELETE FROM History WHERE id in ($idList)
This all seems not to be the best performance, because I have to check every combination with an extra query. Is there a way to have one delete statement that does the magic to avoid the 40.000 extra queries?
Edit Update: I use AWS Aurora DB

If you are using MySQL 8+, then one conceptually simple way to proceed here is to use a CTE to identify the top two rows per group which you do want to retain. Then, delete any record whose schema/id pair do not appear in this whitelist:
WITH cte AS (
SELECT refSchema, refId
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY refSchema, refId ORDER BY importedAt DESC) rn
FROM History
) t
WHERE rn IN (1, 2)
)
DELETE
FROM History
WHERE (refSchema, refId) NOT IN (SELECT refSchema, refId FROM cte);
If you can't use CTE, then try inlining the above CTE:
DELETE
FROM History
WHERE (refSchema, refId) NOT IN (
SELECT refSchema, refId
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY refSchema, refId ORDER BY importedAt DESC) rn
FROM History
) t
WHERE rn IN (1, 2)
);

Related

How to update the Group Number in a sequence in all rows

I have a table in which the data is stored as:
I want only the Branch column to be updated in a way that the sequence becomes continuous. Means the rows from ID 1 to 4 would have same value, but the rows from ID 5 to 7 will have Branch as 3, rows 8 and 9 would have Branch as 4, rows from 10 to 12 would have branch as 5 and so on.
My desired output would look like this:
I don't want the rows to be reordered, means the rows would have same sequence as they are now with continuous increasing ID column, and only the Branch column to be ordered.
I tried it doing with looping but that part is becoming so large and error prone that I was thinking of some other direct approach.
Is it possible through CTEs or any other approach?
How can I do so?
SQL DEMO
I use more columns than necesary just to show what is going on. rn is just to show how the grp is create. You only need grp to the final result.
The idea is create a group sequence based on Id. Then using DENSE_RANK() you get your desire sequence.
This assume ID is sequential number without holes, if your ID has holes, you need to use ROW_NUMBER() to create a sequence.
WITH cte as (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY [Branch] ORDER BY [Id]) as rn,
[Id] - ROW_NUMBER() OVER (PARTITION BY [Branch] ORDER BY [Id]) as grp
FROM Table1
)
SELECT *, DENSE_RANK() OVER (ORDER BY grp) as new_branch
FROM cte
OUTPUT

How to query a table with over 200 million rows?

I have a table USERS with only one column USER_ID. These IDs are more than 200M, they are not consecutive and are not ordered. It has an index USER_ID_INDEX on that column. I have the DB in MySQL and also in Google Big Query, but I haven't been able to get what I need in any of them.
I need to know how to query these 2 things:
1) Which is the row number for a particular USER_ID (once the table is ordered by USER_ID)
For this, I've tried in MySQL:
SET #row := 0;
SELECT #row := #row + 1 AS row FROM USERS WHERE USER_ID = 100001366260516;
It goes fast but it returns row=1 because the row counting is from the data-set.
SELECT USER_ID, #row:=#row+1 as row FROM (SELECT USER_ID FROM USERS ORDER BY USER_ID ASC) WHERE USER_ID = 100002034141760
It takes forever (I didn't wait to see the result).
In Big Query:
SELECT ROW_NUMBER() OVER() row, USER_ID
FROM (SELECT USER_ID from USERS.USER_ID ORDER BY USER_ID ASC)
WHERE USER_ID = 1063650153
It takes forever (I didn't wait to see the result).
2) Which USER_ID is in a particular row (once the table is ordered by USER_ID)
For this, I've tried in MySQL:
SELECT USER_ID FROM USERS ORDER BY USER_ID ASC LIMIT 150000000000, 1
It takes 5 minutes in giving a result. Why? Isn't it supposed to be fast if it has an index?
In Big Query, I didn't find the way because LIMIT init, num_rows, doesn't even exist.
I could order the table in a new one, and add a column called RANK that orders the USER_ID, with an INDEX on it. But it will be a mess if I want to add or remove a row.
Any ideas on how to solve these two queries?
Thanks,
Natalia
For (1), try this:
SELECT count(user_id)
FROM USERS
WHERE USER_ID <= 100001366260516;
You can check the explain, but it should just be doing a scan of the index.
For (2). Your question: "Why? Isn't it supposed to be fast if it has an index?". Yes, it will use the index. Then it has to count up to row 150,000,000,000 using an index scan. Hmmm, that is being the end of the table (if it is not a typo). In any case, an index scan is quite different from doing an index lookup, which is fast. And, it will take time. And more time if the index does not fit into memory.
The proper syntax for row_number(), by the way, would be:
SELECT row, USER_ID
FROM (SELECT USER_ID, row_number() over (order by user_id) as row
from USERS.USER_ID )
WHERE USER_ID = 1063650153;
I don't know if it will be that much faster, but at least you are not explicitly ordering the rows first.
If these are the types of queries you need to do, then think about a way to include the ordering information as a column in the table.

Find and remove duplicate rows by two columns

I read all the relevant duplicated questions/answers and I found this to be the most relevant answer:
INSERT IGNORE INTO temp(MAILING_ID,REPORT_ID)
SELECT DISTINCT MAILING_ID,REPORT_IDFROM table_1
;
The problem is that I want to remove duplicates by col1 and col2, but also want to include to the insert all the other fields of table_1.
I tried to add all the relevant columns this way:
INSERT IGNORE INTO temp(M_ID,MAILING_ID,REPORT_ID,
MAILING_NAME,VISIBILITY,EXPORTED) SELECT DISTINCT
M_ID,MAILING_ID,REPORT_ID,MAILING_NAME,VISIBILITY,
EXPORTED FROM table_1
;
M_ID(int,primary),MAILING_ID(int),REPORT_ID(int),
MAILING_NAME(varchar),VISIBILITY(varchar),EXPORTED(int)
But it inserted all rows into temp (including duplicates)
The best way to delete duplicate rows by multiple columns is the simplest one:
Add an UNIQUE index:
ALTER IGNORE TABLE your_table ADD UNIQUE (field1,field2,field3);
The IGNORE above makes sure that only the first found row is kept, the rest discarded.
(You can then drop that index if you need future duplicates and/or know they won't happen again).
This works perfectly in any version of MySQL including 5.7+. It also handles the error You can't specify target table 'my_table' for update in FROM clause by using a double-nested subquery. It only deletes ONE duplicate row (the later one) so if you have 3 or more duplicates, you can run the query multiple times. It never deletes unique rows.
DELETE FROM my_table
WHERE id IN (
SELECT calc_id FROM (
SELECT MAX(id) AS calc_id
FROM my_table
GROUP BY identField1, identField2
HAVING COUNT(id) > 1
) temp
)
I needed this query because I wanted to add a UNIQUE index on two columns but there were some duplicate rows that I needed to discard first.
For Mysql:
DELETE t1 FROM yourtable t1
INNER JOIN yourtable t2 WHERE t1.id < t2.id
AND t1.identField1 = t2.identField1
AND t1.identField2 = t2.identField2;
You will first need to find your duplicates by grouping on the two fields with a having clause.
Select identField1, identField2, count(*) FROM yourTable
GROUP BY identField1, identField2
HAVING count(*) >1
If this returns what you want, you can then use it as a subquery and
DELETE FROM yourTable WHERE field in (Select identField1, identField2, count(*) FROM yourTable
GROUP BY identField1, identField2
HAVING count(*) >1 )
you can always get the primary ids by grouping that two unique fields
select count(*), id as count from table group by col a, col b having count(*)>1;
and then
delete from table where id in ( select count(*), id as count from table group by col a, col b having count(*)>1) limit maxlimit;
you can also use max() in place of limit
NOTE: This solution is an alternative & old school solution.
If you couldn't achieve what you wanted, then you can try my "oldschool" method:
First, run this query to get the duplicate records:
select column1,
column2,
count(*)
from table
group by column1,
column2
having count(*) > 1
order by count(*) desc
After that, select those results and paste them into the notepad++:
Now by using the find and replace specialty of the notepad++ replace them with; first "delete" then "insert" queries like this (from now on, for security reasons, my values will be AAAA).
Special Note: Please make another new line for the end of the last line of your data inside notepad++ because regex matched the '\r\n' at the end of the each line:
Find what regex: \D*(\d+)\D*(\d+)\D*\r\n
Replace with string: delete from table where column1 = $1 and column2 = $2; insert into table set column1 = $1, column2 = $2;\r\n
Now finally, paste those queries to your MySQL Workbench's query console and execute. You will see only one occurrences of each duplicate record.
This answer is for a relation table constructed of just two columns without ID. I think you can apply it to your situation.
In a large data set if you are selecting the multiple columns in the select clause ex:
select x,y,z from table1.
And the requirement is to remove duplicate based on two columns:from above example let y,z
then you may use below instead of using combo of "group by" and "sub query", which is bad in performance:
select x,y,z
from (
select x,y,z , row_number() over (partition by y,z) as index_num
from table1) main
where main.index_num=1

Select all except top N in MySQL

I have a table with data that I don't need to keep for very long, so every night I want to remove all rows except the last 20.
To do that, I found the following query:
DELETE FROM Table WHERE ID NOT IN (
SELECT id FROM (
SELECT TOP 10 ID FROM Table
) AS x
)
MySQL doesn't support the TOP function, so I rewrote it to use LIMIT instead:
DELETE FROM Table WHERE ID NOT IN (
SELECT id FROM (
SELECT ID FROM Table ORDER BY ID DESC LIMIT 10
) AS x
)
Unfortunately, MySQL doesn't seem to support the LIMIT function within subqueries. So what do I do now?
How do I select all except the 10 rows with the highest ID?
I could probably just delete all the records that are older than a day or something, but it feels like I should be able to do it this way.
As MySQL has foibles when reading from the same table as you are deleting, the simplest option is often to use a temp table.
INSERT INTO yourTempTable
SELECT id FROM yourTable
ORDER BY ID DESC LIMIT 10
DELETE yourTable WHERE id IN (SELECT id FROM yourTempTable)
Or many variations thereof (using joins instead of IN, etc).
The main consideration isn't about how to write the second query, it's about race conditions.
Your data could be changed by another process between the temp table and the delete. If that is possible and matters, you need to wrap it all in a transaction and slap a table lock on yourTable.
Another way:
DELETE t
FROM
TableX AS t
CROSS JOIN
( SELECT Id
FROM TableX
ORDER BY Id DESC
LIMIT 1 OFFSET 9
) AS tenth
WHERE t.Id < tenth.id
There are few ways to do this, I would prefer this :-
create table tmp_table select * from your_table order by id desc limit 20;
truncate table your_table;
insert into your_table select * from tmp_table;
drop table tmp_table;
It seems very lengthy at first glance,
but personally I think is understandable,
and very low risk (plus, it should be efficient that doing a JOIN)
PS : truncate won't reset auto increment field
ps2 : this approach assume there is no write during truncate, one can issue lock to preserve the data integrity

Selecting last row WITHOUT any kind of key

I need to get the last (newest) row in a table (using MySQL's natural order - i.e. what I get without any kind of ORDER BY clause), however there is no key I can ORDER BY on!
The only 'key' in the table is an indexed MD5 field, so I can't really ORDER BY on that. There's no timestamp, autoincrement value, or any other field that I could easily ORDER on either. This is why I'm left with only the natural sort order as my indicator of 'newest'.
And, unfortunately, changing the table structure to add a proper auto_increment is out of the question. :(
Anyone have any ideas on how this can be done w/ plain SQL, or am I SOL?
If it's MyISAM you can do it in two queries
SELECT COUNT(*) FROM yourTable;
SELECT * FROM yourTable LIMIT useTheCountHere - 1,1;
This is unreliable however because
It assumes rows are only added to this table and never deleted.
It assumes no other writes are performed to this table in the meantime (you can lock the table)
MyISAM tables can be reordered using ALTER TABLE, so taht the insert order is no longer preserved.
It's not reliable at all in InnoDB, since this engine can reorder the table at will.
Can I ask why you need to do this?
In oracle, possibly the same for MySQL too but the optimiser will choose the quickest record / order to return you results. So there is potential if your data was static to run the same query twice and get a different answer.
You can assign row numbers using the ROW_NUMBER function and then sort by this value using the ORDER BY clause.
SELECT *,
ROW_NUMBER() OVER() AS rn
FROM table
ORDER BY rn DESC
LIMIT 1;
Basically, you can't do that.
Normally I'd suggest adding a surrogate primary key with auto-incrememt and ORDER BY that:
SELECT *
FROM yourtable
ORDER BY id DESC
LIMIT 1
But in your question you write...
changing the table structure to add a proper auto_increment is out of the question.
So another less pleasant option I can think of is using a simulated ROW_NUMBER using variables:
SELECT * FROM
(
SELECT T1.*, #rownum := #rownum + 1 AS rn
FROM yourtable T1, (SELECT #rownum := 0) T2
) T3
ORDER BY rn DESC
LIMIT 1
Please note that this has serious performance implications: it requires a full scan and the results are not guaranteed to be returned in any particular order in the subquery - you might get them in sort order, but then again you might not - when you dont' specify the order the server is free to choose any order it likes. Now it probably will choose the order they are stored on disk in order to do as little work as possible, but relying on this is unwise.
Without an order by clause you have no guarantee of the order in which you will get your result. The SQL engine is free to choose any order.
But if for some reason you still want to rely on this order, then the following will indeed return the last record from the result (MySql only):
select *
from (select *,
#rn := #rn + 1 rn
from mytable,
(select #rn := 0) init
) numbered
where rn = #rn
In the sub query the records are retrieved without order by, and are given a sequential number. The outer query then selects only the one that got the last attributed number.
We can use the having for that kind of problem-
SELECT MAX(id) as last_id,column1,column2 FROM table HAVING id=last_id;