Optimising "NOT IN(...)" query for millions of rows

Optimising "NOT IN(...)" query for millions of rows - mysql

Note: I do not have access to the source code/database to which this question pertains. The two tables in question are located on different servers.
I'm working with a 3rd party company that have systems integrated with our own. They have a query that runs something like this;
DELETE FROM table WHERE column NOT IN(1,2,3,4,5,.....3 000 000)
It's pretty much referencing around 3 million values in the NOT IN.
I'm trying to point out that this seems like an inefficient method for deleting multiple rows and keeping all the ones noted in the query. The problem is, as I don't have the access myself to the source code/database I'm not totally sure what to suggest as a solution.
I know the idea of this query is to keep a target server synced up with a source server. So if a row is deleted on the source server, the target server will reflect that change when this (and other) query is run.
With this limited knowledge, what possible suggestions could I present to them?
The first thing that comes to mind is having some kind of flag column that indicates whether it's been deleted or not. When the sync script runs it would first perform an update on the target server for all rows marked as deleted (or insert for new rows), then a second query to delete all rows marked for deletion.
Is there more logical way to do something like this, bearing in mind complete overhauls in functionality are out of the question. Only small tweaks to the current process will be possible for a number of reasons.

Instead of
DELETE FROM your_table
WHERE column NOT IN(1,2,3,4,5,.....3 000 000)
you could do
delete t1
from your_table t1
left join table_where_the_ids_come_from t2 on t1.column = t2.id
where t2.id is null

I know the idea of this query is to keep a target server synced up with a source server. So if a row is deleted on the source server, the target server will reflect that change when this (and other) query is run.
I know this is obvious, but why don't these two servers stay in sync using replication? I'm guessing it's because aside from this one table, they don't have identical data.
If out-of-the-box replication isn't flexible enough, you could use a change-data capture tool.
The idea is that the tool monitors changes in a MySQL binary log stream, and reacts to them. The reaction is user-defined, and it can include applying the same change to another MySQL instance, which would keep them in sync.
Here's a blog that shows how to use Maxwell, which is one of the open-source CDC tools, this one released from Zendesk:
https://www.percona.com/blog/2016/09/13/mysql-cdc-streaming-binary-logs-and-asynchronous-triggers/
A couple of advantages of this approach:
No need to re-sync the whole table. You'd only apply incremental changes as they occur.
No need to schedule re-syncs daily or whatever. Since incremental changes are likely to be small, you could apply the changes nearly immediately.

Deleting a large number of rows will take a huge amount of time. This is likely to require a full table scan. As it finds rows to delete, it will stress the undo/redo log. It will clog replication (if using such). Etc.
How many rows do you expect to delete?
Better would be to break the list up into chunks of 1000. (This applies whether using IN(list of constants) or JOIN.) But, since you are doing NOT, it gets stickier. Possibly the best way is to copy over what you want:
CREATE TABLE new LIKE real;
INSERT INTO new
SELECT * FROM real WHERE id IN (...); -- without NOT
RENAME TABLE real TO old,
new TO real;
DROP TABLE old;
I go into details of chunking, partitioning, and other techniques in Big Deletes .

Related

Database query efficiency

My boss is having me create a database table that keeps track of some of our inventory with various parameters. It's meant to be implemented as a cron job that runs every half hour or so, but the scheduling part isn't important since we've already discussed that we're handling it later.
What I'm want to know is if it's more efficient to just delete everything in the table each time the script is called and repopulate it, or go through each record to determine if any changes were made and update each entry accordingly. It's easier to do the former, but given that we have over 700 separate records to keep track of, I don't know if the time it takes to do this would put a huge load on the server. The script is written in PHP.

700 records is an extremely small number of records to have performance concerns. Don't even think about it, do whichever is easier for you.
But if it is performance that you are after, updating rows is slower than inserting rows, (especially if you are not expecting any generated keys, so an insertion is a one-way operation to the database instead of a roundtrip to and from the database,) and TRUNCATE TABLE tends to be faster than DELETE * FROM.

If you have IDs for the proper inventory talking about SQL DB, then it would be good practice to update them, since in theory your IDs will get exhausted (overflow).
Another approach would be to use some NoSQL DB like MongoDB and simply update the DB with given json bodies apparently with existing IDs, and the DB itself will figure it out on its own.

How to continuously remove anything older than the newst 10 entries of a MySQLdatabase (Possibly in JPQL/JPA)

I'm looking for a way to continuously monitor and delete the oldest entries so that the database is never larger than a certain value. I'm only interested in the latest 10 for example and everything past that number should be deleted. The database is updated through varous programs but the program that does the monitoring and deleting will probably be a Java EE application with JPA. I don't know at which layer of the implementation this will be done. If MySQL has build in management that does this, if I'll have to write a query that does this, or if there is a feature of Java that can do this.
Edit: I'm using an autoincremented id that could be used to determine threshhold of deleting.

This is a complex problem, because unless your table is not linked to any other table, you might very well have the latest row in table A referencing a very old row in table B. In this case, although the table B's row is very old, you can't delete it without breaking the coherence of your database.
Doing it "continuously" is even harder (read: impossible). I would first
examine if it's really needed. Disks are cheap, and 10 entries in an enterprise database is really nothing.
implement some purge mechanism and execute it very now and then, when the database is not used by anyone else.

I'll have a stab without knowing anything about your table schema:
DELETE FROM MyTable WHERE Id NOT IN (SELECT TOP 10 Id FROM MyTable ORDER BY Date DESC)
This is pretty inefficient to run all the time and there may by a MySql-specific TRUNCATE that does the job nicer. You'd probably get better performance from limiting your reads to the 10 rows you need, and actually archiving / deleting the extraneous data only periodically.

Best Approach for Checking and Inserting Records

EDIT: To clarify the records originally come from a flat-file database and is not in the MySQL database.
In one of our existing C programs which purpose is to take data from the flat-file and insert them (based on criteria) into the MySQL table:
Open connection to MySQL DB
for record in all_record_of_my_flat_file:
if record contain a certain field:
if record is NOT in sql_table A: // see #1
insert record information into sql_table A and B // see #2
Close connection to MySQL DB
select field from sql_table A where field=XXX
2 inserts
I believe that management did not feel it is worth it to add the functionality so that when the field in the flat file is created, it would be inserted into the database. This is specific to one customer (that I know of). I too, felt it odd that we use tool such as this to "sync" the data. I was given the duty of using and maintaining this script so I haven't heard too much about the entire process. The intent is to primarily handle additional records so this is not the first time it is used.
This is typically done every X months to sync everything up or so I'm told. I've also been told that this process takes roughly a couple of days. There is (currently) at most 2.5million records (though not necessarily all 2.5m will be inserted and most likely much less). One of the table contains 10 fields and the other 5 fields. There isn't much to be done about iterating through the records since that part can't be changed at the moment. What I would like to do is speed up the part where I query MySQL.
I'm not sure if I have left out any important details -- please let me know! I'm also no SQL expert so feel free to point out the obvious.
I thought about:
Putting all the inserts into a transaction (at the moment I'm not sure how important it is for the transaction to be all-or-none or if this affects performance)
Using Insert X Where Not Exists Y
LOAD DATA INFILE (but that would require I create a (possibly) large temp file)
I read that (hopefully someone can confirm) I should drop indexes so they aren't re-calculated.
mysql Ver 14.7 Distrib 4.1.22, for sun-solaris2.10 (sparc) using readline 4.3

Why not upgrade your MySQL server to 5.0 (or 5.1), and then use a trigger so it's always up to date (no need for the monthly script)?
DELIMITER //
CREATE TRIGGER insert_into_a AFTER INSERT ON source_table
FOR EACH ROW
BEGIN
IF NEW.foo > 1 THEN
SELECT id AS #testvar FROM a WHERE a.id = NEW.id;
IF #testvar != NEW.id THEN
INSERT INTO a (col1, col2) VALUES (NEW.col1, NEW.col2);
INSERT INTO b (col1, col2) VALUES (NEW.col1, NEW.col2);
END IF
END IF
END //
DELIMITER ;
Then, you could even setup update and delete triggers so that the tables are always in sync (if the source table col1 is updated, it'll automatically propagate to a and b)...

Here's my thoughts on your utility script...
1) Is just a good practice anyway, I'd do it no matter what.
2) May save you a considerable amount of execution time. If you can solve a problem in straight SQL without using iteration in a C-Program, this can save a fair amount of time. You'll have to profile it first to ensure it really does in a test environment.
3) LOAD DATA INFILE is a tactic to use when inserting a massive amount of data. If you have a lot of records to insert (I'd write a query to do an analysis to figure out how many records you'll have to insert into table B), then it might behoove you to load them this way.
Dropping the indexes before the insert can be helpful to reduce running time, but you'll want to make sure you put them back when you're done.
Although... why aren't all the records in table B in the first place? You haven't mentioned how processing works, but I would think it would be advantageous to ensure (in your app) that the records got there without your service script's intervention. Of course, you understand your situation better than I do, so ignore this paragraph if it's off-base. I know from experience that there are lots of reasons why utility cleanup scripts need to exist.
EDIT: After reading your revised post, your problem domain has changed: you have a bunch of records in a (searchable?) flat file that you need to load into the database based on certain criteria. I think the trick to doing this as quickly as possible is to determine where the C application is actually the slowest and spends the most time spinning its proverbial wheels:
If it's reading off the disk, you're stuck, you can't do anything about that, unless you get a faster disk.
If it's doing the SQL query-insert operation, you could try optimizing that, but your'e doing a compare between two databases (the flat-file and the MySQL one)
A quick thought: by doing a LOAD DATA INFILE bulk insert to populate a temporary table very quickly (perhaps even an in-memory table if MySQL allows that), and then doing the INSERT IF NOT EXISTS might be faster than what you're currently doing.
In short, do profiling, and figure out where the slowdown is. Aside from that, talk with an experienced DBA for tips on how to do this well.

I discussed with another colleague and here is some of the improvements we came up with:
For:
SELECT X FROM TABLE_A WHERE Y=Z;
Change to (currently waiting verification on whether X is and always unique):
SELECT X FROM TABLE_A WHERE X=Z LIMIT 1;
This was an easy change and we saw some slight improvements. I can't really quantify it well but I did:
SELECT X FROM TABLE_A ORDER BY RAND() LIMIT 1
and compared the first two query. For a few test there was about 0.1 seconds improvement. Perhaps it cached something but the LIMIT 1 should help somewhat.
Then another (yet to be implemented) improvement(?):
for record number X in entire record range:
if (no CACHE)
CACHE = retrieve Y records (sequentially) from the database
if (X exceeds the highest record number in cache)
CACHE = retrieve the next set of Y records (sequentially) from the database
search for record number X in CACHE
...etc
I'm not sure what to set Y to, are there any methods for determining what's a good sized number to try with? The table has 200k entries. I will edit in some results when I finish implementation.

What is the best way to update (or replace) an entire database table on a live machine?

I'm being given a data source weekly that I'm going to parse and put into a database. The data will not change much from week to week, but I should be updating the database on a regular basis. Besides this weekly update, the data is static.
For now rebuilding the entire database isn't a problem, but eventually this database will be live and people could be querying the database while I'm rebuilding it. The amount of data isn't small (couple hundred megabytes), so it won't load that instantaneously, and personally I want a bit more of a foolproof system than "I hope no one queries while the database is in disarray."
I've thought of a few different ways of solving this problem, and was wondering what the best method would be. Here's my ideas so far:
Instead of replacing entire tables, query for the difference between my current database and what I want to place in the database. This seems like it could be an unnecessary amount of work, though.
Creating dummy data tables, then doing a table rename (or having the server code point towards the new data tables).
Just telling users that the site is going through maintenance and put the system offline for a few minutes. (This is not preferable for obvious reasons, but if it's far and away the best answer I'm willing to accept that.)
Thoughts?

I can't speak for MySQL, but PostgreSQL has transactional DDL. This is a wonderful feature, and means that your second option, loading new data into a dummy table and then executing a table rename, should work great. If you want to replace the table foo with foo_new, you only have to load the new data into foo_new and run a script to do the rename. This script should execute in its own transaction, so if something about the rename goes bad, both foo and foo_new will be left untouched when it rolls back.
The main problem with that approach is that it can get a little messy to handle foreign keys from other tables that key on foo. But at least you're guaranteed that your data will remain consistent.
A better approach in the long term, I think, is just to perform the updates on the data directly (your first option). Once again, you can stick all the updating in a single transaction, so you're guaranteed all-or-nothing semantics. Even better would be online updates, just updating the data directly as new information becomes available. This may not be an option for you if you need the results of someone else's batch job, but if you can do it, it's the best option.

BEGIN;
DELETE FROM TABLE;
INSERT INTO TABLE;
COMMIT;
Users will see the changeover instantly when you hit commit. Any queries started before the commit will run on the old data, anything afterwards will run on the new data. The database will actually clear the old table once the last user is done with it. Because everything is "static" (you're the only one who ever changes it, and only once a week), you don't have to worry about any lock issues or timeouts. For MySQL, this depends on InnoDB. PostgreSQL does it, and SQL Server calls it "snapshotting," and I can't remember the details off the top of my head since I rarely use the thing.
If you Google "transaction isolation" + the name of whatever database you're using, you'll find appropriate information.

We solved this problem by using PostgreSQL's table inheritance/constraints mechanism.
You create a trigger that auto-creates sub-tables partitioned based on a date field.
This article was the source I used.

Which database server are you using? SQL 2005 and above provides a locking method called "Snapshot". It allows you to open a transaction, do all of your updates, and then commit, all while users of the database continue to view the pre-transaction data. Normally, your transaction would lock your tables and block their queries, but snapshot locking would be perfect in your case.
More info here: http://blogs.msdn.com/craigfr/archive/2007/05/16/serializable-vs-snapshot-isolation-level.aspx
But it requires SQL Server, so if you're using something else....

Several database systems (since you didn't specify yours, I'll keep this general) do offer the SQL:2003 Standard statement called MERGE which will basically allow you to
insert new rows into a target table from a source which don't exist there yet
update existing rows in the target table based on new values from the source
optionally even delete rows from the target that don't show up in the import table anymore
SQL Server 2008 is the first Microsoft offering to have this statement - check out more here, here or here.
Other database system probably will have similar implementations - it's a SQL:2003 Standard statement after all.
Marc

Use different table names(mytable_[yyyy]_[wk]) and a view for providing you with a constant name(mytable). Once a new table is completely imported update your view so that it uses that table.

Replication with lots of temporary table writes

I've got a database which I intend to replicate for backup reasons (performance is not a problem at the moment).
We've set up the replication correctly and tested it and all was fine.
Then we realized that it replicates all the writes to the temporary tables, which in effect meant that replication of one day's worth of data took almost two hours for the idle slave.
The reason for that is that we recompute some of the data in our db via cronjob every 15 mins to ensure it's in sync (it takes ~3 minutes in total, so it is unacceptable to do those operations during a web request; instead we just store the modifications without attempting to recompute anything while in the web request, and then do all of the work in bulk). In order to process that data efficiently, we use temporary tables (as there's lots of interdependencies).
Now, the first problem is that temporary tables do not persist if we restart the slave while it's in the middle of processing transactions that use that temp table. That can be avoided by not using temporary tables, although this has its own issues.
The more serious problem is that the slave could easily catch up in less than half an hour if it wasn't for all that recomputation (which it does one after the other, so there's no benefit of rebuilding the data every 15 mins... and you can literally see it stuck at, say 1115, only to quickly catch up and got stuck at 1130 etc).
One solution we came up with is to move all that recomputation out of the replicated db, so that the slave doesn't replicate it. But it has disadvantages in that we'd have to prune the tables it eventually updates, making our slave in effect "castrated", ie. we'd have to recompute everything on it before we could actually use it.
Did anyone have a similar problem and/or how would you solve it? Am I missing something obvious?

I've come up with the solution. It makes use of replicate-do-db mentioned by Nick. Writing it down here in case somebody had a similar problem.
The problem with just using replicate-(wild-)do* options in this case (like I said, we use temp tables to repopulate a central table) is that either you ignore temp tables and repopulate the central one with no data (which causes further problems as all the queries relying on the central table being up-to-date will produce different results) or you ignore the central table, which has a similar problem. Not to mention, you have to restart mysql after adding any of those options to my.cnf. We wanted something that would cover all those cases (and future ones) without the need for any further restart.
So, what we decided to do is to split the database into the "real" and a "workarea" databases. Only the "real" database is replicated (I guess you could decide on a convention of table names to be used for replicate-wild-do-table syntax).
All the temporary table work is happening in "workarea" db, and to avoid the dependency problem mentioned above, we won't populate the central table (which sits in "real" db) by INSERT ... SELECT or RENAME TABLE, but rather query the tmp tables to generate a sort of a diff on the live table (ie. generate INSERT statements for new rows, DELETE for the old ones and update where necessary).
This way the only queries that are replicated are exactly the updates that are required, nothing else, ie. some (most?) of the recomputation queries hapenning every fifteen minutes might not even make its way to slave, and the ones that do will be minimal and not computationally expensive at all, just simple INSERTs and DELETEs.

In MySQL, as of 5.0 I believe, you can do table wildcards to replicate specific tables. There are a number of command-line options that can be set but you can also do this via your MySQL config file.
[mysqld]
replicate-do-db = db1
replicate-do-table = db2.mytbl2
replicate-wild-do-table= database_name.%
replicate-wild-do-table= another_db.%
The idea being that you tell it to not replicate any tables other than the ones you specify.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008