MySQL trigger vs application insert for history - mysql

I have a main table in mysql and need a history table for tracking the changes in the table.
I have 2 approaches.
trigger --> create a trigger for the main table which inserts into history table for any change in the main table
insert into the history table while inserting or updating in the main table from application
I am checking which is the best approach with performance.

Assuming your trigger performs exactly the same operation as the separate logging query (e.g. both insert a row to your history table whenever you modify your table), there is no significant performance difference between your two options, as both do the same amount of work.
The decision is usually design driven - or the preference of whoever makes the guidelines you have to follow.
Some advantages of using a trigger for your history log:
You cannot forget to log, e.g. by coding mistakes in your app, and don't have to take care of it in every quick and dirty maintenance script. MySQL does it for you.
You have direct access to all column values in the trigger including their previous values, and specifically the primary key (new.id). This makes logging trivial.
If you e.g. do batch modifications, it might be complicated to write an individual logging query. delete from tablename where xyz? You probably will do a insert into historytable ... select ... where xyz first, and if xyz is a slow condition that ends up not deleting anything, you may just double your execution time this way. So much for performance. update tablename set a = rand() where a > 0.5? Good luck writing a proper separate logging query for this.
Some advantages not using a trigger to log:
you have control over when and what you log, e.g. if you want to log only specific changes done by end users in your application, but not those by batch scripts or automatic processes, it might be easier (and faster) to just log explicitly what you want to log.
you may want to log additional information not available to the trigger (and that you don't want to store in the main table), e.g the windows login or the last button the user pressed to access the function that modified this data.
it might be more convenient to write a general logging function in a programming language, where you can use meta data to e.g. dynamically generate the logging query or compare old and new values in a loop over all columns, than to maintain 3 triggers for every table, where you usually have to list every column explicitly.
since you are especially interested in performance: although it's probably more a theoretical than a practical advantage, if you do a lot of batch modifications, it might be faster to write the log in batches too (e.g. inserting 1000 history rows at once will be faster than inserting 1000 rows individually using a trigger). But you will have to properly design your logging query, and the query itself cannot be slow.

Related

How to perform check-and-insert in mysql?

In my code I need to do the following:
Check a MySQL table (InnoDB) if a particular row (matching some criteria) exists. If it does, return it. If it doesn't, create it and then return it.
The problem I seem to have is race conditions. Every now and then two processes run so closely together, that they both check the table at the same time, don't see the row, and both insert it - thus duplicate data.
I'm reading MySQL documentation trying to come up with some way to prevent this. What I've come up so far:
Unique indexes seem to be one option, but they're not universal (it only works when the criteria is something unique for all rows).
Transactions even at SERIALIZABLE level don't protect against INSERT, period.
Neither do SELECT ... LOCK IN SHARE MODE or SELECT ... FOR UPDATE.
A LOCK TABLE ... WRITE would do it, but it's a very drastic measure - other processes won't be able to read from the table, and I need to lock ALL tables that I intend to use until I unlock them.
Basically, I'd like to do either of the following:
Prevent all INSERT to the table from processes other than mine, while allowing SELECT/UPDATE (this is probably impossible because it make so little sense most of the time).
Organize some sort of manual locking. The two processes would coordinate among themselves which one gets to do the select/insert dance, while the other waits. This needs some sort of operation that waits until the lock is released. I could probably implement a spin-lock (one process repeatedly checks if the other has released the lock), but I'm afraid that it would be too resource intensive.
I think I found an answer myself. Transactions + SELECT ... FOR UPDATE in an InnoDB table can provide a synchronization lock (aka mutex). Have all processes lock on a specific row in a specific table before they start their work. Then only one will be able to run at a time and the rest will wait until the first one finishes its transaction.

Logging of data change in mysql tables using ado.net

Is there any work around to get the latest change in MySQL Database using Ado.NET.
i.e. change in which table, which column, performed operation, old and new value. both for single table change and multiple table change. want to log the changes in my own new table.
There are several ways how change tracking can be implemented for mysql:
triggers: you can add DB trigger for insert/update/delete that creates an entry in the audit log.
add application logic to track changes. Implementation highly depends on your data layer; if you use ADO.NET DataAdapter, RowUpdating event is suitable for this purpose.
Also you have the following alternatives how to store audit log in mysql database:
use one table for audit log with columns like: id, table, operation, new_value (string), old_value (string). This approach has several drawbacks: this table will grow up very fast (as it holds history for changes in all tables), it keeps values as strings, it saves excessive data duplicated between old-new pairs, changeset calculation takes some resources on every insert/update.
use 'mirror' table (say, with '_log' suffix) for each table with enabled change tracking. On insert/update you can execute additional insert command into mirror table - as result you'll have record 'snapshots' on every save, and by this snapshots it is possible to calculate what and when is changed. Performance overhead on insert/update is minimal, and you don't need to determine which values are actually changed - but in 'mirror' table you'll have a lot of redundant data as full row copy is saved even if only one column is changed.
hybrid solution when record 'snapshots' are temporarily saved, and then processed in background to store differences in optimal way without affecting app performance.
There are no one best solution for all cases, everything depends on the concrete application requirements: how many inserts/updates are performed, how audit log is used etc.

Optimising "NOT IN(...)" query for millions of rows

Note: I do not have access to the source code/database to which this question pertains. The two tables in question are located on different servers.
I'm working with a 3rd party company that have systems integrated with our own. They have a query that runs something like this;
DELETE FROM table WHERE column NOT IN(1,2,3,4,5,.....3 000 000)
It's pretty much referencing around 3 million values in the NOT IN.
I'm trying to point out that this seems like an inefficient method for deleting multiple rows and keeping all the ones noted in the query. The problem is, as I don't have the access myself to the source code/database I'm not totally sure what to suggest as a solution.
I know the idea of this query is to keep a target server synced up with a source server. So if a row is deleted on the source server, the target server will reflect that change when this (and other) query is run.
With this limited knowledge, what possible suggestions could I present to them?
The first thing that comes to mind is having some kind of flag column that indicates whether it's been deleted or not. When the sync script runs it would first perform an update on the target server for all rows marked as deleted (or insert for new rows), then a second query to delete all rows marked for deletion.
Is there more logical way to do something like this, bearing in mind complete overhauls in functionality are out of the question. Only small tweaks to the current process will be possible for a number of reasons.
Instead of
DELETE FROM your_table
WHERE column NOT IN(1,2,3,4,5,.....3 000 000)
you could do
delete t1
from your_table t1
left join table_where_the_ids_come_from t2 on t1.column = t2.id
where t2.id is null
I know the idea of this query is to keep a target server synced up with a source server. So if a row is deleted on the source server, the target server will reflect that change when this (and other) query is run.
I know this is obvious, but why don't these two servers stay in sync using replication? I'm guessing it's because aside from this one table, they don't have identical data.
If out-of-the-box replication isn't flexible enough, you could use a change-data capture tool.
The idea is that the tool monitors changes in a MySQL binary log stream, and reacts to them. The reaction is user-defined, and it can include applying the same change to another MySQL instance, which would keep them in sync.
Here's a blog that shows how to use Maxwell, which is one of the open-source CDC tools, this one released from Zendesk:
https://www.percona.com/blog/2016/09/13/mysql-cdc-streaming-binary-logs-and-asynchronous-triggers/
A couple of advantages of this approach:
No need to re-sync the whole table. You'd only apply incremental changes as they occur.
No need to schedule re-syncs daily or whatever. Since incremental changes are likely to be small, you could apply the changes nearly immediately.
Deleting a large number of rows will take a huge amount of time. This is likely to require a full table scan. As it finds rows to delete, it will stress the undo/redo log. It will clog replication (if using such). Etc.
How many rows do you expect to delete?
Better would be to break the list up into chunks of 1000. (This applies whether using IN(list of constants) or JOIN.) But, since you are doing NOT, it gets stickier. Possibly the best way is to copy over what you want:
CREATE TABLE new LIKE real;
INSERT INTO new
SELECT * FROM real WHERE id IN (...); -- without NOT
RENAME TABLE real TO old,
new TO real;
DROP TABLE old;
I go into details of chunking, partitioning, and other techniques in Big Deletes .

Database query efficiency

My boss is having me create a database table that keeps track of some of our inventory with various parameters. It's meant to be implemented as a cron job that runs every half hour or so, but the scheduling part isn't important since we've already discussed that we're handling it later.
What I'm want to know is if it's more efficient to just delete everything in the table each time the script is called and repopulate it, or go through each record to determine if any changes were made and update each entry accordingly. It's easier to do the former, but given that we have over 700 separate records to keep track of, I don't know if the time it takes to do this would put a huge load on the server. The script is written in PHP.
700 records is an extremely small number of records to have performance concerns. Don't even think about it, do whichever is easier for you.
But if it is performance that you are after, updating rows is slower than inserting rows, (especially if you are not expecting any generated keys, so an insertion is a one-way operation to the database instead of a roundtrip to and from the database,) and TRUNCATE TABLE tends to be faster than DELETE * FROM.
If you have IDs for the proper inventory talking about SQL DB, then it would be good practice to update them, since in theory your IDs will get exhausted (overflow).
Another approach would be to use some NoSQL DB like MongoDB and simply update the DB with given json bodies apparently with existing IDs, and the DB itself will figure it out on its own.

What is the best way to update (or replace) an entire database table on a live machine?

I'm being given a data source weekly that I'm going to parse and put into a database. The data will not change much from week to week, but I should be updating the database on a regular basis. Besides this weekly update, the data is static.
For now rebuilding the entire database isn't a problem, but eventually this database will be live and people could be querying the database while I'm rebuilding it. The amount of data isn't small (couple hundred megabytes), so it won't load that instantaneously, and personally I want a bit more of a foolproof system than "I hope no one queries while the database is in disarray."
I've thought of a few different ways of solving this problem, and was wondering what the best method would be. Here's my ideas so far:
Instead of replacing entire tables, query for the difference between my current database and what I want to place in the database. This seems like it could be an unnecessary amount of work, though.
Creating dummy data tables, then doing a table rename (or having the server code point towards the new data tables).
Just telling users that the site is going through maintenance and put the system offline for a few minutes. (This is not preferable for obvious reasons, but if it's far and away the best answer I'm willing to accept that.)
Thoughts?
I can't speak for MySQL, but PostgreSQL has transactional DDL. This is a wonderful feature, and means that your second option, loading new data into a dummy table and then executing a table rename, should work great. If you want to replace the table foo with foo_new, you only have to load the new data into foo_new and run a script to do the rename. This script should execute in its own transaction, so if something about the rename goes bad, both foo and foo_new will be left untouched when it rolls back.
The main problem with that approach is that it can get a little messy to handle foreign keys from other tables that key on foo. But at least you're guaranteed that your data will remain consistent.
A better approach in the long term, I think, is just to perform the updates on the data directly (your first option). Once again, you can stick all the updating in a single transaction, so you're guaranteed all-or-nothing semantics. Even better would be online updates, just updating the data directly as new information becomes available. This may not be an option for you if you need the results of someone else's batch job, but if you can do it, it's the best option.
BEGIN;
DELETE FROM TABLE;
INSERT INTO TABLE;
COMMIT;
Users will see the changeover instantly when you hit commit. Any queries started before the commit will run on the old data, anything afterwards will run on the new data. The database will actually clear the old table once the last user is done with it. Because everything is "static" (you're the only one who ever changes it, and only once a week), you don't have to worry about any lock issues or timeouts. For MySQL, this depends on InnoDB. PostgreSQL does it, and SQL Server calls it "snapshotting," and I can't remember the details off the top of my head since I rarely use the thing.
If you Google "transaction isolation" + the name of whatever database you're using, you'll find appropriate information.
We solved this problem by using PostgreSQL's table inheritance/constraints mechanism.
You create a trigger that auto-creates sub-tables partitioned based on a date field.
This article was the source I used.
Which database server are you using? SQL 2005 and above provides a locking method called "Snapshot". It allows you to open a transaction, do all of your updates, and then commit, all while users of the database continue to view the pre-transaction data. Normally, your transaction would lock your tables and block their queries, but snapshot locking would be perfect in your case.
More info here: http://blogs.msdn.com/craigfr/archive/2007/05/16/serializable-vs-snapshot-isolation-level.aspx
But it requires SQL Server, so if you're using something else....
Several database systems (since you didn't specify yours, I'll keep this general) do offer the SQL:2003 Standard statement called MERGE which will basically allow you to
insert new rows into a target table from a source which don't exist there yet
update existing rows in the target table based on new values from the source
optionally even delete rows from the target that don't show up in the import table anymore
SQL Server 2008 is the first Microsoft offering to have this statement - check out more here, here or here.
Other database system probably will have similar implementations - it's a SQL:2003 Standard statement after all.
Marc
Use different table names(mytable_[yyyy]_[wk]) and a view for providing you with a constant name(mytable). Once a new table is completely imported update your view so that it uses that table.