I want to create a SQL trigger that inserts a new row if and only if it passes a given condition. I can think of a couple ways to do this, but I'm not sure which is the best or correct way.
Do an AFTER INSERT trigger and then delete the new row if it fails the condition.
Do a BEFORE INSERT trigger and raise an application error if it fails.
???
Option 1 creates a race condition. I would avoid that explicitly.
Option 2 is likely to cause significantly slower INSERTs, but can work.
Option 3 is a stored procedure, but you'll probably need to call the proc for each row inserted, and unless you set up security correctly you may not actually prevent users from inserting data directly.
Option 4 is to insert everything into a staging or transaction table, and then use a broker or procedure with queries or views to move only valid data to the live table. This is extremely old school and relatively nasty, since you're not using an RDBMS like a modern RDBMS anymore. Expect lots of problems with key violation issues and synchronization. And you have the same security problem as Option 3. This method is usually only used today for bulk import and export.
Option 5 is to validate your data in the application instead of the DB. This will work, but runs into problems when your customers try to use your RDBMS like an RDBMS. Then you hit the same security problem as Option 3. It won't actually fix problems or prevent storage of invalid data by programs outside your application.
Option 6 is to use an RDBMS that supports CHECK constraints, which is just about everything not MySQL or MariaDB. MS SQL Server, Oracle, DB2, PostgreSQL, even MS Access and SQLite support CHECK constraints. It's moderately ridiculous that MySQL doesn't.
Related
I have a quick question that I can't seem to find online, not sure I'm using the right wording or not.
Do MySql database automatically synchronize queries or coming in at around the same time? For example, if I send a query to insert something to a database at the same time another connection sends a query to select something from a database, does MySQL automatically lock the database while the insert is happening, and then unlock when it's done allowing the select query to access it?
Thanks
Do MySql databases automatically synchronize queries coming in at around the same time?
Yes.
Think of it this way: there's no such thing as simultaneous queries. MySQL always carries out one of them first, then the second one. (This isn't exactly true; the server is far more complex than that. But it robustly provides the illusion of sequential queries to us users.)
If, from one connection you issue a single INSERT query or a single UPDATE query, and from another connection you issue a SELECT, your SELECT will get consistent results. Those results will reflect the state of data either before or after the change, depending on which query went first.
You can even do stuff like this (read-modify-write operations) and maintain consistency.
UPDATE table
SET update_count = update_count + 1,
update_time = NOW()
WHERE id = something
If you must do several INSERT or UPDATE operations as if they were one, you'll need to use the InnoDB engine, and you'll need to use transactions. The transaction will block SELECT operations while it is in progress. Teaching you to use transactions is beyond the scope of a Stack Overflow answer.
The key to understanding how a modern database engine like InnoDB works is Multi-Version Concurrency Control or MVCC. This is how simultaneous operations can run in parallel and then get reconciled into a consistent "view" of the database when fully committed.
If you've ever used Git you know how you can have several updates to the same base happening in parallel but so long as they can all cleanly merge together there's no conflict. The database works like that as well, where you can begin a transaction, apply a bunch of operations, and commit it. Should those apply without conflict the commit is successful. If there's trouble the transaction is rolled back as if it never happened.
This ability to juggle multiple operations simultaneously is what makes a transaction-capable database engine really powerful. It's an important component necessary to meet the ACID standard.
MyISAM, the original engine from MySQL 3.0, doesn't have any of these features and locks the whole database on any INSERT operation to avoid conflict. It works like you thought it did.
When creating a database in MySQL you have your choice of engine, but using InnoDB should be your default. There's really no reason at all to use MyISAM as any of the interesting features of that engine (e.g. full-text indexes) have been ported over to InnoDB.
I currently have a PostgreSQL database, because one of the pieces of software we're using only supports this particular database engine. I then have a query which summarizes and splits the data from the app into a more useful format.
In my MySQL database, I have a table which contains an identical schema to the output of the query described above.
What I would like to develop is an hourly cron job which will run the query against the PostgreSQL database, then insert the results into the MySQL database. During the hour period, I don't expect to ever see more than 10,000 new rows (and that's a stretch) which would need to be transferred.
Both databases are on separate physical servers, continents apart from one another. The MySQL instance runs on Amazon RDS - so we don't have a lot of control over the machine itself. The PostgreSQL instance runs on a VM on one of our servers, giving us complete control.
The duplication is, unfortunately, necessary because the PostgreSQL database only acts as a collector for the information, while the MySQL database has an application running on it which needs the data. For simplicity, we're wanting to do the move/merge and delete from PostgreSQL hourly to keep things clean.
To be clear - I'm a network/sysadmin guy - not a DBA. I don't really understand all of the intricacies necessary in converting one format to the other. What I do know is that the data being transferred consists of 1xVARCHAR, 1xDATETIME and 6xBIGINT columns.
The closest guess I have for an approach is to use some scripting language to make the query, convert results into an internal data structure, then split it back out to MySQL again.
In doing so, are there any particular good or bad practices I should be wary of when writing the script? Or - any documentation that I should look at which might be useful for doing this kind of conversion? I've found plenty of scheduling jobs which look very manageable and well-documented, but the ongoing nature of this script (hourly run) seems less common and/or less documented.
Open to any suggestions.
Use the same database system on both ends and use replication
If your remote end was also PostgreSQL, you could use streaming replication with hot standby to keep the remote end in sync with the local one transparently and automatically.
If the local end and remote end were both MySQL, you could do something similar using MySQL's various replication features like binlog replication.
Sync using an external script
There's nothing wrong with using an external script. In fact, even if you use DBI-Link or similar (see below) you probably have to use an external script (or psql) from a cron job to initiate repliation, unless you're going to use PgAgent to do it.
Either accumulate rows in a queue table maintained by a trigger procedure, or make sure you can write a query that always reliably selects only the new rows. Then connect to the target database and INSERT the new rows.
If the rows to be copied are too big to comfortably fit in memory you can use a cursor and read the rows with FETCH, which can be helpful if the rows to be copied are too big to comfortably fit in memory.
I'd do the work in this order:
Connect to PostgreSQL
Connect to MySQL
Begin a PostgreSQL transaction
Begin a MySQL transaction. If your MySQL is using MyISAM, go and fix it now.
Read the rows from PostgreSQL, possibly via a cursor or with DELETE FROM queue_table RETURNING *
Insert them into MySQL
DELETE any rows from the queue table in PostgreSQL if you haven't already.
COMMIT the MySQL transaction.
If the MySQL COMMIT succeeded, COMMIT the PostgreSQL transaction. If it failed, ROLLBACK the PostgreSQL transaction and try the whole thing again.
The PostgreSQL COMMIT is incredibly unlikely to fail because it's a local database, but if you need perfect reliability you can use two-phase commit on the PostgreSQL side, where you:
PREPARE TRANSACTION in PostgreSQL
COMMIT in MySQL
then either COMMIT PREPARED or ROLLBACK PREPARED in PostgreSQL depending on the outcome of the MySQL commit.
This is likely too complicated for your needs, but is the only way to be totally sure the change happens on both databases or neither, never just one.
BTW, seriously, if your MySQL is using MyISAM table storage, you should probably remedy that. It's vulnerable to data loss on crash, and it can't be transactionally updated. Convert to InnoDB.
Use DBI-Link in PostgreSQL
Maybe it's because I'm comfortable with PostgreSQL, but I'd do this using a PostgreSQL function that used DBI-link via PL/Perlu to do the job.
When replication should take place, I'd run a PL/PgSQL or PL/Perl procedure that uses DBI-Link to connect to the MySQL database and insert the data in the queue table.
Many examples exist for DBI-Link, so I won't repeat them here. This is a common use case.
Use a trigger to queue changes and DBI-link to sync
If you only want to copy new rows and your table is append-only, you could write a trigger procedure that appends all newly INSERTed rows into a separate queue table with the same definition as the main table. When you want to sync, your sync procedure can then in a single transaction LOCK TABLE the_queue_table IN EXCLUSIVE MODE;, copy the data, and DELETE FROM the_queue_table;. This guarantees that no rows will be lost, though it only works for INSERT-only tables. Handling UPDATE and DELETE on the target table is possible, but much more complicated.
Add MySQL to PostgreSQL with a foreign data wrapper
Alternately, for PostgreSQL 9.1 and above, I might consider using the MySQL Foreign Data Wrapper, ODBC FDW or JDBC FDW to allow PostgreSQL to see the remote MySQL table as if it were a local table. Then I could just use a writable CTE to copy the data.
WITH moved_rows AS (
DELETE FROM queue_table RETURNING *
)
INSERT INTO mysql_table
SELECT * FROM moved_rows;
In short you have two scenarios:
1) Make destination pull the data from source into its own structure
2) Make source push out the data from its structure to destination
I'd rather try the second one, look around and find a way to create postgresql trigger or some special "virtual" table, or maybe pl/pgsql function - then instead of external script, you'll be able to execute the procedure by executing some query from cron, or possibly from inside postgres, there are some possibilities of operation scheduling.
I'd choose 2nd scenario, because postgres is much more flexible, and manipulating data some special, DIY ways - you will simply have more possibilities.
External script probably isn't a good solution, e.g. because you will need to treat binary data with special care, or convert dates× from DATE to VARCHAR and then to DATE again. Inside external script, various text-stored data will be probably just strings, and you will need to quote it too.
A really weird (for me) problem is occurring lately. In an application that accepts user submitted data the following occurs at random:
Rows from the Database Table where the user submitted data is stored are disappearing.
Please note that there is NO DELETE, DROP, TRUNCATE or other SQL statement issued on the database table except from the INSERT statement.
Could this be a bug of Mysql? Did some research on mysql.com (forums, bugs, etc) and found 2 similar cases but without getting a solid answer (just suggestions).
Some info you might find useful:
Storage Engine: InnoDB
User Submitted Data sanitized and checked for SQL Injection attempts
Appreciate any suggestions, info.
regards,
Here's 3 possibilities:
The data never got to the database in the first place. Something happened elsewhere so the data disappeared. Maybe intermitten network issues, overloaded server, application bug.
A database transaction was not commited, and got rolled back. Maybe a bug in your application code, maybe some invalid data screwd things up, maybe a concurrency exception occured etc.
A bug in mysql.
I'd look at 1. and 2. first.
A table on which you only ever insert (and presumably select) and never update or delete should be really stable. Are you absolutely certain you're protecting thoroughly against SQL injection attacks? Because those could (of course) delete rows and such if successful.
You haven't mentioned which table engine you're using (there are several), but it's well worth running whatever diagnostic tools there are for it on the table in question. For instance, on a MyISAM table, run myisamchk. Or more generically (this works for several table types), use the CHECK TABLE statement.
Have you had issues with the underlying storage? It may be worth checking for those.
Activating binlog and periodically monitoring DELETE queries can help to identify the culprit.
One more case to fullfill the above. There could also be the case of client-side and server-side parts of application. Client-side initiated changes can be processed on the server side with additional code logics.
For example, in our case, local admin panel updated an order information with pay_date = NULL and php-website processed this table to clean-up overdue orders from this table. As php logics were developed by another programmer, it looked strange when orders update resulted in records to disappear after some time.
The same refers to crone operations, working on mysql database in a schedule.
SETUP
I have to insert a couple million rows in either SQL Server 2000/2005, MySQL, or Access. Unfortunately I don't have an easy way to use bulk insert or BCP or any of the other ways that a normal human would go about this. The inserts will happen on one particular database but that code needs to be db agnostic -- so I can't do bulk copy, or SELECT INTO, or BCP. I can however run specific queries before and after the inserts, depending on which database I'm importing to.
eg.
If IsSqlServer() Then
DisableTransactionLogging();
ElseIf IsMySQL() Then
DisableMySQLIndices();
End If
... do inserts ...
If IsSqlServer() Then
EnableTransactionLogging();
ElseIf IsMySQL() Then
EnableMySQLIndices();
End If
QUESTION
Are there any interesting things I can do to SQL Server that might speed up these inserts?
For example, is there a command I could issue to tell SQL Server, "Hey, don't bother recording these transactions in the transaction log".
Or maybe I could say, "Hey, I have a million rows coming in, so don't update your index until I'm totally finished".
ALTER INDEX [IX_TableIndex] ON Table DISABLE
... inserts
ALTER INDEX [IX_TableIndex] ON Table REBUILD
(Note: Above index disable only works on 2005, not 2000. Bonus points if you know a way to do this on 2000).
What about MySQL, and Access?
The single biggest thing that will kill performance here is the fact that (it sounds like) you're executing a million different INSERTs against the DB. Each INSERT is treated as a single operation. If you can do this as a single operation, then you will almost certainly have a huge performance improvement.
Both MySQL and SQL Server support 'selects' of constant expressions without a table name, so this should work as one statement:
INSERT INTO MyTable(ID, name)
SELECT 1, 'Fred'
UNION ALL SELECT 2, 'Wilma'
UNION ALL SELECT 3, 'Barney'
UNION ALL SELECT 4, 'Betty'
It's not clear to me if Access supports that, not having Access available. HOWEVER, Access does support constants in a SELECT, as far as I can tell, and you can coerce the above into ANSI SQL-92 (which should be supported by all 3 engines; it's about as close to 'DB agnostic' as you'll get) by just adding
FROM OneRowTable
to the end of every individual SELECT, where 'OneRowTable' is a table with just one row of dummy data.
This should let you insert a million rows of data in much much less than a million INSERT statements -- and things like index reshuffling will be done once, rather than a million times. You may have much less need for other optimisations after that.
is this a regular process or a one time event?
I have, in the past, just scripted out the current indexes, dropped them, inserted the rows, then just re-add the indexes.
The SQL Management Studio can script out the indexes from the right click menus...
For SQL Server:
You can set the recovery model to "Simple", so your transaction log will be kept small. Do not forget to set back afterwards.
Disabling the indexes is actually a good idea. This will work on SQL 2005, not on SQL Server 2000.
alter index [INDEX_NAME] on [TABLE_NAME] disable
And to enable
alter index [INDEX_NAME] on [TABLE_NAME] rebuild
And then just insert the rows one by one. You have to be patient, but at least it is somewhat faster.
If it is a one-time thing (or it happens often enough to justify automating this), also considering dropping/disabling all indexes, and then adding/reenabling them again when the insert it done
The trouble with setting the recovery model to simple is that it affects any other users entering data at the same time and thus will amke thier changes unrecoverable.
Samre thing with disabling the indexes, this disables for everyone and may make the database run slower than a slug.
Suggest you run the import in batches.
If this is not something that needs to be read terribly quickly, you can do an "Insert Delayed" into the table on MySQL. This allows your code to continue running without having to wait for the insert to actually happen. This does have some limitations, but if your primary concern is to get the program to finish quickly, this may help. Be warned that there is a nice long list of situations where this may not act as expected. Check the docs.
I do not know if this functionality works for Access or MS SQL, though.
Have you considered using the Factory pattern? I'm guessing you're writing the code for this, so if using the factory pattern you could code up a factory that returned a concrete "IDataInserter" type class that would do the work for.
This would still allow you to be data agnostic and get the fastest method for each type of database.
SQL Server 2000/2005, MySQL, and Access can all load directly from a tab / cr text file they just have different commands to do it. If you've got the case statement to determine which DB you're importing into just figure out their preference for importing a text file.
Can you use DTS (2000) or SSIS (2005) to build a package to do this? DTS and SSIS can both pull from the same source and pipe out to the different potential destinations. Go for SSIS if you can. There's a lot of good, fast technology in there along with functionality to embed the IsSQLServer, IsMySQL, etc. logic.
It's worth considering breaking your inserts into smaller batches; a single transaction with lots of queries will be slow.
You might consider using SQL's bulk-logged recovery model during your bulk insert.
http://msdn.microsoft.com/en-us/library/ms190422(SQL.90).aspx
http://msdn.microsoft.com/en-us/library/ms190203(SQL.90).aspx
You might also disable the indexes on the target table during your inserts.
I'm being given a data source weekly that I'm going to parse and put into a database. The data will not change much from week to week, but I should be updating the database on a regular basis. Besides this weekly update, the data is static.
For now rebuilding the entire database isn't a problem, but eventually this database will be live and people could be querying the database while I'm rebuilding it. The amount of data isn't small (couple hundred megabytes), so it won't load that instantaneously, and personally I want a bit more of a foolproof system than "I hope no one queries while the database is in disarray."
I've thought of a few different ways of solving this problem, and was wondering what the best method would be. Here's my ideas so far:
Instead of replacing entire tables, query for the difference between my current database and what I want to place in the database. This seems like it could be an unnecessary amount of work, though.
Creating dummy data tables, then doing a table rename (or having the server code point towards the new data tables).
Just telling users that the site is going through maintenance and put the system offline for a few minutes. (This is not preferable for obvious reasons, but if it's far and away the best answer I'm willing to accept that.)
Thoughts?
I can't speak for MySQL, but PostgreSQL has transactional DDL. This is a wonderful feature, and means that your second option, loading new data into a dummy table and then executing a table rename, should work great. If you want to replace the table foo with foo_new, you only have to load the new data into foo_new and run a script to do the rename. This script should execute in its own transaction, so if something about the rename goes bad, both foo and foo_new will be left untouched when it rolls back.
The main problem with that approach is that it can get a little messy to handle foreign keys from other tables that key on foo. But at least you're guaranteed that your data will remain consistent.
A better approach in the long term, I think, is just to perform the updates on the data directly (your first option). Once again, you can stick all the updating in a single transaction, so you're guaranteed all-or-nothing semantics. Even better would be online updates, just updating the data directly as new information becomes available. This may not be an option for you if you need the results of someone else's batch job, but if you can do it, it's the best option.
BEGIN;
DELETE FROM TABLE;
INSERT INTO TABLE;
COMMIT;
Users will see the changeover instantly when you hit commit. Any queries started before the commit will run on the old data, anything afterwards will run on the new data. The database will actually clear the old table once the last user is done with it. Because everything is "static" (you're the only one who ever changes it, and only once a week), you don't have to worry about any lock issues or timeouts. For MySQL, this depends on InnoDB. PostgreSQL does it, and SQL Server calls it "snapshotting," and I can't remember the details off the top of my head since I rarely use the thing.
If you Google "transaction isolation" + the name of whatever database you're using, you'll find appropriate information.
We solved this problem by using PostgreSQL's table inheritance/constraints mechanism.
You create a trigger that auto-creates sub-tables partitioned based on a date field.
This article was the source I used.
Which database server are you using? SQL 2005 and above provides a locking method called "Snapshot". It allows you to open a transaction, do all of your updates, and then commit, all while users of the database continue to view the pre-transaction data. Normally, your transaction would lock your tables and block their queries, but snapshot locking would be perfect in your case.
More info here: http://blogs.msdn.com/craigfr/archive/2007/05/16/serializable-vs-snapshot-isolation-level.aspx
But it requires SQL Server, so if you're using something else....
Several database systems (since you didn't specify yours, I'll keep this general) do offer the SQL:2003 Standard statement called MERGE which will basically allow you to
insert new rows into a target table from a source which don't exist there yet
update existing rows in the target table based on new values from the source
optionally even delete rows from the target that don't show up in the import table anymore
SQL Server 2008 is the first Microsoft offering to have this statement - check out more here, here or here.
Other database system probably will have similar implementations - it's a SQL:2003 Standard statement after all.
Marc
Use different table names(mytable_[yyyy]_[wk]) and a view for providing you with a constant name(mytable). Once a new table is completely imported update your view so that it uses that table.