Relying on MySQL features vs my script - mysql

I've always relied on my PHP programming for most processes which I need to do, that I know can be done via a MySQL query or feature. For example:
I know that MySQL has a FOREIGN KEY feature that helps maintain data integrity but I don't rely on MySQL. I might as well make my scripts do this as it is more flexible; I'm basically using MySQL as STORAGE and my SCRIPTS as the processor.
I would like to keep things that way, put most of the load on my coding. I make sure that my scripts are robust to check for conflicts, orphaned rows, etc every time it makes changes and I even have a SYSTEM CHECK routine that runs through all these data verification processes so I really try to do everything on script side as long as it doesn't impact the whole thing's performance significantly (since I know MySQL can do things faster internally I mean I do use MySQL COUNT() functions of course).
Of course any direct changes done to the tables will not trigger routines in my script. but that's a different story. I'm pretty comfortable with doing this and I plan to keep doing this until I am convinced otherwise.
The only thing that I really have an issue with right now is, checking for duplicates.
My current routine is basically inserting products with serial numbers. I need to make sure that there are no duplicate serial numbers entered into the database.
I can simply rely on MySQL UNIQUE constraint to make sure of this or I can do it script side and this is what I did.
This product routine is a BATCH routine where anything from 1 to 500 products will be entered into the database at one call to the script.
Obviously I check for both duplicate entries in the data submitted as well as the data in the database. Here's a chunk of my routine
for ($i = 1; $i <= $qty; $i++) {
//
$serial = $serials_array[$i - 1]; // -1 coz arrays start at zero
//check duplicates in submitted data ++++++++++++++++++++++++++
if($serial_check[$serial] == 1) { // duplicate found!
exit("stat=err&statMsg=Duplicate serial found in your entry! ($serial)");
}else{
$serial_check[$serial] = 1;
}
//check duplicates in database
if(db_checkRow("inventory_stocks", "WHERE serial='$serial'"))exit("stat=err&statMsg=Serial Number is already used. ($serial)");
//++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
}
OK so basically it's:
1) Check submitted data for duplicates via creating an array that I can check against each serial number submitted - THIS IS no problem and really fast with PHP even up to 1000 records.
2) But, to check the database for duplicates, I have to call a function I made (db_checkRow) w/c basically issues a SELECT statement on EACH serial submitted and see if there's a hit/duplicate.
So, basically, 500 SELECT statements to check for duplicates vs just the MySQL unique constraint feature.
Does it really matter much??
Another reason I design my software like this is because at least if I need to deploy my stuff on a different database I don't rely too much on database features, hence I can easily port my application with very little tweaking.

It's almost guaranteed that MySQL will be faster at checking duplicates. Unless you are running your PHP on some uber-machine and the MySQL is running on an old wristwatch the index checking will be faster and better optimized than anything you can do via PHP.
Not to mention that your process is fine until someone else (or some other app) starts writing to the db. You can save yourself having to write the duplicate checking code in the first place - and again in the next app - and so on.

You're wrong. You're very, dangerously wrong.
The database has been designed for a specific function. You will never beat MySQL at enforcing a unique constraint. The database has been designed to do explicitly that as quickly as possible. It is impossible that you can do it quicker or more efficiently in PHP as you still need to access the database to determine whether the data you're inserting would be a duplicate.
This is easily demonstrated by the fact that you have 500 select statements to enforce a single unique constraint. As your table grows this will get even more ridiculous. What happens when your table hits 2,000 rows? What if you have a new table with a million rows?
Use the database features that have been designed explicitly to make your life easy.
You're also assuming that the only way the database will be accessed is through the application. This is an extremely dangerous assumption that is almost certain to be incorrect as time progresses.
Please read this programmers question, which seems like it's been written just for you. Simply put, “Never do in code what you can get the SQL server to do well for you”. I cannot emphasise this enough.

Related

Database query efficiency

My boss is having me create a database table that keeps track of some of our inventory with various parameters. It's meant to be implemented as a cron job that runs every half hour or so, but the scheduling part isn't important since we've already discussed that we're handling it later.
What I'm want to know is if it's more efficient to just delete everything in the table each time the script is called and repopulate it, or go through each record to determine if any changes were made and update each entry accordingly. It's easier to do the former, but given that we have over 700 separate records to keep track of, I don't know if the time it takes to do this would put a huge load on the server. The script is written in PHP.
700 records is an extremely small number of records to have performance concerns. Don't even think about it, do whichever is easier for you.
But if it is performance that you are after, updating rows is slower than inserting rows, (especially if you are not expecting any generated keys, so an insertion is a one-way operation to the database instead of a roundtrip to and from the database,) and TRUNCATE TABLE tends to be faster than DELETE * FROM.
If you have IDs for the proper inventory talking about SQL DB, then it would be good practice to update them, since in theory your IDs will get exhausted (overflow).
Another approach would be to use some NoSQL DB like MongoDB and simply update the DB with given json bodies apparently with existing IDs, and the DB itself will figure it out on its own.

SQL Server 2008 - How to implement a "Watch Dog Service" which woofs when too many insert statements on a table

Like my title describes: how can I implement something like a watchdog service in SQL Server 2008 with following tasks: Alerting or making an action when too many inserts are committed on that table.
For instance: Error table gets in normal situation 10 error messages in one second. If more than 100 error messages (100 inserts) in one second then: ALERT!
Would appreciate it if you could help me.
P.S.: No. SQL Jobs are not an option because the watchdog should be live and woof on the fly :-)
Integration Services? Are there easier ways to implement such a service?
Kind regards,
Sani
I don't understand your problem exactly, so I'm not entirely sure whether my answer actually solves anything or just makes an underlying problem worse. Especially if you are facing performance or concurrency problems, this may not work.
If you can update the original table, just add a datetime2 field like
InsertDate datetime2 NOT NULL DEFAULT GETDATE()
Preferrably, make an index on the table and then with whatever interval that fits, poll the table by seeing how many rows have an InsertDate > GetDate - X.
For this particular case, you might benefit from making the polling process read uncommitted (or use WITH NOLOCK), although one has to be careful when doing so.
If you can't modify the table itself and you can't or won't make another process or job monitor the relevant variables, I'd suggest the following:
Make a 'counter' table that just has one Datetime2 column.
On the original table, create an AFTER INSERT trigger that:
Deletes all rows where the datetime-field is older than X seconds.
Inserts one row with current time.
Counts to see if too many rows are now present in the counter-table.
Acts if necessary - ie. by executing a procedure that will signal sender/throw exception/send mail/whatever.
If you can modify the original table, add the datetime column to that table instead and make the trigger count all rows that aren't yet X seconds old, and act if necessary.
I would also look into getting another process (ie. an SQL Jobs or a homemade service or similar) to do all the housekeeping, ie. deleting old rows, counting rows and acting on it. Keeping this as the work of the trigger is not a good design and will probably cause problems in the long run.
If possible, you should consider having some other process doing the housekeeping.
Update: A better solution will probably be to make the trigger insert notifications (ie. datetimes) into a queue - if you then have something listening against that queue, you can write logic to determine whether your threshold has been exceeded. However, that will require you to move some of your logic to another process, which I initially understood was not an option.

How to update database of ~25,000 music files?

Update:
I wrote a working script that finishes this job in a reasonable length of time, and seems to be quite reliable. It's coded entirely in PHP and is built around the array_diff() idea suggested by saccharine (so, thanks saccharine!).
You can access the source code here: http://pastebin.com/ddeiiEET
I have a MySQL database that is an index of mp3 files in a certain directory, together with their attributes (ie. title/artist/album).
New files are often being added to the music directory. At the moment it contains about 25,000 MP3 files, but I need to create a cron job that goes through it each day or so, adding any files that it doesn't find in the database.
The problem is that I don't know what is the best / least taxing way of doing this. I'm assuming a MySQL query would have to be run for each file on each cron run (to check if it's already indexed), so the script would unavoidably take a little while to run (which is okay; it's an automated process). However, because of this, my usual language of choice (PHP) would probably not suffice, as it is not designed to run long-running scripts like this (or is it...?).
It would obviously be nice, but I'm not fussed about deleting index entries for deleted files (if files actually get deleted, it's always manual cleaning up, and I don't mind just going into the database by hand to fix the index).
By the way, it would be recursive; the files are mostly situated in an Artist/Album/Title.mp3 structure, however they aren't religiously ordered like this and the script would certainly have to be able to fetch ID3 tags for new files. In fact, ideally, I would like the script to fetch ID3 tags for each file on every run, and either add a new row to the database or update the existing one if it had changed.
Anyway, I'm starting from the ground up with this, so the most basic advice first I guess (such as which programming language to use - I'm willing to learn a new one if necessary). Thanks a lot!
First a dumb question, would it not be possible to simply order the files by date added and only run the iterations through the files added in the last day? I'm not very familiar working with files, but it seems like it should be possible.
If all you want to do is improve the speed of your current code, I would recommend that you check that your data is properly indexed. It makes queries a lot faster if you search through a table's index. If you're searching through columns that aren't the key, you might want to change your setup. You should also avoid using "SELECT *" and instead use "SELECT COUNT" as mysql will then be returning ints instead of objects.
You can also do everything in a few mysql queries but will increase the complexity of your php code. Call the array with information about all the files $files. Select the data from the db where the files in the db match the a file in $files. Something like this.
"SELECT id FROM MUSIC WHERE id IN ($files)"
Read the returned array and label it $db_files. Then find all files in $files array that don't appear in $db_files array using array_diff(). Label the missing files $missing_files. Then insert the files in $missing_files into the db.
What kind of Engine are you using? If you're using MyISAM, the whole table will be locked while updating your table. But still, 25k rows are not that much, so basically in (max) a few minutes it should be updated. If it is InnoDB just update it since it's row-level locked and you should be still able to use your table while updating it.
By the way, if you're not using any fulltext search on that table, I believe that you should convert it to InnoDB as you can use foreign indexes, and that would help you a lot while joining tables. Also, it scales better AFAIK.

MySQL table modified timestamp

I have a test server that uses data from a test database. When I'm done testing, it gets moved to the live database.
The problem is, I have other projects that rely on the data now in production, so I have to run a script that grabs the data from the tables I need, deletes the data in the test DB and inserts the data from the live DB.
I have been trying to figure out a way to improve this model. The problem isn't so much in the migration, since the data only gets updated once or twice a week (without any action on my part). The problem is having the migration take place only when it needs to. I would like to have my migration script include a quick check against the live tables and the test tables and, if need be, make the move. If there haven't been updates, the script quits.
This way, I can include the update script in my other scripts and not have to worry if the data is in sync.
I can't use time stamps. For one, I have no control over the tables on the live side once it goes live, and also because it seems a bit silly to bulk up the tables more for conviencience.
I tried doing a "SHOW TABLE STATUS FROM livedb" but because the tables are all InnoDB, there is no "Update Time", plus, it appears that the "Create Time" was this morning, leading me to believe that the database is backed up and re-created daily.
Is there any other property in the table that would show which of the two is newer? A "Newest Row Date" perhaps?
In short: Make the development-live updating first-class in your application. Instead of depending on the database engine to supply you with the necessary information to enable you to make a decision (to update or not to update ... that is the question), just implement it as part of your application. Otherwise, you're trying to fit a round peg into a square hole.
Without knowing what your data model is, and without understanding at all what your synchronization model is, you have a few options:
Match primary keys against live database vs. the test database. When test > live IDs, do an update.
Use timestamps in a table to determine if it needs to be updated
Use the md5 hash of a database table and modification date (UTC) to determine if a table has changed.
Long story short: Database synchronization is very hard. Implement a solution which is specific to your application. There is no "generic" solution which will work ideally.
If you have an autoincrement in your tables, you could compare the maximum autoincrement values to see if they're different.
But which version of mysql are you using?
Rather than rolling your own, you could use a preexisting solution for keeping databases in sync. I've heard good things about SQLYog's SJA (see here). I've never used it myself, but I've been very impressed with their other programs.

What is the best way to update (or replace) an entire database table on a live machine?

I'm being given a data source weekly that I'm going to parse and put into a database. The data will not change much from week to week, but I should be updating the database on a regular basis. Besides this weekly update, the data is static.
For now rebuilding the entire database isn't a problem, but eventually this database will be live and people could be querying the database while I'm rebuilding it. The amount of data isn't small (couple hundred megabytes), so it won't load that instantaneously, and personally I want a bit more of a foolproof system than "I hope no one queries while the database is in disarray."
I've thought of a few different ways of solving this problem, and was wondering what the best method would be. Here's my ideas so far:
Instead of replacing entire tables, query for the difference between my current database and what I want to place in the database. This seems like it could be an unnecessary amount of work, though.
Creating dummy data tables, then doing a table rename (or having the server code point towards the new data tables).
Just telling users that the site is going through maintenance and put the system offline for a few minutes. (This is not preferable for obvious reasons, but if it's far and away the best answer I'm willing to accept that.)
Thoughts?
I can't speak for MySQL, but PostgreSQL has transactional DDL. This is a wonderful feature, and means that your second option, loading new data into a dummy table and then executing a table rename, should work great. If you want to replace the table foo with foo_new, you only have to load the new data into foo_new and run a script to do the rename. This script should execute in its own transaction, so if something about the rename goes bad, both foo and foo_new will be left untouched when it rolls back.
The main problem with that approach is that it can get a little messy to handle foreign keys from other tables that key on foo. But at least you're guaranteed that your data will remain consistent.
A better approach in the long term, I think, is just to perform the updates on the data directly (your first option). Once again, you can stick all the updating in a single transaction, so you're guaranteed all-or-nothing semantics. Even better would be online updates, just updating the data directly as new information becomes available. This may not be an option for you if you need the results of someone else's batch job, but if you can do it, it's the best option.
BEGIN;
DELETE FROM TABLE;
INSERT INTO TABLE;
COMMIT;
Users will see the changeover instantly when you hit commit. Any queries started before the commit will run on the old data, anything afterwards will run on the new data. The database will actually clear the old table once the last user is done with it. Because everything is "static" (you're the only one who ever changes it, and only once a week), you don't have to worry about any lock issues or timeouts. For MySQL, this depends on InnoDB. PostgreSQL does it, and SQL Server calls it "snapshotting," and I can't remember the details off the top of my head since I rarely use the thing.
If you Google "transaction isolation" + the name of whatever database you're using, you'll find appropriate information.
We solved this problem by using PostgreSQL's table inheritance/constraints mechanism.
You create a trigger that auto-creates sub-tables partitioned based on a date field.
This article was the source I used.
Which database server are you using? SQL 2005 and above provides a locking method called "Snapshot". It allows you to open a transaction, do all of your updates, and then commit, all while users of the database continue to view the pre-transaction data. Normally, your transaction would lock your tables and block their queries, but snapshot locking would be perfect in your case.
More info here: http://blogs.msdn.com/craigfr/archive/2007/05/16/serializable-vs-snapshot-isolation-level.aspx
But it requires SQL Server, so if you're using something else....
Several database systems (since you didn't specify yours, I'll keep this general) do offer the SQL:2003 Standard statement called MERGE which will basically allow you to
insert new rows into a target table from a source which don't exist there yet
update existing rows in the target table based on new values from the source
optionally even delete rows from the target that don't show up in the import table anymore
SQL Server 2008 is the first Microsoft offering to have this statement - check out more here, here or here.
Other database system probably will have similar implementations - it's a SQL:2003 Standard statement after all.
Marc
Use different table names(mytable_[yyyy]_[wk]) and a view for providing you with a constant name(mytable). Once a new table is completely imported update your view so that it uses that table.