missing autonumbers in Access - ms-access

I have a very basic database with only one main table and a few lookups - no relationships.
the Autonumber field in the main table has little code associated with its form however I am noting that every 10 records or so it skips a number.
I have locked the DB up pretty tight from the users so they have no delete access and can only modify records very sparingly once created. they have no way to delete a wrong entry - they must tick a box called CANCELED in order to remove the entry from the list and start again. - the ONLY way to delete a record is to SHIFT-OPEN, open the table and delete from there... I doubt they are doing that but anything is possible...
Question is this - I have seen numerous web discussions on similar issued but the solutions generally point to some code or a formatting issue or a SQL / Access thing... I have no such system... its a straight Front end / back end DB using linked tables on the local network.
Can someone please advise if this is just an Access thing and just to ignore it or is this very unusual and something is going on in that someone IS deleting records... if someone IS deleting records - is there any way I can maybe PW protect if it tries to open in edit mode? Or can I PW protect the table itself maybe?.
Or even better - is there a way I can maybe add some fields and code and see what the heck is going on? whether it is access just not creating that # or if someone is messing with me?
Thanks

The basic rule for database auto numbers is simply they are INTERNAL numbers – END of story! I mean when you load a word document do you care about the memory segment number used? Auto numbers are used to setup relations between tables. They are a “concept” and if the tables are linked by pictures, apes in the jungle eating bananas or some auto numb sequence, you do NOT care.
However, to answer your question if you jump to a new record, and then start typing, the record is dirty. Of course the user might decide, hey, I don’t want to add this record. If they go edit->undo, or hit control-z and then exit, the record is not created nor is it saved. However, the auto number will get incremented. I mean since the database is multi-user, then one user starts working, and then another – they will both be assigned an auto number – but both may decide to not save.
Auto numbers are NOT to be given meaning to the end user, and really they should never seem them. Users never see the computer memory segment that a record or word document loads into also – they don’t care.
How internal indexing and how tables are laid out, and how they work are the SOLE issue of the database engine, and have ZERO to do with you, or your users.
Now of course you are “aware” that your computer has memory, but you would NOT expose the “memory” location used to your end users, since such internal housekeeping numbers are just that – internal housekeeping numbers.
In addition to users hitting un-do and bailing on record addition, general deleting of records will also produce gaps.
If you looking for some kind of number sequence, then create an invoice number field, or whatever. While an invoice number can be required, if you use internal auto numbers, then your database design can function because you don’t have some Social insurance number, or some silly invoice number. What do they have to do with you as the developing building relations between tables? (Answer: absolute nothing at all!!!)
The fact that your database functions fine without an invoice number or other numbers has ZERO to do with internal numbers used for housekeeping and to maintain relationships.
You define relationships in your database – these have ZERO to do with what your users think about, know about etc. Such numbers have no more meaning then the memory segment used in your computers ram to load a record into.
If you need some kind of invoice number, or some other sequence number, then you have to add that design part to your database. Such numbers have ZERO to do with some internal numbers that Access uses and maintains to build relationships with.
In a multi-user environment, and due to additions or deletions, you as a general rule might as well assume auto numbers are random – they have no meaning to users, nor any to business rules that require some kind of sequence number.

Related

UNIQUE constraint vs checking before INSERT

I have a SQL server table RealEstate with columns - Id, Property, Property_Value. This table has about 5-10 million rows and can increase even more in the future. I want to insert a row only if a combination of Id, Property, Property_Value does not exist in this table.
Example Table -
1,Rooms,5
1,Bath,2
1,Address,New York
2,Rooms,2
2,Bath,1
2,Address,Miami
Inserting 2,Address,Miami should NOT be allowed. But, 2,Price,2billion is okay. I am curious to know which is the "best" way to do this and why. The why part is most important to me. The two ways of checking are -
At application level - The app should check if a row exists before it inserts a row.
At database level - Set unique constraints on all 3 columns and let the database
do the checking instead of person/app.
Is there any scenario where one would be better than the other ?
Thanks.
PS: I know there is a similar question already, but it does not answer my problem -
Unique constraint vs pre checking
Also, I think that UNIQUE is applicable to all databases, so I don't think I should remove the mysql and oracle tags.
I think it most cases the differences between that two are going to be small enough that the choice should mostly be driven by picking the implementation that ends up being most understandable to someone looking at the code for the first time.
However, I think exception handling has a few small advantages:
Exception handling avoids a potential race condition. The 'check, then insert' method might fail if another process inserts a record between your check and your insert. So, even if you're doing 'check then insert' you still want exception handling on the insert and if you're already doing exception handling anyways then you might as well do away with the initial check.
If your code is not a stored procedure and has to interact with the database via the network (i.e. the application and the db are not on the same box), then you want to avoid having two separate network calls (one for the check and the other for the insert) and doing it via exception handling provides a straightforward way of handling the whole thing with a single network call. Now, there are tons of ways to do the 'check then insert' method while still avoiding the second network call, but simply catching the exception is likely to be the simplest way to go about it.
On the other hand, exception handling requires a unique constraint (which is really a unique index), which comes with a performance tradeoff:
Creating a unique constraint will be slow on very large tables and it will cause a performance hit on every single insert to that table. On truly large databases you also have to budget for the extra disk space consumed by the unique index used to enforce the constraint.
On the other hand, it might make selecting from the table faster if your queries can take advantage of that index.
I'd also note that if you're in a situation where what you actually want to do is 'update else insert' (i.e. if a record with the unique value already exists then you want to update that record, else you insert a new record) then what you actually want to use is your particular database's UPSERT method, if it has one. For SQL Server and Oracle, this would be a MERGE statement.
Dependent on the cost of #1 (doing a lookup) being reasonable, I would do both. At least, in Oracle, which is the database I have the most experience with.
Rationale:
Unique/primary keys should be a core part of your data model design, I can't see any reason to not implement them - if you have so much data that performance suffers from maintaining the unique index:
that's a lot of data
partition it or archive it away from your OLTP work
The more constraints you have, the safer your data is against application logic errors.
If you check that a row exists first, you can easily extract other information from that row to use as part of an error message, or otherwise fork the application logic to cope with the duplication.
In Oracle, rolling back DML statements is relatively expensive because Oracle expects to succeed (i.e. COMMIT changes that have been written) by default.
This does not answer the question directly, but I thought it might be helpful to post it here since its better than wikipedia and the link might just become dead someday.
Link - http://www.celticwolf.com/blog/2010/04/27/what-is-a-race-condition/
Wikipedia has a good description of a race condition, but it’s hard to follow if you don’t understand the basics of programming. I’m going to try to explain it in less technical terms, using the example of generating an identifier as described above. I’ll also use analogies to human activities to try to convey the ideas.
A race condition is when two or more programs (or independent parts of a single program) all try to acquire some resource at the same time, resulting in an incorrect answer or conflict. This resource can be information, like the next available appointment time, or it can be exclusive access to something, like a spreadsheet. If you’ve ever used Microsoft Excel to edit a document on a shared drive, you’ve probably had the experience of being told by Excel that someone else was already editing the spreadsheet. This error message is Excel’s way of handling the potential race condition gracefully and preventing errors.
A common task for programs is to identify the next available value of some sort and then assign it. This technique is used for invoice numbers, student IDs, etc. It’s an old problem that has been solved before. One of the most common solutions is to allow the database that is storing the data to generate the number. There are other solutions, and they all have their strengths and weaknesses.
Unfortunately, programmers who are ignorant of this area or simply bad at programming frequently try to roll their own. The smart ones discover quickly that it’s a much more complex problem than it seems and look for existing solutions. The bad ones never see the problem or, once they do, insist on making their unworkable solution ever more complex without fixing the error. Let’s take the example of a student ID. The neophyte programmer says “to know what the next student number should be, we’ll just get the last student number and increment it.” Here’s what happens under the hood:
Betty, an admin. assistant in the admissions office fires up the student management program. Note that this is really just a copy of the program that runs on her PC. It talks to the database server over the school’s network, but has no way to talk to other copies of the program running on other PCs.
Betty creates a new student record for Bob Smith, entering all of the information.
While Betty is doing her data entry, George, another admin. assistant, fires up the student management program on his PC and begins creating a record for Gina Verde.
George is a faster typist, so he finishes at the same time as Betty. They both hit the “Save” button at the same time.
Betty’s program connects to the database server and gets the highest student number in use, 5012.
George’s program, at the same time, gets the same answer to the same question.
Both programs decide that the new student ID for the record that they’re saving should be 5013. They add that information to the record and then save it in the database.
Now Bob Smith (Betty’s student) and Gina Verde (George’s student) have the same student ID.
This student ID will be attached to all sorts of other records, from grades to meal cards for the dining hall. Eventually this problem will come to light and someone will have to spend a lot of time assigning one of them a new ID and sorting out the mixed-up records.
When I describe this problem to people, the usual reaction is “But how often will that happen in practice? Never, right?”. Wrong. First, when data entry is being done by your staff, it’s generally done during a relatively small period of time by everyone. This increases the chances of an overlap. If the application in question is a web application open to the general public, the chances of two people hitting the “Save” button at the same time are even higher. I saw this in a production system recently. It was a web application in public beta. The usage rate was quite low, with only a few people signing up every day. Nevertheless, six pairs of people managed to get identical IDs over the space of a few months. In case you’re wondering, no, neither I nor anyone from my team wrote that code. We were quite surprised, however, at how many times that problem occurred. In hindsight, we shouldn’t have been. It’s really a simple application of Murphy’s Law.
How can this problem be avoided? The easiest way is to use an existing solution to the problem that has been well tested. All of the major databases (MS SQL Server, Oracle, MySQL, PostgreSQL, etc.) have a way to increment numbers without creating duplicates. MS SQL server calls it an “identity” column, while MySQL calls it an “auto number” column, but the function is the same. Whenever you insert a new record, a new identifier is automatically created and is guaranteed to be unique. This would change the above scenario as follows:
Betty, an admin. assistant in the admissions office fires up the student management program. Note that this is really just a copy of the program that runs on her PC. It talks to the database server over the school’s network, but has no way to talk to other copies of the program running on other PCs.
Betty creates a new student record for Bob Smith, entering all of the information.
While Betty is doing her data entry, George, another admin. assistant, fires up the student management program on his PC and begins creating a record for Gina Verde.
George is a faster typist, so he finishes at the same time as Betty. They both hit the “Save” button at the same time.
Betty’s program connects to the database server and hands it the record to be saved.
George’s program, at the same time, hands over the other record to be saved.
The database server puts both records into a queue and saves them one at a time, assigning the next available number to them.
Now Bob Smith (Betty’s student) gets ID 5013 and Gina Verde (George’s student) gets id 5014.
With this solution, there is no problem with duplication. The code that does this for each database server has been tested repeatedly over the years, both by the manufacturer and by users. Millions of applications around the world rely on it and continue to stress test it every day. Can anyone say the same about their homegrown solution?
There is at least one well tested way to create identifiers in the software rather than in the database: uuids (Universally Unique Identifiers). However, a uuid takes the form of xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx where “x” stands for a hexadecimal digit (0-9 and a-f). Do you want to use that for an invoice number, student ID or some other identifier seen by the public? Probably not.
To summarize, a race condition occurs when two programs, or two independent parts of a program, attempt to access some information or access a resource at the same time, resulting in an error, be it an incorrect calculation, a duplicated identifier or conflicting access to a resource. There are many more types of race conditions than I’ve presented here and they affect many other areas of software and hardware.
The description of your problem is exactly why primary keys can be compound, e.g., they consist of multiple fields. That way, the database will handle the uniqueness for you, and you don't need to care about it.
In your case, the table definition could be something similar to the following like:
CREATE TABLE `real_estate` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`property` varchar(255) DEFAULT NULL,
`property_value` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `index_id_property_property_value` (`id`, `property`, `property_value`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Should "binary flags" for each user be placed in the main Users table or in its own "binary flags" table?

On one of my sites, I have a main Users table with each user's unique user id, e-mail address, password, etc.
I need to start keeping track of a lot of binary flags related to each user, such as whether they have confirmed their e-mail, whether they have posted a message, whether they have upgraded their account, whether they have done X, whether they have done Y, etc.
Each of these flags is a simple "0" (false) or "1" (true), and based on these flags, my site shows the user or does different things.
My question is, does it make more sense to add these binary flags to the main Users table or to create a separate table for the binary flags or something else?
Please try to explain your reasoning (and the advantages of your approach) so that I understand where you're coming from.
Do all these flags need to be stored or they can be calculated? For example, if user hasn't posted any message, this can be easily determined by querying the MESSAGE table.
Physically storing "calculable" flags is redundant and opens the possibility for data inconsistencies. For example what if user adds a message but a bug in your application prevents the flag update? Such "denormalization" may be justified for performance reasons, but only make this decision after you have measured the performance on realistic amounts of data and representative workloads.
OTOH, some flags may be "real" (e.g. whether the user has confirmed the e-mail). If such flags are relatively static (i.e. you know them in advance, at the time you are designing your data model), store them directly as simple boolean (or equivalent) fields in the USER table itself.
Only if you need to have a considerable run-time flexibility, consider using a separate FLAG table that is in N:1 relationship with USER table. This is a kind of EAV.
You have advantages in keeping them together and advantages in separating them: if you put the flags in the Users table, with a simple query on the user ID you have all the informations about that user, instead of using a join to retrieve them.
On the other side, having them on a separate table makes them "logically" separated from the data you have on the Users table, which might be completely unrelated (even if they talk both about the user), thus having a clearer database structure.
Another thing to take into account is how often you have to change and retrieve such datas: if, for example, you just need them on login, then you might want to keep them on the same table and get all the login data at once; instead, if you have to change them repeatedly, then your choice should be going on a different table.
That said, I would go for the two tables solution in any case, but that's just how I like to see them in the DB schema.

MYSQL - Database Design Large-scale real world deployment

I would love to hear some opinions or thoughts on a mysql database design.
Basically, I have a tomcat server which recieves different types of data from about 1000 systems out in the field. Each of these systems are unique, and will be reporting unique data.
The data sent can be categorized as frequent, and unfrequent data. The unfrequent data is only sent about once a day and doesn't change much - it is basically just configuration based data.
Frequent data, is sent every 2-3 minutes while the system is turned on. And represents the current state of the system.
This data needs to be databased for each system, and be accessible at any given time from a php page. Essentially for any system in the field, a PHP page needs to be able to access all the data on that client system and display it. In other words, the database needs to show the state of the system.
The information itself is all text-based, and there is a lot of it. The config data (that doesn't change much) is key-value pairs and there is currently about 100 of them.
My idea for the design was to have 100+ columns, and 1 row for each system to hold the config data. But I am worried about having that many columns, mainly because it isn't too future proof if I need to add columns in the future. I am also worried about insert speed if I do it that way. This might blow out to a 2000row x 200column table that gets accessed about 100 times a second so I need to cater for this in my initial design.
I am also wondering, if there is any design philosophies out there that cater for frequently changing, and seldomly changing data based on the engine. This would make sense as I want to keep INSERT/UPDATE time low, and I don't care too much about the SELECT time from php.
I would also love to know how to split up data. I.e. if frequently changing data can be categorised in a few different ways should I have a bunch of tables, representing the data and join them on selects? I am worried about this because I will probably have to make a report to show common properties between all systems (i.e. show all systems with a certain condition).
I hope I have provided enough information here for someone to point me in the right direction, any help on the matter would be great. Or if someone has done something similar and can offer advise I would be very appreciative. Thanks heaps :)
~ Dan
I've posted some questions in a comment. It's hard to give you advice about your rapidly changing data without knowing more about what you're trying to do.
For your configuration data, don't use a 100-column table. Wide tables are notoriously hard to handle in production. Instead, use a four-column table containing these columns:
SYSTEM_ID VARCHAR System identifier
POSTTIME DATETIME The time the information was posted
NAME VARCHAR The name of the parameter
VALUE VARCHAR The value of the parameter
The first three of these columns are your composite primary key.
This design has the advantage that it grows (or shrinks) as you add to (or subtract from) your configuration parameter set. It also allows for the storing of historical data. That means new data points can be INSERTed rather than UPDATEd, which is faster. You can run a daily or weekly job to delete history you're no longer interested in keeping.
(Edit if you really don't need history, get rid of the POSTTIME column and use MySQL's nice extension feature INSERT ON DUPLICATE KEY UPDATE when you post stuff. See http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html)
If your rapidly changing data is similar in form (name/value pairs) to your configuration data, you can use a similar schema to store it.
You may want to create a "current data" table using the MEMORY access method for this stuff. MEMORY tables are very fast to read and write because the data is all in RAM in your MySQL server. The downside is that a MySQL crash and restart will give you an empty table, with the previous contents lost. (MySQL servers crash very infrequently, but when they do they lose MEMORY table contents.)
You can run an occasional job (every few minutes or hours) to copy the contents of your MEMORY table to an on-disk table if you need to save history.
(Edit: You might consider adding memcached http://memcached.org/ to your web application system in the future to handle a high read rate, rather than constructing a database design for version 1 that handles a high read rate. That way you can see which parts of your overall app design have trouble scaling. I wish somebody had convinced me to do this in the past, rather than overdesigning for early versions. )

Storing large, session-level datasets?

I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.
First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.

Never delete entries? Good idea? Usual?

I am designing a system and I don't think it's a good idea to give the ability to the end user to delete entries in the database. I think that way because often then end user, once given admin rights, might end up making a mess in the database and then turn to me to fix it.
Of course, they will need to be able to do remove entries or at least think that they did if they are set as admin.
So, I was thinking that all the entries in the database should have an "active" field. If they try to remove an entry, it will just set the flag to "false" or something similar. Then there will be some kind of super admin that would be my company's team who could change this field.
I already saw that in another company I worked for, but I was wondering if it was a good idea. I could just make regular database backups and then roll back if they commit an error and adding this field would add some complexity to all the queries.
What do you think? Should I do it that way? Do you use this kind of trick in your applications?
In one of our databases, we distinguished between transactional and dictionary records.
In a couple of words, transactional records are things that you cannot roll back in real life, like a call from a customer. You can change the caller's name, status etc., but you cannot dismiss the call itself.
Dictionary records are things that you can change, like assigning a city to a customer.
Transactional records and things that lead to them were never deleted, while dictionary ones could be deleted all right.
By "things that lead to them" I mean that as soon as the record appears in the business rules which can lead to a transactional record, this record also becomes transactional.
Like, a city can be deleted from the database. But when a rule appeared that said "send an SMS to all customers in Moscow", the cities became transactional records as well, or we would not be able to answer the question "why did this SMS get sent".
A rule of thumb for distinguishing was this: is it only my company's business?
If one of my employees made a decision based on data from the database (like, he made a report based on which some management decision was made, and then the data report was based on disappeared), it was considered OK to delete these data.
But if the decision affected some immediate actions with customers (like calling, messing with the customer's balance etc.), everything that lead to these decisions was kept forever.
It may vary from one business model to another: sometimes, it may be required to record even internal data, sometimes it's OK to delete data that affects outside world.
But for our business model, the rule from above worked fine.
A couple reasons people do things like this is for auditing and automated rollback. If a row is completely deleted then there's no way to automatically rollback that deletion if it was in error. Also, keeping a row around and its previous state is important for auditing - a super user should be able to see who deleted what and when as well as who changed what, etc.
Of course, that's all dependent on your current application's business logic. Some applications have no need for auditing and it may be proper to fully delete a row.
The downside to just setting a flag such as IsActive or DeletedDate is that all of your queries must take that flag into account when pulling data. This makes it more likely that another programmer will accidentally forget this flag when writing reports...
A slightly better alternative is to archive that record into a different database. This way it's been physically moved to a location that is not normally searched. You might add a couple fields to capture who deleted it and when; but the point is it won't be polluting your main database.
Further, you could provide an undo feature to bring it back fairly quickly; and do a permanent delete after 30 days or something like that.
UPDATE concerning views:
With views, the data still participates in your indexing scheme. If the amount of potentially deleted data is small, views may be just fine as they are simpler from a coding perspective.
I prefer the method that you are describing. Its nice to be able to undo a mistake. More often than not, there is no easy way of going back on a DELETE query. I've never had a problem with this method and unless you are filling your database with 'deleted' entries, there shouldn't be an issue.
I use a combination of techniques to work around this issue. For some things adding the extra "active" field makes sense. Then the user has the impression that an item was deleted because it no longer shows up on the application screen. The scenarios where I would implement this would include items that are required to keep a history...lets say invoice and payment. I wouldn't want such things being deleted for any reason.
However, there are some items in the database that are not so sensitive, lets say a list of categories that I want to be dynamic...I may then have users with admin privileges be allowed to add and delete a category and the delete could be permanent. However, as part of the application logic I will check if the category is used anywhere before allowing the delete.
I suggest having a second database like DB_Archives whre you add every row deleted from DB. The is_active field negates the very purpose of foreign key constraints, and YOU have to make sure that this row is not marked as deleted when it's referenced elsewhere. This becomes overly complicated when your DB structure is massive.
There is an acceptable practice that exists in many applications (drupal's versioning system, et. al.). Since MySQL scales very quickly and easily, you should be okay.
I've been working on a project lately where all the data was kept in the DB as well. The status of each individual row was kept in an integer field (data could be active, deleted, in_need_for_manual_correction, historic).
You should consider using views to access only the active/historic/... data in each table. That way your queries won't get more complicated.
Another thing that made things easy was the use of UPDATE/INSERT/DELETE triggers that handled all the flag changing inside the DB and thus kept the complex stuff out of the application (for the most part).
I should mention that the DB was a MSSQL 2005 server, but i guess the same approach should work with mysql, too.
Yes and no.
It will complicate your application much more than you expect since every table that does not allow deletion will be behind extra check (IsDeleted=false) etc. It does not sound much but then when you build larger application and in query of 11 tables 9 require chech of non-deletion.. it's tedious and error prone. (Well yeah, then there are deleted/nondeleted views.. when you remember to do/use them)
Some schema upgrades will become PITA since you'll have to relax FK:s and invent "suitable" data for very, very old data.
I've not tried, but have thought a moderate amount about solution where you'd zip the row data to xml and store that in some "Historical" table. Then in case of "must have that restored now OMG the world is dying!1eleven" it's possible to dig out.
I agree with all respondents that if you can afford to keep old data around forever it's a good idea; for performance and simplicity, I agree with the suggestion of moving "logically deleted" records to "old stuff" tables rather than adding "is_deleted" flags (moving to a totally different database seems a bit like overkill, but you can easily change to that more drastic approach later if eventually the amount of accumulated data turns out to be a problem for a single db with normal and "old stuff" tables).