MySQL primary key cleanup - mysql

I have a fairly large database that I use for tracking items installed in a home by our service reps. For programmatic simplicity I wrote the tracking page so that every time anyone updates, removes or adds a new installed item it totally clears that home's installed item list and rebuilds it from scratch.
This works very well and has been error free in actual use, but now I've come into a different problem that I'm a bit worried about. The primary key that is used to track each particular item in the home has grown exponentially, because for every update it clears out old numbers and starts again from the highest auto_increment. This means I have large gaps in my ids and my highest index is thousands of numbers higher than the actual count of installed measures.
For clarification: I don't care that there are gaps in the ids, I built my system to only use that number as a foreign key reference to the billing information for it and it's never displayed. My actual concern is that I'm going to run out of digits far, far sooner than should be possible.
I know that I could change my script around to be "more efficient" and not delete items that don't change and I may end up doing that in the future (this issue is a symptom of the purpose of my tracking radically changing in the middle of a project. Thanks, boss), but in the mean time I'd like to know if there is a way to "clean up" my ids. Everything that depends on those numbers is set to cascade so there shouldn't be an issue with updating the keys. Basically I'd like to start with 1, eliminate the gaps between the ids and avoid clashing with existing ids as the script runs.
I'm hoping that someone can provide a simple means of doing this, hopefully one that can be implemented as a stored procedure and run routinely.

There are two options to reset the auto_increment:
Truncate the table
Reset the auto_increment
This is done by:
ALTER TABLE tablename AUTO_INCREMENT=10000
So, you can just clear the auto_increment.
Otherwise, I would recommend you to increase the integer size. Use BIGINT, not INT.

Related

missing autonumbers in Access

I have a very basic database with only one main table and a few lookups - no relationships.
the Autonumber field in the main table has little code associated with its form however I am noting that every 10 records or so it skips a number.
I have locked the DB up pretty tight from the users so they have no delete access and can only modify records very sparingly once created. they have no way to delete a wrong entry - they must tick a box called CANCELED in order to remove the entry from the list and start again. - the ONLY way to delete a record is to SHIFT-OPEN, open the table and delete from there... I doubt they are doing that but anything is possible...
Question is this - I have seen numerous web discussions on similar issued but the solutions generally point to some code or a formatting issue or a SQL / Access thing... I have no such system... its a straight Front end / back end DB using linked tables on the local network.
Can someone please advise if this is just an Access thing and just to ignore it or is this very unusual and something is going on in that someone IS deleting records... if someone IS deleting records - is there any way I can maybe PW protect if it tries to open in edit mode? Or can I PW protect the table itself maybe?.
Or even better - is there a way I can maybe add some fields and code and see what the heck is going on? whether it is access just not creating that # or if someone is messing with me?
Thanks
The basic rule for database auto numbers is simply they are INTERNAL numbers – END of story! I mean when you load a word document do you care about the memory segment number used? Auto numbers are used to setup relations between tables. They are a “concept” and if the tables are linked by pictures, apes in the jungle eating bananas or some auto numb sequence, you do NOT care.
However, to answer your question if you jump to a new record, and then start typing, the record is dirty. Of course the user might decide, hey, I don’t want to add this record. If they go edit->undo, or hit control-z and then exit, the record is not created nor is it saved. However, the auto number will get incremented. I mean since the database is multi-user, then one user starts working, and then another – they will both be assigned an auto number – but both may decide to not save.
Auto numbers are NOT to be given meaning to the end user, and really they should never seem them. Users never see the computer memory segment that a record or word document loads into also – they don’t care.
How internal indexing and how tables are laid out, and how they work are the SOLE issue of the database engine, and have ZERO to do with you, or your users.
Now of course you are “aware” that your computer has memory, but you would NOT expose the “memory” location used to your end users, since such internal housekeeping numbers are just that – internal housekeeping numbers.
In addition to users hitting un-do and bailing on record addition, general deleting of records will also produce gaps.
If you looking for some kind of number sequence, then create an invoice number field, or whatever. While an invoice number can be required, if you use internal auto numbers, then your database design can function because you don’t have some Social insurance number, or some silly invoice number. What do they have to do with you as the developing building relations between tables? (Answer: absolute nothing at all!!!)
The fact that your database functions fine without an invoice number or other numbers has ZERO to do with internal numbers used for housekeeping and to maintain relationships.
You define relationships in your database – these have ZERO to do with what your users think about, know about etc. Such numbers have no more meaning then the memory segment used in your computers ram to load a record into.
If you need some kind of invoice number, or some other sequence number, then you have to add that design part to your database. Such numbers have ZERO to do with some internal numbers that Access uses and maintains to build relationships with.
In a multi-user environment, and due to additions or deletions, you as a general rule might as well assume auto numbers are random – they have no meaning to users, nor any to business rules that require some kind of sequence number.

Designing tables for deleted data

I have many tables where data needs to be "marked for deletion" but not deleted, or toggle between published and hidden data.
Most intuitive way to handle these cases is to add a column in the database deleted int(1) or public int(1). This raises the concern of not forgetting to specify WHERE deleted=0 for each and every time that table is being accessed.
I considered overcoming this by creating duplicate tables for deleted/unpublished data such as article => article_deleted and moving the data instead of deleting it. This provides with 2 issues:
Foreign key constraints end up being extremely annoying to maintain
Number of tables with hidden content doubles (in my case ~20 becomes ~40 tables)
My last idea is to create a duplicate of the entire database called unreleased and migrate data there.
My question isn't about safety of the data management, but more of - what is the right way of doing it from the beginning?
I have run into this exact issue before and I think it is a bad idea to create an unnecessarily cumbersome DB because you are afraid of bad code.
I think it would be a better idea to do thorough testing on your Test server before you release to production. Even I was tripped up by the "Deleted" column a few times when I first encountered it but I eventually caught on, and if you have a proper Dev/Test/Production environment you should be fine.
In summary, keep the delete column and demand more from your coders.
UPDATE:
Alternatively you could create a view that only pulls the records that aren't deleted and make sure everyone uses that for select queries.
I think your initial approach is "correct" and "right", but your concern about it being slightly error-prone is a valid one.
You'll probably just have to make sure that your test procedures are rigourous enough to catch errors.
The first approach is the best I've come up with. I call the column active instead of deleted. The record exists but it can be either active or inactive. That then if you really do need to delete things the terminology doesn't get screwy.
Saying "Delete the inactive records" makes sense but saying "Delete the deleted records" just gets confusing.

MySQL InnoDB auto_increment value increases by 2 instead of 1. Virus?

There's an InnoDB table for storing comments for blog posts used by a custom built web application.
Recently I noticed that the auto incremented primary key values for the comments are incrementing by 2 instead of just 1.
I also noticed that in another MySQL table which is used for remembering the last few commenter's footprint signature (e.g. ip, session id, uagent string, etc) the name of the PHP session starts with "viruskinq" which is weird because I thought it should always be a hexadecimal md5-like string.
Google yields only a couple of results for "viruskinq", all in Turkish. It is interesting because approximately a year ago the website in question was defaced by Turkish villains. (I'm 100% sure that the attackers didn't succeed because of any security holes in my app, because other websites, hosted by the same company, were defaced too at that time.)
The site is on a shared host, using Linux.
Do you think it is possible that the server itself may still be under the influence of those hackers? Examining the comment's id values revealed that this doubling phenomena exists since this May, but the defacing happened almost a year ago.
What other causes could there be that explain the weird behavior of the auto increment value? The application hasn't been changed and at older comments the auto incremented primary key values are in order.
Edit: Summary of the solution
The hosting company informed me that the reason of the doubled auto increment value is because they use a Master-Slave MySQL architect and according to them this phenomena is normal.
They also admitted that various hackers are constantly attacking their servers, "especially the sessions" and they cannot do anything about it.
I think I better start packing my things and move to a better webhost.
I really, really doubt this is a virus. Double-check whether that really is the session ID that starts with that string (which would indeed be reason for some concern). My guess would be this is a kid who discovered how to alter the User Agent string in the browser, and you are seeing the results of that, which is entirely harmless.
In regards to the increment problem.
First, check the auto_increment_increment setting of your mySQL server. Maybe it was set to 2 for some reason?
Second, if it's not that, I would look at all DELETE operations that the comment system runs on the table. Do comments recognized as spam get deleted? Can you log deletions for a while, or switch to soft deletions?
Also, try to create some subsequent comments yourself. Does the same phenonmenon occur? What if you add records using mySQL manually?
Look through the PHP code inserting a submitted comment making really sure there is nothing that could lead to this behaviour.
Try moving the comment system to a different server - preferably a local one, maybe freshly set up - to see whether the behaviour persists there.
Could it just be that the table's auto-increment value is set to 2?
See: MySQL autoincrement column jumps by 10- why?

How can I fix this scaling issue with soft deleting items?

I have a database where most tables have a delete flag for the tables. So the system soft deletes items (so they are no longer accessible unless by admins for example)
What worries me is in a few years, when the tables are much larger, is that the overall speed of the system is going to be reduced.
What can I do to counteract effects like that.
Do I index the delete field?
Do I move the deleted data to an identical delete table and back when undeleted?
Do I spread out the data over a few MySQL servers over time? (based on growth)
I'd appreciate any and all suggestions or stories.
UPDATE:
So partitioning seems to be the key to this. But wouldn't partitioning just create two "tables", one with the deleted items and one without the deleted items.
So over time the deleted partition will grow large and the occasional fetches from it will be slow (and slower over time)
Would the speed difference be something I should worry about? Since I fetch most (if not all) data by some key value (some are searches but they can be slow for this setup)
I'd partition the table on the DELETE flag.
The deleted rows will be physically kept in other place, but from SQL's point of view the table remains the same.
Oh, hell yes, index the delete field. You're going to be querying against it all the time, right? Compound indexes with other fields you query against a lot, like parent IDs, might also be a good idea.
Arguably, this decision could be made later if and only if performance problems actually appear. It very much depends on how many rows are added at what rate, your box specs, etc. Obviously, the level of abstraction in your application (and the limitations of any libraries you are using) will help determine how difficult such a change will be.
If it becomes a problem, or you are certain that it will be, start by partitioning on the deleted flag between two tables, one that holds current data and one that holds historical/deleted data. IF, as you said, the "deleted" data will only be available to administrators, it is reasonable to suppose that (in most applications) the total number of users (here limited only to admins) will not be sufficient to cause a problem. This means that your admins might need to wait a little while longer when searching that particular table, but your user base (arguably more important in most applications) will experience far less latency. If performance becomes unacceptable for the admins, you will likely want to index the user_id (or transaction_id or whatever) field you access the deleted records by (I generally index every field by which I access the table, but at certain scale there can be trade-offs regarding which indexes are most worthwhile).
Depending on how the data is accessed, there are other simple tricks you can employ. If the admin is looking for a specific record most of the time (as opposed to, say, reading a "history" or "log" of user activity), one can often assume that more recent records will be looked at more often than old records. Some DBs include tuning options for making recent records easier to find than older records, but you'll have to look it up for your particular database. Failing that, you can manually do it. The easiest way would be to have an ancient_history table that contains all records older than n days, weeks or months, depending on your constraints and suspected usage patterns. Newer data then lives inside a much smaller table. Even if the admin is going to "browse" all the records rather than searching for a specific one, you can start by showing the first n days and have a link to see all days should they not find what they are looking for (eg, most online banking applications that lets you browse transactions but shows only the first 30 days of history unless you request otherwise.)
Hopefully you can avoid having to go a step further, and sharding on user_id or some such scheme. Depending on the scale of the rest of your application, you might have to do this anyway. Unless you are positive that you will need to, I strongly suggest using vertical partitioning first (eg, keeping your forum_posts on a separate machine than your sales_records), as it is FAR easier to setup and maintain. If you end up needing to shard on user_id, I suggest using google ;-]
Good luck. BTW, I'm not a DBA so take this with a grain of salt.

Unique, numeric, incremental identifier

I need to generate unique, incremental, numeric transaction id's for each request I make to a certain XML RPC. These numbers only need to be unique across my domain, but will be generated on multiple machines.
I really don't want to have to keep track of this number in a database and deal with row locking etc on every single transaction. I tried to hack this using a microsecond timestamp, but there were collisions with just a few threads - my application needs to support hundreds of threads.
Any ideas would be appreciated.
Edit: What if each transaction id just has to be larger than the previous request's?
If you're going to be using this from hundreds of threads, working on multiple machines, and require an incremental ID, you're going to need some centralized place to store and lock the last generated ID number. This doesn't necessarily have to be in a database, but that would be the most common option. A central server that did nothing but serve IDs could provide the same functionality, but that probably defeats the purpose of distributing this.
If they need to be incremental, any form of timestamp won't be guaranteed unique.
If you don't need them to be incremental, a GUID would work. Potentially doing some type of merge of the timestamp + a hardware ID on each system could give unique identifiers, but the ID number portion would not necessarily be unique.
Could you use a pair of Hardware IDs + incremental timestamps? This would make each specific machine's IDs incremental, but not necessarily be unique across the entire domain.
---- EDIT -----
I don't think using any form of timestamp is going to work for you, for 2 reasons.
First, you'll never be able to guarantee that 2 threads on different machines won't try to schedule at exactly the same time, no matter what resolution of timer you use. At a high enough resolution, it would be unlikely, but not guaranteed.
Second, to make this work, even if you could resolve the collision issue above, you'd have to get every system to have exactly the same clock with microsecond accuracy, which isn't really practical.
This is a very difficult problem, particularly if you don't want to create a performance bottleneck. You say that the IDs need to be 'incremental' and 'numeric' -- is that a concrete business constraint, or one that exists for some other purpose?
If these aren't necessary you can use UUIDs, which most common platforms have libraries for. They allow you to generate many (millions!) of IDs in very short timespans and be quite comfortable with no collisions. The relevant article on wikipedia claims:
In other words, only after generating
1 billion UUIDs every second for the
next 100 years, the probability of
creating just one duplicate would be
about 50%.
If you remove 'incremental' from your requirements, you could use a GUID.
I don't see how you can implement incremental across multiple processes without some sort of common data.
If you target a Windows platform, did you try Interlocked API ?
Google for GUID generators for whatever language you are looking for, and then convert that to a number if you really need it to be numeric. It isn't incremental though.
Or have each thread "reserve" a thousand (or million, or billion) transaction IDs and hand them out one at a time, and "reserve" the next bunch when it runs out. Still not really incremental.
I'm with the GUID crowd, but if that's not possible, could you consider using db4o or SQL Lite over a heavy-weight database?
If each client can keep track of its own "next id", then you could talk to a sentral server and get a range of id's, perhaps a 1000 at a time. Once a client runs out of id's, it will have to talk to the server again.
This would make your system have a central source of id's, and still avoid having to talk to the database for every id.