Cascading Number MySQL - mysql

Ok I'm new with MySQL so excuse any stupid questions and such but I'm having a problem with the simple database table I created.
See:
I have it set up so each new entry has a automatically incrementing id however if I were to delete the third row, there is a gap. I have heard that there is a way to cascade the other numbers down when a previous entry is deleted. I have looked but couldn't find anything that would help me change it in phpMyAdmin without recreating the table.
How would I do this?

This question pops up every once in a while, and there's this confusion that auto_increment gives you sequential numbers that you can use, well, for sequences. You can't.
The sole purpose of primary key is to uniquely identify a row. And incrementing integer fits that role perfectly, it's easy to implement in C/C++ MySQL code and it works extremely fast and well.
But that's all it does. You don't get nice, sequential numbering feature out of it. You can't use it for what you want because there are, as you called it - gaps.
And no, you don't make MySQL fill the gaps. It's bad and it's dangerous and it creates problems that you didn't even think of.
Bottom line is that you never rely on auto_increment to reuse "wasted" numbers.
Here's why:
InnoDB, default MySQL engine, uses the primary key internally to physically organize records on the hard drive. It relies on the feature that every next id is greater than one before. I won't get into details, but the key idea is that the index and data are written on the same page. That makes InnoDB extremely fast when doing primary key lookups (SELECT col FROM table WHERE id = 1000000 types of queries).
Now, what happens when you "reuse" the keys that had gaps - imagine this scenario: you have 1 million records. There were no number losses.
You delete the record 500 000.
Afterwards, you add new record, and using your logic - you need to "reuse" number 500 000. So you do it.
But, InnoDB expects every next record to be larger. So to conform to your needs, it has to rebalance what it has written. And it has to start from record 500 000. Now you have 500 000 records that are being reorganized, and that means you have 500 000 checks and writes going on. This kills your performance, COMPLETELY. Let's say you have a mechanical hard drive. It's capable of about 200-300 input output operations per second (IOPS). If every reorganized record requires 1 I/O, to reorganize 500 000 of them would take 20-30 minutes. Now you have inserts that take 30 minutes to complete.
The other problem, much more severe than the performance problem is concurrency and isolation problem.
What people don't understand is that MySQL (and other relational databases) solve the problem of concurrency, or if you will - problem of simultaneous access or "What happens when two users write at the same time" problem.
MySQL takes care of that, and even more. And the feature where it "wastes" numbers is what actually makes it happen. Every transaction relies on the primary key at one point. Even before you commit the transaction, an auto_increment was already assigned to it. So even when transaction fails or isn't commited, an auto_increment gets "wasted". It's actually DESIRABLE behavior.
I definitely didn't list all the disadvantages, just two that I could think of (I also didn't describe them in detail due to lack of time but principles do apply).
Conclusion is - do not "reuse" wasted auto_increment numbers, do not listen to people who tell you how to do it, do not assume that your project won't have problems I listed above. If you do, prepare to encounter problems you could never think of.

Related

Does deleting rows from table effect on db performance?

As a MySQL database user,
I'm working on a script using MySQL database with an auto-increment primary key tables, that users may need to remove (lots of) data rows as mistaken, duplicated, canceled data and so on.
for now, I use a tinyint last col as 'delete' for each table and update the rows to delete=1 instead of deleting the row.
considering the deleted data as not important data,
which way do you suggest to have a better database and performance?
does deleting (maybe lots of) rows every day affect select queries for large tables?
is it better to delete the rows instantly?
or keep the rows using 'delete' col and delete them for example monthly then re-index the data?
I've searched about this but most of the results were based on personal opinions or preferred ones and not referenced or tested data.
PS) Edit:
AND Refering to the question and considering below pic, there's one more point to ask in this topic and I would be grateful if you could guide me.
deleting a row (row 6) while auto increment index was 225, leaded the not-sorted table to put the next inserted row by id=225 at deleted-id=6 place (at least visually!). if deleting action happens lots of times, then primary column and its rows will be completely out of order and messed up.
It should be considered as the good point of database that fill up the deleted spaces or something bad that leads to reducing the performance or none of them and doesn't matter what it's showing in front!?
Thanks.
What percentage of the table is "deleted"?
If it is less than, say, 20%, it would be hard to measure any difference between a soft "deleted=1" and a hard "DELETE FROM tbl". The disk space would probably be the same. A 16KB block would either have soft-deleted rows to ignore, or the block would be not "full".
Let's say 80% of the rows have been deleted. Now there are some noticeable differences.
In the "soft-delete" case, a SELECT will be looking at 5 rows to find only 1. While this sounds terrible, it does not translate into 5 times the effort. There is overhead for fetching a block; if it contains 4 soft-deleted rows and 1 useful row, that overhead is shared. Once a useful row is found, there is overhead to deliver that row to the client, but that applies only to the 1 row.
In the "hard-delete" case, blocks are sometimes coalesced. That is, when two "adjacent" blocks become less than half full, they may be combined into a single block. (Or so the documentation says.) This helps to cut down on the number of blocks that need to be touched. But it does not shrink the disk space -- hard-deleted rows leave space that can be reused; deleted blocks can be reused. Blocks are not returned to the OS.
A "point-query" is a SELECT where you specify exactly the row you want (eg, WHERE id = 123). That will be very fast with either type of delete. The only possible change is if the BTree is a different depth. But even if 80% of the rows are deleted, the BTree is unlikely to change in depth. You need to get to about 99% deleted before the depth changes. (A million rows has a depth of about 3; 100M -> 4.)
"Range queries (eg, WHERE blah BETWEEN ... AND ...) will notice some degradation if most are soft-deleted -- but, as already mentioned, there is a slight degradation in either deletion method.
So, is this my "opinion"? Yes. But it is based on an understanding of how InnoDB tables work. And it is based on "experience" in the sense that I have detected nothing to significantly shake this explanation in about 19 years of using InnoDB.
Further... With hard-delete, you have the option of freeing up the free space with OPTIMIZE TABLE. But I have repeatedly said "don't bother" and elaborated on why.
On the other hand, if you need to delete a big chunk of a table (either one-time or repeatedly), see my blog on efficient techniques: http://mysql.rjweb.org/doc.php/deletebig
(Re: the PS)
SELECT without an ORDER BY -- It is 'fair game' for the query to return the rows in any order it feels like. If you want a certain order, add ORDER BY.
What Engine is being used? MyISAM and InnoDB work differently; neither are predictable with out ORDER BY.
If you wanted the new entry to have id=6, that is a different problem. (And I will probably argue against designing the ids like that.)
The simple answer is no. Because DBMS systems are already designed to make changes at any time but system performance is important. Sometimes it's will affect a little bit. But no need to care it

Seeking a performant solution for accessing unique MySQL entries

I know very little about MySQL (or web development in general). I'm a Unity game dev and I've got a situation where users (of a region the size of which I haven't decided yet, possibly globally) can submit entries to an online database. The users must be able to then locate their entry at any time.
For this reason, I've generated a guid from .Net (System.Guid.NewGuid()) and am storing that in the database entry. This works for me! However... I'm no expert, but my gut tells me that looking up a complex string in what could be a gargantuan table might have terrible performance.
That said, it doesn't seem like anything other than a globally unique identifier will solve my problem. Is there a more elegant solution that I'm not seeing, or a way to mitigate against any issues this design pattern might create?
Thanks!
Make sure you define the GUID column as the primary key in the MySQL table. That will cause MySQL to create an index on it, which will enable MySQL to quickly find a row given the GUID. The table might be gargantuan but (assuming a regular B-tree index) the time required for a lookup will increase logarithmically relative to the size of the table. In other words, if it requires 2 reads to find a row in a 1,000-row table, finding a row in a 1,000,000-row table will only require 2 more reads, not 1,000 times as many.
As long as you have defined the primary key, the performance should be good. This is what the database is designed to do.
Obviously there are limits to everything. If you have a billion users and they're submitting thousands of these entries every second, then maybe a regular indexed MySQL table won't be sufficient. But I wouldn't go looking for some exotic solution before you even have a problem.
If you have a key of the row you want, and you have an index on that key, then this query will take less than a second, even if the table has a billion rows:
SELECT ... FROM t WHERE id = 1234.
The index in question might be the PRIMARY KEY, or it could be a secondary key.
GUIDs/UUIDs should be used only if you need to manufacture unique ids in multiple clients without asking the database for an id. If you do use such, be aware that GUIDs perform poorly if the table is bigger than RAM.

How to fix values missed by MYSQL auto_increment

I have a MYSQL table, where (to an already existing table) I added another column "Number" that is auto_incremented and has a UNIQUE KEY constraint.
There are 17000+ records in the table. After adding the "Number" column, one value is missed - there is a value of 14 369 and the next one is 14 371.
I tried removing the column and adding it again, but the missing value is still missing.
What might be the problem, and what is the least painfull way to solve this?
There is no problem and there is nothing to fix.
MySQL's auto_increment provides unique values, and it calculates them using sequential increment algorithm (it just increments a number).
That algorithm guarantees the fastest and accurate way of generating unique values.
That's its job. It doesn't "reuse" numbers and forcing it to do so comes with disastrous performance and stability.
Since queries do fail sometimes, these numbers get "lost" and you can't have them back.
If you require sequential numbers for whatever reason, create a procedure or scheduled event and maintain the numbers yourself.
You have to bear in mind that MySQL is a transactional database designed to operate under concurrent access. If it were to reuse these numbers, the performance would be abysmal since it'd have to use locks and force people to wait until it reorganizes the numbers.
InnoDB engine, the default engine, uses primary key values to organize records on the hard drive. If you were to change any of the values, it would start re-writing the records incurring a HUGE I/O wait that depends on the amount of data on the disk - it could bring the whole serve to a grinding halt.
TL:DR; there is no problem, there is nothing to fix, don't do it. If you persist, expect abnormal behavior.

Optimizing Innodb table indexes with GUID/UUID keys

I have an InnoDB based schema with roughly 100 tables, most use GUID/UUID's as the primary key. I started this at a point in time where I didn't really understand the implications of a UUID PK with regard to Disk IO and fragmentation, but wanted the benefits of avoiding a single key dispenser when dealing with server clusters. We're not currently dealing with large numbers of rows, but we will be (in the hundreds of millions) and I would like to be prepared for that.
Now that I understand indexing in InnoDB better, specifically the clustered nature of the primary key, I can see that my UUID's are a poor choice for scalability from a DISK IO perspective, but I don't want to stop using them due to the server clustering requirement.
The accepted/recommended solution seems to be a mix of Autoincrement PK (INT|BIGINT), with UNIQUE Indexed UUID keys. My intention is to add a new first column ai_col to each table and assign it as the new PK, I'm taking queues from:
http://dev.mysql.com/doc/refman/5.1/en/innodb-auto-increment-handling.html
I would then update/recreate a new "UNIQUE" index on my UUID keys and continue to use them in our application layer.
My expectation is that once this is done that I can essentially ignore the ai_col and everything else runs business as usual. InnoDB will have a relatively small int based PK from which to cluster on and append to the other unique indexes.
Question 1: Am I correct in assuming that in this new scenario, I can have my cake and eat it too?
The follow up question is with regard to smaller 'associational' tables, i.e. Only two columns, both Foreign Keys to other tables joining them implicitly. In these cases I have typically two indexes, one being a UNIQUE two column index with the more heavily used column first, then a second single index on the other column. I know that this is essentially 2.5x as large as the actual row data, but it seems to really help our more complex queries during optimization, and is on smaller tables so relatively acceptable.
Most of these associational tables will only be a fraction the number of records in the primary tables because they're typically more specific, however, there are a few cases where these have many multiples the number of records as their foreign parents, i.e. potentially billions.
Question 2: Is it a good idea to add the numeric PK's to these tables as well? I'm guessing that the answer will be something along the lines of "Benchtest it" but I'm just looking for helpful nuggets of wisdom.
If I've obviously mis-interpreted anything or you can offer insights that I may not be considering, I'd really appreciate that too!
Many thanks!
EDIT: As promised in the answer, I just wanted to follow up for anyone interested... This solution has worked famously :) Read and write performance increased across the board, and so far it's been tested up to about 6 billion i/o's / month, without breaking a sweat.
Without any other suggestions, confirmations, or otherwise, I've begun testing on our dev server with a number of less used tables but ones that would be affected none the less if the new AI based id's were going to affect our application layer.
So far it's looking good, indexes are performing as expected and the new table fields haven't required any changes to our application layer, we've been basically able to ignore them.
I haven't run any thorough bench testing though to test the actual Disk IO under heavy load but from the sheer amount of information out there on the subject, I can surmise that we're in good shape for scaling up.
Once this has been in place for a while I'll drop in a follow up in case anyone's in the same boat we were.

Is there any reason not to use auto_increment on an index for a database table?

I've inherited the task of maintaining a very poorly-coded e-commerce site and I'm working on refactoring a lot of the code and trying to fix ongoing bugs.
Every database insert (adding an item to cart, etc.) begins with a grab_new_id function which COUNTs the number of rows in the table, then, starting with that number, querys the database to find an unused index number. In addition to being terrible performance-wise (there are 40,000+ rows already, and indexes are regularly deleted, so sometimes it takes several seconds just to find a new id) this breaks regularly when two operations are preformed simultaneously, as two entries are added with duplicate id numbers.
This seems idiotic to me - why not just use auto-increment on the index field? I've tested it both ways, and adding rows to the table without specifying an index id is (obviously) many times faster. My question is: can anyone think of any reason the original programmer might have done this? Is there some school of thought where auto_increment is somehow considered bad form? Are there databases that don't have auto-increment capabilities?
I've seen this before from someone that didn't know that feature existed. Definitely use the auto-increment feature.
Some people take the "roll your own" approach to everything, often because they haven't taken the time to see if that is an available feature or if someone else had already come up with it. You'll often see crazy workarounds or poor performing/fragile code from these people. Inheriting a bad database is no fun at all, good luck!
Well Oracle has sequences but not auto-generated ids as I understand it. However, usually this kind of stuff is done by devs who don't understand database programming and who hate to see gaps in the data (as you get from rollbacks). There are also people who like to create the id, so they have it available beforhand to use for child tables, but most databases with autogenerated ids also have a way to return that id to the user at the time of creation.
The only issue I found partially reasonable (but totally avoidable!) against auto_inc fields is that some backup tools by default include auto_inc values into table definition even if you don't include data into a db dump that may be inconvenient.
Depending on the specific situation, there are clearly many reasons for not using consecutive numbers as a primary key.
However, under the given that I do want consecutive numbers as a primary key, I see no reason not to use the built in auto_increment functionality MySQL offers
It was probably done that way for historical reasons; i.e. earlier versions didn't have autoinc variables. I've written code that uses manual autoinc fields on databases that don't support autoinc types, but my code wasn't quite as inefficient as pulling a count().
One issue with using autoinc fields as a primary key is that moving records in and out of tables may result in the primary key changing. So, I'd recommend designing in a "LegacyID" field up front that can be used as future storage for the primary key for times when you are moving records in and out of the table.
They may just have been inexperienced and unfamiliar with auto increment. One reason I can think of, but doesn't necessarily make much sense, is that it is difficult (not impossible) to copy data from one environment to another when using auto increment id's.
For this reason, I have used sequential Guids as my primary key before for ease of transitioning data, but counting the rows to populate the ID is a bit of a WTF.
Two things to watch for:
1.Your RDBMS intelligently sets the auto-increment value upon restart. Our engineers were rolling their own auto-increment key to get around the auto-increment field jumping by an order of 100000s whenever the server restarted. However, at some point Sybase added an option to set the size of the auto-increment.
2.The other place where auto-increment can get nasty is if you are replicating databases and are using a master-master configuration. If you write on both databases (NOT ADVISED), you can run into identity-collision.
I doubt either of these were the case, but things to be aware of.
I could see if the ids were generated on the client and pushed into the database, this is common practice when speed is necessary, but what you discribed seems over the top and unnecessary. Remove it and start an auto incrementing id.