Everyone says don't re-use deleted MySql keys. eg. Stack Overflow question: I want to reuse the gaps of the deleted rows
I have read all of the "expert" opinions but have not found a single answer that gives a valid reason why not. Everyone simply asks "why do you want to"?
Well here is a very good reason. If my users have a choice of entering URL mysite.com/person.php?id=123 or a URL mysite.com/person.php?id=123456789123, which one would they most likely prefer?
So can anyone give me a reason why re-using 123 would be a bad idea? I am actually not talking about one record. My records get added and deleted in blocks of several thousand. Updates are very rare and I am the only person who does updates.
There are also no dependencies. Nothing points to those records so there are no integrity issues with other tables.
When I want to add another block of records I will have a simple search routine that searches for the first block of unused record keys large enough to accommodate all of the records being added. Much the same way that hard disk space usage works.
Keys are usually used as unique identifiers, if they are used again, they stop being unique, and become shared. This is the logic behind the idea of not to reuse keys.
So I would suggest, split the key and the id of the user, to two fields, key the key as unique, and the id make it "choose-able" via a gap-finding function.
Before you split, create this new column called user-id, and copy to it the id (which is currently your key) of the users.
Then make this column unique, so that you prevent accidental cases of id reuse.
And you are "home" free.
Related
I know very little about MySQL (or web development in general). I'm a Unity game dev and I've got a situation where users (of a region the size of which I haven't decided yet, possibly globally) can submit entries to an online database. The users must be able to then locate their entry at any time.
For this reason, I've generated a guid from .Net (System.Guid.NewGuid()) and am storing that in the database entry. This works for me! However... I'm no expert, but my gut tells me that looking up a complex string in what could be a gargantuan table might have terrible performance.
That said, it doesn't seem like anything other than a globally unique identifier will solve my problem. Is there a more elegant solution that I'm not seeing, or a way to mitigate against any issues this design pattern might create?
Thanks!
Make sure you define the GUID column as the primary key in the MySQL table. That will cause MySQL to create an index on it, which will enable MySQL to quickly find a row given the GUID. The table might be gargantuan but (assuming a regular B-tree index) the time required for a lookup will increase logarithmically relative to the size of the table. In other words, if it requires 2 reads to find a row in a 1,000-row table, finding a row in a 1,000,000-row table will only require 2 more reads, not 1,000 times as many.
As long as you have defined the primary key, the performance should be good. This is what the database is designed to do.
Obviously there are limits to everything. If you have a billion users and they're submitting thousands of these entries every second, then maybe a regular indexed MySQL table won't be sufficient. But I wouldn't go looking for some exotic solution before you even have a problem.
If you have a key of the row you want, and you have an index on that key, then this query will take less than a second, even if the table has a billion rows:
SELECT ... FROM t WHERE id = 1234.
The index in question might be the PRIMARY KEY, or it could be a secondary key.
GUIDs/UUIDs should be used only if you need to manufacture unique ids in multiple clients without asking the database for an id. If you do use such, be aware that GUIDs perform poorly if the table is bigger than RAM.
I would like to know if there's some regular way to handle duplicates in the database without actually removing the duplicated rows. Or a specific name for what I'm trying to achieve, so I can check it out.
Why would I keep duplicates? Because I have to monitor them. I have to know that they're duplicates and are not e.g. searchable, but at the same time, I have to keep them, because I update the rows from external source and if I'd remove them, they'd go back to the database as soon as I update from external source.
I have two ideas:
Have an additional boolean column "searchable", but I feel it's a partial solution, it can turn out to be insufficient in the future
Have an additional column "duplicate_of". It would keep id of the column of which the row is duplicate. It would be a foreign key of the same table which is kind of weird., isn't it?
I know it's not a specific programming question, but I think that someone must have handled a similar situation (Facebook - Pages they keep track of those which are duplicates of others) and it would be great to know a verified solution.
EDIT: these are close duplicates, indetified mainly by their location (lat, lng), so DISTINCT is probably not a solution here
I would create a view that has DISTINCT values. Having an additional column to be searchable sounds tedious. Your second idea is actually more feasible and there is nothing weird about a self-referencing table.
The solution depends on several other factors. In particular, does the database support real deletes and updates (apart from setting the duplication information)?
You have a range of solutions. One is to place distinct values in a separate table, periodically. This works well if you have batch inserts, and no updates/deletes.
If you have a database that is being updated, then you might want to maintain a version number on the record. This lets you track it. Presumably, if it is a duplicate, there is another duplicate key inside it.
The problem with your second approach is that it can result in a tree-like structure of duplicates. Where A-->B-->C and D--> so A and D are duplicates, but this is not obvious. If you always put in the earliest value and there are no updates or deletes, then this solution is reasonable.
I have a table titled videos. In it there are three columns: media_id, project_id, and video_url. My questions is, is it necessary for me to have media_id? I'm not using it in any other tables. I would expect there to be multiple project_ids with the same number but different video_urls.
Having or not having surrogate ID's for something has nothing to do with normalization.
(copyright catcall)
Having or not having surrogate ID's for something depends on whether or not you have a useful use for it. You already gave the answer to that yourself. And it depends on whether or not there is a significant likelihood that, even if there is no actual use for it right now, such a use might quickly emerge in a nearby future.
You could use project_id and video_url as a function dependency key in your model but at a physical level I would not like to use a URL as part of a key.
By this I mean I prefer an ID or number to avoid typing in long string each time the key is referenced in different tables.
I would consider it necessary. This is purely based on the fact that the media entry is unique and there could be multiple media entries for any one project. This keeps a unique id for the row, a proper project relationship and the valuable URL data for the media resource.
I've inherited the task of maintaining a very poorly-coded e-commerce site and I'm working on refactoring a lot of the code and trying to fix ongoing bugs.
Every database insert (adding an item to cart, etc.) begins with a grab_new_id function which COUNTs the number of rows in the table, then, starting with that number, querys the database to find an unused index number. In addition to being terrible performance-wise (there are 40,000+ rows already, and indexes are regularly deleted, so sometimes it takes several seconds just to find a new id) this breaks regularly when two operations are preformed simultaneously, as two entries are added with duplicate id numbers.
This seems idiotic to me - why not just use auto-increment on the index field? I've tested it both ways, and adding rows to the table without specifying an index id is (obviously) many times faster. My question is: can anyone think of any reason the original programmer might have done this? Is there some school of thought where auto_increment is somehow considered bad form? Are there databases that don't have auto-increment capabilities?
I've seen this before from someone that didn't know that feature existed. Definitely use the auto-increment feature.
Some people take the "roll your own" approach to everything, often because they haven't taken the time to see if that is an available feature or if someone else had already come up with it. You'll often see crazy workarounds or poor performing/fragile code from these people. Inheriting a bad database is no fun at all, good luck!
Well Oracle has sequences but not auto-generated ids as I understand it. However, usually this kind of stuff is done by devs who don't understand database programming and who hate to see gaps in the data (as you get from rollbacks). There are also people who like to create the id, so they have it available beforhand to use for child tables, but most databases with autogenerated ids also have a way to return that id to the user at the time of creation.
The only issue I found partially reasonable (but totally avoidable!) against auto_inc fields is that some backup tools by default include auto_inc values into table definition even if you don't include data into a db dump that may be inconvenient.
Depending on the specific situation, there are clearly many reasons for not using consecutive numbers as a primary key.
However, under the given that I do want consecutive numbers as a primary key, I see no reason not to use the built in auto_increment functionality MySQL offers
It was probably done that way for historical reasons; i.e. earlier versions didn't have autoinc variables. I've written code that uses manual autoinc fields on databases that don't support autoinc types, but my code wasn't quite as inefficient as pulling a count().
One issue with using autoinc fields as a primary key is that moving records in and out of tables may result in the primary key changing. So, I'd recommend designing in a "LegacyID" field up front that can be used as future storage for the primary key for times when you are moving records in and out of the table.
They may just have been inexperienced and unfamiliar with auto increment. One reason I can think of, but doesn't necessarily make much sense, is that it is difficult (not impossible) to copy data from one environment to another when using auto increment id's.
For this reason, I have used sequential Guids as my primary key before for ease of transitioning data, but counting the rows to populate the ID is a bit of a WTF.
Two things to watch for:
1.Your RDBMS intelligently sets the auto-increment value upon restart. Our engineers were rolling their own auto-increment key to get around the auto-increment field jumping by an order of 100000s whenever the server restarted. However, at some point Sybase added an option to set the size of the auto-increment.
2.The other place where auto-increment can get nasty is if you are replicating databases and are using a master-master configuration. If you write on both databases (NOT ADVISED), you can run into identity-collision.
I doubt either of these were the case, but things to be aware of.
I could see if the ids were generated on the client and pushed into the database, this is common practice when speed is necessary, but what you discribed seems over the top and unnecessary. Remove it and start an auto incrementing id.
I'm a beginning programmer, building a non-commercial web-site.
I need a user ID, and I thought it would be logical to use for that a simple INTEGER field with an auto-increment. Does that make sense? The user ID will be not be directly used by users (they'll have to select a user-name); should I care about where they start at (presumably 1)?
Any other best practices I should incorporate in building my 'Users' table?
Thanks!
JDelage
Your design is correct. Your internal PK should be a meaningless number, not seen by users of the system and maintained automatically. It doesn't matter if it starts at 1 and it doesn't matter if it's sequential or not, or if you have "holes" in the sequence. (For cases in which you do expose the number to end users, it is sometimes important that the numbers be neither sequential nor fully-populated so that they are not guessable).
Users should identify themselves to the system with another, meaningful piece of the information (such as an email address). That piece of information should either be guaranteed unique (using a UNIQUE index) or else your front end must provide an interface for disambiguation.
Among the benefits of this design are:
The meaningful identifier for the account can be changed by updating one value in one record of one table, rather than requiring updates all around the database.
Your PK value, which will appear many, many times in the database, is a small and efficiently indexed integer while your user-facing identifier can be of any type you want including a longish text string.
You can use a non-unique identifier with disambiguation if the application calls for it.
auto_increment is okay.
But, you shouldn't care of it's particular number.
Extremely contrary, you should never be concerned of the identifier's particular value. Take is as an abstract identifier only.
Though I doubt it can be invisible to users. Do you have another identifier to use? Auto_inqrement identifiers are usually visible to users as well. For example your ID here is 98361, nobody is hiding it. It is very handy to use such numbers, being unique and unchanged, forever bound to particular table row, it will always identify the same matter (a user, or an article, etc).
An auto incrementing field is fine unless you need to do things like share this ID across multiple databases then you will probably need to create the id value yourself. Also beware of exporting and importing data. If you are not careful all the id values will get reassigned.
In general I avoid auto incrementing fields so I have more control over how the id values are generated. Which is not to say I care what the values are just that they are unique. These are internal values the end user should never see.
Yes, that is correct. Auto-Increment starts at 1, usually. It's not usually accepted to have 0 as an ID.
If you are storing passwords, do not store them as clear text, use md5 (most popular) or some other hash.
Yes, auto incrementing is fine, Problably you will be saving passwords as well, make sure these have some kind of protection, hashing (md5) or encrypting is fine.
Also make sure you index the columns you will use to perform lookups, such as email etc... to avoid full table scans.