Is there any good practice for this?
I wish I could solve the problem when primary key hit the limit and not to avoid it. Because this is what will happen in my specific problem.
If it's unavoidable... What can i do?
This is mysql question, is there Sybase sql anywhere same problem?
Why would you hit the limit on that field? If you defined it with a datatype large enough it should be able to hold all your records.
If you used an unsigned bigint you can have up to 18,446,744,073,709,551,615 records!
You should pick the correct type for the primary key, if you know you will have lots of rows you could use bigint instead of the commonly used int.
In mysql you can easily adjust the primary key collumn with the alter table statement to adjust the range.
you should also use the unsigned property on that collumn because an auto increment primary key is always positive.
when the limit is reached you could maybe create some algorithm to put inside the ON DUPLICATE KEY UPDATE statement of an INSERT
Well, depending on the autoincrement column's datatype.
Unsign int goes up to 4294967295.
If you want to prevent the error, you can check the value last autoincrement value: LAST_INSERT_ID()
If it's approaching the datatype's max, either do not allow insertion or handle it in other ways.
Other than that, I can only suggest you use bigint so you can almost not hit the max for most scenario.
Can't give you a foolproof answer though :)
I know this question might be too old, but I would like to answer as well.
It is actually impossible to make that scenario unavoidable. Just by
thinking there is a physical limit about how many storage drives
humankind is able to make. But this is certainly not likely to happen
to fill all available storage.
As others have told you, an UNSIGNED BIGINT is able to handle up to 18,446,744,073,709,551,615 records, probably the way to go in "most" cases.
Here is another idea: by triggering the number of records of your database (for example, 85% full) you could backup that table into another table/region/database, and scale your infrastructure accordingly. And then reset that initial table.
And my last approach: some companies opt to do a tiny change to their license and use agreement, that they will shutdown an account if the user do not log in for a certain amount of time (say, 6 months for free users, 6 years for pro users, 60 years for ultimate users... <-- hey! you can also use different tables for those too!).
Hope somebody find this useful.
Related
I know very little about MySQL (or web development in general). I'm a Unity game dev and I've got a situation where users (of a region the size of which I haven't decided yet, possibly globally) can submit entries to an online database. The users must be able to then locate their entry at any time.
For this reason, I've generated a guid from .Net (System.Guid.NewGuid()) and am storing that in the database entry. This works for me! However... I'm no expert, but my gut tells me that looking up a complex string in what could be a gargantuan table might have terrible performance.
That said, it doesn't seem like anything other than a globally unique identifier will solve my problem. Is there a more elegant solution that I'm not seeing, or a way to mitigate against any issues this design pattern might create?
Thanks!
Make sure you define the GUID column as the primary key in the MySQL table. That will cause MySQL to create an index on it, which will enable MySQL to quickly find a row given the GUID. The table might be gargantuan but (assuming a regular B-tree index) the time required for a lookup will increase logarithmically relative to the size of the table. In other words, if it requires 2 reads to find a row in a 1,000-row table, finding a row in a 1,000,000-row table will only require 2 more reads, not 1,000 times as many.
As long as you have defined the primary key, the performance should be good. This is what the database is designed to do.
Obviously there are limits to everything. If you have a billion users and they're submitting thousands of these entries every second, then maybe a regular indexed MySQL table won't be sufficient. But I wouldn't go looking for some exotic solution before you even have a problem.
If you have a key of the row you want, and you have an index on that key, then this query will take less than a second, even if the table has a billion rows:
SELECT ... FROM t WHERE id = 1234.
The index in question might be the PRIMARY KEY, or it could be a secondary key.
GUIDs/UUIDs should be used only if you need to manufacture unique ids in multiple clients without asking the database for an id. If you do use such, be aware that GUIDs perform poorly if the table is bigger than RAM.
I want to create a system of online billboard where everyone can post a topic as my project.
I try to design the database using SQL to store the information of each topic, including the topic's id as primary key.
At first I design the id using integer datatype with auto-increment, as I think it's the simplest way. Then I thought about it and found out that the integer has limit(the number may be high but it is there), so I'm here finding another method.
Now I think of some pseudo-random algorithms, or use the hashing of topic's name but still not clear.
I also find the GUID from research in here, but not sure can it be used.
I wish you suggest me some ways of how to deal with the limit size of integer as primary key, or suggest me any keywords for me to do further research.
This answer assumes MySQL/MariaDB, because it uses the terminology "auto-increment" for such columns (as opposed to other databases that use identity or serial).
If int isn't big enough, you can use bigint.
Although I might consider it unlikely that you'll exceed the thresholds for int (it works for many applications), bigint would require great effort on you and your computers part for a long, long time to exceed the maximum value.
This is explained in the documentation.
With int, the maximum value supported by SQL Server is 2,147,483,647.
Just for completeness, I will also add that yet another option is to change the data type of the column to bigint (maximum value 9,223,372,036,854,775,807 - this will allow you to insert a million rows per second, for almost 300,000 years in a row).
Or if you fear that you will overflow even that, you can consider using decimal(38,0) - the maximum here is a number consisting of 38 9's (which will allow you to maintain that same pace for a whopping 31,709,791,983,764,586,504,312,531 years).
http://sqlblog.com/blogs/hugo_kornelis
I searched Google for a question I ask myself since this morning but couldn't find any information or article about it.
I was wondering, in the following situation, to improve performance (a little % still) :
Context: I have two column : ID, AddedAt (AddedAt is the Unix Timestamp of when the row is created).
Theoretically, if you insert a new row, ID will be +1 and AddedAt will be the current time.
Now, let's say it is impossible in the current situation to have two simultaneous insert, would it be better to use AddedAt as a PK and remove the ID column ? AddedAt will be only one and unique column that does PK and UNIX Timestamp. So in the final, I will have one column instead of two.
The only bad side I see is maybe the size of the key that will be created on AddedAt since unix timestamp now's day is 10 digits.
Would it be better, in this situation ? What's your opinion ?
EDIT: What about using timestamp + ms ?
Timestamps are in seconds. While you might not have simultaneous inserts, as the world tends to speed up you might get multiple inserts in a second. Build your system to function soundly--don't use timesamps as primary keys.
Also, with statement replication sometime timestamps arent consistent across dbs... Row based replication alleviates this, but still its another reason for concern when using them.
From an good convention standpoint, Primary Keys should have some clear meaning to others outside yourself if it's anything other than just us a plain old auto incrementing id field. Generally, people expect numbers or char values for keys, not things like blobs, timestamps, datetimes, etc... This is especially true if later it's used for as a foreign key in another table, using timestamp as a foreign key can be confusing to later developers. Sure, if you have a varchar GUID field you know is unique, use it as the key. Just remember when used as a foreign key your going to eat up also quite a bit of memory if you have a huge string.
Assuming you can guarantee that two events won't occur within the same 1-second interval, then sure, you could use the timestamp field as a PK.
That being said, why are you worried about key sizes? A timestamp may be 10 digits, but its internal storage requirements is only 4 bytes. By comparison, an int is also 4 bytes, so you wouldn't be losing anything - unless you're using bigints, in which case it's 8 bytes.
Also, note that timestamp fields are subject to the y2038k problem. They're essentially unix timestamps that auto-format into a human readable date for you. If your app is going to be around for more than 26 years, then you should stick with an int/bigint, which has a wraparound range of "however fast you insert rows", not a fixed date/time.
The primary key is not only a technical thing, it is the business representation of something that makes each object represented by a row unique.
A timestamp is a unique field of your object because you cannot (in your case) insert two objects at the same time, but it is NOT the primary definition of a business object (if you had a business object called "timestamp" then yes, the time when it was inserted should be the primary key)
An ID stands for "my client has a physical id that represents him": in the past, we would give numbers to clients on papers, bills...
Never forget that computer science is not the objective per se but the means to achieve your goals.
I would leave the ID column as the primary key as there may be scenarios in which the unix timestamp will give you a value you're not expecting. One could be inserting very fast in succession returns the same timestamp, and another is if the server admin decides to monkey with the servers time settings.
Doing joins will probably much more obvious as people typically expect the primary key to be some sort of unique id, not a timestamp.
Yes of course, but performance gain will be minimal only while adding new record.
Moreover you will be forced to use timestamp for foreign_keys in all related objects.
It is worth considering only if you expect many inserts per second and a lot of records (to save storage on id column and its index), but as you said timestamp will be unique, so it's max 1 record per second :-)
I use id's for almost all my tables, you never know when they come handy. But today I read this...
Be extra careful to make sure that, according to convention, your ‘id’ column (or primary key) is:
char(36) and never varchar(36)
CakePHP will work with both definitions, however you will be sacrificing about 50% of the performance of your DB (MySQL in particular). This will be most evident in more complicated SELECT’s, which might require some JOIN’s or calculations.
I wonder... why even use something text-based, when you only have to save integers? I care a great deal about using the right formats for the right content, so I wonder if char gives any performance improvements over integers?
I would strongly suggestest using ints. I am doing some modelling for my thesis and I work on large datasets. I had to create a table with about ~70.000.000 rows. My primary key was varchar + int. At the beginning one cycle of creating 5-digit number of rows took 5 minutes, soon it became 40. Dropping the primary key fixed my performance issue. I guess that it is because ensuring uniqueness and it was becoming more and more time consuming. I had no similar issues when my primary key was int.
it is personal experience though, so maybe someone can give more theoretic and reliable answer.
char doesn't give any improvement over integer. But it's useful when you need to prevent users from knowing or tampering with other rows that you don't want them to.
Let's say each user have a profile picture with the naming /img/$id.jpg (the simplest case, since you don't have to store any data in DB for this information. Of course, there are other ways) If you use integer, someone can loop through all profile pictures that you have. With UUID, they can't.
When you have a lot of records, the auto increment int is better for performance. You can put the uuid in another field (secret_key, for example).
The database I am working on right now has records being added and removed by the 10s of thousands and because of this there are gaps in the auto-incremented key hundreds of thousands big and auto-increment numbers well into the billions.
These numbers are never stored to reference the individual record but are used to reference the record when doing on-the-fly calculations.
Is there any reason to remove these gaps and reset the auto-increment number or is this inconsequential?
The id field is an unsigned int, should I increase it to an unsigned big int? From what I understand, right now if it hits 4,294,967,295 it will break.
The only reason I'd worry about it is if you find yourself close to that 2^32 limit. If you're not using the column as a row id, then don't even worry about it.
EDIT If you are using this column for any kind of identifying information, then I'd switch the column over to a GUID or something, because you're gonna get overflow, and then you'll get duplicate values. And that's no bueno.
I don't know what the growth rate of your autoincrement field is, but it should be simple math for you to estimate when you will hit the 4294967295 limit.
If you still feel that you need to do something, you have the following options:
reset the current count to 1. Do this ideally by dropping the column and recreating it. Since you are not using this field for referential integrity, should be a quick and simple fix until the next time...
Change the datatype to an unsigned BIGINT. Now you can go up to 18446744073709551615. But you need more space in the heap to store this increased amount of data, and you have only postponed your problem.
Change from an autoincrement (INT / BIGINT) to a UUID. Then you can stop worrying about numbers and the nature of infinity, but you most likely will have to change all of your client code.
On a separate note, I sense a poor decision or two somewhere earlier up the line here.