I want to create a system of online billboard where everyone can post a topic as my project.
I try to design the database using SQL to store the information of each topic, including the topic's id as primary key.
At first I design the id using integer datatype with auto-increment, as I think it's the simplest way. Then I thought about it and found out that the integer has limit(the number may be high but it is there), so I'm here finding another method.
Now I think of some pseudo-random algorithms, or use the hashing of topic's name but still not clear.
I also find the GUID from research in here, but not sure can it be used.
I wish you suggest me some ways of how to deal with the limit size of integer as primary key, or suggest me any keywords for me to do further research.
This answer assumes MySQL/MariaDB, because it uses the terminology "auto-increment" for such columns (as opposed to other databases that use identity or serial).
If int isn't big enough, you can use bigint.
Although I might consider it unlikely that you'll exceed the thresholds for int (it works for many applications), bigint would require great effort on you and your computers part for a long, long time to exceed the maximum value.
This is explained in the documentation.
With int, the maximum value supported by SQL Server is 2,147,483,647.
Just for completeness, I will also add that yet another option is to change the data type of the column to bigint (maximum value 9,223,372,036,854,775,807 - this will allow you to insert a million rows per second, for almost 300,000 years in a row).
Or if you fear that you will overflow even that, you can consider using decimal(38,0) - the maximum here is a number consisting of 38 9's (which will allow you to maintain that same pace for a whopping 31,709,791,983,764,586,504,312,531 years).
http://sqlblog.com/blogs/hugo_kornelis
Related
We're considering using UUID values as primary keys for our MySQL database. The data being inserted is generated from dozens, hundreds, or even thousands of remote computers and being inserted at a rate of 100-40,000 inserts per second, and we'll never do any updates.
The database itself will typically get to around 50M records before we start to cull data, so not a massive database, but not tiny either. We're also planing to run on InnoDB, though we are open to changing that if there is a better engine for what we're doing.
We were ready to go with Java's Type 4 UUID, but in testing have been seeing some strange behavior. For one, we're storing as varchar(36) and I now realize we'd be better off using binary(16) - though how much better off I'm not sure.
The bigger question is: how badly does this random data screw up the index when we have 50M records? Would we be better off if we used, for example, a type-1 UUID where the leftmost bits were timestamped? Or maybe we should ditch UUIDs entirely and consider auto_increment primary keys?
I'm looking for general thoughts/tips on the performance of different types of UUIDs when they are stored as an index/primary key in MySQL. Thanks!
At my job, we use UUID as PKs. What I can tell you from experience is DO NOT USE THEM as PKs (SQL Server by the way).
It's one of those things that when you have less than 1000 records it;s ok, but when you have millions, it's the worst thing you can do. Why? Because UUID are not sequential, so everytime a new record is inserted MSSQL needs to go look at the correct page to insert the record in, and then insert the record. The really ugly consequence with this is that the pages end up all in different sizes and they end up fragmented, so now we have to do de-fragmentation periodic.
When you use an autoincrement, MSSQL will always go to the last page, and you end up with equally sized pages (in theory) so the performance to select those records is much better (also because the INSERTs will not block the table/page for so long).
However, the big advantage of using UUID as PKs is that if we have clusters of DBs, there will not be conflicts when merging.
I would recommend the following model:
PK INT Identity
Additional column automatically generated as UUID.
This way, the merge process is possible (UUID would be your REAL key, while the PK would just be something temporary that gives you good performance).
NOTE: That the best solution is to use NEWSEQUENTIALID (like I was saying in the comments), but for legacy app with not much time to refactor (and even worse, not controlling all inserts), it is not possible to do.
But indeed as of 2017, I'd say the best solution here is NEWSEQUENTIALID or doing Guid.Comb with NHibernate.
A UUID is a Universally Unique ID. It's the universally part that you should be considering here.
Do you really need the IDs to be universally unique? If so, then UUIDs may be your only choice.
I would strongly suggest that if you do use UUIDs, you store them as a number and not as a string. If you have 50M+ records, then the saving in storage space will improve your performance (although I couldn't say by how much).
If your IDs do not need to be universally unique, then I don't think that you can do much better then just using auto_increment, which guarantees that IDs will be unique within a table (since the value will increment each time)
Something to take into consideration is that Autoincrements are generated one at a time and cannot be solved using a parallel solution. The fight for using UUIDs eventually comes down to what you want to achieve versus what you potentially sacrifice.
On performance, briefly:
A UUID like the one above is 36
characters long, including dashes. If
you store this VARCHAR(36), you're
going to decrease compare performance
dramatically. This is your primary
key, you don't want it to be slow.
At its bit level, a UUID is 128 bits,
which means it will fit into 16 bytes,
note this is not very human readable,
but it will keep storage low, and is
only 4 times larger than a 32-bit int,
or 2 times larger than a 64-bit int.
I will use a VARBINARY(16)
Theoretically, this can work without a
lot of overhead.
I recommend reading the following two posts:
Brian "Krow" Aker's Idle Thoughts - Myths, GUID vs Autoincrement
To UUID or not to UUID ?
I reckon between the two, they answer your question.
I tend to avoid UUID simply because it is a pain to store and a pain to use as a primary key but there are advantages. The main one is they are UNIQUE.
I usually solve the problem and avoid UUID by using dual key fields.
COLLECTOR = UNIQUE ASSIGNED TO A MACHINE
ID = RECORD COLLECTED BY THE COLLECTOR (auto_inc field)
This offers me two things. Speed of auto-inc fields and uniqueness of data being stored in a central location after it is collected and grouped together. I also know while browsing the data where it was collected which is often quite important for my needs.
I have seen many cases while dealing with other data sets for clients where they have decided to use UUID but then still have a field for where the data was collected which really is a waste of effort. Simply using two (or more if needed) fields as your key really helps.
I have just seen too many performance hits using UUID. They feel like a cheat...
Instead of centrally generating unique keys for each insertion, how about allocating blocks of keys to individual servers? When they run out of keys, they can request a new block. Then you solve the problem of overhead by connecting for each insert.
Keyserver maintains next available id
Server 1 requests id block.
Keyserver returns (1,1000)
Server 1 can insert a 1000 records until it needs to request a new block
Server 2 requests index block.
Keyserver returns (1001,2000)
etc...
You could come up with a more sophisticated version where a server could request the number of needed keys, or return unused blocks to the keyserver, which would then of course need to maintain a map of used/unused blocks.
I realize this question is rather old but I did hit upon it in my research. Since than a number of things happened (SSD are ubiquitous InnoDB got updates etc).
In my research I found this rather interesting post on performance:
claiming that due to the randomness of a GUID/UUID index trees can get rather unbalanced. in the MariaDB KB I found another post suggested a solution.
But since than the new UUID_TO_BIN takes care of this. This function is only available in MySQL (tested version 8.0.18) and not in MariaDB (version 10.4.10)
TL;DR: Store UUID as converted/optimized BINARY(16) values.
I would assign each server a numeric ID in a transactional manner.
Then, each record inserted will just autoincrement its own counter.
Combination of ServerID and RecordID will be unique.
ServerID field can be indexed and future select performance
based on ServerID (if needed) may be much better.
The short answer is that many databases have performance problems (in particular with high INSERT volumes) due to a conflict between their indexing method and UUIDs' deliberate entropy in the high-order bits. There are several common hacks:
choose a different index type (e.g. nonclustered on MSSQL) that doesn't mind it
munge the data to move the entropy to lower-order bits (e.g. reordering bytes of V1 UUIDs on MySQL)
make the UUID a secondary key with an auto-increment int primary key
... but these are all hacks--and probably fragile ones at that.
The best answer, but unfortunately the slowest one, is to demand your vendor improve their product so it can deal with UUIDs as primary keys just like any other type. They shouldn't be forcing you to roll your own half-baked hack to make up for their failure to solve what has become a common use case and will only continue to grow.
What about some hand crafted UID? Give each of the thousands of servers an ID and make primary key a combo key of autoincrement,MachineID ???
Since the primary key is generated decentralised, you don't have the option of using an auto_increment anyway.
If you don't have to hide the identity of the remote machines, use Type 1 UUIDs instead of UUIDs. They are easier to generate and can at least not hurt the performance of the database.
The same goes for varchar (char, really) vs. binary: it can only help matters. Is it really important, how much performance is improved?
The main case where UUIDs cause miserable performance is ...
When the INDEX is too big to be cached in the buffer_pool, each lookup tends to be a disk hit. For HDD, this can slow down the access by 10x or worse. (No, that is not a typo for "10%".) With SSDs, the slowdown is less, but still significant.
This applies to any "hash" (MD5, SHA256, etc), with one exception: A type-1 UUID with its bits rearranged.
Background and manual optimization: UUIDs
MySQL 8.0: see UUID_TO_BIN() and BIN_TO_UUID()
MariaDB 10.7 carries this further with its UUID datatype.
I'm creating a database for combustion experiments. Each experiment has some scientific metadata which I call 'details'. For example ('Fuel', 'C2H6') or ('Pressure', 120). Because the same detail names (like 'Fuel') show up a lot, I created a table just to store the names and units. Here's a simplified version:
CREATE TABLE properties (
id INT AUTO_INCREMENT PRIMARY KEY,
name VARCHAR(50) NOT NULL,
units NVARCHAR(15) NOT NULL DEFAULT 'dimensionless',
);
I also created a table called 'details' which maps 'properties' to values.
CREATE TABLE details (
id INT AUTO_INCREMENT PRIMARY KEY,
property_id INT NOT NULL,
value VARCHAR(30),
FOREIGN KEY(property_id) REFERENCES properties(id)
);
This isn't ideal because the value attribute is sometimes a chemical name and sometimes a float. In the future, there may even be new entries that have integer values. Storing everything in a VARCHAR seems wasteful. Since it'll be hard to change later, I want to make the right decision now.
I've been researching this for hours and have considered four options:
Store everything as varchar under value (simplest to develop)
Use an EAV model (most complicated to develop).
Create a column for each type, and have plenty of NULL entries.
value_float, value_int, value_char
Use the JSON datatype.
Looking into each one, it seems like they're all bad in different ways. (1) is bad since it takes up extra space and I have to do extra operations to parse strings into numeric values. (2) is bad because of the huge increase in complexity (four extra tables and a lot more join operations), plus I hear EAV is to be avoided. (3) is a middle-ground in complexity, but there will be two NULL values for each table entry. (4) seems similar to (1), and I'm not sure how it might be better or worse.
I don't expect to have huge growth on this database or millions of entries. It just needs to be fast and searchable for researchers. I'm willing to have more backend complexity for a better/faster user experience.
By now I realize that there aren't that many clear-cut answers in database design. I'm simply asking for some insight into my three options, or perhaps another option I haven't thought of.
EDIT: Added JSON as an option.
Well, you have to sacrify something. Either HD space, or performance, or specific/general dimension or easy/complex to develop dimension. Choose a mix suitable for your needs and situation. - I solved it in 2000 in a general kind of EAV solution this way: basic record had a common properties shared by majority of events, then joins to properties without values (associative table), and those ones very specific properties/values I stored in a BLOB in XML like tags. This way I combined frequent properties with those very specific ones. AS this was intended as VERY GENERAL solution, you probably don't need, I'd sacrifice space, it's cheap today. Who cares if you take more space than it's "correct according to data modeling theory". Ok data model will be ugly, so what ? - You'll still need to decide on specific/general dimension - how specific attributes will be solved - either as specific columns (yes if they are repeated often) or in Property-TypeOfProperty-Value type of table.
I use id's for almost all my tables, you never know when they come handy. But today I read this...
Be extra careful to make sure that, according to convention, your ‘id’ column (or primary key) is:
char(36) and never varchar(36)
CakePHP will work with both definitions, however you will be sacrificing about 50% of the performance of your DB (MySQL in particular). This will be most evident in more complicated SELECT’s, which might require some JOIN’s or calculations.
I wonder... why even use something text-based, when you only have to save integers? I care a great deal about using the right formats for the right content, so I wonder if char gives any performance improvements over integers?
I would strongly suggestest using ints. I am doing some modelling for my thesis and I work on large datasets. I had to create a table with about ~70.000.000 rows. My primary key was varchar + int. At the beginning one cycle of creating 5-digit number of rows took 5 minutes, soon it became 40. Dropping the primary key fixed my performance issue. I guess that it is because ensuring uniqueness and it was becoming more and more time consuming. I had no similar issues when my primary key was int.
it is personal experience though, so maybe someone can give more theoretic and reliable answer.
char doesn't give any improvement over integer. But it's useful when you need to prevent users from knowing or tampering with other rows that you don't want them to.
Let's say each user have a profile picture with the naming /img/$id.jpg (the simplest case, since you don't have to store any data in DB for this information. Of course, there are other ways) If you use integer, someone can loop through all profile pictures that you have. With UUID, they can't.
When you have a lot of records, the auto increment int is better for performance. You can put the uuid in another field (secret_key, for example).
We're considering using UUID values as primary keys for our MySQL database. The data being inserted is generated from dozens, hundreds, or even thousands of remote computers and being inserted at a rate of 100-40,000 inserts per second, and we'll never do any updates.
The database itself will typically get to around 50M records before we start to cull data, so not a massive database, but not tiny either. We're also planing to run on InnoDB, though we are open to changing that if there is a better engine for what we're doing.
We were ready to go with Java's Type 4 UUID, but in testing have been seeing some strange behavior. For one, we're storing as varchar(36) and I now realize we'd be better off using binary(16) - though how much better off I'm not sure.
The bigger question is: how badly does this random data screw up the index when we have 50M records? Would we be better off if we used, for example, a type-1 UUID where the leftmost bits were timestamped? Or maybe we should ditch UUIDs entirely and consider auto_increment primary keys?
I'm looking for general thoughts/tips on the performance of different types of UUIDs when they are stored as an index/primary key in MySQL. Thanks!
At my job, we use UUID as PKs. What I can tell you from experience is DO NOT USE THEM as PKs (SQL Server by the way).
It's one of those things that when you have less than 1000 records it;s ok, but when you have millions, it's the worst thing you can do. Why? Because UUID are not sequential, so everytime a new record is inserted MSSQL needs to go look at the correct page to insert the record in, and then insert the record. The really ugly consequence with this is that the pages end up all in different sizes and they end up fragmented, so now we have to do de-fragmentation periodic.
When you use an autoincrement, MSSQL will always go to the last page, and you end up with equally sized pages (in theory) so the performance to select those records is much better (also because the INSERTs will not block the table/page for so long).
However, the big advantage of using UUID as PKs is that if we have clusters of DBs, there will not be conflicts when merging.
I would recommend the following model:
PK INT Identity
Additional column automatically generated as UUID.
This way, the merge process is possible (UUID would be your REAL key, while the PK would just be something temporary that gives you good performance).
NOTE: That the best solution is to use NEWSEQUENTIALID (like I was saying in the comments), but for legacy app with not much time to refactor (and even worse, not controlling all inserts), it is not possible to do.
But indeed as of 2017, I'd say the best solution here is NEWSEQUENTIALID or doing Guid.Comb with NHibernate.
A UUID is a Universally Unique ID. It's the universally part that you should be considering here.
Do you really need the IDs to be universally unique? If so, then UUIDs may be your only choice.
I would strongly suggest that if you do use UUIDs, you store them as a number and not as a string. If you have 50M+ records, then the saving in storage space will improve your performance (although I couldn't say by how much).
If your IDs do not need to be universally unique, then I don't think that you can do much better then just using auto_increment, which guarantees that IDs will be unique within a table (since the value will increment each time)
Something to take into consideration is that Autoincrements are generated one at a time and cannot be solved using a parallel solution. The fight for using UUIDs eventually comes down to what you want to achieve versus what you potentially sacrifice.
On performance, briefly:
A UUID like the one above is 36
characters long, including dashes. If
you store this VARCHAR(36), you're
going to decrease compare performance
dramatically. This is your primary
key, you don't want it to be slow.
At its bit level, a UUID is 128 bits,
which means it will fit into 16 bytes,
note this is not very human readable,
but it will keep storage low, and is
only 4 times larger than a 32-bit int,
or 2 times larger than a 64-bit int.
I will use a VARBINARY(16)
Theoretically, this can work without a
lot of overhead.
I recommend reading the following two posts:
Brian "Krow" Aker's Idle Thoughts - Myths, GUID vs Autoincrement
To UUID or not to UUID ?
I reckon between the two, they answer your question.
I tend to avoid UUID simply because it is a pain to store and a pain to use as a primary key but there are advantages. The main one is they are UNIQUE.
I usually solve the problem and avoid UUID by using dual key fields.
COLLECTOR = UNIQUE ASSIGNED TO A MACHINE
ID = RECORD COLLECTED BY THE COLLECTOR (auto_inc field)
This offers me two things. Speed of auto-inc fields and uniqueness of data being stored in a central location after it is collected and grouped together. I also know while browsing the data where it was collected which is often quite important for my needs.
I have seen many cases while dealing with other data sets for clients where they have decided to use UUID but then still have a field for where the data was collected which really is a waste of effort. Simply using two (or more if needed) fields as your key really helps.
I have just seen too many performance hits using UUID. They feel like a cheat...
Instead of centrally generating unique keys for each insertion, how about allocating blocks of keys to individual servers? When they run out of keys, they can request a new block. Then you solve the problem of overhead by connecting for each insert.
Keyserver maintains next available id
Server 1 requests id block.
Keyserver returns (1,1000)
Server 1 can insert a 1000 records until it needs to request a new block
Server 2 requests index block.
Keyserver returns (1001,2000)
etc...
You could come up with a more sophisticated version where a server could request the number of needed keys, or return unused blocks to the keyserver, which would then of course need to maintain a map of used/unused blocks.
I realize this question is rather old but I did hit upon it in my research. Since than a number of things happened (SSD are ubiquitous InnoDB got updates etc).
In my research I found this rather interesting post on performance:
claiming that due to the randomness of a GUID/UUID index trees can get rather unbalanced. in the MariaDB KB I found another post suggested a solution.
But since than the new UUID_TO_BIN takes care of this. This function is only available in MySQL (tested version 8.0.18) and not in MariaDB (version 10.4.10)
TL;DR: Store UUID as converted/optimized BINARY(16) values.
I would assign each server a numeric ID in a transactional manner.
Then, each record inserted will just autoincrement its own counter.
Combination of ServerID and RecordID will be unique.
ServerID field can be indexed and future select performance
based on ServerID (if needed) may be much better.
The short answer is that many databases have performance problems (in particular with high INSERT volumes) due to a conflict between their indexing method and UUIDs' deliberate entropy in the high-order bits. There are several common hacks:
choose a different index type (e.g. nonclustered on MSSQL) that doesn't mind it
munge the data to move the entropy to lower-order bits (e.g. reordering bytes of V1 UUIDs on MySQL)
make the UUID a secondary key with an auto-increment int primary key
... but these are all hacks--and probably fragile ones at that.
The best answer, but unfortunately the slowest one, is to demand your vendor improve their product so it can deal with UUIDs as primary keys just like any other type. They shouldn't be forcing you to roll your own half-baked hack to make up for their failure to solve what has become a common use case and will only continue to grow.
What about some hand crafted UID? Give each of the thousands of servers an ID and make primary key a combo key of autoincrement,MachineID ???
Since the primary key is generated decentralised, you don't have the option of using an auto_increment anyway.
If you don't have to hide the identity of the remote machines, use Type 1 UUIDs instead of UUIDs. They are easier to generate and can at least not hurt the performance of the database.
The same goes for varchar (char, really) vs. binary: it can only help matters. Is it really important, how much performance is improved?
The main case where UUIDs cause miserable performance is ...
When the INDEX is too big to be cached in the buffer_pool, each lookup tends to be a disk hit. For HDD, this can slow down the access by 10x or worse. (No, that is not a typo for "10%".) With SSDs, the slowdown is less, but still significant.
This applies to any "hash" (MD5, SHA256, etc), with one exception: A type-1 UUID with its bits rearranged.
Background and manual optimization: UUIDs
MySQL 8.0: see UUID_TO_BIN() and BIN_TO_UUID()
MariaDB 10.7 carries this further with its UUID datatype.
Is there any good practice for this?
I wish I could solve the problem when primary key hit the limit and not to avoid it. Because this is what will happen in my specific problem.
If it's unavoidable... What can i do?
This is mysql question, is there Sybase sql anywhere same problem?
Why would you hit the limit on that field? If you defined it with a datatype large enough it should be able to hold all your records.
If you used an unsigned bigint you can have up to 18,446,744,073,709,551,615 records!
You should pick the correct type for the primary key, if you know you will have lots of rows you could use bigint instead of the commonly used int.
In mysql you can easily adjust the primary key collumn with the alter table statement to adjust the range.
you should also use the unsigned property on that collumn because an auto increment primary key is always positive.
when the limit is reached you could maybe create some algorithm to put inside the ON DUPLICATE KEY UPDATE statement of an INSERT
Well, depending on the autoincrement column's datatype.
Unsign int goes up to 4294967295.
If you want to prevent the error, you can check the value last autoincrement value: LAST_INSERT_ID()
If it's approaching the datatype's max, either do not allow insertion or handle it in other ways.
Other than that, I can only suggest you use bigint so you can almost not hit the max for most scenario.
Can't give you a foolproof answer though :)
I know this question might be too old, but I would like to answer as well.
It is actually impossible to make that scenario unavoidable. Just by
thinking there is a physical limit about how many storage drives
humankind is able to make. But this is certainly not likely to happen
to fill all available storage.
As others have told you, an UNSIGNED BIGINT is able to handle up to 18,446,744,073,709,551,615 records, probably the way to go in "most" cases.
Here is another idea: by triggering the number of records of your database (for example, 85% full) you could backup that table into another table/region/database, and scale your infrastructure accordingly. And then reset that initial table.
And my last approach: some companies opt to do a tiny change to their license and use agreement, that they will shutdown an account if the user do not log in for a certain amount of time (say, 6 months for free users, 6 years for pro users, 60 years for ultimate users... <-- hey! you can also use different tables for those too!).
Hope somebody find this useful.