I am creating a REST API with a MySQL database. I would like to know if using auto-incrementing IDs as primary keys, (to keep good performance) and unique uuid fields (used as API ID) is a bad idea? If so why?
(from Comment) The purpose of the UUID is to provide an opaque id in the API, while using a simpler, more efficient, BIGINT for internal purposes.
UUIDs have these benefits:
They can be created independently by multiple clients, while being unique.
They obfuscate the ids. (Example: avoid hackers discovering valid ids.)
IDs have these benefits:
Smaller disk space (and cache) needed, hence somewhat faster.
Temporarily oriented ("recent" inserts are clustered "together"). This is a performance benefit for very large tables (or small RAMs).
"Natural" Primary keys (a column or combination of columns that is intrinsically unique):
may be smaller
may be faster
more logical.
Example: In the case of a many-to-many mapping table (just 2 ids pointing to two other tables), PRIMARY KEY(a_id, b_id), INDEX(b_id, a_id) is clearly faster and smaller.
UUIDs are 36 or 16 bytes; ids are 8 bytes or 4 or smaller. A natural key may take 0 extra bytes (or may not).
To answer your question: "It depends".
The tables I build have PKs:
Natural - 2/3 of the tables
Auto_inc - 1/3
UUID - essentially none.
(PS: I find REST to be clumsy and provide no real benefits, so I avoid it.)
Based on Comment
Probably you what:
An auto_inc id everywhere in the database;
A UUID for opaquely sending to the user. This avoids various hacking games that might be played with an auto_inc.
So, in the the main table,
CREATE TABLE main (
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
uuid BINARY(16) NOT NULL,
....
PRIMARY KEY (id),
UNIQUE(uuid),
...
) ENGINE=InnoDB
When creating a new row, compute a new UUID, strip the dashes and convert FROM_HEX().
When sending a message to the user, include uuid, not id.
When receiving a reply message, quickly switch to using id by looking it up via that available index. Perhaps this way:
SELECT id FROM main WHERE uuid = ?
Related
If I set the primary key to be INT type (AUTO_INCREMENT) or set it in UUID, what is the difference between these two in the database performance (SELECT, INSERT etc) and why?
UUID returns a universal unique identifier (hopefuly also unique if imported to another DB as well).
To quote from MySQL doc (emphasis mine):
A UUID is designed as a number that is globally unique in space and
time. Two calls to UUID() are expected to generate two different
values, even if these calls are performed on two separate computers
that are not connected to each other.
On the other hand a simply INT primary id key (e.g. AUTO_INCREMENT) will return a unique integer for the specific DB and DB table, but which is not universally unique (so if imported to another DB chances are there will be primary key conflicts).
In terms of performance, there shouldn't be any noticeable difference using auto-increment over UUID. Most posts (including some by the authors of this site), state as such. Of course UUID may take a little more time (and space), but this is not a performance bottleneck for most (if not all) cases. Having a column as Primary Key should make both choices equal wrt to performance. See references below:
To UUID or not to UUID?
Myths, GUID vs Autoincrement
Performance: UUID vs auto-increment in cakephp-mysql
UUID performance in MySQL?
Primary Keys: IDs versus GUIDs (coding horror)
(UUID vs auto-increment performance results, adapted from Myths, GUID vs Autoincrement)
UUID pros / cons (adapted from Primary Keys: IDs versus GUIDs)
GUID Pros
Unique across every table, every database, every server
Allows easy merging of records from different databases
Allows easy distribution of databases across multiple servers
You can generate IDs anywhere, instead of having to roundtrip to the database
Most replication scenarios require GUID columns anyway
GUID Cons
It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if
you're not careful
Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of
clustered indexes.
Note
I would read carefully the mentioned references and decide whether to use UUID or not depending on my use case. That said, in many cases UUIDs would be indeed preferable. For example one can generate UUIDs without using/accessing the database at all, or even use UUIDs which have been pre-computed and/or stored somewhere else. Plus you can easily generalise/update your database schema and/or clustering scheme without having to worry about IDs breaking and causing conflicts.
In terms of possible collisions, for example using v4 UUIDS (random), the probability to find a duplicate within 103 trillion version-4 UUIDs is one in a billion.
A UUID key cannot be pk until unless persisted in DB so round tripping will happen until then you cannot assume its pk without a successful transaction. Most of the UUID use time based, mac based, name based or some random uuid. Given we are moving heavily towards container based deployments and they have a pattern for starting sequence MAC addresses relying on mac addresses will not work. Time based is not going to guarantee as the assumption is systems are always in exact time sync which is not true sometimes as clocks will not follow the rules. GUID cannot guarantee that collision will never occur just that in given short period of time it will not occur but given enough time and systems running in parallel and proliferations of systems that guarantee will eventually fail.
http://www.ietf.org/rfc/rfc4122.txt
For MySQL, which uses clustered primary key, version 4 randomly generated UUID will hurt insertion performance if used as the primary key. This is because it requires reordering the rows to place the newly inserted row at the right position inside the clustered index.
FWIW, PostgreSQL uses heap instead of clustered primary key, thus using UUID as the primary key won't impact PostgreSQL's insertion performance.
For more information, this article has a more comprehensive comparison between UUID and Int: Choose Primary Key - UUID or Auto Increment Integer
I am trying to design an ecommerce web application in MySQL and I am having problems choosing the correct primary keys for the user table. the example given is just a sample example for illustration.
user table have following definition
CREATE TABLE IF NOT EXISTS `mydb`.`user` (
`id` INT NOT NULL ,
`username` VARCHAR(25) NOT NULL ,
`email` VARCHAR(25) NOT NULL ,
`external_customer_id` INT NOT NULL ,
`subscription_end_date` DATETIME NULL ,
`column_1` VARCHAR(45) NULL ,
`column_2` VARCHAR(45) NULL ,
`colum_3` VARCHAR(45) NULL ,
PRIMARY KEY (`id`) ,
UNIQUE INDEX `username_UNIQUE` (`username` ASC) ,
UNIQUE INDEX `email_UNIQUE` (`email` ASC) ,
UNIQUE INDEX `customer_id_UNIQUE` (`external_customer_id` ASC) )
ENGINE = InnoDB
I am facing following issues with the primary key candidate columns:
Id column
Pros
No business meaning (stable primary key)
faster table joins
compacter index
cons
not a "natural" key
All attribute table(s) must be joined with the "master" user table, thus non-joining direct queries are not possible
causes less "natural" SQL queries
Leaks information: a user can figure out the number of registered user if start value is 0 (changing the start value sort this out)
ii) A user register a profile as user_A at time_X and some time later as user_B at time_Y will be easily able to calculate the number of registered users over the time period ((Id for user_B) - (Id for user_A)/(time_Y - time_X))
email column
Pros
None
Cons
a user should be able to change the email address. Not suitable for primary key
username column
Pros
a "natural" primary key
Less table joins
simpler and more "natural" queries
Cons
varchar column is slower when joining tables
an index on a varchar column is less compact than int column index
very difficult to change username since foreign keys are dependent on the value. Solution: "Syncing" all foreign keys on application or not allowing a user to change the username, .e.g. a user should delete the profile a register new
external_customer column
pros
can be used as an external reference for a customer and holds no information (maybe non-editable username can be used instead? )
cons
might leaks information if it is auto incremental (if possible)
problematic to generate unqiue value if an auto incremental surrogate id is already in use since MySQL innodb engine does not multiple auto_increment columns in same table
what are the common practice when choosing user table primary keys for a
scalable ecommerce web application? all feedback appreciated
I don't have anything to say about some of your analysis. If I've cut some of your pros or cons, that only means I don't think I have anything useful to add.
Id column
Pros
No business meaning (stable primary key)
faster table joins
compacter index
First, any column or set of columns declared NOT NULL UNIQUE has all the properties of a primary key. You can use any of them as the target for a foreign key reference, which is what all this is really about.
In your case, your structure allows 4 columns to be targets of a foreign key reference: id, username, email, and external_customer_id. You don't have to use the same one all the time. It might make sense to use id for 90% of your FK references, and email for 10% of them.
Stability doesn't have anything to do with whether a column has business meaning. Stability has to do with how often, and under what circumstances, a value might change. "Stable" doesn't mean "immutable" unless you're running Oracle. (Oracle can't do ON UPDATE CASCADE.)
Depending on your table structure and indexing, a natural key might perform faster. Natural keys make some joins unnecessary. I did tests before I built our production database. It's probably going to be decades before we reach the point that joins on ID numbers will outperform fewer joins and natural keys. I've written about those tests either on SO or on DBA.
You have three other unique indexes. (Good for you. I think at least 90% of the people who build a database don't get that right.) So it's not just that an index on an ID number is more compact than either of those three; it's also an additional index. (In this table.)
email column
Pros
None
An email address can be considered stable and unique. You can't stop people from sharing email addresses, regardless of whether it's the target for a foreign key reference.
But email addresses can be "lost". In the USA, most university students lose their *.edu email addresses with a year or so of graduation. If your email address comes through a domain that you're paying for, and you stop paying, the email address goes away. I imagine it's possible for email address like those to be given to new users. Whether that creates an unbearable burden is application-dependent.
Cons
a user should be able to change the email address. Not suitable for primary key
All values in a SQL database can be changed. It's only unsuitable if your environment doesn't let your dbms honor an ON UPDATE CASCADE declaration in a timely manner. My environment does. (But I run PostgreSQL on decent, unshared hardware.) YMMV.
username column
Pros
a "natural" primary key
Less table joins
simpler and more "natural" queries
Fewer joins is an important point. I have been on consulting gigs where I've seen the mindless use of ID numbers made people write queries with 40+ joins. Judicious use of natural keys eliminated up to 75% of them.
It's not important to always use surrogate keys as the target for your foreign keys (unless Oracle) or to always use natural keys as the target. It's important to think.
Cons
varchar column is slower when joining tables
an index on a varchar column is less compact than int column index
You can't really say that joining on a varchar() is slower without qualifying that claim. The fact is that, although most joins on varchar() are slower than joins on id numbers, they're not necessarily so slow that you can't use them. If a query takes 4ms with id numbers, and 6ms with varchar(), I don't think that's a good reason to disqualify the varchar(). Also, using a natural key will eliminate a lot of joins, so overall system response might be faster. (Other things being equal, 40 4ms joins will underperform 10 6ms joins.)
I can't recall any case in my database career (25+ years) where the width of an index was the deciding factor in choosing the target for a foreign key.
external_customer column
pros
can be used as an external reference for a customer and holds no information (maybe non-editable username can be used instead? )
There are actually few systems that let me change my username. Most will let me change my real name (I think), but not my username. I think an uneditable username is completely reasonable.
In general, web applications try to keep their database schema away from the customer - including primary keys. I think you're conflating your schema design with authentication methods - there's nothing stopping you from allowing users to log in with their email address, even if your database design uses an integer to uniquely identify them.
Whenever I've designed systems like this, I've used an ID column - either integer or GUID for the primary key. It's fast, doesn't change due to pesky real life situations, and is a familiar idiom to developers.
I've then worked out the best authentication scheme for the app in hand - most people expect to login with their email address these days, so I'd stick with that. Of course, you could also let them login with their Facebook, Twitter, or Google accounts. Has nothing to do with my primary key, though...
I think that with username column you also have this cons:
A user should be able to change the username. Not suitable for primary key.
So for the same reason that you won't use the email I won't use the username. For me the internal user integer id is the best approach.
I have recently started a new job and noticed that all the SQL tables use the GUID data type for the primary key.
In my previous job we used integers (Auto-Increment) for the primary key and it was a lot more easier to work with in my opinion.
For example, say you had two related tables; Product and ProductType - I could easily cross check the 'ProductTypeID' column of both tables for a particular row to quickly map the data in my head because its easy to store the number (2,4,45 etc) as opposed to (E75B92A3-3299-4407-A913-C5CA196B3CAB).
The extra frustration comes from me wanting to understand how the tables are related, sadly there is no Database diagram :(
A lot of people say that GUID's are better because you can define the unique identifer in your C# code for example using NewID() without requiring SQL SERVER to do it - this also allows you to know provisionally what the ID will be.... but I've seen that it is possible to still retrieve the 'next auto-incremented integer' too.
A DBA contractor reported that our queries could be up to 30% faster if we used the Integer type instead of GUIDS...
Why does the GUID data type exist, what advantages does it really provide?... Even if its a choice by some professional there must be some good reasons as to why its implemented?
GUIDs are good as identity fields in certain cases:
When you have multiple instances of SQL (different servers) and you need to combine the different updates later on without affecting referential integrity
Disconnected clients that create data - this way they can create data without worrying that the ID field is already taken
GUIDs are generated to be globally unique, which is why they are suited for such scenarios.
Contrary to what most folks here seem to preach, I see GUID's as more of a plague than a blessing. Here's why:
GUIDs may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.
You really need to keep two issues apart:
the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.
the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.
By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based Primary / Clustered Key into two separate key - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.
As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.
Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so. Plus, you can only use it as a default for a column in your table - you cannot get a new sequential GUID in T-SQL code (like a trigger or something) - another major drawback.
Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.
Quick calculation - using INT vs. GUID as Primary and Clustering Key:
Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
6 nonclustered indexes (22.89 MB vs. 91.55 MB)
TOTAL: 25 MB vs. 106 MB - and that's just on a single table!
Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.
GUIDs as PRIMARY KEY and/or clustered key
The clustered index debate continues
Ever-increasing clustering key - the Clustered Index Debate..........again!
Marc
INT
Advantage:
Numeric values (and specifically integers) are better for performance when used in joins, indexes and conditions.
Numeric values are easier to understand for application users if they are displayed.
Disadvantage:
If your table is large, it is quite possible it will run out of it and after some numeric value there will be no additional identity to use.
GUID
Advantage:
Unique across the server.
Disadvantage:
String values are not as optimal as integer values for performance when used in joins, indexes and conditions.
More storage space is required than INT.
credit goes to : http://blog.sqlauthority.com/2010/04/28/sql-server-guid-vs-int-your-opinion/
There are a ton of Google-able articles on using GUIDs as PKs and almost all of them say the same thing your DBA contractor says -- queries are faster without GUIDs as keys.
The primary use I've seen in practice (we've never used them as PKs) is with replication. The MSDN page for uniqueidentifier says about the same.
It is globally unique, so that each record in your table has a GUID that is shared by no other item of any kind in the world. Handy if you need this kind of exclusive identification (if you are replicating the database, or combining data from multiple source). Otherwise, your dba is correct - GUIDs are much larger and less efficient that integers, and you could speed up your db (30%? maybe...)
They basically save you from more sometimes complicated logic of using
set #InsertID = scope_identity()
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
This post was edited and submitted for review 10 days ago.
Improve this question
I am not very familiar with databases and the theories behind how they work. Is it any slower from a performance standpoint (inserting/updating/querying) to use Strings for Primary Keys than integers?
For Example I have a database that would have about 100 million row like mobile number, name and email. mobile number and email would be unique. so can I have the mobile number or email as a primary key,
well it effect my query performance when I search based on email or mobile number. similarly the primary key well be used as foreign key in 5 to 6 tables or even more.
I am using MySQL database
Technically yes, but if a string makes sense to be the primary key then you should probably use it. This all depends on the size of the table you're making it for and the length of the string that is going to be the primary key (longer strings == harder to compare). I wouldn't necessarily use a string for a table that has millions of rows, but the amount of performance slowdown you'll get by using a string on smaller tables will be minuscule to the headaches that you can have by having an integer that doesn't mean anything in relation to the data.
Another issue with using Strings as a primary key is that because the index is constantly put into sequential order, when a new key is created that would be in the middle of the order the index has to be resequenced... if you use an auto number integer, the new key is just added to the end of the index.
Inserts to a table having a clustered index where the insertion occurs in the middle of the sequence DOES NOT cause the index to be rewritten. It does not cause the pages comprising the data to be rewritten. If there is room on the page where the row will go, then it is placed in that page. The single page will be reformatted to place the row in the right place in the page. When the page is full, a page split will happen, with half of the rows on the page going to one page, and half going on the other. The pages are then relinked into the linked list of pages that comprise a tables data that has the clustered index. At most, you will end up writing 2 pages of database.
Strings are slower in joins and in real life they are very rarely really unique (even when they are supposed to be). The only advantage is that they can reduce the number of joins if you are joining to the primary table only to get the name. However, strings are also often subject to change thus creating the problem of having to fix all related records when the company name changes or the person gets married. This can be a huge performance hit and if all tables that should be related somehow are not related (this happens more often than you think), then you might have data mismatches as well. An integer that will never change through the life of the record is a far safer choice from a data integrity standpoint as well as from a performance standpoint. Natural keys are usually not so good for maintenance of the data.
I also want to point out that the best of both worlds is often to use an autoincrementing key (or in some specialized cases, a GUID) as the PK and then put a unique index on the natural key. You get the faster joins, you don;t get duplicate records, and you don't have to update a million child records because a company name changed.
Too many variables. It depends on the size of the table, the indexes, nature of the string key domain...
Generally, integers will be faster. But will the difference be large enough to care? It's hard to say.
Also, what is your motivation for choosing strings? Numeric auto-increment keys are often so much easier as well. Is it semantics? Convenience? Replication/disconnected concerns? Your answer here could limit your options. This also brings to mind a third "hybrid" option you're forgetting: Guids.
It doesn't matter what you use as a primary key so long as it is UNIQUE. If you care about speed or good database design use the int unless you plan on replicating data, then use a GUID.
If this is an access database or some tiny app then who really cares. I think the reason why most of us developers slap the old int or guid at the front is because projects have a way of growing on us, and you want to leave yourself the option to grow.
Don't worry about performance until you have got a simple and sound design that agrees with the subject matter that the data describes and fits well with the intended use of the data. Then, if performance problems emerge, you can deal with them by tweaking the system.
In this case, it's almost always better to go with a string as a natural primary key, provide you can trust it. Don't worry if it's a string, as long as the string is reasonably short, say about 25 characters max. You won't pay a big price in terms of performance.
Do the data entry people or automatic data sources always provide a value for the supposed natural key, or is sometimes omitted? Is it occasionally wrong in the input data? If so, how are errors detected and corrected?
Are the programmers and interactive users who specify queries able to use the natural key to get what they want?
If you can't trust the natural key, invent a surrogate. If you invent a surrogate, you might as well invent an integer. Then you have to worry about whther to conceal the surrogate from the user community. Some developers who didn't conceal the surrogate key came to regret it.
Indices imply lots of comparisons.
Typically, strings are longer than integers and collation rules may be applied for comparison, so comparing strings is usually more computationally intensive task than comparing integers.
Sometimes, though, it's faster to use a string as a primary key than to make an extra join with a string to numerical id table.
Two reasons to use integers for PK columns:
We can set identity for integer field which incremented automatically.
When we create PKs, the db creates an index (Cluster or Non Cluster) which sorts the data before it's stored in the table. By using an identity on a PK, the optimizer need not check the sort order before saving a record. This improves performance on big tables.
Yes, but unless you expect to have millions of rows, not using a string-based key because it's slower is usually "premature optimization." After all, strings are stored as big numbers while numeric keys are usually stored as smaller numbers.
One thing to watch out for, though, is if you have clustered indices on a any key and are doing large numbers of inserts that are non-sequential in the index. Every line written will cause the index to re-write. if you're doing batch inserts, this can really slow the process down.
What is your reason for having a string as a primary key?
I would just set the primary key to an auto incrementing integer field, and put an index on the string field.
That way if you do searches on the table they should be relatively fast, and all of your joins and normal look ups will be unaffected in their speed.
You can also control the amount of the string field that gets indexed. In other words, you can say "only index the first 5 characters" if you think that will be enough. Or if your data can be relatively similar, you can index the whole field.
From performance standpoint - Yes string(PK) will slow down the performance when compared to performance achieved using an integer(PK), where PK ---> Primary Key.
From requirement standpoint - Although this is not a part of your question still I would like to mention. When we are handling huge data across different tables we generally look for the probable set of keys that can be set for a particular table. This is primarily because there are many tables and mostly each or some table would be related to the other through some relation ( a concept of Foreign Key ). Therefore we really cannot always choose an integer as a Primary Key, rather we go for a combination of 3, 4 or 5 attributes as the primary key for that tables. And those keys can be used as a foreign key when we would relate the records with some other table. This makes it useful to relate the records across different tables when required.
Therefore for Optimal Usage - We always make a combination of 1 or 2 integers with 1 or 2 string attributes, but again only if it is required.
I would probably use an integer as your primary key, and then just have your string (I assume it's some sort of ID) as a separate column.
create table sample (
sample_pk INT NOT NULL AUTO_INCREMENT,
sample_id VARCHAR(100) NOT NULL,
...
PRIMARY KEY(sample_pk)
);
You can always do queries and joins conditionally on the string (ID) column (where sample_id = ...).
There could be a very big misunderstanding related to string in the database are. Almost everyone has thought that database representation of numbers are more compact than for strings. They think that in db-s numbers are represented as in the memory. BUT it is not true. In most cases number representation is more close to A string like representation as to other.
The speed of using number or string is more dependent on the indexing then the type itself.
By default ASPNetUserIds are 128 char strings and performance is just fine.
If the key HAS to be unique in the table it should be the Key. Here's why;
primary string key = Correct DB relationships, 1 string key(The primary), and 1 string Index(The Primary).
The other option is a typical int Key, but if the string HAS to be unique you'll still probably need to add an index because of non-stop queries to validate or check that its unique.
So using an int identity key = Incorrect DB Relationships, 1 int key(Primary), 1 int index(Primary), Probably a unique string Index, and manually having to validate the same string doesn't exist(something like a sql check maybe).
To get better performance using an int over a string for the primary key, when the string HAS to be unique, it would have to be a very odd situation. I've always preferred to use string keys. And as a good rule of thumb, don't denormalize a database until you NEED to.
I in the process of designing a database for high volume data and I was wondering what datatype to use for the primary keys?
There will be table partitioning and the database will ultimatley be clustered and will be hot failover to alternative datacentres.
EDIT
Tables - think chat system for multiple time periods and multiple things to chat about with multiple users chatting about the time period and thing.
Exponential issues are what I am thinking about - ie something could generate billions of rows in small time period. ie before we could change the database or DBA doing DBA things
Mark - I share your concearn of GUID - I dont like coding with GUIDs flying about.
With just the little bit of info you've provided, I would recommend using a BigInt, which would take you up to 9,223,372,036,854,775,807, a number you're not likely to ever exceed. (Don't start with an INT and think you can easily change it to a BigInt when you exceed 2 billion rows. Its possible (I've done it), but can take an extremely long time, and involve significant system disruption.)
Kimberly Tripp has an Excellent series of blog articles (GUIDs as PRIMARY KEYs and/or the clustering key and The Clustered Index Debate Continues) on the issue of creating clustered indexes, and choosing the primary key (related issues, but not always exactly the same). Her recommendation is that a clustered index/primary key should be:
Unique (otherwise useless as a key)
Narrow (the key is used in all non-clustered indexes, and in foreign-key relationships)
Static (you don't want to have to change all related records)
Always Increasing (so new records always get added to the end of the table, and don't have to be inserted in the middle)
If you use a BigInt as an increasing identity as your key and your clustered index, that should satisfy all four of these requirements.
Edit: Kimberly's article I mentioned above (GUIDs as PRIMARY KEYs and/or the clustering key) talks about why a (client generated) GUID is a bad choice for a clustering key:
But, a GUID that is not sequential -
like one that has it's values
generated in the client (using .NET)
OR generated by the newid() function
(in SQL Server) can be a horribly bad
choice - primarily because of the
fragmentation that it creates in the
base table but also because of its
size. It's unnecessarily wide (it's 4
times wider than an int-based identity
- which can give you 2 billion (really, 4 billion) unique rows). And,
if you need more than 2 billion you
can always go with a bigint (8-byte
int) and get 263-1 rows.
SQL has a function called NEWSEQUENTIALID() that allows you to generate sequential GUIDs that avoid the fragmentation issue, but they still have the problem of being unnecessarily wide.
You can always go for int but taking into account your partitioning/clustering I'd suggest you look into uniqueidentifier which will generate globally unique keys.
int tends to be the norm unless you need massive volume of data, and has the advantage of working with IDENTITY etc; Guid has some advantages if you want the numbers to be un-guessable or exportable, but if you use a Guid (unless you generate it yourself as "combed") you should ensure it is non-clustered (the index, that is; not the farm), as it won't be incremental.
I thik that int will be very good for it.
The range of INTEGER is - 2147483648 to 2147483647.
also you can use UniqueIdentifier (GUID), but in this case
table row size limit in MSSQL
storage + memory. Imagine you have tables with 10000000 rows and growing
flexibility: there are T-SQL operators available for INT like >, <, =, etc...
GUID is not optimized for ORDER BY/GROUP BY queries and for range queries in general