EF 4.0 Guid or Int as A primary Key - sql-server-2008

I am Implementing custom ASPNetMembership using EF 4.0
Is there any reason why i should use Guid as a primary key in User tables?
As far as i know Int as a PK on SQL Server more performanced than strings.
And Int is easier to iterate.
Also, for security purpose if i need to pass any int id somewhere e.g in url i may encrypt it somehow and pass it like a string with no probs.
But if i want to use auto generated Guid on SQL Server side using EF 4.0 i need to do this trick http://leedumond.com/blog/using-a-guid-as-an-entitykey-in-entity-framework-4/
I can't see any cases why i should use Guid as PK, may be only one if system going to have millions ans millions users, but also, theoretically, Guid could be duplicated sometime isn't so?
Anyway Int32 size is 2,147.483.647 it is pretty much even for very-very big system, but if this number is still not enough I may go with Int64, in that cases I may have 9,223.372.036.854.775.807 rows. Pretty much huh?
From another hand, M$ using Guids as PK in their ASPNetMembership implementation. [aspnetdb].[aspnet_Users] -> PK UserId Type uniqueidentifier,
should be some reasons/explanation why the did it?!
May be some one has any ideas/experience about that?

I would agree 100% with you - using an INT IDENTITY is much better!
GUIDs may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.
You really need to keep two issues apart:
1) the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.
2) the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.
By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based Primary / Clustered Key into two separate key - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.
As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.
Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so.
Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.
Quick calculation - using INT vs. GUID as Primary and Clustering Key:
Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
6 nonclustered indexes (22.89 MB vs. 91.55 MB)
TOTAL: 25 MB vs. 106 MB - and that's just on a single table!
Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.
GUIDs as PRIMARY KEY and/or clustered key
The clustered index debate continues
Ever-increasing clustering key - the Clustered Index Debate..........again!
Disk space is cheap - that's not the point!

Go with the INT PK. See Kimberly L. Tripp's article: GUIDs as PRIMARY KEYs and/or the clustering key

Until Entity Framework introduces any concept of batching there is no reason for not using INT IDENTITY. Guid is only useful when you want to set Id of the new record on the client.

Related

API REST & MySQL : using both auto increment ID and uuid

I am creating a REST API with a MySQL database. I would like to know if using auto-incrementing IDs as primary keys, (to keep good performance) and unique uuid fields (used as API ID) is a bad idea? If so why?
(from Comment) The purpose of the UUID is to provide an opaque id in the API, while using a simpler, more efficient, BIGINT for internal purposes.
UUIDs have these benefits:
They can be created independently by multiple clients, while being unique.
They obfuscate the ids. (Example: avoid hackers discovering valid ids.)
IDs have these benefits:
Smaller disk space (and cache) needed, hence somewhat faster.
Temporarily oriented ("recent" inserts are clustered "together"). This is a performance benefit for very large tables (or small RAMs).
"Natural" Primary keys (a column or combination of columns that is intrinsically unique):
may be smaller
may be faster
more logical.
Example: In the case of a many-to-many mapping table (just 2 ids pointing to two other tables), PRIMARY KEY(a_id, b_id), INDEX(b_id, a_id) is clearly faster and smaller.
UUIDs are 36 or 16 bytes; ids are 8 bytes or 4 or smaller. A natural key may take 0 extra bytes (or may not).
To answer your question: "It depends".
The tables I build have PKs:
Natural - 2/3 of the tables
Auto_inc - 1/3
UUID - essentially none.
(PS: I find REST to be clumsy and provide no real benefits, so I avoid it.)
Based on Comment
Probably you what:
An auto_inc id everywhere in the database;
A UUID for opaquely sending to the user. This avoids various hacking games that might be played with an auto_inc.
So, in the the main table,
CREATE TABLE main (
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
uuid BINARY(16) NOT NULL,
....
PRIMARY KEY (id),
UNIQUE(uuid),
...
) ENGINE=InnoDB
When creating a new row, compute a new UUID, strip the dashes and convert FROM_HEX().
When sending a message to the user, include uuid, not id.
When receiving a reply message, quickly switch to using id by looking it up via that available index. Perhaps this way:
SELECT id FROM main WHERE uuid = ?

The differences between INT and UUID in MySQL

If I set the primary key to be INT type (AUTO_INCREMENT) or set it in UUID, what is the difference between these two in the database performance (SELECT, INSERT etc) and why?
UUID returns a universal unique identifier (hopefuly also unique if imported to another DB as well).
To quote from MySQL doc (emphasis mine):
A UUID is designed as a number that is globally unique in space and
time. Two calls to UUID() are expected to generate two different
values, even if these calls are performed on two separate computers
that are not connected to each other.
On the other hand a simply INT primary id key (e.g. AUTO_INCREMENT) will return a unique integer for the specific DB and DB table, but which is not universally unique (so if imported to another DB chances are there will be primary key conflicts).
In terms of performance, there shouldn't be any noticeable difference using auto-increment over UUID. Most posts (including some by the authors of this site), state as such. Of course UUID may take a little more time (and space), but this is not a performance bottleneck for most (if not all) cases. Having a column as Primary Key should make both choices equal wrt to performance. See references below:
To UUID or not to UUID?
Myths, GUID vs Autoincrement
Performance: UUID vs auto-increment in cakephp-mysql
UUID performance in MySQL?
Primary Keys: IDs versus GUIDs (coding horror)
(UUID vs auto-increment performance results, adapted from Myths, GUID vs Autoincrement)
UUID pros / cons (adapted from Primary Keys: IDs versus GUIDs)
GUID Pros
Unique across every table, every database, every server
Allows easy merging of records from different databases
Allows easy distribution of databases across multiple servers
You can generate IDs anywhere, instead of having to roundtrip to the database
Most replication scenarios require GUID columns anyway
GUID Cons
It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if
you're not careful
Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of
clustered indexes.
Note
I would read carefully the mentioned references and decide whether to use UUID or not depending on my use case. That said, in many cases UUIDs would be indeed preferable. For example one can generate UUIDs without using/accessing the database at all, or even use UUIDs which have been pre-computed and/or stored somewhere else. Plus you can easily generalise/update your database schema and/or clustering scheme without having to worry about IDs breaking and causing conflicts.
In terms of possible collisions, for example using v4 UUIDS (random), the probability to find a duplicate within 103 trillion version-4 UUIDs is one in a billion.
A UUID key cannot be pk until unless persisted in DB so round tripping will happen until then you cannot assume its pk without a successful transaction. Most of the UUID use time based, mac based, name based or some random uuid. Given we are moving heavily towards container based deployments and they have a pattern for starting sequence MAC addresses relying on mac addresses will not work. Time based is not going to guarantee as the assumption is systems are always in exact time sync which is not true sometimes as clocks will not follow the rules. GUID cannot guarantee that collision will never occur just that in given short period of time it will not occur but given enough time and systems running in parallel and proliferations of systems that guarantee will eventually fail.
http://www.ietf.org/rfc/rfc4122.txt
For MySQL, which uses clustered primary key, version 4 randomly generated UUID will hurt insertion performance if used as the primary key. This is because it requires reordering the rows to place the newly inserted row at the right position inside the clustered index.
FWIW, PostgreSQL uses heap instead of clustered primary key, thus using UUID as the primary key won't impact PostgreSQL's insertion performance.
For more information, this article has a more comprehensive comparison between UUID and Int: Choose Primary Key - UUID or Auto Increment Integer

MySQL relational tables and Primary Keys data types

I'm working on building a MySQL database that will have many related tables. I know I can create a primary key that is unique and auto increments, then reference that Primary Key as a Foreign Key in the related table. My question is what are best practices for Primary/Foreign key data types (int, varchar, etc..). I want to make sure I get this right the first time.
Thank you
In MySQL most commonly used datatype is INT UNSIGNED. It's 4 bytes long and can store numbers up to 4 billion which is more than enough for majority of applications. Using BIGINT (8B - 18446744073709551615 range) is usually an overkill when just starting with your application. Before your database gets too large for INT primary key, you'll need to hire a professional DBA anyway to help you with other issues.
CHAR or BINARY datatypes can be used, when GUID type Primary key is used - such keys make it easier to shard data across multiple database instances, however with MySQL a multi master replication is more often seen.
When using InnoDB tables it's good to remember, that PK will be implicitly the last column in all other indexes created on specific table. This makes it important to use reasonably short datatype for PK (again 4B INT usually does its job well).

SQL GUID Vs Integer

I have recently started a new job and noticed that all the SQL tables use the GUID data type for the primary key.
In my previous job we used integers (Auto-Increment) for the primary key and it was a lot more easier to work with in my opinion.
For example, say you had two related tables; Product and ProductType - I could easily cross check the 'ProductTypeID' column of both tables for a particular row to quickly map the data in my head because its easy to store the number (2,4,45 etc) as opposed to (E75B92A3-3299-4407-A913-C5CA196B3CAB).
The extra frustration comes from me wanting to understand how the tables are related, sadly there is no Database diagram :(
A lot of people say that GUID's are better because you can define the unique identifer in your C# code for example using NewID() without requiring SQL SERVER to do it - this also allows you to know provisionally what the ID will be.... but I've seen that it is possible to still retrieve the 'next auto-incremented integer' too.
A DBA contractor reported that our queries could be up to 30% faster if we used the Integer type instead of GUIDS...
Why does the GUID data type exist, what advantages does it really provide?... Even if its a choice by some professional there must be some good reasons as to why its implemented?
GUIDs are good as identity fields in certain cases:
When you have multiple instances of SQL (different servers) and you need to combine the different updates later on without affecting referential integrity
Disconnected clients that create data - this way they can create data without worrying that the ID field is already taken
GUIDs are generated to be globally unique, which is why they are suited for such scenarios.
Contrary to what most folks here seem to preach, I see GUID's as more of a plague than a blessing. Here's why:
GUIDs may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.
You really need to keep two issues apart:
the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.
the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.
By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based Primary / Clustered Key into two separate key - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.
As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.
Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so. Plus, you can only use it as a default for a column in your table - you cannot get a new sequential GUID in T-SQL code (like a trigger or something) - another major drawback.
Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.
Quick calculation - using INT vs. GUID as Primary and Clustering Key:
Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
6 nonclustered indexes (22.89 MB vs. 91.55 MB)
TOTAL: 25 MB vs. 106 MB - and that's just on a single table!
Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.
GUIDs as PRIMARY KEY and/or clustered key
The clustered index debate continues
Ever-increasing clustering key - the Clustered Index Debate..........again!
Marc
INT
Advantage:
Numeric values (and specifically integers) are better for performance when used in joins, indexes and conditions.
Numeric values are easier to understand for application users if they are displayed.
Disadvantage:
If your table is large, it is quite possible it will run out of it and after some numeric value there will be no additional identity to use.
GUID
Advantage:
Unique across the server.
Disadvantage:
String values are not as optimal as integer values for performance when used in joins, indexes and conditions.
More storage space is required than INT.
credit goes to : http://blog.sqlauthority.com/2010/04/28/sql-server-guid-vs-int-your-opinion/
There are a ton of Google-able articles on using GUIDs as PKs and almost all of them say the same thing your DBA contractor says -- queries are faster without GUIDs as keys.
The primary use I've seen in practice (we've never used them as PKs) is with replication. The MSDN page for uniqueidentifier says about the same.
It is globally unique, so that each record in your table has a GUID that is shared by no other item of any kind in the world. Handy if you need this kind of exclusive identification (if you are replicating the database, or combining data from multiple source). Otherwise, your dba is correct - GUIDs are much larger and less efficient that integers, and you could speed up your db (30%? maybe...)
They basically save you from more sometimes complicated logic of using
set #InsertID = scope_identity()

Key DataType for a high volume SQL Server 2008?

I in the process of designing a database for high volume data and I was wondering what datatype to use for the primary keys?
There will be table partitioning and the database will ultimatley be clustered and will be hot failover to alternative datacentres.
EDIT
Tables - think chat system for multiple time periods and multiple things to chat about with multiple users chatting about the time period and thing.
Exponential issues are what I am thinking about - ie something could generate billions of rows in small time period. ie before we could change the database or DBA doing DBA things
Mark - I share your concearn of GUID - I dont like coding with GUIDs flying about.
With just the little bit of info you've provided, I would recommend using a BigInt, which would take you up to 9,223,372,036,854,775,807, a number you're not likely to ever exceed. (Don't start with an INT and think you can easily change it to a BigInt when you exceed 2 billion rows. Its possible (I've done it), but can take an extremely long time, and involve significant system disruption.)
Kimberly Tripp has an Excellent series of blog articles (GUIDs as PRIMARY KEYs and/or the clustering key and The Clustered Index Debate Continues) on the issue of creating clustered indexes, and choosing the primary key (related issues, but not always exactly the same). Her recommendation is that a clustered index/primary key should be:
Unique (otherwise useless as a key)
Narrow (the key is used in all non-clustered indexes, and in foreign-key relationships)
Static (you don't want to have to change all related records)
Always Increasing (so new records always get added to the end of the table, and don't have to be inserted in the middle)
If you use a BigInt as an increasing identity as your key and your clustered index, that should satisfy all four of these requirements.
Edit: Kimberly's article I mentioned above (GUIDs as PRIMARY KEYs and/or the clustering key) talks about why a (client generated) GUID is a bad choice for a clustering key:
But, a GUID that is not sequential -
like one that has it's values
generated in the client (using .NET)
OR generated by the newid() function
(in SQL Server) can be a horribly bad
choice - primarily because of the
fragmentation that it creates in the
base table but also because of its
size. It's unnecessarily wide (it's 4
times wider than an int-based identity
- which can give you 2 billion (really, 4 billion) unique rows). And,
if you need more than 2 billion you
can always go with a bigint (8-byte
int) and get 263-1 rows.
SQL has a function called NEWSEQUENTIALID() that allows you to generate sequential GUIDs that avoid the fragmentation issue, but they still have the problem of being unnecessarily wide.
You can always go for int but taking into account your partitioning/clustering I'd suggest you look into uniqueidentifier which will generate globally unique keys.
int tends to be the norm unless you need massive volume of data, and has the advantage of working with IDENTITY etc; Guid has some advantages if you want the numbers to be un-guessable or exportable, but if you use a Guid (unless you generate it yourself as "combed") you should ensure it is non-clustered (the index, that is; not the farm), as it won't be incremental.
I thik that int will be very good for it.
The range of INTEGER is - 2147483648 to 2147483647.
also you can use UniqueIdentifier (GUID), but in this case
table row size limit in MSSQL
storage + memory. Imagine you have tables with 10000000 rows and growing
flexibility: there are T-SQL operators available for INT like >, <, =, etc...
GUID is not optimized for ORDER BY/GROUP BY queries and for range queries in general