Been looking into using multi-column primary keys and as performance is extremely important with the size of traffic and database I need to know if there is anything to consider before I start throwing out the unique ID method on many of my tables and start using mulit column primary keys.
So, what are the performance/optimisation pros/cons to using multi column primary keys versus a basic single column, auto-inc primary key?
Is there a particular reason that you need/want to use multi-column keys instead of an (I assume) already created single-column key?
One of the problems with Natural Keys is dealing with cascading an update to the key value across all the foreign keys. A surrogate key such as an auto-increment column avoids this.
In terms of performance, depending on the row count, the data types of the columns, your storage engine, and the amount of RAM you have dedicated to MySQL, multi-column keys can affect performance due to the sheer size of the index.
In my experience, it is almost always easier in terms of development and maintenance to use a surrogate key as the PK and then create indexes that cover your queries across the natural keys. However, the only way to determine the true performance impact for your application is to benchmark it with realistic a realistic load and dataset.
HTH -
Chris
I wouldn't think that there would be any performance problems with multiple primary keys. It's more or less equivalent to having multiple indexes (you will spend a little bit more time computing index values when doing inserts).
Sometimes the data model makes more sense with multiple keys. I'd worry about being straightforward first and worry about performance second. You can always add more indexes, improve your queries, or twiddle server settings.
I think the most I've encountered was a 4-column primary key. Makes me cringe a little bit, but it worked¹.
[1] "worked" is defined to mean "the application performed to specification", and is not meant to imply that actual tasks were accomplished using said application. :)
Related
Even if I don't have a primary key or unique key, InnoDB still creates a cluster index on a synthetic column as described below.
https://dev.mysql.com/doc/refman/5.5/en/innodb-index-types.html
So, why does InnoDB have to require clustered index? Is there a defenite reason clustered index must exist here?
In Oracle Database or MSSQL I don't see they require this.
Also, I don't think cluster index have so tremendous advantage comparing to ordinary table either.
It is true that looking for data using clustering key does not need an additional disk read and faster than when I don't have one but without cluster index, secondary index can look up faster by using physical rowID.
Therefore, I don't see any reason for insisting using it.
Other vendors have a "ROWNUM" or something like that. InnoDB is much simpler. Instead of having that animal, it simply requires something that you will usually want anyway. In both cases, it is a value that uniquely identifies a row. This is needed for guts of transactions -- knowing which row(s) to lock, etc, to provide transactional integrity. (I won't go into the rationale here.)
In requiring (or providing) a PK, and in doing certain other simplifications, InnoDB sacrifices several little-used (or easily worked around) features: Multiple pks, multiple clustered indexes, no pk, etc.
Since the "synthetic column" takes 6 bytes, it is almost always better to simply provide id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, even if you don't use it. But if you don't use it, but do have a non-NULL UNIQUE key, then you may as well make it the PK. (As MySQL does by default.)
A lookup by a secondary key first gets the PK value from the secondary key's BTree. Then the main BTree (with the data ordered by the PK) is drilled down to find the row. Hence, secondary keys can be slower that use of the PK. (Usually this is not enough slower to matter.) So, this points out one design decision that required a PK.) (Other vendors use ROWNUM, or something, to locate the record, instead of the PK.)
Back to "Why?". There are many decisions in MySQL where the designers said "simplicity is better for this free product, let's not bother building some complex, but little-used feature. At first there were no subqueries (temp tables were a workaround). No Views (they are only syntactic sugar). No Materialized Views (OK, this may be a failing; but they can be simulated). No bit-mapped or hash or isam (etc) indexing (BTree is very good for "all-around" usage).
Also, by always "clustering" the PK with the data, lookups via the PK are inherently faster than the competition (no going through a ROWNUM). (Secondary key lookups may not be faster.)
Another difference -- MySQL was very late in implementing "index merge", wherein it uses two indexes, then ANDs or ORs the results. This can be efficient with ROWNUMs, but not with clustered PKs.
(I'm not a MySQL/MariaDB/Percona developer, but I have used them since 1999, and have been to virtually all major MySQL Conferences, where inside info is often divulged. So, I think I have enough insight into their thinking to present this answer.)
What would be the performance penalty of using strings as primary keys instead of bigints etc.? String comparison is much more expensive than integer comparison, but on the other hand I can imagine that internally a DBMS will compute hash keys to reduce the penalty.
An application that I work on uses strings as primary keys in several tables (MySQL). It is not trivial to change this, and I'd like to know what can be gained performance wise to justify the work.
on the other hand I can imagine that
internally a DBMS will compute hash
keys to reduce the penalty.
The DB needs to maintain a B-Tree (or a similar structure) with the key in a way to have them ordered.
If the key is hashed and stored it in the B-Tree that would be fine to check rapidly the uniqueness of the key -- the key can still be looked up efficiently. But you would not be able to search efficient for range of data (e.g. with LIKE) because the B-Tree is no more ordered according to the String value.
So I think most DB really store the String in the B-Tree, which can (1) take more space than numeric values and (2) require the B-Tree to be re-balanced if keys are inserted in arbitrary order (no notion of increasing value as with numeric pk).
The penalty in practice can range from insignificant to huge. It all depends on the usage, the number of rows, the average size of the string key, the queries which join table, etc.
In our product we use varchar(32) for primary keys (GUIDs) and we haven't met performance issues of this. Our product is a web site with extreme overload and is critical to be stable.
We use SQL Server 2005.
Edit: In our biggest tables we have more than 3 000 000 records with lots of inserts and selects from them. I think in general, the benefit of migrating to int key will be very low, but the problems while migrating very high.
One thing to watch out for is page splits (I know this can happen in SQL Server - probably the same in MySQL).
Primary keys are physically ordered. By using an auto-increment integer you guarantee that each time you insert you are inserting the next number up, so there is no need for the db to reorder the keys. If you use strings however, the pk you insert may need to be placed in the middle of the other keys to maintain the pk order. That process of reordering the pks on the insert can get expensive.
It depends on several factors: RDBMS, number of indexes involving those columns but in general it will be more efficient using ints, folowed by bigints.
Any performance gains depend on usage, so without concrete examples of table schema and query workload it is hard to say.
Unless it makes sense in the domain (I'm thinking unique something like social security number), a surrogate integer key is a good choice; referring objects do not need to have their FK reference updated when the referenced object changes.
In my application I usually use my primary keys as a way to access data. However, I've been told in order to increase performance, I should index columns in my table. But I have no idea what columns to index.
Now the Questions:
Is it a good idea to create an index on your primary key?
How would I know what columns to index?
Is it a good idea to create an index on your primary key?
Primary keys are implemented using a unique index automatically in Postgres. You are done here.
The same is true for MySQL. See:
Is the primary key automatically indexed in MySQL?
How would I know what columns to index?
For advice on additional indices, see:
Optimize PostgreSQL read-only tables
Again, the basics are the same for MySQL and Postgres. But Postgres has more advanced features like partial or functional indices if you need them. Start with the basics, though.
Your primary key will already have an index that is created for you automatically by PostgreSQL. You do not need to index the column again.
As far as the rest of the fields go, take a look at the article here on figuring out cardinality:
http://kirk.webfinish.com/2013/08/some-help-to-find-uniqueness-in-a-large-table-with-many-fields/
Fields that are completely unique are candidates, fields that have no uniqueness at all are useless to index. The sweet spot is the cardinality in the middle (.5).
And of course you should take a look at which columns you are using in the WHERE clause. It is useless to index columns that are not a part of your quals.
Primary keys will have an idex only if you formally define them as primary keys. Where most people forget to make indexes are Foriegn keys which are not generally automatically indexed and almost always will be involved in joins and thus indexed. Other candidates for indexes are things you frequently filter data on that have a large number fo possible values, things like names, part numbers, start Dates, etc.
1) Is it a good idea to make your primary key as an Index?(assuming the primary key is unique,an id
All DBMSes I know of will automatically create an index underneath the PK.
In case of MySQL/InnoDB, PK will not just be indexed, but that index will be clustered index.
(BTW, just saying "primary key" implies it is unique, so there is no need to explicitly state "assuming the primary key is unique".)
2) how would I know what columns to index ?
That depends on which queries need to be supported.
But beware that adding indexes is not free and is a matter of engineering tradeoff - while some queries might benefit from an index, some may actually suffer from it. For example:
An index on FOO would significantly speed-up the SELECT * FROM T WHERE FOO = ....
However, the same index would somewhat slow-down the INSERT INTO T VALUES (...).
In most situations you'd favor large speedup in SELECT over small slowdown in INSERT, but that may not always be the case.
Indexing and the database performance in general are a complex topic beyond the scope of a humble StackOverflow post, but if you are interested I warmly recommend reading Use The Index, Luke!.
Your primary key will always be an index.
Always create indexes in columns that help to reduce the search, for example if in the column there are only 3 different values among more than a thousand it is a good sign to make it index.
Hi I am a newbie to mysql
Here are my questions:
What is the use of Mysql Index key?
Does it make a difference in mysql queries with defining an Index key and without it?
Are all primary keys default Index key?
Thanks a million
1- Defining an index on a column (or set of columns) makes searching on that column (or set) much faster, at the expense of additional disk space.
2- Yes, the difference is that queries using that column will be much faster.
3- Yes, as it's usual to search by the primary key, it makes sense for that column to always be indexed.
Read more on MySQL indexing here.
An index is indeed an additional set of records. Nothing more.
Things that make indexes access faster are:
Internally there's more chance that the engine put in buffer the index than the whole table rows
The index is smaller so to parse it means reading less blocks of the hard drive
The index is sorted already, so finding a given value is easy
In case of being not null, it's even faster (for various reasons, but the most important thing to know is that the engine doesn't store null values in indexes)
To know whether or not an index is useful is not so easy to guess (obviously I'm not talking about the primary key) and should be investigated. Here are some counterparts when it might slow down your operations:
It will slow down inserts and updates on indexed fields
It requires more maintenance: statistics have to be built for each index so the computing could take a significantly longer time if you add many indexes
It might slow down the queries when the statistics are not up to date. This effect could be catastrophic because the engine would actually go "the wrong way"
It might slow down when the query are inadequate (anyway indexes should not be a rule but an exception: no index, except if there's an urge on certain queries. I know usually every table has at least one index, but it came after investigations)
We could comment this last point a lot, but I think every case is special and many examples of this already exist in internet.
Now about your question 'Are all primary keys default Index key?', I must say that it is not like this that the optimizer works. When there are various indexes defined in a table, the more efficient index combination will be compiled with on the fly datas and some other static datas (indexes statistics), in order to reach the best performances. There's no default index per se, every situation leeds to a different result.
Regards.
I have recently started a new job and noticed that all the SQL tables use the GUID data type for the primary key.
In my previous job we used integers (Auto-Increment) for the primary key and it was a lot more easier to work with in my opinion.
For example, say you had two related tables; Product and ProductType - I could easily cross check the 'ProductTypeID' column of both tables for a particular row to quickly map the data in my head because its easy to store the number (2,4,45 etc) as opposed to (E75B92A3-3299-4407-A913-C5CA196B3CAB).
The extra frustration comes from me wanting to understand how the tables are related, sadly there is no Database diagram :(
A lot of people say that GUID's are better because you can define the unique identifer in your C# code for example using NewID() without requiring SQL SERVER to do it - this also allows you to know provisionally what the ID will be.... but I've seen that it is possible to still retrieve the 'next auto-incremented integer' too.
A DBA contractor reported that our queries could be up to 30% faster if we used the Integer type instead of GUIDS...
Why does the GUID data type exist, what advantages does it really provide?... Even if its a choice by some professional there must be some good reasons as to why its implemented?
GUIDs are good as identity fields in certain cases:
When you have multiple instances of SQL (different servers) and you need to combine the different updates later on without affecting referential integrity
Disconnected clients that create data - this way they can create data without worrying that the ID field is already taken
GUIDs are generated to be globally unique, which is why they are suited for such scenarios.
Contrary to what most folks here seem to preach, I see GUID's as more of a plague than a blessing. Here's why:
GUIDs may seem to be a natural choice for your primary key - and if you really must, you could probably argue to use it for the PRIMARY KEY of the table. What I'd strongly recommend not to do is use the GUID column as the clustering key, which SQL Server does by default, unless you specifically tell it not to.
You really need to keep two issues apart:
the primary key is a logical construct - one of the candidate keys that uniquely and reliably identifies every row in your table. This can be anything, really - an INT, a GUID, a string - pick what makes most sense for your scenario.
the clustering key (the column or columns that define the "clustered index" on the table) - this is a physical storage-related thing, and here, a small, stable, ever-increasing data type is your best pick - INT or BIGINT as your default option.
By default, the primary key on a SQL Server table is also used as the clustering key - but that doesn't need to be that way! I've personally seen massive performance gains when breaking up the previous GUID-based Primary / Clustered Key into two separate key - the primary (logical) key on the GUID, and the clustering (ordering) key on a separate INT IDENTITY(1,1) column.
As Kimberly Tripp - the Queen of Indexing - and others have stated a great many times - a GUID as the clustering key isn't optimal, since due to its randomness, it will lead to massive page and index fragmentation and to generally bad performance.
Yes, I know - there's newsequentialid() in SQL Server 2005 and up - but even that is not truly and fully sequential and thus also suffers from the same problems as the GUID - just a bit less prominently so. Plus, you can only use it as a default for a column in your table - you cannot get a new sequential GUID in T-SQL code (like a trigger or something) - another major drawback.
Then there's another issue to consider: the clustering key on a table will be added to each and every entry on each and every non-clustered index on your table as well - thus you really want to make sure it's as small as possible. Typically, an INT with 2+ billion rows should be sufficient for the vast majority of tables - and compared to a GUID as the clustering key, you can save yourself hundreds of megabytes of storage on disk and in server memory.
Quick calculation - using INT vs. GUID as Primary and Clustering Key:
Base Table with 1'000'000 rows (3.8 MB vs. 15.26 MB)
6 nonclustered indexes (22.89 MB vs. 91.55 MB)
TOTAL: 25 MB vs. 106 MB - and that's just on a single table!
Some more food for thought - excellent stuff by Kimberly Tripp - read it, read it again, digest it! It's the SQL Server indexing gospel, really.
GUIDs as PRIMARY KEY and/or clustered key
The clustered index debate continues
Ever-increasing clustering key - the Clustered Index Debate..........again!
Marc
INT
Advantage:
Numeric values (and specifically integers) are better for performance when used in joins, indexes and conditions.
Numeric values are easier to understand for application users if they are displayed.
Disadvantage:
If your table is large, it is quite possible it will run out of it and after some numeric value there will be no additional identity to use.
GUID
Advantage:
Unique across the server.
Disadvantage:
String values are not as optimal as integer values for performance when used in joins, indexes and conditions.
More storage space is required than INT.
credit goes to : http://blog.sqlauthority.com/2010/04/28/sql-server-guid-vs-int-your-opinion/
There are a ton of Google-able articles on using GUIDs as PKs and almost all of them say the same thing your DBA contractor says -- queries are faster without GUIDs as keys.
The primary use I've seen in practice (we've never used them as PKs) is with replication. The MSDN page for uniqueidentifier says about the same.
It is globally unique, so that each record in your table has a GUID that is shared by no other item of any kind in the world. Handy if you need this kind of exclusive identification (if you are replicating the database, or combining data from multiple source). Otherwise, your dba is correct - GUIDs are much larger and less efficient that integers, and you could speed up your db (30%? maybe...)
They basically save you from more sometimes complicated logic of using
set #InsertID = scope_identity()