There is a table that contains more id data than real data data.
user_id int unsigned NOT NULL,
project_id int unsigned NOT NULL,
folder_id int unsigned NOT NULL,
file_id int unsigned NOT NULL,
data TEXT NOT NULL
The only way to create a unique primary key for this table would be a composite of (user_id, project_id, folder_id, file_id). I have frequently seen 2 column composite primary keys, but is it ok to have 4 or even more? According to MySQL: "All storage engines support at least 16 indexes per table and a total index length of at least 256 bytes. Most storage engines have higher limits.", so I know at least it is possible to do.
Past this, there are frequent queries to this table for various combinations of these ids. For example, find all projects for user X, find all files for user X, find all files for project Y and folder Z, etc. Should there be a separate individual index key on each of the id columns, or if there is a composite primary key that already contains all the columns does this make further individual keys redundant? There will be about 10 million - 50 million rows in the table at any time.
To summarize: is it ok to have a composite primary key with 4 (or more) id columns, and if there is a composite key does it make additional individual keys for each of those columns redundant?
Yes, it is ok to have a composite primary key with 4 or more columns.
It doesn't necessarily make additional keys for each of those columns redundant. For example, a key (a, b, c) will not be useful for a query SELECT ... WHERE b = 4. For that type of query you would rather have key (b) or key (b, c).
You need to examine your expected queries to determine which indexes you'll need. See this talk for more details: http://youtu.be/AVNjqgf7zNw
Yes this is OK if the data model supports it. You haven't shared much about your overall DB schema and how these items related to each other to determine if this might be considered the best approach. In other words is this truly the only way in which these for items are related to each other, or for example are the files REALLY related to projects and projects related to users or something like that such the splitting up these joins tables makes more logical sense.
If you are querying individual columns within this primary key, this might suggest to me that your schema is not quite correct. At a minimum you might need to add individual index on these columns to support such a query.
You're going to regret creating a compound primary key, it becomes really obnoxious to address individual rows and derivative indexes in MySQL must contain the primary key as a row identifier. You can create a UNIQUE that's compound, though.
You can have a composite key with a fairly large number of components, though keep in mind the more you add the bigger the index will get and the slower it will be to update when you do an INSERT. As your database grows in size, insert operations may get cripplingly slow.
This is why, whenever possible, you should try and minimize your index size.
Related
I need a table to store some ratings, in this table I have a composite index (user_id, post_id) and other column to identify different rating system.
user_id - bigint
post_id - bigint
type - varchar
...
Composite Index (user_id, post_id)
In this table I've not a primary key because the primary need to be unique and the INDEX not need to be unique, in my case univocity is a problem.
For example I can have
INSERT INTO tbl_rate
(user_id,post_id,type)
VALUES
(24,1234,'like'),
(24,1234,'love'),
(24,1234,'other');
The missing of PRIMARY KEY may cause performance problem? My table structure is good or I need to change it?
Thank you
A few points:
It sounds like you are just using what is currently unique about the table and making that as a primary key. That works. And natural keys have some advantages when it comes to querying because of locality. (The data for each user is stored in the same area). And because the table is clustered by that key which eliminates lookups to the data if you are searching by the columns in the primary.
But, using a natural primary key like you chose has disadvantages for performance as well.
Using a very large primary key will make all other indexes very large in innodb because the primary key is included in each index value.
Using a natural primary key isn't as fast as a surrogate key for INSERT's because in addition to being bigger it can't just insert at the end of the table each time. It has to insert in the section for that user and post etc.
Also, if u are searching by time most likely you will be seeking all over the table with a natural key unless time is your first column. surrogate keys tend to be local for time and can often be just right for some queries.
Using a natural key like yours as a primary key can also be annoying. What if you want to refer to a particular vote? You need a few fields. Also it's a little difficult to use with lots of ORMs.
Here's the Answer
I would create your own surrogate key and use it as a primary key rather than rely on innodb's internal primary key because you'll be able to use it for updates and lookups.
ALTER TABLE tbl_rate
ADD id INT UNSIGNED NOT NULL AUTO_INCREMENT,
ADD PRIMARY KEY(id);
But, if you do create a surrogate primary key, I'd also make your key a UNIQUE. Same cost but it enforces correctness.
ALTER TABLE tbl_rate
ADD UNIQUE ( user_id, post_id, type );
The missing of PRIMARY KEY may cause performance problem?
Yes in InnoDB for sure, as InnoDB will use a algorithm to create it's own "ROWID",
Which is defined in dict0boot.ic
Returns a new row id.
#return the new id */
UNIV_INLINE
row_id_t
dict_sys_get_new_row_id(void)
/*=========================*/
{
row_id_t id;
mutex_enter(&(dict_sys->mutex));
id = dict_sys->row_id;
if (0 == (id % DICT_HDR_ROW_ID_WRITE_MARGIN)) {
dict_hdr_flush_row_id();
}
dict_sys->row_id++;
mutex_exit(&(dict_sys->mutex));
return(id);
}
The main problem in that code is mutex_enter(&(dict_sys->mutex)); which blocks others threads from accessing if one thread is already running this code.
Meaning it will table lock the same as MyISAM would.
% may take a few nanoseconds. That is insignificant compared to
everything else. Anyway #define DICT_HDR_ROW_ID_WRITE_MARGIN 256
Indeed yes Rick James this is indeed insignificant compared to what was mentioned above.
The C/C++ compiler would micro optimize it more to to get even more performance out off it by making the CPU instructions lighter.
Still the main performance concern is mentioned above..
Also the modulo operator (%) is a CPU heavy instruction.
But depening on the C/C++ compiler (and/or configuration options) if might be optimized if DICT_HDR_ROW_ID_WRITE_MARGIN is a power of two. Like (0 == (id & (DICT_HDR_ROW_ID_WRITE_MARGIN - 1))) as bitmasking is much faster, i believe DICT_HDR_ROW_ID_WRITE_MARGIN indeed had a number which is a power of 2
I have a large sensor data count table say SENSORS_COUNT with a string SID referring to another table SENSOR_DEFINITIONS with the same primary key SID. As there are millions of data points, the index on string primary key becomes 1) bloated 2) slow. The total number of sensors is pretty small (< 2000).
I can think about 3 different ways of making the queries faster:
Using a join table which translates the string key into a corresponding integer key and refer that with joins in all queries
Load the string/integer translation as a hash in program memory and refer that within the code
Use index on string primary id (which would be slower than integer though)
I'm trying to build a system for a variety of sensors which may have different types of string ids (but same schema). What would be the best recommendation to go about it?
EDIT 1: This is the schema. And yes (thanks for the correction), in SENSORS_COUNT table, SID is not a primary key
TABLE: SENSOR_DEFINITIONS (2000 records)
SID : VARCHAR(20), PRIMARY KEY
SNAME: VARCHAR(50)
TABLE: SENSORS_COUNT (N million records)
SID: VARCHAR(20)
DATETIME: TIMESTAMP
VALUE: INTEGER
For "large" tables, normalization becomes more important. Especially when the table is too big to be cached.
So, I agree with the choice of using a SMALLINT UNSIGNED (2 bytes, 0..64K) for the 2000 sensor names, not a VARCHAR(...).
Without seeing (1) the SHOW CREATE TABLE and (2) some critical SELECTs, it is hard to give further advice.
Probably, a "composite" PRIMARY KEY would be better than an AUTO_INCREMENT. It might be (sensor_id, datetime), but it would help to see the selects first.
Do not have two tables with the same schema (without a good reason).
I have a table that uses 2 foreign key fields and a date field.
Is it common to have a table use 3 or more fields as a primary key? And are there any disadvantages to doing this?
--
My 3 tables are employees, training, and emp_training. The employees table holds employee data. Training table holds different training courses. And I am designing the emp_training table to be the fields EmployeeID (FK), TrainingID (FK), OnDate.
An employee can do multiple training courses, and can do the same training course multiple times. But they cannot to the same training course more than once on the same day.
Which is better to implement:
Option A - Make all 3 fields a primary key
Option B - Add an autonumber PK field, and use a query to find any potential duplicates.
I've created many tables before using 2 fields as a primary key, but never 3, so I'm curious if there is any disadvantage to proceeding with option A
It's worth to mention, that with SQL Server the PK by default is the one and only clustered key, but you are allowed to create a non-clustered PK as well.
You may define a new clustered index which is not the PK. "Primary Key" is just a name actually...
The most important question is: Which columns participate in a clustered key and (this is the very most important question): Do they have an implicit sorting? And (very important too): Are there many update operations which change the content of participating columns?
You must be aware, that a clustered key defines the physical order on your hard disc. In other words: The clustered key is the table itself. You can think of an index with all columns included. If your leading column (worst case) is a GUID, each insert to your table will not be in order. This leads to a 99.99% fragmentation.
If a clustered index is bound to the time of insert or a running number (best case), it will never go into fragmentation!
What makes things worse: If there is a clustered key (whether it's called PK or not), it will be used as lookup key for other indexes.
So: in many cases it is best to use a running number as clustered key and a non-clustered multi-column index which is much faster to re-build than as if it was the clustered one.
All indexes will profit from this!
My advise for you:
Option C: a running number as PK and additionally a unique multi-column-key to ensure data integrity. No need to use own logic here...
Yes, you can have a poor strategy for choosing too many columns for your composite Primary Key (PK) if a better strategy could be employeed for uniqueness via secondary indexes.
Remember that the PK is special. There is only 1 physical / clustered ordering of your data. Changes to the data via Inserts and Updates (and incumbent shuffling) has overhead there that would not exist if maintained in a secondary index.
So the following can have not-so-insignificant differences:
A primary key with 5 composite columns
vs.
A primary key with 1 or 2 columns plus
Secondary indexes that maintain uniqueness if thought through well
The former mandates movement of data between data pages to maintain the clustered index (the PK). Which might suggest why so often one sees:
(
id int auto_increment primary key,
...
)
in table designs.
Performance with Index Width:
The width of the PK in 1. above is narrow. The width of 2. can be quite wide. Wider keys propagating to child relationships will slow performance and concurrency.
Cases of FK compositions:
Special cases of compositions of foreign keys simply cannot be achieved without the use of a single column index, preferably the PK, as seen in this recent Answer of mine.
I dont think that there is any problem of creating a table with a composed PK ,such tables are needed in larger db .There is not a real problem in creating a table with 2FK whose with the OnDate field form the PK . Both ways are vailable.
Good luck!
If you assign primary key on more than one column it will be composite primary key. For example,
CREATE TABLE employee(
training VARCHAR(10),
emp_training VARCHAR (20),
OnDate INTEGER,
PRIMARY KEY (training, emp_training, OnDate)
)
there will be unique records in training, emp_training, OnDate together and can not be null together.
As already stated you can have a single primary key which consists of multiple columns.
If the question was how to make the columns primary keys separately, that's not possible. However, you can create 1 primary key and add two unique keys
While performing INSERT...ON DUPLICATE KEY UPDATE on InnoDB in MySQL, we are often told to ignore the potential gaps in auto_increment columns. What if such gaps are very likely and cannot be ignored?
As an example, suppose there is one table rating that stores the users' ratings of items. The table scheme is something like
CREATE TABLE rating (
id INT AUTO_INCREMENT PRIMARY KEY,
user_id INT NOT NULL,
item_id INT NOT NULL,
rating INT NOT NULL,
UNIQUE KEY tuple (user_id, item_id),
FOREIGN KEY (user_id) REFERENCES user(id),
FOREIGN KEY (item_id) REFERENCES item(id)
);
It is possible that there are many users and many items, while users may frequently change the ratings of items that they have already rated before. Every time a rating is changed, a gap is created if we use INSERT...ON DUPLICATE KEY UPDATE, otherwise we will have to query twice (do a SELECT first) which is performance harming or check affected rows which cannot accommodate multiple records INSERT.
For some system where 100K users each has rated 10 items and changes half of the ratings every day, the auto_increment id will be exhausted within two years. Then what should we do to prevent it in practice?
Full answer.
Gaps it's ok! Just use bigger id field, for example BIGINT. Don't try to reuse gaps. This is a bad idea. Don't think about performance or optimization in this case. Its waste of time.
Another solution is to make composite key as primary. In your case, you can remove id field, and use pair (user_id, item_id) as primary key.
In case of "rating" the most frequent queries are "delete by user_id", and inserting. So you are not really need this "id" primary key for functionality. But you always need any primary key to be presented in table.
The only drawback of this method is, that now when you want to delete just one row from the table, you will need to use query something like:
DELETE FROM rating WHERE user_id = 123 AND item_id=1234
instead of old
DELETE FROM rating WHERE id = 123
But in this case it isn't hard to change one line of code in your application. Furthermore, in most cases people doesn't needs such functionality.
We work in a large table and we have tables with 100s millions of records in some table. We repeatedly use INSERT IGNORE or INSERT.. ON DUPLICATE KEY. Making the column as unsigned bigint will avoid the id issue.
But I would suggest you to think of long term solution as well. With some known facts.
SELECT and INSERT/UPDATE is quite often faster than INSERT..ON DUPLICATE KEY, again based on you data size and other factors
If you have two unique keys ( or one primary and one unique key), your query might not always predictable. It gives replication error if you use statement based replication.
ID is not the only issue with large tables. If you have table with more than some 300M records, performances degrades drastically. You need to think of partitioning/clustering/sharding your database/tables pretty soon
Personally I would suggest not to use INSERT.. ON DUPLICATE KEY. Read extensively on its usage and performance impact if you are planning for a highly scalable service
I am implementing a friends list for users in my database, where the list will store the friends accountID.
I already have a similar structure in my database for achievements where I have a separate table that has a pair of accountID to achievementID, but my concern with this approach is that it is inefficient because if there are 1 million users with 100 achievements each there are 100million entries in this table. Then trying to get every achievement for a user with a certain accountID would be a linear scan of the table (I think).
I am considering having a comma separated string of accountIDs for my friends list table, I realize how annoying it will be to deal with the data as a string, but at least it would be guaranteed to be log(n) search time for a user with accountID as the primary key and the second column being the list string.
Am I wrong about the search time for these two different structures?
MySQL can make effective use of appropriate indexes, for queries designed to use those indexes, avoiding a "scan" operation on the table.
If you are ALWAYS dealing with the complete set of achievements for a user, retrieving the entire set, and storing the entire set, then a comma separated list in a single column can be a workable approach.
HOWEVER... that design breaks down when you want to deal with individual achievements. For example, if you want to retrieve a list of users that have a particular achievement. Now, you're doing expensive full scans of all achievements for all users, doing "string searches", dependent on properly formatted strings, and MySQL is unable to use an index scan to efficiently retrieve that set.
So, the rule of thumb, if you NEVER need to individually access an achievement, and NEVER need to remove an achievement from user in the database, and NEVER need to add an individual achievement for a user, and you will ONLY EVER pull the achievements as an entire set, and only store them as an entire set, in and out of the database, the comma separated list is workable.
I hesitate to recommend that approach, because it never turns out that way. Inevitably, you'll want a query to get a list of users that have a particular achievement.
With the comma separated list column, you're into some ugly SQL:
SELECT a.user_id
FROM user_achievement_list a
WHERE CONCAT(',',a.list,',') LIKE '%,123,%'
ugly in the sense that MySQL can't use an index range scan to satisfy the predicate; MySQL has to look at EVERY SINGLE list of achievements, and then do a string scan on each and every one of them, from the beginning to the end, to find out if a row matches or not.
And it's downright excruciating if you want to use the individual values in that list to do a join operation, to "lookup" a row in another table. That SQL just gets horrendously ugly.
And declarative enforcement of data integrity is impossible; you can't define any foreign key constraints that restrict the values that are added to the list, or remove all occurrences of a particular achievement_id from every list it occurs in.
Basically, you're "giving up" the advantages of a relational data store; so don't expect the database to be able to do any work with that type of column. As far as the database is concerned, it's just a blob of data, might as well be .jpg image stored in that column, MySQL isn't going to help with retrieving or maintaining the contents of that list.
On the other hand, if you go with a design that stores the individual rows, each achievement for each user as a separate row, and you have an appropriate index available, the database can be MUCH more efficient at returning the list, and the SQL is more straightforward:
SELECT a.user_id
FROM user_achievements a
WHERE a.achievement_id = 123
A covering index would be appropriate for that query:
... ON user_achievements (achievement_id, user_id)
An index with user_id as the leading column would be suitable for other queries:
... ON user_achievements (user_id, achievement_id)
FOLLOWUP
Use EXPLAIN SELECT ... to see the access plan that MySQL generates.
For your example, retrieving all achievements for a given user, MySQL can do a range scan on the index to quickly locate the set of rows for the one user. MySQL doesn't need to look at every page in the index, the index is structured as a tree (at least, in the case of B-Tree indexes) so it can basically eliminate a whole boatload of pages it "knows" that the rows you are looking for can't be. And with the achievement_id also in the index, MySQL can return the resultset right from the index, without a need to visit the pages in the underlying table. (For the InnoDB engine, the PRIMARY KEY is the cluster key for the table, so the table itself is effectively an index.)
With a two column InnoDB table (user_id, achievement_id), with those two columns as the composite PRIMARY KEY, you would only need to add one secondary index, on (achievement_id, user_id).
FOLLOWUP
Q: By secondary index, do you mean a 3rd column that contains the key for the composite (userID, achievementID) table. My create table query looks like this
CREATE TABLE `UserFriends`
(`AccountID` BIGINT(20) UNSIGNED NOT NULL
,`FriendAccountID` BIGINT(20) UNSIGNED NOT NULL
,`Key` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT
, PRIMARY KEY (`Key`)
, UNIQUE KEY `AccountID` (`AccountID`, `FriendAccountID`)
);
A: No, I don't mean the addition of a third column. If the only two columns in the table are the foreign keys to another table (looks like they refer to the same table, and the columns are both NOT NULL and there is a UNIQUE constraint on the combination of the columns... and there are no other attributes on the table, I would consider not using a surrogate as the primary key at all. I would make the UNIQUE KEY the PRIMARY KEY.
Personally, I would be using InnoDB, with the innodb_file_per_table option enabled. And my table definition would look something like this:
CREATE TABLE user_friend
( account_id BIGINT(20) UNSIGNED NOT NULL COMMENT 'PK, FK ref account.id'
, friend_account_id BIGINT(20) UNSIGNED NOT NULL COMMENT 'PK, FK ref account.id'
, PRIMARY KEY (account_id, friend_account_id)
, UNIQUE KEY user_friend_UX1 (friend_account_id, account_id)
, CONSTRAINT FK_user_friend_user FOREIGN KEY (account_id)
REFERENCES account (id) ON UPDATE CASCADE ON DELETE CASCADE
, CONSTRAINT FK_user_friend_friend FOREIGN KEY (friend_account_id)
REFERENCES account (id) ON UPDATE CASCADE ON DELETE CASCADE
) Engine=InnoDB;