I need a table to store some ratings, in this table I have a composite index (user_id, post_id) and other column to identify different rating system.
user_id - bigint
post_id - bigint
type - varchar
...
Composite Index (user_id, post_id)
In this table I've not a primary key because the primary need to be unique and the INDEX not need to be unique, in my case univocity is a problem.
For example I can have
INSERT INTO tbl_rate
(user_id,post_id,type)
VALUES
(24,1234,'like'),
(24,1234,'love'),
(24,1234,'other');
The missing of PRIMARY KEY may cause performance problem? My table structure is good or I need to change it?
Thank you
A few points:
It sounds like you are just using what is currently unique about the table and making that as a primary key. That works. And natural keys have some advantages when it comes to querying because of locality. (The data for each user is stored in the same area). And because the table is clustered by that key which eliminates lookups to the data if you are searching by the columns in the primary.
But, using a natural primary key like you chose has disadvantages for performance as well.
Using a very large primary key will make all other indexes very large in innodb because the primary key is included in each index value.
Using a natural primary key isn't as fast as a surrogate key for INSERT's because in addition to being bigger it can't just insert at the end of the table each time. It has to insert in the section for that user and post etc.
Also, if u are searching by time most likely you will be seeking all over the table with a natural key unless time is your first column. surrogate keys tend to be local for time and can often be just right for some queries.
Using a natural key like yours as a primary key can also be annoying. What if you want to refer to a particular vote? You need a few fields. Also it's a little difficult to use with lots of ORMs.
Here's the Answer
I would create your own surrogate key and use it as a primary key rather than rely on innodb's internal primary key because you'll be able to use it for updates and lookups.
ALTER TABLE tbl_rate
ADD id INT UNSIGNED NOT NULL AUTO_INCREMENT,
ADD PRIMARY KEY(id);
But, if you do create a surrogate primary key, I'd also make your key a UNIQUE. Same cost but it enforces correctness.
ALTER TABLE tbl_rate
ADD UNIQUE ( user_id, post_id, type );
The missing of PRIMARY KEY may cause performance problem?
Yes in InnoDB for sure, as InnoDB will use a algorithm to create it's own "ROWID",
Which is defined in dict0boot.ic
Returns a new row id.
#return the new id */
UNIV_INLINE
row_id_t
dict_sys_get_new_row_id(void)
/*=========================*/
{
row_id_t id;
mutex_enter(&(dict_sys->mutex));
id = dict_sys->row_id;
if (0 == (id % DICT_HDR_ROW_ID_WRITE_MARGIN)) {
dict_hdr_flush_row_id();
}
dict_sys->row_id++;
mutex_exit(&(dict_sys->mutex));
return(id);
}
The main problem in that code is mutex_enter(&(dict_sys->mutex)); which blocks others threads from accessing if one thread is already running this code.
Meaning it will table lock the same as MyISAM would.
% may take a few nanoseconds. That is insignificant compared to
everything else. Anyway #define DICT_HDR_ROW_ID_WRITE_MARGIN 256
Indeed yes Rick James this is indeed insignificant compared to what was mentioned above.
The C/C++ compiler would micro optimize it more to to get even more performance out off it by making the CPU instructions lighter.
Still the main performance concern is mentioned above..
Also the modulo operator (%) is a CPU heavy instruction.
But depening on the C/C++ compiler (and/or configuration options) if might be optimized if DICT_HDR_ROW_ID_WRITE_MARGIN is a power of two. Like (0 == (id & (DICT_HDR_ROW_ID_WRITE_MARGIN - 1))) as bitmasking is much faster, i believe DICT_HDR_ROW_ID_WRITE_MARGIN indeed had a number which is a power of 2
Related
I have a table called customer_type with fields (id, customer_type). It has 5 rows, each describing customer type.
I also have a table called quote that uses customer_type_id as one of the foreign key columns:
CREATE TABLE `quote` (
`id` int NOT NULL AUTO_INCREMENT,
`number` int NOT NULL,
`customer_type_id` tinyint(4) DEFAULT NULL,
`comments` text,
PRIMARY KEY (`id`),
KEY `fk_customer_type` (`customer_type_id`),
CONSTRAINT `fk_customer_type`
FOREIGN KEY (`customer_type_id`)
REFERENCES `customer_type` (`id`),
);
There are other columns and indices in the quote table, total of 10 indices. Lately INSERTs to the database started being slow and one possible reason could be having too many indices.
And so I want to remove some, including for example, fk_customer_type. Cardinality of that index is 5, while cardinality of some other indices is much higher (i.e. 5000, or 20000).
I cannot simply drop the index because of foreign key constraint.
Question
Does my situation warrant removing this foreign key and corresponding foreign key constraint?
Where my reasons for removing are:
reduce the number of indices in hopes of improving INSERT performance
cardinality of 'customer_type_id' is very low, to where performance will not likely be affected
My reasons against removing could be:
I will lose the referential integrity (foreign key constraint)
Are there any specific downsides that will happen if I do remove the index? Is it worth keeping the index just to keep the index constraint?
The slowlog is an excellent way to identify the slowest queries. More: mysql.rjweb.org/doc.php/mysql_analysis#slow_queries_and_slowlog
I started with that comment because I suspect that your question about cardinality and FKs and dropping indexes has very little to do with your performance problems.
The cardinality of that TINYINT is low in the quote table. So? If you look up all the rows with customer_type_id = 2 the Optimizer will probably ignore INDEX(customer_type_id). But, I don't think you have that. Let's see SHOW CREATE TABLE quotes.
The customer_type table is tiny. Its data and index(es) are so trivial that I don't even want to discuss them. And you are probably not adding much to it, ever.
During INSERT INTO customer_type ..., the FOREIGN KEY constraint needs to check that there is a matching row in customer_type; that takes a small amount of CPU, but probably zero I/O. There will be one read the first time, then that block will stay in cache (see innodb_buffer_pool_size) until shutdown.
Lately INSERTs to the database started being slow and one possible reason could be having too many indices.
I see 2 indexes in quotes.
The PRIMARY KEY is clustered with the data, and is necessary. And, since it is AUTO_INCREMENT, inserts will go at the "end" of the table -- quite efficient.
KEY(customer_type_id) -- I've already explained that it is probably useless. But it is not a big deal. That is, it would not explain your perceived slowdown.
There are other columns and indices in the quote table
Let's see them!
Normally 10 indexes is no big deal. But there could be a UUID or GUID or something else that raises a red flag. Also, if quotes is a billion rows long, other issues raise their ugly head. Or big text/blob columns.
(Of course, if you have a billion rows, then you are threatening to overflow AUTO_INCREMENT. That is messy to repair.
Show us some of the slow queries; I predict that you could use some "composite" indexes.
Batching inserts can greatly speed them up. Describe the inserts (randomly coming from multiple clients / 1000 at a time / whatever).
At work we have a big database with unique indexes instead of primary keys and all works fine.
I'm designing new database for a new project and I have a dilemma:
In DB theory, primary key is fundamental element, that's OK, but in REAL projects what are advantages and disadvantages of both?
What do you use in projects?
EDIT: ...and what about primary keys and replication on MS SQL server?
What is a unique index?
A unique index on a column is an index on that column that also enforces the constraint that you cannot have two equal values in that column in two different rows. Example:
CREATE TABLE table1 (foo int, bar int);
CREATE UNIQUE INDEX ux_table1_foo ON table1(foo); -- Create unique index on foo.
INSERT INTO table1 (foo, bar) VALUES (1, 2); -- OK
INSERT INTO table1 (foo, bar) VALUES (2, 2); -- OK
INSERT INTO table1 (foo, bar) VALUES (3, 1); -- OK
INSERT INTO table1 (foo, bar) VALUES (1, 4); -- Fails!
Duplicate entry '1' for key 'ux_table1_foo'
The last insert fails because it violates the unique index on column foo when it tries to insert the value 1 into this column for a second time.
In MySQL a unique constraint allows multiple NULLs.
It is possible to make a unique index on mutiple columns.
Primary key versus unique index
Things that are the same:
A primary key implies a unique index.
Things that are different:
A primary key also implies NOT NULL, but a unique index can be nullable.
There can be only one primary key, but there can be multiple unique indexes.
If there is no clustered index defined then the primary key will be the clustered index.
You can see it like this:
A Primary Key IS Unique
A Unique value doesn't have to be the Representaion of the Element
Meaning?; Well a primary key is used to identify the element, if you have a "Person" you would like to have a Personal Identification Number ( SSN or such ) which is Primary to your Person.
On the other hand, the person might have an e-mail which is unique, but doensn't identify the person.
I always have Primary Keys, even in relationship tables ( the mid-table / connection table ) I might have them. Why? Well I like to follow a standard when coding, if the "Person" has an identifier, the Car has an identifier, well, then the Person -> Car should have an identifier as well!
Foreign keys work with unique constraints as well as primary keys. From Books Online:
A FOREIGN KEY constraint does not have
to be linked only to a PRIMARY KEY
constraint in another table; it can
also be defined to reference the
columns of a UNIQUE constraint in
another table
For transactional replication, you need the primary key. From Books Online:
Tables published for transactional
replication must have a primary key.
If a table is in a transactional
replication publication, you cannot
disable any indexes that are
associated with primary key columns.
These indexes are required by
replication. To disable an index, you
must first drop the table from the
publication.
Both answers are for SQL Server 2005.
The choice of when to use a surrogate primary key as opposed to a natural key is tricky. Answers such as, always or never, are rarely useful. I find that it depends on the situation.
As an example, I have the following tables:
CREATE TABLE toll_booths (
id INTEGER NOT NULL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
...
UNIQUE(name)
)
CREATE TABLE cars (
vin VARCHAR(17) NOT NULL PRIMARY KEY,
license_plate VARCHAR(10) NOT NULL,
...
UNIQUE(license_plate)
)
CREATE TABLE drive_through (
id INTEGER NOT NULL PRIMARY KEY,
toll_booth_id INTEGER NOT NULL REFERENCES toll_booths(id),
vin VARCHAR(17) NOT NULL REFERENCES cars(vin),
at TIMESTAMP DEFAULT CURRENT_TIMESTAMP NOT NULL,
amount NUMERIC(10,4) NOT NULL,
...
UNIQUE(toll_booth_id, vin)
)
We have two entity tables (toll_booths and cars) and a transaction table (drive_through). The toll_booth table uses a surrogate key because it has no natural attribute that is not guaranteed to change (the name can easily be changed). The cars table uses a natural primary key because it has a non-changing unique identifier (vin). The drive_through transaction table uses a surrogate key for easy identification, but also has a unique constraint on the attributes that are guaranteed to be unique at the time the record is inserted.
http://database-programmer.blogspot.com has some great articles on this particular subject.
There are no disadvantages of primary keys.
To add just some information to #MrWiggles and #Peter Parker answers, when table doesn't have primary key for example you won't be able to edit data in some applications (they will end up saying sth like cannot edit / delete data without primary key). Postgresql allows multiple NULL values to be in UNIQUE column, PRIMARY KEY doesn't allow NULLs. Also some ORM that generate code may have some problems with tables without primary keys.
UPDATE:
As far as I know it is not possible to replicate tables without primary keys in MSSQL, at least without problems (details).
If something is a primary key, depending on your DB engine, the entire table gets sorted by the primary key. This means that lookups are much faster on the primary key because it doesn't have to do any dereferencing as it has to do with any other kind of index. Besides that, it's just theory.
In addition to what the other answers have said, some databases and systems may require a primary to be present. One situation comes to mind; when using enterprise replication with Informix a PK must be present for a table to participate in replication.
As long as you do not allow NULL for a value, they should be handled the same, but the value NULL is handled differently on databases(AFAIK MS-SQL do not allow more than one(1) NULL value, mySQL and Oracle allow this, if a column is UNIQUE)
So you must define this column NOT NULL UNIQUE INDEX
There is no such thing as a primary key in relational data theory, so your question has to be answered on the practical level.
Unique indexes are not part of the SQL standard. The particular implementation of a DBMS will determine what are the consequences of declaring a unique index.
In Oracle, declaring a primary key will result in a unique index being created on your behalf, so the question is almost moot. I can't tell you about other DBMS products.
I favor declaring a primary key. This has the effect of forbidding NULLs in the key column(s) as well as forbidding duplicates. I also favor declaring REFERENCES constraints to enforce entity integrity. In many cases, declaring an index on the coulmn(s) of a foreign key will speed up joins. This kind of index should in general not be unique.
There are some disadvantages of CLUSTERED INDEXES vs UNIQUE INDEXES.
As already stated, a CLUSTERED INDEX physically orders the data in the table.
This mean that when you have a lot if inserts or deletes on a table containing a clustered index, everytime (well, almost, depending on your fill factor) you change the data, the physical table needs to be updated to stay sorted.
In relative small tables, this is fine, but when getting to tables that have GB's worth of data, and insertrs/deletes affect the sorting, you will run into problems.
I almost never create a table without a numeric primary key. If there is also a natural key that should be unique, I also put a unique index on it. Joins are faster on integers than multicolumn natural keys, data only needs to change in one place (natural keys tend to need to be updated which is a bad thing when it is in primary key - foreign key relationships). If you are going to need replication use a GUID instead of an integer, but for the most part I prefer a key that is user readable especially if they need to see it to distinguish between John Smith and John Smith.
The few times I don't create a surrogate key are when I have a joining table that is involved in a many-to-many relationship. In this case I declare both fields as the primary key.
My understanding is that a primary key and a unique index with a not‑null constraint, are the same (*); and I suppose one choose one or the other depending on what the specification explicitly states or implies (a matter of what you want to express and explicitly enforce). If it requires uniqueness and not‑null, then make it a primary key. If it just happens all parts of a unique index are not‑null without any requirement for that, then just make it a unique index.
The sole remaining difference is, you may have multiple not‑null unique indexes, while you can't have multiple primary keys.
(*) Excepting a practical difference: a primary key can be the default unique key for some operations, like defining a foreign key. Ex. if one define a foreign key referencing a table and does not provide the column name, if the referenced table has a primary key, then the primary key will be the referenced column. Otherwise, the the referenced column will have to be named explicitly.
Others here have mentioned DB replication, but I don't know about it.
Unique Index can have one NULL value. It creates NON-CLUSTERED INDEX.
Primary Key cannot contain NULL value. It creates CLUSTERED INDEX.
In MSSQL, Primary keys should be monotonically increasing for best performance on the clustered index. Therefore an integer with identity insert is better than any natural key that might not be monotonically increasing.
If it were up to me...
You need to satisfy the requirements of the database and of your applications.
Adding an auto-incrementing integer or long id column to every table to serve as the primary key takes care of the database requirements.
You would then add at least one other unique index to the table for use by your application. This would be the index on employee_id, or account_id, or customer_id, etc. If possible, this index should not be a composite index.
I would favor indices on several fields individually over composite indices. The database will use the single field indices whenever the where clause includes those fields, but it will only use a composite when you provide the fields in exactly the correct order - meaning it can't use the second field in a composite index unless you provide both the first and second in your where clause.
I am all for using calculated or Function type indices - and would recommend using them over composite indices. It makes it very easy to use the function index by using the same function in your where clause.
This takes care of your application requirements.
It is highly likely that other non-primary indices are actually mappings of that indexes key value to a primary key value, not rowid()'s. This allows for physical sorting operations and deletes to occur without having to recreate these indices.
There is a table that contains more id data than real data data.
user_id int unsigned NOT NULL,
project_id int unsigned NOT NULL,
folder_id int unsigned NOT NULL,
file_id int unsigned NOT NULL,
data TEXT NOT NULL
The only way to create a unique primary key for this table would be a composite of (user_id, project_id, folder_id, file_id). I have frequently seen 2 column composite primary keys, but is it ok to have 4 or even more? According to MySQL: "All storage engines support at least 16 indexes per table and a total index length of at least 256 bytes. Most storage engines have higher limits.", so I know at least it is possible to do.
Past this, there are frequent queries to this table for various combinations of these ids. For example, find all projects for user X, find all files for user X, find all files for project Y and folder Z, etc. Should there be a separate individual index key on each of the id columns, or if there is a composite primary key that already contains all the columns does this make further individual keys redundant? There will be about 10 million - 50 million rows in the table at any time.
To summarize: is it ok to have a composite primary key with 4 (or more) id columns, and if there is a composite key does it make additional individual keys for each of those columns redundant?
Yes, it is ok to have a composite primary key with 4 or more columns.
It doesn't necessarily make additional keys for each of those columns redundant. For example, a key (a, b, c) will not be useful for a query SELECT ... WHERE b = 4. For that type of query you would rather have key (b) or key (b, c).
You need to examine your expected queries to determine which indexes you'll need. See this talk for more details: http://youtu.be/AVNjqgf7zNw
Yes this is OK if the data model supports it. You haven't shared much about your overall DB schema and how these items related to each other to determine if this might be considered the best approach. In other words is this truly the only way in which these for items are related to each other, or for example are the files REALLY related to projects and projects related to users or something like that such the splitting up these joins tables makes more logical sense.
If you are querying individual columns within this primary key, this might suggest to me that your schema is not quite correct. At a minimum you might need to add individual index on these columns to support such a query.
You're going to regret creating a compound primary key, it becomes really obnoxious to address individual rows and derivative indexes in MySQL must contain the primary key as a row identifier. You can create a UNIQUE that's compound, though.
You can have a composite key with a fairly large number of components, though keep in mind the more you add the bigger the index will get and the slower it will be to update when you do an INSERT. As your database grows in size, insert operations may get cripplingly slow.
This is why, whenever possible, you should try and minimize your index size.
I would like to have advices about a mysql table design for a event logger.
Our needs :
- track a lot of action
- 10 000 actions / second
- 1 billion row at this time
Our hardware :
- 2*Xeon (seen as 32 CPU by the system)
- 128 GB RAM
- 6*600 SSD with Raid 10
Our table design :
CREATE TABLE IF NOT EXISTS `log_event` (
`id` bigint(20) NOT NULL AUTO_INCREMENT,
`id_event` smallint(6) NOT NULL,
`id_user` bigint(20) NOT NULL,
`date` int(11) NOT NULL,
`data` bigint(20) NOT NULL,
PRIMARY KEY (`id`),
KEY `id_event_2` (`id_event`,`data`),
KEY `id_inscri` (`id_inscri`),
KEY `date` (`date`),
KEY `id_event_4` (`id_event`,`date`,`data`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8
ALTER TABLE `log_event`
ADD CONSTRAINT `log_event_ibfk_1` FOREIGN KEY (`id_inscri`) REFERENCES `inscription` (`id_inscri`) ON DELETE CASCADE ON UPDATE CASCADE;
Our problem :
- We have an auto-increment as primary, but it is not really used. Is it a problem to remove it ? We will no have primary key if we remove it => How to identify a line ?
We would like to do partionning, but with the foreign it seems to be impossible ?
We don't do bulk insert. Is it a good idea to insert in a Memory table without index and copy data every 5 minutes ?
Do you have any idea to optimize ? Do you have best practice for this kind of system ?
Thanks !
François
Primary keys of relational tables (relations) might have two types:
Natural - exists in subject area to completely determine each row of relational table.
Natural primary keys might be simple (if consists of only one column), or complex (if consists more than one column). It is not recomended to set a natural primary key on large string column.
Artificial - special column, injected by database designer / developer to boost table performance, if natural key is complex, and have to be used in related table (is foreign key for something), or if it is simple, but is large and will produce data overhead while copied in related table as a foreign key, or if it is complex to search (for example, CRUD operations on VARCHAR IDs might be slower, than on INT IDs). There might be other reasons. TL;DR: Artificial key - one special column, serving to completely determine each row of relational table and boost it's performance for CRUD operations.
We have an auto-increment as primary, but it is not really used. Is it
a problem to remove it ? We will no have primary key if we remove it
=> How to identify a line ?
If you do not need to reference your table to another tables (as source), then you may probably remove artificial key without any consequences. Still, I recomend you set any other PRIMARY KEY in this table to avoid data duplication, and for obviosity (if it matters).
Your table by itself (if properly normalized) will have natural key as one of "key candidates". It might be complex one (consist of few columns). It is normal. But don't set primary for strings, because PRIMARY always have index, which will produce data overhead. If it is combination of INT or "small" VARCHAR columns, then it is normal.
Consider as an option: id_event + id_user + date.
We don't do bulk insert. Is it a good idea to insert in a Memory table
without index and copy data every 5 minutes ?
It is not a bad idea. But it is not good idea, until it properly tested. Try to perform load-test, before real use.
If you not reference MEMORY table to others, then you still may join it with any other InnoDB table. But you will loose InnoDB functionality (referential integrity). If lose of parent table ON DELETE CASCADE ON UPDATE CASCADE is not a concern, then it might be done. As for me, InnoDB is not so slow to switch table engine, in your case.
How optimal is it to have a primary key on 3 or 4 fields? If the table has say millions of records, is it going to be heavy on the server running a query such as:
Select * from my_table where field1='123' and field_2='123' and field_3='hours'
The primary key is created on these fields:
field_1 int(11)
field_2 int(11)
field_3 varchar(20)
What I'm considering doing as an alternative is have those fields store the data with a primary key on a separate field which has an md5 hash of the data such as "md5(field_1+'-'+field_2+'-'+field_3) and then my script just queries one field such as:
Select * from my_table where field_hash=MD5('123-123-hours')
So basically I'm just wondering if method 1 is just as optimal as method 2 with a table with millions of records.
I'd say your best option is to use a surrogate auto-incrementing field as the PK. Failing that I'd just use the three fields.
The md5 hash doesn't seem worth the complexity. I really don't see the benefit of that approach in any scenario. Don't try to outsmart the DB engine. If a hash was indeed faster, the indexing engine would be implemented internally that way for composite keys. It is not, which should tell you something.
With the surrogate key you get faster joins, with the composite key you get some performance benefits when you have queries that only return fields that are part of the primary key (covering indexes).
You can read about the composite key performance from the answers to question Composite Primary Key performance drawback in MySQL
Before doing this kind of optimization, you should always measure the effects. That is, create two tables with the same data, one using composite key and the other the hash, and try which one works better in you use case.
In general, I don't like using nonsensical key values if not absolutely necessary. If the hash is used as the primary key it means that the user of the database must be aware of the ID generation process. This leads to more documentation, that will not be read, and errors in the long run.
Instead of using the composite key you might want to see, if there is a possibility of normalizing your database further. Does the composite key represent a different entity and should it actually form an second table where you can attach a surrogate key to the set of columns?
Another option is to use a surrogate key in the current table and then place a unique constraint to the current composite key e.g.
create table
id int(11) primary key,
field1 int(11),
field_2 int(11),
field_3 varchar(20),
constraint uq_composite unique (field1, field_2, field_3);
I would try to avoid using a non sequential primary key (i.e. a string, or numbers that are randomly generated) because this causes more I/O on the disk and reduces performance on some storage engines (particularly MyISAM).