I have some innoDbs with only 2 int columns which are foreign keys to the primary keys of other tables.
E.g one table is user_items, it has 2 columns, userId, itemId, both foreign keys to user and item tables, set to cascade if updated or deleted.
Should I add a 3rd column to such tables and make it a primary key, or is it better the way it is right now, in terms of performance or any other benefits?
Adding a third ID column just for the sake of adding an ID column makes no sense. In fact it simply adds processing overhead (index maintenance) when you insert or delete rows.
A primary key is not necessarily "an ID column".
If you only allow a single associated between user and item (a user cannot be assigned the same item twice) then it does make sense to define (userid, itemid) as the primary key of your table.
If you do allow the same pair to appear more than once then of course you don't need that constraint.
You already have a natural key {userId, itemId}. Unless there is a specific reason to add another (surrogate) key, just use your existing key as primary.
Some reasons for the surrogate may include:
Keeping child FKs "slimmer".
Elimination of child cascading updates.
ORM-friendliness.
I don't think that any of this applies to your case.
Also, please be aware that InnoDB tables are clustered, and secondary indexes in clustered tables are more expensive than secondary indexes in heap-based tables. So ideally, you should avoid secondary indexes whenever you can.
In general, if it adds no real complexity to the code you're writing and the table is expected to contain 100,000-500,000 rows or less, I'd recommend adding the primary key. I also sometimes recommended adding created_at and updated_at columns.
Yes, they require more storage -- but it's minimal. There's also the issue that the primary key index will have to be maintained and so inserts and updates may be slower if the table becomes large. But unless the table is large (100's of thousands or millions of rows) it will probably make no difference in processing speed.
So unless the table is going to be quite large, the space and processing speed impact are insignificant -- so you make the decision on how much effort it takes to maintain it and the potential utility it provides. If it takes very little extra code to do, then virtually any utility it provides might make it worthwhile.
One of the best reasons to have a primary key is to give the rows a natural order based on the order they were inserted. If you ever want to retrieve the last 100 (or first 100) rows added, it's very simple and fast if you have an auto-increment primary key on the table.
Adding inserted_at and updated_at columns can provide similar utility in terms of fetching data based on date ranges. Again, unless the number of rows is going to be very large, it may be worth evaluating these as well.
Related
From what I have understand, we don't have Primary Key in the fact table and put a Surrogate Key is somehow a waste of space. Hence, the foreign key combination is the primary key for the fact table.
But I may case, I was not able to do that because the unique keys can potentially repeat in the fact table, e.g. same person paid twice on the same day, in same restaurant. In this cas, the primary key is no longer unique...
Is there anyway to solve this problème, without adding a surrogate key?
Thanks in advance !
If you are building a table like this a primary key or unique key combination is strongly recommended but if you are avoiding adding PK you may want to add unique transaction numbers so that you can do a combination of customer number and transaction number as the key combo.
InnoDB, if you don't provide a PK, will provide one for you. But it is 6 bytes and hidden. Compared to a 4-byte surrogate INT, this is bigger!
Check the data; there may be a "natural" PK that is a column or combination of columns.
Generally, for DW, the only index I have on the Fact table is the PK. Then I use "Summary tables" for the bulk of accesses. These are smaller and faster. In an extreme case, I will purge old Fact rows (via DROP PARTITION) but hang onto the Summary tables 'forever'. This keeps the disk space in check, while losing virtually nothing useful of the data.
Bottom line: Provide an explicit PK for every table.
I am currently learning about foreign keys and trying to add them as much as I can in my application to ensure data-integrity. I am using INNODB on Mysql.
My clicks table has a structure something like...
id, timestamp, link_id, user_id, ip_id, user_agent_id, ... etc for about 12 _id columns.
Obviously these all point to other tables, so should I add a foreign key on them? MySQL is creating an index automatically for every foreign key, so essentially I'll have an index on every column? Is this what I want?
FYI - this table will essentially be my most bulky table. My research basically tells me I'm sacrificing performance for integrity but doesn't suggest how harsh the performance drop will be.
Right before inserting such a row, you did 12 inserts or lookups to get the ids, correct? Then, as you do the INSERT, it will do 12 checks to verify that all of those ids have a match. Why bother; you just verified them with the code.
Sure, have FKs in development. But in production, you should have weeded out all the coding mistakes, so FKs are a waste.
A related tip -- Don't do all the work at once. Put the raw (not-yet-normalized) data into a staging table. Periodically do bulk operations to add new normalization keys and get the _id's back. Then move them into the 'real' table. This has the added advantage of decreasing the interference with reads on the table. If you are expecting more than 100 inserts/second, let's discuss further.
The generic answer is that if you considered a data item so important that you created a lookup table for the possible values, then you should create a foreign key relationship to ensure you are not getting any orphan records.
However, you should reconsider, whether all data items (fields) in your clicks table need a lookup table. For example ip_id field probably represents an IP address. You can simply store the IP address directly in the clicks table, you do not really need a lookup table, since IP addresses have a wide range and the IP addresses are unique.
Based on the re-evaluation of the fields, you may be able to reduce the number of related tables, thus the number of foreign keys and indexes.
Here are three things to consider:
What is the ratio of reads to writes on this table? If you are reading much more often than writing, then more indexes could be good, but if it is the other way around then the cost of maintaining those indexes becomes harder to bear.
Are some of the foreign keys not very selective? If you have an index on the gender_id column then it is probably a waste of space. My general rule is that indexes without included columns should have about 1000 distinct values (unless values are unique) and then tweak from there.
Are some foreign keys rarely or never going to be used as a filter for a query? If you have a last_modified_user_id field but you never have any queries that will return a list of items which were last modified by a particular user then an index on that field is less useful.
A little bit of knowledge about indexes can go a long way. I recommend http://use-the-index-luke.com
I have a mysql table of 3 integer fields. None of the fields have a unique value - but the three of them combined are unique.
When I query this table, I only search by the first field.
Which approach is recommended for indexing such table?
Having a multiple-field primary key on the 3 fields, or setting an index on the first field, which is not unique?
Thanks,
Doori Bar
Both. You'll need the multi-field primary key to ensure uniqueness, and you'll want the index on the first field for speed during searches.
You can have a UNIQUE Constraint on the three fields combined to meet your data quality standards. If you are primarily searching by Field1 then you should have an index on it.
You should also consider how you JOIN this table.
Your indexes should really support the bigger workload first - you will have to look at the execution plan to determine what suits you best.
The primary key will prevent your application from accidenttly inserting dupe rows. You probably want that.
Order the columns in the PK correctly though or make an index on the first column clustered for better performance. Compare how the query runs (with the PK present) and with and without the index on the first column.
If you're using InnoDB, you must have a clustered index. If you don't specify one, MySQL will use one in the background anyway. So, you may as well use a clustered (unique) primary key by combining all three columns.
The primary key will also then prevent duplicates, which is a bonus.
If you're returning all three integer fields, then you'll have a covered index, which means that the database won't even have to touch the actual record. It will get everything it needs right from the index.
The only caveat would be inserts (and appends). Updating a clustered index, especially on multiple columns, does have some performance penalization. It will be up to you to test and determine the best approach.
Can anyone tell me if a table in a relational database (such as MySQL / SQL SERVER) can be without a primary key?
For example, I could have table day_temperature, where I register temperature and time. I don't see the reason to have a primary key for such a table.
Technically, you can declare such a table.
But in your case, the time should be made the PRIMARY KEY, since it's probably wrong to have different temperatures for the same time and probably useless to have same more than once.
Logically, each table should have a PRIMARY KEY so that you could distinguish two records.
If you don't have a candidate key in you data, just create a surrogate one (AUTO_INCREMENT, SERIAL or whatever your database offers).
The only excuse for not having a PRIMARY KEY is a log or similar table which is a subject to heavy DML and having an index on it will impact performance beyond the level of tolerance.
Like always it depends.
Table does not have to have primary key. Much more important is to have correct indexes. On database engine depends how primary key affects indexes (i.e. creates unique index for primary key column/columns).
However, in your case (and 99% other cases too), I would add a new auto increment unique column like temp_id and make it surrogate primary key.
It makes much easier maintaining this table -- for example finding and removing records (i.e. duplicated records) -- and believe me -- for every table comes time to fix things :(.
If the possibility of having duplicate entries (for example for the same time) is not a problem, and you don't expect to have to query for specific records or range of records, you can do without any kind of key.
You don't need a PK, but it's recommended that you have one. It's the best way to identify unique rows. Sometimes you don't want an auto incremental int PK, but rather create the PK on something else. For example in your case, if there's only one unique row per time, you should create the PK on the time. It makes looks up based on time faster, plus it ensures that they're unique (you can be sure that the data integrity isn't violated):
Even if you do not add a primary key to an InnoDB table in MySQL, MySQL adds a hidden clustered index to that table. If you do not define a primary key, MySQL locates the first UNIQUE index where all the key columns are NOT NULL and InnoDB uses it as the clustered index.
If the table has no primary key or suitable UNIQUE index, InnoDB internally generates a clustered index GEN_CLUST_INDEX on a synthetic column containing row ID values.
https://dev.mysql.com/doc/refman/8.0/en/innodb-index-types.html
The time would then become your primary key. It will help index that column so that you can query data based on say a date range. The PK is what ultimately makes your row unique, so in your example, the datetime is the PK.
I would include a surrogate/auto-increment key, especially if there is any possibility of duplicate time/temperature readings. You would have no other way to uniquely identify a duplicate row.
I run into the same question on one of the tables i did.
The problem was that the PK was supposed to be composed out of all the rows of the table all is well but this means that the table size will grow very fast with each row inserted.
I choose to not have a PK, but only have an index on the row i do the lookup on.
When you replicate a database on mysql, A table without a primary key may cause delay in the replication.
http://lists.mysql.com/mysql/227217
The most common mistake when using ROW or MIXED is the failure to
verify that every table you want to replicate has a PRIMARY KEY on
it. This is a mistake because when a ROW event (such as the one
documented above) is sent to the slave and neither the master's copy
nor the slave's copy of the table has a PRIMARY KEY on the table,
there is no way to easily identify which unique row you want
replication to change.
According to your answer I would consider three options:
put a PK on both cols, this way for each time there could be only one temp and vise versa. This solution allows for multiple rows with the same temp or the same time just that there wouldn't be any two rows with same temp AND time.
don't put a PK at all but do put a unique index on both cols. one unique index containing both cols. this would allow for nulls in temp and time but incurs more space to maintain index.
these two options would be best for retrieval speed if you have heavy reads but would result in lower inserts rate as indices would have to be updated as well.
don't put any index at all, nor PK. this would be best for inserts but very bad for searching. useful for logging where retrieval is done by another
mechanism or when inserting device is not required to check for dups.
Also, it is very important to consider cardinality here and think about future consequences of using an auto incremented number. if you're planning to do A LOT OF inserts then even an auto incremented unsigned bigint would be a risk because it would eventually run out. In your example I guess you'll be saving data daily - for how long? this would be problematic if you saved temp every minute... so I'll take this as an extreme example.
I guess it is best to think about what you need from the table. are you doing "save-and-forget" for the entire year for the temp at every minute? are you going to use this table frequently in real-time decision making in your business logic? I think it is best to segregate data necessary for real-time (oltp) from long-term saving data that would be required seldom and its retrieval latency is allowed to be high (olap). it's even worth duplicating the data into two different tables, one heavily indexed and get erased once in a while to control cardinality and the second is actually saved on a magentic disk with almost no indices at all (it is possible to transfer a schema from your main fs into another fs).
I've got a better example of a table that doesn't need a primary key - a joiner table. Say I have a table with something called "capabilities", and another table with something called "groups", and I want a joiner table that tells me all the capabilities that all the groups might have, so it's basicallly
create table capability_group
( capability_id varchar(32),
group_id varchar(32));
There is no reason to have a primary key on that, because you never address a single row - you either want all the capabilities for a given group, or all the groups for a given capabilty. It would be better to have a unique constraint on (capabilty_id,group_id), and separate indexes on both fields.
I have some mysql tables that have auto incrementing id's that are primary keys, but I notice that I never actually use them... I used to think that every table must have a primary key so I guess that is why I created them before. Should I remove them all if I don't use them at all?
Unless you are running into space problems I wouldn't remove them.
They are a life saver in case you by mistake (or oversight) populate the database with repeated/wrong data.
They also help to have related tables, where you reference the content on one table through the autogenerated id.
This is assuming you have indexes for the other columns you use to actually query the data (if you don't, then more reason to keep the autoincrement ids and use them!).
No.
You should keep them; a database always needs something that differentiates a row from another row (a "Key" of some sort).
If you have something that is guaranteed to be unique for each row, then you can use that as a key; otherwise keep the Primary Key and the Auto generated ID.
I'd personally keep them. They will be especially useful at a later date if you expand the database design and need to reference this table.
Interesting!...
I seem to hold a minority opinion here, getting both upvoted and downvoted to currently an even 0, yet no one in the majority opinion (see responses above) seems to make much of a case for keeping the id field, and the downvoters didn't even bother leaving comments hinting at why doing away with the id is such a bad idea.
In their defense, my own original response did not include any strong argument as to why it is ok to do away with the id attribute in some cases (which seem to apply to the OP). Maybe such a gratuitous response makes it, in of itself, a downvotable response.
Please do educate me, and the OP, by leaving comments pro or against the _systematic_ (and I stress "systematic") need to include auto-incremented non-semantic primary keys in all tables. A promised I returned and added to my response to provide a list of reasons why it may be detrimental to [again, systematically] impose a auto-incremented PK.
My original response:
You bet! you can remove these!
Before you do anything to the database make sure you have a backup, in particular is the DB size is significant.
Use the ALTER TABLE statement to remove the id in the tables where you want to remove it. Specifically
ALTER TABLE myTable DROP COLUMN id
(you also need to remove the PK constraint before removing the id, if the table has such a constraint)
EDIT (Added later)
There are many cases where it just doesn't make sense to carry along an autoincremented ID key, regardless of the relative little extra storage requirement these keys add.
In all these cases, the underlying implication is that
either the data itself supplies a primary key,
or, the application manages the key generation
The key supplied "natively" in the data doesn't necessarily neeeds to be a single column key, it can be a composite key, although in these cases one may wish to study the situation more closely, particularly is the overal key is a bit long.
Here are some of the drawbacks of using an auto-incremeted primary key in lieu of a native or application-supplied key:
The effective data integrity may go unchecked
i.e. the server may allow record insertions of updates which create a duplicated [native] key (eventhough the artificial, autoincremented primary key hides this reality)
When relying on the auto-incremented PK for the support of joins between tables, when part of the [native] key values have to be updated...
...we either create the need of deleting the record in full and and re-insert it with the news values,
...or the risk of keeping outdated/incorrect links.
A common "follow-up" with auto-incremented keys is to create a clustered index on the table for this key.
This does make sense for tables without an native or application-supplied primary key, so so much for data sets that have such keys.
Effectively this prevents choosing a key for the clustered index which may be more beneficial for the most common query patterns.
Migrating tables with an auto-incremented key can made more difficult depending on the DBMS (need to declare the underlying column as plain integer, prior to copy, then need start again the autoincrement...)
For narrow tables, i.e. tables with a few columns only, the relative cost of the auto-incremented PK can be significant, and impact performance in a non negligible fashion.
When inserting new records along with associated records in related tables, the auto-incremented key needs to be obtained after the insertion of the main record, before the related records can be inserted; the logic is simpler when the column values supporting the link are known ahead of time.
To summarize, the idea that so long as the storage can carry the [relatively minimal] extra "weight" of the artificial primary key, we should include and use such a key, is not without drawbacks of its own.
A final consideration is that just like it is rather easy to remove such keys when we don't need them, they too can be easily added, post-facto, when/if it becomes apparent that they are useful in a particular situation. Neither form of refactoring (adding vs. removing the auto-incremented columns) is risk free, but neither is a major production either.
Yes, if you can figure out another primary key.
There is obviously a flaw of your table design. For example, you had a table like
relation_id(PK), parent_id, child_id .
It is known that the combination of parent_id and child_id is unique, then you can assign the primary key to be parent_id + child_id, and then drop the column relation_id.
There should may endlessly other possible cases, but just bear in mind that primary key is helping you to locate data quickly, as well as helping you have your design making sense.