I'm helping with a Rails application, the intent is for that application to be multi-tenanted. What this means is that there will be data from multiple users/organisations in the database tables, and often the access path will be along the lines of "get me all the data for my organisation".
We're using MYSQL as the database.
Rails by default creates a primary key on the table using the id column. The id column is auto-incremented. This is nice in some ways - rows are always added at the end of the table. However, consider the following situation:
An object called foo. A foo has an id, and always has an
organisation_id
Over time each organisation creates foos in the database, these foos
are interleaved throughout the table (they are stored in id sequence)
A use case that involves listing all foos for this organisation
The problem I have is that the foos for an organisation are not located closely together in the database, in fact they're spread around very sub-optimally. Ideally I'd create a primary key of (organisation_id, id) on the table, which would result in all foos for a given organisation being side by side in the table.
Unfortunately, when I do this Rails gives me an 'Unknown primary key for table foos in model Foo' error. I think I could deal with this by using the composite keys gem to rails, but it seems like there should be some way to make this transparent at the database level.
Is there an alternate approach?
For reference, the command on the database to change my index was:
ALTER TABLE foos ADD KEY (id); # needed because the id column is auto-increment
ALTER TABLE foos DROP PRIMARY KEY, ADD PRIMARY KEY(organisation_id, id);
EDIT 1: A blog post that indicates success doing exactly this with composite_primary_keys gem. Which gives me a bit more confidence with that approach, problem is that it's from 2008, so things may have moved on. http://www.joehruska.com/?p=6
EDIT 2: Another option I was considering was partitioning instead - the number of organisations probably wouldn't exceed the maximum partitions, and I could probably group them a bit without losing too much benefit. Unfortunately, the key quote is every unique key on the table must use every column in the table's partitioning expression. (This also includes the table's primary key - from the MYSQL manual http://dev.mysql.com/doc/refman/5.6/en/partitioning-limitations-partitioning-keys-unique-keys.html.
So I'm still back needing a composite primary key again. I'm a little surprised that Rails cares so much about the primary key, rather than simply that a key is present.
If you don't want to use composite_primary_keys then you may be stuck just relying on a standard index on :organisation_id or [:organisation_id, :id]
My understanding is that Rails cares about PrimaryKeys so much because of the assumptions is makes with relationships between models. Perhaps it should be improved, you could always suggest it as a future feature.
Related
In MySql, one can create an index on a (non-unique) column along with a table, e.g.
create table orders(
orderid varchar(20) not null unique,
customerid varchar(20),
index(customerid)
);
Having not found a corresponding option in Oracle, i.e. creating the index on table creation rather than as a separate command afterwards, I suspect it is not possible. Is this correct? If so, what is the reason behind this - efficiency, as for example discussed here
Insertion of data after creating index on empty table or creating unique index after inserting data on oracle? ?
Thanks in advance!
Other than indexes defined as part of a primary or unique constraint there does not appear to be a way to define an index as part of a CREATE TABLE statement in Oracle. Although the USING INDEX clause is part of the constraint-state element of the CREATE TABLE statement, a missing right parenthesis error is issued if you try to include a USING INDEX clause in any constraint definition except a PRIMARY or UNIQUE constraint - see this db<>fiddle for examples.
As to "why" - that's a question only someone on the architecture team at Oracle could answer. From my personal user-oriented point of view, I see no particular value to being able to create an index as part of the CREATE TABLE statement, but then I'm accustomed to how Oracle works and have my thought patterns oriented in that particular direction. YMMV.
The root reason for the difference is that MySQL and Oracle are two distinctly different products, developed at different times by different software engineering teams. The fact thay MySQL is owned by Oracle means nothing in this case. MySQL was a separate and separetly developed product which was subsequently purchased by Oracle. As for why the two separate and distinct design teams made the decisions they did ... you'd have to ask them. But I'm pretty certain it has nothing to do with operational efficiency as you suggest. Once a table and index are created, there is no difference between having created an index as part of the CREATE TABLE vs. creating the index separately. And so there would be no difference in efficiency of any DML on said table.
I think I should be counted as database newbie, so read the question as a newbie question. I currently create a table, which holds environment variables for a number of hosts, like this:
create table envs (
host varchar(255),
envname varchar(255),
envvalue varchar(8192),
PRIMARY KEY(host, envname)
);
Very simple, one table holding all the data I need. Common operation is to get all the environment variables for a given host, another is to get a given environment variable for a given host, third example operation would be to get a given environment variable for all hosts and list duplicates.
Performance is not expected to be an issue, it's going to be maybe tens of hosts, dozens of variables per host, average max 1 query per second.
Now I've read that having composite primary key is not necessarily a good idea. Is this true for above use case? If it is true, how should I change the database design? If not, is the above one-table database fine for the purposes I listed above?
I don't see a problem here with the primary key. The semantics of a primary key is to uniquely identify the non-key attribute values for the key values. As I assume that for one host and one envname there is at most one envvalue the primary key makes perfect sense.
It could be that some people argue against composite primary keys because they are afraid of performance issues. However performance considerations should never influence the choice of the primary key. Many database systems automatically create an index structure for the primary key; the choice of this index structure can influence performance. However this choice can mostly be changed manually and should be done at a later point if you really have performance issues.
Your one-table design and choice of primary key is fine.
Now I've read that having composite primary key is not necessarily a good idea. Is this true for above use case?
No. Use a composite primary key on (host, envname).
If it is true, how should I change the database design?
N/A.
If not, is the above one-table database fine for the purposes I listed above?
Yes: it's known as the Entity–Attribute–Value model.
It's a bad idea, because you store unique values (host, envname) several times.
What if you were to change the hostname from srv01 to *srv01_new*? You'd have to change every ocurrence of srv01 in your table. And what if, some day, you decide you need to create a new table that holds additional information about every single host.
Now, if you change the hostname, you have to change those information as well.
To get to your question: It's not an issue of performance, but of normalization.
Databases should generally be normalized as far as possible. If you are intrigued enough, read on.
You should create one table for your hosts, having a unique id (int) as primary key and a unique (index) name as the hostname.
Your table should then only reference the id of the host, not the name. This way, your hostname is only stored once in your whole database and can be altered to whatever you desire, without breaking other tables.
If your environment names are unique, too, you should create another table for those, having the same layout as the hosts table (id, name).
Your combination table then stores the id of the host and the id of the environment, along with the value. You must of course keep the combined primary key, so every combination of host/environment is unique and easily indexable.
Then, you have a many-to-many-relationship with additional attributes and perfect normalization.
We have built an application with MySQL as the database. Every week we export the data dump from the database, and delete all the data. Now we want to merge all these dumps together for some data-analysis tasks.
The problem we are facing is that the "id" field for all the tables is Auto-Increment, so it starts with 1 in all the data dumps, which causes duplicate IDs in the table. I am sure there must be better ways to do it since it should be a pretty common task in MySQL administration.
What would be the best way to go about it?
If you can easily identify your foreign key fields (like they take the form *_id) then you can use the scripting language of your choice to modify the primary and foreign keys in the dump files by adding an "id space offset".
For example let's say you have two dump files and you know their primary key range does not exceed 1,000,000, you increment the primary and foreign keys in the second dump file by 1,000,000.
This is not entirely trivial to implement, as you will have to detect the position of the foreign key fields in the statements and then modify values at the same column position elsewhere in the statement.
If your foreign keys are not easily identifiable by a common naming convention then you must keep separate information per table about how to find their positions based on column position.
Good luck.
The best way would be that you have another database that acts as data warehouse into which you copy the contents of your app's database. After that, you don't truncate all the tables, you simply use DELETE FROM tablename - that way, your auto_increments won't get reset.
It's an ugly solution to have something exported, then truncate the database, then expect an import will proceed properly. Even if you go around the problem of clashing auto increments (there's ON DUPLICATE KEY statement that allows you to do something if a unique key constraint fails), nothing guarantees that relations between tables (foreign keys) will be preserved.
This is a broad topic and solution given is quick and not nice, some other people will probably suggest other methods, but if you are doing this to offload the db your app uses - it's a bad design. Try to google MySQL's partitioning support if you're aiming for better performance with larger data set.
For the data you've already dumped, load it into a table that doesn't use the ID column as a primary key. You don't have to define any primary key. You will have multiple rows with the same ID, but that won't impede your data analysis.
Going forward, you can set up a discipline where you dump and then DELETE the rows that are more than, say, one day old. That way the your ID will keep incrementing.
Or, you can copy this data to a table that uses the ARCHIVE storage engine. This is good for retaining data for analysis, because it compresses its contents.
So I'm trying to do "my own" version of phpMyAdmin in the sense that I'm trying to do a bunch of general operations to tables.
Right now, I'm stuck at the 'edit a row' operation. Is there a command to edit the last selected row that I can use? Is there something that would let me do something along the lines of
update t set <blah blah> where (select * from t limit 0,1);
I ask because I can't think of any other unique characteristics that my rows have as some primary keys are combinations of two foreign keys.
Thanks!
While you're probably expecting an answer along these lines, I'll step up and say it anyway: you should reconsider your database structure.
Combining two foreign keys together into a single primary key (alternatively, multiple primary keys) is a great way to force yourself into corners. You'll have to write a lot of custom code that will be unique to your database, which others will have a difficult time understanding, and therefore it'll be difficult to get help. It'll also become difficult to debug your own code, since you'll have problems returning to this non-standard code in the future when your project has grown.
Ideally, you should have a unique index on those two foreign keys, but have a single primary key that's automatically generated. You can use the primary key for operations like what you're suggesting, but also have fast lookup times on the foreign keys because of the index.
I have some mysql tables that have auto incrementing id's that are primary keys, but I notice that I never actually use them... I used to think that every table must have a primary key so I guess that is why I created them before. Should I remove them all if I don't use them at all?
Unless you are running into space problems I wouldn't remove them.
They are a life saver in case you by mistake (or oversight) populate the database with repeated/wrong data.
They also help to have related tables, where you reference the content on one table through the autogenerated id.
This is assuming you have indexes for the other columns you use to actually query the data (if you don't, then more reason to keep the autoincrement ids and use them!).
No.
You should keep them; a database always needs something that differentiates a row from another row (a "Key" of some sort).
If you have something that is guaranteed to be unique for each row, then you can use that as a key; otherwise keep the Primary Key and the Auto generated ID.
I'd personally keep them. They will be especially useful at a later date if you expand the database design and need to reference this table.
Interesting!...
I seem to hold a minority opinion here, getting both upvoted and downvoted to currently an even 0, yet no one in the majority opinion (see responses above) seems to make much of a case for keeping the id field, and the downvoters didn't even bother leaving comments hinting at why doing away with the id is such a bad idea.
In their defense, my own original response did not include any strong argument as to why it is ok to do away with the id attribute in some cases (which seem to apply to the OP). Maybe such a gratuitous response makes it, in of itself, a downvotable response.
Please do educate me, and the OP, by leaving comments pro or against the _systematic_ (and I stress "systematic") need to include auto-incremented non-semantic primary keys in all tables. A promised I returned and added to my response to provide a list of reasons why it may be detrimental to [again, systematically] impose a auto-incremented PK.
My original response:
You bet! you can remove these!
Before you do anything to the database make sure you have a backup, in particular is the DB size is significant.
Use the ALTER TABLE statement to remove the id in the tables where you want to remove it. Specifically
ALTER TABLE myTable DROP COLUMN id
(you also need to remove the PK constraint before removing the id, if the table has such a constraint)
EDIT (Added later)
There are many cases where it just doesn't make sense to carry along an autoincremented ID key, regardless of the relative little extra storage requirement these keys add.
In all these cases, the underlying implication is that
either the data itself supplies a primary key,
or, the application manages the key generation
The key supplied "natively" in the data doesn't necessarily neeeds to be a single column key, it can be a composite key, although in these cases one may wish to study the situation more closely, particularly is the overal key is a bit long.
Here are some of the drawbacks of using an auto-incremeted primary key in lieu of a native or application-supplied key:
The effective data integrity may go unchecked
i.e. the server may allow record insertions of updates which create a duplicated [native] key (eventhough the artificial, autoincremented primary key hides this reality)
When relying on the auto-incremented PK for the support of joins between tables, when part of the [native] key values have to be updated...
...we either create the need of deleting the record in full and and re-insert it with the news values,
...or the risk of keeping outdated/incorrect links.
A common "follow-up" with auto-incremented keys is to create a clustered index on the table for this key.
This does make sense for tables without an native or application-supplied primary key, so so much for data sets that have such keys.
Effectively this prevents choosing a key for the clustered index which may be more beneficial for the most common query patterns.
Migrating tables with an auto-incremented key can made more difficult depending on the DBMS (need to declare the underlying column as plain integer, prior to copy, then need start again the autoincrement...)
For narrow tables, i.e. tables with a few columns only, the relative cost of the auto-incremented PK can be significant, and impact performance in a non negligible fashion.
When inserting new records along with associated records in related tables, the auto-incremented key needs to be obtained after the insertion of the main record, before the related records can be inserted; the logic is simpler when the column values supporting the link are known ahead of time.
To summarize, the idea that so long as the storage can carry the [relatively minimal] extra "weight" of the artificial primary key, we should include and use such a key, is not without drawbacks of its own.
A final consideration is that just like it is rather easy to remove such keys when we don't need them, they too can be easily added, post-facto, when/if it becomes apparent that they are useful in a particular situation. Neither form of refactoring (adding vs. removing the auto-incremented columns) is risk free, but neither is a major production either.
Yes, if you can figure out another primary key.
There is obviously a flaw of your table design. For example, you had a table like
relation_id(PK), parent_id, child_id .
It is known that the combination of parent_id and child_id is unique, then you can assign the primary key to be parent_id + child_id, and then drop the column relation_id.
There should may endlessly other possible cases, but just bear in mind that primary key is helping you to locate data quickly, as well as helping you have your design making sense.