State description
I have two databases, DB1 and DB2, that have the same table, Author, with the fields Author.ID and Author.AuthorName.
The DB1.Author has the AUTO_INCREMENT on its Author.ID field, while the DB2.Author does not have the AUTO_INCREMENT since it relies on the correctness of DB1 data.
Both tables have the PRIMARY index on Author.ID and a UNIQUE index on Author.AuthorName.
DB2.Author has rows copied from the DB1.Author.
Both databases use MariaDB version 10.6.7.
The problem
DB1 manager deleted some entries in the DB1.Author table, and then reordered indexes to have no gaps in index numbers. This means they might have had:
ID
AuthorName
1
A
2
B
3
C
Then they deleted the row where the AuthorName was 'B':
ID
AuthorName
1
A
3
C
And they finally updated the indexes to have no gaps (3-C changed to 2-C):
ID
AuthorName
1
A
2
C
Now I need to find a way to copy such updated state of the rows from the DB1.Author to the DB2.Author without deleting everything from the DB2.Author table, so that I don't lose the data on CASCADE effects.
What is the best approach for this?
My shot
This is what I did, but it obviously cannot work, since in the case of duplicate key, it would attempt to create another duplicate key (duplicate ID 2 would try to INSERT duplicate value of 'C', since it already exists on ID 3):
INSERT INTO DB2.Author (ID, AuthorName)
SELECT DB1.Author.ID, DB1.Author.AuthorName FROM DB1.Author
ON DUPLICATE KEY UPDATE
ID = DB1.Author.ID,
AuthorName = DB1.Author.AuthorName;
Additional ways?
Other than the possible SQL query solution, are there any other ways to automatically update the table data in one database when the other database changes its data? Would need to replicate only some tables, while other, linked tables are different.
tl;dr your problem is your DB manager. The solution is to get him/her to undo the damage they caused by restoring the data to how it was before. Deleting rows is fine. Updating primary keys is never OK.
Do not create a work around or validate it by accommodating his/her mistake, because doing so will make it more likely that it will happen again.
Full answer.
Your actual problem is your "DB manager", who violated a fundamental rule of databases: Never update surrogate key values!
In your case it's even more tragic, because gaps in the ID column values don't matter in any way. If gaps do matter, you're in even worse shape. Allow me to explain...
The author's name is your actual identifier. We know this because there a unique constraint on it.
The ID column is a surrogate key, which are most conveniently implemented as an auto incrementing integer, but surrogate keys would work just as well if they were random (unique) numbers. Gaps, and even the choice of values themselves, are irrelevant to the effectiveness of surrogate keys.
You need to treat the DB2 table as completely wrong as the update of primary keys on the source table would have completely spoilt it.
Delete everything in DB2 table
Insert into DB2 table everything from DB1 table
Going forwards, without being condescending, the users with access to DB1 need training (or perhaps you need to reconsider the security against the DB). Updating a primary key value is a wrong thing to do. Gapless sequences is a silly thing to want, especially when you have known dependencies. In fact, gapless sequences is often listed as poor database security (as it makes it easy to just cycle through all data).
You probably want to consider commercial solutions for logical data replication. If they don’t support updates of primary keys then you can use that as a good enough reason not to.
I would invest time in making sure there’s no other logical corruptions of data like this.
Related
During the creation of tables using mysql on phpmyadmin, I always find an issue when it comes to primary keys and their auto-increments. When I insert lines into my table. The auto_increment works perfectly adding a value of 1 to each primary key on each new line. But when I delete a line for example a line where the primary key is 'id = 4' and I add a new line to the table. The primary key in the new line gets a value of 'id = 5' instead of 'id = 4'. It acts like the old line was never deleted.
Here is an example of the SQL statement:
CREATE TABLE employe(
id INT UNSIGNED PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(30) NOT NULL
)
ENGINE = INNODB;
How do I find a solution to this problem ?
Thank you.
I'm pretty sure this is by design. If you had IDs up to 6 in your table and you deleted ID 2, would you want the next input to be an ID of 2? That doesn't seem to follow the ACID properties. Also, if there was a dependence on that data, for example, if it was user data, and the ID determined user IDs, it would invalidate pre-existing information, since if user X was deleted and the same ID was assigned to user Y, that could cause integrity issues in dependent systems.
Also, imagine a table with 50 billion rows. Should the table run an O(n) search for the smallest missing ID every time you're trying to insert a new record? I can see that getting out of hand really quickly.
Some links you might like to read:
Principles of Transaction-Oriented Database Recovery (1983)
How can we re-use the deleted id from any MySQL-DB table?
Why do you care?
Primary keys are internal row identifiers that are not supposed to be sexy or good looking. As long as they are able identify each row uniquely, they serve their purpose.
Now, if you care about its value, then you probably want to expose the primary key value somewhere, and that's a big red flag. If you need an external, visible identifier, you can create a secondary column with any formatting sequence and values you want.
As a side note, the term AUTO_INCREMENT is a bit misleading. It doesn't really mean they increase one by one all the time. It just mean it will try to produce sequential numbers, as long as it is possible. In multi-threaded apps that's usually not possible since batches or numbers are reserved per thread so the row insertion sequence may end actually not following the natural numbering. Row deletions have a similar effect, as well as INSERT with roll backs.
Primary keys are meant to be used for joining tables together and
indexing, they are not meant to be used for human usage. Reordering
primary key columns could orphan data and wreck havoc to your queries.
Tips: Add another column to your table and reorder that column to your will if needed (show that column to your user instead of the primary key).
There is a table. No PK, 2 FK, with some arbitrary number of columns.
Unfortunately FK are not unique in any way.
Adding new data is easy.
Deleting data (finding a row) is ok if I put unique constraint to some other col.
(DELETE ... WHERE fk1=:fk1 AND fk2=:fk2 AND ucol=:ucol)
What to do with UPDATE?
I cant use that ucol because that same ucol might be subject of change. I have several solutions, but none of them seem ok.
Solution1:
Put PK in table. Use it for DELETE and UPDATE. Deleting will make lot of holes in it but that's no problem. In theory, it can run out of PK numbers (int, unsigned int) if there's some heavy deleting going on.
Solution1a
Make CK of (fk1, fk2, some new col) use that to locate the row. It's the same as just using the PK.
Solution2
Use timestamp with microtime/ hash/ unique key generator/ something to populate new unique col. That col is used as PK to locate the row for UPDATE and DELETE. Excellent only if unique algo does it's job perfectly.
My question:
Is there something better? That doesn't require fancy algorithms and have no risk of overflowing auto-incremented PK...
----------------- edit----------------
Solution2a
Use mysql UUID! It's far better (and easier to use) than, creating custom timestamp / hash / something_unique.
As per my suggestion , it will be better to add a PK to the table because of following reasons:
1. It will give unique id to each row , which will help in DELETE and UPDATE script.
2. PK will create a cluster index on the column which will improve performace of the table while retriving data.
3. Its always adviced to provide a PK in each table.
4. In future you can use the PK as a FK in any table if required.
i would like to ask you a design question:
I am designing a table that makes me scratch my head, not sure what the best approach is, i feel like i am missing something:
There are two tables A and B and one M:N relationship table between them. The relationship table has right now these values:
A.ID, B.ID, From, To
Bussiness requirements:
At any time, A:B relation ship can be only 1:1
A:B can repeat in time as defined by From and To datetime values, which specify an interval
example: Car/Driver.
Any car can have only 1 Driver at any time
Any Driver can drive only 1 car at any time (this is NOT topgear, ok? :) )
Driver can change the car after some time, and can return to the same car
Now, i am not sure:
- what PK should i go with? A,B is not enough, adding From and To doesnt feel right, maybe an autoincrement PK?
-any way to enforce the bussiness requirements by DB design?
-for business reason, i would prefer it not to be in a historical table. Why? Well, let's assume the car is rented and i want to know, given a date, who had what car rented at that date. Splitting it into historical table would require more joinst :(
I feel like i am missing something, some kind of general patter ... or i dont know....
Thankful for any help, so thank you :)
I don't think you are actually missing anything. I think you've got a handle on what the problem is.
I've read a couple of articles about how to handle "temporal" data in a relational database.
Bottom line consensus is that the traditional relational model doesn't have any builtin mechanism for supporting temporal data.
There are several approaches, some better suited to particular requirements than others, but all of the approaches feel like they are "duct taped" on.
(I was going to say "bolted on", but I thought at tip of the hat to Red Green was in order: "... the handyman's secret weapon, duct tape", and "if the women don't find you handsome, they should at least find in you handy.")
As far as a PRIMARY KEY or UNIQUE KEY for the table, you could use the combination of (a_id, b_id, from). That would give the row a unique identifier.
But, that doesn't do anything to prevent overlapping "time" ranges.
There is no declarative constraint for a MySQL table that prevents "overlapping" datetime ranges that are stored as "start","end" or "start","duration", etc. (At least, in the general case. If you had very well defined ranges, and triggers that rounded the from to an even four hour boundary, and a duration to exactly four hours, you could use a UNIQUE constraint. In the more general case, for any ol' values of from and to, the UNIQUE constraint does not work for us.
A CHECK constraint is insufficient (since you would need to look at other rows), and even if it were possible, MySQL doesn't actually enforce check constraints.
The only way (I know of) to get the database to enforce such a constraint would be a TRIGGER that looks for the existence of another row for which the affected (inserted/updated) row would conflict.
You'd need both a BEFORE INSERT trigger and a BEFORE UPDATE trigger. The trigger would need to query the table, to check for the existence of a row that "overlaps" the new/modified row
SELECT 1
FROM mytable t
WHERE t.a_id = NEW.a_id
AND t.b_id = NEW.b_id
AND t.from <> OLD.from
AND < (t.from, t.to) overlaps (NEW.from,NEW.to) >
Obviously, that last line is pseudocode for the actual syntax that would be required.
The line before that would only be needed in the BEFORE UPDATE trigger, so we don't find (as a "match") the row being updated. The actual check there would really depend on the selection of the PRIMARY KEY (or UNIQUE KEY)(s).
With MySQL 5.5, we can use the SIGNAL statement to return an error, if we find the new/updated row would violate the constraint. With previous versions of MySQL, we can "throw" an error by doing something that causes an actual error to occur, such as running a query against a table name that we know does not exist.
And finally, this type of functionality doesn't necessarily have to be implemented in a database trigger; this could be handled on the client side.
How about three tables:
TCar, TDriver, TLog
TCar
pkCarID
fkDriverID
name
A unique index on driver ensures a driver is only ever in one car. Turning the foreign key
fkDriverID into a 1:1 relation ship.
TDriver
pkDriverID
name
TLog
pkLogID (surrogate pk)
fkCarID
fkDriverID
from
to
With 2 joins you will get any information you describe. if you just need to find Car data by driverID or driver data by cardid you can do it with one join.
thank you everyone for you input, so far i am thinking about this approach, would be thankful for any criticism/pointing out flaws:
Tables (pseudoSQLcode):
Car (ID pk auto_increment, name)
Driver(ID pk auto_increment, name)
Assignment (CarID unique,DriverID unique,from Datetime), composite PK (CarID,DriverID)
AssignmentHistory (CarID unique,DriverID unique,from Datetime,to Datetime) no pk
of course, CarID is a FK to Car(ID) and DriverID is a FK to Driver(ID)
the next stage are two triggers (and boy oh boy, i hope this can be done in mysql (works on MSSSQL, but i dont have a mysql db handy right now to test):
!!! Warning, MSSQL for now
create trigger Assignment _Update on Assignment instead of update as
delete Assignment
from Assignment
join inserted
on ( inserted.CarID= Assignment .CarID
or inserted.DriverID= Assignment .DriverID)
and ( inserted.CarID<> omem.CarID or inserted.DriverID<> omem.DriverID)
insert into Assignment
select * from inserted;
create trigger Assignment _Insert on Assignment after delete as
insert into Assignment_History
select CarID,DriverID,from,NOW() from deleted;
i tested it a bit and it seems for each bussiness case it does what i need it to do
It is popular to save all versions of posts when editing (like in stackexchange projects), as we can restore old versions. I wonder what is the best way to save all versions.
Method 1: Store all versions in the same table, and adding a column for order or active version. This will makes the table too long.
Method 2: Create an archive table to store older versions.
In both methods, I wonder how deals with the row ID which is the main identifier of the article.
The "best" way to save revision history depends on what your specific goals/constraints are -- and you haven't mentioned these.
But here some thoughts about your two suggested methods:
create one table for posts, and one for post history, for example:
create table posts (
id int primary key,
userid int
);
create table posthistory (
postid int,
revisionid int,
content varchar(1000),
foreign key (postid) references posts(id),
primary key (postid, revisionid)
);
(Obviously there would be more columns, foreign keys, etc.) This is straightforward to implement and easy to understand (and easy to let the RDBMS maintain referential integrity), but as you mentioned may result in posthistory have too many rows to be searched quickly enough.
Note that postid is a foreign key in posthistory (and the PK of posts).
Use a denormalized schema where all of the latest revisions are in one table, and previous revisions are in a separate table. This requires more logic on the part of the program, i.e. when I add a new version, replace the post with the same id in the post table, and also add this to the revision table.
(This may be what SE sites use, based on the data dump in the SE Data Explorer. Or maybe not, I can't tell.)
For this approach, postid is also a foreign key in the posthistory table, and the primary key in the posts table.
In my opinion, a interesting approach is
to define another table, for example posts_archive (it will contain all columns of posts table + an auto-incremented primary key + optionally a date...)
to feed this table through after-insert and after-updates triggers defined on posts table.
If the size of the table is an issue, then the second option would be the better choice. That way the active version can be returned quickly from a smaller table, and restoring an older version from the larger archive table is accepted to take longer. That said, the size of the table should not be an issue with a sensible database and indexing.
Either way, you need a primary key that consists of multiple table columns instead of just row ID. The trivial answer would be to include a timestamp containing the time each revision was created into the key, so that ID continues to identify a specific article, and ID and revision time together identify a specific revision of the article.
Dealing with temporal data is a known problem.
The method 1 simply changes your table identifier: you will end up with a table containing messageID, version, description, ... with a primary key messageID, version.
Modifying the data is done by simply adding a row with an incremented version. Querying is a little bit more complicated.
The method 2 is more tedious, you will end up with a table with a rowID and a second table that is exactly the same as in the method 1. Then, on every update, you will have to remember to copy the data into the "backup table".
The method 3: answser given by Matt
In my opinion, method 1 and 3 are better. The schema is simplier in 1, but you can have unversionned data for your posts using the method 3.
So imagine you have multiple tables in your database each with it's own structure and each with a PRIMARY KEY of it's own.
Now you want to have a Favorites table so that users can add items as favorites. Since there are multiple tables the first thing that comes in mind is to create one Favorites table per table:
Say you have a table called Posts with PRIMARY KEY (post_id) and you create a Post_Favorites with PRIMARY KEY (user_id, post_id)
This would probably be the simplest solution, but could it be possible to have one Favorites table joining across multiple tables?
I've though of the following as a possible solution:
Create a new table called Master with primary key (master_id). Add triggers on all tables in your database on insert, to generate a new master_id and write it along the row in your table. Also let's consider that we also write in the Master table, where the master_id has been used (on which table)
Now you can have one Favorites table with PRIMARY KEY (user_id, master_id)
You can select the Favorites table and join with each individual table on the master_id and get the the favorites per table. But would it be possible to get all the favorites with one query (maybe not a query, but a stored procedure?)
Do you think that this is a stupid approach? Since you will perform one query per table what are you gaining by having a single table?
What are your thoughts on the matter?
One way wold be to sub-type all possible tables to a generic super-type (Entity) and than link user preferences to that super-type. For example:
I think you're on the right track, but a table-based inheritance approach would be great here:
Create a table master_ids, with just one column: an int-identity primary key field called master_id.
On your other tables, (users as an example), change the user_id column from being an int-identity primary key to being just an int primary key. Next, make user_id a foreign key to master_ids.master_id.
This largely preserves data integrity. The only place you can trip up is if you have a master_id = 1, and with a user_id = 1 and a post_id = 1. For a given master_id, you should have only one entry across all tables. In this scenario you have no way of knowing whether master_id 1 refers to the user or to the post. A way to make sure this doesn't happen is to add a second column to the master_ids table, a type_id column. Type_id 1 can refer to users, type_id 2 can refer to posts, etc.. Then you are pretty much good.
Code "gymnastics" may be a bit necessary for inserts. If you're using a good ORM, it shouldn't be a problem. If not, stored procs for inserts are the way to go. But you're having your cake and eating it too.
I'm not sure I really understand the alternative you propose.
But in general, when given the choice of 1) "more tables" or 2) "a mega-table supported by a bunch of fancy code work" ..your interests are best served by more tables without the code gymnastics.
A Red Flag was "Add triggers on all tables in your database" each trigger fire is a performance hit of it's own.
The database designers have built in all kinds of technology to optimize tables/indexes, much of it behind the scenes without you knowing it. Just sit back and enjoy the ride.
Try these for inspiration Database Answers ..no affiliation to me.
An alternative to your approach might be to have the favorites table as user_id, object_id, object_type. When inserting in the favorites table just insert the type of the favorite. However i dont see a simple query being able to work with your approach or mine. One way to go about it might be to use UNION and get one combined resultset and then identify what type of record it is based on the type. Another thing you can do is, turn the UNION query into a MySQL VIEW and simply query that VIEW.
The benefit of using a single table for favorites is a simplicity, which some might consider as against the database normalization rules. But on the upside, you dont have to create so many favorites table and you can add anything to favorites easily by just coming up with a new object_type identifier.
It sounds like you have an is-a type relationship that needs to be modeled. All of the items that can be favourited are a type of "item". It sounds like you are on the right track, but I wouldn't use triggers. What could be the right answer if I have understood correctly, is to pull all the common fields into a single table called items (master is a poor name, master of what?), this should include all the common data that would be needed when you need a users favourite items, I'd expect this to include fields like item_id (primary key), item_type and human_readable_name and maybe some metadata about when the item was created, modified etc. Each of your specific item types would have its own table containing data specific to that item type with an item_id field that has a foreign key relationship to the item table. Then you'd wrap each item type in its own insertion, update and selection SPs (i.e. InsertItemCheese, UpdateItemMonkey, SelectItemCarKeys). The favourites table would then work as you describe, but you only need to select from the item table. If your app needs the specific data for each item type, it would have to be queried for each item (caching is your friend here).
If MySQL supports SPs with multiple result sets you could write one that outputs all the items as a result set, then a result set for each item type if you need all the specific item data in one go. For most cases I would not expect you to need all the data all the time.
Keep in mind that not EVERY use of a PK column needs a constraint. For example a logging table. Even though a logging table has a copy of the PK column from the table being logged, you can't build a constraint.
What would be the worst possible case. You insert a record for Oprah's TV show into the favorites table and then next year you delete the Oprah Show from the list of TV shows but don't delete that ID from the Favorites table? Will that break anything? Probably not. When you join favorites to TV shows that record will fall out of the result set.
There are a couple of ways to share values for PK's. Oracle has the advantage of sequences. If you don't have those you can add a "Step" to your Autonumber fields. There's always a risk though.
Say you think you'll never have more than 10 tables of "things which could be favored" Then start your PK's at 0 for the first table increment by 10, 1 for the second table increment by 10, 2 for the third... and so on. That will guarantee that all the values will be unique across those 10 tables. The risk is that a future requirement will add table 11. You can always 'pad' your guestimate