I have a revision system running in one of my projects.
A database table looks like the following:
objectID // (no primary/auto_increment!)
versionID
time
deletionTime
// (additionalFields)
The newest revision always has versionID = 0.
On an update, the entry with versionID = 0 gets updated to MAX(versionID)+1.
The updated entry will then be inserted with versionID = 0.
Also, the time of the insert will be saved.
The revision system works good like it is, but I have some difficulties with the system as it is at the moment:
I have another table with no revision system that is related to the revision table:
ID
parentID // the reference to the revision table
time
If I query entries out of this table, I need to join my revision table and get the revision that was relevant when my entry was created.
This means that I need a subquery to select all revisions with revisionTable.time <= otherTable.time
and then group by the entryID.
This does not seem to be a really good solution to me. There will be always a subquery needed for this.
I used this revision system with the current revision as "versionID = 0" because of the ability to easily query all current revisions over all objects.
My solution for a better revision system is as follows:
ID // primary, auto_increment
objectID // multiple revisions share the same objectID
isRevision // bool, the current revision has it set to 0
time
// (additionalColumns)
With this system, I could still query all current revisions and could use the primary ID of the revisions in my other table.
What do you think? Better ideas?
Related
Suppose we have a table with an auto-increment primary key. I want to load all IDs greater than the last ID I have seen.
SELECT id
FROM mytable
WHERE id > 10;
With the naive approach, I risk skipping IDs:
Transaction 1 claims ID 11.
Transaction 2 claims ID 12.
Transaction 2 commits.
I read all IDs >10. I see 12, and next time I will read all IDs >12. I have skipped 11.
Transaction 1 commits. For all intents and purposes, ID 11 now exists.
As a solution, I propose to do a double check to ensure that no intermediate IDs are about to be committed:
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
SELECT COUNT(*)
FROM mytable
WHERE id > 10
AND id <= 12; -- Where 12 was the max seen in the first query
If the count is greater than the number of IDs seen in the first query, then it is definitely possible that additional IDs will be committed in the gaps in that sequence.
The question is: does the reverse hold true? If the count is equal to (or less than) the number of IDs seen in the first query, is it guaranteed that there will be no values in between? Or am I missing some possible scenario where the IDs are being claimed, yet the READ UNCOMMITTED query does not see them yet?
For this question, please disregard:
Manual ID insertions.
Rewinding the auto-increment counter.
Mysql locks the table during auto_increment.
See
https://dev.mysql.com/doc/refman/5.7/en/innodb-auto-increment-handling.html
So that normally that problem doesn't occur, if there is no bug in the Version.
The lock works like a semaphore/critical Section.
I've been developing with SQL databases since 1992, and I have never seen an occasion where using READ UNCOMMITTED was the right solution to any problem.
I guess you are using MySQL as a kind of queue. That is, you rely on the auto-increment ID as the head of the queue.
I don't think you can do this in the way you describe, because of the fact that the order in which transactions generate their auto-inc ID's is not the same order as they commit.
I suggest you need to have another column called processed or something like that. Then you can query for records you have not yet processed:
SELECT id FROM mytable WHERE processed = false ORDER BY id
This way, the query will always return any records you have not seen yet. If ID 11 is committed after you have already seen ID 12, it will show up the next time you run this query.
Once you have done whatever you are going to do with a record, then:
UPDATE mytable SET processed = true WHERE id = ?
An even better solution, without the need to have a processed column, is to use a message queue to complement the SQL database.
When a client adds a record, they should also post the ID of the record they just inserted into the message queue. It's important that this client post to the message queue after they commit the record, or else a consumer of the message queue could get notified of an ID they can't see yet.
TL;DR: Is this design correct and how should I query it?
Let's say we have history tables for city and address designed like this:
CREATE TABLE city_history (
id BIGINT UNSIGNED NOT NULL PRIMARY KEY,
name VARCHAR(128) NOT NULL,
history_at DATETIME NOT NULL,
obj_id INT UNSIGNED NOT NULL
);
CREATE TABLE address_history (
id BIGINT UNSIGNED NOT NULL PRIMARY KEY,
city_id INT NULL,
building_no VARCHAR(10) NULL,
history_at DATETIME NOT NULL,
obj_id INT UNSIGNED NOT NULL
);
Original tables are pretty much the same except for history_id and obj_id (city: id, name; address: id, city_id, building_no). There's also a foreign key relation between city and address (city_id).
History tables are populated on every change of the original entry (create, update, delete) with the exact state of the entry at given time.
obj_id holds id of original object - no foreign key, because original entry can be deleted and history entries can't. history_at is the time of creation of history entry.
History entries are created for every table independently - change in city name creates city_history entry but does not create address_history entry.
So to see what was the state of the whole address with city (e.g. on printed documents) at any T1 point in time, we take from both history tables most recent entries for given obj_id created before T1, right?
With this design in theory we should be able to see the state of signle address with city at any given point of time. Could anyone help me create such a query for given address id and time? Please note that there could be multiple records with the same exact timestamp.
There is also a need to create a report for showing every change of state of given address in given time period with entries like "city_name, building_no, changed_at". Is it something that can be created with SQL query? Performance doesn't matter here so much, such reports won't be generated so often.
The above report will probably be needed in an interactive version where user can filter results e.g. by city name or building number. Is it still possible to do in SQL?
In reality address table and address_history table have 4 more foreign keys that should be joined in report (street, zip code, etc.). Wouldn't the query be like ten pages long to provide all the needed functionality?
I've tried to build some queries, play with greatest-n-per-group, but I don't think I'm getting anywhere with this. Is this design really OK for my use cases (if so, can you please provide some queries for me to play with to get where I want?)? Or should I rethink the whole design?
Any help appreciated.
(My answer copied from here, since that question never marked an answer as accepted.)
My normal "pattern" in (very)pseudo code:
Table A: a_id (PK), a_stuff
Table A_history: a_history_id (PK), a_id(FK referencing A.a_id), valid_from, valid_to, a_stuff
Triggers on A:
On insert: insert values into A_history with valid_from = now, and valid_to = null.
On update: set valid_to = now for last history record of a_id; and do the same insert from the "on insert" trigger with the updated values for the row.
On delete: set valid_to = now for last history record of a_id.
In this scenario, you'd query history with "x >= from and x < to" (not BETWEEN as the a previous record's "from" value should match the next's to "value").
Additionally, this pattern also makes "change log" reports easier.
Without a table dedicated to change logging, the relevant records can be found just by SELECT * FROM A_history WHERE valid_from BETWEEN [reporting interval] OR valid_to BETWEEN [reporting interval].
If there is a central change log table, the triggers can just be modified to include log entry inserts as well. (Unless log entries include "meta" data such as reason for change, who changed, etc... obviously).
Note: This pattern can be implemented without triggers. Using a stored procedure, or even just multiple queries in code, can actually negate the need for the non-history table.
The history table's "a_id" would need to be replaced with whatever uniquely identifies the record normally though; it could still be an id value, but these values would need synthesized when inserting, and known when updating/deleting.
Queries:
(if not new) UPDATE the most recent entry's valid_to.
(if not deleting) INSERT new entry
This is a very "traditional" Problem, when it comes down to versioning (or monitoring) of changes to a certain row.
There are various "solutions", each having its own drawback and advantage.
The following "statements" are a result of my expericence, they are neither perfect, nor do I claim they are the "only ones"!
1.) Creating a "history table": That's the worst Idea of all. You would always need to take into account which table you need to query, depending on DATA that should be queried. That's a "Chicken-Egg" Problem...
2.) Using ONE Table with ONE (increasing) "Revision" Number: That's a better approach, but it will get "hard" to query: Determining the "most recent row" per "id" is very costly no matter which aproach is used.
My personal expierence is, that following the pattern of a "double linked List" ist the best to solve this, when it comes down to Millions of records:
3.) Maintain two columns among every entity, let's say prev_version_id and next_version_id. prev_version_id points to NULL, if there is no previous version. next_version_id points to NULL if there is no later version.
This approach would require you to ALWAYS perform two actions upon an update:
Create the new row
Update the old rows reference (next_version_id) to the just insterted row.
However, when your database has grown to something like 100 Million Rows, you will be very happy that you have choosen this path:
Querying the "Oldest" Version is as simple as querying where ISNULL(prev_version_id) and entity_id = 5
Querying the "Latest" Version is as simple as querying where ISNULL(next_version_id) and entity_id = 5
Getting a full version history will just target the entity_id=5 of the data-table, sortable by either prev_version_id or next_version_id.
The very often neglected fact: The first two queries will also work to get a list of ALL first versions or of ALL recent versions of an entity - in about NO TIME! (Don't underestimate how "costly" it can be do determine the most recent version of an entity otherwise! Believe me, when "testing" everything seems equaly fine, but the real struggle starts when live-data with millions of records is used.)
cheers,
dognose
I have an old website and a new website... the old website had 4500 orders placed on it, tracked by a table with a primary key for the order id.
When the new website was launched, it was launched before migrating old orders into it. To accomplish this, the auto_increment value on the new orders table was set to 5000 so any new order placed would not collide with an old id.
This allows orders to continue being placed on the new website, all is well...
Now I'd like to run my import script to bring in the old orders into the new website.
Is it possible to temporarily lower the auto_increment value on the new orders table to my desired order id?
Disclaimer: This is a migration from a Drupal 5 Ubercart based site, to a Drupal 7 Commerce based site, so I do not (easily) have control over the complex queries involved in assembling the new orders, and cannot simply (AFAIK) supply an order id when assembling an order, because the system always refers to the next available primary key value in the table when creating an order. I can easily take the site offline to run the script, so nothing gets out of sync.
For importing the "old" orders you don't need to rely on the autoincrement id-- they already have ids, and you probably want to keep those!
Modify your import script to insert the complete old records into the new table, id and all! As long as the ids don't collide, it shouldn't be a problem.
You can always run ALTER TABLE table_name AUTO_INCREMENT = 1; (or whatever number you want).
The question is: do you WANT to?
If you have any records already in the database, it's probably best to insure the next auto increment value is larger than the maximum already in your database.
This is related to How to change ID in mysql
I also have checked other questions and none are quite like this one.
As we know, innodb has a feature. If I want to channge an id of a record for example, then all other table that point to the previous ID will magically be updated.
What about if I want to MERGE 2 records?
Say I have 2 businesses.
They have 2 ID.
I want to merge them into one. I also want to use innodb awesome feature to automatically change things.
I can't just change one of the id to the other ID. Or can I?
What would you do to merge 2 simmilar records in database?
Of course what actually goes into the combined record will be business decisions.
Basically I just do not want to pin point all the other table one by one. I think on update rule is there for a reason. Is there a way where I just change slaveID to masterID, keep ALL data in master the same, and then have the database itself (rather than my program) to repoint all tables that point to slaveID to point to masterID? of course, records for slaveID will be gone anyway.
For example, with normal mysql engine, you can change ID, and then you have to go through all table that points to the old ID to point the new ID instead. With innodb, that repointing is done by the database engine itself. Which is kind of cool. Why would anyone use non innodb engine anyway.
I want to do the same but for merging.
Trying to set a records primary key to an already existing value will simply result in a key violation error. While this is simple on a first glance, it has a side effect: You can not use ON UPDATE CASCADE to merge two records - it will simply not work.
If you have the possibility to change the schema, you can use the old but good redirect-trick:
(Assuming your IDs are positive, maybe unsigend ints)
add a field redirect int not null default 0
Create a view:
.
CREATE VIEW tablename_view
SELECT
-- repeat next line for every field apart from redirect
IF(s.redirect>0,m.<fieldname>,s.<fieldname>
FROM tablename AS s
LEFT JOIN tablename AS m ON s.redirect=m.id
When you merge a record (slave) into another record (master) run UPDATE tablename SET redirect=<id_of_master> WHERE id=<id_of_slave>
Adapt your select queries to select from tablename_view instead of tablename
Create and use a maintenance script to weed out merger slaves
I am in the process of rewriting a company's entire system. The original developer was a bit silly and generated ID numbers for each customer report randomly in his database. Each ID number is up to 7 digits long - but could be anything.
I am migrating over all his old data to our new, far more logically structured database. I obviously want to use a MySQL auto-increment for our ID field. However, it's vital that we keep the old ID numbers as customers still phone up each day with those to reference against.
Ideally, the perfect scenario would be we go live December 1st - everything up to December 1st is all randomly IDed, and from December 1st onwards they automatically increment starting at the highest random ID in the old database.
Is such a thing possible with MySQL without any issues? I am currently using two columns - one, our logical autoincrementing ID, and a second column called old_id which was being used during migration. But we need the call centre staff to only be using one ID or mass confusion will ensue.
Thanks!
If you start numbering from the highest random value, just changing the field to autoincrement should be enough, the normal behaviour is that mysql won't change ids already set, and starts numbering from the highest value+1.
If you want to start from a specific value (say 10,000,000) you can set
ALTER TABLE theTableInQuestion AUTO_INCREMENT=10000000
Of course, be sure to create backups and test, but it should not pose any problems at all. (Note that the old records will be stored in order of the id-field, which is random, and won't reflect the creation order.)
As you need to keep the old IDs, I'm going to assume that you're going to create a new column for autoincrement ID that will become your primary key but keep the existing ID column and rename it (to old_id, maybe?). I'm also going to assume you record when a customer signed up.
If you make your old ID column nullable (allow NULL as a valid value) then you can simply check whether or not the old ID column is NULL. If it's not NULL then treat that as the ID, otherwise use the autoincrement column.
Finding a customer:
SELECT *
FROM customer
WHERE (id = /*Put your ID here*/ AND reg_date >= /*Put the date the new regime starts here*/)
OR (id_old = /*put your ID here*/ AND reg_date < /*Put the date the new regime starts here*/)
This will occasionally return 2 rows so you'll have to use some other criteria to uniquely identify the customer in question.
As for associating an old customer with other tables in the database, you can always use the new ID internally throughout the entire DB once its generated. You will have to update tables that are using the old ID as the foreign key, obviously.
UPDATE target_table
JOIN customers on target_table.cust_id = customers.id_old
SET target_table.cust_id = customers.id;
(Note: The above is just a quick and dirty query that hasn't been tested! I'd suggest testing on a copy of the database before you try it for real!)