Should id or timestamp be used to determine the creation order of rows within a database table? (given possibility of incorrectly set system clock) - mysql

A database table is used to store editing changes to a text document.
The database table has four columns: {id, timestamp, user_id, text}
A new row is added to the table each time a user edits the document. The new row has an auto-incremented id, and a timestamp matching the time the data was saved.
To determine what editing changes a user made during a particular edit, the text from the row inserted in response to his or her edit is compared to the text in the previously inserted row.
To determine which row is the previously inserted row, either the id column or the timestamp column could be used. As far as I can see, each method has advantages and disadvantages.
Determining the creation order using id
Advantage: Immune to problems resulting from incorrectly set system clock.
Disadvantage: Seems to be an abuse of the id column since it prescribes meaning other than identity to the id column. An administrator might change the values of a set of ids for whatever reason (eg. during a data migration), since it ought not matter what the values are so long as they are unique. Then the creation order of rows could no longer be determined.
Determining the creation order using timestamp
Advantage: The id column is used for identity only, and the timestamp is used for time, as it ought to be.
Disadvantage: This method is only reliable if the system clock is known to have been correctly set each time a row was inserted into the table. How could one be convinced that the system clock was correctly set for each insert? And how could the state of the table be fixed if ever it was discovered that the system clock was incorrectly set for a not precisely known period in the past?
I seek a strong argument for choosing one method over the other, or a description of another method that is better than the two I am considering.

Using the sequential id would be simpler as it's probably(?) a primary key and thus indexed and quicker to access. Given that you have user_id, you can quickly assertain the last and prior edits.
Using the timestamp is also applicable, but it's likely to be a longer entry, and we don't know if it's indexed at all, plus the potential for collisions. You rightly point out that system clocks can change... Whereas sequential id's cannot.
Given your update:
As it's difficult to see what your exact requirements are, I've included this as evidence of what a particular project required for 200K+ complex documents and millions of revisions.
From my own experience (building a fully auditable doc/profiling system) for an internal team of more than 60 full-time researchers. We ended up using both an id and a number of other fields (including timestamp) to provide audit-trailing and full versioning.
The system we built has more than 200 fields for each profile and thus versioning a document was far more complex than just storing a block of changed text/content for each one; Yet, each profile could be, edited, approved, rejected, rolled-back, published and even exported as either a PDF or other format as ONE document.
What we ended up doing (after a lot of strategy/planning) was to store sequential versions of the profile, but they were keyed primarily on an id field.
Timestamps
Timestamps were also captured as a secondary check and we made sure of keeping system clocks accurate (amongst a cluster of servers) through the use of cron scripts that checked the time-alignment regularly and corrected them where necessary. We also used Ntpd to prevent clock-drift.
Other captured data
Other data captured for each edit also included (but not limited to):
User_id
User_group
Action
Approval_id
There were also other tables that fulfilled internal requirements (including automatically generated annotations for the documents) - as some of the profile editing was done using data from bots (built using NER/machine learning/AI), but with approval being required by one of the team before edits/updates could be published.
An action log was also kept of all user actions, so that in the event of an audit, one could look at the actions of an individual user - even when they didn't have the permissions to perform such an action, it was still logged.
With regard to migration, I don't see it as a big problem, as you can easily preserve the id sequences in moving/dumping/transferring data. Perhaps the only issue being if you needed to merge datasets. You could always write a migration script in that event - so from a personal perspective I consider that disadvantage somewhat diminished.
It might be worth looking at the Stack Overflow table structures for there data explorer (which is reasonably sophisticated). You can see the table structure here: https://data.stackexchange.com/stackoverflow/query/new, which comes from a question on meta: How does SO store revisions?
As a revision system, SO works well and the markdown/revision functionality is probably a good example to pick over.

Use Id. It's simple and works.
The only caveat is if you routinely add rows from a store-and-forward server so rows may be added later but should treated as being added earlier

Or add another column whose sole purpose is to record the editing order. I suggest you do not use datetime for this.

Related

MYSQL - Database Design Large-scale real world deployment

I would love to hear some opinions or thoughts on a mysql database design.
Basically, I have a tomcat server which recieves different types of data from about 1000 systems out in the field. Each of these systems are unique, and will be reporting unique data.
The data sent can be categorized as frequent, and unfrequent data. The unfrequent data is only sent about once a day and doesn't change much - it is basically just configuration based data.
Frequent data, is sent every 2-3 minutes while the system is turned on. And represents the current state of the system.
This data needs to be databased for each system, and be accessible at any given time from a php page. Essentially for any system in the field, a PHP page needs to be able to access all the data on that client system and display it. In other words, the database needs to show the state of the system.
The information itself is all text-based, and there is a lot of it. The config data (that doesn't change much) is key-value pairs and there is currently about 100 of them.
My idea for the design was to have 100+ columns, and 1 row for each system to hold the config data. But I am worried about having that many columns, mainly because it isn't too future proof if I need to add columns in the future. I am also worried about insert speed if I do it that way. This might blow out to a 2000row x 200column table that gets accessed about 100 times a second so I need to cater for this in my initial design.
I am also wondering, if there is any design philosophies out there that cater for frequently changing, and seldomly changing data based on the engine. This would make sense as I want to keep INSERT/UPDATE time low, and I don't care too much about the SELECT time from php.
I would also love to know how to split up data. I.e. if frequently changing data can be categorised in a few different ways should I have a bunch of tables, representing the data and join them on selects? I am worried about this because I will probably have to make a report to show common properties between all systems (i.e. show all systems with a certain condition).
I hope I have provided enough information here for someone to point me in the right direction, any help on the matter would be great. Or if someone has done something similar and can offer advise I would be very appreciative. Thanks heaps :)
~ Dan
I've posted some questions in a comment. It's hard to give you advice about your rapidly changing data without knowing more about what you're trying to do.
For your configuration data, don't use a 100-column table. Wide tables are notoriously hard to handle in production. Instead, use a four-column table containing these columns:
SYSTEM_ID VARCHAR System identifier
POSTTIME DATETIME The time the information was posted
NAME VARCHAR The name of the parameter
VALUE VARCHAR The value of the parameter
The first three of these columns are your composite primary key.
This design has the advantage that it grows (or shrinks) as you add to (or subtract from) your configuration parameter set. It also allows for the storing of historical data. That means new data points can be INSERTed rather than UPDATEd, which is faster. You can run a daily or weekly job to delete history you're no longer interested in keeping.
(Edit if you really don't need history, get rid of the POSTTIME column and use MySQL's nice extension feature INSERT ON DUPLICATE KEY UPDATE when you post stuff. See http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html)
If your rapidly changing data is similar in form (name/value pairs) to your configuration data, you can use a similar schema to store it.
You may want to create a "current data" table using the MEMORY access method for this stuff. MEMORY tables are very fast to read and write because the data is all in RAM in your MySQL server. The downside is that a MySQL crash and restart will give you an empty table, with the previous contents lost. (MySQL servers crash very infrequently, but when they do they lose MEMORY table contents.)
You can run an occasional job (every few minutes or hours) to copy the contents of your MEMORY table to an on-disk table if you need to save history.
(Edit: You might consider adding memcached http://memcached.org/ to your web application system in the future to handle a high read rate, rather than constructing a database design for version 1 that handles a high read rate. That way you can see which parts of your overall app design have trouble scaling. I wish somebody had convinced me to do this in the past, rather than overdesigning for early versions. )

Cross Stream Data changes - EDW

I got a scenario where Data Stream B is dependent on Data Stream A. Whenever there is change in Data Stream A it is required re-process the Stream B. So a common process is required to identify the changes across datastreams and trigger the re-processing tasks.
Is there a good way to do this besides triggers.
Your question is rather unclear and I think any answer depends very heavily on what your data looks like, how you load it, how you can identify changes, if you need to show multiple versions of one fact or dimension value to users etc.
Here is a short description of how we handle it, it may or may not help you:
We load raw data incrementally daily, i.e. we load all data generated in the last 24 hours in the source system (I'm glossing over timing issues, but they aren't important here)
We insert the raw data into a loading table; that table already contains all data that we have previously loaded from the same source
If rows are completely new (i.e. the PK value in the raw data is new) they are processed normally
If we find a row where we already have the PK in the table, we know it is an updated version of data that we've already processed
Where we find updated data, we flag it for special processing and re-generate any data depending on it (this is all done in stored procedures)
I think you're asking how to do step 5, but it depends on the data that changes and what your users expect to happen. For example, if one item in an order changes, we re-process the entire order to ensure that the order-level values are correct. If a customer address changes, we have to re-assign him to a new sales region.
There is no generic way to identify data changes and process them, because everyone's data and requirements are different and everyone has a different toolset and different constraints and so on.
If you can make your question more specific then maybe you'll get a better answer, e.g. if you already have a working solution based on triggers then why do you want to change? What problem are you having that is making you look for an alternative?

Storing large, session-level datasets?

I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.
First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.

How to design a SQL db with undo-redo?

I'm trying to figure out how to design my DB tables to allow Undo-Redo.
Pretend you have a tasks table with the following structure:
id <int>
title <varchar>
memo <string>
date_added <datetime>
date_due <datetime>
Now assume that over a few days and multiple log-ins that several edits have taken place; but a user wants to go back to one of the versions.
Would you have a separate table tracking the changes - or - would you try to keep the changes within the tasks table ("ghost" rows, for lack of a better term)?
Would you track all of the columns or just the ones that changed each time?
If it matters, I'm using MySQL. Also, if it matters, I'd like to be able to show the history (ala Photoshop) and allow a user to switch to any version.
Bonus question: Would you save the whole memo cell on a change or would you try to save the delta only? Reason I ask is because the memo cell could be large and only a single word or character might be changed each revision. Granted, saving the delta would require parsing, but if undos aren't expected very often, wouldn't it be better to save space rather than processing time?
Thank you for your help.
I would create a History table for your tasks table. Same structure as tasks + a new field named previousId. This would hold the previous change id, so you can go back an forth through different changes (undo/redo).
Why a new History table? For a simple reason: do not overload tasks table with things that it was not designed for.
As for space, in the History, instead of a Memo, use a binary format and zip the content of the text you want to store. Don't try to detect changes. You will run into a buggy code which will result in frustration and wasted time...
Optimization:
Even better, you may keep only three columns in History table:
1. taskId (foreign key to tasks)
2. data - a binary field. Before saving in the History table, create an XML string holding only the fields that have changed.
3. previousId (will help maintain a queue of changes and allow navigation back and forth)
As for data field, create an XML string like this:
<task>
<title>Title was changed</title>
<date_added>2011-03-26 01:29:22<date_added>
</task>
This will basically tell you that this time you changed only the title and the date_added fields.
After the XML string is built, just zip it if you want and store it into History table's data field.
XML will also allow for flexibility. If you add / remove a field in tasks table, you don't need to update the History table, too. So this way the structure of the tasks table and History table are decoupled so you don't need to update two tables each time.
PS: don't forget to add some indexes to quickly navigate through the history table. Fields to be indexed: taskId and previousId as you will need fast queries against this table.
Hope this helps.
When I do similar types of things using SQL I always use a second table for revision history. This prevents your primary table from getting overly large with versions. The rationale is that retrieving the record that is current happens almost 100% of the time, viewing history and rolling back (undo) is very infrequent.
If you only have a single UNDO or history, then tracking in-table is probably fine.
Whether you want to save deltas or the entire cell depends on expected growth / usage. If you are comfortable creating the logic to manage deltas, that will save you space. If things don't really create new versions that often I wouldn't start with that, (applying YAGNI)
You might want to compress revisions in delta form but you should still have the current revision in full for quick retrieval.
However, older to newer deltas require lots of processing unless you have some non-delta to base on. Newer to older deltas require reprocessing every time something changes. So deltas usually do not get you many benefits but greater complexity.
Last I checked, which is some years ago, MediaWiki, the software behind Wikipedia, stored full texts and provided some means to compress older revisions with gzip to save space and a dedicated table archive for deleted revisions / pages.
Their website has an ER diagram of their database layout which you might find useful.

Version Tracking with mysql

I have an database with books.
One book has one Author, Publisher.
Some Prices, ID's and Descriptions.
I want to keep track of changes made to one product. One way is to save the product with time AND id as primary key.
Are there other ways?
Are there database systems (i've only used mysql) who can keep track of changes automaticly?
greetings...
What you are asking for is mostly covered by the "Change Data Capture" (CDC) design patterns and "Slowly changing dimension" (SDC) concept.
Read the Wikipedia articles on these subjects, as they provide a good birds-eye view on this topic.
One approach is to have 2 separate tables e.g. books and book_versions with the same set of fields (author, publisher, description etc.).
Whenever your application does an insert or update into books you insert a corresponding record in book_versions. This means that the books table contains the latest version of the record and book_versions contains the latest and all historical versions. If you're only interested in the latest version the majority of the time you can just select from books by ID and only retrieve the history when you need it. This is the approach used by the acts_as_versioned plugin for Ruby on Rails.
You can use a trigger (if mysql has them, I think it does) to catch the 'update' event and enter a bunch of relevant information into a 'log' table.
Databases do have transaction logs, but probably it's not useful for you as I don't think it can be trivially queried.
A simple solution is to include a modified-date as a field in your product table.
Update your stored procedures to always pull the product with product-ID with the latest effective-date.
This would allow you to have a separate stored procedure that lists all versions of a product.
I propose to add a changelog table into your system. This table is only ever written to, and it has the columns date, subject, predicate, object, where subject is the author/principal making the change, predicate is the nature of the change (create, update, delete), and object is the thing being change. Potentially, you can split object still into id, attribute, value, where id is the book id, attribute is the string name of the attribute being changed, and value is the old value (as the new one is in the proper table).
Any of the suggested solutions above will work; it really depends on your workload and data set size.
If you have a lot of records and you just want historical archive for reference, you may also consider moving "old/earlier" versions off the database and instead, store them on disk in some kind of linked list format (e.g. insert a version that contains address of the previous version, hence forming a linked list), and just keep a pointer to the latest version in DB.
There are pluses and minuses with this approach, but one plus is you can keep your DB small, and just read off older versions from disk. Your older versions should be immutable so you won't need to rely on transactions/concurrency support from the DB. If your "current/up-to-date" data set is, say 100G, and your past versions is 900G, then you can put a database on RAID on the 100G and put the past versions on cheaper storage, and copy it a few times (they are atomic, so no concurrency issues when replicating).
You might be interested in the concept of temporal databases used to describe things that change in time. There is a freely available book on temporal databases that describes this concept in every detail, but for something more down to earth you could read Patterns for things that change with time by Martin Fowler, my favorite programming author.