Short version: I'm looking for suggestions on how to implement a database versioning system that keeps track of relations as well as record data for moderating a social web app with user-editable content.
Long version: I'm developing a social app where all the users can edit content. To give you an example, let's say one type of an item to edit is a Book. A Book can have a title and a few authors (many-to-many relation to authors table). Of course this example is rather simple; real items will have many more fields as well as relations, but the rule should be the same.
Now, I need to implement a versioning system to moderate changes made by users. Let's say there are two groups of users: normal users and trusted users. Changes made by normal users are logged, but aren't commited until moderator accepts that particular change. Changes made by trusted users are commited immediately, but still logged so that they can be reverted at any time.
If I were to keep revisions of a Book from the example, I would need to keep track of changing relations with authors. I need to be able to track adding and deleting relations, and be able to revert them.
So if the title gets modified, I need to be able to revert the change. If a relation to an author gets deleted, I need to be able to revert that, as well as if some relation gets added, I need to be able to revert that too.
I only need to keep track of an item and it's relations, not anything that it relates to. If we had tables foos, bars, and foos_bars, I would be interested only in logging foos and foos_bars, bars would be pretty independent.
I'm familiar with this great question, as well as it's kind-of-an-adversary solution, and pretty comprehensive article on the second approach and it's follow-up. However, none of those give any special consideration to keeping track of relations as well as normal table data that would be obvious answer to my problem.
I like the one-table-for-all-history approach, as it allows for keeping only part of changes easily, and undo others. Like if one user submitted fields A and B, and then second user submitted A and B, it would be easy to undo just second user's B-change, and keep the A. It's also nice to have one table for the whole functionality, as opposed to many tables with the other approach. It also makes it easy to see who did exactly what (e.g. modified only foobar field) - it doesn't seem to be easy with the other approach. And it seems like it would be easier to automate the moderation process - we don't really even need to know table names, as everything needed is stored in a revision record.
If I were to use the one-revisions-table-for-each-revisioned-table approach, having limited experience in writing triggers, I don't know if it would be possible or relatively easy to implement a system that automatically records an edit, but doesn't commit it immediately unless some parameter is set (e.g. edit_from_trusted_user == true). And it makes me think of triggers invoking when I wouldn't really want them to (as the moderation wouldn't apply to e.g. changes made by admin, or some other "objects" that could try to modify the data).
No matter which solution I choose, it seems as if I'll have to add a rather artificial id to all many-to-many relation tables (instead of [book_id, author_id] I would have [id, book_id, author_id]).
I thought about implementing relations in the one table approach like so:
if we have standard revision table structure
[ID] [int]
[TableName] [varchar]
[RecordID] [int]
[FieldName] [varchar]
[OldValue] [varchar]
[NewValue] [varchar]
[EventType] [enum]
[EventDate] [datetime]
[UserID] [int]
we could store relations by simply setting RecordID and FieldName to NULL, EventType to either ADD or DELETE, and OldValue and NewValue to relation's foreign keys. The only problem is, some of my relations have some additional data (like a graph's edge weight), so I would have to store that somewhere too. Then again, operation of adding a new relation could be split into 2-event sequence: ADD and SET(weight), but then artificial relation IDs would be needed, and I'm not sure if such a solution wouldn't have some bad implications in the future.
There will be around 5 to 10 versioned tables, each with, on average, 3 many-to-many relations to keep track of. I'm using MySQL on InnoDB, app is written in PHP 5.3 and connected to the db using PDO.
Putting versioning in the app logic instead of db triggers is fine with me. I just need the whole thing to work, and be reasonably efficient. I expect reverts to occur rather seldom compared to edits, and edits will be few compared to number of views of content. Only moderators will access revision data, to either accept or reject recent changes.
Do you have any experience implementing such system? What are suggested solutions to this problem? Any considerations that come to mind?
I searched SO and the net for quite some time, but didn't find anything to help me with the matter. However, if I missed something, I'll be grateful for any links / directions.
Thanks.
Related
In a Relational Database, what is the best way to handle removing an object from the object graph while still retaining referential integrity? At some point, this has to happen. Either through a soft or hard delete.
For example - when a product is removed, what is the best approach to make sure that the orders containing that product are still relevant, or furthermore that invoices containing orders containing that product are still relevant?
There are basically 3 "standard solutions":
Solution 1
You need the product (like in your case because of the invoices referencing it). This means the data is VALID and the only change is that it goes "out of stock" or "out of portfolio". In any case your business process often will need you to handle RMA situations or some IRS related matters for example... this means the product must not be deleted. This is just a different "state" of the product which needs to be reflected by your DB data model etc.
IF you are concerned with performance do some profiling... if need be you have a multitude of optimization options... these are usually RDBMS-dependent, one technique being "partitioning" - every RDBMS has its own mechanics which differ in flexibility etc.
Solution 2
You don't need any of the data at all... just do a cascaded delete and be done with it...
Solution 3
You only need historical data but no "future business process" will ever need this entity (i.e. product) again... in this case a common solution is to have archive tables which are filled before doing a cascaeded delete on the "active/productive tables". A slight variant of this scheme is copying the needed information into the "dependent rows" (invoice in your case) and just delete the active/productive row (i.e. product in your case).
Conclusion
Complex systems deal with a lot of different business processes/use cases and thus tend to employ all of the above techniques - each has its place depeding on the specific business processes/use cases involved...
Here is an answer I received from an un-named source. I will say this, he is pretty well respected, and to be respectful I am not going to post his name.
I am not going to accept my own answer here, or bypass the bounty, but am just showing his answer.
"With a full-featured RDBMS you can partition the table on the "deleted_or_not" column and that will result in all of the live production rows to be stored compactly. If you don't want deprecated data to show up in reports, simply give the full table an obscure name, such as customers_including_deleted_rows and create a view "customers" (containing only the live rows) from which most of the application code queries. This assumes, of course, that there is some value to having the old data around."
Background
I am developing a social web app for poets and writers, allowing them to share their poetry, gather feedback, and communicate with other poets. I have very little formal training in database design, but I have been reading books, SO, and online DB design resources in an attempt to ensure performance and scalability without over-engineering.
The database is MySQL, and the application is written in PHP. I'm not sure yet whether we will be using an ORM library or writing SQL queries from scratch in the app. Other than the web application, Solr search server and maybe some messaging client will interact with the database.
Current Needs
The schema I have thrown together below represents the primary components of the first version of the website. Initially, users can register for the site and do any of the following:
Create and modify profile details and account settings
Post, tag and categorize their writing
Read, comment on and "favorite" other users' posts
"Follow" other users to get notifications of their activity
Search and browse content and get suggested posts/users (though we will be using the Solr search server to index DB data and run these type of queries)
Schema
Here is what I came up with on MySQL Workbench for the initial site. I'm still a little fuzzy on some relational databasey things, so go easy.
Questions
In general, is there anything I'm doing wrong or can improve upon?
Is there any reason why I shouldn't combine the ExternalAccounts table into the UserProfiles table?
Is there any reason why I shouldn't combine the PostStats table into the Posts table?
Should I expand the design to include the features we are doing in the second version just to ensure that the initial schema can support it?
Is there anything I can do to optimize the DB design for Solr indexing/performance/whatever?
Should I be using more natural primary keys, like Username instead of UserID, or zip/area code instead of a surrogate LocationID in the Locations table?
Thanks for the help!
In general, is there anything I'm doing wrong or can improve upon?
Overall, I don't see any big flaws in your current setup or schema.
What I'm wonderng is your split into 3 User* tables. I get what you want your intendtion was (having different user-related things seperate) but I don't know if I would go with the exact same thing. If you plan on displaying only data from the User table on the site, this is fine, since the other info is not needed multiple times on the same page but if users need to use their real name and display their real name (like John Doe instead of doe55) than this will slow down things when the data gets bigger since you may require joins. Having the Preferences seperate seems like a personal choice. I have no argument in favor of nor against it.
Your many-to-many tables would not need an addtional PK (e.g PostFavoriteID). A combined primary of both PostID and UserID would be enough since PostFavoriteID is never used anywhere else. This goes for all join tables
Is there any reason why I shouldn't combine the ExternalAccounts
table into the UserProfiles table?
As withe the prev. answer, I don't see a advatanage or disadvantage. I may put both in the same table since the NULL (or maybe better -1) values would not bother me.
Is there any reason why I shouldn't combine the PostStats table
into the Posts table?
I would put them into the same table using a trigger to handle the increment of the ViewCount table
Should I expand the design to include
the features we are doing in the
second version just to ensure that the
initial schema can support it?
You are using a normalsied schema so any additions can be done at any time.
Is there anything I can do to optimize the DB design for Solr
indexing/performance/whatever?
Can't tell you, haven't done it yet but I know that Solr is very powerfull and flexible so I think you should be doing fine.
Should I be using more natural primary keys, like Username instead of
UserID, or zip/area code instead of a
surrogate LocationID in the Locations
table?
There are many threads here on SO discussing this. Personally, I like a surrogate key better (or another unique number key if available) since it makes queries more easier and faster since an int is looked up easier. If you allow a change of username/email/whatever-your-PK-is than there are massive updates required. With the surrogate key, you don't need to bother.
What I would also do is to add things like created_at, last_accessed at (best done via triggers or procedures IMO) to have some stats already available. This can realy give you valuable stats
Further strategies to increate the performance would be things like memcache, counter cache, partitioned tables,... Such things can be discussed when you are really overrun by users because there may be things/technologies/techniques/... that are very specific to your problem.
I'm not clear what's going on with your User* tables - they're set up as if they're 1:1 but the diagram reflects 1-to-many (the crow's foot symbol).
The ExternalAccounts and UserSettings could be normalised further (in which case they would then be 1-to-many!), which will give you a more maintainable design - you wouldn't need to add further columns to your schema for additional External Account or Notification Types (although this may be less scalable in terms of performance).
For example:
ExternalAccounts
UserId int,
AccountType varchar(45),
AccountIdentifier varchar(45)
will allow you to store LinkedIn, Google, etc. accounts in the same structure.
Similarly, further Notification Types can be readily added using a structure like:
UserSettings
UserId int,
NotificationType varchar(45),
NotificationFlag ENUM('on','off')
hth
Lately I've been rethinking a database design I made a couple of months ago. The main reason is that last night I read the databse schema of vBulletin and saw that they use many, MANY, tables.
The current "idea" I'm using for my schema, for instance my log table, is to keep everything in one table by differencing the type of Log with an integer:
id, type, type_id, action, message
1 , 1, 305, 2, 'Explanation for user Ban'
2, 2, 1045, 1, 'Reason for deletion of Article'
Where type 1 = user, type 2 = article, type_id = the ID of the user, article or w/e and action 2 = ban, action 1 = deletion.
Should I change the design to two tables logBans, logSomething and so on? or is it better to keep the method I'm currently using?
The issue here is subtyping. There are three basic approaches to dealing with subtypes.
Put each record type into a completely separate table;
Put a record in a parent table and then a record in a subtype table; and
Put all the records in one table, having nullable columns for the "optional" data (ie things that don't apply to that type).
Each strategy has its merits.
For example, (3) is particularly applicable if there is little to no difference between different subtypes. In your case, do different log records have extra columns if they're of a particular type? If they don't or there are few cases when they do putting them all in one table makes perfect sense.
(2) is common used for a Party table. This is a common model in CRMs that involves a parent Party object which has subtypes for Person and Organization (Organization may also have subtypes like Company, Association, etc). Person and Organization have different properties (eg salutation, given names, date of birth, etc for Person) so it makes sense to split this up rather than using nullable columns.
(2) is potentially more space efficient (although the overhead of NULL columns in modern DBMSs is very low). The bigger issue is that (2) might be more confusing to developers. You will get a situation where someone needs to store an extra field somewhere and will whack it in a column that's empty for that type simply because it's easier doing that than getting approval for the DBAs to add a column (no, I'm not kidding).
(1) is probably the least frequently used scheme of the 3 in my experience.
Lastly, scalability has to be considered and is probably the best case for (1). At a certain points JOINs don't scale effectively and you'll need to use some kind of partitioning scheme to cut down your table sizes. (1) is one method of doing that (but a crude method).
I wouldn't worry too much about that though. You'll typically need to get to hundreds of millions or billions of records before that becomes an issue (unless your records are really really large, in which case it'll happen sooner).
It depends. If you're going to have 1500000000 entries of type 1 and 1000 entries of type 2 and you'll be doing a LOT of queries on type 2, separate the tables. If not, it's more convenient to keep only one table.
Keep in mind scalability:
How many entries of each type will I have in 1 year?
How many requests on this table will I be doing ?
Can you, at some point, clear this log? Can you move it to another table (like archive entries older than X months) ?
The one drawback I see right now is that you cannot enforce foreign key integrity on your type_id since it points to many different tables.
I want to add a small tip. A little off topic, and quite basic, but it's a lot clearer to use enum instead of tinyint for status flags, i.e.
enum('user','type')
If there are only two statuses, tinyint is a little more memory efficient, but less clear. Another disadvantage in enum is that you put a part of the business logic in the data tier - when you need to add or remove statuses, you have to alter the DB. Otherwise it's much more clear and I prefer enum.
I would keep things as specific as possible - in this case I would create two tables.
Each table has a specific purpose so I cannot see why you would combine them.
I wouldn't do what vBulletin does. The problem with older apps like vBulletin is that while they might have started as lean-machines, over the time they collect a lot of entropy and end up being bloated. Since there are plugins, and third-party tools, and developers who've worked on the old code, breaking it is a tough choice.
That's why there is not much refactoring going on here. Don't make them your programming model. Look around, find out what works best and use that. A lot of table sounds like a bad thing to me, not good.
Pardon the elementary question but my newness to the realm of database design leaves me in a bind quite often.
I have a site that keeps growing with regard to families of information. In the beginning I had one sort of item I was describing and all was well. That item occupied one record and had 34 columns (a lot now that I look back) attributed to it of descriptive data. As I get more and more into this stuff, I see that many developers break out data (when practical) into distinct tables.
I've now got additional tables that relate to the original item but are not always needed when describing the original item so I broke them out so they're not queried unnecessarily.
Anyhow, I have a new item I've been trying to organize which is a USER. The user table has typical columns like username, email, last_login, path to associated image, etc. These users have been making comments, which I keep in yet another table that contains columns with IDs that relate to the user and the item on which they are commenting.
Now... I am in the process of adding the obligatory user profile page to the site. Should I create yet another table containing only essential profile data or append the existing user record with profile data in the original user table? I am thinking housekeeping might be a pain if I am to add a "Remove me from site" function as I would have to run something that kills the user record, the user profile record, and any other data associated with that user ID in other tables.
Basically what I am asking is should I keep going with this "granular" design method - breaking everything out into essential parts or does it ever serve me to consolidate into larger tables? I see a few instances where if a user deletes their account, I'll be left with a bunch of non-relevant data. For instance, the original item are restaurants... if I make a table to record "Visits" to restaurants, containing the Restaurant ID and the User ID, if the user or restaurant get removed from the site, this "Visits" table will have a bunch of useless records saying either "non existent restaurant was visited by user 45" or "Restaurant 21 was visited by non-existent user"
I hope I make sense here... I'm just wondering if it's normal to end up with this "junk" data over time.
Thanks much,
Rob
Deleting that "on-relevant" data is a normal, healthy part of an application's life. It's just what happens. You just have to do it, like you brush your teeth or make your bed. Don't let two or three DELETE queries influence how your tables get structured. They're not that expensive, and honestly, if you think that's too much of a pain, you're in the wrong business :)
If you're using InnoDB tables, you can look into foreign key constraints that will take care of some of the cleanup for you.
You'll be able to make these decisions much more easily if you learn about normalization.
In general, if data all relates to the same logical entity -- the same "thing" -- then it should go in the same table. Breaking one table into two just to keep the tables smaller is generally not a good idea. Depending on what you are doing, it may or may not make queries faster, and it introduces unnecessary complexity. Let me explain.
Whether it makes queries faster depends on the nature of the data and how you use it. If you have some very large field, like "rambling_comments varchar(5000)" or some such, and it is rarely used, then breaking it into a separate table so that what's left in the "main" table is relatively small could indeed make your queries faster, for the fairly obvious reason that there is now less data to read. But if the size of the fields you are thinking of breaking out are modest, and you often need data from both tables, then queries that only use one table don't gain that much, and queries that use both now need to do a join, which is usually more expensive than reading a somewhat bigger record.
But breaking up your tables will certainly make your programs more complex. Now you have to keep track of which data is in which table. You'll constantly be checking if that field is in the Item_Descriptive_Data table or the Item_Stock_Data table or whatever. You're liable to lose track at some point and accidentally put the same field into two tables. (Or worse, you'll decide this is a good idea and do it deliberately.) Then you have redundant and potentially contradictory data.
You have to do joins every time you need data that crosses tables. You create the possibility that records in one or more of the tables may not exist. Like, if you break your User table into User_Main and User_Profile, and you need data from both tables so you do a join, what happens if there is a record in User_Profile with no corresponding record in User_Main? You're going to have to add code to check for the possibility and deal with it. Oh, and blithely saying "That can never happen, no need to worry about it" is a very dangerous attitude: No matter that it's not SUPPOSED to happen, sooner or later it will, and if you don't handle the error gracefully, you could have a real mess.
In short, breaking up tables for performance reasons is usually a premature optimization. If you find that you have some real performance problem, THEN look at the tables and see if you should denormalize for efficiency. But don't start out trashing your database just to avoid a problem that might possibly happen someday.
I am designing a system and I don't think it's a good idea to give the ability to the end user to delete entries in the database. I think that way because often then end user, once given admin rights, might end up making a mess in the database and then turn to me to fix it.
Of course, they will need to be able to do remove entries or at least think that they did if they are set as admin.
So, I was thinking that all the entries in the database should have an "active" field. If they try to remove an entry, it will just set the flag to "false" or something similar. Then there will be some kind of super admin that would be my company's team who could change this field.
I already saw that in another company I worked for, but I was wondering if it was a good idea. I could just make regular database backups and then roll back if they commit an error and adding this field would add some complexity to all the queries.
What do you think? Should I do it that way? Do you use this kind of trick in your applications?
In one of our databases, we distinguished between transactional and dictionary records.
In a couple of words, transactional records are things that you cannot roll back in real life, like a call from a customer. You can change the caller's name, status etc., but you cannot dismiss the call itself.
Dictionary records are things that you can change, like assigning a city to a customer.
Transactional records and things that lead to them were never deleted, while dictionary ones could be deleted all right.
By "things that lead to them" I mean that as soon as the record appears in the business rules which can lead to a transactional record, this record also becomes transactional.
Like, a city can be deleted from the database. But when a rule appeared that said "send an SMS to all customers in Moscow", the cities became transactional records as well, or we would not be able to answer the question "why did this SMS get sent".
A rule of thumb for distinguishing was this: is it only my company's business?
If one of my employees made a decision based on data from the database (like, he made a report based on which some management decision was made, and then the data report was based on disappeared), it was considered OK to delete these data.
But if the decision affected some immediate actions with customers (like calling, messing with the customer's balance etc.), everything that lead to these decisions was kept forever.
It may vary from one business model to another: sometimes, it may be required to record even internal data, sometimes it's OK to delete data that affects outside world.
But for our business model, the rule from above worked fine.
A couple reasons people do things like this is for auditing and automated rollback. If a row is completely deleted then there's no way to automatically rollback that deletion if it was in error. Also, keeping a row around and its previous state is important for auditing - a super user should be able to see who deleted what and when as well as who changed what, etc.
Of course, that's all dependent on your current application's business logic. Some applications have no need for auditing and it may be proper to fully delete a row.
The downside to just setting a flag such as IsActive or DeletedDate is that all of your queries must take that flag into account when pulling data. This makes it more likely that another programmer will accidentally forget this flag when writing reports...
A slightly better alternative is to archive that record into a different database. This way it's been physically moved to a location that is not normally searched. You might add a couple fields to capture who deleted it and when; but the point is it won't be polluting your main database.
Further, you could provide an undo feature to bring it back fairly quickly; and do a permanent delete after 30 days or something like that.
UPDATE concerning views:
With views, the data still participates in your indexing scheme. If the amount of potentially deleted data is small, views may be just fine as they are simpler from a coding perspective.
I prefer the method that you are describing. Its nice to be able to undo a mistake. More often than not, there is no easy way of going back on a DELETE query. I've never had a problem with this method and unless you are filling your database with 'deleted' entries, there shouldn't be an issue.
I use a combination of techniques to work around this issue. For some things adding the extra "active" field makes sense. Then the user has the impression that an item was deleted because it no longer shows up on the application screen. The scenarios where I would implement this would include items that are required to keep a history...lets say invoice and payment. I wouldn't want such things being deleted for any reason.
However, there are some items in the database that are not so sensitive, lets say a list of categories that I want to be dynamic...I may then have users with admin privileges be allowed to add and delete a category and the delete could be permanent. However, as part of the application logic I will check if the category is used anywhere before allowing the delete.
I suggest having a second database like DB_Archives whre you add every row deleted from DB. The is_active field negates the very purpose of foreign key constraints, and YOU have to make sure that this row is not marked as deleted when it's referenced elsewhere. This becomes overly complicated when your DB structure is massive.
There is an acceptable practice that exists in many applications (drupal's versioning system, et. al.). Since MySQL scales very quickly and easily, you should be okay.
I've been working on a project lately where all the data was kept in the DB as well. The status of each individual row was kept in an integer field (data could be active, deleted, in_need_for_manual_correction, historic).
You should consider using views to access only the active/historic/... data in each table. That way your queries won't get more complicated.
Another thing that made things easy was the use of UPDATE/INSERT/DELETE triggers that handled all the flag changing inside the DB and thus kept the complex stuff out of the application (for the most part).
I should mention that the DB was a MSSQL 2005 server, but i guess the same approach should work with mysql, too.
Yes and no.
It will complicate your application much more than you expect since every table that does not allow deletion will be behind extra check (IsDeleted=false) etc. It does not sound much but then when you build larger application and in query of 11 tables 9 require chech of non-deletion.. it's tedious and error prone. (Well yeah, then there are deleted/nondeleted views.. when you remember to do/use them)
Some schema upgrades will become PITA since you'll have to relax FK:s and invent "suitable" data for very, very old data.
I've not tried, but have thought a moderate amount about solution where you'd zip the row data to xml and store that in some "Historical" table. Then in case of "must have that restored now OMG the world is dying!1eleven" it's possible to dig out.
I agree with all respondents that if you can afford to keep old data around forever it's a good idea; for performance and simplicity, I agree with the suggestion of moving "logically deleted" records to "old stuff" tables rather than adding "is_deleted" flags (moving to a totally different database seems a bit like overkill, but you can easily change to that more drastic approach later if eventually the amount of accumulated data turns out to be a problem for a single db with normal and "old stuff" tables).