What is the best way to keep updates (diffs) of the some entity in the database? Here, at StackOverflow, we can edit questions and answers. And then we can look at any revision of the question or answer we want. For example: revisions of some random question. Maybe someone knows how it realized in StackOverflow?
To be clear in my case I have some entity (article) with some fields (name, description, content). Many users can edit the same article. I want to keep history of the article updates (something like version control) and I want to keep only diffs, not the whole content of the updated article. By the way I use PostgreSQL, but can migrate to any other database.
UPD
Open bounty, so here is some requirements. You don't need to fully satisfy them. But if you do it will be much better. Nevertheless any answer is much appreciated. So I want to have an ability:
to keep only diffs to not waste my space for no purpose.
to fetch any revision (version) of some article. But fetching of the last revision of the article must be really quick. Fetching speed of other revisions is not so important.
to fetch any diff (and list of diffs) of some article. Article can have changes in fields: header, description or content (like StackOverflow have changes in header and content), so it must be taken into account.
In the past, I have used diff-match-patch with excellent (and fast) results. It is available for several languages (my experience with it was in C#). I did not use it for exactly the process you are describing (we were interested in merging), but it seems to me you could:
Save the initial version of an article's text/header/whatever.
When a change is made, use diff-match-patch to compute a patch between the newly edited version and what is already in the database. To get the latest version in the database, simply apply any patches that have already been generated to the original article in order.
Save the newly generated patch.
If you wanted to speed things up even more, you could cache the latest version of the article in its own row/table/however-you-organize-things so that getting the latest version is a simple SELECT. This way, you'd have the initial version, the list of patches, and the current version, giving you some flexibility and speed.
Since you have a set of patches in sequence, fetching any version of the article would simply be a matter of applying patches up to the one desired.
You can take a look at the patch demo to see what its patches look like and get an idea of how big they are.
Like I said, I have not used it for exactly this scenario, but diff-match-patch has been designed for doing more or less exactly what you are talking about. This library is on my short list of software I can use when I have no restrictions on libraries developed out-of-house.
Update: Some example pseudocode
As an example, you could set up your tables like so (this assumes a few other tables, like Authors):
Articles
--------
id
authorId
title
content
timestamp
ArticlePatches
--------------
id
articleId
patchText
timestamp
CurrentArticleContents
----------------------
id
articleId
content
Then some basic CRUD could look like:
Insert new article:
INSERT INTO Articles (authorId, title, content, timestamp)
VALUES(#authorId, #title, #content, GETDATE())
INSERT INTO CurrentArticleContents(articleId, content)
VALUES(SCOPE_IDENTITY(),#content)
GO
Get all articles with latest content for each:
SELECT
a.id,
a.authorId,
a.title,
cac.content,
a.timestamp AS originalPubDate
FROM Articles a
INNER JOIN CurrentArticleContents cac
ON a.id = cac.articleId
Update an article's content:
//this would have to be done programatically
currentContent =
(SELECT content
FROM CurrentArticleContents
WHERE articleId = #articleId)
//using the diff-match-patch API
patches = patch_make(currentContent, newContent);
patchText = patch_toText(patches);
//setting #patchText = patchText and #newContent = newContent:
(INSERT INTO ArticlePatches(articleId, patchText, timestamp)
VALUES(#articleId, #patchText, GETDATE())
INSERT INTO CurrentArticleContents(articleId, content, timestamp)
VALUES(#articleId, #newContent, GETDATE())
GO)
Get the article at a particular point in time:
//again, programatically
originalContent = (SELECT content FROM Articles WHERE articleId = #articleId)
patchTexts =
(SELECT patchText
FROM ArticlePatches
WHERE articleId = #articleId
AND timestamp <= #selectedDate
ORDER BY timestamp ASCENDING)
content = originalContent
foreach(patchText in patchTexts)
{
//more diff-match-patch API
patches = patch_fromText(patchText)
content = patch_apply(patches, content)[0]
}
i have similar issue in workplace.
i implement the use of trigger after update to record all needed data to a another table (which of course you can save only the difference field only), and the new on exist on the real table, while the log live in another table.
Ok, first #PaulGriffin's answer is complete. And #VladimirBaranov's also got me to thinking about the optimal way to do updates. I've got a tasty way to handle it in the use case of frequent updates and less frequent reads. Lazy full record updates.
For example, editing a document online, possibly from different devices. Optimize for light network traffic and less frequent large db record updates. Without the ability of the client to go to specific versions (rather than always getting latest).
The database has collection of deltas, indexed by version(could be timestamp), and a lastDocument with lastDocUpdate (version/timestamp).
Frequent use case: Client Edits
Send only the delta to the Server & update database with delta
Use Case: Client with old document version requests updates
Send all deltas since current client version
Less Frequent use case: New device, no prior data on client
On Server, look at lastDocument, apply deltas since lastDocUpdate
Save updated lastDocument in db, and send to client
The most expensive operation is updating the full document in the database, but it's only done when necessary, ie the client has no version of the document.
That rare action is what actually triggers the full document update.
This setup has no extra db writes, minimal data sent to the client (which updates its' doc with the deltas), and the large text field is updated only when we are already applying the deltas on the server and full document is needed by client.
Related
BACKSTORY IF YOU'RE INTERESTED: A friend of mine owns a magazine and has been publishing a corresponding Drupal 7 website since 2011. The site has thousands of articles and around 50,000 images supporting those articles. Unfortunately, due to copyright trolling attorneys, he's already been hit with a couple of copyright infringement lawsuits over images that he thought were from "creative commons." Since his first lawsuit in 2016, he's made sure all images are from a stock image company. But apparently, very recently, yet another image from before 2016 has caused another copyright troll to seek $18,000 (it's literally a photo of a hotdog by the way). Nevertheless, his business insurance company just wants to pay the settlement fees rather than risk anything in court, but has demanded that all potentially suspect images be deleted from the site going forward. Since 95% of the stories that have been published on his site have had fewer than 1000 views anyway (they are worth less than 50 cents from advertisers), he has agreed to take all those images down because $.50 is definitely not worth the risk of feeding any more trolls.
QUESTION: What's the best way to delete, rename or move all the images that are connected to a story node before a certain date in 2016? It would be nice if we could temporarily just change the filenames on the filesystem from "trollfood.jpg" to "trollfood.jpg.bak" (or something) so that if/when he can ensure that an image is in fact in the public domain, he can revive it. It would also be nice, if we could replace all the potentially suspect images links (in the db) with a placeholder image links for the time being (so that people can still read the article without wondering where the images have gone...perhaps the image will be a brief explanation of the trolling situation). Anyway, it's been a minute since I've done anything with Drupal, so I've forgotten how drupal links files to nodes (and he has some custom content types powering his main articles).
I've been able to get all the potentially suspect images in a list via mysql:
SELECT fid, filename, timestamp, from_unixtime(timestamp, "%Y-%m-%e")
FROM drupal7_therooster.file_managed
where timestamp between unix_timestamp('2011-01-01') and unix_timestamp('2017-01-01');
// here's sample output:
# fid filename timestamp from_unixtime(timestamp, "%Y-%m-%e")
6154 _MG_5147.jpg 1373763148 2013-07-14
6155 _MG_5179.jpg 1373763148 2013-07-14
6161 The Lone Bellow (4 of 5).jpg 1373866156 2013-07-15
6162 The Lone Bellow (1 of 5).jpg 1373866156 2013-07-15
Now, how can I use this to find the potentially offending stories that uses these images, and perform the following:
Create a list of all the stories that use these images so I can save this in case he ever wants to revive these images. I know SQL well enough...I just don't know which tables keep which data.
Create a query that replaces these image associations in these stories to a placeholder image (so if story uses "trollfood.jpg", that story now uses "safetyimageplaceholder.jpg" instead. Some stories have multiple images attached to them.
Once all the potentially offending articles reference a placeholder image instead, I still need to move all the offending files so they can't be accessed by lawyers...I have access via ssh by the way. Are there any good ways of using bash commands to ONLY move/rename files that match the list I generate from an SQL query? I just want to be careful not to delete/rename/move any images that were NOT part of the query. Bear in mind the file creation date in the filesystem is all 2017+ on the server because the server was moved (or copied) in 2017 so the file system's original creation dates are inaccurate.
I know this is a long question...and it involves a Drupal site, but I think I might need the help of proper SQL and bash experts, so I've posted it here instead of the Drupal specific stackexchange. I'm totally open to any suggestions if another completely different approach is better suited for this problem. Cheers!
I was able to answer my own question. I had to do three main things:
STEP ONE: Create a query for Drupal's MySQL database that would give me a list of all potential copyright infringing files that were being used by nodes created between 2012 and 2017:
SELECT fm.fid, fm.filename,
n.title, n.nid, from_unixtime(n.created, "%Y-%m-%d") as 'node_date'
FROM file_managed fm
JOIN file_usage fu ON fm.fid = fu.fid
JOIN node n ON fu.id = n.nid
WHERE created BETWEEN unix_timestamp('2012-01-01') AND unix_timestamp('2017-01-01')
ORDER BY node_date
This is a moderately complex query, but basically it joins columns from three tables (Drupal 7's file_managed, node, and file_usage tables). The file_usage table is a shared key register of which files (via fid) are used on which nodes (via nid).
STEP TWO: Organize and filter the data to create a list of files.
I filtered and ordered the results by node created dates. I got about 48K records from the join query in step one, and then I created a google spreadsheet to clean up and sort the data. Here's a sample of the google spreadsheet. This sheet also includes data from the node_counter table which tracks page views for each node. Using a simple VLOOKUP function to match the total page views for each nid on the main sheet, now the main sheet can be sorted by page views. I did this so I could prioritize which images attached to each nodes/article I should check first. This is the sql query I used to get that data from the db by the way:
SELECT nid, totalcount, daycount, from_unixtime(timestamp, "%Y-%m-%d") as 'date'
FROM node_counter
ORDER BY totalcount DESC
STEP THREE: Write a Shell Script that will take our filtered list of files, and move them somewhere safe (and off the public webserver).
Basically, I needed a simple BASH script that would use the list of files from step two to move them off the web server. Bear in mind, when each image file is uploaded to the server, Drupal can (and did) created about a dozen different aspect ratios and sizes, and placed each one of these copies into corresponding folders. For example, one image filename could be copied and resized into:
files/coolimage.jpg
files/large/coolimage.jpg
files/hero/coolimage.jpg
files/thumbnails/coolimage.jpg
files/*/coolimage.jpg etc etc.
So, I have to take a list of ~50K filenames, and check for those filenames in a dozen different subfolders, and if they are present in a subfolder, move each of them to an archived folder all while preserving the folder/file tree structure and leaving behind files that are "safe" to keep on the public web server. So...I ended up writing THIS simple script and open sourced it on Github in case anyone else might benefit from it.
That's it! Thankfully I knew some SQL and how to use google spreadsheets...and some basic bash..and well, how to use google and solve problems. If google users are able to find this helpful in the future...cheers!
I'm currently working on a tool to build and alter books and articles.
a book contains articles and has a specific structure
an article can occur multiple times in the same book
it is multi-lingual, so each article can (but doesn't have to) have different languages. (if there isn't any data present for the chosen language, just displaying a notice or similar is fine)
later on, new languages may be added
the depth/nesting is dynamic (an article can have sub-articles)
both the articles as well as the structure have to be versioned (I later on have to be able to restore specific states)
here is my current approach for the database:
with this, I could grab the latest version of a book like this:
SELECT *
FROM books AS b
JOIN structure AS s ON s.book_fk = b.id
AND s.book_version_nr =
(
SELECT MAX(s2.book_version_nr)
FROM structure AS s2
WHERE s2.book_fk = b.id
)
JOIN articles AS a ON a.id = s.article_fk
JOIN article_texts AS as ON as.article_fk = a.id
AND as.version_nr = s.article_version_nr
AND as.language_fk = 'languageIdFromScript'
WHERE b.id = 'bookIdFromScript'
ORDER BY s.position ASC
However, this would mean that:
if i create a new version of an article, i have to update the most recent structure to reflect this (but that also means that I "lose" the version since the content is now changed)
if I add, remove or move an article around, I have to create a whole new structure version for the slightest change (this would lead to massive amounts of data in the database really fast and potentially affect the queries time)
This approach doesn't seem to be what I really want, as it is not possible to have different versions of different languages of the same article referenced in a single book structure. Also, the subselect seems to heavily slow down the whole process.
Is there any way to represent this connection as well as prevent any down-sides performance-wise?
Consider the following: Have two tables for structure: One for the Current structure; this is updated whenever a change occurs. The other is the History of the structure.
When you make a change, copy the current structure from Current into History, then modify Current.
This will slow down Inserts and Updates, but simplify Selects.
Multiple rows in History will point to the same Articles, Users, etc. But that is OK.
What is the faster/better way to keep track on statistical data in a message board?
-> number of posts/topics
Update a column like 'number_of_posts' for each incoming post or after a post gets deleted.
Or just count(*) on the posts matching a topicId?
Just use count(*) - it's built into the database. It's well tested, and already written.
Having a special column to do this for you means you need to write the code to manage it, keep it in sync with the actual value (on adds and deletes). Why make more work for yourself?
What is the "proper" (most normalized?) way to store requests in the database? For example, a user submits an article. This article must be reviewed and approved before it is posted to the site.
Which is the more proper way:
A) store it in in the Articles table with an "Approved" field which is either a 0, 1, 2 (denied, approved, pending)
OR
B) Have an ArticleRequests table which has the same fields as Articles, and upon approval, move the row data from ArticleRequests to Articles.
Thanks!
Since every article is going to have an approval status, and each time an article is requested you're very likely going to need to know that status - keep it inline with the table.
Do consider calling the field ApprovalStatus, though. You may want to add a related table to contain each of the statuses unless they aren't going to change very often (or ever).
EDIT: Reasons to keep fields in related tables are:
If the related field is not always applicable, or may frequently be null.
If the related field is only needed in rare scenarios and is better described by using a foreign key into a related table of associated attributes.
In your case those above reasons don't apply.
Definitely do 'A'.
If you do B, you'll be creating a new table with the same fields as the other one and that means you're doing something wrong. You're repeating yourself.
I think it's better to store data in main table with specific status. Because it's not necessary to move data between tables if this one is approved and the article will appear on site at the same time. If you don't want to store disapproved articles you should create cron script with will remove unnecessary data or move them to archive table. In this case you will have less loading of your db because you can adjust proper time for removing old articles for example at night.
Regarding problem using approval status in each query: If you are planning to have very popular site with high-load for searching or making list of article you will use standalone server like sphinx or solr(mysql is not good solution for this purposes) and you will put data to these ones with status='Approved'. Using delta indexing helps you to keep your data up-to-date.
I have a requirement to store all versions of an entity in a easily indexed way and was wondering if anyone has input on what system to use.
Without versioning the system is simply a relational database with a row per, for example, person. If the person's state changes that row is changed to reflect this. With versioning the entry should be updated in such a way so that we can always go back to a previous version. If I could use a temporal database this would be free and I would be able to ask 'what is the state of all people as of yesterday at 2pm living in Dublin and aged 30'. Unfortunately there doesn't seem to be any mature open source projects that can do temporal.
A really nasty way to do this is just to insert a new row per state change. This leads to duplication, as a person can have many fields but only one changing per update. It is also then quite slow to select the correct version for every person given a timestamp.
In theory it should be possible to use a relational database and a version control system to mimic a temporal database but this sounds pretty horrendous.
So I was wondering if anyone has come across something similar before and how they approached it?
Update
As suggested by Aaron here's the query we currently use (in mysql). It's definitely slow on our table with >200k rows. (id = table key, person_id = id per person, duplicated if the person has many revisions)
select name from person p where p.id = (select max(id) from person where person_id = p.person_id and timestamp <= :timestamp)
Update
It looks like the best way to do this is with a temporal db but given that there aren't any open source ones out there the next best method is to store a new row per update. The only problem is duplication of unchanged columns and a slow query.
There are two ways to tackle this. Both assume that you always insert new rows. In every case, you must insert a timestamp (created) which tells you when a row was "modified".
The first approach uses a number to count how many instances you already have. The primary key is the object key plus the version number. The problem with this approach seems to be that you'll need a select max(version) to make a modification. In practice, this is rarely an issue since for all updates from the app, you must first load the current version of the person, modify it (and increment the version) and then insert the new row. So the real problem is that this design makes it hard to run updates in the database (for example, assign a property to many users).
The next approach uses links in the database. Instead of a composite key, you give each object a new key and you have a replacedBy field which contains the key of the next version. This approach makes it simple to find the current version (... where replacedBy is NULL). Updates are a problem, though, since you must insert a new row and update an existing one.
To solve this, you can add a back pointer (previousVersion). This way, you can insert the new rows and then use the back pointer to update the previous version.
Here is a (somewhat dated) survey of the literature on temporal databases: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.6988&rep=rep1&type=pdf
I would recommend spending a good while sitting down with those references and/or Google Scholar to try to find some good techniques that fit your data model. Good luck!