Jun 26, 2006
Revision Control for Web Application Content
Revision control is a wonderful concept. Using revision control has helped me personally in managing content and source code. Traditionally revision control is used to track changes in source code files, but people has since extended revision control to manage other things such as system configuration files, web pages, term papers, and so on. In this article, I will focus on applying the revision control concept in managing content in web applications, for example, like how Wikipedia keeps track of changes in the articles (example).
Why?
Having a revision control is really useful when we want to keep a history of contents in our web applications. This history data can be used to revert changes to a certain revision, find out by whom and when certain changes are made, and visualise the differences between revisions. Accidental overwriting of content is no longer a problem with revision control in place. Aside from history, revision control can also help in a multi-user environment where many people may potentially work on the same content. In my opinion, these are the main benefits of having revision control.
Cons
The only main drawback of having revision control is that extra storage space is needed to keep the revision data. The actual requirements vary, depending on the implementation. If done inefficiently (e.g. storing the whole revision content instead of just the differences), then the space requirement could be significant. However, storage is cheap nowadays, and in many cases the benefits far outweigh the drawbacks.
Traditional vs. web app.
Traditional revision control systems such as Subversion or CVS operate on files. Web application contents are normally stored in a database instead of files. This is one major difference that should be kept in mind.
Requirements
I’m going to give an example of two simple implementation approaches of revision control for web application content. But first I’m going to list the requirements for this simple implementation. Let say I already have a simple web application that is used by my family to manage our notes (think of a very simple family wiki). This is what I want to get from adding a revision control to the system:
- content history, ability to store, retrieve and rollback to certain revision, as well as to show differences between revisions
- audit, I want to know who did what and when
- all database, everything is stored in the database itself, no external files
- no cheating, I can’t use subversion or cvs behind the scene
- simple, minimal and straightforward modifications in the database structure and code
- generic, the design should be database- and language-agnostic
- storage, in this case I’m willing to pay the price of space for having a history of my notes, so storage space size is not a concern
Existing system
Let say this is the (simplified) existing database structure:
users:
+ id
- name
notes:
+ id
- status
- title
- content
- created_by
- created_at
- last_modified_by
- last_modified_at
First approach: the simple method
I think one of the simplest tricks that can be used to implement revision
control is to add a revision_id column, and make it part of the primary
key. And that is exactly what I’m going to do first. Here’s the modified
database structure:
users:
+ id
- name
notes:
+ id
+ rev_id
- status
- title
- content
- rev_user_id
- rev_time
The users table is unchanged because I don’t need to put anything there
under revision control. The primary key of the notes table now becomes a
composite key (id, rev_id). The audit information now becomes the revision
metadata, rev_user_id and rev_time, because now we can track who did
what and when for each revision, not just the first and the last ones.
This method is the simplest that I can think of. There is a limitation, however, that every column in that table becomes under revision control. That’s probably fine for most needs, but there could be situations where only some of the columns should be under revision control, for example when storage space is a concern.
Now comes the more interesting part. How can we actually achieve common revision control tasks? I’m going to provide some example MySQL statements to illustrate how to do certain things:
To list all published notes:
SELECT DISTINCT id FROM notes WHERE status='published'
To retrieve the latest revision of note 5:
SELECT * FROM notes WHERE id=5 AND rev_id=(SELECT MAX(rev_id) FROM notes WHERE id=5)To create a new revision of note 5 by user 3:
BEGIN; SELECT @newid := MAX(rev_id) + 1 FROM notes WHERE id=5; INSERT INTO notes (id, rev_id, status, title, content, rev_user_id, rev_time) VALUES (5, @newid, 'published', '...', '...', 3, NOW()); COMMIT;To edit a revision: simply create a new revision. Editing or deleting a revision is beating the purpose of having revision control in the first place. By never editing or deleting any revision we can rollback to a particular revision if we need to.
- To delete a revision: simply create a new revision with status set to ‘deleted’ and exclude the deleted notes from your view. Again, we never edit or delete any revisions, this way we can “resurrect” deleted notes by simply rollback to a previous revision.
To list all revision metadata of note 5:
SELECT rev_id, rev_user_id, rev_time FROM notes WHERE id=5
- To rollback note 5 to revision 2: create a new revision for note 5 by
copying the content of revision 2 (excluding the revision metadata
rev_user_idandrev_time) into the new revision. - To show differences between revision 2 and 3 for note 5: retrieve both
revisions and pass them into a
diffprogram or its equivalent
Second approach: an alternative method
There is an alternative approach using a separate table to store revisions. This method practically addresses the aforementioned limitation of the simple method. Using the same example, the modified database structure now becomes:
users:
+ id
- name
notes:
+ id
- title
notes_revisions:
+ id
+ notes_id
- status
- content
- rev_user_id
- rev_time
In the above structure the title column is not under revision control.
Although title may not be the best example to illustrate the storage space
restriction requirement, I used it as an example to show that we can put
certain columns outside revision control. The id column of
notes_revisions is the revision id. If you’re using MySQL, then you can’t
make notes_revisions(id) column as auto_increment in this case.
The notes_id column is a foreign key referencing notes(id) and is part
of the composite primary key.
And here’s how you do common revision control tasks using this alternative method:
To list all published notes:
SELECT n.* FROM notes n LEFT OUTER JOIN notes_revisions nr ON nr.notes_id = n.id WHERE nr.status='published' AND nr.id=( SELECT MAX(id) FROM notes_revisions WHERE notes_id=nr.notes_id)To retrieve the latest revision of note 5:
SELECT n.*, nr.* FROM notes n LEFT OUTER JOIN notes_revisions nr ON nr.notes_id = n.id WHERE n.id=5 AND nr.id=(SELECT MAX(id) FROM notes_revisions WHERE notes_id=5)To create a new revision of note 5 by user 3:
BEGIN; SELECT @newrid := MAX(id) + 1 FROM notes_revisions WHERE notes_id=5; INSERT INTO notes_revisions (id, notes_id, status, content, rev_user_id, rev_time) VALUES (@newrid, 5, 'published', '...', 3, NOW()); COMMIT;To list all revision metadata of note 5:
SELECT id, rev_user_id, rev_time FROM notes_revisions WHERE notes_id=5
Other tasks are done in a similar manner to the simple method.
Addressing simultaneous edits
Many times it is useful to allow simultaneous editing of the same content without locking. The basic idea is to allow users to edit anyway and let the system merge the differences between simultaneous editing sessions when creating new revisions. To do this properly, we need the help of a diff program and a patch program (or their equivalents) to merge differences. I think this could make my application too complicated for my purposes. So I’ll settle for a less complicated workaround: during save, if a simultaneous edit is detected, then the application will deny the save request. Although I have to admit that this workaround is not better than locking. Maybe I will write another article in the future that discusses this specifically in more details.
How to detect simultaneous edits? When presenting a note for editing, the revision number must be attached to the form (e.g. as a hidden input field). After submitting the edited note and before creating a new revision, the system must first check that the revision number in the database is not higher than the current revision. If it is higher, then that means somebody else has edited the same note.
Conclusions
There are more to revision control than what is written in this article, and it could be a very complicated discussion topic. But basic revision control is good to have, and easy enough to implement in most web applications.
Related information
Acknowledgements
The alternative approach was inspired by MediaWiki. Thank God for open source!
Follow
A CMS called Drupal (http://drupal.org) already has this feature. ;)
The contents are stored in ‘node’ table and — similarly — the revisions are stored in ‘node_revisions’.
There are differences to your concept:
- ‘status’ is in stored in ‘node’, not ‘node_revisions’
- a revision is stored not automatically. In drupal, an edit can be stored as revision or just publish (in order to save space).
Great article!
Some questions popped up in my mind:
* How to handle conflict during merge (if any), present the conflict to the user?
* How to split the table in multiple repositories? Say, I want to store the old revisions in other server to save space but I want them still to be accessible.
In Software Engineering for Internet Applications
(a guide that every web developer must read)
on Content Management chapter also address this issues
http://philip.greenspun.com/seia/content-management