As discussed over a week ago, I have been pondering the idea of migrating my projects from Subversion to Git. One of the prerequisites of such a migration is the preparation of a process to cleanly migrate the revision history from the old system to the new one. Of course, such process should attempt to preserve the revision history as close to reality as possible (regardless of what some other big projects have done by just throwing away their history; shrug).

The projects I am interested in migrating from Subversion to Git all live in Google Project Hosting. This is irrelevant for all but one detail I will mention later. However, the project hosting's site is the obvious place to look for help and, as expected, there is a little document titled Convert your project from Subversion to Git that provides a good start.

Summarizing, there are two well-suited tools to convert a Subversion repository to Git:
  • The built-in git svn command: works pretty well and has a good amount of customization features. The only little problem is that git svn does not recognize Subversion tags as such and therefore converts them to Git branches instead of tags.
  • The third-party svn2git: while this is actually built on top of git svn, it properly handles Subversion tags and converts them to native Git tags.
If you do not have tags in your Subversion repository or do not care about them being properly converted to Git tags, you can go with git svn. However, I am interested in properly recording tags, so I chose svn2git for my needs.

Now, using svn2git by itself as instructed in the referenced document will happily convert your Subversion repository to Git. You should not have to care about anything else... unless you want to polish a few details mentioned below.

Set up

In the rest of this article, I will place all the scripts for the conversion in a ~/convert/ directory. When the script is run, it will create a ./git/ subdirectory that will end up containing the new Git repository, and a ./control/ directory that will maintain temporary control files.

Disclaimer: Be aware that all the code below is quite ugly and probably bug-ridden. It is not meant to be pretty as I only have two use cases for it. Also please note that I am, by no means, a Git expert, so the process below may be way more convoluted than is actually necessary.

Oh, and this does not deal at all with the wiki changes. Google Code stores such information in the same Subversion repository as your code. However, when migrating to Git, you get two different repositories. Adapting the code below to deal with the wiki component of your project should be easy enough, although I'm not sure I care enough about the few contents I have in the wikis to go through the hassle.

Warning: The instructions below completely mess up tags if you happen to have any in your repository. I only discovered this after following the procedure below and pushing HEAD and the tags, which resulted in a different set of revisions pushed for every tag. When I actually did this, I ended up running the steps below, then discarding the tags, and creating the tags again by hand pointing to the appropriate revisions.

Setting up authors

The first step in using svn2git is defining a mapping between Subversion authors to Git authors. This is easy enough, but there is one caveat that affects projects hosted in Google Code: the root revision of your Subversion repository is not actually authored by you; its author is (no author) so you should account for that. Something like this in your ~/.svn2git/authors file will tae care of this teeny tiny detail: = Your Name 
(no author) = Your Name

However, as we will see below, we will be skipping the first revision of the repository altogether so this is actually not really necessary. I just felt like mentioning it for completeness, given that I really didn't expect (no author) to be properly recognized in this context.

References to old Subversion revision ids

Unless you have been extremely terse in your commit history and in your bug tracker, you will have plenty of cross-references pointing to Subversion revision identifiers. For example, if you are fixing a bug introduced a month ago in r123, you may as well point that out in the commit message of r321 for reference purposes. I guess you can see the problem now: if we migrate the history "as is", all these references become meaningless because the new Git history has no traces of the old revision identifiers.

The -m option to svn2git will annotate every revision with a git-svn-id line that contains the relevant information. However, such line is quite ugly because it is not really meant for human consumption: the information in such line is used by git-svn to allow pulls and pushes from a master Subversion repository.

What I want to do is reword the git-svn-id line to turn it into a real sentence rather than some internal control code. We can achieve this with git rebase in interactive mode: mark all revisions as "reword" and then go revision by revision fixing its message. A pain that can be automated: if an editor can do the necessary changes, we can create a fake "editor" script that performs the same modifications.

How? Store this as ~/convert/

#! /bin/sh

if grep git-svn-id "${1}"; then
# We are editing a commit message.
new="This revision was r\1 in Subversion."
sed -r -i -e "s,git-svn-id[^@]+@([0-9]+).*$,${new}," "${1}"
# We are editing the git-rebase interactive control file.
sed -i -e 's,^pick,reword,' "${1}"

With this script, we can simply do EDITOR=~/convert/ git rebase -i base-revision and get every revision tagged... but this will blow up if your repository contains empty revisions, which takes us to the next concern.

Drop empty revisions

As you probably know, Subversion supports attaching metadata to directories and files in the form of properties. These properties cannot be represented in Git, so, if you have any Subversion commits that touched properties alone, svn2git will happily convert those as empty Git commits.

There is nothing wrong with this, but things like git rebase will choke on these empty commits over and over again... and it gets quite annoying. Furthermore, these empty commits serve no purpose in Git because the changes they performed in Subversion make no sense in Git land. It is easier to just kill them all from the history.

The git rebase command above will abort on every empty revision it encounters. We can, at that point, record their identifiers for later deletion. However, recording the revision identifier will not work because, as we are doing a rebase, the identifier will have changed once we are done. Instead, and because I have been careful to write detailed commit messages, we can rely on the first line of the message (aka the subject) to identify every message. Rerun the rebase as follows, storing the list of empty commits in ../control/empty:

first=$(git log | grep '^commit' | tail -n 1 | cut -d ' ' -f 2)
git rebase --interactive "${first}" || true
touch ../control/empty
while [ -f .git/MERGE_MSG ]; do
head -n 1 .git/COMMIT_EDITMSG >>../control/empty
EDITOR="${convert}/" git commit --allow-empty
EDITOR="${convert}/" git rebase --continue || true

With this list in mind, we create another ~/convert/ script to remove the empty revisions once we know them:

#! /bin/sh

echo "Empty revisions to be deleted:"
cat ../control/empty | while read line; do
grep "${line}" "${1}"
sed -i -e "/^pick ${line}$/s,^pick,fixup," "${1}"

Amend the root revision

The root revision in a Google Code subversion repository is empty: the system creates it to initialize the "directory layout" of your repository, but it serves no purpose. We can skip this root revision by passing the --revision=2 flag to svn2git.

However, no matter what we do, the git rebase above to tag Git revisions with their corresponding Subversion identifiers, will happily skip our first real revision and leave it untagged. We have to manually go and fix this, which is actually quite tricky. Luckily, this reply in stackoverflow provides the solution.

Putting it all together

Alright then. If all the above was dense and cryptic code-wise, it is the time to put it all together in a script that performs all the steps for us. Assuming you already have ~/convert/ and ~/convert/ in place, now create ~/convert/ as follows:

#! /bin/sh

set -e -x

[ ${#} -eq 1 ] || exit 1
convert=$(dirname ${0})

rm -rf git control
mkdir git control
cd git

# Start at revision 2 to skip the initial empty revision.
svn2git -v --revision=2 -m "http://${project}"

# Tag git revisions with the original Subversion revision id.
first=$(git log | grep '^commit' | tail -n 1 | cut -d ' ' -f 2)
git rebase --interactive "${first}" || true
touch ../control/empty
while [ -f .git/MERGE_MSG ]; do
head -n 1 .git/COMMIT_EDITMSG >>../control/empty
EDITOR="${convert}/" git commit --allow-empty
EDITOR="${convert}/" git rebase --continue || true

# Drop empty revisions recorded in the previous step.
# The list is in the 'empty' file and is read by
git rebase --interactive "${first}" || true

# Tag the root revision with the original Subversion revision id.
git tag root $(git rev-list HEAD | tail -1)
git checkout -b new-root root
EDITOR="${convert}/" git commit --amend
git checkout @{-1}
git rebase --onto new-root root
git branch -d new-root
git tag -d root

cd -
rm -rf control

Yuck. Well, it works, and it works nicely. Converting a Subversion repository to Git will all the constraints above is now as easy as: ~/convert/ your-project-name!