Three months ago, the Confluence team switched from Subversion to git, just in time for our 4.1 release. In Confluence, git, rename, merge oh my… we talked about the problems we encountered with merges across branches that had lots of renames. In this post, we take a step back to look at the tools we used in order to migrate the Confluence source code from Subversion to git.

Preparation

Firstly, we wanted to ensure that the following infrastructure supported our source code hosted in git:

  • Since our team uses Maven, we needed to ensure that the Maven releases worked properly, especially all of the custom Maven plugins and scripts we built for source code distribution and Confluence release builds.
  • We have a large number of Bamboo build plans that we use to test and release Confluence and some of its supporting libraries and plugins.
  • JIRA, FishEye and Crucible are crucial in our development workflow, so we wanted to be able to set up and test the integration first.

Secondly, we wanted to make the switch to a new version control system as smooth as possible for the team, without unnecessarily interrupting our development iterations. To accomplish this goal, we decided to set up a git mirror of our Subversion repository, which helped us test the git integration before finally flipping the switch.

Process

With the help of git svn, it is relatively straightforward to convert a Subversion repository to git:

  1. Clone the entire repository with git svn, making sure to properly map author information.
  2. Turn the svn tags that git svn maintains as branches into proper git tags.
  3. Potentially prune unused branches before pushing the git repository to its new location, in order to reduce the size of the converted git repository.
  4. Push the newly created local tags and branches to the final shared repository.

Creating the initial git repository

The authors file

The initial import should be done with a properly populated authors file that maps Subversion usernames to git author names (including email addresses). This file will be used to correctly populate the git commit messages. The authors file has the following format:

SVN-USERNAME = Author name <email>

To get an initial list of usernames for all the developers that commited to the current project, the following command can be run inside the Subversion repository:

svn log -q | grep -e '^r' | awk 'BEGIN { FS = "|" } ; { print $2 }' | sort | uniq

For the Confluence migration, I used a simple ruby script that retrieves staff information from LDAP and prints it in the correct format to STDOUT:

#!/usr/bin/env ruby

# Requires 'ldapsearch' to be on the $PATH

require 'ostruct'

bindDN="uid=ssaasen,ou=People,dc=atlassian,dc=com"
url="ldap://ldap.atlassian.com"
baseDN="ou=People,dc=atlassian,dc=com"

result = `ldapsearch -LLL -D #{bindDN} -x -W -H #{url} -b #{baseDN} "(uid=*)" -S uid cn mail`

records = result.split("\n\n").inject([]) do |lst, group|
# dn: uid=ssaasen,ou=people,dc=atlassian,dc=com
# uid: ssaasen
# cn: Stefan Saasen
# mail: devnull@atlassian.com
lst << group.split("\n").inject(OpenStruct.new) do |e, line|
key, *val = line.split(":")
e.send("#{key.strip}=", val.join(":").strip)
e
end
end

records.reject {|e| !e.mail }.sort{|a,b| a.uid <=> b.uid}.each do |rec|
name = rec.cn ? rec.cn : rec.uid
puts "#{rec.uid} = #{name} <#{rec.mail}>"
end

This is of course only one of the many different ways to generate the authors mapping file. When you make the change, be sure to do what’s comfortable for you and your team.

Initial clone

With the authors file properly populated, the following command can be used to import the Subversion history into a git repository:

git svn clone --prefix=svn/ -s --no-metadata \
--authors-file=$BASE_DIR/authors-transform-final.txt \
http://svn.example.com/svn-repository/ local-repo

This command will clone the Subversion repository into the git repository named ‘local-repo‘ using a default Subversion layout. (The command assumes svn-repository contains trunk, tags, branches directories.) The layout uses the authors file and removes the git-svn-id that git svn uses as a fallback to map Subversion revisions to git commit ids.

A word of caution: Omitting the git-svn-id meta data from every commit message will prevent git svn from restoring that information if its internal database ever gets corrupted. This option should not be used if you want to use git svn as a Subversion client!

Commits will then look like:

commit 46fd18726ae451ef4d48a5a1ce16600b66bec5d3
Author: John Doe <john.doe@example.com>
Date: Fri Sep 16 07:00:41 2011 +0000

CONFDEV-6004: make placeholders more robust in webkit and firefox

instead of:

commit 46fd18726ae451ef4d48a5a1ce16600b66bec5d3
Author: jdoe
Date: Fri Sep 16 07:00:41 2011 +0000

CONFDEV-6004: make placeholders more robust in webkit and firefox

git-svn-id: https://svn.example.com/repository/project/trunk@163240 d2a7a951-c712-0410-832a-9abccabd3052

Since the initial clone took a lengthy 12 hours, we used a slightly modified version of the command that enabled us to keep using the repository as a mirror. Due to our repository layout, we explicitly defined the directories as branches (using the -b, -T, -t flags):

#!/bin/bash

set -u # Don't use undefined variables
set -e # Exit on error

REPO_NAME="confluence-git"
BASE_DIR=/opt/svn-to-git/conversion

git init $REPO_NAME
cd $REPO_NAME

git svn init \
--no-metadata \
--prefix=svn/ \
-Tatlassian/confluence/trunk \
-tatlassian/confluence/tags \
-batlassian/confluence/branches \
-batlassian/confluence/branches/private \
file://$BASE_DIR/atlassian-private-svn

git config svn.authorsfile $BASE_DIR/authors-transform-final.txt
git config svn.noMetadata
git svn fetch

To decouple network access and svn to git conversion, we used a local svn copy using svnsync instead of directly using the Subversion server:

# Initialise the svnsync mirror
svnsync init file://$BASE_DIR/atlassian-private-svn http://svn.example.com/repository

Keeping the git repository in sync with our Subversion server was then simply a matter of running:

svnsync sync file://$BASE_DIR/atlassian-private-svn
cd confluence-git
git svn fetch --authors-prog=$BASE_DIR/update-authors.rb

For a one-off conversion it is sufficient to only use the authors file mentioned above. For ongoing synchronisation, the --authors-prog option can be used to look up author mappings that do not exist in the original authors file, a necessity if there are new developers commiting to the upstream Subversion repository.

authors-prog accepts a single argument (the Subversion username) and returns a Name <email> pair on $STDOUT. The script we used is a slightly more generic version of the Ruby script introduced above that looked up the name/email pair from our internal LDAP server.

Our git svn mirror was synced every minute, and it was used to test our build and development infrastructure. After successfully running our mirror for a few weeks we even switched some of our builds to run off of git, as it turned out to be faster than checking out from our busy Subversion server.

Final conversion

After running and exercising the git-svn mirror for a few months, we finally decided to switch after the Confluence 4.0 release.
In order to turn our exisiting git mirror into our final repository, we turned the svn tags into proper annotated tag objects in git, pruning old and unused branches along the way.

Subversion tags

In order to maintain svn tags, git svn keeps them as local branches, instead of proper git tags. The following script turns every git branch that represents a Subversion tag into an annotated git tag.

Git has two different kinds of tags: a lightweight tag and an annotated tag. The lightweight tag is simply a named reference that points to a particular commit that allows you to refer to a tag by name instead of its SHA1 commit id. Annotated tags, on the other hand, are objects of their own that require a tag message and record the tag creator and the tag creation date.

The following script turns the svn tag branches into annotated tag objects:

#!/bin/sh

# Based on https://github.com/haarg/convert-git-dbic

set -u
set -e

git for-each-ref --format='%(refname)' refs/remotes/svn/tags/* | while read r; do
    tag=${r#refs/remotes/svn/tags/}
    sha1=$(git rev-parse "$r")

    commiterName="$(git show -s --pretty='format:%an' "$r")"
    commiterEmail="$(git show -s --pretty='format:%ae' "$r")"
    commitDate="$(git show -s --pretty='format:%ad' "$r")"
    # Print the commit subject and body separated by a newline
    git show -s --pretty='format:%s%n%n%b' "$r" | \
    env GIT_COMMITTER_EMAIL="$commiterEmail" GIT_COMMITTER_DATE="$commitDate" GIT_COMMITTER_NAME="$commiterName" \
    git tag -a -m "Tag: ${tag} sha1: ${sha1} using '${commiterName}', '${commiterEmail}' on '${commitDate}'" "$tag" "$sha1"
    # Remove the svn/tags/* ref
    git update-ref -d "$r"
done

If you compare the tags in your git-svn clone against those in the Subversion repository, you might notice that the number of tags differs. In our case, we had 535 tags in the Subversion repository but 556 tags in our git-svn clone. The difference comes from the tags that were once created but removed from the Subversion repository since the clone began. In order to create a true copy of the Subversion repository at the time of the final switch, we used the following script to remove local tags not present anymore in the Subversion repository:

for tag in $(git tag -l); do
    set -e
    echo "Check if the tag '"${tag}"' still exists in Subversion"
    set +e
    svn ls https://svn.example.com/svn/confluence/tags/${tag} > /dev/null 2>&1
    if [ "$?" -ne 0 ]; then
        echo "Tag '"${tag}"' doesn't exist anymore, will remove it from git repository."
        set -e
        git tag -d ${tag}
    fi
done

Prune large, unused content

Switching your version control system is a good opportunity to get rid of unused or unnecessary history or branches. This might not be worth the effort for fairly young or small projects, but in large projects like Confluence, we felt that the size of our git clone (~800 MB) could be improved. Due to its distributed nature, a clone of a git repository transfers every single file ever added, so pruning accidentially commited files or removing unused branches benefits every consumer of the git repository.
To reduce the size of the repository on disk, it’s worth checking for large files that can be safely removed from the history. To get an idea of what the large files are in a git repository, we use the following script to print a list of the 10 largest objects:

#!/usr/bin/env ruby

# Based on http://progit.org/book/ch9-7.html

puts "Running 'git gc'" && `git gc` unless $DEBUG

# Find the 10 largest objects
`git verify-pack -v .git/objects/pack/*.idx | sort -k 3 -n --reverse | head -10`.split("\n").each do |line|
    # SHA1 type size size-in-pack-file offset-in-packfile
    # or
    # SHA1 type size size-in-packfile offset-in-packfile depth base-SHA1
    sha1, type, size, *rest = line.split
    size_human_readable = sprintf "%.2f", size.to_f/1024.0**2
    puts "Resolving file information for #{sha1}" if $DEBUG
    path = `git rev-list --objects --all | \grep #{sha1}`.split.last
    $stdout.puts "sha1: #{sha1}, size: #{size_human_readable} Mb, file: #{path}"
    $stdout.flush
end

In our case, which considered all branches, including the private and possibly unrelated ones, the script yielded the following result:

$> cd /path/to/git/repo.git && ruby /opt/scripts/find-large-objetcs.rb
Counting objects: 521785, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (165873/165873), done.
Writing objects: 100% (521785/521785), done.
Total 521785 (delta 281337), reused 521785 (delta 281337)

sha1: f785359836cdc6abb85d6771d5241b21034dd31e, size: 44.73 Mb, file: rthomas/shipit13/trunk/conf-webapp/src/main/resources/com/atlassian/confluence/setup/atlassian-bundled-plugins.zip
sha1: 4d18fd8b96cb35b81dba868f905b6d0074dc92e4, size: 41.96 Mb, file: conf-acceptance-test/src/main/resources/siteExport-QAdata.zip
sha1: 89323ee2956be8bbfea9707117ffeb0f45719317, size: 41.37 Mb, file: bar/shipit14/python_scripts/102153-2.xml.zip
sha1: 6c2e4e7388f86b49ae0625cc27262e8448af659f, size: 36.46 Mb, file: replay.tar.bz2
sha1: 2ba11ba3b0f289ebcd23ff722291d2fe2a1767ff, size: 27.57 Mb, file: foo/shipit13/trunk/confluence-bundled-plugins/target/bundles/OfficeConnector-1.6.jar
sha1: e22530fdb67da29d12eefd3adda7d3f2d3b2f14b, size: 17.68 Mb, file: abc/frother/conf-webapp/src/main/resources/com/atlassian/confluence/setup/demo-site.zip
sha1: daf0227c2af2093fe409daf648cba5e05fb035f8, size: 14.36 Mb, file: conf-acceptance-test/src/main/resources/site-export-broken-trustedapp.zip
sha1: b282cdbe5079eca684683b27efa709f67b9a4702, size: 6.31 Mb, file: conf-webapp/src/main/bundled-plugins/atlassian-plugin-repository-confluence-plugin-2.0.9.jar
sha1: d22e1ce128dc658f4ed3610fed264b676f8ab4ff, size: 5.21 Mb, file: baz/shipit5/confluence/src/etc/java/com/atlassian/confluence/setup/atlassian-bundled-plugins.zip
sha1: ee6337b5de108cb6aa8f708c3a2f2454af041f58, size: 4.49 Mb, file: conf-webapp/src/main/resources/com/atlassian/confluence/setup/demo-site.zip

We see that some of the private branches in our svn repository contain files we don’t need. It’s important to note that the files listed can be part of other branches, so this list of large objects is merely one way that may help you identify branches that would be worth leaving out of the new repository.

Create the final branches

To reduce the size of the final repository we decided to prune some of the old branches and to keep only branches mapping to official streams of work (old stable branches for the maintenance releases) or important feature branches. This reduced the size of the repository from ~ 800MB to ~ 350 MB.

The following script for example only considers branches starting with the prefix confluence so it creates local branches for only a subset of all the available svn branches:

#!/bin/sh

# create local branches out of svn branches
git for-each-ref --format='%(refname)' refs/remotes/svn/ | while read branch_ref; do
    branch=${branch_ref#refs/remotes/svn/}
    # Only use select branches
    if [[ "$branch" =~ ^confluence(_[0-9]|-project-[0-9]).* ]]; then
        echo "Creating Confluence branch: $branch"
        git branch -t "$branch" "$branch_ref"
        git update-ref -d "$branch_ref"
    fi
done

Push to the new shared git repository

After creating local tags and branches, the only remaining step is to push them to the shared repository that the team is going to use. We host the Confluence repository on Bitbucket:

git remote add origin https://bitbucket.org/atlassian/confluence.git
git push origin --all # Pushes all refs under refs/heads
git push origin --tags # Pushes all refs under refs/tags

Done!

Conclusion

git svn is not only a great tool that allows you to use a local git repository with a Subversion server, but it also is a simple way to convert any Subversion repository into a proper git repository.

Check out our migrating to Git resources to learn more.