Three months ago, the Confluence team switched from Subversion to git, just in time for our 4.1 release. In Confluence, git, rename, merge oh my… we talked about the problems we encountered with merges across branches that had lots of renames. In this post, we take a step back to look at the tools we used in order to migrate the Confluence source code from Subversion to git.
Preparation
Firstly, we wanted to ensure that the following infrastructure supported our source code hosted in git:
- Since our team uses Maven, we needed to ensure that the Maven releases worked properly, especially all of the custom Maven plugins and scripts we built for source code distribution and Confluence release builds.
- We have a large number of Bamboo build plans that we use to test and release Confluence and some of its supporting libraries and plugins.
- JIRA, FishEye and Crucible are crucial in our development workflow, so we wanted to be able to set up and test the integration first.
Secondly, we wanted to make the switch to a new version control system as smooth as possible for the team, without unnecessarily interrupting our development iterations. To accomplish this goal, we decided to set up a git mirror of our Subversion repository, which helped us test the git integration before finally flipping the switch.
Process
With the help of git svn, it is relatively straightforward to convert a Subversion repository to git:
- Clone the entire repository with git svn, making sure to properly map author information.
- Turn the svn tags that git svn maintains as branches into proper git tags.
- Potentially prune unused branches before pushing the git repository to its new location, in order to reduce the size of the converted git repository.
- Push the newly created local tags and branches to the final shared repository.
Creating the initial git repository
The authors file
The initial import should be done with a properly populated authors file that maps Subversion usernames to git author names (including email addresses). This file will be used to correctly populate the git commit messages. The authors file has the following format:
To get an initial list of usernames for all the developers that commited to the current project, the following command can be run inside the Subversion repository:
For the Confluence migration, I used a simple ruby script that retrieves staff information from LDAP and prints it in the correct format to STDOUT:
# Requires 'ldapsearch' to be on the $PATH
require 'ostruct'
bindDN="uid=ssaasen,ou=People,dc=atlassian,dc=com"
url="ldap://ldap.atlassian.com"
baseDN="ou=People,dc=atlassian,dc=com"
result = `ldapsearch -LLL -D #{bindDN} -x -W -H #{url} -b #{baseDN} "(uid=*)" -S uid cn mail`
records = result.split("\n\n").inject([]) do |lst, group|
# dn: uid=ssaasen,ou=people,dc=atlassian,dc=com
# uid: ssaasen
# cn: Stefan Saasen
# mail: devnull@atlassian.com
lst << group.split("\n").inject(OpenStruct.new) do |e, line|
key, *val = line.split(":")
e.send("#{key.strip}=", val.join(":").strip)
e
end
end
records.reject {|e| !e.mail }.sort{|a,b| a.uid <=> b.uid}.each do |rec|
name = rec.cn ? rec.cn : rec.uid
puts "#{rec.uid} = #{name} <#{rec.mail}>"
end
This is of course only one of the many different ways to generate the authors mapping file. When you make the change, be sure to do what’s comfortable for you and your team.
Initial clone
With the authors file properly populated, the following command can be used to import the Subversion history into a git repository:
--authors-file=$BASE_DIR/authors-transform-final.txt \
http://svn.example.com/svn-repository/ local-repo
This command will clone the Subversion repository into the git repository named ‘local-repo‘ using a default Subversion layout. (The command assumes svn-repository contains trunk, tags, branches directories.) The layout uses the authors file and removes the git-svn-id that git svn uses as a fallback to map Subversion revisions to git commit ids.
A word of caution: Omitting the git-svn-id meta data from every commit message will prevent git svn from restoring that information if its internal database ever gets corrupted. This option should not be used if you want to use git svn as a Subversion client!
Commits will then look like:
Author: John Doe <john.doe@example.com>
Date: Fri Sep 16 07:00:41 2011 +0000
CONFDEV-6004: make placeholders more robust in webkit and firefox
instead of:
Author: jdoe
Date: Fri Sep 16 07:00:41 2011 +0000
CONFDEV-6004: make placeholders more robust in webkit and firefox
git-svn-id: https://svn.example.com/repository/project/trunk@163240 d2a7a951-c712-0410-832a-9abccabd3052
Since the initial clone took a lengthy 12 hours, we used a slightly modified version of the command that enabled us to keep using the repository as a mirror. Due to our repository layout, we explicitly defined the directories as branches (using the -b, -T, -t flags):
set -u # Don't use undefined variables
set -e # Exit on error
REPO_NAME="confluence-git"
BASE_DIR=/opt/svn-to-git/conversion
git init $REPO_NAME
cd $REPO_NAME
git svn init \
--no-metadata \
--prefix=svn/ \
-Tatlassian/confluence/trunk \
-tatlassian/confluence/tags \
-batlassian/confluence/branches \
-batlassian/confluence/branches/private \
file://$BASE_DIR/atlassian-private-svn
git config svn.authorsfile $BASE_DIR/authors-transform-final.txt
git config svn.noMetadata
git svn fetch
To decouple network access and svn to git conversion, we used a local svn copy using svnsync instead of directly using the Subversion server:
svnsync init file://$BASE_DIR/atlassian-private-svn http://svn.example.com/repository
Keeping the git repository in sync with our Subversion server was then simply a matter of running:
cd confluence-git
git svn fetch --authors-prog=$BASE_DIR/update-authors.rb
For a one-off conversion it is sufficient to only use the authors file mentioned above. For ongoing synchronisation, the --authors-prog option can be used to look up author mappings that do not exist in the original authors file, a necessity if there are new developers commiting to the upstream Subversion repository.
authors-prog accepts a single argument (the Subversion username) and returns a Name <email> pair on $STDOUT. The script we used is a slightly more generic version of the Ruby script introduced above that looked up the name/email pair from our internal LDAP server.
Our git svn mirror was synced every minute, and it was used to test our build and development infrastructure. After successfully running our mirror for a few weeks we even switched some of our builds to run off of git, as it turned out to be faster than checking out from our busy Subversion server.
Final conversion
After running and exercising the git-svn mirror for a few months, we finally decided to switch after the Confluence 4.0 release.
In order to turn our exisiting git mirror into our final repository, we turned the svn tags into proper annotated tag objects in git, pruning old and unused branches along the way.
Subversion tags
In order to maintain svn tags, git svn keeps them as local branches, instead of proper git tags. The following script turns every git branch that represents a Subversion tag into an annotated git tag.
Git has two different kinds of tags: a lightweight tag and an annotated tag. The lightweight tag is simply a named reference that points to a particular commit that allows you to refer to a tag by name instead of its SHA1 commit id. Annotated tags, on the other hand, are objects of their own that require a tag message and record the tag creator and the tag creation date.
The following script turns the svn tag branches into annotated tag objects:
# Based on https://github.com/haarg/convert-git-dbic
set -u
set -e
git for-each-ref --format='%(refname)' refs/remotes/svn/tags/* | while read r; do
tag=${r#refs/remotes/svn/tags/}
sha1=$(git rev-parse "$r")
commiterName="$(git show -s --pretty='format:%an' "$r")"
commiterEmail="$(git show -s --pretty='format:%ae' "$r")"
commitDate="$(git show -s --pretty='format:%ad' "$r")"
# Print the commit subject and body separated by a newline
git show -s --pretty='format:%s%n%n%b' "$r" | \
env GIT_COMMITTER_EMAIL="$commiterEmail" GIT_COMMITTER_DATE="$commitDate" GIT_COMMITTER_NAME="$commiterName" \
git tag -a -m "Tag: ${tag} sha1: ${sha1} using '${commiterName}', '${commiterEmail}' on '${commitDate}'" "$tag" "$sha1"
# Remove the svn/tags/* ref
git update-ref -d "$r"
done
If you compare the tags in your git-svn clone against those in the Subversion repository, you might notice that the number of tags differs. In our case, we had 535 tags in the Subversion repository but 556 tags in our git-svn clone. The difference comes from the tags that were once created but removed from the Subversion repository since the clone began. In order to create a true copy of the Subversion repository at the time of the final switch, we used the following script to remove local tags not present anymore in the Subversion repository:
set -e
echo "Check if the tag '"${tag}"' still exists in Subversion"
set +e
svn ls https://svn.example.com/svn/confluence/tags/${tag} > /dev/null 2>&1
if [ "$?" -ne 0 ]; then
echo "Tag '"${tag}"' doesn't exist anymore, will remove it from git repository."
set -e
git tag -d ${tag}
fi
done
Prune large, unused content
Switching your version control system is a good opportunity to get rid of unused or unnecessary history or branches. This might not be worth the effort for fairly young or small projects, but in large projects like Confluence, we felt that the size of our git clone (~800 MB) could be improved. Due to its distributed nature, a clone of a git repository transfers every single file ever added, so pruning accidentially commited files or removing unused branches benefits every consumer of the git repository.
To reduce the size of the repository on disk, it’s worth checking for large files that can be safely removed from the history. To get an idea of what the large files are in a git repository, we use the following script to print a list of the 10 largest objects:
# Based on http://progit.org/book/ch9-7.html
puts "Running 'git gc'" && `git gc` unless $DEBUG
# Find the 10 largest objects
`git verify-pack -v .git/objects/pack/*.idx | sort -k 3 -n --reverse | head -10`.split("\n").each do |line|
# SHA1 type size size-in-pack-file offset-in-packfile
# or
# SHA1 type size size-in-packfile offset-in-packfile depth base-SHA1
sha1, type, size, *rest = line.split
size_human_readable = sprintf "%.2f", size.to_f/1024.0**2
puts "Resolving file information for #{sha1}" if $DEBUG
path = `git rev-list --objects --all | \grep #{sha1}`.split.last
$stdout.puts "sha1: #{sha1}, size: #{size_human_readable} Mb, file: #{path}"
$stdout.flush
end
In our case, which considered all branches, including the private and possibly unrelated ones, the script yielded the following result:
Counting objects: 521785, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (165873/165873), done.
Writing objects: 100% (521785/521785), done.
Total 521785 (delta 281337), reused 521785 (delta 281337)
sha1: f785359836cdc6abb85d6771d5241b21034dd31e, size: 44.73 Mb, file: rthomas/shipit13/trunk/conf-webapp/src/main/resources/com/atlassian/confluence/setup/atlassian-bundled-plugins.zip
sha1: 4d18fd8b96cb35b81dba868f905b6d0074dc92e4, size: 41.96 Mb, file: conf-acceptance-test/src/main/resources/siteExport-QAdata.zip
sha1: 89323ee2956be8bbfea9707117ffeb0f45719317, size: 41.37 Mb, file: bar/shipit14/python_scripts/102153-2.xml.zip
sha1: 6c2e4e7388f86b49ae0625cc27262e8448af659f, size: 36.46 Mb, file: replay.tar.bz2
sha1: 2ba11ba3b0f289ebcd23ff722291d2fe2a1767ff, size: 27.57 Mb, file: foo/shipit13/trunk/confluence-bundled-plugins/target/bundles/OfficeConnector-1.6.jar
sha1: e22530fdb67da29d12eefd3adda7d3f2d3b2f14b, size: 17.68 Mb, file: abc/frother/conf-webapp/src/main/resources/com/atlassian/confluence/setup/demo-site.zip
sha1: daf0227c2af2093fe409daf648cba5e05fb035f8, size: 14.36 Mb, file: conf-acceptance-test/src/main/resources/site-export-broken-trustedapp.zip
sha1: b282cdbe5079eca684683b27efa709f67b9a4702, size: 6.31 Mb, file: conf-webapp/src/main/bundled-plugins/atlassian-plugin-repository-confluence-plugin-2.0.9.jar
sha1: d22e1ce128dc658f4ed3610fed264b676f8ab4ff, size: 5.21 Mb, file: baz/shipit5/confluence/src/etc/java/com/atlassian/confluence/setup/atlassian-bundled-plugins.zip
sha1: ee6337b5de108cb6aa8f708c3a2f2454af041f58, size: 4.49 Mb, file: conf-webapp/src/main/resources/com/atlassian/confluence/setup/demo-site.zip
We see that some of the private branches in our svn repository contain files we don’t need. It’s important to note that the files listed can be part of other branches, so this list of large objects is merely one way that may help you identify branches that would be worth leaving out of the new repository.
Create the final branches
To reduce the size of the final repository we decided to prune some of the old branches and to keep only branches mapping to official streams of work (old stable branches for the maintenance releases) or important feature branches. This reduced the size of the repository from ~ 800MB to ~ 350 MB.
The following script for example only considers branches starting with the prefix confluence so it creates local branches for only a subset of all the available svn branches:
# create local branches out of svn branches
git for-each-ref --format='%(refname)' refs/remotes/svn/ | while read branch_ref; do
branch=${branch_ref#refs/remotes/svn/}
# Only use select branches
if [[ "$branch" =~ ^confluence(_[0-9]|-project-[0-9]).* ]]; then
echo "Creating Confluence branch: $branch"
git branch -t "$branch" "$branch_ref"
git update-ref -d "$branch_ref"
fi
done
Push to the new shared git repository
After creating local tags and branches, the only remaining step is to push them to the shared repository that the team is going to use. We host the Confluence repository on Bitbucket:
git push origin --all # Pushes all refs under refs/heads
git push origin --tags # Pushes all refs under refs/tags
Done!
Conclusion
git svn is not only a great tool that allows you to use a local git repository with a Subversion server, but it also is a simple way to convert any Subversion repository into a proper git repository.
Check out our migrating to Git resources to learn more.


Comments (34)
By JoshG on January 12, 2012 /
By Łukasz Marchewka on January 18, 2012 /
By Stefan Saasen on January 19, 2012 /
Pingback: How Atlassian migrated from SVN to Git – Bitbucket
By Sven on January 29, 2012 /
By Stefan Saasen on January 30, 2012
By Semen Vadishev on February 3, 2012 /
By Stefan Saasen on February 4, 2012
By SubGit on February 6, 2012
By Sheds New York on February 5, 2012 /
Pingback: Auf Git & Mercurial wechseln – ohne Furcht « svenpet.com
By Mohamed SRHAYRI on March 3, 2012 /
By Stefan Saasen on March 4, 2012 /
By allamiro on March 21, 2012 /
By Stefan Saasen on March 29, 2012
By Dirk Heinrichs on March 29, 2012 /
By Stefan Saasen on March 29, 2012
By Beth P. on April 4, 2012 /
By Stefan Saasen on April 4, 2012
By Rob Shepherd on May 4, 2012 /
By Rob Shepherd on May 4, 2012
By Stefan Saasen on June 19, 2012 /
By Vadym Chepkov on December 1, 2012 /
By Stefan Saasen on January 10, 2013
By Andrej on January 3, 2013 /
By Stefan Saasen on January 10, 2013
By Heinz Müller on January 14, 2013 /
By Stefan Saasen on January 14, 2013
By Heinz Müller on January 16, 2013 /
By Stefan Saasen on January 16, 2013
By werner mueller on January 30, 2013 /
Pingback: SVN から Git へ:進行中の開発に影響を与えずに Git へ移行したアトラシアンの方法 ― 技術的側面 | Atlassian Japan
Pingback: From SVN to Git: How Atlassian Made the Switch Without Sacrificing Active Development – the Technical Side | Praecipio Consulting Blog
Pingback: Twenty Helpful git Resources » Connecting the Enterprise -- the AppFusions blog