Diff. Patch. Change. Delta. Δ.

Whatever you call it, diffs are pretty much the aggregate output of any developer’s day. A bunch of deletions and additions from a set of files that can be codified in a .patch file as a bunch of pluses, minuses and context indicators.

‐‐‐ a/todo.md
+++ b/todo.md
@@ -1,5 +1,4 @@
# TODO
buy anniversary present
build arduino-based vaucanson’s duck
-blog on git diffs
fork webkit
‐‐

I’m still undecided as to whether it’s depressing or amazing that our life’s work can be so simply encoded.

Regardless, git has given us a bunch of neat ways at looking at these diffs. The following post is a quick primer on the various incantations of git diff. Note that I’m going to be focusing on different output format for viewing diffs. There’s also a whole range of ways you can specify different commits and paths to diff, but that’s a topic for another day.

git diff

You’re probably already familiar with the built in git diff command. By default, diff will show you changes to tracked files in your work tree that haven’t yet been added to the index. For example:

$ cat 2cities.txt

It was the best of times,
it was the blurst of times.
$ sed -i ‘.tmp’ ‘s/blurst/worst/’ 2cities.txt

$ git diff

It was the best of times,
-it was the blurst of times.
+it was the worst of times.

This works fine, but it’s unfortunate that it shows up as one removed line followed by one added line. All I did was change one word!

git diff ‐‐color-words

git diff also has a special mode for highlighting changes with much better granularity: ‐‐color-words. This mode tokenizes added and removed lines by whitespace and then diffs those.

$ git diff ‐‐color-words

It was the best of times,
it was the blurstworst of times.

That’s an improvement! Now we’re down to the individual words that have changed. But if you look closely, we’ve really only changed three characters: “blu” has become “wo”. We can do even better.

git diff-highlight

If you clone the git source you’ll find a sub-directory called contrib. It contains a bunch of git-related tools and other interesting bits and pieces that haven’t yet been promoted to git core. One of these is a perl script called diff-highlight. diff-highlight pairs up matching lines of diff output and highlights sub-word fragments that have changed.

$ git diff | ~/src/git/contrib/diff-hightlight/diff-highlight

It was the best of times,
-it was the blurst of times.
+it was the worst of times.

Sweet! Now we’ve pared down our diff to the smallest possible change.

If you like this fine-grain way of looking at diffs, you can add a shell function alias to ~/.git/config to handle the piping to diff-highlight for you:


[alias]
# assumes git.git clone at ~/src/git
diff-highlight = "!f() { git diff \"$@\" | ~/src/git/contrib/diff-highlight/diff-highlight; }; f"

Then, you can get a nicely highlighted diff by simply invoking git diff-highlight.

Diffing binary files

If you make changes to a binary file (such as an image, pdf or archive) the default output from git diff is pretty uninteresting.

$ git diff

Binary files a/script.pdf and b/script.pdf differ

However, git does have a nifty feature that allows you to specify a shell command to transform the content of your binary files into text prior to performing the diff. It does require a little set up though.

First you need to specify a textconv filter describing how to convert a certain type of binary to text. I’m using a simple utility called pdftohtml (available via homebrew) to convert my PDFs into human readable HTML. You can set this up for a single repository by editing your .git/config file, or globally by editing ~/.gitconfig.

.git/config

 


[diff "pdfconv"]
textconv=pdftohtml -stdout

Then all you need to do is associate one or more file patterns with our pdfconv filter. You can do this by creating a .gitattributes file in the root of your repository.

.gitattributes

 


*.pdf diff=pdfconv

Then, the next time we attempt to do a diff of a binary file, we get much more meaningful output:

$ git diff

<HTML>
<BODY>
Interior. Room filled with monkeys bashing on typewriters. Mr. Burns tears a sheet of paper from a typewriter.<br>
-Burns: It was the best of times.. it was the blurst of times?<br>
+Burns: It was the best of times.. it was the blurst of times? You stupid monkey!<br>
Burns balls up the sheet and throws it at the monkey.<br>
</BODY>
</HTML>

And remember, this is just regular diff output, so we can redo our diff-highlight trick to get something a bit more palatable:

$ git diff-highlight

Interior. Room filled with monkeys bashing on typewriters. Mr. Burns tears a sheet of paper from a typewriter.<br>
-Burns: It was the best of times.. it was the blurst of times?<br>
+Burns: It was the best of times.. it was the blurst of times? You stupid monkey!<br>
Burns balls up the sheet and throws it at the monkey.<br>

The same technique can be applied to get useful diffs from all sorts of binary files, for example:

  • zips, jars and other archives: using unzip -l (or similar) in place of pdf2html will show you paths that have been added or removed between commits
  • images: exiv2 can be used to show metadata changes such as image dimensions
  • documents: conversion tools exist for transforming .odf, .doc and other document formats to plain text. In a pinch, strings will often work for binary files where no formal converter exists.

Want to talk more about git or Atlassian Stash? Leave a comment or hit me up on twitter @kannonboy

Vive la git diff!