TL;DR: never change the functionality of upgrade tasks that have been released – create new fixer tasks instead. Keep your upgrade tasks independent of other services.

From time to time we get bugs in our products that are caused by changes to upgrade tasks. In general, changing the functionality of upgrade tasks is a bad idea.

A year-old example

OnDemand uses a suite of plugins on top of our regular on-premises applications to provide the integration between the apps. Two of the services provided are Subversion hosting (or were – we no longer offer the Subversion service) and Bamboo (Bamboo is our CI offering – we don’t provide build agents as part of OnDemand; people need to use Amazon’s EC2 platform for that). The integration in this case is that Bamboo builds in OnDemand automatically know about Subversion and can check out from there without extra setup. The ability to do this is implemented by having a special user in the system that Subversion knows about and Bamboo has credentials for. About a year ago, we needed to change the way this user was created so we moved it into the upgrade task code. There was already an upgrade task which did the initial setup in Bamboo, so it seemed logical to add the new user creation code there. Our functional tests start with a clean slate so, in the functional tests, the upgrade task was run and the user created successfully – but manual upgrade testing showed that Bamboo agents could no longer check out from Subversion. The code worked well on new instances, but the users were not created on old instances! The reason was that the upgrade task had already run and was not re-run. The effect was that old instances couldn’t use Bamboo. It took us a lot of time to diagnose the problem because the upgrade task looked like it was creating the user and debugging showed that it did create the user. If we hadn’t changed the upgrade task, and instead written a new one, we’d have saved a fair bit of time.

The problem

Making alterations to upgrade tasks is a high-risk area. It’s high-risk because it’s code that only runs once and then is never run again. This means that if an upgrade task has had its functionality changed at some stage then different versions of that plugin will run with upgrade tasks that do different things. This means that if you get a fresh checkout of the plugin source, you will not be looking at the code that has run everywhere the plugin is used. After a few changes to upgrade tasks between plugin versions, you approach a situation where what has been run (such as workflows being put in a certain state at some time* or, in the case of the above example, a user existing in a certain state at a certain time**) differ significantly between instances, which means you can no longer make assumptions about what has happened in your upgrade tasks. Discovering what version has been run on which instance tends to be difficult and time consuming, so is not always checked for. The whole point of these tasks is that they apply some known change to your data. If you mess with what that change is, then the rest of your code becomes unreliable because it relies on assumptions that may no longer hold true. Upgrade task changes are particularly insidious because they will continue to pass functional tests – the new versions of tests and tasks will run, not the old versions.

If you need to change the effect of an upgrade task the thing to do is the same as what needs to be done to fix an upgrade task that has been changed in the past: leave the old upgrade task alone and create a new task which applies the data changes you need. This allows you to track the changes that have been applied to the data and ensure every change you want applied is actually applied.

The other case where an upgrade task needs to be changed is when APIs change. In general a plugin upgrade task should use the lowest-level supported API available. Using the lowest level API possible means that the number of moving parts is limited, reducing the likelihood of assumed behaviour changing. APIs from the plugin being developed should not be used as they are far more likely to change behaviour. Using a supported API means that the likelihood of having behaviour change between product versions is reduced. Sometimes an unsupported API may need to be used, but that often leads to more maintenance and can require a new plugin release for each version of the product due to binary incompatibilities. When APIs do change, a replacement API should be used; look for equivalent behaviour to achieve the same effect. If you encounter a changed API, you need to take extra care to ensure the new API continues to have the same effect that the old one did.

A solution

So how did we solve the user creation problem? At that time we were experimenting with a new model that runs some checks on every startup and puts the instance in a good state if something is amiss. We created a com.atlassian.sal.api.lifecycle.LifecycleAware which would do this. The LifecycleAware checks to see if the upgrade task has already run. If it hasn’t run, then the upgrade task will run later and create the user then, so we don’t need to do anything, but if it has already run, it will check for the user and create it if it doesn’t exist. In most cases the solution that should be used is to fix the bad state in a subsequent upgrade task.

* Of course, a customer admin can undo your changes, so the only assumption you’d want to make is that a certain event happened at a certain time, not an assumption about the current state

** Atlassian OnDemand can make some limited assumptions about state because it is a controlled environment. Yes, making these sorts of assumptions has bitten us in the past, though usually because people did not manage the state properly.