John Rotenstein, Senior Business Analyst

Here at Atlassian, we make good use of Open Source products. Not just because we can re-use code, but often because the Open Source products are just as good, or better, than commercial products.

One good example is our use of PostgreSQL. Prior to working at Atlassian, I typically used Microsoft SQL Server or Oracle as an enterprise database server, and Microsoft Access or even (gasp!) Microsoft Excel for quick & dirty situations. They don't cut the mustard around Atlassian, where we prefer the power of PostgreSQL, which is an incredible enterprise-capable database. (Our co-CEO, Mike Cannon-Brookes, tells me that many startups use MySQL until they grow large enough to hit its limitations, then move across to PostgreSQL).

Okay, back to the story at hand...

Using pl/pgsql
We're in the process of improving our online ordering system (hooray!) and need to migrate a lot of our business data to a new schema, which basically involves lots of data transformations and some simple business logic. Rather than writing a throw-away Java app to perform the migration, we decided to do it in the PostgreSQL native programming language, known as pl/pgsql ("programming language/PostgreSQL").

It's truly amazing what can be done in such a language. Database statements can be intermixed with function variables, FOR loops can iterate across the results of database queries and even table names can be used to define variable types.

Unfortunately, performance appeared to be a bit of a problem.

We needed to migrate 14,000 records and it appeared that they would take over two seconds each. That's about 9 hours to migrate the data — not too hot.

So, I did the hacker thing... I inserted timing points and had the system write the timing to a database table (real easy, since I can just type an INSERT statement directly into the procedure!).

The results looked like this:

SQL-Migrate1.png

I had written the procedures to select one input record at a time, then transform the data. While this is fine for testing, the 0.33 seconds to select a record was clearly not going to scale. So, I replaced it with a FOR loop iterating over the input dataset, which meant that it only executed once, regardless of the number of records being migrated.

Interestingly, that ending up caching some information so process A and C suddenly worked a lot faster, getting it down to a total of 0.64 seconds — or about 2.5 hours to migrate.

I then studied Process B and found that it was re-selecting data from the input record. While this was clean in terms of writing procedural code, it was not optimal in terms of execution speed. This immediately reminded me of a recent IT Conversations Podcast, where Rasmus Lerdorf (creator of PHP) said that web applications would be rejected at Yahoo! if they required more than a couple of database queries per screen.

So, I re-wrote the procedure calls, passing a database record as a parameter instead of a primary key. This is an example of the power of programming directly within a database language — I simply provided a table name as the data type of an input variable, and the database knew exactly what it meant!.

By avoiding this re-querying, I got down to this:

SQL-Migrate2.png

This gave a total of about 1.4 hours, 15% of the original time required, which is well within the 'acceptable' range for our migration run. It also means I'll get back several hours of personal time on our migration weekend, so I can play more World of Warcraft!

Lessons Learned

  • Database programming languages are quite powerful
  • Use timing points to focus optimization effort
  • Every database query avoided is time saved
  • PostgreSQL rocks!

6 Comment(s)

PostgreSQL rules! Thanks for this article.
Shame Atlassian is not delivering PostgreSQL bundled with its products instead of HSQL, which is not recommended anyway.

By Martin Blazek at November 28, 2007 11:29 AM

We love Postgres too. Unfortunately, bundling platform-specific code like PostgreSQL into our products would make the installation process several orders of magnitude more complicated. (Which distribution of Linux are you running? On what CPU architecture? With what version of glibc...?) When you're evaluating a product, you just want to get it running as quickly as possible.

By Charles Miller at November 28, 2007 12:37 PM

@Charles Miller - do you mean Atlassian evaluation products ship with just the in-memory database?

@John - I don't understand this:

'I had written the procedures to select one input record at a time, then transform the data. While this is fine for testing, the 0.33 seconds to select a record was clearly not going to scale. So, I replaced it with a FOR loop iterating over the input dataset, which meant that it only executed once, regardless of the number of records being migrated.'

Do you mean that you selected the entire dataset with one select query and then worked with it in memory? Wasn't that a large amount of data to keep in memory to process?

By anne at February 13, 2008 4:01 AM

Anne,

>> Do you mean that you selected the entire dataset with one select query and then
>> worked with it in memory? Wasn't that a large amount of data to keep in memory to process?

Yes, that's what I did. No, it doesn't take much memory because PL/PGSQL is specifically designed to do this.

Here's an example:

FOR the_license IN
SELECT *
FROM license
LIMIT qty
LOOP
perform migrate_this_license(the_license);
END LOOP;

So, it selects the number of rows indicated by 'qty', then calls a function for each returned row. PL/PGSQL takes care of moving the database cursor, retrieving the date, etc. The result is that there is one delay for the initial select (especially when using complex queries), but no delay for each iteration. Voila!

By John Rotenstein at February 13, 2008 2:48 PM

Guys -have a look at Hyperic (http://www.hyperic.com) to see how they do it. Postgres is distributed with the packages as the default option. While they obviously recommend using a separate host for the DB, it's a nice message they send out.

I haven't looked into what compiler flags are used for the specific platform builds but the 'platform independent' packages works on pretty much anything.

By Sam at February 21, 2008 9:38 PM

>> Charles: ...would make the installation process several orders of magnitude more complicated...

This may be the case in the Linux world that I am not all that familiar with; I was rather thinking about Windows packages - it would be actually quite cool to have PG preinstalled and all applications set up for it.

By Martin Blazek at February 23, 2008 9:49 AM

Post a comment

If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.





Remember personal info?

Type the characters you see in the picture above.