Shihab Hamid

Hammering Crowd

Shihab Hamid talks about Crowd
March 30, 2008

There have been a few customers wondering how Crowd scales (outside of it's integration with JIRA/Confluence). Unfortunately, the answers we could think of ranged from "..yes" to "nfi" - so we decided to take a look at load testing Crowd.

Since Crowd offers a bunch of connection points for various applications, directories and databases, it's hard to give an accurate single metric for scalability. One particular evaluator was asking how Crowd would scale for 1 million users using an internal directory (MySQL) with a PHP application.

It's a massive number given that we consider 20,000 users a large user base.

Getting a million users

We ran a script to insert users into our internal directory, starting with user0 and ending with user999999 - taking 5 hours.

I came back in this morning and found that Dave, our team lead, had already verified that it was possible to authenticate and use the Crowd console without issues. This shows us that Crowd is capable of not falling over if there are 1 million users in it's repository. He also decided to double-check that JIRA integration was very broken with this many users :)

Load Testing

PHP, Java, .NET or whatever is unlikely to make a huge difference if your web-services stack is slick. What's more relevant is the calls your application makes to Crowd. You can bet that findAllPrincipalNames will take much longer than findPrincipalByName for example.

As the evaluator didn't have a concrete idea of the number and nature of the calls his application would be making, we decided to test out the fundamental calls made to Crowd:


  1. Authenticate

  2. Validate token

  3. Find user from token

When you go to an application, your likely to log in (authenticate) once, but your likely to perform many secure operations (each requiring a valid token and the corresponding user) once logged on. Thus we vaguely approximate 100 token checks per authentication call for our load test. In reality you're likely to need fewer authentications.

We had two ideas for load testing: hammer the crap out of it or approximate a reasonable load. We decided to go with the hammering option as it's possible to extrapolate performance under a reasonable load from that data - and it's much faster to simulate than mimic the 3-30 seconds a user would take to read a web page before clicking.

The Hammering

Hammering Crowd means see how many concurrent threads Crowd can service. So we take n users and launch n threads. Each thread performs:

for 1..100
{
  authenticatePrincipal()
  for 1..100
  {
    verifyToken()
    findPrincipalByToken()
  }
}

Which is ~10,000 requests to Crowd. Note that this test is equivalent to something logging in and pressing refresh 100 times (really fast!) and repeating that 100 times.

Clearly with a handful of threads, you'd expect Crowd to get smashed.

For the load test, Crowd and MySQL were on the same box: a 4 core Mac Pro, networked over a 100MBit line to the client, residing on a separate box. Check out the results:

crowd-hammer-table.png

crowd-hammer-graph.png


Analysis

Making sense of the data:


  • Authentications/authentication verifications are pretty fast (~10ms).

  • Crowd performs optimally when there are 4-6 threads hammering it at the same time and doesn't appear to show signs of death for more concurrent threads.

  • The JVM heap was 128MB and it didn't die, ie. Crowd is not hogging memory since there's only a handful of entities it needs to load up for this authentication test.

  • Load could be limited by the generation box, however, the generation box was an 8 core beast whereas the Crowd server was on a 4 core box.

  • Crowd seems to scale for 15+ concurrent threads hammmering it with authentication requests. Overnight, we ran a 50-concurrent-threads test which had an average request service time of 8.26ms. Conversely, this translates to 120 requests serviced per second.

  • At 50 concurrent threads, we are still not maxing out the CPUs although their idle time is decreasing. We could push Crowd even further until it was either CPU, disk or network bound.

  • We can extrapolate these results and overestimate "reasonable usage" to allow for 10 seconds between authentication checks. This means that Crowd could handle 1200 active users. Note that 10 seconds is an overestimate, especially if client libraries cache authentication for much longer (usually around 2 minutes).

  • This is still a basic test and we should investigate broader performance testing of Crowd's API.

Going Forward

Load testing is important. It's even more important when you're middleware. Although these metrics are a start, we should consider further performance testing:


  • Various directories: OpenLDAP, ActiveDirectory.

  • Various databases: Postgres, Oracle.

  • Load testing real applications integrated with Crowd (eg. Confluence, JIRA): might be good to compare how much fat Crowd adds to an applications standard repository.

  • Replaying logs from customers' (or our own) applications for automated load testing of Crowd (without needing to run a specific client application).

  • Profiling: determine which methods are letting us down and optimise based on potential benefit.

There are two sides to considering Crowd and load, and the first is to ensure the Crowd Server is lean and mean. The side which we didn't examine in this post, particularly useful for Java clients, is to ensure that our client libraries are smart and efficient - which boils down to effective caching - only making requests to the Crowd server when actually necessary.


We're working on making Crowd, and Crowd integration, even slicker in 1.4!

4 Comment(s)

Great! The problem with Atlassian products in an enterprise setting has always been that you guys don't seem to test them with more than a couple users. It is very encouraging to see this change like this.

By Mikael Gueck at March 31, 2008 4:59 AM

Good to hear. You all may want to think about adding some of the internal functionality of Crowd to the "load testing" queue. For example (as noted in http://jira.atlassian.com/browse/CWD-948), Crowd falls apart if you try to import a bunch of users. Obviously Crowd can handle the load once the users are in...but getting them in proved to be a little trickier than expected.

I am still a fan though!

By Nate Nash at March 31, 2008 11:41 AM

IMHO, one of the missing caching features is a mutating cookie which applications can cache and while that cookie remains the same as last time I saw it, I can assume (within a reasonable timeframe) that nothing has changed.

During login / logout / group change or other sensitive "cache busting" times the cookie would mutate causing the application to ignore it's cache settings and talk to the Crowd server again.

My specific use case it that I have Crowd integrated with JIRA and Confluence. For performance reasons I would like to keep the cache interval high on the clients (5 minutes or something) but know that if someone logs into Confluence, then logs out of JIRA and immediately goes back to Confluence, that Confluence knows to ignore the cache and talk to Crowd.

Of course a hostile user agent could resist the cookie change, however I don't believe that's a security threat greater than what already exists through the use of Crowd & SSO.

By Dan Hardiker at March 31, 2008 12:42 PM

Nate, we're aware of the issues regarding importing users (from both LDAP, CSV and XML backup) for large datasets. We plan on looking in to this - it's likely we could get a speed up using JDBC batching.

Dan, you're right. There is a need to be able to synchronise logouts between client applications that use cached authentications. There is also a need to be able to synchronise user changes between client applications, for example:

1. User X logs in to application A.
2. User visits application B and is authenticated as X via single sign-on.
3. User logs out of application A and signs in as user Y.
4. User visits application B and is authenticated as Y via single sign-on.

If we simply used a timeout-based cache for authenticated users then application B will not see the user change from X to Y - in fact application B will not even know there has been a logout - breaking SSO. These subtleties are abundant.

For the upcoming 1.4 release of Crowd, we will be investigating performance improvements and more effective caching.

By Shihab Hamid at March 31, 2008 4:27 PM

Post a comment

If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.





Remember personal info?

Type the characters you see in the picture above.