Abstract

Dumping a JVM’s heap is an extremely useful tool for debugging problems with a J2EE application. Unfortunately, when a JVM explodes, using the standard jmap tool can take an inordinate amount of time to execute for lots of different reasons. This leads to extended downtime when a heap dump is attempted and even then, jmap regularly fails.

This blog post is intended to outline an alternate method using standard tools in the Unix/Linux arsenal to achieve a heap dump that only requires mere seconds of additional downtime allowing the slow jmap process to happen once the application is back in service.

Credits

Thanks to Paul De Audney, fellow Atlassian Sysadmin Team Lead, who floated the idea for this a while ago.

Our friend, the JVM

The beauty of a JVM is it’s isolation from the underlying OS. This makes it attractive to both programmers and system administrators alike, but generally, makes debugging problems harder for everyone as standard tools can only tell you what the JVM is doing, not the underlying J2EE application.  Also, the tools to extract the required detail for the developer are slow, unreliable and as a result, can significantly extend down time.

However, the JVM is essentially just another process running on the kernel so our standard tools just may be useful in other ways.  As it turns out, they can be used specifically for dumping not the JVM heap, but instead it’s core.  This core can then have the JVM heap extracted offline, even on a completely different system!  So let’s look at the high-level process and the pre-requisites.

Pre-requisites

  1. Sufficient disk space for the entire process core (RSS+Shared)
  2. Sufficient disk space to hold the extracted heap (usually < RSS+Shared)
  3. The GNU Debugger – gdb
    • sudo apt-get install gdb – for Debian/Ubuntu systems.
    • sudo yum install gdb – for RedHat/CentOS systems.
  4. The JVM’s process ID (PID)
  5. Root (or elevated) privileges to execute arbitrary commands as described below.

The Process

Summary

  1. Get the JVM PID
  2. Dump the PID’s core to a file with gdb
  3. Extract the JVM Heap from the core file in #2 with jmap

Detail

  1. First of all, you need to extract the PID for the JVM we need the heap for (pgrep would also work if there was only a single Java application on the host) :
    [cc lang=’bash’ line_numbers=’false’ nowrap=”0″]ps -ef|grep java
    # Will output similar to
    UID PID PPID C STIME TTY TIME CMD
    1234 16837 11678 0 Mar13 ? 00:56:51 /usr/lib/jvm/java-7-oracle/bin/java -Djava.util.logging.config.file=/opt/java/tools/tomcat/j2ee-application/conf/logging.properties -server -Xms1024m -Xmx1024m -D -XX:MaxPermSize=256m -Dconfluence.home=….[/cc]
  2. Now that we have the PID (16837 above) we can tell gdb to dump the core for that process. This stage in the process involves invoking gdb as root (or the application owner) to attach to the PID concerned. Then we want to dump the core to a specific file, detach and quit. You can do all that interactively or batch it (see “Making it better…” below). For now, we’ll assume we’re doing it interactively.
    NOTE: as soon as you attach the debugger to the process, that process will stop executing until detached. This isn’t necessarily a problem on a crashed application, but can be an problem if heap dumping a running application. In the latter case, using old school jmap directly on the JVM is preferred, even if it is slower.
    [cc lang=”bash” line_numbers=”off”]sudo gdb –pid=16837
    …bunch of info…
    (gdb) gcore /tmp/jvm.core
    Saved corefile /tmp/jvm.core
    (gdb) detach
    (gdb) quit[/cc]
    Or as the application owner:
    [cc lang=”bash” line_numbers=”off”]sudo -u j2ee_application_owner gdb –pid=16837
    ..etc..[/cc]There may be some warnings after the “gcore /tmp/jvm.core” but these are informational and don’t interfere with the resultant core file.
  3. Once you have a core file (/tmp/jvm.core specifically) you can restart the J2EE application and restore service as the forensic data has already been captured! So, the last step is to extract the J2EE heap from the process’s core:
    [cc lang=”bash” line_numbers=”off”]sudo jmap -dump:format=b,file=jvm.hprof /usr/bin/java /tmp/jvm.core
    # Which will output…
    Attaching to core /tmp/jvm.core from executable /usr/bin/java, please wait…
    Debugger attached successfully.
    Server compiler detected.
    JVM version is 23.5-b02
    Dumping heap to jvm.hprof …

    Heap dump file created[/cc]
    You now have a the original core file in /tmp/jvm.core and the freshly extracted JVM heap in /tmp/jvm.hprof. From here, you simply need to send the heap dump to the relevant developer(s) and you’re done!
    Notes:

    • The jmap command can be executed anywhere as long as you have the core file and the same version of Java that was running when the core was dumped.
    • The sudo commands above are only really required because the root user created the original core file in this example. They can be replaced with “sudo -u j2ee_application_owner” as appropriate.

Benefits

  1. FAST.
    In our testing, dumping a 2.7GB core file took <5 sec. (Not including time to fire up gdb etc.) from a process consuming 1.6GB resident memory (RSS)   Basically, the dump happens at almost the maximum available disk write speed. Our kit can write to disk almost 8GB/sec – YMMV
  2. RELIABLE.
    Doesn’t rely on the JVM.  The jmap can be executed on a different machine if load is critical (see note above though – jmap is sensitive to the JVM version)
  3. RAPID REMEDIATION.
    After dumping the core, you can restart the application safe in the knowledge you’ve captured the forensic data.

Making it better…

We can actually make this better by creating script that generates the core using gcore and executes the whole process with only needing a PID. Here is a an idea:
[cc lang=”bash” line_numbers=”on” escaped=”true”]#!/bin/bash

# atlassian-heap-dump.sh – dump a heap using GDB for a crashed application
# Accepts a single argument: the PID of the JVM
# Author: James Gray (jgray@atlassian.com)
# Copyright Atlassian P/L
# License: MIT

# Are we root?
if [ $UID -ne 0 ]; then
echo “Be gone peon – you must be root”
exit 1
fi

# Did we get a command line argument?
if [ -z $1 ]; then
# 1st command line arg is empty…dump usage and quit
echo “Must have a JVM PID to dump”
echo -e “eg,\n$(basename $0) \n”
exit 1
fi

# OK, we have a PID, we are root…hit it
JVM_CORE=/tmp/jvm.core
JVM_HEAP=/tmp/application-name-$(date +’%Y%m%d’).hprof
JMAP_OPTS=”-dump:format=b,file=${JVM_HEAP} /usr/bin/java ${JVM_CORE}.${1}”
GCORE_OPTS=”-o ${JVM_CORE} ${1}”
HERE=”$(pwd)”

# Go to /tmp … just so we know where we are.
cd /tmp

# Now run gdb and get the core:
echo “Dumping the core for PID: \”${1}\””
gcore ${GCORE_OPTS}

# Now get the heap and dump it to the preferred name:
echo “Core created at ${JVM_CORE}.${1} – YOU CAN NOW RESTART THE APPLICATION”
jmap ${JMAP_OPTS}
echo “Your JVM Heap is now available at: ${JVM_HEAP}”

# Clean up after ourselves:
echo “Deleting redundant core file”
rm -f ${JVM_CORE}.${1} >/dev/null 2>&1

# Go back to whence we came…
cd “${HERE}”[/cc]

To-do – exercise for the reader

  • Make the script sanitise the $1 argument – is it a PID? Is the PID a JVM?
  • Check exit statuses for the gcore and jmap commands before destroying files etc.
  • Maybe automate the process restart after completing the core dump?
  • Maybe use minimum privileges (eg, J2EE run user) instead of root.

So you want your JVM’s heap…