Profiling Android apps with Flamegraphs

Companion code for this post available on Github

TL;DR:

I built a tool to turn Android trace output in to flame graphs. You can check out the source code here, or get started immediately by uploading a trace file here

If you’ve ever tried to debug a performance issue in an Android app, you’ve probably become familiar with Traceview, which reads the .trace files generated by the Debug.startMethodTracing API call and displays them in a more or less readable manner. However, for me at least, Traceview is less than ideal. The interface is rather clunky, the scrolling behaviour is questionable (zooming and scrolling down at the same time? Just what I wanted?) and it’s very difficult to interpret the call chains that are consuming the most of your time, especially if multiple threads are involved.

Traceview. Isn't obvious from the coloured bars what's going on?

One of the most useful performance visualization and analysis tools I am aware of is Brendan Gregg’s Flame Graphs, which make it easy to identify long-running sections of your code. However, I couldn’t find any existing tooling for converting the Android trace format to flat stack format expected by the flamegraph generator, so it was time to get familiar with the internals of the format.

First things first, we need to gather a trace. In my case, I have an app that takes an appreciable amount of time to load the first Activity, so I’m going to add trace sections to onCreate, onStart and onResume, following this pattern:

private static final int MEGABYTE = 1024 * 1024;

@Override
protected void onCreate(Bundle savedInstanceState) {
    super.onCreate(savedInstanceState);
    // Call the resulting trace file 'onCreate.trace', and allow a 128MiB
    // buffer for collecting trace data.
    Debug.startMethodTracing("onCreate", 128 * MEGABYTE);
    // Existing onCreate code
    [...]
    // Stop method tracing
    Debug.stopMethodTracing();
}

Once that’s added, we can start up the app normally and wait for it to finish booting. Note that like all profiliers, this trace mechanism adds overhead! Don’t make decisions based on the absolute timings as correct when dealing with these traces, but rather the differences between traces. Your app will also take noticeably longer to start while profiling is active - this is normal.

Once it has loaded, open up adb and take a look to check that your traces have been created:

ross@mjolnir:/h/ross$ adb shell ls -l /sdcard/Android/data/com.schlaikjer.cookbook/files/
total 112400
-rw-rw---- 1 u0_a122 sdcard_rw 57473027 2017-02-26 14:26 onCreate.trace
-rw-rw---- 1 u0_a122 sdcard_rw     6255 2017-02-26 14:26 onResume.trace
-rw-rw---- 1 u0_a122 sdcard_rw    60809 2017-02-26 14:26 onStart.trace

And then pull all of them up so that we can take a look at them:

ross@mjolnir:/h/ross$ for F in {Create,Start,Resume}; do adb pull /sdcard/Android/data/com.schlaikjer.cookbook/files/on$F.trace; done
4693 KB/s (57473027 bytes in 11.959s)
710 KB/s (60809 bytes in 0.083s)
79 KB/s (6255 bytes in 0.077s)

If we crack one open with our editor of choice, we can see that the files begin with three plain text sections, followed by what looks like a lot of binary data:

*version
3
data-file-overflow=false
clock=dual
elapsed-time-usec=18482984
num-method-calls=4086234
clock-call-overhead-nsec=767
vm=art
pid=17816
*threads
17816   main
17821   Jit thread pool worker thread 0
[...]
*methods
0x7b0   java.lang.BootClassLoader   getInstance ()Ljava/lang/BootClassLoader; ClassLoader.java
0x7ac   java.lang.ClassLoader   findLoadedClass (Ljava/lang/String;)Ljava/lang/Class;   ClassLoader.java
[...]
*end
SLOW^C^@ ^@&<95>
[...]

Some of these (elapsed time, number of calls, vm name &c) are pretty intuitive. For the rest of them, the best way to figure out what they are is to look at the code that generates them!

os << StringPrintf("%cversion\n", kTraceTokenChar);
os << StringPrintf("%d\n", GetTraceVersion(clock_source_));
os << StringPrintf("data-file-overflow=%s\n", overflow_ ? "true" : "false");
if (UseThreadCpuClock()) {
  if (UseWallClock()) {
    os << StringPrintf("clock=dual\n");
  } else {
    os << StringPrintf("clock=thread-cpu\n");
  }
} else {
  os << StringPrintf("clock=wall\n");
}
os << StringPrintf("elapsed-time-usec=%" PRIu64 "\n", elapsed);
if (trace_output_mode_ != TraceOutputMode::kStreaming) {
  size_t num_records = (final_offset - kTraceHeaderLength) / GetRecordSize(clock_source_);
  os << StringPrintf("num-method-calls=%zd\n", num_records);
}
os << StringPrintf("clock-call-overhead-nsec=%d\n", clock_overhead_ns_);
os << StringPrintf("vm=art\n");
os << StringPrintf("pid=%d\n", getpid());
if ((flags_ & kTraceCountAllocs) != 0) {
  os << StringPrintf("alloc-count=%d\n", Runtime::Current()->GetStat(KIND_ALLOCATED_OBJECTS));
  os << StringPrintf("alloc-size=%d\n", Runtime::Current()->GetStat(KIND_ALLOCATED_BYTES));
  os << StringPrintf("gc-count=%d\n", Runtime::Current()->GetStat(KIND_GC_INVOCATIONS));
}

Based on this, it looks like our version header consists of:

There are also three fields that were not present in the trace we took a look at:

With that under our belt, let’s move on to the Threads section. This one is pretty simple:

void Trace::DumpThreadList(std::ostream& os) {
  Thread* self = Thread::Current();
  for (auto it : exited_threads_) {
    os << it.first << "\t" << it.second << "\n";
  }
  Locks::thread_list_lock_->AssertNotHeld(self);
  MutexLock mu(self, *Locks::thread_list_lock_);
  Runtime::Current()->GetThreadList()->ForEach(DumpThread, &os);
}

So each entry under the thread section is just a tuple of Thread ID and a human readable thread name. The method section is similar, but has a few more fields:

std::string Trace::GetMethodLine(ArtMethod* method) {
  method = method->GetInterfaceMethodIfProxy(kRuntimePointerSize);
  return StringPrintf("%#x\t%s\t%s\t%s\t%s\n", (EncodeTraceMethod(method) << TraceActionBits),
      PrettyDescriptor(method->GetDeclaringClassDescriptor()).c_str(), method->GetName(),
      method->GetSignature().ToString().c_str(), method->GetDeclaringClassSourceFile());
}

So the method section is a list of tuples of:

That’s all the text sections dealt with. Now we can take a look at the binary data at the end of the file. Luckily, this section is actually described at the top of trace.h:

// File format:
//     header
//     record 0
//     record 1
//     ...
//
// Header format:
//     u4  magic ('SLOW')
//     u2  version
//     u2  offset to data
//     u8  start date/time in usec
//     u2  record size in bytes (version >= 2 only)
//     ... padding to 32 bytes
//
[...]
//
// Record format v3:
//     u2  thread ID
//     u4  method ID | method action
//     u4  time delta since start, in usec
//     u4  wall time since start, in usec (when clock == "dual" only)
//
// 32 bits of microseconds is 70 minutes.
//
// All values are stored in little-endian order.

The interesting thing to note here (and something that tripped me up for a while) is the method ID | method action section of the record format. If we take a look over in trace.cc, we can see how that’s implemented:

uint32_t Trace::EncodeTraceMethodAndAction(ArtMethod* method, TraceAction action) {
  uint32_t tmid = (EncodeTraceMethod(method) << TraceActionBits) | action;
  DCHECK_EQ(method, DecodeTraceMethod(tmid));
  return tmid;
}

Where TraceAction is defined as:

enum TraceAction {
    kTraceMethodEnter = 0x00,       // method entry
    kTraceMethodExit = 0x01,        // method exit
    kTraceUnroll = 0x02,            // method exited by exception unrolling
    // 0x03 currently unused
    kTraceMethodActionMask = 0x03,  // two bits
};

So with this data, we know that the method ID encoded in one of the trace records with the lower two bits masked off will match one of the method IDs in the *methods section of the plain text header. We can then use the lower two bits to work out whether each entry is a method entry or exit (via either return or stack unwind).

Armed with this, lets start writing a parser for these files. I chose Erlang for a learning exercise, and also because I intended to make use of it’s excellent binary matching syntax in conjunction with binary comprehensions. Since we know the magic for the binary section of the trace file (SLOW), let’s take a look at how we can easily parse out the header and the records using binary matching.

-define(TRACE_HEADER_MAGIC, "SLOW").

% Find the location of the trace header
{HeaderPos, _} = binary:match(Data, <<?TRACE_HEADER_MAGIC>>),

% Match out the entire header specification into variables
<<?TRACE_HEADER_MAGIC, VersionBin:2/binary, DataOffsetBin:2/binary,
  StartTimeBin:8/binary, RecordSizeBin:2/binary>> = binary:part(Data, {HeaderPos, 18}),

% Remember all numbers are little endian
DataOffset = binary:decode_unsigned(DataOffsetBin, little),
RecordSize = binary:decode_unsigned(RecordSizeBin, little),

% Now that we have the header start and header size, we can start parsing out
% the call records themselves. First, excerpt the section of the trace that
% contains the binary data
SectionStart = HeaderPos + DataOffset,
SectionEnd = byte_size(Data),
RecordSection = binary_part(Data, {SectionStart, SectionEnd - SectionStart}),

% Now that we have the records, we can break them up based on the RecordSize
% that the header speficied and parse them
Records = [Record || <<Record:RecordSize/binary>> <= RecordSection],
ParsedRecords = [parse_trace_record(Record) || Record <- Records].

As you can see, extracting the records section and parsing the header was pretty simple using the binary syntax. We use the same approach to parse out the records themselves:

parse_trace_record(Record) ->
    <<ThreadId:2/binary, MethodIdActionBin:4/binary,
      TimeDelta:4/binary, WallTimeDelta:4/binary>> = Record,

    % Decode the method ID and action from a binary to an integer
    MethodIdAction = binary:decode_unsigned(MethodIdActionBin, little),

    % Now remember that this is a 4-byte integer, and that the top bits
    % are the actual method ID
    MethodId = MethodIdAction band 16#FFFFFFFC,

    % While the action is the lower two bits.
    % Convert to an atom for readability
    MethodAction = case MethodIdAction band 16#00000003 of
                       16#00 -> enter;
                       16#01 -> exit;
                       16#02 -> unwind
                   end,

    #call_record{
       thread_id=binary:decode_unsigned(ThreadId, little),
       method_id=MethodId,
       method_action=MethodAction,
       time_delta=binary:decode_unsigned(TimeDelta, little),
       wall_time_delta=binary:decode_unsigned(WallTimeDelta, little),
       child_time=0
      }.

That’s most of the hard work! The full parser code, including the sections for the thread and method tables, can be seen in the final parser implementation here. Now that we have our call records, method IDs and thread IDs, we need to actually convert that data into the format that the flame graph generator can handle. It expects to receive data as ; delimited stack frames, followed by a space and a number representing the time / samples / cycles spent in the final call in that stack. To calculate this, we iterate over all these records per-thread and perform the following:

For a method entry: Push the method name onto a stack, so that we can keep track of what methods have been called. Also push the full method record onto a second stack, so that we can reference its timings later.

For a method exit/unwind: Here’s where the real logic happens. When a method exits, the stack should contain the matching method entry record. We can use the time on the two records to calculate how much time was spent in this method call altogether. If there is another parent on the stack, we update it to reflect how much time was spent in the current call - this allows for separate tracking of self and child call times. We then take the self time (subtracting any children from the current call), and update a map of method name list -> time with the self time. This deduplicates identical call chains.

Once we have iterated over all the calls, we should have a map of lists of method names to timings. From there, we can just join the names with semicolons, append a space and the timing, and it’s all set to be processed by flamegraph.pl.

The code implementing this logic can be found here.

Once we have the trace data, we can process the graph:

flamegraph.pl \
        --title "onCreate" \
        --hash \
        --countname "microseconds" \
        onCreate.flat > onCreate.svg

Et voilá!

In order to make this process a lot simpler and less manual, the trace parsing and graph generation have all been rolled into a simple server than you can run. Instructions are available on Github, and there is also a public copy of the server running at https://aflame.rhye.org/ if you want to try out uploading your own traces.

The full per-thread breakdown for the example onCreate trace we generated can be found online here.

Comments